Technology thesis · Artificial Intelligence
high conviction growthReinforcement learning from human feedback (RLHF)
RLHF is the alignment technique that makes LLMs useful and safe; it is the most commercially important reinforcement learning application in history.
Position maintained continuously · last reviewed Apr 22, 2026
The thesis
Core thesis
RLHF transforms raw language models into helpful, harmless assistants. Anthropic's Constitutional AI extends this with AI-generated feedback. Every frontier model uses RLHF or variants (DPO, RLAIF). The quality of RLHF training is a key differentiator between model providers — it's why Claude and GPT-4 feel different despite similar architectures. The technique is well-understood but expensive and labour-intensive to execute well.
State of the art (2026)
RLHF has bifurcated. For helpfulness, harmlessness and tone, preference-based alignment remains universal across every frontier model — Claude, GPT, Gemini, Llama, Grok, Qwen and DeepSeek — but PPO is being displaced by simpler DPO/GRPO-style objectives and by RL from AI feedback (Anthropic Constitutional AI). For reasoning, the centre of gravity has shifted to RL with verifiable rewards: DeepSeek-R1 and Kimi K1.5 showed in early 2025 that GRPO against checkable maths and code outcomes yields frontier reasoning cheaply, and that recipe now underpins o-series and Claude reasoning post-training. The durable scarce input is no longer raw annotation volume but expert, domain-specific preference and verification data, which keeps Surge AI, Scale and Mercor strategically central.
Everything below is live inside CanaryIQ
The full analysis behind the verdict — the structure is real; the content unlocks when you log in.
Signal stack
Evidence stacked leading → lagging
Technology-native KPIs
Metrics that predict trajectory, tracked over time
Landscape map
Who builds what — and who depends on whom
Catalyst calendar
Dated events that will move the position
Technology roadmap
Milestones on the path to maturity
Watchlists
Companies, people and papers — each with a remove-by condition
Decision frameworks
The same call, framed for your desk
Thesis changelog
When our view changed, and why
Change our mind
3 disconfirming conditions
The rest is inside
You've read the verdict. The file is much deeper.
The full signal stack, technology-native KPIs tracked over time, the landscape of who depends on whom, the dated catalyst calendar, decision frameworks for every desk, live watchlists and the changelog of every time our call on Reinforcement learning from human feedback (RLHF) has changed — all live inside CanaryIQ.