Technology thesis · Artificial Intelligence

high conviction growth

Reinforcement learning from human feedback (RLHF)

RLHF is the alignment technique that makes LLMs useful and safe; it is the most commercially important reinforcement learning application in history.

Position maintained continuously · last reviewed Apr 22, 2026

The thesis

Core thesis

RLHF transforms raw language models into helpful, harmless assistants. Anthropic's Constitutional AI extends this with AI-generated feedback. Every frontier model uses RLHF or variants (DPO, RLAIF). The quality of RLHF training is a key differentiator between model providers — it's why Claude and GPT-4 feel different despite similar architectures. The technique is well-understood but expensive and labour-intensive to execute well.

State of the art (2026)

RLHF has bifurcated. For helpfulness, harmlessness and tone, preference-based alignment remains universal across every frontier model — Claude, GPT, Gemini, Llama, Grok, Qwen and DeepSeek — but PPO is being displaced by simpler DPO/GRPO-style objectives and by RL from AI feedback (Anthropic Constitutional AI). For reasoning, the centre of gravity has shifted to RL with verifiable rewards: DeepSeek-R1 and Kimi K1.5 showed in early 2025 that GRPO against checkable maths and code outcomes yields frontier reasoning cheaply, and that recipe now underpins o-series and Claude reasoning post-training. The durable scarce input is no longer raw annotation volume but expert, domain-specific preference and verification data, which keeps Surge AI, Scale and Mercor strategically central.

The rest of the file

Everything below is live inside CanaryIQ

The full analysis behind the verdict — the structure is real; the content unlocks when you log in.

Signal stack

Evidence stacked leading → lagging

8 signals

talent

research

patent

expert

operational

market

Technology-native KPIs

Metrics that predict trajectory, tracked over time

4 tracked

RLHF-Trained Model Share

Human Preference Data Scale

Alignment Tax Reduction

DPO/Alternative Method Adoption

Landscape map

Who builds what — and who depends on whom

77 players · 6 layers

Catalyst calendar

Dated events that will move the position

5 ahead

Technology roadmap

Milestones on the path to maturity

8 milestones

Watchlists

Companies, people and papers — each with a remove-by condition

3 · 18

Companies · 3

People · 18

Decision frameworks

The same call, framed for your desk

Locked

Public Equity

PE / VC

Corporate Leader

Thesis changelog

When our view changed, and why

5 updates

Change our mind

3 disconfirming conditions

The rest is inside

You've read the verdict. The file is much deeper.

The full signal stack, technology-native KPIs tracked over time, the landscape of who depends on whom, the dated catalyst calendar, decision frameworks for every desk, live watchlists and the changelog of every time our call on Reinforcement learning from human feedback (RLHF) has changed — all live inside CanaryIQ.