Technology thesis · Artificial Intelligence

medium conviction mature

Reinforcement learning

RL-from-verifiable-rewards is now the dominant post-training paradigm for frontier reasoning; the open bet is whether it generalises past verifiable domains into fuzzy real-world reward.

Position maintained continuously · last reviewed Jun 24, 2026

The thesis

Core thesis

RL powers AlphaGo, robotics control, and the reasoning capabilities in o1-class models. RLHF (reinforcement learning from human feedback) is the key technique for aligning LLMs. As AI moves toward agents that plan and act, RL becomes more central. The challenge: RL is data-hungry, unstable to train, and hard to debug — limiting its application to well-defined reward environments.

State of the art (2026)

RL has shifted from a niche control technique to the dominant post-training paradigm for frontier reasoning. RL-from-verifiable-rewards (RLVR) underpins OpenAI o3, DeepSeek-R1 and V3.2, Gemini Deep Think and Claude reasoning; DeepSeek-V3.2 (December 2025) trained across 1,800-plus agentic environments, and its Speciale variant took IMO and IOI gold. The frontier now runs on outcome-based RL plus test-time compute rather than process reward models. In robotics the picture is more contested: imitation learning has overtaken RL as the primary on-ramp for real-world manipulation, with RL reserved for locomotion and sim-to-real fine-tuning. The open question is whether RLVR generalises beyond verifiable domains — maths, code, tool-use — into fuzzy real-world reward.

The rest of the file

Everything below is live inside CanaryIQ

The full analysis behind the verdict — the structure is real; the content unlocks when you log in.

Signal stack

Evidence stacked leading → lagging

10 signals

talent

research

patent

expert

operational

market

Technology-native KPIs

Metrics that predict trajectory, tracked over time

3 tracked

RL Research Publications

RL in Production Deployments

Sim-to-Real Transfer Success

Landscape map

Who builds what — and who depends on whom

70 players · 6 layers

Catalyst calendar

Dated events that will move the position

5 ahead

Technology roadmap

Milestones on the path to maturity

8 milestones

Watchlists

Companies, people and papers — each with a remove-by condition

2 · 20

Companies · 2

People · 20

Decision frameworks

The same call, framed for your desk

Locked

Public Equity

PE / VC

Corporate Leader

Thesis changelog

When our view changed, and why

5 updates

Change our mind

3 disconfirming conditions

The rest is inside

You've read the verdict. The file is much deeper.

The full signal stack, technology-native KPIs tracked over time, the landscape of who depends on whom, the dated catalyst calendar, decision frameworks for every desk, live watchlists and the changelog of every time our call on Reinforcement learning has changed — all live inside CanaryIQ.