We use third-party cookies in order to personalize your site experience. See our Privacy Policy.

Technology thesis · Artificial Intelligence

high conviction growth

Synthetic data generation

Synthetic data is now structural to frontier post-training, robotics sim2real and regulated-industry analytics, and the open question is margins for specialist vendors as frontier labs generate their own.

Position maintained continuously · last reviewed Jun 24, 2026

The thesis

Core thesis

Synthetic data solves the three biggest AI training problems: data scarcity (rare events, edge cases), privacy (healthcare, finance), and bias (under-represented populations). It is now structural in frontier post-training, where RLAIF, distillation and self-generated reasoning traces are mainstream. The independent specialist tier has consolidated into platforms: NVIDIA acquired Gretel (March 2025), SAS absorbed Hazys software (late 2024), and Syntho took the MOSTLY AI brand (June 2026); Tonic.ai remains independent. The risk: synthetic data can amplify biases in the seed data, and the open commercial question is margins for specialist vendors as frontier labs increasingly generate their own.

State of the art (2026)

Synthetic data is now load-bearing in three distinct markets, not one. In frontier post-training it is mainstream: RLAIF, distillation and self-generated reasoning traces dominate alignment work at Anthropic, OpenAI, Google DeepMind and DeepSeek, with feared model collapse not materialising in practice when synthetic and real data are mixed. In physical AI, NVIDIA released Cosmos 3 in June 2026 as an open world-foundation model generating physics-aware training data for robotics and autonomous fleets. In regulated tabular data, Mostly AI, Gretel, Tonic.ai and Hazy sell privacy-preserving generation into finance and healthcare. The labelling layer consolidated when Meta took a 49% Scale AI stake (valuing it at $29bn) in June 2025 and hired Alexandr Wang.

The rest of the file

Everything below is live inside CanaryIQ

The full analysis behind the verdict — the structure is real; the content unlocks when you log in.

Signal stack

Evidence stacked leading → lagging

9 signals
talent
research
patent
expert
operational
market

Technology-native KPIs

Metrics that predict trajectory, tracked over time

6 tracked
Scale AI valuation + structure
Synthetic data market size
Enterprise adoption rate
Model accuracy parity
Privacy regulation citations
Synthetic data share of frontier model training

Landscape map

Who builds what — and who depends on whom

97 players · 6 layers

Catalyst calendar

Dated events that will move the position

6 ahead

Technology roadmap

Milestones on the path to maturity

8 milestones

Watchlists

Companies, people and papers — each with a remove-by condition

20 · 19
Companies · 20
People · 19

Decision frameworks

The same call, framed for your desk

Locked
Public Equity
PE / VC
Corporate Leader

Thesis changelog

When our view changed, and why

6 updates

Change our mind

3 disconfirming conditions

The rest is inside

You've read the verdict. The file is much deeper.

The full signal stack, technology-native KPIs tracked over time, the landscape of who depends on whom, the dated catalyst calendar, decision frameworks for every desk, live watchlists and the changelog of every time our call on Synthetic data generation has changed — all live inside CanaryIQ.