Technology thesis · Artificial Intelligence
medium conviction growthSpeech AI and voice synthesis
Voice is now production infrastructure, not a demo: sub-100ms speech-to-speech agents are displacing tier-1 BPO work; deepfake fraud and unresolved music-IP rulings are the live limits on scale.
Position maintained continuously · last reviewed Jun 24, 2026
The thesis
Core thesis
Voice synthesis crossed the uncanny valley — ElevenLabs and Play.ht produce voices indistinguishable from real humans. Applications: audiobook production (cost drops 90%), real-time translation, accessibility. But also: voice deepfake attacks ($25M wire transfer fraud via CEO voice clone). The same technology enables and threatens. Market growing 15%+ annually.
Core thesis
Voice synthesis crossed the uncanny valley years ago; the live question in 2026 is deployment, not naturalness. Real-time speech-to-speech is the default architecture – OpenAI's gpt-realtime reached GA in August 2025, Cartesia's Sonic-3.5 ships sub-100ms synthesis, and ElevenLabs raised a $500M Series D at an $11B valuation in February 2026. Voice agents now handle tier-1 contact-centre work end to end. The same capability cuts both ways: deepfake voice fraud (a CFO-clone authorised a $25M wire transfer on a live call) is driving authentication mandates, and AI music generation faces unresolved IP rulings. The structural growth is real; regulation and fraud liability are the live limits on scale.
State of the art (2026)
Real-time speech-to-speech is now the default architecture, not a research demo. OpenAI's gpt-realtime reached general availability in August 2025, and specialists ship sub-100ms synthesis – Cartesia's Sonic-3.5 and ElevenLabs' multilingual stack lead, with ElevenLabs raising a $500M Series D in February 2026 at an $11B valuation. The frontier has moved from naturalness to deployable voice agents handling tier-1 contact-centre work end to end. Two constraints now bind commercial scale: deepfake voice fraud (the $25M CFO-clone wire transfer set the template) driving authentication mandates, and music-IP exposure – Universal settled with Udio in October 2025 and Warner with Suno in November 2025, while Sony's fair-use ruling against both is expected in summer 2026.
Everything below is live inside CanaryIQ
The full analysis behind the verdict — the structure is real; the content unlocks when you log in.
Signal stack
Evidence stacked leading → lagging
Technology-native KPIs
Metrics that predict trajectory, tracked over time
Landscape map
Who builds what — and who depends on whom
Catalyst calendar
Dated events that will move the position
Technology roadmap
Milestones on the path to maturity
Watchlists
Companies, people and papers — each with a remove-by condition
Decision frameworks
The same call, framed for your desk
Thesis changelog
When our view changed, and why
Change our mind
3 disconfirming conditions
The rest is inside
You've read the verdict. The file is much deeper.
The full signal stack, technology-native KPIs tracked over time, the landscape of who depends on whom, the dated catalyst calendar, decision frameworks for every desk, live watchlists and the changelog of every time our call on Speech AI and voice synthesis has changed — all live inside CanaryIQ.