We use third-party cookies in order to personalize your site experience. See our Privacy Policy.

Technology thesis · Artificial Intelligence

medium conviction growth

Multimodal AI

Native multimodality is now the default architecture, not the next frontier; the contest has moved to real-time omni interaction, on-device inference and agentic computer use.

Position maintained continuously · last reviewed Jun 24, 2026

The thesis

Core thesis

Native multimodality is the default architecture for frontier models, not a frontier in itself: the leading text-image-audio-video stacks (Gemini, GPT-5.x, Claude) ship multimodal by default, and Google's Gemini line carries a 2M-token context that ingests hours of video. The live contest is real-time omni interaction, on-device inference and agentic computer use. Enterprise value concentrates in document understanding, visual inspection and video analysis. The constraint is inference economics: multimodal still costs several times more per query than text, and the firms that drive that cost down fastest capture the margin.

State of the art (2026)

By mid-2026 natively multimodal is the default architecture, not a frontier. Gemini 2.5 Pro processes text, image, audio and video in one model with a 2M-token context and ingests up to three hours of video per prompt; GPT-5.4 (March 2026) and Claude Opus 4.6 (February 2026) anchor the other two frontier stacks, though Claude still lacks native video and audio. At I/O on 19 May 2026 Google launched Gemini Omni, an any-input family that outputs editable video. Generative video is commodity-fast: Veo 3.1 renders 4K at 60fps with synchronised audio, while OpenAI retired the Sora consumer app on 26 April 2026. The contest has moved from whether models are multimodal to real-time omni interaction, on-device inference and agentic computer use.

The rest of the file

Everything below is live inside CanaryIQ

The full analysis behind the verdict — the structure is real; the content unlocks when you log in.

Signal stack

Evidence stacked leading → lagging

10 signals
talent
research
patent
expert
operational
market

Technology-native KPIs

Metrics that predict trajectory, tracked over time

3 tracked
Multimodal AI model releases
Multimodal AI market size
Enterprise multimodal AI adoption

Landscape map

Who builds what — and who depends on whom

139 players · 6 layers

Catalyst calendar

Dated events that will move the position

6 ahead

Technology roadmap

Milestones on the path to maturity

8 milestones

Watchlists

Companies, people and papers — each with a remove-by condition

20 · 20
Companies · 20
People · 20

Decision frameworks

The same call, framed for your desk

Locked
Public Equity
PE / VC
Corporate Leader

Thesis changelog

When our view changed, and why

5 updates

Change our mind

2 disconfirming conditions

The rest is inside

You've read the verdict. The file is much deeper.

The full signal stack, technology-native KPIs tracked over time, the landscape of who depends on whom, the dated catalyst calendar, decision frameworks for every desk, live watchlists and the changelog of every time our call on Multimodal AI has changed — all live inside CanaryIQ.