Technology thesis · Artificial Intelligence

high conviction growth

AI model compression

AI model compression has converged on a production recipe - train bf16, ship 4-bit - putting frontier-derived models on flagship phones; the live frontier is sub-2-bit, bandwidth-aware compression.

Position maintained continuously · last reviewed Jun 24, 2026

The thesis

Compression has consolidated into a production recipe

The recipe has converged: train in 16-bit (bf16/fp16) and ship at 4-bit. GPTQ (2022) and AWQ (2023) demonstrated that 4-bit post-training quantisation preserves most model quality with a 4x memory reduction. The format split is mature: GGUF (Q4_K_M holds ~92% quality, llama.cpp on CPU and Apple Silicon) handles edge deployment; AWQ paired with vLLM (~95% quality) is the GPU-server default; NVIDIA TensorRT-LLM with FP8 and FP4 is the NVIDIA-cluster path. The investment frontier has moved from the headline quantisation method to combined-technique tooling - NVIDIA Minitron prunes and distils before quantising; Pruna AI's smash() automates the combined pipeline. The structural read is that compression is now a horizontal toolchain step rather than a vertical research category.

State of the art (2026)

The production stack has settled: train in bf16, ship at 4-bit. GGUF Q4_K_M via llama.cpp owns CPU and Apple-Silicon distribution; AWQ paired with vLLM is the GPU-server default; NVIDIA TensorRT-LLM carries FP8 and FP4 on Blackwell-class clusters. On-device is the volume case - Gemini Nano (1.8-3.25B, 4-bit) runs across Pixel and Galaxy flagships, and the January 2026 Apple-Google deal puts a custom Gemini behind the rebuilt Siri AI shown at WWDC in June, though Apple routes it server-side through Private Cloud Compute rather than fully on-device. Sub-2-bit remains research, not product: Microsoft shipped BitNet b1.58-2B-4T and bitnet.cpp but still flags it as not production-ready, so no major lab has released a 1.58-bit instruction-tuned foundation checkpoint. The live frontier is combined prune-distil-quantise tooling and memory-bandwidth-aware compression.

On-device AI is the volume application that compression unlocked

The largest commercial consequence of compression is on-device AI at flagship-phone scale. Gemini Nano (1.8B-3.25B parameters, 4-bit quantised) is the first on-device LLM shipping at consumer scale - it runs across Pixel 9 and Pixel 10 Pro (Tensor G5, 16GB RAM, 2.6x faster than prior generation), Samsung Galaxy S24/S25, and is rolling broader. The Apple-Google deal announced January 2026 puts Gemini Nano-class models behind Apple Intelligence on iPhones and Macs, with Apple Foundation Model 3B running in parallel for some workloads. The on-device value proposition - sub-100ms latency, full privacy, offline operation - is impossible without compression. The volume gate has therefore shifted from 'can a phone run this' to 'how good does the on-device model get vs cloud'. By 2027 the gap between flagship-phone on-device and cloud frontier is the binding question for compression research.

Sub-2-bit quantisation and bandwidth-aware compression are the next frontier

Two open frontiers remain after the 4-bit convergence. First, sub-2-bit quantisation - BitNet (1.58-bit ternary weights), AbsoluteZero-style binarisation, and the 1-bit-instruction-following category - has shown research traction but has not yet shipped in widely-deployed instruction-tuned models. Production viability by 2027 would compress phone-deployable models from ~3B to 10B+ effective parameters under the same RAM envelope. Second, memory-bandwidth-aware compression is the constraint on actual NPU throughput - flagship phones bottleneck on LPDDR5X bandwidth before NPU TOPS, so compression schemes that reduce memory access patterns (not just weight size) become the binding optimisation. Both frontiers determine whether on-device AI remains a Gemini-Nano-class capability or moves into frontier-class capability before 2028.

The rest of the file

Everything below is live inside CanaryIQ

The full analysis behind the verdict — the structure is real; the content unlocks when you log in.

Signal stack

Evidence stacked leading → lagging

9 signals

talent

research

patent

expert

operational

market

Technology-native KPIs

Metrics that predict trajectory, tracked over time

4 tracked

Gemini Nano parameter range

GGUF Q4_K_M quality retention

AWQ quality retention

Pixel 10 Pro Gemini Nano speedup

Landscape map

Who builds what — and who depends on whom

104 players · 6 layers

Catalyst calendar

Dated events that will move the position

9 ahead

Technology roadmap

Milestones on the path to maturity

8 milestones

Watchlists

Companies, people and papers — each with a remove-by condition

20 · 20

Companies · 20

People · 20

Decision frameworks

The same call, framed for your desk

Locked

Public Equity

PE / VC

Corporate Leader

Thesis changelog

When our view changed, and why

6 updates

Change our mind

6 disconfirming conditions

The rest is inside

You've read the verdict. The file is much deeper.

The full signal stack, technology-native KPIs tracked over time, the landscape of who depends on whom, the dated catalyst calendar, decision frameworks for every desk, live watchlists and the changelog of every time our call on AI model compression has changed — all live inside CanaryIQ.