Technology thesis · Artificial Intelligence
high conviction growthAI model compression
AI model compression has converged on a production recipe - train bf16, ship 4-bit - putting frontier-derived models on flagship phones; the live frontier is sub-2-bit, bandwidth-aware compression.
Position maintained continuously · last reviewed Jun 24, 2026
The thesis
Compression has consolidated into a production recipe
The recipe has converged: train in 16-bit (bf16/fp16) and ship at 4-bit. GPTQ (2022) and AWQ (2023) demonstrated that 4-bit post-training quantisation preserves most model quality with a 4x memory reduction. The format split is mature: GGUF (Q4_K_M holds ~92% quality, llama.cpp on CPU and Apple Silicon) handles edge deployment; AWQ paired with vLLM (~95% quality) is the GPU-server default; NVIDIA TensorRT-LLM with FP8 and FP4 is the NVIDIA-cluster path. The investment frontier has moved from the headline quantisation method to combined-technique tooling - NVIDIA Minitron prunes and distils before quantising; Pruna AI's smash() automates the combined pipeline. The structural read is that compression is now a horizontal toolchain step rather than a vertical research category.
State of the art (2026)
The production stack has settled: train in bf16, ship at 4-bit. GGUF Q4_K_M via llama.cpp owns CPU and Apple-Silicon distribution; AWQ paired with vLLM is the GPU-server default; NVIDIA TensorRT-LLM carries FP8 and FP4 on Blackwell-class clusters. On-device is the volume case - Gemini Nano (1.8-3.25B, 4-bit) runs across Pixel and Galaxy flagships, and the January 2026 Apple-Google deal puts a custom Gemini behind the rebuilt Siri AI shown at WWDC in June, though Apple routes it server-side through Private Cloud Compute rather than fully on-device. Sub-2-bit remains research, not product: Microsoft shipped BitNet b1.58-2B-4T and bitnet.cpp but still flags it as not production-ready, so no major lab has released a 1.58-bit instruction-tuned foundation checkpoint. The live frontier is combined prune-distil-quantise tooling and memory-bandwidth-aware compression.
On-device AI is the volume application that compression unlocked
The largest commercial consequence of compression is on-device AI at flagship-phone scale. Gemini Nano (1.8B-3.25B parameters, 4-bit quantised) is the first on-device LLM shipping at consumer scale - it runs across Pixel 9 and Pixel 10 Pro (Tensor G5, 16GB RAM, 2.6x faster than prior generation), Samsung Galaxy S24/S25, and is rolling broader. The Apple-Google deal announced January 2026 puts Gemini Nano-class models behind Apple Intelligence on iPhones and Macs, with Apple Foundation Model 3B running in parallel for some workloads. The on-device value proposition - sub-100ms latency, full privacy, offline operation - is impossible without compression. The volume gate has therefore shifted from 'can a phone run this' to 'how good does the on-device model get vs cloud'. By 2027 the gap between flagship-phone on-device and cloud frontier is the binding question for compression research.
Sub-2-bit quantisation and bandwidth-aware compression are the next frontier
Two open frontiers remain after the 4-bit convergence. First, sub-2-bit quantisation - BitNet (1.58-bit ternary weights), AbsoluteZero-style binarisation, and the 1-bit-instruction-following category - has shown research traction but has not yet shipped in widely-deployed instruction-tuned models. Production viability by 2027 would compress phone-deployable models from ~3B to 10B+ effective parameters under the same RAM envelope. Second, memory-bandwidth-aware compression is the constraint on actual NPU throughput - flagship phones bottleneck on LPDDR5X bandwidth before NPU TOPS, so compression schemes that reduce memory access patterns (not just weight size) become the binding optimisation. Both frontiers determine whether on-device AI remains a Gemini-Nano-class capability or moves into frontier-class capability before 2028.
Everything below is live inside CanaryIQ
The full analysis behind the verdict — the structure is real; the content unlocks when you log in.
Signal stack
Evidence stacked leading → lagging
Technology-native KPIs
Metrics that predict trajectory, tracked over time
Landscape map
Who builds what — and who depends on whom
Catalyst calendar
Dated events that will move the position
Technology roadmap
Milestones on the path to maturity
Watchlists
Companies, people and papers — each with a remove-by condition
Decision frameworks
The same call, framed for your desk
Thesis changelog
When our view changed, and why
Change our mind
6 disconfirming conditions
The rest is inside
You've read the verdict. The file is much deeper.
The full signal stack, technology-native KPIs tracked over time, the landscape of who depends on whom, the dated catalyst calendar, decision frameworks for every desk, live watchlists and the changelog of every time our call on AI model compression has changed — all live inside CanaryIQ.