Back to Home

How Google's TurboQuant Works: The Algorithm That Crashed Memory Stocks

10 min read
Sumeet Zankar

Sumeet Zankar

AI Solution Architect & Full-Stack Developer

Introduction

On March 26, 2026, Google Research published a paper that sent shockwaves through the semiconductor industry. Within hours of the TurboQuant announcement, SK Hynix dropped over 6%, Samsung fell nearly 5%, and DDR5 memory prices began a weeks-long slide of 15-30%.

What exactly did Google build, and why did it terrify memory investors?

In this deep dive, we'll break down:

  • The KV cache bottleneck that TurboQuant solves
  • How the algorithm achieves 6x compression without accuracy loss
  • The two core techniques: QJL and PolarQuant
  • Real-world benchmarks and performance gains
  • Why the market reaction might be overblown

The Problem: KV Cache Is Eating Your GPU Memory

If you've ever run a large language model, you've hit this wall: memory.

Here's why. Every time an LLM generates a token, it needs to “remember” all the previous tokens in the conversation. This memory is called the Key-Value (KV) cache.

Token 1 → Store K1, V1
Token 2 → Store K2, V2 (+ retrieve K1, V1)
Token 3 → Store K3, V3 (+ retrieve K1, V1, K2, V2)
...
Token 10000 → Store K10000, V10000 (+ retrieve ALL previous KVs)

The KV cache grows linearly with context length. For a model like GPT-4 with 128K context, this can consume tens of gigabytes of GPU memory — often more than the model weights themselves.

This is why you need expensive H100s to run long-context inference. The model weights might fit on a smaller GPU, but the KV cache won't.


The Traditional Solution: Vector Quantization (And Its Problems)

The obvious fix is compression. Specifically, vector quantization — a classic technique that reduces the precision of stored vectors.

Instead of storing each value as a 32-bit float, you could use 8 bits, or 4 bits, or even 2 bits. Massive memory savings.

But there's a catch: memory overhead.

Traditional quantization methods need to store extra “quantization constants” — metadata that helps reconstruct the original values. This overhead typically adds 1-2 bits per number, partially defeating the purpose.

Compress to 4 bits, but add 2 bits of overhead? You're only getting 2x compression instead of 8x.


Enter TurboQuant: Zero-Overhead Compression

TurboQuant's breakthrough is achieving extreme compression with zero memory overhead.

Google claims:

  • 6x reduction in KV cache size
  • Zero accuracy loss on standard benchmarks
  • 8x speedup in attention computation on H100 GPUs
  • No fine-tuning or retraining required

How? TurboQuant combines two novel algorithms: PolarQuant and QJL.


Part 1: PolarQuant — Changing the Coordinate System

The first insight is geometric. Instead of storing vectors in Cartesian coordinates (X, Y, Z), PolarQuant converts them to polar coordinates (radius + angle).

Think of it like directions:

  • Cartesian: “Go 3 blocks East, then 4 blocks North”
  • Polar: “Go 5 blocks at a 37° angle”

Why does this help?

In high-dimensional space, vector angles follow a highly concentrated, predictable distribution. By converting to polar coordinates, PolarQuant can:

  1. Store the radius (magnitude) — how “strong” the signal is
  2. Store the angle (direction) — what the vector “means”

Because the angle distribution is known and predictable, you don't need to store normalization constants. The boundaries are fixed, not data-dependent.

Result: No memory overhead for storing quantization metadata.


Part 2: QJL — The 1-Bit Error Corrector

PolarQuant handles most of the compression, but there's still residual error. That's where Quantized Johnson-Lindenstrauss (QJL) comes in.

QJL is based on the Johnson-Lindenstrauss Transform — a mathematical technique that projects high-dimensional data into lower dimensions while preserving distances between points.

Here's the clever part: QJL reduces each value to a single sign bit (+1 or -1).

That's it. One bit.

QJL acts as an error-correction layer that eliminates bias from the PolarQuant stage. It uses a special estimator that balances high-precision queries against low-precision stored data.

The combination:

  • PolarQuant: Primary compression (most bits) — captures the main concept
  • QJL: Residual correction (1 bit) — eliminates systematic errors

The Architecture: How It All Fits Together

Here's the TurboQuant pipeline:

1. INPUT: Original KV cache vectors (32-bit floats)
           ↓
2. RANDOM ROTATION: Rotate vectors to simplify geometry
           ↓
3. POLARQUANT: Convert to polar coordinates, quantize
           ↓
4. QJL: Apply 1-bit error correction to residuals
           ↓
5. OUTPUT: Compressed KV cache (3-4 bits per value)

The random rotation in step 2 is crucial — it makes the data geometry uniform and predictable, which is what allows PolarQuant to work without storing per-block constants.


Benchmarks: Does It Actually Work?

Google tested TurboQuant across multiple long-context benchmarks:

  • LongBench — diverse long-context tasks
  • Needle in a Haystack — finding specific info in massive contexts
  • ZeroSCROLLS — zero-shot long document understanding
  • RULER — synthetic long-context evaluation
  • L-Eval — long-context evaluation suite

Models tested: Gemma, Mistral (open-source LLMs)

MetricTurboQuantBaseline
KV Memory6x smaller1x
AccuracyNo lossBaseline
Attention Speed (H100)8x faster1x
Needle-in-HaystackPerfectPerfect

The key finding: TurboQuant achieves perfect downstream accuracy on needle-in-haystack tasks while using 6x less memory.

For 4-bit TurboQuant specifically, they measured up to 8x speedup in computing attention logits on H100 GPUs compared to unquantized 32-bit keys.


Beyond LLMs: Vector Search Gets a Boost Too

TurboQuant isn't just for LLM inference. It's also a breakthrough for vector search — the technology powering semantic search, RAG systems, and recommendation engines.

Vector databases store billions of high-dimensional embeddings. Searching through them efficiently requires:

  1. Fast similarity computation
  2. Minimal memory footprint
  3. High recall (finding the actual nearest neighbors)

Google tested TurboQuant against state-of-the-art vector quantization methods (PQ and RabbiQ) on the standard 1@k recall benchmark.

Result: TurboQuant achieved superior recall ratios, even though competitors used larger codebooks and dataset-specific tuning.

This means faster index building, smaller indices, and more accurate search — all at once.


Why the Market Overreacted

Now let's address the elephant in the room: should memory companies actually be worried?

The Bear Case

  • AI inference needs 6x less memory
  • Data centers buy fewer memory chips
  • Samsung, SK Hynix, Micron suffer

The Bull Case (Jevons Paradox)

When you make a resource more efficient, you often end up using MORE of it, not less.

Consider:

  • 6x cheaper inference → more companies deploy AI
  • More AI deployments → more total GPU/memory purchases
  • Longer contexts become practical → models use even more memory

History supports this. When SSDs got cheaper, we didn't use less storage — we stored more data. When GPUs got faster, we didn't use smaller models — we trained bigger ones.

Also important: TurboQuant targets inference, not training. The insatiable demand for HBM (High Bandwidth Memory) in training clusters remains unaffected.


What This Means for Developers

If you're building AI applications, TurboQuant matters for several reasons:

  1. Longer contexts on smaller GPUs — Run 128K context models where you previously couldn't
  2. Lower inference costs — 6x memory reduction = cheaper GPU instances
  3. Faster response times — 8x attention speedup means snappier applications
  4. Better RAG systems — More efficient vector search with higher recall

Expect this (or similar techniques) to be integrated into:

  • Google's Gemini API (likely already deployed)
  • Open-source inference frameworks (vLLM, TensorRT-LLM)
  • Vector databases (Pinecone, Weaviate, Qdrant)

Conclusion

TurboQuant represents a genuine algorithmic breakthrough — not just an incremental improvement, but a fundamental advance in how we compress high-dimensional data.

By combining PolarQuant's coordinate transformation with QJL's 1-bit error correction, Google achieved something previously thought impossible: extreme compression with zero overhead and zero accuracy loss.

The market's panic reaction might have been excessive, but the underlying technology is real and significant. For AI practitioners, this means more powerful models on cheaper hardware.

For memory companies? The jury's still out — but I'm betting on Jevons.


References

  1. TurboQuant: Redefining AI efficiency with extreme compression — Google Research Blog
  2. TurboQuant Paper (arXiv) — ICLR 2026
  3. PolarQuant Paper — AISTATS 2026
  4. QJL Paper — AAAI
  5. KV Cache Explained — Hugging Face
AIMachine LearningTurboQuantGoogleLLMCompressionVector QuantizationDeep Learning

Enjoyed this article?

Connect with me on LinkedIn for more insights on AI, automation, and full-stack development.