Google Research published TurboQuant on March 24 and presented it at ICLR 2026: a two-stage pipeline that quantizes transformer key-value caches down to 3–4 bits per element without fine-tuning or calibration data, achieving up to 6x memory reduction and 8x throughput improvement on H100 GPUs with negligible accuracy loss. The method applies a random orthogonal rotation to each KV vector to spread energy uniformly, then uses a precomputed Lloyd-Max quantization grid, making it architecture-agnostic. The paper has already spawned a dozen open-source implementations and follow-on compression methods benchmarked against it.

Google TurboQuant compresses LLM KV caches 6x with no retraining

Citations