Large Language Models (LLMs) are increasingly used in applications requiring long context
lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as con-
text lengths grow. To address this, we propose Commutative Vector Quantization (CommVQ)
to significantly reduce memory usage for long context LLM inference. First, we leverage additive quantization by introducing a lightweight encoder and codebook to compress the KV cache,
which can then be decoded with a simple matrix multiplication. Second, to tackle the high
computational costs during decoding, we design the…



Source link