Post

K/V Cache Quantization in Ollama

A somewhat hidden feature of Ollama is K/V Cache quantization. This post is about how to activate it, it’s benefits and drawback, example numbers, and use cases.

Activating K/V Cache quantization in Ollama:

Benefits:

  • lower memory consumption
  • most pronounced when using small models with large context windows

Drawback:

Example numbers:

  • Llama 3.2 8B supports 128.000 tokens context windows.
  • Running it with Q4_K_M quantization and the longest possible context length, it consumes
    • 23.3 GB memory without K/V cache qantization,
    • 17.0 GB with Q8_K_0 K/V cache quantization, and
    • 13.8 GB with Q4_K_0 K/V cache quantization.
  • it now fits into 16 GB RAM…
  • Sam McLeod’s vram-estimator is a nice tool to estimate the memory consumption of models with different quantization settings.

Use cases:

ApplicationBenefit
code generationmore code in context
question answeringwhole docs fit into the context
function callingmore tools
chatlonger conversations
multi-modalimages need many tokens

To learn more, read Sam McLeod’s in-dept blog post about Bringing K/V Context Quantisation to Ollama. Sam helped implement this in Ollama.

This post is licensed under CC BY 4.0 by the author.