Post

K/V Cache Quantization in Ollama

A somewhat hidden feature of Ollama is K/V Cache quantization. K/V Cache quantization in Ollama is not on by default, so you need to activate it by setting the OLLAMA_KV_CACHE_TYPE environment variable. Supported values are documented in How can I set the quantization type for the K/V cache?:

  • f16
  • q8_0
  • q4_0

Benefits:

  • lower memory consumption
  • most pronounced when using small models with large context windows

Drawback:

Example numbers:

  • Llama 3.2 8B supports 128.000 tokens context windows.
  • Running it with Q4_K_M quantization and the longest possible context length, it consumes
    • 23.3 GB memory without K/V cache qantization,
    • 17.0 GB with Q8_K_0 K/V cache quantization, and
    • 13.8 GB with Q4_K_0 K/V cache quantization.
  • it now fits into 16 GB RAM…
  • Sam McLeod’s vram-estimator is a nice tool to estimate the memory consumption of models with different quantization settings.

Use cases:

ApplicationBenefit
code generationmore code in context
question answeringwhole docs fit into the context
function callingmore tools
chatlonger conversations
multi-modalimages need many tokens

To learn more, read Sam McLeod’s in-dept blog post about Bringing K/V Context Quantisation to Ollama. Sam helped implement this in Ollama.

This post is licensed under CC BY 4.0 by the author.