K/V Cache Quantization in Ollama

Table of Contents

A somewhat hidden feature of Ollama is K/V Cache quantization. This is relevant for local AI as it reduces memory consumption, especially for small LLMs with large context windows.

This post describes how to activate K/V Cache in Ollama and gives an overview of its benefits, drawbacks and use cases.

Activating K/V cache quantization in Ollama
#

K/V Cache quantization in Ollama is not on by default, so you need to activate it by setting the OLLAMA_KV_CACHE_TYPE environment variable. Supported values are documented in How can I set the quantization type for the K/V cache?:

f16
q8_0
q4_0

Benefits and Drawbacks
#

K/V Cache quantization can be the difference between being able to run a model on a machine, or not. You can use Sam McLeod’s vram-estimator to estimate the memory consumption of models with different quantization settings.

The benefit in brief is lower memory consumption which is most pronounced when using small models with large context windows.

The main drawback is reduced model accuracy. The lmdeploy team has written a blog post about K/V cache quantization accuracy test results if you want to see quantified impacts of K/V quantization on model accuracy.

Example numbers
#

Llama 3.2 8B supports 128.000 tokens context windows.

When you run it with Q4_K_M quantization and the longest possible context length, it consumes

23.3 GB memory without K/V cache qantization,
17.0 GB with Q8_K_0 K/V cache quantization, and
13.8 GB with Q4_K_0 K/V cache quantization.

With Q4_K_0 K/V cache quantization it now fits into 16 GB vRAM.

Use cases
#

Application	Benefit
code generation	more code in context
question answering	whole docs fit into the context
function calling	more tools
chat	longer conversations
multi-modal	images need many tokens

More Info
#

To learn more, read Sam McLeod’s in-dept blog post about Bringing K/V Context Quantisation to Ollama. Sam helped to implement this in Ollama.

Activating K/V cache quantization in Ollama#

Benefits and Drawbacks#

Example numbers#

Use cases#

More Info#

Activating K/V cache quantization in Ollama
#

Benefits and Drawbacks
#

Example numbers
#

Use cases
#

More Info
#