LocalHFBackend store the key-value (KV) attention states from
a forward pass and reuse them in later calls, skipping the prefill computation for
content that hasn’t changed. This is useful when many calls share a large common
prefix — a system prompt, a long document, or a fixed instruction header.
Prerequisite: This feature is specific to LocalHFBackend. Server-side backends
(Ollama, OpenAI, vLLM) manage their own KV caching internally.
Enable caching on the backend
Pass aSimpleLRUCache to LocalHFBackend at construction time:
capacity is the maximum number of cached KV blocks held in GPU memory at once.
When the cache is full, the least recently used block is evicted and its GPU memory
freed automatically.
To disable caching entirely (useful for benchmarking):
Mark a CBlock for caching
Caching is opt-in at the content level. Setcache=True on a CBlock to tell the
backend to prefill that block and store its KV state:
CBlock, the backend runs a forward pass and
stores the resulting DynamicCache. On subsequent calls containing the same block,
the cached states are retrieved and merged with the non-cached suffix — no
redundant prefill.
How KV smashing works
When a prompt contains a mix of cached and uncached blocks, Mellea:- Tokenises each block independently.
- Runs forward passes on uncached blocks.
- Retrieves stored
DynamicCachefor cached blocks. - Smashes (concatenates) all KV caches along the time axis using
merge_dynamic_caches(). - Passes the merged cache plus the combined input IDs to the generation step.
Practical example
A pipeline that applies the same long grounding document to many different queries:reference block is prefilled once. Each subsequent query pays only for its
own suffix tokens.
Cache capacity and memory
Each cached block occupies GPU memory proportional to the block’s token count and the model’s number of layers and attention heads. Choosecapacity conservatively:
- 1–3 for large documents or long system prompts on a single GPU.
- 5–10 for short, frequently reused blocks with ample VRAM.
on_evict callback (used internally by LocalHFBackend) frees GPU tensors
when a block is evicted, so the cache does not leak memory.
Disable for benchmarking
To measure true generation time without cache benefits:use_caches=False at construction. The session behaviour is otherwise
identical — disabling caching only affects whether prefill states are stored and
reused.
See also: HuggingFace Transformers |
Intrinsics |
LoRA and aLoRA Adapters