LocalHFBackend uses HuggingFace Transformers
for local inference. It is designed for experimental Mellea features — aLoRA adapters,
constrained decoding, and span-based context — that are not yet available on
server-based backends.
Prerequisites: pip install 'mellea[hf]', Python 3.11+, local model weights.
Tip: For everyday local inference without experimental features, use Ollama — it is simpler to set up and well suited for development.
Install
Basic usage
LocalHFBackend downloads the model weights via the Transformers
Auto* classes and loads them onto the best available device (cuda > mps > cpu).
Device selection
TheBackend selects the device automatically: CUDA GPU
if available, then Apple Silicon MPS, then CPU. To override device selection, use
custom_config:
KV cache
LocalHFBackend caches KV blocks across calls by default (use_caches=True). This
speeds up repeated calls that share a common prefix. Pass a SimpleLRUCache
to control capacity, or disable caching entirely for debugging:
aLoRA adapters
LocalHFBackend supports Activated LoRA (aLoRA)
adapters — lightweight domain-specific requirement validators that run on local GPU
hardware. See the aLoRA guide for training and usage.
Vision support
Vision support forLocalHFBackend is model-dependent and experimental. Pass a PIL
image or an ImageBlock via images=[...] to
instruct() or chat() when using a vision-capable model. Not all models loaded via
LocalHFBackend support image input. See
Use Images and Vision Models.
Troubleshooting
pip install "mellea[hf]" fails on Intel macOS
If you see torch/torchvision version errors on an Intel Mac, use Conda:
python inside the Conda environment rather than
uv run --with mellea.
Python 3.13: error: can't find Rust compiler
The outlines package (used by mellea[hf]) requires a Rust compiler on Python 3.13.
Either downgrade to Python 3.12 or install the
Rust compiler:
See also: Backends and Configuration | LoRA and aLoRA Adapters