LocalVLLMBackend uses vLLM for higher-throughput local inference.
It is a good choice when you are running many requests in parallel — for example, batch
evaluation or load testing. vLLM takes longer to initialise than LocalHFBackend but
sustains higher throughput once warm.
Prerequisites: pip install 'mellea[vllm]', Linux, CUDA GPU.
Platform note: vLLM is not supported on macOS. Use
LocalHFBackend or Ollama on Apple Silicon.
Install
Basic usage
Always setMAX_NEW_TOKENSexplicitly. vLLM defaults to approximately 16 tokens. For structured output or longer responses, setModelOption.MAX_NEW_TOKENSto 200–1000+ tokens.
High-throughput batched inference
vLLM processes requests in continuous batches. For batch evaluation, send requests concurrently rather than sequentially to take advantage of the batching:Vision support
Vision support forLocalVLLMBackend is model-dependent. Pass a PIL image or an
ImageBlock via images=[...] when using a
vision-capable model. See Use Images and Vision Models.
Troubleshooting
Output truncated at ~16 tokens
vLLM defaults to approximately 16 tokens. SetModelOption
MAX_NEW_TOKENS explicitly:
See also: Backends and Configuration | LoRA and aLoRA Adapters