Skip to main content
The purpose of the VLLM backend is to provide a locally running fast inference engine.

Classes

CLASS LocalVLLMBackend

The LocalVLLMBackend uses vLLM’s python interface for inference, and uses a Formatter to convert Components into prompts. The support for Activated LoRAs (ALoras)](https://arxiv.org/pdf/2504.12397) is planned. This backend is designed for running an HF model for small-scale inference locally on your machine. Its throughput is generally higher than that of LocalHFBackend. However, it takes longer to load the weights during the instantiation. Also, if you submit a request one by one, it can be slower.
Methods:

FUNC generate_from_context

generate_from_context(self, action: Component[C] | CBlock, ctx: Context) -> tuple[ModelOutputThunk[C], Context]
Generate using the huggingface model.

FUNC processing

processing(self, mot: ModelOutputThunk, chunk: vllm.RequestOutput)
Process the returned chunks or the complete response.

FUNC post_processing

post_processing(self, mot: ModelOutputThunk, conversation: list[dict], _format: type[BaseModelSubclass] | None, tool_calls: bool, tools: dict[str, Callable], seed)
Called when generation is done.

FUNC generate_from_raw

generate_from_raw(self, actions: list[Component[C]], ctx: Context) -> list[ModelOutputThunk[C]]

FUNC generate_from_raw

generate_from_raw(self, actions: list[Component[C] | CBlock], ctx: Context) -> list[ModelOutputThunk[C | str]]

FUNC generate_from_raw

generate_from_raw(self, actions: Sequence[Component[C] | CBlock], ctx: Context) -> list[ModelOutputThunk]
Generate using the completions api. Gives the input provided to the model without templating.