Skip to main content

Documentation Index

Fetch the complete documentation index at: https://ibm-llm-runtime-aaf3a78b.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

OpenAIBackend connects Mellea to the OpenAI API and to any server that implements the OpenAI HTTP API — including LM Studio, Ollama’s OpenAI endpoint, vLLM, and OpenAI-compatible providers. Prerequisites: pip install mellea, a valid API key for the OpenAI API or a local OpenAI-compatible server running.

OpenAI API

Set your API key as an environment variable (recommended):
export OPENAI_API_KEY=sk-...
Then create a session:
# Requires: mellea
# Returns: ModelOutputThunk
from mellea import MelleaSession
from mellea.backends.openai import OpenAIBackend
from mellea.stdlib.context import ChatContext

m = MelleaSession(
    OpenAIBackend(model_id="gpt-4o"),
    ctx=ChatContext(),
)
reply = m.chat("What is the capital of France?")
print(str(reply))
# Output will vary — LLM responses depend on model and temperature.
Pass the key directly if you prefer not to use an environment variable:
# Requires: mellea
# Returns: MelleaSession
from mellea import MelleaSession
from mellea.backends.openai import OpenAIBackend

m = MelleaSession(
    OpenAIBackend(model_id="gpt-4o", api_key="sk-..."),
)
Note: Never commit API keys to source control. Use environment variables or a secrets manager in production.

OpenAI-compatible local servers

OpenAIBackend works with any server that implements the OpenAI HTTP API. No real API key is needed for local servers — pass any non-empty string:

LM Studio

# Requires: mellea
# Returns: MelleaSession
from mellea import MelleaSession
from mellea.backends.openai import OpenAIBackend

m = MelleaSession(
    OpenAIBackend(
        model_id="qwen/qwen2.5-vl-7b",
        base_url="http://127.0.0.1:1234/v1",
    )
)

Ollama’s OpenAI endpoint

# Requires: mellea
# Returns: MelleaSession
from mellea import MelleaSession
from mellea.backends.openai import OpenAIBackend
from mellea.stdlib.context import ChatContext

m = MelleaSession(
    OpenAIBackend(
        model_id="qwen2.5vl:7b",
        base_url="http://localhost:11434/v1",
        api_key="ollama",              # Ollama ignores the key; any value works
    ),
    ctx=ChatContext(),
)

vLLM

# Requires: mellea
# Returns: MelleaSession
from mellea import MelleaSession
from mellea.backends.openai import OpenAIBackend

m = MelleaSession(
    OpenAIBackend(
        model_id="ibm-granite/granite-3.3-8b-instruct",
        base_url="http://localhost:8000/v1",
        api_key="your-vllm-key",
    )
)

Using base_url from the environment

Set OPENAI_BASE_URL to avoid repeating the base URL in your code:
export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
# Requires: mellea
# Returns: MelleaSession
from mellea import MelleaSession
from mellea.backends.openai import OpenAIBackend

# Reads OPENAI_BASE_URL and OPENAI_API_KEY from environment
m = MelleaSession(OpenAIBackend(model_id="qwen2.5vl:7b"))
base_url and api_key constructor parameters take precedence over environment variables if both are set.

Vision and multimodal input

OpenAIBackend supports image inputs for vision-capable models. Pass a PIL image or a Mellea ImageBlock:
# Requires: mellea
# Returns: ModelOutputThunk
from PIL import Image
from mellea import MelleaSession
from mellea.backends.openai import OpenAIBackend
from mellea.core import ImageBlock
from mellea.stdlib.context import ChatContext

m = MelleaSession(
    OpenAIBackend(
        model_id="gpt-4o",
        api_key="sk-...",
    ),
    ctx=ChatContext(),
)

pil_image = Image.open("screenshot.png")
img_block = ImageBlock.from_pil_image(pil_image)

response = m.instruct(
    "Describe the content of this image and identify any text visible.",
    images=[img_block],
)
print(str(response))
# Output will vary — LLM responses depend on model and temperature.
You can also pass PIL Image objects directly without wrapping them:
# Requires: mellea, pillow
# Returns: ModelOutputThunk
chat_response = m.chat(
    "How many people are in this image?",
    images=[pil_image],
)
Backend note: Vision requires a model that supports image inputs (e.g., gpt-4o, qwen2.5vl:7b). Text-only models will raise an error if images are passed.

Structured output with format

Use the format parameter to constrain generation to a Pydantic schema:
# Requires: mellea, pydantic
# Returns: str
from pydantic import BaseModel
from mellea import MelleaSession
from mellea.backends.openai import OpenAIBackend

class Summary(BaseModel):
    title: str
    key_points: list[str]
    word_count: int

m = MelleaSession(OpenAIBackend(model_id="gpt-4o", api_key="sk-..."))
result = m.instruct(
    "Summarise this article: {{text}}",
    format=Summary,
    user_variables={"text": "...your article text..."},
)
parsed = Summary.model_validate_json(str(result))
print(parsed.title)

Model options

Set generation parameters with ModelOption:
# Requires: mellea
# Returns: MelleaSession
from mellea import MelleaSession
from mellea.backends import ModelOption
from mellea.backends.openai import OpenAIBackend

m = MelleaSession(
    OpenAIBackend(
        model_id="gpt-4o",
        api_key="sk-...",
        model_options={
            ModelOption.TEMPERATURE: 0.3,
            ModelOption.MAX_NEW_TOKENS: 500,
            ModelOption.SYSTEM_PROMPT: "You are a concise technical writer.",
        },
    )
)
Options set at construction time apply to all calls. Options passed to instruct() or chat() apply to that call only and take precedence.

Anthropic via OpenAI-compatible endpoint

Anthropic’s API is not OpenAI-compatible natively, but if you access it through a proxy that exposes an OpenAI-compatible interface, you can use OpenAIBackend:
# Requires: mellea
# Returns: MelleaSession
from mellea import MelleaSession
from mellea.backends.openai import OpenAIBackend

# Example: accessing Claude via a proxy with OpenAI-compatible interface
m = MelleaSession(
    OpenAIBackend(
        model_id="claude-3-haiku-20240307",
        api_key="your-anthropic-key",
        base_url="https://api.anthropic.com/v1/",
    )
)
Note (review needed): Direct Anthropic API compatibility via this path has not been verified against the current Mellea version. If you are using Anthropic, LiteLLM provides a verified integration — see Backends and Configuration.

Intrinsics with Granite Switch

Granite Switch models embed LoRA/aLoRA adapters directly in the model weights. When served via vLLM, these adapters enable intrinsic functions (RAG quality checks, safety evaluation, requirement validation) through the OpenAI-compatible API without loading adapter weights at runtime. Start a vLLM server with the Granite Switch model:
python -m vllm.entrypoints.openai.api_server \
    --model <granite-switch-model-id> \
    --dtype bfloat16 \
    --enable-prefix-caching
Then create a backend with load_embedded_adapters=True:
from mellea.backends.openai import OpenAIBackend
from mellea.backends.model_ids import IBM_GRANITE_SWITCH_4_1_3B_PREVIEW
from mellea.formatters import TemplateFormatter

backend = OpenAIBackend(
    model_id=IBM_GRANITE_SWITCH_4_1_3B_PREVIEW.hf_model_name,
    formatter=TemplateFormatter(model_id=IBM_GRANITE_SWITCH_4_1_3B_PREVIEW.hf_model_name),
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
    load_embedded_adapters=True,
)
The high-level intrinsic wrappers (rag.check_answerability, core.check_certainty, etc.) work identically with this backend. See Intrinsics for the full list of available intrinsics.
Note: load_embedded_adapters=True downloads adapter I/O configurations from the model’s HuggingFace repository on first use. No adapter weights are transferred — the adapters are already part of the model. Only intrinsics embedded in the model are available — check the model’s adapter_index.json for the list.
For more control, load adapters manually with load_embedded_adapters=False:
from mellea.backends.adapters.adapter import EmbeddedIntrinsicAdapter
from mellea.backends.openai import OpenAIBackend
from mellea.backends.model_ids import IBM_GRANITE_SWITCH_4_1_3B_PREVIEW
from mellea.formatters import TemplateFormatter

backend = OpenAIBackend(
    model_id=IBM_GRANITE_SWITCH_4_1_3B_PREVIEW.hf_model_name,
    formatter=TemplateFormatter(model_id=IBM_GRANITE_SWITCH_4_1_3B_PREVIEW.hf_model_name),
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
    load_embedded_adapters=False,
)

# Load a single adapter from the model's HuggingFace repo
adapters = EmbeddedIntrinsicAdapter.from_hub(
    IBM_GRANITE_SWITCH_4_1_3B_PREVIEW.hf_model_name,
    intrinsic_name="answerability",
)
for adapter in adapters:
    backend.add_adapter(adapter)

Troubleshooting

OPENAI_API_KEY not set error

Either export the environment variable or pass api_key directly to OpenAIBackend. For local servers, pass any non-empty string (e.g., api_key="local").

Connection refused at custom base_url

Confirm the local server is running and listening on the expected port. For Ollama, run ollama serve; for LM Studio, start the local server from the LM Studio UI.

Model not found

The model string must exactly match the name your server recognises. For OpenAI, refer to the OpenAI models page. For local servers, list available models from the server’s API or UI.

Empty value from a thinking-mode model

A response with result.value == "" despite non-zero completion_tokens is the signature of a thinking-mode model that emitted only reasoning tokens and no final answer. The OpenAI backend reports the response faithfully — the model genuinely returned content=None — but the reasoning content is preserved separately on the underlying ModelOutputThunk. Diagnose with:
result = m.instruct("What is 2 + 2?")
print(repr(result.value))                    # ''
print(result.generation.usage)               # {'completion_tokens': 9, ...}
print(result._thinking)                      # populated reasoning content, if any
This affects models that default to thinking mode, most commonly Qwen3 served via vLLM with --reasoning-parser qwen3. To disable thinking and get a normal text response, pass the runtime-specific switch through model_options. For vLLM:
from mellea import MelleaSession
from mellea.backends.openai import OpenAIBackend

m = MelleaSession(
    OpenAIBackend(
        model_id="Qwen/Qwen3-Coder-Next-FP8",
        base_url="http://localhost:8000/v1",
        api_key="unused",
        model_options={
            "extra_body": {"chat_template_kwargs": {"enable_thinking": False}},
        },
    )
)
Other inference servers expose the same control under different names — check your runtime’s documentation. If you intend to use thinking mode, read the reasoning trace from result._thinking rather than result.value.
See also: Backends and Configuration | Enforce Structured Output | Official Granite Switch Documentation