Skip to main content

Overview

The Granite Vision models are designed for enterprise applications, specializing in visual document understanding. They are capable of performing a wide range of tasks, including extracting information from tables, charts, diagrams, sketches, and infographics, as well as general image analysis. The family of models also includes Granite Vision Embedding, a novel multimodal embedding model for document retrieval. It enables queries on documents containing tables, charts, infographics, and complex layouts. By eliminating the need for text extraction, Vision Embedding simplifies and accelerates retrieval-augmented generation (RAG) pipelines. Despite its lightweight architecture, Granite Vision achieves strong performance on standard visual document understanding benchmarks and on the LiveXiv benchmark, which evaluates the model on a continuously updated set of new arXiv papers to prevent data leakage. Granite Vision is currently ranked 2nd on the OCRBench Leaderboard (as of October 2nd, 2025). Similarly, Granite Vision Embedding achieves top ranks on visual document retrieval benchmarks, currently holding 5th place on the ViDoRe 2 leaderboard (as of 10/2/2025). Granite Vision and Vision Embedding are released under the Apache 2.0 license, making them freely available for both research and commercial purposes, with full transparency into their training data. Paper: Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence. Please note that this paper describes Granite Vision 3.2. Granite Vision 3.3 shares most of the technical underpinnings with Granite 3.2, but with several enhancements in terms of new and improved vision encoder, many new high quality datasets for training, and several new experimental capabilities.

Model cards

Run locally with Ollama

Learn more about Granite on Ollama.

Getting started

Granite Vision with Hugging Face transformers

This is a simple example of how to use the granite-vision-3.3-2b model with the Transformers library and PyTorch. First, install the required libraries
pip install transformers>=4.49
from transformers import AutoProcessor, AutoModelForVision2Seq
from huggingface_hub import hf_hub_download
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_path = "ibm-granite/granite-vision-3.3-2b"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForVision2Seq.from_pretrained(model_path).to(device)

# prepare image and text prompt, using the appropriate prompt template

img_path = hf_hub_download(repo_id=model_path, filename='example.png')

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": img_path},
            {"type": "text", "text": "What is the highest scoring model on ChartQA and what is its score?"},
        ],
    },
]
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(device)


# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))

Granite Vision with vLLM

The granite-vision-3.3-2b model can also be loaded with vLLM. First make sure to install the following libraries:
pip install torch torchvision torchaudio
pip install vllm==0.6.6
Then, copy the snippet from the section that is relevant for your use case.
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from huggingface_hub import hf_hub_download
from PIL import Image

model_path = "ibm-granite/granite-vision-3.3-2b"

model = LLM(
    model=model_path,
)

sampling_params = SamplingParams(
    temperature=0.2,
    max_tokens=64,
)

# Define the question we want to answer and format the prompt
image_token = "<image>"
system_prompt = "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n"

question = "What is the highest scoring model on ChartQA and what is its score?"
prompt = f"{system_prompt}<|user|>\n{image_token}\n{question}\n<|assistant|>\n"
img_path = hf_hub_download(repo_id=model_path, filename='example.png')
image = Image.open(img_path).convert("RGB")
print(image)

# Build the inputs to vLLM; the image is passed as `multi_modal_data`.
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image,
    }
}

outputs = model.generate(inputs, sampling_params=sampling_params)
print(f"Generated text: {outputs[0].outputs[0].text}")