@generative boolean function to discard irrelevant candidates before passing
the survivors to a grounded m.instruct() call.
Source file: docs/examples/rag/simple_rag_with_filter.py
Concepts covered
- Building a FAISS flat inner-product index from sentence-transformer embeddings
- Using
@generativereturningboolas a per-document relevance gate - Passing filtered documents as
grounding_contexttom.instruct() - Running the example with
uv runvia an inline PEP 723 dependency block
Prerequisites
- Quick Start complete
faiss-cpuandsentence-transformersinstalled, or run viauv runwhich installs them automatically from the inline script block- Ollama running locally with
granite4:micropulled (or a Mistral model — see the session setup section below)
uv run:
Pipeline architecture
The full example
Inline script dependencies
/// script block follows PEP 723.
When you run the file with uv run simple_rag_with_filter.py, uv reads this
block and installs the listed packages into a temporary environment before
execution. No manual pip install is needed.
Imports and document corpus
IndexFlatIP is a FAISS index that
scores by inner product — equivalent to cosine similarity when the embeddings
are L2-normalised, as sentence-transformers produces by default.
Index creation and querying
create_index encodes all documents once and stores the result. query_index
encodes the query at inference time and returns the top-k documents by
similarity. The default k=5 gives the filter stage enough candidates without
overwhelming the context window.
The relevance filter
@generative function returning bool acts as a classifier. The docstring
frames the task: given a candidate document (answer) and the original query
(question), decide whether the document is actually useful.
Vector similarity finds documents that are topically related, but it can
return documents that mention the same keywords without actually answering the
question. This LLM filter catches those false positives.
Main: retrieval, filtering, and generation
del embedding_model frees the sentence-transformer weights before loading
the LLM backend. On a machine with limited VRAM or RAM this prevents
out-of-memory errors when both models would otherwise be resident simultaneously.
model_id=model_ids.MISTRALAI_MISTRAL_0_3_7B selects a specific backend
model. You can substitute any model constant from model_ids or pass a string
identifier directly. The example comment confirms other models work too.
grounding_context passes the surviving documents as named context
entries. The template variable {{query}} is supplied separately via
user_variables. Keeping query and context separate lets Mellea render the
prompt correctly and trace each component independently.
answer.value retrieves the raw string from the
ModelOutputThunk returned by
m.instruct().
Full file
Key observations
Two-stage retrieval reduces hallucination. Vector search alone can surface documents that share vocabulary with the query but do not answer it. The LLM filter adds a semantic gate that vector distance cannot provide.@generative returning bool is a classifier. You can use this pattern
wherever you need a binary decision: spam detection, content moderation, input
validation, feature flags driven by natural language.
grounding_context is the RAG anchor. Without it, m.instruct() would
generate from the model’s parametric knowledge. Passing documents through
grounding_context grounds the answer in retrieved evidence.
What to try next
- Replace the in-memory list with a database-backed corpus and see
docs/examples/rag/mellea_pdf.pyfor a PDF-based variant. - Tune
kinquery_indexand observe how the filter step affects final answer quality. - Add
requirementsto the finalm.instruct()call to enforce length, citation, or tone constraints — see the requirements system concept.
See also: Build a RAG Pipeline — step-by-step how-to guide | Examples Index