Skip to main content
Classes and functions that implement the InMemoryRetriever.

Functions

FUNC compute_embeddings

compute_embeddings(corpus, embedding_model_name: str, chunk_size: int = 512, overlap: int = 128)
Split documents into windows and compute embeddings for each of the the windows. Args:
  • corpus: PyArrow Table of documents as returned by read_corpus(). Should have the columns ["id", "url", "title", "text"].
  • embedding_model_name: Hugging Face model name for the model that computes embeddings. Also used for tokenizing.
  • chunk_size: Maximum size of chunks to split documents into, in embedding model tokens; must be less than or equal to the embedding model’s maximum sequence length.
  • overlap: Target overlap between adjacent chunks, in embedding model tokens. Actual begins and ends of chunks will be on sentence boundaries.
Returns:
  • PyArrow Table of chunks of the corpus, with schema
  • ["id", "url", "title", "begin", "end", "text", "embedding"].

FUNC write_embeddings

write_embeddings(target_dir: str, corpus_name: str, embeddings, chunks_per_partition: int = 10000) -> pathlib.Path
Write embeddings. Write the embeddings produced by :func:compute_embeddings() to a directory of Parquet files on local disk. Args:
  • target_dir: Location where the files should be written (in a subdirectory).
  • corpus_name: Corpus name used to generate the output directory name.
  • embeddings: PyArrow Table produced by compute_embeddings().
  • chunks_per_partition: Number of document chunks to write to each Parquet partition file.
Returns:
  • Path to the directory where the Parquet files were written.

Classes

CLASS InMemoryRetriever

Simple retriever that keeps docs and embeddings in memory. Args:
  • data_file_or_table: Parquet file of document snippets and embeddings, or an equivalent in-memory PyArrow Table. Should have columns id, begin, end, text, and embedding.
  • embedding_model_name: Name of the Sentence Transformers model to use for embeddings. Must match the model used to compute embeddings in the data file.
Methods:

FUNC retrieve

retrieve(self, query: str, top_k: int = 5) -> list[dict]
Run a query and return results. Args:
  • query: Natural language query string.
  • top_k: Number of top results to return.
Returns:
  • List of dicts with keys doc_id, text, and score.