mellea.formatters.granite.retrievers.embeddings

Classes and functions that implement the InMemoryRetriever.

Functions

compute_embeddings(corpus, embedding_model_name: str, chunk_size: int = 512, overlap: int = 128)

Split documents into windows and compute embeddings for each of the the windows. Args:

corpus: PyArrow Table of documents as returned by read_corpus(). Should have the columns ["id", "url", "title", "text"].
embedding_model_name: Hugging Face model name for the model that computes embeddings. Also used for tokenizing.
chunk_size: Maximum size of chunks to split documents into, in embedding model tokens; must be less than or equal to the embedding model’s maximum sequence length.
overlap: Target overlap between adjacent chunks, in embedding model tokens. Actual begins and ends of chunks will be on sentence boundaries.

Returns:

write_embeddings(target_dir: str, corpus_name: str, embeddings, chunks_per_partition: int = 10000) -> pathlib.Path

Write embeddings. Write the embeddings produced by :func:compute_embeddings() to a directory of Parquet files on local disk. Args:

target_dir: Location where the files should be written (in a subdirectory).
corpus_name: Corpus name used to generate the output directory name.
embeddings: PyArrow Table produced by compute_embeddings().
chunks_per_partition: Number of document chunks to write to each Parquet partition file.

Returns:

Simple retriever that keeps docs and embeddings in memory. Args:

data_file_or_table: Parquet file of document snippets and embeddings, or an equivalent in-memory PyArrow Table. Should have columns id, begin, end, text, and embedding.
embedding_model_name: Name of the Sentence Transformers model to use for embeddings. Must match the model used to compute embeddings in the data file.

Methods:

retrieve(self, query: str, top_k: int = 5) -> list[dict]

Run a query and return results. Args:

Returns: