mellea.formatters.granite.retrievers.util

Various utility functions relating to the MTRAG benchmark data set.

Functions

download_mtrag_corpus(target_dir: str, corpus_name: str) -> pathlib.Path

Download a corpus file from the MTRAG benchmark if the file hasn’t already present. Args:

Returns:

Raises:

read_mtrag_corpus(corpus_file: str | pathlib.Path) -> pa.Table

Read the documents from one of the MTRAG benchmark’s corpora. Args:

Returns:

Raises:

TypeError: If the ID column cannot be identified or if no text column is present in the corpus file.

download_mtrag_embeddings(embedding_name: str, corpus_name: str, target_dir: str)

Download precomputed embeddings for a corpus in the MTRAG benchmark. Args:

embedding_name: Name of the SentenceTransformers embedding model used to create the embeddings.
corpus_name: Should be one of "cloud", "clapnq", "fiqa", or "govt".
target_dir: Location where Parquet files named "part_001.parquet", "part_002.parquet", etc. will be written.

Raises:

ValueError: If corpus_name is not one of the supported corpus names, or if no precomputed embeddings are found for the given corpus and embedding model combination.