Skip to content

Benchmark

relai.benchmark.Benchmark(benchmark_id, samples=None)

Bases: ABC

Abstract base class for defining and managing benchmarks.

This class provides a foundational structure for benchmarks, enabling the download, and iteration of samples. It ensures that all concrete benchmark implementations have a unique identifier and a collection of samples to be used as inputs for AI agents and evaluators.

Attributes:

Name Type Description
benchmark_id str

A unique identifier for this specific benchmark.

samples list[RELAISample]

A list of RELAISample objects contained within this benchmark. Defaults to an empty list if not provided.

Parameters:

Name Type Description Default
benchmark_id str

The unique identifier for the benchmark.

required
samples list[RELAISample]

A list of RELAISample objects to include in the benchmark. Defaults to an empty list.

None

__iter__()

Enables iteration over the samples within the benchmark as follows:

for sample in benchmark:
    # Process each sample
    pass

Yields:

Name Type Description
RELAISample RELAISample

Each RELAISample object contained in the benchmark.

__len__()

Returns the number of samples currently in the benchmark.

Returns:

Name Type Description
int int

The total count of Sample objects.

sample(n=1)

Returns n random samples from the benchmark, with replacement.

If n is greater than the total number of samples, samples may be repeated in the returned list.

Parameters:

Name Type Description Default
n int

The number of random samples to retrieve. Must be a positive integer. Defaults to 1.

1

Returns:

Type Description
list[RELAISample]

list[RELAISample]: A list containing n randomly selected RELAISample objects.

Raises:

Type Description
ValueError

If n is less than or equal to 0.

relai.benchmark.RELAIBenchmark(benchmark_id, field_name_mapping=None, field_value_transform=None, agent_input_fields=None, extra_fields=None)

Bases: Benchmark

A concrete implementation of Benchmark that downloads samples from the RELAI platform.

Attributes:

Name Type Description
benchmark_id str

The unique identifier (ID) of the RELAI benchmark to be loaded from the platform. You can find the benchmark ID in the metadata of the benchmark.

samples list[RELAISample]

A list of RELAISample objects contained within this benchmark.

Parameters:

Name Type Description Default
benchmark_id str

The unique identifier for the RELAI benchmark. This ID is used to fetch the benchmark data from the RELAI platform.

required
field_name_mapping dict[str, str]

A mapping from field names returned by the RELAI API to standardized field names expected by the evaluators. If a field name is not present in this mapping, it is used as-is. Defaults to an empty dictionary.

None
field_value_transform dict[str, Callable]

A mapping from field names to transformation functions that convert field values from the RELAI API into the desired format. If a field name is not present in this mapping, the identity function is used (i.e., no transformation). Defaults to an empty dictionary.

None
agent_input_fields list[str]

A list of field names to extract from each sample for the agent_inputs dictionary. These fields are provided to the AI agent. Defaults to an empty list.

None
extra_fields list[str]

A list of field names to extract from each sample for the extras dictionary. These fields are also provided to the evaluators. Defaults to an empty list.

None

fetch_samples()

Downloads samples from the RELAI platform and populates the samples attribute.

This method fetches the benchmark data using the RELAI client and processes each sample to create Sample objects. The samples attribute is then updated with the newly fetched samples.

relai.benchmark.RELAIQuestionAnsweringBenchmark(benchmark_id)

Bases: RELAIBenchmark

A concrete implementation of RELAIBenchmark for question-answering tasks. All samples in this benchmark have the following fields:

  • agent_inputs:
    • question: The question to be answered by the AI agent.
  • extras:
    • rubrics: A dictionary of rubrics for evaluating the answer.
    • std_answer: The standard answer to the question. }

Parameters:

Name Type Description Default
benchmark_id str

The unique identifier for the RELAI question-answering benchmark. This ID is used to fetch the benchmark data from the RELAI platform.

required

relai.benchmark.RELAISummarizationBenchmark(benchmark_id)

Bases: RELAIBenchmark

A concrete implementation of RELAIBenchmark for summarization tasks. All samples in this benchmark have the following fields:

  • agent_inputs:
    • source: The text to be summarized.
  • extras:
    • key_facts: A list of key facts extracted from the source.
    • style_rubrics: A dictionary of rubrics for evaluating the style of the summary.
    • format_rubrics: A dictionary of rubrics for evaluating the format of the summary.

Parameters:

Name Type Description Default
benchmark_id str

The unique identifier for the RELAI summarization benchmark. This ID is used to fetch the benchmark data from the RELAI platform.

required

relai.benchmark.RELAIAnnotationBenchmark(benchmark_id)

Bases: RELAIBenchmark

A concrete implementation of RELAIBenchmark for benchmarks created from user annotations. All samples in this benchmark have the following fields:

  • agent_inputs:
    • The input(s) provided to the agent being evaluated.
  • extras:
    • previous_outputs: The previous outputs produced by the agent.
    • desired_outputs: The desired outputs as specified by the user.
    • feedback: The user feedback provided for the previous outputs.
    • liked: A boolean indicating whether the user liked the previous outputs.

Parameters:

Name Type Description Default
client AsyncRELAI

An instance of the AsyncRELAI client to interact with the RELAI platform.

required
benchmark_id str

The unique identifier for the RELAI summarization benchmark. This ID is used to fetch the benchmark data from the RELAI platform.

required

relai.benchmark.CSVBenchmark(csv_file, agent_input_columns=None, extra_columns=None, benchmark_id=None)

Bases: Benchmark

A concrete implementation of Benchmark that loads samples from a CSV file.

Attributes:

Name Type Description
benchmark_id str

The unique identifier (ID) of the benchmark to be loaded from the CSV file. Defaults to the CSV file name.

samples list[Sample]

A list of Sample objects contained within this benchmark.

Parameters:

Name Type Description Default
csv_file str

The path to the CSV file containing benchmark samples.

required
agent_input_columns list[str]

A list of column names in the CSV file that should be used as inputs for the AI agent. Defaults to an empty list.

None
extra_columns list[str]

A list of column names in the CSV file that could be used as inputs for evaluators. Defaults to an empty list.

None
benchmark_id str

A unique identifier for the benchmark. If not provided, it defaults to the name of the CSV file.

None