Benchmark¶
relai.benchmark.Benchmark(benchmark_id, samples=None)
¶
Bases: ABC
Abstract base class for defining and managing benchmarks.
This class provides a foundational structure for benchmarks, enabling the download, and iteration of samples. It ensures that all concrete benchmark implementations have a unique identifier and a collection of samples to be used as inputs for AI agents and evaluators.
Attributes:
| Name | Type | Description |
|---|---|---|
benchmark_id |
str
|
A unique identifier for this specific benchmark. |
samples |
list[RELAISample]
|
A list of |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
benchmark_id
|
str
|
The unique identifier for the benchmark. |
required |
samples
|
list[RELAISample]
|
A list of |
None
|
__iter__()
¶
Enables iteration over the samples within the benchmark as follows:
Yields:
| Name | Type | Description |
|---|---|---|
RELAISample |
RELAISample
|
Each |
__len__()
¶
Returns the number of samples currently in the benchmark.
Returns:
| Name | Type | Description |
|---|---|---|
int |
int
|
The total count of |
sample(n=1)
¶
Returns n random samples from the benchmark, with replacement.
If n is greater than the total number of samples, samples may be
repeated in the returned list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
The number of random samples to retrieve. Must be a positive integer. Defaults to 1. |
1
|
Returns:
| Type | Description |
|---|---|
list[RELAISample]
|
list[RELAISample]: A list containing |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
relai.benchmark.RELAIBenchmark(benchmark_id, field_name_mapping=None, field_value_transform=None, agent_input_fields=None, extra_fields=None)
¶
Bases: Benchmark
A concrete implementation of Benchmark that downloads samples from the RELAI platform.
Attributes:
| Name | Type | Description |
|---|---|---|
benchmark_id |
str
|
The unique identifier (ID) of the RELAI benchmark to be loaded from the platform. You can find the benchmark ID in the metadata of the benchmark. |
samples |
list[RELAISample]
|
A list of |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
benchmark_id
|
str
|
The unique identifier for the RELAI benchmark. This ID is used to fetch the benchmark data from the RELAI platform. |
required |
field_name_mapping
|
dict[str, str]
|
A mapping from field names returned by the RELAI API to standardized field names expected by the evaluators. If a field name is not present in this mapping, it is used as-is. Defaults to an empty dictionary. |
None
|
field_value_transform
|
dict[str, Callable]
|
A mapping from field names to transformation functions that convert field values from the RELAI API into the desired format. If a field name is not present in this mapping, the identity function is used (i.e., no transformation). Defaults to an empty dictionary. |
None
|
agent_input_fields
|
list[str]
|
A list of field names to extract from each
sample for the |
None
|
extra_fields
|
list[str]
|
A list of field names to extract from each
sample for the |
None
|
fetch_samples()
¶
Downloads samples from the RELAI platform and populates the samples attribute.
This method fetches the benchmark data using the RELAI client and processes
each sample to create Sample objects. The samples attribute is then
updated with the newly fetched samples.
relai.benchmark.RELAIQuestionAnsweringBenchmark(benchmark_id)
¶
Bases: RELAIBenchmark
A concrete implementation of RELAIBenchmark for question-answering tasks.
All samples in this benchmark have the following fields:
agent_inputs:question: The question to be answered by the AI agent.
extras:rubrics: A dictionary of rubrics for evaluating the answer.std_answer: The standard answer to the question. }
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
benchmark_id
|
str
|
The unique identifier for the RELAI question-answering benchmark. This ID is used to fetch the benchmark data from the RELAI platform. |
required |
relai.benchmark.RELAISummarizationBenchmark(benchmark_id)
¶
Bases: RELAIBenchmark
A concrete implementation of RELAIBenchmark for summarization tasks. All samples in this benchmark have the following fields:
agent_inputs:source: The text to be summarized.
extras:key_facts: A list of key facts extracted from the source.style_rubrics: A dictionary of rubrics for evaluating the style of the summary.format_rubrics: A dictionary of rubrics for evaluating the format of the summary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
benchmark_id
|
str
|
The unique identifier for the RELAI summarization benchmark. This ID is used to fetch the benchmark data from the RELAI platform. |
required |
relai.benchmark.RELAIAnnotationBenchmark(benchmark_id)
¶
Bases: RELAIBenchmark
A concrete implementation of RELAIBenchmark for benchmarks created from user annotations.
All samples in this benchmark have the following fields:
agent_inputs:- The input(s) provided to the agent being evaluated.
extras:previous_outputs: The previous outputs produced by the agent.desired_outputs: The desired outputs as specified by the user.feedback: The user feedback provided for the previous outputs.liked: A boolean indicating whether the user liked the previous outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
client
|
AsyncRELAI
|
An instance of the AsyncRELAI client to interact with the RELAI platform. |
required |
benchmark_id
|
str
|
The unique identifier for the RELAI summarization benchmark. This ID is used to fetch the benchmark data from the RELAI platform. |
required |
relai.benchmark.CSVBenchmark(csv_file, agent_input_columns=None, extra_columns=None, benchmark_id=None)
¶
Bases: Benchmark
A concrete implementation of Benchmark that loads samples from a CSV file.
Attributes:
| Name | Type | Description |
|---|---|---|
benchmark_id |
str
|
The unique identifier (ID) of the benchmark to be loaded from the CSV file. Defaults to the CSV file name. |
samples |
list[Sample]
|
A list of |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
csv_file
|
str
|
The path to the CSV file containing benchmark samples. |
required |
agent_input_columns
|
list[str]
|
A list of column names in the CSV file that should be used as inputs for the AI agent. Defaults to an empty list. |
None
|
extra_columns
|
list[str]
|
A list of column names in the CSV file that could be used as inputs for evaluators. Defaults to an empty list. |
None
|
benchmark_id
|
str
|
A unique identifier for the benchmark. If not provided, it defaults to the name of the CSV file. |
None
|