Datasets

Quick Summary

In deepeval, an evaluation dataset, or just dataset, is a collection of LLMTestCases and/or Goldens. There are two approaches to evaluating datasets in deepeval:

using @pytest.mark.parametrize and assert_test
using evaluate

Create An Evaluation Dataset

An EvaluationDataset in deepeval is simply a collection of LLMTestCases and/or Goldens.

info

A Golden is extremely very similar to an LLMTestCase, but they are more flexible as they do not require an actual_output at initialization. On the flip side, whilst test cases are always ready for evaluation, a golden isn't.

With Test Cases

from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

first_test_case = LLMTestCase(input="...", actual_output="...")
second_test_case = LLMTestCase(input="...", actual_output="...")

test_cases = [first_test_case, second_test_case]

dataset = EvaluationDataset(test_cases=test_cases)

You can also append a test case to an EvaluationDataset through the test_cases instance variable:

...

dataset.test_cases.append(test_case)
# or
dataset.add_test_case(test_case)

With Goldens

You should opt to initialize EvaluationDatasets with goldens if you're looking to generate LLM outputs at evaluation time. This usually means your original dataset does not contain precomputed outputs, but only the inputs you want to evaluate your LLM (application) on.

from deepeval.dataset import EvaluationDataset, Golden

first_golden = Golden(input="...")
second_golden = Golden(input="...")

goldens = [first_golden, second_golden]

dataset = EvaluationDataset(goldens=goldens)

note

A Golden and LLMTestCase contains almost an identical class signature, so technically you can also supply other parameters such as the actual_output when creating a Golden.

Generate An Evaluation Dataset

deepeval offers anyone the ability to easily generate synthetic datasets from documents locally on your machine. This is especially helpful if you don't have an evaluation dataset prepared beforehand.

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.generate_goldens_from_docs(
    document_paths=['example.txt', 'example.docx', 'example.pdf'],
    max_goldens_per_document=2
)

Under the hood, an EvaluationDataset generates goldens using to deepeval's Synthesizer. You can customize the Synthesizer used to generate goldens within an EvaluationDataset.

from deepeval.dataset import EvaluationDataset
from deepeval.synthesizer import Synthesizer
...

# Use gpt-3.5-turbo instead
synthesizer = Synthesizer(model="gpt-3.5-turbo")
dataset.generate_goldens_from_docs(
    synthesizer=synthesizer,
    document_paths=['example.pdf'],
    max_goldens_per_document=2
)

info

deepeval's Synthesizer uses a series of evolution techniques to complicate and make generated goldens more realistic to human prepared data. For more information on how deepeval's Synthesizer works, visit the synthesizer section.

Load an Existing Dataset

deepeval offers support for loading datasetes stored in JSON files, CSV files, and hugging face datasets into an EvaluationDataset as test cases.

From Confident AI

You can load entire datasets on Confident AI's cloud in one line of code.

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")

Did Your Know?

You can create, annotate, and comment on datasets on Confident AI? You can also upload datasets in CSV format, or push synthetic datasets created in deepeval to Confident AI in one line of code.

For more information, visit the Confident AI datasets section.

From JSON

You can add test cases into your EvaluationDataset by supplying a file_path to your .json file. Your .json file should contain an array of objects (or list of dictionaries).

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_test_cases_from_json_file(
    # file_path is the absolute path to you .json file
    file_path="example.json",
    input_key_name="query",
    actual_output_key_name="actual_output",
    expected_output_key_name="expected_output",
    context_key_name="context",
    retrieval_context_key_name="retrieval_context",
)

From CSV

You can add test cases into your EvaluationDataset by supplying a file_path to your .csv file. Your .csv file should contain rows that can be mapped into LLMTestCases through their column names. Remember, context should be a list of strings and in the context of CSV files, it means you have to supply a context_col_delimiter argument to tell deepeval how to split your context cells into a list of strings.

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_test_cases_from_csv_file(
    # file_path is the absolute path to you .csv file
    file_path="example.csv",
    input_col_name="query",
    actual_output_col_name="actual_output",
    expected_output_col_name="expected_output",
    context_col_name="context",
    context_col_delimiter= ";",
    retrieval_context_col_name="retrieval_context",
    retrieval_context_col_delimiter= ";"
)

note

Since expected_output, context, and retrieval_context are optional parameters for an LLMTestCase, these fields are similarily optional parameters when adding test cases from an existing dataset.

Evaluate Your Dataset With Pytest

Before we begin, we highly recommend logging into Confident AI to keep track of all evaluation results on the cloud:

deepeval login

deepeval utilizes the @pytest.mark.parametrize decorator to loop through entire datasets.

test_bulk.py
import deepeval
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset


dataset = EvaluationDataset(test_cases=[...])

@pytest.mark.parametrize(
    "test_case",
    dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
    hallucination_metric = HallucinationMetric(threshold=0.3)
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    assert_test(test_case, [hallucination_metric, answer_relevancy_metric])


@deepeval.on_test_run_end
def function_to_be_called_after_test_run():
    print("Test finished!")

info

Iterating through an dataset object implicitly loops through the test cases in an dataset. To iterate through goldens, you can do it by accessing dataset.goldens instead.

To run several tests cases at once in parallel, use the optional -n flag followed by a number (that determines the number of processes that will be used) when executing deepeval test run:

deepeval test run test_bulk.py -n 3

Evaluate Your Dataset Without Pytest

Alternately, you can use deepeval's evaluate function to evaluate datasets. This approach avoids the CLI, but does not allow for parallel test execution.

from deepeval import evaluate
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset(test_cases=[...])
hallucination_metric = HallucinationMetric(threshold=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)

dataset.evaluate([hallucination_metric, answer_relevancy_metric])

# You can also call the evaluate() function directly
evaluate(dataset, [hallucination_metric, answer_relevancy_metric])

Quick Summary​

Create An Evaluation Dataset​

With Test Cases​

With Goldens​

Generate An Evaluation Dataset​

Load an Existing Dataset​

From Confident AI​

From JSON​

From CSV​

Evaluate Your Dataset With Pytest​

Evaluate Your Dataset Without Pytest​

Quick Summary

Create An Evaluation Dataset

With Test Cases

With Goldens

Generate An Evaluation Dataset

Load an Existing Dataset

From Confident AI

From JSON

From CSV

Evaluate Your Dataset With Pytest

Evaluate Your Dataset Without Pytest