Synthesizer
Quick Summary
deepeval
offers a data Synthesizer
for anyone to easily generate evaluation datasets from scratch. The Synthesizer
class is a synthetic data generator that first uses an LLM to generate a series of input
s, before evolving each input
to make them more complex and realistic. These evolved inputs are then used to create a list of synthetic Golden
s, which makes up your synthetic EvaluationDataset
.
deepeval
's Synthesizer
uses the data evolution method to generate large volumes of data across various complexity levels to make synthetic data more realistic. This method was originally introduced by the developers of Evol-Instruct and WizardML.
For those interested, here is a great article on how deepeval
's synthesizer was built.
Creating An Synthesizer
deepeval
's Synthesizer
can be used as a standalone or within an EvaluationDataset
. To begin, create a Synthesizer
:
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
There are two optional parameters when creating a Synthesizer
:
- [Optional]
model
: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted togpt-4o
. - [Optional]
multithreading
: a boolean which when set toTrue
, enables concurrent generation of goldens. Defaulted toTrue
.
Using Synthesizer As A Standalone
There are 2 approaches a deepeval
's Synthesizer
can generate synthetic Golden
s:
- Generating synthetic
Golden
s using context extracted from documents. - Generating synthetic
Golden
s from a list of provided context.
Generating From Documents
To generate synthetic Golden
s from documents, simply provide a list of document paths:
The generate_goldens_from_docs
method employs a token-based text splitter to manage document chunking, meaning the chunk_size
and chunk_overlap
parameters do not guarantee exact context sizes. This approach is designed to ensure meaningful and coherent context extraction, but might lead to variations in the expected size of each context
.
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
document_paths=['example.txt', 'example.docx', 'example.pdf'],
max_goldens_per_document=2
)
There are one mandatory and six optional parameters when using the generate_goldens_from_docs
method:
document_paths
: a list strings, representing the path to the documents from which contexts will be extracted from. Supported documents types include:.txt
,.docx
, and.pdf
.- [Optional]
include_expected_output
: a boolean which when set toTrue
, will additionally generate anexpected_output
for each syntheticGolden
. Defaulted toFalse
. - [Optional]
max_goldens_per_document
: the maximum number of golden data points to be generated for each document. Defaulted to 5. - [Optional]
chunk_size
: specifies the size of text chunks (in characters) to be considered for context extraction within each document. Defaulted to 1024. - [Optional]
chunk_overlap
: an int that determines the overlap size between consecutive text chunks during context extraction. Defaulted to 0. - [Optional]
num_evolutions
: the number of evolution steps to apply to each generated input. This parameter controls the complexity and diversity of the generated dataset by iteratively refining and evolving the initial inputs. Defaulted to 1. - [Optional]
enable_breadth_evolve
: a boolean which when set toTrue
, introduces a wider variety of context modifications, enhancing the dataset's diversity. Defaulted toFalse
.
deepeval
uses an LLM and an embedder model during context and query generation during data synthesization. You can choose to use Azure's OpenAI models if you wish by running the following commands in the terminal:
deepeval set-azure-openai --openai-endpoint=<endpoint> \
--openai-api-key=<api_key> \
--deployment-name=<deployment_name> \
--openai-api-version=<openai_api_version> \
--model-version=<model_version>
Then, run this to set the Azure OpenAI embedder:
deepeval set-azure-openai-embedding --embedding_deployment-name=<embedding_deployment_name>
Generating From Provided Contexts
deepeval
also allows you to generate synthetic Goldens
from a manually provided a list of context instead of directly generating from your documents.
This is especially helpful if you already have an indexed and/or embedded knowledge base. For example, if you already have documents parsed and stored in an existing vector database, you may consider handling the logic to retrieve text chunks yourself.
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
synthesizer.generate_goldens(
# Provide a list of context for synthetic data generation
contexts=[
["The Earth revolves around the Sun.", "Planets are celestial bodies."],
["Water freezes at 0 degrees Celsius.", "The chemical formula for water is H2O."],
]
)
There are one mandatory and four optional parameters when using the generate_goldens
method:
contexts
: a list of context, where each context is itself a list of strings, ideally sharing a common theme or subject area.- [Optional]
include_expected_output
: a boolean which when set toTrue
, will additionally generate anexpected_output
for each syntheticGolden
. Defaulted toFalse
. - [Optional]
max_goldens_per_context
: the maximum number of golden data points to be generated from each context. Adjusting this parameter can influence the size of the resulting dataset. Defaulted to 2. - [Optional]
num_evolutions
: the number of evolution steps to apply to each generated input. This parameter controls the complexity and diversity of the generated dataset by iteratively refining and evolving the initial inputs. Defaulted to 1. - [Optional]
enable_breadth_evolve
: a boolean indicating whether to enable breadth evolution strategies during data generation. When set to True, it introduces a wider variety of context modifications, enhancing the dataset's diversity. Defaulted toFalse
.
You can also optionally generate expected_output
s alongside each golden, but you should always aim to cross-check any generated expected output.
Saving Generated Goldens
To not accidentally lose any generated synthetic Golden
, you can use the save_as()
method:
synthesizer.save_as(
file_type='json', # or 'csv'
directory="./synthetic_data"
)
Using Synthesizer Within An Evaluation Dataset
An EvaluationDataset
also has the generate_goldens_from_docs
and generate_goldens
methods, which under the hood is powered by the Synthesizer
's implementation.
Except for an additional option to accept a custom Synthesizer
as argument, the generate_goldens_from_docs
and generate_goldens
methods in an EvaluationDataset
accepts the exact same arguments as those on a Synthesizer
.
You can optionally specify a custom Synthesizer
when calling generate_goldens_from_docs
and generate_goldens
through the EvaluationDataset
interface if for example, you wish to use a custom LLM to generate synthetic data. If no Synthesizer
is provided, the default Synthesizer
configuration is used.
To begin, optionally create a custom Synthesizer
:
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer(model="gpt-3.5-turbo")
Then, provide it as an argument to generate_goldens_from_docs
:
from deepeval.dataset import EvaluationDataset
...
dataset = EvaluationDataset()
dataset.generate_goldens_from_docs(
synthesizer=synthesizer,
document_paths=['example.pdf'],
)
Or, to generate_goldens
:
...
dataset.generate_goldens(
synthesizer=synthesizer,
contexts=[
["The Earth revolves around the Sun.", "Planets are celestial bodies."],
["Water freezes at 0 degrees Celsius.", "The chemical formula for water is H2O."],
]
)
Lastly, don't forget to call save_as()
to perserve any generated synthetic Golden
:
saved_path = dataset.save_as(
file_type='json', # or 'csv'
directory="./synthetic_data"
)
The save_as()
method returns a string to the path the dataset was saved to, just in case you need to use it in code later on.