LlamaIndex
Quick Summary
LlamaIndex is a data framework for LLMs that facilitates the ingestion of data from various sources such as APIs, databases, and PDFs, and indexes it for later retrieval in RAG-based LLM applications.
Evaluating LlamaIndex
RAG applications built using LlamaIndex can be easily evaluated within deepeval
. Lets use this RAG application built using Llamaindex as an example:
from llama_index import VectorStoreIndex, SimpleDirectoryReader
# Consult the LlamaIndex docs if you're unsure what this does
documents = SimpleDirectoryReader("YOUR_DATA_DIRECTORY").load_data()
index = VectorStoreIndex.from_documents(documents)
rag_application = index.as_query_engine()
You can then query your RAG application and evaluate each response using deepeval
's metrics:
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
...
# An example input to your RAG application
user_input = "What is LlamaIndex?"
# LlamaIndex returns a response object that contains
# both the output string and retrieved nodes
response_object = rag_application.query(user_input)
# Process the response object to get the output string
# and retrieved nodes
if response_object is not None:
actual_output = response_object.response
retrieval_context = [node.get_content() for node in response_object.source_nodes]
# Create a test case and metric as usual
test_case = LLMTestCase(
input=user_input,
actual_output=actual_output,
retrieval_context=retrieval_context
)
answer_relevancy_metric = AnswerRelevancyMetric()
# Evaluate
answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
print(answer_relevancy_metric.reason)
You can also extract all necessary outputs and retrieval contexts for each given input to your LlamaIndex application to create an EvaluationDataset
to evaluate test cases in bulk.
Unit Testing LlamaIndex
Unit testing LlamaIndex is as simple as defining an EvaluationDataset
and generating actual_output
s and retrieval_context
s at evaluation time. Building upon the previous example:
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset, Golden
example_golden = Golden(input="What is LlamaIndex?")
dataset = EvaluationDataset(goldens=[example_golden])
@pytest.mark.parametrize(
"golden",
dataset.goldens,
)
def test_rag(golden: Golden):
# LlamaIndex returns a response object that contains
# both the output string and retrieved nodes
response_object = rag_application.query(golden.input)
# Process the response object to get the output string
# and retrieved nodes
if response_object is not None:
actual_output = response_object.response
retrieval_context = [node.get_content() for node in response_object.source_nodes]
test_case = LLMTestCase(
input=golden.input,
actual_output=actual_output,
retrieval_context=retrieval_context
)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [answer_relevancy_metric])
Here, rag_application
is simply the query engine from Llamaindex, which you'll have to import from the appropriate module.
In this example, we initialized an EvaluationDataset
with a Golden
instead of an LLMTestCase
. This is because we're generating actual_output
s and retrieval_context
s at evaluation time, meaning we cannot initialize with test cases since an LLMTestCase
requires an actual_output
to create. Remember, a Golden
do not require an actual_output
, so whilst test cases are always ready for evaluation, a golden isn't.
Using DeepEval for LlamaIndex
In LlamaIndex, there are entities known as evaluators that evaluates the responses of LlamaIndex applications. Continuing from the previous example, here's an alternative way to make use of the AnswerRelevancyMetric
through deepeval
's LlamaIndex evaluators:
from deepeval.integrations.llama_index import DeepEvalAnswerRelevancyEvaluator
...
# An example input to your RAG application
user_input = "What is LlamaIndex?"
# LlamaIndex returns a response object that contains
# both the output string and retrieved nodes
response_object = rag_application.query(user_input)
evaluator = DeepEvalAnswerRelevancyEvaluator()
evaluation_result = evaluator.evaluate_response(
query=user_input,
response=response_object
)
print(evaluation_result)
In LlamaIndex's documentation, you might see examples where the evaluate()
method is called on an evaluator instead of the evaluate_response()
method. While both is correct, you should ALWAYS use the evaluate_response()
methods when using deepeval
's LlamaIndex evaluators.
Answer Relevancy
The DeepEvalAnswerRelevancyEvaluator
uses deepeval
's AnswerRelevancyMetric
for evaluation.
from deepeval.integrations.llama_index import DeepEvalAnswerRelevancyEvaluator
evaluator = DeepEvalAnswerRelevancyEvaluator(
# Optional. A float representing the minimum passing threshold, defaulted to 0.5.
threshold=0.5,
# Optional. A string specifying which of OpenAI's GPT models to use, defaulted to 'gpt-4o'.
model="gpt-4-0125-preview",
# Optional. A boolean which when set to `True`, will include a reason for its evaluation score, defaulted to `True`.
include_reason=True
)
Faithfulness
The DeepEvalFaithfulnessEvaluator
uses deepeval
's FaithfulnessMetric
for evaluation.
from deepeval.integrations.llama_index import DeepEvalFaithfulnessEvaluator
evaluator = DeepEvalFaithfulnessEvaluator(
# Optional. A float representing the minimum passing threshold, defaulted to 0.5.
threshold=0.5,
# Optional. A string specifying which of OpenAI's GPT models to use, defaulted to 'gpt-4o'.
model="gpt-4-0125-preview",
# Optional. A boolean which when set to `True`, will include a reason for its evaluation score, defaulted to `True`.
include_reason=True
)
Contextual Relevancy
The DeepEvalContextualRelevancyEvaluator
uses deepeval
's ContextualRelevancyMetric
for evaluation.
from deepeval.integrations.llama_index import DeepEvalContextualRelevancyEvaluator
evaluator = DeepEvalContextualRelevancyEvaluator(
# Optional. A float representing the minimum passing threshold, defaulted to 0.5.
threshold=0.5,
# Optional. A string specifying which of OpenAI's GPT models to use, defaulted to 'gpt-4o'.
model="gpt-4-0125-preview",
# Optional. A boolean which when set to `True`, will include a reason for its evaluation score, defaulted to `True`.
include_reason=True
)
Summarization
The DeepEvalSummarizationEvaluator
uses deepeval
's SummarizationMetric
for evaluation.
from deepeval.integrations.llama_index import DeepEvalSummarizationEvaluator
evaluator = DeepEvalSummarizationEvaluator(
# Optional. A float representing the minimum passing threshold, defaulted to 0.5.
threshold=0.5,
# Optional. A string specifying which of OpenAI's GPT models to use, defaulted to 'gpt-4o'.
model="gpt-4-0125-preview",
# Optional. A boolean which when set to `True`, will include a reason for its evaluation score, defaulted to `True`.
include_reason=True
)
Bias
The DeepEvalBiasEvaluator
uses deepeval
's BiasMetric
for evaluation.
from deepeval.integrations.llama_index import DeepEvalBiasEvaluator
evaluator = DeepEvalBiasEvaluator(
# Optional. A float representing the maximum passing threshold, defaulted to 0.5.
threshold=0.5,
# Optional. A string specifying which of OpenAI's GPT models to use, defaulted to 'gpt-4o'.
model="gpt-4-0125-preview",
# Optional. A boolean which when set to `True`, will include a reason for its evaluation score, defaulted to `True`.
include_reason=True
)
Toxicity
The DeepEvalToxicityEvaluator
uses deepeval
's ToxicityMetric
for evaluation.
from deepeval.integrations.llama_index import DeepEvalToxicityEvaluator
evaluator = DeepEvalToxicityEvaluator(
# Optional. A float representing the maximum passing threshold, defaulted to 0.5.
threshold=0.5,
# Optional. A string specifying which of OpenAI's GPT models to use, defaulted to 'gpt-4o'.
# Optional. A boolean which when set to `True`, will include a reason for its evaluation score, defaulted to `True`.
include_reason=True
)
Metrics vs Evaluators
While both deepeval
's metrics and evaluators yield the same result, deepeval
is a full evaluation suite built specifically for LLM evaluation. Naturally, deepeval
forces you to follow evaluation best practices, something not accomplishable through the use of the evaluators abstraction.
So while both metrics and evaluators can be used for a one-off, standalone evaluation, metrics:
- can be combined to evaluate multiple criteria asynchronously
- can be used to evaluate entire
EvaluationDataset
s - can leverage
deepeval
's native Pytest integration to unit test LlamaIndex applications in CI/CD pipelines - can be used with any framework, meaning you are not vendor locked-in into LlamaIndex
- covers a wider range of evaluation criteria/use cases
- automatically integrates with Confident AI, which offers evaluation analysis, evaluation debugging, dataset management, and real-time evaluations in production
The only upside of using deepeval
's LlamaIndex evaluators instead of metrics, is an evaluator automatically extracts the retrieval_context
from a LlamaIndex response. However, as shown in previous examples, manually extracting the retrieval_context
from a LlamaIndex response is extremely straightforward:
...
# LlamaIndex returns a response object that contains
# both the output string and retrieved nodes
response_object = rag_application.query(user_input)
# Process the response object to get the output string
# and retrieved nodes
if response_object is not None:
actual_output = response_object.response
retrieval_context = [node.get_content() for node in response_object.source_nodes]