HumanEval

The HumanEval benchmark is a dataset designed to evaluate an LLM’s code generation capabilities. The benchmark consists of 164 hand-crafted programming challenges comparable to simple software interview questions. For more information, visit the HumanEval GitHub page.

info

HumanEval assesses the functional correctness of generated code instead of merely measuring textual similarity to a reference solution.

Arguments

There are two optional arguments when using the HumanEval benchmark:

[Optional] tasks: a list of tasks (HumanEvalTask enums), specifying which of the 164 programming tasks to evaluate in the language model. By default, this is set to all tasks. Detailed descriptions of the HumanEvalTask enum can be found here.
[Optional] n: the number of code generation samples for each task for model evaluation using the pass@k metric. This is set to 200 by default. A more detailed description of the pass@k metric and n parameter can be found here.

caution

By default, each task will be evaluated 200 times, as specified by n, the number of code generation samples. This means your LLM is being invoked 200 times on the same prompt by default.

Example

The code below evaluates a custom GPT-4 model (click here to learn how to use ANY custom LLM) and assesses its performance on HAS_CLOSE_ELEMENTS and SORT_NUMBERS tasks using 100 code generation samples.

from deepeval.benchmarks import HumanEval
from deepeval.benchmarks.tasks import HumanEvalTask

# Define benchmark with specific tasks and number of code generations
benchmark = HumanEval(
    tasks=[HumanEvalTask.HAS_CLOSE_ELEMENTS, HumanEvalTask.SORT_NUMBERS],
    n=100
)

# Replace 'gpt_4' with your own custom model
benchmark.evaluate(model=gpt_4, k=10)
print(benchmark.overall_score)

You must define a generate_samples method in your custom model to perform HumanEval evaluation. In addition, when calling evaluate, you must supply k, the number of top samples chosen for the pass@k metric.

# Define a custom GPT-4 model class
class GPT4Model(DeepEvalBaseLLM):
        ...
    def generate_samples(
        self, prompt: str, n: int, temperature: float
    ) -> Tuple[AIMessage, float]:
        chat_model = self.load_model()
        og_parameters = {"n": chat_model.n, "temp": chat_model.temperature}
        chat_model.n = n
        chat_model.temperature = temperature
        generations = chat_model._generate([HumanMessage(prompt)]).generations
        completions = [r.text for r in generations]
        return completions
        ...

gpt_4 = GPT4Model()

The overall_score for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on the pass@k metric, is calculated by determining the proportion of code generations for which the model passes all the test cases (7.7 test cases average per problem) for at least k samples in relation to the total number of questions.

Pass@k Metric

The pass@k metric evaluates the functional correctness of generated code samples by focusing on whether at least one of the top k samples passes predefined unit tests. It calculates this probability by determining the complement of the probability that all k chosen samples are incorrect, using the formula:

\text{pass@k} = 1 - \frac{C(n-c, k)}{C(n, k)}

where C represents combinations, n is the total number of samples, c is the number of correct samples, and k is the number of top samples chosen.

Using n helps ensure that the evaluation metric considers the full range of generated outputs, thereby reducing the risk of bias that can arise from only considering a small, possibly non-representative set of samples.

HumanEval Tasks

The HumanEvalTask enum classifies the diverse range of subject areas covered in the HumanEval benchmark.

from deepeval.benchmarks.tasks import HumanEvalTask

human_eval_tasks = [HumanEvalTask.HAS_CLOSE_ELEMENTS]

Below is the comprehensive list of all available tasks:

HAS_CLOSE_ELEMENTS
SEPARATE_PAREN_GROUPS
TRUNCATE_NUMBER
BELOW_ZERO
MEAN_ABSOLUTE_DEVIATION
INTERSPERSE
PARSE_NESTED_PARENS
FILTER_BY_SUBSTRING
SUM_PRODUCT
ROLLING_MAX
MAKE_PALINDROME
STRING_XOR
LONGEST
GREATEST_COMMON_DIVISOR
ALL_PREFIXES
STRING_SEQUENCE
COUNT_DISTINCT_CHARACTERS
PARSE_MUSIC
HOW_MANY_TIMES
SORT_NUMBERS
FIND_CLOSEST_ELEMENTS
RESCALE_TO_UNIT
FILTER_INTEGERS
STRLEN
LARGEST_DIVISOR
FACTORIZE
REMOVE_DUPLICATES
FLIP_CASE
CONCATENATE
FILTER_BY_PREFIX
GET_POSITIVE
IS_PRIME
FIND_ZERO
SORT_THIRD
UNIQUE
MAX_ELEMENT
FIZZ_BUZZ
SORT_EVEN
DECODE_CYCLIC
PRIME_FIB
TRIPLES_SUM_TO_ZERO
CAR_RACE_COLLISION
INCR_LIST
PAIRS_SUM_TO_ZERO
CHANGE_BASE
TRIANGLE_AREA
FIB4
MEDIAN
IS_PALINDROME
MODP
DECODE_SHIFT
REMOVE_VOWELS
BELOW_THRESHOLD
ADD
SAME_CHARS
FIB
CORRECT_BRACKETING
MONOTONIC
COMMON
LARGEST_PRIME_FACTOR
SUM_TO_N
DERIVATIVE
FIBFIB
VOWELS_COUNT
CIRCULAR_SHIFT
DIGITSUM
FRUIT_DISTRIBUTION
PLUCK
SEARCH
STRANGE_SORT_LIST
WILL_IT_FLY
SMALLEST_CHANGE
TOTAL_MATCH
IS_MULTIPLY_PRIME
IS_SIMPLE_POWER
IS_CUBE
HEX_KEY
DECIMAL_TO_BINARY
IS_HAPPY
NUMERICAL_LETTER_GRADE
PRIME_LENGTH
STARTS_ONE_ENDS
SOLVE
ANTI_SHUFFLE
GET_ROW
SORT_ARRAY
ENCRYPT
NEXT_SMALLEST
IS_BORED
ANY_INT
ENCODE
SKJKASDKD
CHECK_DICT_CASE
COUNT_UP_TO
MULTIPLY
COUNT_UPPER
CLOSEST_INTEGER
MAKE_A_PILE
WORDS_STRING
CHOOSE_NUM
ROUNDED_AVG
UNIQUE_DIGITS
BY_LENGTH
EVEN_ODD_PALINDROME
COUNT_NUMS
MOVE_ONE_BALL
EXCHANGE
HISTOGRAM
REVERSE_DELETE
ODD_COUNT
MINSUBARRAYSUM
MAX_FILL
SELECT_WORDS
GET_CLOSEST_VOWEL
MATCH_PARENS
MAXIMUM
SOLUTION
ADD_ELEMENTS
GET_ODD_COLLATZ
VALID_DATE
SPLIT_WORDS
IS_SORTED
INTERSECTION
PROD_SIGNS
MINPATH
TRI
DIGITS
IS_NESTED
SUM_SQUARES
CHECK_IF_LAST_CHAR_IS_A_LETTER
CAN_ARRANGE
LARGEST_SMALLEST_INTEGERS
COMPARE_ONE
IS_EQUAL_TO_SUM_EVEN
SPECIAL_FACTORIAL
FIX_SPACES
FILE_NAME_CHECK
WORDS_IN_SENTENCE
SIMPLIFY
ORDER_BY_POINTS
SPECIALFILTER
GET_MAX_TRIPLES
BF
SORTED_LIST_SUM
X_OR_Y
DOUBLE_THE_DIFFERENCE
COMPARE
STRONGEST_EXTENSION
CYCPATTERN_CHECK
EVEN_ODD_COUNT
INT_TO_MINI_ROMAN
RIGHT_ANGLE_TRIANGLE
FIND_MAX
EAT
DO_ALGEBRA
STRING_TO_MD5
GENERATE_INTEGERS

Arguments​

Example​

Pass@k Metric​

HumanEval Tasks​

Arguments

Example

Pass@k Metric

HumanEval Tasks