Class HumanEvalBenchmark<T>

Namespace: AiDotNet.Reasoning.Benchmarks

Assembly: AiDotNet.dll

HumanEval benchmark for evaluating Python code generation capabilities.

public class HumanEvalBenchmark<T> : IBenchmark<T>

Type Parameters

T: The numeric type used for scoring (e.g., double, float).

Inheritance: object

HumanEvalBenchmark<T>

Implements: IBenchmark<T>

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

For Beginners: HumanEval is a benchmark of 164 Python programming problems. Each problem asks the model to write a function that passes a set of test cases.

Example problem:

Write a function that returns True if a number is prime, False otherwise.
def is_prime(n: int) -> bool:
    # Your code here

Why it's important:

Tests code generation abilities
Requires understanding algorithms
Tests correctness via unit tests
Standard benchmark for code models

Performance levels:

GPT-3.5: ~48%
GPT-4: ~67%
ChatGPT o1: ~92%
AlphaCode: ~53%
CodeGen: ~29%

Research: "Evaluating Large Language Models Trained on Code" (Chen et al., 2021) https://arxiv.org/abs/2107.03374

Constructors

HumanEvalBenchmark(HumanEvalBenchmarkOptions?)

public HumanEvalBenchmark(HumanEvalBenchmarkOptions? options = null)

Parameters

options HumanEvalBenchmarkOptions

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, Func<string, string[], CancellationToken, Task<bool>>?, int?, CancellationToken)

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, Func<string, string[], CancellationToken, Task<bool>>? executionEvaluator, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>
executionEvaluator Func<string, string[], CancellationToken, Task<bool>>
sampleSize int?
cancellationToken CancellationToken

Returns

Task<BenchmarkResult<T>>

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>: Function that takes a problem and returns an answer.
sampleSize int?: Number of problems to evaluate (null for all).
cancellationToken CancellationToken: Cancellation token.

Returns

Task<BenchmarkResult<T>>: Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?: Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>: List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.

Table of Contents

Class HumanEvalBenchmark<T>

Type Parameters

Remarks

Constructors

HumanEvalBenchmark(HumanEvalBenchmarkOptions?)

Parameters

Properties

BenchmarkName

Property Value

Description

Property Value

TotalProblems

Property Value

Methods

EvaluateAsync(Func<string, Task<string>>, Func<string, string[], CancellationToken, Task<bool>>?, int?, CancellationToken)

Parameters

Returns

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Parameters

Returns

Remarks

LoadProblemsAsync(int?)

Parameters

Returns

Remarks