Class HumanEvalBenchmark<T>
- Namespace
- AiDotNet.Reasoning.Benchmarks
- Assembly
- AiDotNet.dll
HumanEval benchmark for evaluating Python code generation capabilities.
public class HumanEvalBenchmark<T> : IBenchmark<T>
Type Parameters
TThe numeric type used for scoring (e.g., double, float).
- Inheritance
-
HumanEvalBenchmark<T>
- Implements
-
IBenchmark<T>
- Inherited Members
Remarks
For Beginners: HumanEval is a benchmark of 164 Python programming problems. Each problem asks the model to write a function that passes a set of test cases.
Example problem:
Write a function that returns True if a number is prime, False otherwise.
def is_prime(n: int) -> bool:
# Your code here
Why it's important:
- Tests code generation abilities
- Requires understanding algorithms
- Tests correctness via unit tests
- Standard benchmark for code models
Performance levels:
- GPT-3.5: ~48%
- GPT-4: ~67%
- ChatGPT o1: ~92%
- AlphaCode: ~53%
- CodeGen: ~29%
Research: "Evaluating Large Language Models Trained on Code" (Chen et al., 2021) https://arxiv.org/abs/2107.03374
Constructors
HumanEvalBenchmark(HumanEvalBenchmarkOptions?)
public HumanEvalBenchmark(HumanEvalBenchmarkOptions? options = null)
Parameters
optionsHumanEvalBenchmarkOptions
Properties
BenchmarkName
Gets the name of this benchmark.
public string BenchmarkName { get; }
Property Value
Description
Gets a description of what this benchmark measures.
public string Description { get; }
Property Value
TotalProblems
Gets the total number of problems in this benchmark.
public int TotalProblems { get; }
Property Value
Methods
EvaluateAsync(Func<string, Task<string>>, Func<string, string[], CancellationToken, Task<bool>>?, int?, CancellationToken)
public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, Func<string, string[], CancellationToken, Task<bool>>? executionEvaluator, int? sampleSize = null, CancellationToken cancellationToken = default)
Parameters
evaluateFunctionFunc<string, Task<string>>executionEvaluatorFunc<string, string[], CancellationToken, Task<bool>>sampleSizeint?cancellationTokenCancellationToken
Returns
- Task<BenchmarkResult<T>>
EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)
Evaluates a reasoning strategy on this benchmark.
public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)
Parameters
evaluateFunctionFunc<string, Task<string>>Function that takes a problem and returns an answer.
sampleSizeint?Number of problems to evaluate (null for all).
cancellationTokenCancellationTokenCancellation token.
Returns
- Task<BenchmarkResult<T>>
Benchmark results with accuracy and detailed metrics.
Remarks
For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics
The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.
LoadProblemsAsync(int?)
Loads benchmark problems (for inspection or custom evaluation).
public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)
Parameters
countint?Number of problems to load (null for all).
Returns
- Task<List<BenchmarkProblem>>
List of benchmark problems.
Remarks
For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.