Interface IBenchmark<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for reasoning benchmarks that evaluate model performance.
public interface IBenchmark<T>
Type Parameters
TThe numeric type used for scoring (e.g., double, float).
Remarks
For Beginners: A benchmark is like a standardized test for AI reasoning systems. Just like students take SAT or ACT tests to measure their abilities, AI systems are evaluated on benchmarks to measure their reasoning capabilities.
Common benchmarks:
- GSM8K: Grade school math problems (8,000 questions)
- MATH: Competition-level mathematics
- HumanEval: Code generation tasks
- MMLU: Multiple choice questions across many subjects
- ARC-AGI: Abstract reasoning puzzles
Why benchmarks matter:
- Objective measurement of performance
- Compare different approaches
- Track improvements over time
- Identify strengths and weaknesses
Example:
var benchmark = new GSM8KBenchmark<double>();
var results = await benchmark.EvaluateAsync(reasoner, sampleSize: 100);
Console.WriteLine($"Accuracy: {results.Accuracy:P1}"); // "Accuracy: 87.5%"
Properties
BenchmarkName
Gets the name of this benchmark.
string BenchmarkName { get; }
Property Value
Description
Gets a description of what this benchmark measures.
string Description { get; }
Property Value
TotalProblems
Gets the total number of problems in this benchmark.
int TotalProblems { get; }
Property Value
Methods
EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)
Evaluates a reasoning strategy on this benchmark.
Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)
Parameters
evaluateFunctionFunc<string, Task<string>>Function that takes a problem and returns an answer.
sampleSizeint?Number of problems to evaluate (null for all).
cancellationTokenCancellationTokenCancellation token.
Returns
- Task<BenchmarkResult<T>>
Benchmark results with accuracy and detailed metrics.
Remarks
For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics
The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.
LoadProblemsAsync(int?)
Loads benchmark problems (for inspection or custom evaluation).
Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)
Parameters
countint?Number of problems to load (null for all).
Returns
- Task<List<BenchmarkProblem>>
List of benchmark problems.
Remarks
For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.