Interface IBenchmark<T>

Namespace: AiDotNet.Interfaces

Assembly: AiDotNet.dll

Defines the contract for reasoning benchmarks that evaluate model performance.

public interface IBenchmark<T>

Type Parameters

T: The numeric type used for scoring (e.g., double, float).

Remarks

For Beginners: A benchmark is like a standardized test for AI reasoning systems. Just like students take SAT or ACT tests to measure their abilities, AI systems are evaluated on benchmarks to measure their reasoning capabilities.

Common benchmarks:

GSM8K: Grade school math problems (8,000 questions)
MATH: Competition-level mathematics
HumanEval: Code generation tasks
MMLU: Multiple choice questions across many subjects
ARC-AGI: Abstract reasoning puzzles

Why benchmarks matter:

Objective measurement of performance
Compare different approaches
Track improvements over time
Identify strengths and weaknesses

Example:

var benchmark = new GSM8KBenchmark<double>();
var results = await benchmark.EvaluateAsync(reasoner, sampleSize: 100);
Console.WriteLine($"Accuracy: {results.Accuracy:P1}"); // "Accuracy: 87.5%"

Properties

BenchmarkName

Gets the name of this benchmark.

string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>: Function that takes a problem and returns an answer.
sampleSize int?: Number of problems to evaluate (null for all).
cancellationToken CancellationToken: Cancellation token.

Returns

Task<BenchmarkResult<T>>: Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?: Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>: List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.

Table of Contents

Interface IBenchmark<T>

Type Parameters

Remarks

Properties

BenchmarkName

Property Value

Description

Property Value

TotalProblems

Property Value

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Parameters

Returns

Remarks

LoadProblemsAsync(int?)

Parameters

Returns

Remarks