Table of Contents

Interface IBenchmark<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Defines the contract for reasoning benchmarks that evaluate model performance.

public interface IBenchmark<T>

Type Parameters

T

The numeric type used for scoring (e.g., double, float).

Remarks

For Beginners: A benchmark is like a standardized test for AI reasoning systems. Just like students take SAT or ACT tests to measure their abilities, AI systems are evaluated on benchmarks to measure their reasoning capabilities.

Common benchmarks:

  • GSM8K: Grade school math problems (8,000 questions)
  • MATH: Competition-level mathematics
  • HumanEval: Code generation tasks
  • MMLU: Multiple choice questions across many subjects
  • ARC-AGI: Abstract reasoning puzzles

Why benchmarks matter:

  • Objective measurement of performance
  • Compare different approaches
  • Track improvements over time
  • Identify strengths and weaknesses

Example:

var benchmark = new GSM8KBenchmark<double>();
var results = await benchmark.EvaluateAsync(reasoner, sampleSize: 100);
Console.WriteLine($"Accuracy: {results.Accuracy:P1}"); // "Accuracy: 87.5%"

Properties

BenchmarkName

Gets the name of this benchmark.

string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>

Function that takes a problem and returns an answer.

sampleSize int?

Number of problems to evaluate (null for all).

cancellationToken CancellationToken

Cancellation token.

Returns

Task<BenchmarkResult<T>>

Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?

Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>

List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.