Table of Contents

Class MATHBenchmark<T>

Namespace
AiDotNet.Reasoning.Benchmarks
Assembly
AiDotNet.dll

MATH benchmark for evaluating advanced mathematical reasoning.

public class MATHBenchmark<T> : IBenchmark<T>

Type Parameters

T

The numeric type used for scoring (e.g., double, float).

Inheritance
MATHBenchmark<T>
Implements
Inherited Members

Remarks

For Beginners: The MATH dataset contains 12,500 challenging competition mathematics problems from high school math competitions (AMC, AIME, etc.). These are significantly harder than GSM8K.

Example problems:

  • "Find the sum of all positive integers n such that sqrt(n^2 + 85) is an integer."
  • "A square is inscribed in a circle. What is the ratio of the area of the circle to the square?"
  • "Solve the system of equations: x + y + z = 6, xy + xz + yz = 11, xyz = 6"

Why it's important:

  • Tests advanced mathematical reasoning
  • Requires complex multi-step solutions
  • Includes algebra, geometry, number theory, calculus
  • Benchmark for reasoning capability at competition level

Performance levels:

  • Human (expert): 90-95%
  • GPT-3.5: ~7%
  • GPT-4: ~42%
  • ChatGPT o1: ~85%
  • DeepSeek-R1: ~79.8%
  • Minerva (540B): ~50%

Research: "Measuring Mathematical Problem Solving With the MATH Dataset" (Hendrycks et al., 2021) https://arxiv.org/abs/2103.03874

Constructors

MATHBenchmark()

public MATHBenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>

Function that takes a problem and returns an answer.

sampleSize int?

Number of problems to evaluate (null for all).

cancellationToken CancellationToken

Cancellation token.

Returns

Task<BenchmarkResult<T>>

Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?

Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>

List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.