Table of Contents

Class MMLUBenchmark<T>

Namespace
AiDotNet.Reasoning.Benchmarks
Assembly
AiDotNet.dll

MMLU (Massive Multitask Language Understanding) benchmark for evaluating world knowledge.

public class MMLUBenchmark<T> : IBenchmark<T>

Type Parameters

T

The numeric type used for scoring (e.g., double, float).

Inheritance
MMLUBenchmark<T>
Implements
Inherited Members

Remarks

For Beginners: MMLU is like a comprehensive standardized test for AI, covering 57 subjects from elementary to professional level.

What is MMLU? MMLU tests knowledge across diverse academic and professional domains:

  • STEM: Mathematics, Physics, Chemistry, Biology, Computer Science
  • Humanities: History, Philosophy, Law
  • Social Sciences: Psychology, Economics, Sociology
  • Other: Medicine, Business, Professional Knowledge

Format: Multiple choice questions (A, B, C, D) spanning different difficulty levels:

  • Elementary
  • High School
  • College
  • Professional

Example questions:

Elementary Math: Q: What is 7 × 8? A) 54 B) 56 C) 64 D) 48 Answer: B

College Physics: Q: What is the ground state energy of a hydrogen atom? A) -13.6 eV B) -27.2 eV C) -6.8 eV D) 0 eV Answer: A

Professional Medicine: Q: A 45-year-old presents with sudden chest pain. What is the most appropriate first test? A) CT scan B) ECG C) Blood test D) X-ray Answer: B

Why it's important:

  • Comprehensive knowledge evaluation
  • Tests reasoning + memorization
  • Standard benchmark for LLMs
  • Measures real-world applicability

Performance levels:

  • Random guessing: 25%
  • Average human expert: ~90% (in their domain)
  • GPT-3.5: ~70%
  • GPT-4: ~86%
  • Claude 3 Opus: ~87%
  • Claude 3.5 Sonnet: ~89%
  • ChatGPT o1: ~91%
  • Gemini Pro 1.5: ~90%

57 Subject categories:

STEM (18 subjects):

  • Abstract Algebra, Astronomy, College Biology, College Chemistry
  • College Computer Science, College Mathematics, College Physics
  • Conceptual Physics, Electrical Engineering, Elementary Mathematics
  • High School Biology, High School Chemistry, High School Computer Science
  • High School Mathematics, High School Physics, High School Statistics
  • Machine Learning

Humanities (13 subjects):

  • Formal Logic, High School European History, High School US History
  • High School World History, International Law, Jurisprudence
  • Logical Fallacies, Moral Disputes, Moral Scenarios
  • Philosophy, Prehistory, Professional Law, World Religions

Social Sciences (12 subjects):

  • Econometrics, High School Geography, High School Government and Politics
  • High School Macroeconomics, High School Microeconomics
  • High School Psychology, Human Sexuality, Professional Psychology
  • Public Relations, Security Studies, Sociology, US Foreign Policy

Other (14 subjects):

  • Anatomy, Business Ethics, Clinical Knowledge, College Medicine
  • Global Facts, Human Aging, Management, Marketing
  • Medical Genetics, Miscellaneous, Nutrition, Professional Accounting
  • Professional Medicine, Virology

Research:

Constructors

MMLUBenchmark()

public MMLUBenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>

Function that takes a problem and returns an answer.

sampleSize int?

Number of problems to evaluate (null for all).

cancellationToken CancellationToken

Cancellation token.

Returns

Task<BenchmarkResult<T>>

Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?

Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>

List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.