Class MMLUBenchmark<T>
- Namespace
- AiDotNet.Reasoning.Benchmarks
- Assembly
- AiDotNet.dll
MMLU (Massive Multitask Language Understanding) benchmark for evaluating world knowledge.
public class MMLUBenchmark<T> : IBenchmark<T>
Type Parameters
TThe numeric type used for scoring (e.g., double, float).
- Inheritance
-
MMLUBenchmark<T>
- Implements
-
IBenchmark<T>
- Inherited Members
Remarks
For Beginners: MMLU is like a comprehensive standardized test for AI, covering 57 subjects from elementary to professional level.
What is MMLU? MMLU tests knowledge across diverse academic and professional domains:
- STEM: Mathematics, Physics, Chemistry, Biology, Computer Science
- Humanities: History, Philosophy, Law
- Social Sciences: Psychology, Economics, Sociology
- Other: Medicine, Business, Professional Knowledge
Format: Multiple choice questions (A, B, C, D) spanning different difficulty levels:
- Elementary
- High School
- College
- Professional
Example questions:
Elementary Math: Q: What is 7 × 8? A) 54 B) 56 C) 64 D) 48 Answer: B
College Physics: Q: What is the ground state energy of a hydrogen atom? A) -13.6 eV B) -27.2 eV C) -6.8 eV D) 0 eV Answer: A
Professional Medicine: Q: A 45-year-old presents with sudden chest pain. What is the most appropriate first test? A) CT scan B) ECG C) Blood test D) X-ray Answer: B
Why it's important:
- Comprehensive knowledge evaluation
- Tests reasoning + memorization
- Standard benchmark for LLMs
- Measures real-world applicability
Performance levels:
- Random guessing: 25%
- Average human expert: ~90% (in their domain)
- GPT-3.5: ~70%
- GPT-4: ~86%
- Claude 3 Opus: ~87%
- Claude 3.5 Sonnet: ~89%
- ChatGPT o1: ~91%
- Gemini Pro 1.5: ~90%
57 Subject categories:
STEM (18 subjects):
- Abstract Algebra, Astronomy, College Biology, College Chemistry
- College Computer Science, College Mathematics, College Physics
- Conceptual Physics, Electrical Engineering, Elementary Mathematics
- High School Biology, High School Chemistry, High School Computer Science
- High School Mathematics, High School Physics, High School Statistics
- Machine Learning
Humanities (13 subjects):
- Formal Logic, High School European History, High School US History
- High School World History, International Law, Jurisprudence
- Logical Fallacies, Moral Disputes, Moral Scenarios
- Philosophy, Prehistory, Professional Law, World Religions
Social Sciences (12 subjects):
- Econometrics, High School Geography, High School Government and Politics
- High School Macroeconomics, High School Microeconomics
- High School Psychology, Human Sexuality, Professional Psychology
- Public Relations, Security Studies, Sociology, US Foreign Policy
Other (14 subjects):
- Anatomy, Business Ethics, Clinical Knowledge, College Medicine
- Global Facts, Human Aging, Management, Marketing
- Medical Genetics, Miscellaneous, Nutrition, Professional Accounting
- Professional Medicine, Virology
Research:
- "Measuring Massive Multitask Language Understanding" (Hendrycks et al., 2021)
- https://arxiv.org/abs/2009.03300
- Dataset: 15,908 questions across 57 tasks
Constructors
MMLUBenchmark()
public MMLUBenchmark()
Properties
BenchmarkName
Gets the name of this benchmark.
public string BenchmarkName { get; }
Property Value
Description
Gets a description of what this benchmark measures.
public string Description { get; }
Property Value
TotalProblems
Gets the total number of problems in this benchmark.
public int TotalProblems { get; }
Property Value
Methods
EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)
Evaluates a reasoning strategy on this benchmark.
public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)
Parameters
evaluateFunctionFunc<string, Task<string>>Function that takes a problem and returns an answer.
sampleSizeint?Number of problems to evaluate (null for all).
cancellationTokenCancellationTokenCancellation token.
Returns
- Task<BenchmarkResult<T>>
Benchmark results with accuracy and detailed metrics.
Remarks
For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics
The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.
LoadProblemsAsync(int?)
Loads benchmark problems (for inspection or custom evaluation).
public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)
Parameters
countint?Number of problems to load (null for all).
Returns
- Task<List<BenchmarkProblem>>
List of benchmark problems.
Remarks
For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.