Namespace AiDotNet.Reasoning.Benchmarks
Classes
- ARCAGIBenchmark<T>
ARC-AGI (Abstract Reasoning Corpus - Artificial General Intelligence) benchmark.
- BoolQBenchmark<T>
BoolQ benchmark for evaluating yes/no question answering.
- CodeXGlueBenchmarkOptions
Options for configuring CodeXGLUE dataset loading.
- CodeXGlueBenchmark<T>
CodeXGLUE benchmark harness (dataset-loader + metric computation).
- CommonsenseQABenchmark<T>
CommonsenseQA benchmark for evaluating commonsense knowledge and reasoning.
- DROPBenchmark<T>
DROP (Discrete Reasoning Over Paragraphs) benchmark for numerical and discrete reasoning.
- GSM8KBenchmark<T>
Grade School Math 8K (GSM8K) benchmark for evaluating mathematical reasoning.
- HellaSwagBenchmark<T>
HellaSwag benchmark for evaluating commonsense natural language inference.
- HumanEvalBenchmark<T>
HumanEval benchmark for evaluating Python code generation capabilities.
- LogiQABenchmark<T>
LogiQA benchmark for evaluating logical reasoning abilities.
- MATHBenchmark<T>
MATH benchmark for evaluating advanced mathematical reasoning.
- MBPPBenchmark<T>
MBPP (Mostly Basic Python Problems) benchmark for evaluating Python code generation.
- MMLUBenchmark<T>
MMLU (Massive Multitask Language Understanding) benchmark for evaluating world knowledge.
- PIQABenchmark<T>
PIQA (Physical Interaction Question Answering) benchmark for physical commonsense reasoning.
- TruthfulQABenchmark<T>
TruthfulQA benchmark for evaluating truthfulness and resistance to falsehoods.
- WinoGrandeBenchmark<T>
WinoGrande benchmark for evaluating commonsense reasoning through pronoun resolution.