Namespace AiDotNet.Reasoning.Benchmarks

ARC-AGI (Abstract Reasoning Corpus - Artificial General Intelligence) benchmark.

BoolQ benchmark for evaluating yes/no question answering.

Options for configuring CodeXGLUE dataset loading.

CodeXGLUE benchmark harness (dataset-loader + metric computation).

CommonsenseQA benchmark for evaluating commonsense knowledge and reasoning.

DROP (Discrete Reasoning Over Paragraphs) benchmark for numerical and discrete reasoning.

Grade School Math 8K (GSM8K) benchmark for evaluating mathematical reasoning.

HellaSwag benchmark for evaluating commonsense natural language inference.

HumanEval benchmark for evaluating Python code generation capabilities.

LogiQA benchmark for evaluating logical reasoning abilities.

MATH benchmark for evaluating advanced mathematical reasoning.

MBPP (Mostly Basic Python Problems) benchmark for evaluating Python code generation.

MMLU (Massive Multitask Language Understanding) benchmark for evaluating world knowledge.

PIQA (Physical Interaction Question Answering) benchmark for physical commonsense reasoning.

TruthfulQA benchmark for evaluating truthfulness and resistance to falsehoods.

WinoGrande benchmark for evaluating commonsense reasoning through pronoun resolution.