Table of Contents

Namespace AiDotNet.Reasoning.Benchmarks

Classes

ARCAGIBenchmark<T>

ARC-AGI (Abstract Reasoning Corpus - Artificial General Intelligence) benchmark.

BoolQBenchmark<T>

BoolQ benchmark for evaluating yes/no question answering.

CodeXGlueBenchmarkOptions

Options for configuring CodeXGLUE dataset loading.

CodeXGlueBenchmark<T>

CodeXGLUE benchmark harness (dataset-loader + metric computation).

CommonsenseQABenchmark<T>

CommonsenseQA benchmark for evaluating commonsense knowledge and reasoning.

DROPBenchmark<T>

DROP (Discrete Reasoning Over Paragraphs) benchmark for numerical and discrete reasoning.

GSM8KBenchmark<T>

Grade School Math 8K (GSM8K) benchmark for evaluating mathematical reasoning.

HellaSwagBenchmark<T>

HellaSwag benchmark for evaluating commonsense natural language inference.

HumanEvalBenchmarkOptions
HumanEvalBenchmark<T>

HumanEval benchmark for evaluating Python code generation capabilities.

LogiQABenchmark<T>

LogiQA benchmark for evaluating logical reasoning abilities.

MATHBenchmark<T>

MATH benchmark for evaluating advanced mathematical reasoning.

MBPPBenchmark<T>

MBPP (Mostly Basic Python Problems) benchmark for evaluating Python code generation.

MMLUBenchmark<T>

MMLU (Massive Multitask Language Understanding) benchmark for evaluating world knowledge.

PIQABenchmark<T>

PIQA (Physical Interaction Question Answering) benchmark for physical commonsense reasoning.

TruthfulQABenchmark<T>

TruthfulQA benchmark for evaluating truthfulness and resistance to falsehoods.

WinoGrandeBenchmark<T>

WinoGrande benchmark for evaluating commonsense reasoning through pronoun resolution.