Table of Contents

Class BoolQBenchmark<T>

Namespace
AiDotNet.Reasoning.Benchmarks
Assembly
AiDotNet.dll

BoolQ benchmark for evaluating yes/no question answering.

public class BoolQBenchmark<T> : IBenchmark<T>

Type Parameters

T

The numeric type used for scoring (e.g., double, float).

Inheritance
BoolQBenchmark<T>
Implements
Inherited Members

Remarks

For Beginners: BoolQ tests whether AI can answer yes/no questions about passages of text, requiring reading comprehension.

What is BoolQ? BoolQ (Boolean Questions) contains naturally occurring yes/no questions about Wikipedia passages. Unlike artificial benchmarks, these are real questions that people actually asked.

Format:

  • Passage: A paragraph from Wikipedia
  • Question: A yes/no question about the passage
  • Answer: True or False

Example:

Passage: "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars
in Paris, France. It is named after the engineer Gustave Eiffel, whose company
designed and built the tower. Constructed from 1887 to 1889..."

Question: "Is the Eiffel Tower in France?"
Answer: Yes/True

Question: "Was the Eiffel Tower built in the 21st century?"
Answer: No/False

Why it's challenging:

  • Requires careful reading comprehension
  • Questions can be tricky or indirect
  • Need to distinguish explicit vs. implicit information
  • Must avoid making unwarranted inferences
  • Real-world questions (not synthetic)

Performance levels:

  • Random guessing: 50%
  • Humans: 89%
  • BERT-Large: 77.4%
  • RoBERTa: 87.1%
  • GPT-3: 76.4%
  • GPT-4: 86.9%
  • Claude 3 Opus: 87.5%
  • Claude 3.5 Sonnet: 91.0%
  • ChatGPT o1: 89.5%

Question types:

  • Factual: Direct facts from the passage
  • Inferential: Requires reasoning from passage
  • Temporal: About time and dates
  • Causal: About cause and effect
  • Comparative: About comparisons

Research:

  • "BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions" (Clark et al., 2019)
  • https://arxiv.org/abs/1905.10044
  • Dataset: 15,942 questions from Google search queries
  • Part of SuperGLUE benchmark suite

Constructors

BoolQBenchmark()

public BoolQBenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>

Function that takes a problem and returns an answer.

sampleSize int?

Number of problems to evaluate (null for all).

cancellationToken CancellationToken

Cancellation token.

Returns

Task<BenchmarkResult<T>>

Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?

Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>

List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.