Class BoolQBenchmark<T>

Namespace: AiDotNet.Reasoning.Benchmarks

Assembly: AiDotNet.dll

BoolQ benchmark for evaluating yes/no question answering.

public class BoolQBenchmark<T> : IBenchmark<T>

Type Parameters

T: The numeric type used for scoring (e.g., double, float).

Inheritance: object

BoolQBenchmark<T>

Implements: IBenchmark<T>

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

For Beginners: BoolQ tests whether AI can answer yes/no questions about passages of text, requiring reading comprehension.

What is BoolQ? BoolQ (Boolean Questions) contains naturally occurring yes/no questions about Wikipedia passages. Unlike artificial benchmarks, these are real questions that people actually asked.

Format:

Passage: A paragraph from Wikipedia
Question: A yes/no question about the passage
Answer: True or False

Example:

Passage: "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars
in Paris, France. It is named after the engineer Gustave Eiffel, whose company
designed and built the tower. Constructed from 1887 to 1889..."

Question: "Is the Eiffel Tower in France?"
Answer: Yes/True

Question: "Was the Eiffel Tower built in the 21st century?"
Answer: No/False

Why it's challenging:

Requires careful reading comprehension
Questions can be tricky or indirect
Need to distinguish explicit vs. implicit information
Must avoid making unwarranted inferences
Real-world questions (not synthetic)

Performance levels:

Random guessing: 50%
Humans: 89%
BERT-Large: 77.4%
RoBERTa: 87.1%
GPT-3: 76.4%
GPT-4: 86.9%
Claude 3 Opus: 87.5%
Claude 3.5 Sonnet: 91.0%
ChatGPT o1: 89.5%

Question types:

Factual: Direct facts from the passage
Inferential: Requires reasoning from passage
Temporal: About time and dates
Causal: About cause and effect
Comparative: About comparisons

Research:

"BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions" (Clark et al., 2019)
https://arxiv.org/abs/1905.10044
Dataset: 15,942 questions from Google search queries
Part of SuperGLUE benchmark suite

Constructors

BoolQBenchmark()

public BoolQBenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>: Function that takes a problem and returns an answer.
sampleSize int?: Number of problems to evaluate (null for all).
cancellationToken CancellationToken: Cancellation token.

Returns

Task<BenchmarkResult<T>>: Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?: Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>: List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.

Table of Contents

Class BoolQBenchmark<T>

Type Parameters

Remarks

Constructors

BoolQBenchmark()

Properties

BenchmarkName

Property Value

Description

Property Value

TotalProblems

Property Value

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Parameters

Returns

Remarks

LoadProblemsAsync(int?)

Parameters

Returns

Remarks