Class BoolQBenchmark<T>
- Namespace
- AiDotNet.Reasoning.Benchmarks
- Assembly
- AiDotNet.dll
BoolQ benchmark for evaluating yes/no question answering.
public class BoolQBenchmark<T> : IBenchmark<T>
Type Parameters
TThe numeric type used for scoring (e.g., double, float).
- Inheritance
-
BoolQBenchmark<T>
- Implements
-
IBenchmark<T>
- Inherited Members
Remarks
For Beginners: BoolQ tests whether AI can answer yes/no questions about passages of text, requiring reading comprehension.
What is BoolQ? BoolQ (Boolean Questions) contains naturally occurring yes/no questions about Wikipedia passages. Unlike artificial benchmarks, these are real questions that people actually asked.
Format:
- Passage: A paragraph from Wikipedia
- Question: A yes/no question about the passage
- Answer: True or False
Example:
Passage: "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars
in Paris, France. It is named after the engineer Gustave Eiffel, whose company
designed and built the tower. Constructed from 1887 to 1889..."
Question: "Is the Eiffel Tower in France?"
Answer: Yes/True
Question: "Was the Eiffel Tower built in the 21st century?"
Answer: No/False
Why it's challenging:
- Requires careful reading comprehension
- Questions can be tricky or indirect
- Need to distinguish explicit vs. implicit information
- Must avoid making unwarranted inferences
- Real-world questions (not synthetic)
Performance levels:
- Random guessing: 50%
- Humans: 89%
- BERT-Large: 77.4%
- RoBERTa: 87.1%
- GPT-3: 76.4%
- GPT-4: 86.9%
- Claude 3 Opus: 87.5%
- Claude 3.5 Sonnet: 91.0%
- ChatGPT o1: 89.5%
Question types:
- Factual: Direct facts from the passage
- Inferential: Requires reasoning from passage
- Temporal: About time and dates
- Causal: About cause and effect
- Comparative: About comparisons
Research:
- "BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions" (Clark et al., 2019)
- https://arxiv.org/abs/1905.10044
- Dataset: 15,942 questions from Google search queries
- Part of SuperGLUE benchmark suite
Constructors
BoolQBenchmark()
public BoolQBenchmark()
Properties
BenchmarkName
Gets the name of this benchmark.
public string BenchmarkName { get; }
Property Value
Description
Gets a description of what this benchmark measures.
public string Description { get; }
Property Value
TotalProblems
Gets the total number of problems in this benchmark.
public int TotalProblems { get; }
Property Value
Methods
EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)
Evaluates a reasoning strategy on this benchmark.
public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)
Parameters
evaluateFunctionFunc<string, Task<string>>Function that takes a problem and returns an answer.
sampleSizeint?Number of problems to evaluate (null for all).
cancellationTokenCancellationTokenCancellation token.
Returns
- Task<BenchmarkResult<T>>
Benchmark results with accuracy and detailed metrics.
Remarks
For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics
The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.
LoadProblemsAsync(int?)
Loads benchmark problems (for inspection or custom evaluation).
public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)
Parameters
countint?Number of problems to load (null for all).
Returns
- Task<List<BenchmarkProblem>>
List of benchmark problems.
Remarks
For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.