Class DROPBenchmark<T>
- Namespace
- AiDotNet.Reasoning.Benchmarks
- Assembly
- AiDotNet.dll
DROP (Discrete Reasoning Over Paragraphs) benchmark for numerical and discrete reasoning.
public class DROPBenchmark<T> : IBenchmark<T>
Type Parameters
TThe numeric type used for scoring (e.g., double, float).
- Inheritance
-
DROPBenchmark<T>
- Implements
-
IBenchmark<T>
- Inherited Members
Remarks
For Beginners: DROP tests whether AI can read a paragraph and answer questions that require counting, comparing, sorting, or doing arithmetic on numbers in the text.
What is DROP? DROP presents passages containing numbers, dates, and quantities, then asks questions requiring discrete reasoning operations on this information.
Question types:
1. Addition/Subtraction:
Passage: "In 2019, the company had 500 employees. In 2020, they hired 150 more
and 50 left."
Q: How many employees did they have at the end of 2020?
A: 600 (500 + 150 - 50)
2. Counting:
Passage: "The team scored touchdowns in the 1st, 3rd, and 4th quarters."
Q: How many quarters did they score touchdowns in?
A: 3
3. Comparison:
Passage: "Team A scored 28 points. Team B scored 21 points."
Q: Which team scored more?
A: Team A
4. Sorting:
Passage: "Alice is 25, Bob is 30, and Carol is 22 years old."
Q: Who is oldest?
A: Bob
5. Multi-step reasoning:
Passage: "In the first half, they scored 14 points. In the third quarter,
they scored 7 more. In the fourth quarter, they scored 10 points."
Q: What was their total score?
A: 31 (14 + 7 + 10)
6. Date arithmetic:
Passage: "The war started in 1939 and ended in 1945."
Q: How long did the war last?
A: 6 years
Why it's challenging:
- Requires extracting multiple numbers from text
- Need to understand what operation to perform
- Must track relationships between entities
- Often requires multi-step reasoning
- Can't just pattern match - must actually compute
Performance levels:
- Humans: ~96% F1 score
- BERT: ~43% F1
- RoBERTa: ~58% F1
- GPT-3: ~52% F1
- GPT-4: ~79% F1
- Claude 3 Opus: ~77% F1
- Claude 3.5 Sonnet: ~82% F1
- ChatGPT o1: ~87% F1 (reasoning helps significantly)
Reasoning operations:
- Addition, subtraction
- Counting occurrences
- Finding maximum/minimum
- Sorting by value
- Comparing quantities
- Date/time arithmetic
- Percentage calculations
Research:
- "DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs" (Dua et al., 2019)
- https://arxiv.org/abs/1903.00161
- Dataset: 96,000 questions from Wikipedia passages
- Focus on numerical reasoning in natural language
Constructors
DROPBenchmark()
public DROPBenchmark()
Properties
BenchmarkName
Gets the name of this benchmark.
public string BenchmarkName { get; }
Property Value
Description
Gets a description of what this benchmark measures.
public string Description { get; }
Property Value
TotalProblems
Gets the total number of problems in this benchmark.
public int TotalProblems { get; }
Property Value
Methods
EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)
Evaluates a reasoning strategy on this benchmark.
public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)
Parameters
evaluateFunctionFunc<string, Task<string>>Function that takes a problem and returns an answer.
sampleSizeint?Number of problems to evaluate (null for all).
cancellationTokenCancellationTokenCancellation token.
Returns
- Task<BenchmarkResult<T>>
Benchmark results with accuracy and detailed metrics.
Remarks
For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics
The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.
LoadProblemsAsync(int?)
Loads benchmark problems (for inspection or custom evaluation).
public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)
Parameters
countint?Number of problems to load (null for all).
Returns
- Task<List<BenchmarkProblem>>
List of benchmark problems.
Remarks
For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.