Class DROPBenchmark<T>

Namespace: AiDotNet.Reasoning.Benchmarks

Assembly: AiDotNet.dll

DROP (Discrete Reasoning Over Paragraphs) benchmark for numerical and discrete reasoning.

public class DROPBenchmark<T> : IBenchmark<T>

Type Parameters

T: The numeric type used for scoring (e.g., double, float).

Inheritance: object

DROPBenchmark<T>

Implements: IBenchmark<T>

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

For Beginners: DROP tests whether AI can read a paragraph and answer questions that require counting, comparing, sorting, or doing arithmetic on numbers in the text.

What is DROP? DROP presents passages containing numbers, dates, and quantities, then asks questions requiring discrete reasoning operations on this information.

Question types:

1. Addition/Subtraction:

Passage: "In 2019, the company had 500 employees. In 2020, they hired 150 more
and 50 left."

Q: How many employees did they have at the end of 2020?
A: 600 (500 + 150 - 50)

2. Counting:

Passage: "The team scored touchdowns in the 1st, 3rd, and 4th quarters."

Q: How many quarters did they score touchdowns in?
A: 3

3. Comparison:

Passage: "Team A scored 28 points. Team B scored 21 points."

Q: Which team scored more?
A: Team A

4. Sorting:

Passage: "Alice is 25, Bob is 30, and Carol is 22 years old."

Q: Who is oldest?
A: Bob

5. Multi-step reasoning:

Passage: "In the first half, they scored 14 points. In the third quarter,
they scored 7 more. In the fourth quarter, they scored 10 points."

Q: What was their total score?
A: 31 (14 + 7 + 10)

6. Date arithmetic:

Passage: "The war started in 1939 and ended in 1945."

Q: How long did the war last?
A: 6 years

Why it's challenging:

Requires extracting multiple numbers from text
Need to understand what operation to perform
Must track relationships between entities
Often requires multi-step reasoning
Can't just pattern match - must actually compute

Performance levels:

Humans: ~96% F1 score
BERT: ~43% F1
RoBERTa: ~58% F1
GPT-3: ~52% F1
GPT-4: ~79% F1
Claude 3 Opus: ~77% F1
Claude 3.5 Sonnet: ~82% F1
ChatGPT o1: ~87% F1 (reasoning helps significantly)

Reasoning operations:

Addition, subtraction
Counting occurrences
Finding maximum/minimum
Sorting by value
Comparing quantities
Date/time arithmetic
Percentage calculations

Research:

"DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs" (Dua et al., 2019)
https://arxiv.org/abs/1903.00161
Dataset: 96,000 questions from Wikipedia passages
Focus on numerical reasoning in natural language

Constructors

DROPBenchmark()

public DROPBenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>: Function that takes a problem and returns an answer.
sampleSize int?: Number of problems to evaluate (null for all).
cancellationToken CancellationToken: Cancellation token.

Returns

Task<BenchmarkResult<T>>: Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?: Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>: List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.

Table of Contents

Class DROPBenchmark<T>

Type Parameters

Remarks

Constructors

DROPBenchmark()

Properties

BenchmarkName

Property Value

Description

Property Value

TotalProblems

Property Value

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Parameters

Returns

Remarks

LoadProblemsAsync(int?)

Parameters

Returns

Remarks