Table of Contents

Class DROPBenchmark<T>

Namespace
AiDotNet.Reasoning.Benchmarks
Assembly
AiDotNet.dll

DROP (Discrete Reasoning Over Paragraphs) benchmark for numerical and discrete reasoning.

public class DROPBenchmark<T> : IBenchmark<T>

Type Parameters

T

The numeric type used for scoring (e.g., double, float).

Inheritance
DROPBenchmark<T>
Implements
Inherited Members

Remarks

For Beginners: DROP tests whether AI can read a paragraph and answer questions that require counting, comparing, sorting, or doing arithmetic on numbers in the text.

What is DROP? DROP presents passages containing numbers, dates, and quantities, then asks questions requiring discrete reasoning operations on this information.

Question types:

1. Addition/Subtraction:

Passage: "In 2019, the company had 500 employees. In 2020, they hired 150 more
and 50 left."

Q: How many employees did they have at the end of 2020?
A: 600 (500 + 150 - 50)

2. Counting:

Passage: "The team scored touchdowns in the 1st, 3rd, and 4th quarters."

Q: How many quarters did they score touchdowns in?
A: 3

3. Comparison:

Passage: "Team A scored 28 points. Team B scored 21 points."

Q: Which team scored more?
A: Team A

4. Sorting:

Passage: "Alice is 25, Bob is 30, and Carol is 22 years old."

Q: Who is oldest?
A: Bob

5. Multi-step reasoning:

Passage: "In the first half, they scored 14 points. In the third quarter,
they scored 7 more. In the fourth quarter, they scored 10 points."

Q: What was their total score?
A: 31 (14 + 7 + 10)

6. Date arithmetic:

Passage: "The war started in 1939 and ended in 1945."

Q: How long did the war last?
A: 6 years

Why it's challenging:

  • Requires extracting multiple numbers from text
  • Need to understand what operation to perform
  • Must track relationships between entities
  • Often requires multi-step reasoning
  • Can't just pattern match - must actually compute

Performance levels:

  • Humans: ~96% F1 score
  • BERT: ~43% F1
  • RoBERTa: ~58% F1
  • GPT-3: ~52% F1
  • GPT-4: ~79% F1
  • Claude 3 Opus: ~77% F1
  • Claude 3.5 Sonnet: ~82% F1
  • ChatGPT o1: ~87% F1 (reasoning helps significantly)

Reasoning operations:

  • Addition, subtraction
  • Counting occurrences
  • Finding maximum/minimum
  • Sorting by value
  • Comparing quantities
  • Date/time arithmetic
  • Percentage calculations

Research:

  • "DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs" (Dua et al., 2019)
  • https://arxiv.org/abs/1903.00161
  • Dataset: 96,000 questions from Wikipedia passages
  • Focus on numerical reasoning in natural language

Constructors

DROPBenchmark()

public DROPBenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>

Function that takes a problem and returns an answer.

sampleSize int?

Number of problems to evaluate (null for all).

cancellationToken CancellationToken

Cancellation token.

Returns

Task<BenchmarkResult<T>>

Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?

Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>

List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.