Class WinoGrandeBenchmark<T>
- Namespace
- AiDotNet.Reasoning.Benchmarks
- Assembly
- AiDotNet.dll
WinoGrande benchmark for evaluating commonsense reasoning through pronoun resolution.
public class WinoGrandeBenchmark<T> : IBenchmark<T>
Type Parameters
TThe numeric type used for scoring (e.g., double, float).
- Inheritance
-
WinoGrandeBenchmark<T>
- Implements
-
IBenchmark<T>
- Inherited Members
Remarks
For Beginners: WinoGrande tests whether AI can figure out what pronouns (like "it", "they", "he", "she") refer to in sentences, which requires common sense.
What is WinoGrande? Based on the classic Winograd Schema Challenge, WinoGrande presents sentences with pronouns where you need common sense to understand what the pronoun refers to.
Example 1:
Sentence: "The trophy doesn't fit in the suitcase because it is too big."
Question: What is too big?
A) The trophy
B) The suitcase
Answer: A (The trophy is too big to fit)
Example 2:
Sentence: "The trophy doesn't fit in the suitcase because it is too small."
Question: What is too small?
A) The trophy
B) The suitcase
Answer: B (The suitcase is too small to hold the trophy)
Notice how just changing one word ("big" → "small") completely flips the answer!
Example 3:
Sentence: "The city councilmen refused the demonstrators a permit because they feared violence."
Question: Who feared violence?
A) The city councilmen
B) The demonstrators
Answer: A (The councilmen feared, so they refused)
Example 4:
Sentence: "The city councilmen refused the demonstrators a permit because they advocated violence."
Question: Who advocated violence?
A) The city councilmen
B) The demonstrators
Answer: B (The demonstrators advocated, so permit was refused)
Why it's called "Winograd"? Named after Terry Winograd, a pioneer in natural language understanding who created the original Winograd Schema Challenge in 1972 as a better alternative to the Turing Test.
Why it requires common sense:
- Can't be solved by word associations alone
- Need to understand cause and effect
- Require knowledge about how the world works
- Must reason about physical properties, social situations, etc.
Performance levels:
- Random guessing: 50%
- Humans: 94.0%
- BERT: 59.4%
- RoBERTa: 79.1%
- GPT-3: 70.2%
- GPT-4: 87.5%
- Claude 3 Opus: 86.8%
- Claude 3.5 Sonnet: 88.5%
- ChatGPT o1: 90.8%
WinoGrande improvements over original:
- 44,000 examples (vs 273 in original)
- Adversarially generated to be harder
- More diverse scenarios
- Less prone to statistical biases
Research:
- "WinoGrande: An Adversarial Winograd Schema Challenge at Scale" (Sakaguchi et al., 2020)
- https://arxiv.org/abs/1907.10641
- Dataset: 44,000 problems with adversarial filtering
- Part of SuperGLUE benchmark
Constructors
WinoGrandeBenchmark()
public WinoGrandeBenchmark()
Properties
BenchmarkName
Gets the name of this benchmark.
public string BenchmarkName { get; }
Property Value
Description
Gets a description of what this benchmark measures.
public string Description { get; }
Property Value
TotalProblems
Gets the total number of problems in this benchmark.
public int TotalProblems { get; }
Property Value
Methods
EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)
Evaluates a reasoning strategy on this benchmark.
public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)
Parameters
evaluateFunctionFunc<string, Task<string>>Function that takes a problem and returns an answer.
sampleSizeint?Number of problems to evaluate (null for all).
cancellationTokenCancellationTokenCancellation token.
Returns
- Task<BenchmarkResult<T>>
Benchmark results with accuracy and detailed metrics.
Remarks
For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics
The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.
LoadProblemsAsync(int?)
Loads benchmark problems (for inspection or custom evaluation).
public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)
Parameters
countint?Number of problems to load (null for all).
Returns
- Task<List<BenchmarkProblem>>
List of benchmark problems.
Remarks
For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.