Class HellaSwagBenchmark<T>
- Namespace
- AiDotNet.Reasoning.Benchmarks
- Assembly
- AiDotNet.dll
HellaSwag benchmark for evaluating commonsense natural language inference.
public class HellaSwagBenchmark<T> : IBenchmark<T>
Type Parameters
TThe numeric type used for scoring (e.g., double, float).
- Inheritance
-
HellaSwagBenchmark<T>
- Implements
-
IBenchmark<T>
- Inherited Members
Remarks
For Beginners: HellaSwag tests whether AI can predict what happens next in everyday situations using common sense.
What is HellaSwag? HellaSwag (Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations) presents a context and asks the model to choose the most plausible continuation from 4 options.
Example:
Context: "A woman is sitting at a piano. She"
Options:
A) sits on a bench and plays the piano
B) starts to play the piano with her feet
C) pulls out a sandwich from the piano
D) transforms into a dolphin
Answer: A (most plausible)
Why it's called "HellaSwag"?
- "Hella" = very (slang)
- Designed to be harder than previous benchmarks (SWAG)
- Uses adversarial generation to create tricky wrong answers
Categories:
- ActivityNet: Video descriptions (activities)
- WikiHow: Instructional text (how-to guides)
Adversarial wrong answers: The wrong options are generated to be:
- Grammatically plausible
- Semantically similar
- But factually/logically incorrect
This makes random guessing ineffective and requires actual understanding.
Performance levels:
- Random guessing: 25%
- Humans: 95.6%
- BERT-Large: 47.9%
- GPT-3: 78.9%
- GPT-4: 95.3%
- Claude 3 Opus: 88.0%
- Claude 3.5 Sonnet: 89.0%
- ChatGPT o1: 94.2%
Why it's hard for models:
- Requires real-world common sense
- Can't be solved by pattern matching alone
- Adversarial wrong answers look plausible
- Needs understanding of cause and effect
Research:
- "HellaSwag: Can a Machine Really Finish Your Sentence?" (Zellers et al., 2019)
- https://arxiv.org/abs/1905.07830
- Dataset: 70,000 questions from ActivityNet and WikiHow
Constructors
HellaSwagBenchmark()
public HellaSwagBenchmark()
Properties
BenchmarkName
Gets the name of this benchmark.
public string BenchmarkName { get; }
Property Value
Description
Gets a description of what this benchmark measures.
public string Description { get; }
Property Value
TotalProblems
Gets the total number of problems in this benchmark.
public int TotalProblems { get; }
Property Value
Methods
EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)
Evaluates a reasoning strategy on this benchmark.
public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)
Parameters
evaluateFunctionFunc<string, Task<string>>Function that takes a problem and returns an answer.
sampleSizeint?Number of problems to evaluate (null for all).
cancellationTokenCancellationTokenCancellation token.
Returns
- Task<BenchmarkResult<T>>
Benchmark results with accuracy and detailed metrics.
Remarks
For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics
The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.
LoadProblemsAsync(int?)
Loads benchmark problems (for inspection or custom evaluation).
public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)
Parameters
countint?Number of problems to load (null for all).
Returns
- Task<List<BenchmarkProblem>>
List of benchmark problems.
Remarks
For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.