Table of Contents

Class HellaSwagBenchmark<T>

Namespace
AiDotNet.Reasoning.Benchmarks
Assembly
AiDotNet.dll

HellaSwag benchmark for evaluating commonsense natural language inference.

public class HellaSwagBenchmark<T> : IBenchmark<T>

Type Parameters

T

The numeric type used for scoring (e.g., double, float).

Inheritance
HellaSwagBenchmark<T>
Implements
Inherited Members

Remarks

For Beginners: HellaSwag tests whether AI can predict what happens next in everyday situations using common sense.

What is HellaSwag? HellaSwag (Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations) presents a context and asks the model to choose the most plausible continuation from 4 options.

Example:

Context: "A woman is sitting at a piano. She"
Options:
A) sits on a bench and plays the piano
B) starts to play the piano with her feet
C) pulls out a sandwich from the piano
D) transforms into a dolphin

Answer: A (most plausible)

Why it's called "HellaSwag"?

  • "Hella" = very (slang)
  • Designed to be harder than previous benchmarks (SWAG)
  • Uses adversarial generation to create tricky wrong answers

Categories:

  • ActivityNet: Video descriptions (activities)
  • WikiHow: Instructional text (how-to guides)

Adversarial wrong answers: The wrong options are generated to be:

  1. Grammatically plausible
  2. Semantically similar
  3. But factually/logically incorrect

This makes random guessing ineffective and requires actual understanding.

Performance levels:

  • Random guessing: 25%
  • Humans: 95.6%
  • BERT-Large: 47.9%
  • GPT-3: 78.9%
  • GPT-4: 95.3%
  • Claude 3 Opus: 88.0%
  • Claude 3.5 Sonnet: 89.0%
  • ChatGPT o1: 94.2%

Why it's hard for models:

  • Requires real-world common sense
  • Can't be solved by pattern matching alone
  • Adversarial wrong answers look plausible
  • Needs understanding of cause and effect

Research:

  • "HellaSwag: Can a Machine Really Finish Your Sentence?" (Zellers et al., 2019)
  • https://arxiv.org/abs/1905.07830
  • Dataset: 70,000 questions from ActivityNet and WikiHow

Constructors

HellaSwagBenchmark()

public HellaSwagBenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>

Function that takes a problem and returns an answer.

sampleSize int?

Number of problems to evaluate (null for all).

cancellationToken CancellationToken

Cancellation token.

Returns

Task<BenchmarkResult<T>>

Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?

Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>

List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.