Table of Contents

Class CommonsenseQABenchmark<T>

Namespace
AiDotNet.Reasoning.Benchmarks
Assembly
AiDotNet.dll

CommonsenseQA benchmark for evaluating commonsense knowledge and reasoning.

public class CommonsenseQABenchmark<T> : IBenchmark<T>

Type Parameters

T

The numeric type used for scoring (e.g., double, float).

Inheritance
CommonsenseQABenchmark<T>
Implements
Inherited Members

Remarks

For Beginners: CommonsenseQA tests everyday knowledge that humans take for granted but AI often struggles with.

What is CommonsenseQA? CommonsenseQA contains multiple-choice questions requiring common sense about everyday situations, objects, and concepts.

Example questions:

Physical world:

Q: Where would you put uncooked food that you want to cook soon?
A) pantry  B) shelf  C) refrigerator  D) kitchen cabinet  E) oven
Answer: C (refrigerator keeps food fresh until cooking)

Social understanding:

Q: What happens when people get tired?
A) they sleep  B) go to movies  C) feel energetic  D) stay awake  E) study
Answer: A (tired people need sleep)

Cause and effect:

Q: What can happen to someone who doesn't get enough sleep?
A) lazy  B) insomnia  C) get tired  D) snore  E) have fun
Answer: C (lack of sleep causes tiredness)

Object properties:

Q: What is likely to be found in a book?
A) pictures  B) words  C) pages  D) cover  E) all of the above
Answer: E (books have all these features)

Spatial reasoning:

Q: Where do you typically find a handle?
A) door  B) briefcase  C) suitcase  D) cup  E) all of the above
Answer: E (all these objects have handles)

Knowledge types:

  • Physical properties (hot, cold, heavy, fragile)
  • Spatial relationships (inside, on top of, next to)
  • Temporal understanding (before, after, during)
  • Causal relationships (causes, prevents, enables)
  • Social norms (polite, rude, appropriate)
  • Functional roles (what things are used for)
  • Typical locations (where things are usually found)

Why it's important:

  • Tests implicit knowledge humans use daily
  • Can't be answered by facts alone
  • Requires understanding of how the world works
  • Foundation for real-world AI applications

Performance levels:

  • Random guessing: 20%
  • Humans (crowd workers): 88.9%
  • Humans (expert): 95.3%
  • BERT: 57.0%
  • RoBERTa: 73.1%
  • GPT-3: 65.2%
  • GPT-4: 82.4%
  • Claude 3 Opus: 81.7%
  • Claude 3.5 Sonnet: 85.9%
  • ChatGPT o1: 88.1%

Why LLMs struggle:

  • Lack embodied experience (can't touch/see/hear)
  • No direct interaction with physical world
  • Must infer common sense from text alone
  • Training data may lack obvious implicit knowledge
  • Difficulty distinguishing common from rare situations

How it's created:

  1. Start with concept from ConceptNet (knowledge graph)
  2. Generate question about the concept
  3. Use crowd workers to create wrong but plausible options
  4. Adversarial filtering to ensure quality

ConceptNet integration: Questions are based on ConceptNet relations like:

  • UsedFor: knife UsedFor cutting
  • AtLocation: book AtLocation library
  • Causes: exercise Causes tiredness
  • CapableOf: bird CapableOf flying

Research:

  • "CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge" (Talmor et al., 2019)
  • https://arxiv.org/abs/1811.00937
  • Dataset: 12,247 questions with 5 answer choices each
  • Based on ConceptNet knowledge graph

Constructors

CommonsenseQABenchmark()

public CommonsenseQABenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>

Function that takes a problem and returns an answer.

sampleSize int?

Number of problems to evaluate (null for all).

cancellationToken CancellationToken

Cancellation token.

Returns

Task<BenchmarkResult<T>>

Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?

Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>

List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.