Class CommonsenseQABenchmark<T>

Namespace: AiDotNet.Reasoning.Benchmarks

Assembly: AiDotNet.dll

CommonsenseQA benchmark for evaluating commonsense knowledge and reasoning.

public class CommonsenseQABenchmark<T> : IBenchmark<T>

Type Parameters

T: The numeric type used for scoring (e.g., double, float).

Inheritance: object

CommonsenseQABenchmark<T>

Implements: IBenchmark<T>

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

For Beginners: CommonsenseQA tests everyday knowledge that humans take for granted but AI often struggles with.

What is CommonsenseQA? CommonsenseQA contains multiple-choice questions requiring common sense about everyday situations, objects, and concepts.

Example questions:

Physical world:

Q: Where would you put uncooked food that you want to cook soon?
A) pantry  B) shelf  C) refrigerator  D) kitchen cabinet  E) oven
Answer: C (refrigerator keeps food fresh until cooking)

Social understanding:

Q: What happens when people get tired?
A) they sleep  B) go to movies  C) feel energetic  D) stay awake  E) study
Answer: A (tired people need sleep)

Cause and effect:

Q: What can happen to someone who doesn't get enough sleep?
A) lazy  B) insomnia  C) get tired  D) snore  E) have fun
Answer: C (lack of sleep causes tiredness)

Object properties:

Q: What is likely to be found in a book?
A) pictures  B) words  C) pages  D) cover  E) all of the above
Answer: E (books have all these features)

Spatial reasoning:

Q: Where do you typically find a handle?
A) door  B) briefcase  C) suitcase  D) cup  E) all of the above
Answer: E (all these objects have handles)

Knowledge types:

Physical properties (hot, cold, heavy, fragile)
Spatial relationships (inside, on top of, next to)
Temporal understanding (before, after, during)
Causal relationships (causes, prevents, enables)
Social norms (polite, rude, appropriate)
Functional roles (what things are used for)
Typical locations (where things are usually found)

Why it's important:

Tests implicit knowledge humans use daily
Can't be answered by facts alone
Requires understanding of how the world works
Foundation for real-world AI applications

Performance levels:

Random guessing: 20%
Humans (crowd workers): 88.9%
Humans (expert): 95.3%
BERT: 57.0%
RoBERTa: 73.1%
GPT-3: 65.2%
GPT-4: 82.4%
Claude 3 Opus: 81.7%
Claude 3.5 Sonnet: 85.9%
ChatGPT o1: 88.1%

Why LLMs struggle:

Lack embodied experience (can't touch/see/hear)
No direct interaction with physical world
Must infer common sense from text alone
Training data may lack obvious implicit knowledge
Difficulty distinguishing common from rare situations

How it's created:

Start with concept from ConceptNet (knowledge graph)
Generate question about the concept
Use crowd workers to create wrong but plausible options
Adversarial filtering to ensure quality

ConceptNet integration: Questions are based on ConceptNet relations like:

UsedFor: knife UsedFor cutting
AtLocation: book AtLocation library
Causes: exercise Causes tiredness
CapableOf: bird CapableOf flying

Research:

"CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge" (Talmor et al., 2019)
https://arxiv.org/abs/1811.00937
Dataset: 12,247 questions with 5 answer choices each
Based on ConceptNet knowledge graph

Constructors

CommonsenseQABenchmark()

public CommonsenseQABenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>: Function that takes a problem and returns an answer.
sampleSize int?: Number of problems to evaluate (null for all).
cancellationToken CancellationToken: Cancellation token.

Returns

Task<BenchmarkResult<T>>: Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?: Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>: List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.

Table of Contents

Class CommonsenseQABenchmark<T>

Type Parameters

Remarks

Constructors

CommonsenseQABenchmark()

Properties

BenchmarkName

Property Value

Description

Property Value

TotalProblems

Property Value

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Parameters

Returns

Remarks

LoadProblemsAsync(int?)

Parameters

Returns

Remarks