Class CommonsenseQABenchmark<T>
- Namespace
- AiDotNet.Reasoning.Benchmarks
- Assembly
- AiDotNet.dll
CommonsenseQA benchmark for evaluating commonsense knowledge and reasoning.
public class CommonsenseQABenchmark<T> : IBenchmark<T>
Type Parameters
TThe numeric type used for scoring (e.g., double, float).
- Inheritance
-
CommonsenseQABenchmark<T>
- Implements
-
IBenchmark<T>
- Inherited Members
Remarks
For Beginners: CommonsenseQA tests everyday knowledge that humans take for granted but AI often struggles with.
What is CommonsenseQA? CommonsenseQA contains multiple-choice questions requiring common sense about everyday situations, objects, and concepts.
Example questions:
Physical world:
Q: Where would you put uncooked food that you want to cook soon?
A) pantry B) shelf C) refrigerator D) kitchen cabinet E) oven
Answer: C (refrigerator keeps food fresh until cooking)
Social understanding:
Q: What happens when people get tired?
A) they sleep B) go to movies C) feel energetic D) stay awake E) study
Answer: A (tired people need sleep)
Cause and effect:
Q: What can happen to someone who doesn't get enough sleep?
A) lazy B) insomnia C) get tired D) snore E) have fun
Answer: C (lack of sleep causes tiredness)
Object properties:
Q: What is likely to be found in a book?
A) pictures B) words C) pages D) cover E) all of the above
Answer: E (books have all these features)
Spatial reasoning:
Q: Where do you typically find a handle?
A) door B) briefcase C) suitcase D) cup E) all of the above
Answer: E (all these objects have handles)
Knowledge types:
- Physical properties (hot, cold, heavy, fragile)
- Spatial relationships (inside, on top of, next to)
- Temporal understanding (before, after, during)
- Causal relationships (causes, prevents, enables)
- Social norms (polite, rude, appropriate)
- Functional roles (what things are used for)
- Typical locations (where things are usually found)
Why it's important:
- Tests implicit knowledge humans use daily
- Can't be answered by facts alone
- Requires understanding of how the world works
- Foundation for real-world AI applications
Performance levels:
- Random guessing: 20%
- Humans (crowd workers): 88.9%
- Humans (expert): 95.3%
- BERT: 57.0%
- RoBERTa: 73.1%
- GPT-3: 65.2%
- GPT-4: 82.4%
- Claude 3 Opus: 81.7%
- Claude 3.5 Sonnet: 85.9%
- ChatGPT o1: 88.1%
Why LLMs struggle:
- Lack embodied experience (can't touch/see/hear)
- No direct interaction with physical world
- Must infer common sense from text alone
- Training data may lack obvious implicit knowledge
- Difficulty distinguishing common from rare situations
How it's created:
- Start with concept from ConceptNet (knowledge graph)
- Generate question about the concept
- Use crowd workers to create wrong but plausible options
- Adversarial filtering to ensure quality
ConceptNet integration: Questions are based on ConceptNet relations like:
- UsedFor: knife UsedFor cutting
- AtLocation: book AtLocation library
- Causes: exercise Causes tiredness
- CapableOf: bird CapableOf flying
Research:
- "CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge" (Talmor et al., 2019)
- https://arxiv.org/abs/1811.00937
- Dataset: 12,247 questions with 5 answer choices each
- Based on ConceptNet knowledge graph
Constructors
CommonsenseQABenchmark()
public CommonsenseQABenchmark()
Properties
BenchmarkName
Gets the name of this benchmark.
public string BenchmarkName { get; }
Property Value
Description
Gets a description of what this benchmark measures.
public string Description { get; }
Property Value
TotalProblems
Gets the total number of problems in this benchmark.
public int TotalProblems { get; }
Property Value
Methods
EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)
Evaluates a reasoning strategy on this benchmark.
public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)
Parameters
evaluateFunctionFunc<string, Task<string>>Function that takes a problem and returns an answer.
sampleSizeint?Number of problems to evaluate (null for all).
cancellationTokenCancellationTokenCancellation token.
Returns
- Task<BenchmarkResult<T>>
Benchmark results with accuracy and detailed metrics.
Remarks
For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics
The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.
LoadProblemsAsync(int?)
Loads benchmark problems (for inspection or custom evaluation).
public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)
Parameters
countint?Number of problems to load (null for all).
Returns
- Task<List<BenchmarkProblem>>
List of benchmark problems.
Remarks
For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.