Table of Contents

Class PIQABenchmark<T>

Namespace
AiDotNet.Reasoning.Benchmarks
Assembly
AiDotNet.dll

PIQA (Physical Interaction Question Answering) benchmark for physical commonsense reasoning.

public class PIQABenchmark<T> : IBenchmark<T>

Type Parameters

T

The numeric type used for scoring (e.g., double, float).

Inheritance
PIQABenchmark<T>
Implements
Inherited Members

Remarks

For Beginners: PIQA tests whether AI understands how the physical world works through everyday situations and questions about physical interactions.

What is PIQA? PIQA asks questions about physical commonsense - understanding how objects interact, what happens when you do certain actions, and basic physics of everyday life.

Format:

  • Goal: A task you want to accomplish
  • Solutions: Two possible ways to do it (one correct, one wrong)
  • Question: Which solution actually works?

Example 1:

Goal: To separate egg whites from the yolk
Solution 1: Crack the egg into a bowl, then use a water bottle to suck up the yolk
Solution 2: Crack the egg and use your hands to throw the white away
Correct: Solution 1

Example 2:

Goal: To remove a stripped screw
Solution 1: Place a rubber band over the screw head for better grip
Solution 2: Pour water on the screw to make it easier to turn
Correct: Solution 1

Example 3:

Goal: Keep your garbage can from smelling bad
Solution 1: Spray perfume in the garbage can every day
Solution 2: Put baking soda at the bottom of the can
Correct: Solution 2

Why it's important:

  • Tests real-world physical understanding
  • Can't be solved by language patterns alone
  • Requires knowledge of:
    • Basic physics (gravity, friction, pressure)
    • Material properties (hard, soft, sticky, etc.)
    • Cause and effect in physical world
    • Practical life skills

Performance levels:

  • Random guessing: 50%
  • Humans: 94.9%
  • BERT: 70.9%
  • RoBERTa: 79.4%
  • GPT-3: 81.0%
  • GPT-4: 86.8%
  • Claude 3 Opus: 85.2%
  • Claude 3.5 Sonnet: 88.0%
  • ChatGPT o1: 91.5%

Categories:

  • Kitchen tasks
  • Home repair
  • Cleaning
  • Arts and crafts
  • General household

Why it's hard for AI:

  • LLMs lack embodied experience
  • Can't actually touch or manipulate objects
  • Must infer from text descriptions
  • Requires implicit physical knowledge

Research:

  • "PIQA: Reasoning about Physical Commonsense in Natural Language" (Bisk et al., 2020)
  • https://arxiv.org/abs/1911.11641
  • Dataset: 16,000 questions from WikiHow and other sources
  • Part of physical reasoning evaluation suite

Constructors

PIQABenchmark()

public PIQABenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>

Function that takes a problem and returns an answer.

sampleSize int?

Number of problems to evaluate (null for all).

cancellationToken CancellationToken

Cancellation token.

Returns

Task<BenchmarkResult<T>>

Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?

Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>

List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.