Table of Contents

Class LogiQABenchmark<T>

Namespace
AiDotNet.Reasoning.Benchmarks
Assembly
AiDotNet.dll

LogiQA benchmark for evaluating logical reasoning abilities.

public class LogiQABenchmark<T> : IBenchmark<T>

Type Parameters

T

The numeric type used for scoring (e.g., double, float).

Inheritance
LogiQABenchmark<T>
Implements
Inherited Members

Remarks

For Beginners: LogiQA tests whether AI can solve logic puzzles similar to those on standardized tests like the LSAT or GRE.

What is LogiQA? LogiQA contains logical reasoning questions from Chinese civil service exams, translated to English. These questions test formal logic, deductive reasoning, and analytical skills.

Question types:

1. Categorical reasoning:

All programmers are logical.
Some logical people are creative.
John is a programmer.

Which must be true?
A) John is creative
B) John is logical
C) All creative people are programmers
D) Some programmers are creative
Answer: B

2. Conditional reasoning:

If it rains, the picnic will be cancelled.
The picnic was not cancelled.

What can we conclude?
A) It rained
B) It did not rain
C) The picnic happened
D) Cannot determine
Answer: B (contrapositive: not cancelled → not rain)

3. Assumption identification:

Premise: Companies that provide good customer service are successful.
Conclusion: Therefore, our company should invest in customer service.

What assumption is made?
A) Our company wants to be successful
B) Customer service is expensive
C) Other companies have good service
D) Success is important
Answer: A

4. Weaken/Strengthen arguments:

Argument: "This medicine works because 80% of patients improved."

Which weakens this argument?
A) The medicine is expensive
B) 80% is a high percentage
C) 85% of patients improve without medicine
D) The medicine has side effects
Answer: C (natural improvement rate is higher)

5. Paradox resolution:

Paradox: "Sales of umbrellas decreased, but rainfall increased."

Which explains this?
A) Umbrellas became more expensive
B) People stayed indoors more during rain
C) Rainfall occurred at night when stores are closed
D) Other rain gear became popular
Answer: Could be B, C, or D depending on context

Logic types tested:

  • Deductive reasoning (must be true)
  • Inductive reasoning (likely to be true)
  • Abductive reasoning (best explanation)
  • Formal logic (syllogisms, conditionals)
  • Critical thinking (assumptions, flaws)

Performance levels:

  • Random guessing: 25%
  • Humans (average): ~65%
  • Humans (trained in logic): ~85%
  • BERT: 34.2%
  • RoBERTa: 37.1%
  • GPT-3: 29.8%
  • GPT-4: 43.5%
  • Claude 3 Opus: 44.2%
  • Claude 3.5 Sonnet: 48.0%
  • ChatGPT o1: 61.2% (significant improvement with CoT)

Why it's hard:

  • Requires formal logical reasoning
  • Can't rely on pattern matching
  • Need to track complex relationships
  • Must avoid logical fallacies
  • Tests rigorous thinking, not just knowledge

Research:

  • "LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning" (Liu et al., 2020)
  • https://arxiv.org/abs/2007.08124
  • Dataset: 8,678 questions from Chinese exams
  • Tests multiple types of logical reasoning

Constructors

LogiQABenchmark()

public LogiQABenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>

Function that takes a problem and returns an answer.

sampleSize int?

Number of problems to evaluate (null for all).

cancellationToken CancellationToken

Cancellation token.

Returns

Task<BenchmarkResult<T>>

Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?

Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>

List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.