Table of Contents

Class ARCAGIBenchmark<T>

Namespace
AiDotNet.Reasoning.Benchmarks
Assembly
AiDotNet.dll

ARC-AGI (Abstract Reasoning Corpus - Artificial General Intelligence) benchmark.

public class ARCAGIBenchmark<T> : IBenchmark<T>

Type Parameters

T

The numeric type used for scoring (e.g., double, float).

Inheritance
ARCAGIBenchmark<T>
Implements
Inherited Members

Remarks

For Beginners: ARC-AGI is considered one of the hardest AI benchmarks. It tests abstract reasoning and pattern recognition using visual grid puzzles.

What is ARC? Created by François Chollet (creator of Keras), ARC tests whether AI can think abstractly like humans. Each task shows example input/output grids, and the AI must figure out the transformation rule.

Example task:

Training examples:
Input:  [1,1,0]    Output: [2,2,0]
        [1,1,0]            [2,2,0]
        [0,0,0]            [0,0,0]

Input:  [0,1,1]    Output: [0,2,2]
        [0,1,1]            [0,2,2]
        [0,0,0]            [0,0,0]

Test (what's the output?):
Input:  [1,1,1]    Output: ???
        [0,0,0]
        [0,0,0]

Rule: Replace all 1s with 2s

Why it's hard:

  • Requires understanding abstract concepts
  • Few-shot learning (only 2-3 examples)
  • Novel tasks never seen before
  • Can't be solved by memorization
  • Tests core intelligence, not just pattern matching

Performance levels:

  • Human performance: ~85%
  • GPT-4: ~0-5% (very poor)
  • GPT-4o: ~10%
  • Claude 3.5 Sonnet: ~15-20%
  • ChatGPT o1: ~21%
  • Specialized systems: ~20-30%
  • Current SOTA: ~55% (MindsAI ARC Prize 2024)

Why LLMs struggle:

  1. Pattern recognition ≠ abstract reasoning
  2. Can't generalize from few examples
  3. No spatial/visual reasoning built-in
  4. Trained on language, not logic puzzles

Recent progress:

  • ARC Prize (2024): $1M prize for solving ARC
  • Test-time compute scaling helps (o1, o3)
  • Hybrid neuro-symbolic approaches show promise

Research:

  • "On the Measure of Intelligence" (Chollet, 2019) - Original ARC paper
  • "The ARC Prize" (2024) - Competition for AGI progress
  • ARC-AGI is viewed as a benchmark for measuring progress toward AGI

Categories of tasks:

  • Object counting and manipulation
  • Symmetry and patterns
  • Color transformations
  • Spatial reasoning
  • Logical rules

Constructors

ARCAGIBenchmark()

public ARCAGIBenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>

Function that takes a problem and returns an answer.

sampleSize int?

Number of problems to evaluate (null for all).

cancellationToken CancellationToken

Cancellation token.

Returns

Task<BenchmarkResult<T>>

Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?

Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>

List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.