Class ARCAGIBenchmark<T>

Namespace: AiDotNet.Reasoning.Benchmarks

Assembly: AiDotNet.dll

ARC-AGI (Abstract Reasoning Corpus - Artificial General Intelligence) benchmark.

public class ARCAGIBenchmark<T> : IBenchmark<T>

Type Parameters

T: The numeric type used for scoring (e.g., double, float).

Inheritance: object

ARCAGIBenchmark<T>

Implements: IBenchmark<T>

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

For Beginners: ARC-AGI is considered one of the hardest AI benchmarks. It tests abstract reasoning and pattern recognition using visual grid puzzles.

What is ARC? Created by François Chollet (creator of Keras), ARC tests whether AI can think abstractly like humans. Each task shows example input/output grids, and the AI must figure out the transformation rule.

Example task:

Training examples:
Input:  [1,1,0]    Output: [2,2,0]
        [1,1,0]            [2,2,0]
        [0,0,0]            [0,0,0]

Input:  [0,1,1]    Output: [0,2,2]
        [0,1,1]            [0,2,2]
        [0,0,0]            [0,0,0]

Test (what's the output?):
Input:  [1,1,1]    Output: ???
        [0,0,0]
        [0,0,0]

Rule: Replace all 1s with 2s

Why it's hard:

Requires understanding abstract concepts
Few-shot learning (only 2-3 examples)
Novel tasks never seen before
Can't be solved by memorization
Tests core intelligence, not just pattern matching

Performance levels:

Human performance: ~85%
GPT-4: ~0-5% (very poor)
GPT-4o: ~10%
Claude 3.5 Sonnet: ~15-20%
ChatGPT o1: ~21%
Specialized systems: ~20-30%
Current SOTA: ~55% (MindsAI ARC Prize 2024)

Why LLMs struggle:

Pattern recognition ≠ abstract reasoning
Can't generalize from few examples
No spatial/visual reasoning built-in
Trained on language, not logic puzzles

Recent progress:

ARC Prize (2024): $1M prize for solving ARC
Test-time compute scaling helps (o1, o3)
Hybrid neuro-symbolic approaches show promise

Research:

"On the Measure of Intelligence" (Chollet, 2019) - Original ARC paper
"The ARC Prize" (2024) - Competition for AGI progress
ARC-AGI is viewed as a benchmark for measuring progress toward AGI

Categories of tasks:

Object counting and manipulation
Symmetry and patterns
Color transformations
Spatial reasoning
Logical rules

Constructors

ARCAGIBenchmark()

public ARCAGIBenchmark()

Properties

BenchmarkName

Gets the name of this benchmark.

public string BenchmarkName { get; }

Property Value

string

Description

Gets a description of what this benchmark measures.

public string Description { get; }

Property Value

string

TotalProblems

Gets the total number of problems in this benchmark.

public int TotalProblems { get; }

Property Value

int

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Evaluates a reasoning strategy on this benchmark.

public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)

Parameters

evaluateFunction Func<string, Task<string>>: Function that takes a problem and returns an answer.
sampleSize int?: Number of problems to evaluate (null for all).
cancellationToken CancellationToken: Cancellation token.

Returns

Task<BenchmarkResult<T>>: Benchmark results with accuracy and detailed metrics.

Remarks

For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics

The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.

LoadProblemsAsync(int?)

Loads benchmark problems (for inspection or custom evaluation).

public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)

Parameters

count int?: Number of problems to load (null for all).

Returns

Task<List<BenchmarkProblem>>: List of benchmark problems.

Remarks

For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.

Table of Contents

Class ARCAGIBenchmark<T>

Type Parameters

Remarks

Constructors

ARCAGIBenchmark()

Properties

BenchmarkName

Property Value

Description

Property Value

TotalProblems

Property Value

Methods

EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)

Parameters

Returns

Remarks

LoadProblemsAsync(int?)

Parameters

Returns

Remarks