Class ARCAGIBenchmark<T>
- Namespace
- AiDotNet.Reasoning.Benchmarks
- Assembly
- AiDotNet.dll
ARC-AGI (Abstract Reasoning Corpus - Artificial General Intelligence) benchmark.
public class ARCAGIBenchmark<T> : IBenchmark<T>
Type Parameters
TThe numeric type used for scoring (e.g., double, float).
- Inheritance
-
ARCAGIBenchmark<T>
- Implements
-
IBenchmark<T>
- Inherited Members
Remarks
For Beginners: ARC-AGI is considered one of the hardest AI benchmarks. It tests abstract reasoning and pattern recognition using visual grid puzzles.
What is ARC? Created by François Chollet (creator of Keras), ARC tests whether AI can think abstractly like humans. Each task shows example input/output grids, and the AI must figure out the transformation rule.
Example task:
Training examples:
Input: [1,1,0] Output: [2,2,0]
[1,1,0] [2,2,0]
[0,0,0] [0,0,0]
Input: [0,1,1] Output: [0,2,2]
[0,1,1] [0,2,2]
[0,0,0] [0,0,0]
Test (what's the output?):
Input: [1,1,1] Output: ???
[0,0,0]
[0,0,0]
Rule: Replace all 1s with 2s
Why it's hard:
- Requires understanding abstract concepts
- Few-shot learning (only 2-3 examples)
- Novel tasks never seen before
- Can't be solved by memorization
- Tests core intelligence, not just pattern matching
Performance levels:
- Human performance: ~85%
- GPT-4: ~0-5% (very poor)
- GPT-4o: ~10%
- Claude 3.5 Sonnet: ~15-20%
- ChatGPT o1: ~21%
- Specialized systems: ~20-30%
- Current SOTA: ~55% (MindsAI ARC Prize 2024)
Why LLMs struggle:
- Pattern recognition ≠ abstract reasoning
- Can't generalize from few examples
- No spatial/visual reasoning built-in
- Trained on language, not logic puzzles
Recent progress:
- ARC Prize (2024): $1M prize for solving ARC
- Test-time compute scaling helps (o1, o3)
- Hybrid neuro-symbolic approaches show promise
Research:
- "On the Measure of Intelligence" (Chollet, 2019) - Original ARC paper
- "The ARC Prize" (2024) - Competition for AGI progress
- ARC-AGI is viewed as a benchmark for measuring progress toward AGI
Categories of tasks:
- Object counting and manipulation
- Symmetry and patterns
- Color transformations
- Spatial reasoning
- Logical rules
Constructors
ARCAGIBenchmark()
public ARCAGIBenchmark()
Properties
BenchmarkName
Gets the name of this benchmark.
public string BenchmarkName { get; }
Property Value
Description
Gets a description of what this benchmark measures.
public string Description { get; }
Property Value
TotalProblems
Gets the total number of problems in this benchmark.
public int TotalProblems { get; }
Property Value
Methods
EvaluateAsync(Func<string, Task<string>>, int?, CancellationToken)
Evaluates a reasoning strategy on this benchmark.
public Task<BenchmarkResult<T>> EvaluateAsync(Func<string, Task<string>> evaluateFunction, int? sampleSize = null, CancellationToken cancellationToken = default)
Parameters
evaluateFunctionFunc<string, Task<string>>Function that takes a problem and returns an answer.
sampleSizeint?Number of problems to evaluate (null for all).
cancellationTokenCancellationTokenCancellation token.
Returns
- Task<BenchmarkResult<T>>
Benchmark results with accuracy and detailed metrics.
Remarks
For Beginners: This runs the benchmark by: 1. Selecting problems (either all or a random sample) 2. Asking the reasoning system to solve each one 3. Comparing answers to the correct solutions 4. Calculating accuracy and other metrics
The evaluateFunction is your reasoning system - it takes a problem string and returns an answer string.
LoadProblemsAsync(int?)
Loads benchmark problems (for inspection or custom evaluation).
public Task<List<BenchmarkProblem>> LoadProblemsAsync(int? count = null)
Parameters
countint?Number of problems to load (null for all).
Returns
- Task<List<BenchmarkProblem>>
List of benchmark problems.
Remarks
For Beginners: Returns the actual problems and their correct answers so you can inspect them or run custom evaluations.