Navigating ARC-AGI: From Zero to One
An interactive guide to the theory, implementation, and state-of-the-art strategies for the ARC-AGI Challenge.
"Easy for Humans, Hard for AI" โ the defining characteristic that makes ARC-AGI a powerful benchmark for general intelligence.
๐A Note from a Fellow Beginner
Hello! Like many others, I was fascinated by the ARC-AGI challenge but found the learning curve a bit steep. I created this guide as a personal project to connect the dots for myself.
My goal is simple: to provide a single, clear starting point for newcomers by synthesizing the core concepts into one easy-to-follow narrative. If you're just starting your ARC journey, I hope this resource helps. Happy Journey!
๐What changed (updated mid-2026)
I first wrote this during the 2024 Kaggle season. A lot has happened since, so I went back through the whole thing. The short version:
- ARC-AGI-1 has basically fallen โ frontier models now sit around 90%+ on it, so it's more of a warm-up than the real test.
- ARC-AGI-2 became the actual fight. A full competition (ARC Prize 2025) ran on it, and the gap between "a model with unlimited compute" and "a solution that fits inside Kaggle's limits" turned out to be the whole story.
- There's now an ARC-AGI-3, and it's a completely different beast: agents dropped into little games with no instructions. Every frontier model scores under 1%.
- The research mood shifted from "scale it up" to "refine it" โ small recursive models and feedback loops did surprisingly well.
I've updated the numbers, added the 2025 winners, and written up the newer ideas. If something here looks out of date by the time you read it, well โ that's ARC for you. It moves fast.
What is ARC-AGI?
ARC-AGI is a benchmark for fluid intelligence โ how well a system handles a problem it has never seen before. Most benchmarks reward what a model already knows; this one rewards how fast it can pick up something new.
Novel Puzzles
Each task is unique and designed to resist memorization.
Skill Acquisition
Tests the efficiency of learning new skills, not just performance.
Core Knowledge
Uses only universal cognitive primitives for fair comparison.
Part I: Foundations
The Philosophy and Design of ARC-AGI
Straight from the ARC-AGI training set. Same hidden rule in all three pairs — can you predict the last grid?
The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) comes with a strong opinion about what intelligence actually is. Franรงois Chollet built it because he thought the field was measuring the wrong thing.
Intelligence is how fast you learn, not what you already know
The core idea is that intelligence isn't demonstrated by the possession of a specific skill, but by the efficiency of acquiring new skills when faced with a novel problem. That's almost the opposite of what most AI benchmarks reward, which is performance on tasks you can master by training on enough data.
When that's the game, a high score can just be bought with more data and compute, and it tells you little about whether the system can actually adapt. An AI that plays superhuman Go has mastered Go; it hasn't necessarily gotten any smarter in general.
Chollet's definition is precise: "The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty." ARC-AGI is that definition turned into a test โ every task is one-of-a-kind, so memorizing or pattern-matching against a training set won't get you there.
To make that measurable, every puzzle follows the same shape โ the ARC Task Format โ handing you a few examples to work the rule out from.
Core Knowledge Priors: keeping the comparison fair
For the human-vs-AI comparison to mean anything, the test has to lean on fluid intelligence โ the ability to reason, solve novel problems, and adapt to new situations โ and not crystallized intelligence, which relies on accumulated knowledge and skills. If a task needed you to know history or English, it would just favor whoever was pre-trained on the right material, and you'd be measuring prior exposure instead of reasoning.
ARC-AGI gets around this by relying on only a small set of Core Knowledge Priors โ cognitive building blocks that are either present at birth or picked up very early in life, with minimal explicit instruction.
Key Concept: Core Knowledge Priors
๐ต Objectness
The ability to perceive a scene in terms of discrete objects with properties like cohesion (objects move as wholes) and persistence (objects don't randomly appear or disappear).
๐ Basic Topology & Geometry
Intuitive understanding of connectivity, symmetry, inside/outside relationships, and distance.
๐ข Elementary Number Sense
Simple counting and basic integer arithmetic.
๐ฏ Goal-Directedness
The notion that actions are taken to achieve goals.
Keeping the required knowledge down to these few universal primitives means a good score has to come from actually reasoning, not from what you happened to know going in. The public training set is put together to walk you through every prior you'll need for the evaluation tasks โ think of it as the tutorial level.
One Benchmark Becomes Three: ARC-AGI-1 โ 2 โ 3
Here's something that trips up newcomers: "ARC-AGI" isn't one test anymore. There are now three, and each one was born the moment AI got too good at the last. That patternโbuild a test, watch it get solved, build a harder oneโis the whole rhythm of this field.
ARC-AGI-1 (2019) was the original. It did its job for years, but it had a weak spot: a big chunk of it could be cracked by brute-force program search, which says more about compute than about reasoning. By the end of 2024, OpenAI's o3 cleared the human bar on it, and by 2026 most frontier models score in the 90s. It's effectively solved nowโstill useful as a warm-up, but no longer the real challenge.
ARC-AGI-2 (March 2025) is the current battleground and what the prize money is chasing. It keeps the same format and philosophy but is built specifically to defeat brute force and to target the things modern reasoning models are bad at. It was also calibrated against real peopleโa panel of 400+ testers in San Diego, with every kept task solvable by at least two humans in two tries. The average person scores around 60โ66%; the panel together gets 100%.
ARC-AGI-3 (preview 2025, full launch March 2026) threw out the format entirely. Instead of static grids, you get dropped into a tiny turn-based game with no rules, no goal, and no instructions, and you have to figure it all out by playing. It's covered in its own section below. The headline: humans breeze through it, and every frontier AI scores under 1%.
New Conceptual Hurdles in ARC-AGI-2
Based on the failures of frontier AI models, ARC-AGI-2 introduces tasks that test for new, more complex reasoning abilities:
- Symbolic Interpretation: Tasks where visual symbols must be interpreted as having semantic meaning beyond their shape, such as a shape representing an action.
- Compositional Reasoning: Tasks that require discovering and applying multiple, interacting rules simultaneously.
- Contextual Rule Application: Tasks where the correct rule to apply depends on the specific context within the grid, moving beyond superficial global patterns.
| Feature | ARC-AGI-1 | ARC-AGI-2 | ARC-AGI-3 |
|---|---|---|---|
| Launched | 2019 | March 2025 | March 2026 |
| Format | Static grid in โ out | Static grid in โ out (harder) | Interactive turn-based games |
| Tests for | Generalization, basic abstraction | Symbolic, compositional & contextual reasoning + efficiency | Exploration, goal-finding, on-the-fly learning |
| Brute-force-able? | Yes (its weak spot) | No (by design) | N/A โ different paradigm |
| Human baseline | High | ~60โ66% individual / 100% panel | Humans solve ~all of it |
| Best AI (2026) | ~90โ96% (solved) | ~85% unconstrained / ~24% under contest limits | <1% |
| Status | Retired to "warm-up" | The main event | Wide-open frontier |
Note the two very different numbers for ARC-AGI-2. That gapโbetween what a model can do with unlimited compute versus what fits inside the competition's offline budgetโis the single most important thing to understand about where the field is right now. More on that just below.
The ARC-AGI Ecosystem: Datasets and Evaluation
To actually compete, you need to know the practical side: which datasets exist, how scoring works, and where the community lives.
Navigating the ARC Datasets
The data comes in a few separate sets, each with its own job. Using them the right way matters โ both for building your solver and for trusting the numbers you get back.
| Dataset Name | Number of Tasks | Purpose | Access | Key Considerations |
|---|---|---|---|---|
| Public Training | ~1,000 | Training algorithms, learning Core Knowledge priors. | Public | Contains easier, "curriculum-style" tasks. Use freely for development. |
| Public Evaluation | 120 | Final local evaluation of an algorithm. | Public | Do not use for iterative development. Treat as a one-shot evaluation. |
| Semi-Private Evaluation | 120 | Powers the public leaderboard on arcprize.org. | Private (Kaggle) | Used to test both open and closed-source models. |
| Private Evaluation | 120 | Official ranking for the Kaggle prize competition. | Private (Kaggle) | The ultimate test of generalization. No internet access allowed. |
Understanding the Rules of the Game
Competition Rules
- Scoring (pass@2): For each test input you get
twoattempts. If either is a pixel-perfect match, the task counts as solved. (ARC-AGI-3 is scored differently โ by how efficiently your agent acts compared to a human.) - The Kaggle box (this is the important one): prize submissions run offline in a Kaggle notebook โ no internet, fixed hardware, hard time limit. In 2025 that was roughly 12 hours on 4ร L4 GPUs. No internet means you can't call GPT, Claude, or Gemini in the prize track; your whole solution has to ship inside the notebook.
- Open source: to win money you must release a complete, reproducible solution. The 2026 cycle asks for a true public-domain license (CC0 or MIT-0). This is the rule that turns each year's winner into next year's starting point.
The current live competition is ARC Prize 2026, with $2,000,000 in prizes split across three tracks: a $1M track for cracking ARC-AGI-2 (the 85% Grand Prize, still unclaimed), a milestone-based track for building ARC-AGI-3 agents, and a $450K paper track for new ideas. All three want open-source work.
โ๏ธThe "two scores" trap (read this before quoting any number)
Every time you see an ARC-AGI score, ask one question: was it measured with unlimited compute, or under the contest budget? They are wildly different.
Public leaderboard (unlimited)
A frontier model allowed to spend freely. On ARC-AGI-2 in 2026, the top systems reach the low-to-mid 80s โ but at $10โ30+ per task.
Kaggle competition (budgeted)
Offline, fixed hardware, a few cents per task. The 2025 winner landed at 24% for about $0.20/task. This is what the prize is actually judged on.
So when a headline says "AI hits 85% on ARC-AGI-2," it's usually the left column. The Grand Prizeโ85% under the budgetโis still wide open. A fair warning: a lot of third-party "leaderboard" sites blur these two together, so when in doubt, trust arcprize.org's own verified leaderboard.
Part II: Core Methodologies
The Evolution of ARC-AGI Approaches
Program Synthesis
Infer explicit rules from examples
Neural Networks
Guide search with learned intuition
Test-Time Adaptation
Adapt dynamically to each task
Refinement Loops
Guess, check, fix โ repeat
Each step didn't replace the last โ the best 2026 systems stack all four.
Program Synthesis and Domain-Specific Languages (DSLs)
Program synthesis is the oldest serious approach to ARC, and it goes straight at the heart of the problem: work out the rule from the examples. The idea (also called inductive programming) is to automatically write a program that satisfies a spec โ and here the spec is just the demonstration pairs.
So you're hunting for a program P that turns every training input into its matching output. Find one, assume it captures the rule, and run it on the test input to get your answer.
This is the inductive route: pin down a general rule first, then execute it. The other route, transduction, skips the program and predicts the output directly from the examples. Program synthesis ran the show in ARC's early years โ the 2020 Kaggle winner was built this way.
The Power of DSLs
The hard part is that the space of possible programs is enormous, which is where a Domain-Specific Language (DSL) comes in. A DSL is a small language built for ARC: a hand-picked set of functions, or primitives, for common grid operations. The whole game is balance โ expressive enough to actually solve tasks, small enough that searching it stays feasible.
Common DSL Primitives:
rotate_grid
find_objects
mirror_object
count_colors
draw_line
shift_object
DSL Program Example
# Simplified representation of the solver for task 5521c0d9 from Hodel's arc-dsl
def solve_5521c0d9(I):
# 1. Extract all non-background objects from the input grid 'I'.
objs = dsl.objects(I, univalued=True, diagonal=False, without_bg=True)
# 2. Merge all extracted objects into a single 'foreground' object.
foreground = dsl.merge(objs)
# 3. Create a new grid by removing the foreground, leaving only the background.
empty_grid = dsl.cover(I, foreground)
# 4. Create a function 'offset_getter' that calculates an upward shift vector
# equal to an object's height. This is done by composing three functions:
# height -> invert -> toivec (get height, negate it, convert to vector).
offset_getter = dsl.chain(dsl.toivec, dsl.invert, dsl.height)
# 5. Create a function 'shifter' that takes an object and moves it.
# The 'fork' primitive applies the 'shift' operation, using the object
# itself as the first argument and the result of 'offset_getter(object)'
# as the second argument.
shifter = dsl.fork(dsl.shift, dsl.identity, offset_getter)
# 6. Apply the 'shifter' function to every object in the 'objs' list
# and merge the results into a single object of shifted shapes.
shifted = dsl.mapply(shifter, objs)
# 7. Paint the final 'shifted' object onto the 'empty_grid'.
O = dsl.paint(empty_grid, shifted)
return O
Combined primitives in action
You can read this one top to bottom โ no black box, just a short program: pull the grid apart into objects (dsl.objects), work out how far to shift each one, move them (dsl.mapply), and paint the result back (dsl.paint). The catch is that a synthesis system has to find this exact seven-step sequence among an enormous number of possibilities.
The Power of Search: From Brute Force to Neurally-Guided
Once you have a DSL, solving a task turns into a search problem: find the sequence of primitives that does the job.
The Combinatorial Explosion
Even with a constrained DSL of 100 primitives, the number of possible programs grows exponentially:
Beyond Brute Force: Smarter Search
Modern solvers use smarter search to get through that space without checking everything:
Classic & Modern Search
- Monte Carlo Tree Search (MCTS): Balances exploration and exploitation to find promising program paths.
- Adaptive Branching MCTS (AB-MCTS): An advanced variant from Sakana AI that adaptively decides whether to search deeper (refine) or wider (explore).
- Beam Search & Heuristic Search: Methods that use rules of thumb or maintain multiple candidate programs to guide search toward likely solutions.
Neural Guidance
- GridCoder: Uses a Transformer to predict the most likely sequence of DSL primitives, guiding the search probabilistically.
- Execution-Guided Search: A neural network learns a distance metric between grids to evaluate which intermediate step is "closest" to the goal.
- Learning Program Space (LPS): The main GridCoder approach where a model predicts the final program directly.
The big step forward was letting a neural network steer the search โ turning it from "try everything in order" into something closer to "learned intuition" about which paths are worth following. It's faster, and a lot closer to how people actually search.
The 2025 twist: evolve the program, and let the searcher teach itself
Two ideas pushed search forward in the last competition. The first is evolutionary search: instead of building a program once, you keep a small population of candidate solutions and have an LLM mutate and recombine the best ones, generation after generation. Jeremy Berman rode this to the top of the boards โ and made one counterintuitive discovery along the way. He stopped evolving Python and started evolving plain-English descriptions of the rule. Turns out natural language is easier for an LLM to tweak and refine than code, and it got him to about 29% on ARC-AGI-2 for roughly $8 a task.
The second is self-improvement. The award-winning SOAR method runs evolutionary search, then does something clever with the wreckage: every failed program is secretly a correct program for some other task (whatever it actually computed). Relabel those failures as solved examples, fine-tune the model on them, and search again โ a little smarter each round. It climbed to ~52% on ARC-AGI-1 with no hand-written DSL at all. The theme to notice: in both cases, the magic isn't a single brilliant guess, it's the loop.
Test-Time Adaptation: The Modern Paradigm
Program synthesis ruled the early years, but the breakthroughs of 2024 came from Test-Time Adaptation (TTA) โ letting the model adjust to each task, at inference time, using that task's own demonstration pairs. By 2024 every top solution had some form of it.
TTA splits into two flavors. Test-Time Scaling (TTS) spends more compute at inference but leaves the weights alone; Test-Time Training (TTT) briefly fine-tunes the model on the spot.
Test-Time Scaling (TTS)
TTS just throws more compute at inference, without touching the weights. That can be as simple as sampling lots of answers and keeping the best, or as involved as chain-of-thought reasoning or Sakana AI's Adaptive Branching Monte Carlo Tree Search (AB-MCTS).
Test-Time Training (TTT)
TTT actually nudges the weights at inference time โ a few steps of gradient descent on the handful of demo pairs for the task in front of it, then it throws the update away. MIT's team pioneered it for ARC, and it became the backbone of several top 2024 solutions.
The Test-Time Training (TTT) Workflow
Get Task
Receive a novel ARC task with a few train/test examples.
Augment Data
Expand the small demo set using symmetries (rotations, flips, color swaps) to create a temporary training set.
Train LoRA
Rapidly fine-tune a small LoRA adapter on the augmented data, leaving the base model frozen for efficiency.
Infer & Ensemble
Predict the output. Often done under multiple augmentations and combined via majority vote for robustness.
The Duality of Induction and Transduction
There's a split in how you can solve an ARC task, made precise in the prize-winning paper by Li et al. It lines up with System 1 and System 2 thinking โ the fast-and-intuitive versus slow-and-deliberate idea from cognitive science. It's worth getting, because the strongest solvers use both at once.
Induction (Program Synthesis)
The System 2, deliberate path. First you work out an explicit program f that explains the training examples, then you run it on the test input: y_test = f(x_test). The DSL search from earlier is inductive.
Excels at: Tasks requiring precision, multi-step logic, compositionality, and explicit computation.
Transduction (Direct Prediction)
The System 1, intuitive path. You predict the test output y_test directly, taking in the training pairs (x_train, y_train) and the test input all at once, without ever writing down a program. The LLM-based TTT methods are mostly transductive.
Excels at: Tasks relying on "fuzzy" perception, pattern completion, and holistic transformations.
The very best ARC solutions are ensembles โ they run both the inductive and the transductive solver, since the two tend to fail on different tasks.
Refinement Loops: the idea that tied 2025 together
If you only remember one thing from the last competition, make it this. When the ARC Prize team looked back at everything that worked in 2025, almost every strong system โ tiny models, giant models, program searchers โ was doing the same thing underneath. They called it the refinement loop, and summed it up with a line worth sitting with: refinement is intelligence.
The shape is simple, almost embarrassingly so:
Propose
Generate a candidate (a program, a grid, a guess).
Check
Score it against a signal you trust โ for ARC, the demo pairs.
Fix
Keep what worked, repair what didn't, go again.
What makes ARC perfect for this is that it hands you a free, trustworthy checker: a candidate either reproduces the demonstration pairs exactly or it doesn't. That verifiable signal is why looping works so well here โ and, honestly, why it's hard to copy this trick into messier domains where "correct" is fuzzy.
The loop shows up at three levels, and the best teams used all of them at once:
- While making training data โ generate synthetic puzzles, throw out the ones that don't pass a spec, keep the rest. The 2025 winner built hundreds of thousands of tasks this way.
- While answering โ a model that revises its own output step by step (the tiny recursive models below live entirely here), or a reasoning model talking itself out of a wrong answer.
- While picking a final answer โ generate many candidates under different rotations/recolorings and vote.
The practical takeaway
Don't ask your model for an answer. Ask it for a process. A modest model wrapped in a good guess-check-fix loop will usually beat a stronger model used in a single shot. There are even off-the-shelf "refinement harnesses" (Poetiq is one) that wrap a frontier API this way โ they roughly took Gemini 3 Pro from 31% to 54% on ARC-AGI-2, just by looping.
The Plot Twist of 2025: Tiny Models That Punch Way Above Their Weight
Here's the result nobody saw coming. While everyone assumed you'd need a bigger and bigger model, three of the most talked-about papers of 2025 went the opposite way โ and the paper prizes went to all three of them. On ARC, it turns out depth of thinking beats sheer size.
๐ค TRM โ Tiny Recursive Model
~7M params ยท 1st paper, $50KA 2-layer network โ about 7 million parameters, less than 0.01% the size of a frontier LLM โ that just keeps refining its own answer, around 16 passes deep. It hits ~45% on ARC-AGI-1 and ~8% on ARC-AGI-2, beating models ten-thousand times larger, and runs on a single gaming GPU. It's the cleanest proof that the refinement loop, not the parameter count, is doing the work.
๐ช HRM โ Hierarchical Reasoning Model
~27M params ยท cautionary taleA 27M-param model with two "brain-inspired" recurrent modules that made a big splash. Worth knowing for a second reason, though: when the ARC Prize team dug in, they found most of its ARC performance came from the outer refinement loop and data augmentation, not the fancy hierarchy. A plain transformer of the same size matched it. Lesson for your own work โ ablate before you credit the clever part.
๐๏ธ CompressARC โ ARC without pretraining
~76K params ยท 3rd paperThe most extreme of the bunch: 76,000 parameters, no pretraining, no dataset, no search. For each puzzle it just runs gradient descent to find the shortest description that explains the demos (the Minimum Description Length idea โ good compression is understanding). ~20 minutes a puzzle on one GPU, ~20โ34% on ARC-AGI-1. It's a striking argument that you may not need a giant pretrained model at all.
One more thread worth flagging, because it'll matter going forward: a 2025โ26 line of work argues we've been framing ARC wrong by feeding grids to language models as text. "ARC Is a Vision Problem!" treats each grid as an image and trains a vanilla vision transformer on it โ these are puzzles you look at, after all. Keep an eye on the vision-first crowd.
The Interactive Frontier: ARC-AGI-3
Everything so far has been about static puzzles: here are a few examples, predict the answer. In March 2026 the ARC Prize Foundation changed the game entirely โ and if you want to work on something genuinely unsolved, this is where to look.
ARC-AGI-3 drops an agent into a small, turn-based game. No instructions. No stated goal. No rules. Just like being handed a video game you've never seen, you have to poke at it, watch what happens, work out what you're even trying to do, and then do it. It's the first interactive reasoning benchmark.
No instructions, no stated goal. You just act, watch what changes, and work out for yourself that the boxed grid has to be made to match the pattern. Obvious once you see it — yet every frontier model still scores under 1% on games like this.
“If we look at the first three completed grids, each contains a small 3×3 pixel pattern in its center… the positions of the black pixels directly dictate which of the outer 6×6 blocks get colored light blue instead of dark blue. The yellow border acts as a ‘Submit’ button for the quadrant.”
What it's actually testing
Static ARC tests whether you can spot a pattern. Interactive ARC tests the messier, more human skills that come before that:
๐งญ Exploration
Can you poke at the world efficiently to learn how it works, instead of flailing randomly?
๐ฏ Goal-finding
Can you figure out what you're supposed to be doing when nobody tells you?
๐พ Memory
Do you remember what you tried, and reuse it, instead of repeating yourself?
๐ Learning on the fly
Do you get better the longer you play a single game?
And the scoring reflects that shift: you're not graded on a final answer but on how efficiently you act compared to a human, averaged over a set of roughly 135 hand-built games. A perfect 100% means you match human efficiency everywhere.
The scoreboard right now is brutal
At the March 2026 launch, every frontier model scored under 1% โ GPT-5.x, Gemini 3, Claude Opus 4.x, Grok 4, all of them, basically zero. Meanwhile humans (1,200+ testers across 3,900+ playthroughs) solved nearly all of it, and a lot of them said it was fun.
That's the original ARC promise โ easy for people, hard for machines โ back at full strength. The same models that ace ARC-AGI-1 fall apart the moment they have to explore a world instead of being handed the question.
What the early agents did
The preview round was small, and the best agent still only reached about 12.6% efficiency โ but the approaches that worked tell you where to start:
- Predict what your actions do. The winning agent (StochasticGoose) trained a small CNN with reinforcement learning to guess which buttons actually change the screen, so it could explore on purpose instead of at random.
- Build a map. The runner-up kept a graph of game states and pruned moves that led nowhere or looped back.
- Plain LLM agents struggled. Dropping a big language model in and asking it to play mostly burned through actions. This is an RL / world-model problem, not a prompt-engineering one.
Why this might be your best bet
ARC-AGI-2 is a crowded, well-mapped engineering race. ARC-AGI-3 is the opposite โ almost nothing works yet, the best score is ~12%, and a clean, well-thought-out exploration agent can land near the top of the board today. If you like RL, agents, and open problems, start here. The agent SDK and docs live at docs.arcprize.org.
Part III: Practical Guide
Your First ARC Solver: A Step-by-Step Tutorial
This section provides a hands-on tutorial for building a simple, yet complete, ARC solver in Python. We'll use a minimal DSL and simple brute-force search to demonstrate the core logic of the inductive programming paradigm.
Building Your First ARC Solver
What We'll Build
- A minimal Domain-Specific Language (DSL)
- A brute-force search algorithm
- Grid visualization tools
- A complete end-to-end solver
Learning Objectives
- Understand the program synthesis approach
- Learn how DSLs constrain search space
- Implement search and verification logic
- Create submission-ready output
Note: this tutorial is just the core concepts. Real competitive solvers use much larger DSLs, smarter search, and neural guidance.
1. Setting Up Your Environment
First, let's prepare your development environment with the necessary data and libraries.
# Clone the ARC-AGI repository for the data
git clone https://github.com/fchollet/ARC-AGI.git
# Install necessary libraries
pip install numpy matplotlib
# Create project structure
mkdir arc_solver
cd arc_solver
Project Structure
arc_solver/
โโโ ARC-AGI/ # The cloned repository
โโโ solver.py # Our main solver script
โโโ visualize.py # A utility for plotting grids
2. Loading and Visualizing Data
Visualizing the tasks is essential for understanding and debugging. Create a file visualize.py:
# In visualize.py
import matplotlib.pyplot as plt
from matplotlib import colors
import numpy as np
# Define the 10 official ARC colors
ARC_COLORMAP = colors.ListedColormap([
'#000000', '#0074D9', '#FF4136', '#2ECC40', '#FFDC00',
'#AAAAAA', '#F012BE', '#FF851B', '#7FDBFF', '#870C25'
])
def plot_grid(ax, grid, title=""):
"""Plots a single ARC grid with the official colormap."""
norm = colors.Normalize(vmin=0, vmax=9)
ax.imshow(np.array(grid), cmap=ARC_COLORMAP, norm=norm)
ax.grid(True, which='both', color='white', linewidth=0.5)
ax.set_xticks(np.arange(-0.5, len(grid[0]), 1), minor=True)
ax.set_yticks(np.arange(-0.5, len(grid), 1), minor=True)
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_title(title)
def plot_task(task):
"""Plots all training and test pairs for a given ARC task."""
num_train = len(task['train'])
num_test = len(task['test'])
num_total = num_train + num_test
fig, axs = plt.subplots(2, num_total, figsize=(3 * num_total, 6))
for i, pair in enumerate(task['train']):
plot_grid(axs[0, i], pair['input'], f"Train {i} Input")
plot_grid(axs[1, i], pair['output'], f"Train {i} Output")
for i, pair in enumerate(task['test']):
plot_grid(axs[0, num_train + i], pair['input'], f"Test {i} Input")
if 'output' in pair:
plot_grid(axs[1, num_train + i], pair['output'], f"Test {i} Output")
else:
axs[1, num_train + i].axis('off')
axs[1, num_train + i].set_title(f"Test {i} Output (Predict)")
plt.tight_layout()
plt.show()
You can now load and view a task in your main script, solver.py:
# In solver.py
import json
from visualize import plot_task
def load_task(task_path):
with open(task_path, 'r') as f:
return json.load(f)
# Example usage
task_file = 'ARC-AGI/data/training/ed36ccf7.json'
task = load_task(task_file)
# plot_task(task) # Uncomment to visualize
3. Building a Simple DSL Solver
Now, let's build the core solver logic.
Step 1: Define a Minimal DSL
Create a few simple functions that operate on grids (represented as NumPy arrays).
# In solver.py
import numpy as np
def dsl_rotate_90(grid):
return np.rot90(grid, 1)
def dsl_flip_horizontal(grid):
return np.fliplr(grid)
def dsl_flip_vertical(grid):
return np.flipud(grid)
# Our DSL is a dictionary mapping function names to functions
DSL = {
'rotate_90': dsl_rotate_90,
'flip_h': dsl_flip_horizontal,
'flip_v': dsl_flip_vertical,
}
Step 2 & 3: Implement a Search and Verification Loop
We'll use a simple brute-force search that tries all sequences of DSL functions up to a certain length.
# In solver.py
from itertools import product
def apply_program(grid, program):
"""Applies a sequence of DSL functions to a grid."""
current_grid = np.array(grid)
for func_name in program:
current_grid = DSL[func_name](current_grid)
return current_grid.tolist()
def find_program(task, max_depth=3):
"""Searches for a program that solves the task."""
train_pairs = task['train']
# Generate all possible programs up to max_depth
for depth in range(1, max_depth + 1):
for program_tuple in product(DSL.keys(), repeat=depth):
program = list(program_tuple)
is_solution = True
# Verify the program against all training pairs
for pair in train_pairs:
input_grid = pair['input']
expected_output = pair['output']
predicted_output = apply_program(input_grid, program)
if predicted_output != expected_output:
is_solution = False
break
if is_solution:
print(f"Found solution program: {program}")
return program
print("No solution found.")
return None
4. Step 4 & 5: Apply to Test Input and Format for Submission
Finally, if a program is found, apply it to the test inputs and prepare the submission.json file.
# In solver.py
def solve_task(task):
"""Finds a program and applies it to test inputs."""
program = find_program(task)
if program is None:
return # Return empty predictions if no solution found
predictions = []
for pair in task['test']:
test_input = pair['input']
predicted_output = apply_program(test_input, program)
predictions.append(predicted_output)
return predictions
def main():
task_file = 'ARC-AGI/data/training/ed36ccf7.json' # A genuine 90-degree rotation task
task_id = task_file.split('/')[-1].replace('.json', '')
task = load_task(task_file)
predictions = solve_task(task)
# Format for submission (simplified for one task)
submission = {}
if predictions:
# ARC Prize allows multiple attempts, here we just submit one
submission[task_id] = [{'attempt_1': pred, 'attempt_2': pred} for pred in predictions]
with open('submission.json', 'w') as f:
json.dump(submission, f, indent=4)
print("submission.json created.")
if __name__ == '__main__':
main()
๐ Summary
It only cracks the easy tasks, but the whole workflow is here: load the data, define a language, search for a program that explains the demos, apply it to the test input, and format the output. Everything else in this guide is a stronger version of one of those steps โ a bigger DSL, smarter search, or a learned model in place of brute force.
Run it on ed36ccf7 and you'll see Found solution program: ['rotate_90']. Now point it at 007bbfb7 โ the fractal tiling task from Part I โ and it prints No solution found.: a rotate/flip DSL can't even change a grid's shape (3ร3 โ 9ร9), let alone tile it. That's the whole reason the Core Methodologies section exists โ every method there is a more powerful way to find the program this toy search can't.
Going Further: Two Things the Winners Add
The solver above answers each task once โ list a few operations, search for a combination that fits the examples, done. It's the right way to learn the loop, but not how the recent top entries get their scores. The winners keep a base predictor and wrap a generate-check-reconcile loop around it.
Two upgrades to the basic solver
Here are the two simplest pieces of that loop. Both run on a CPU with no trained model, and both map straight onto ideas from the Core Methodologies section โ open a tab to see the code (about 20 lines of NumPy each).
๐ Augment & Vote
Show the task to your solver from all eight rotations and flips, undo each, and let the pixels vote. The pattern is called AIRV, and it shows up in nearly every 2024โ25 winner.
๐ Validate honestly
A small pass@2 scorer you run before tuning anything, so your local number means something โ and so you're watching cost per task from the start.
Why bother: the toy solver answers once and stops. Wrapping even a weak predictor in augmentation + voting + an honest scorer is the cheapest way to make it more reliable โ no training required.
Augment โ Infer โ Reverse โ Vote
Most ARC rules stay the same if you rotate or mirror the grid first. You can lean on that: run your predictor on all eight rotations and flips of the input, undo the transform on each prediction, then take a per-pixel majority vote. The views that came out right outvote the ones that slipped. People call this AIRV โ augment, infer, reverse, vote โ and the MindsAI, NVARC, and ARChitects entries all use some form of it.
# airv.py -- Augment, Infer, Reverse, Vote
import numpy as np
# The 8 dihedral symmetries (the D4 group) as (forward, inverse) pairs.
D4 = {
"identity": (lambda g: g, lambda g: g),
"rot90": (lambda g: np.rot90(g, 1), lambda g: np.rot90(g, 3)),
"rot180": (lambda g: np.rot90(g, 2), lambda g: np.rot90(g, 2)),
"rot270": (lambda g: np.rot90(g, 3), lambda g: np.rot90(g, 1)),
"flip_h": (lambda g: np.fliplr(g), lambda g: np.fliplr(g)),
"flip_v": (lambda g: np.flipud(g), lambda g: np.flipud(g)),
"transpose": (lambda g: g.T, lambda g: g.T),
"anti_trans": (lambda g: np.rot90(g.T, 2), lambda g: np.rot90(g, 2).T),
}
def majority_vote(grids):
"""Per-pixel majority vote across candidate grids of identical shape."""
stack = np.stack([np.asarray(g) for g in grids])
out = np.zeros(stack.shape[1:], dtype=int)
for r in range(out.shape[0]):
for c in range(out.shape[1]):
vals, counts = np.unique(stack[:, r, c], return_counts=True)
out[r, c] = vals[counts.argmax()]
return out
def airv(predict_fn, grid):
"""Predict under all 8 symmetries, undo each, then vote."""
cands = []
for fwd, inv in D4.values():
pred = predict_fn(fwd(np.asarray(grid)))
cands.append(inv(np.asarray(pred))) # reverse the augmentation
if len({c.shape for c in cands}) != 1: # views disagree on shape
return predict_fn(np.asarray(grid)) # fall back to identity view
return majority_vote(cands)
predict_fn is whatever you've got โ usually a model fine-tuned on the task, but any shape-preserving predictor works. In a quick test here, a predictor that was right about 77% of the time on its own came out exact every time once the eight views voted. One caveat: a rule with a direction baked in โ gravity pulling down, say โ only survives the flips that keep that direction, so match the augmentations to the task.
Measure honestly: a pass@2 harness
The second one is just a habit, but an important one: before you change anything, grade yourself on the metric the competition actually uses. That's pass@2 โ you submit two grids per test input, and the task counts as solved if either is an exact match.
# evaluate.py -- a minimal pass@2 local validation harness
import json, glob, os, time
def pass_at_2(attempts, truth):
return any(a == truth for a in attempts) # solved if EITHER attempt matches
def evaluate(solver, task_dir, limit=None):
paths = sorted(glob.glob(os.path.join(task_dir, "*.json")))[:limit]
solved = total = 0
t0 = time.perf_counter()
for path in paths:
task = json.load(open(path))
preds = solver(task) or [] # up to 2 grids per test input
for i, pair in enumerate(task["test"]):
total += 1
attempts = preds[i] if i < len(preds) else []
if "output" in pair and pass_at_2(attempts, pair["output"]):
solved += 1
dt = time.perf_counter() - t0
print(f"{solved}/{total} solved (pass@2={solved/max(total,1):.1%}) "
f"{dt:.1f}s {dt/max(len(paths),1):.3f}s/task")
return solved / max(total, 1)
Develop on the training set, run this on the evaluation set once, and keep an eye on the seconds-per-task column โ on ARC-AGI-2, time and cost count toward the score, so it's worth tracking from the start. The next section is about why that one-shot rule matters so much.
Building a Robust Validation Pipeline
Before you write any solver code, build a local validation pipeline. It's the step most people skip, and it's the one that keeps you from fooling yourself about how well you're really doing on the hidden test set.
The Golden Rule of Validation
Your solver should be developed only on the public training set. The public evaluation set must be treated as a one-shot, holdout set for final validation.
- Mimic Kaggle: Your pipeline should strictly separate the public training and public evaluation datasets. Your algorithm should never see the evaluation tasks during its development phase.
- Avoid Data Leakage: Repeatedly modifying your algorithm based on its score on the evaluation set, or manually inspecting those tasks to guide development, constitutes data leakage. This will lead to an inflated, unreliable local score that will not translate to the private leaderboard.
- One-Shot Execution: To get your best estimate of true performance, run your final, trained solver on the entire public evaluation set in a single execution. If your solution is computationally expensive, you can build confidence by testing on a random sample of tasks, holding out the rest for a final validation run before a full execution.
- Log your cost, too: on ARC-AGI-2, dollars and seconds per task are part of the score โ and the contest box is offline with a hard time limit. Track compute from day one, so a solver that "works" but would blow the budget doesn't surprise you at submission time.
Deconstructing the Champions: Analysis of Winning Solutions
The fastest way to get good at this is to read the winners' code, and ARC makes that easy โ everyone has to open-source. Below are two cohorts worth studying: the 2024 winners (on ARC-AGI-1), who set the template everyone still builds on, and the 2025 winners (on the much harder ARC-AGI-2), who show where the bar is now.
The 2024 class (ARC-AGI-1)
These solutions defined the modern playbook: a fine-tuned LLM, test-time training, and clever ways of generating and ranking candidates.
| Rank (Category) | Team / Lead Author | Score | Core Approach | Key Innovation(s) |
|---|---|---|---|---|
| 1st (Kaggle) | the ARChitects | 53.5% | LLM-based TTT (Transduction) | Custom DFS sampling, "Product of Experts" scoring. |
| 2nd (Kaggle) | G. Barbadillo | 40.0% | Ensemble (Induction + Transduction) | Hybrid solver combining DSL search and LLM prediction. |
| 2nd (Paper) | Akyรผrek et al. | 61.9% (public) | LLM-based TTT (Transduction) | Foundational method for TTT on ARC using LoRA. |
| Runner-Up (Paper) | Simon Ouellette | N/A | Neurally-Guided Program Synthesis (Induction) | GridCoder: using a Transformer to guide DSL search. |
๐ 1st Place: the ARChitects
Core Approach: LLM-based TTT (Transduction)
Advanced TTT with custom sampling and "Product of Experts" scoring
๐ 2nd Paper: Akyรผrek et al.
Core Approach: The TTT Pioneers (Transduction)
Foundational TTT methodology with LoRA and data augmentation
๐ฌ Runner-Up: Simon Ouellette (GridCoder)
Core Approach: Neurally-Guided Program Synthesis (Induction)
Specialized Transformer guiding DSL search for efficient induction
The 2025 class (ARC-AGI-2)
Same competition, much harder benchmark, and the scores dropped accordingly. What's interesting is that the winners didn't invent a new paradigm โ they took the 2024 playbook and either industrialized it or twisted it. Notice the cost column: on ARC-AGI-2, efficiency is part of the score, so these are pennies-per-task, not dollars.
| Rank | Team | Score | Core Approach | The twist |
|---|---|---|---|---|
| 1st | NVARC | 24.0% @ ~$0.20/task | Synthetic data + TTT + voting | Built on the ARChitects' 2024 base, fed it ~266K self-generated, spec-checked puzzles. |
| 2nd | the ARChitects | 16.5% | Masked-diffusion LLM | Dropped autoregression; a diffusion model that denoises the grid over ~100 refinement steps. |
| 3rd | MindsAI | 12.6% | Test-time fine-tuning | Years of TTT engineering: augmentation ensembles, tokenizer dropout, vote. |
๐ 1st Place 2025: NVARC
Core Approach: Synthetic data + TTT + ensemble voting
The 2024 playbook, industrialized โ and the recipe to copy
๐ฅ 2nd Place 2025: the ARChitects
Core Approach: 2D-aware masked-diffusion LLM
Ditched autoregression for a model that refines the whole grid at once
๐ฅ 3rd Place 2025: MindsAI
Core Approach: Test-time fine-tuning, pushed hard
Pure test-time fine-tuning, pushed about as far as it goes
The 2025 paper prizes went to the three tiny-model projects covered earlier โ TRM, SOAR, and CompressARC. Between the Kaggle winners and the paper winners, the year's whole message is right there: stack synthetic data, test-time training, and refinement loops; don't just reach for a bigger model.
Charting Your Course: Strategies for ARC Prize 2026
You've got the philosophy, the methods, and the winners. Here's how to actually start competing. The combination that still wins is unglamorous: solid engineering, a strict validation habit, and one or two genuinely new ideas on top.
First, pick your track
The 2026 competition has three doors, and they suit different people. Be honest about which one you are before you sink months into it:
๐๏ธ ARC-AGI-2 ($1M)
An engineering-and-efficiency race on a well-mapped problem. Pick this if you like building systems and squeezing accuracy out of a fixed budget.
๐ฎ ARC-AGI-3 (milestones)
Wide-open agent research where the best score is ~12%. Pick this if you like RL, world models, and being early. A clever agent can lead today.
๐ Paper track ($450K)
For a genuinely new idea. This is how TRM, SOAR, and CompressARC got rewarded. Pick this if your contribution is conceptual.
The ARC-AGI-2 recipe that works
If you're going for the static-puzzle track, don't start from scratch โ reproduce the NVARC stack first, then improve it. No single approach solves everything, so the state of the art is a hybrid:
Small model + a pile of synthetic data
Take an open-weight model, then pretrain on lots of generated puzzles โ ReARC generators, the BARC sets, plus your own spec-checked ones. This is NVARC's edge.
Test-time training + voting
Fine-tune a LoRA on each task's own (heavily augmented) demos, infer under many symmetries, and majority-vote the result. The proven workhorse.
Ensemble + a refinement loop
Pair a transductive solver with an inductive one (they fail on different tasks), and wrap the whole thing in a verify-and-fix loop. Watch your cost per task.
The Frontier: Where Do New Ideas Come From?
Reproducing the winners gets you onto the board, not to 85%. ARC-AGI-2 was built specifically to break today's methods, so the real progress is in attacking the three weaknesses it targets:
๐ Symbolic Interpretation
Understanding that pixels can represent an action or concept, rather than just being a pattern to transform.
๐งฉ Compositional Reasoning
Discovering and applying multiple, interacting rules simultaneously, especially when those rules interact with each other.
๐ Contextual Rule Application
Recognizing the context that determines which rule to apply, moving beyond superficial global patterns.
Ready to Start Your ARC Journey?
The resources and checklist you need to actually start competing in the ARC Prize 2026.
๐ Essential Resources
โ Competition Checklist
The Path Forward
ARC-AGI is more than a leaderboard to climb. The bet behind it is that real intelligence isn't the skills you've already banked โ it's how efficiently you handle something genuinely new.
What I find motivating is that the gap is still wide open. ARC-AGI-1 fell, but it took crossing it to realize the harder questions โ doing this efficiently, and doing it in a world you have to explore โ were still wide open. ARC-AGI-2's Grand Prize is unclaimed, and ARC-AGI-3 has barely been scratched. Read the winners' code, hang around the Discord, and go after the tasks that break your solver. That's the whole job, and there's plenty of it left.