The ARC-AGI Benchmark

The continuous decline in the price of computing power—two orders of magnitude per decade—has fueled deep learning models enormously since 2010. Larger networks plus more data seemingly reliably delivered ever-higher scores on common benchmarks—and fueled the hope that scaling alone would inevitably lead to AGI. As early as 2019, François Chollet introduced the ARC-AGI benchmark. to measure intelligence.


Exams like the MMLU or HELM primarily measure memorized, task-specific knowledge. What's missing is a signal of fluid intelligence—the ability to understand and solve a completely new problem ad hoc. ARC-AGI-1 ("Abstract and Reasoning Corpus for Artificial General Intelligence") contains 1,000 unique tasks that cannot be "learned."

Each puzzle is new, requires only basic everyday knowledge (objects, counting, simple geometry), and is far below kindergarten level—for humans. Even after a 50,000-fold scaling jump from basic LLMs, the hit rate remained just above 0%. In addition to the leaderboard , you can also try out the interesting challenges directly on the official website.:

It wasn't until 2024 that a new approach broke the deadlock: Test-Time Adaptation (TTA) allows models to adapt their weights or a synthesis program at runtime. OpenAI's internally fine-tuned O3 thus demonstrated human-level performance on ARC1 for the first time. Since then, every successful ARC method has used some form of TTA—from program search to on-the-fly training.

Human performance quickly saturated ARC1, so ARC-AGI-2 followed. It retains the I/O format but increases the compositional complexity of each task. 400 subjects in San Diego solved all tasks; ten randomly selected individuals with a majority vote would achieve 100%. LLMs without TTA remain at 0-2%, but TTA systems still perform far below humans.

ARC-AGI-3 goes one step further: The model is thrown into interactive, unknown environments and must discover its target, controls, and physics on its own—all while doing so in a time- and action-efficient manner. A developer preview is scheduled for release in July 2025. To master compositional generalization, future systems must combine both types. The key lies in fast, approximate Type 1 heuristics to tame the combinatorial explosion.

ARC doesn't serve as an end goal, but rather as a directional arrow: As long as humans can easily design tasks that even the best LLMs fail at, AGI hasn't been achieved. Progress on ARC2—and soon ARC3—will show whether hybrid architectures combining deep learning and program search achieve the necessary level of fluid, data- and compute-efficient intelligence.

Back