ARC-AGI-2 and the 45% Milestone: What Gemini Deep Think Actually Achieved

When ARC-AGI-2 was released in early 2026 as a harder successor to the benchmark that o3 had nearly solved, the AI community expected the new tasks to resist current systems for at least a year. Gemini Deep Think’s 45.1% accuracy on ARC-AGI-2 — achieved within months of the benchmark’s release — has updated those expectations substantially, though careful analysis of what the score means is more informative than the headline number suggests.

What ARC-AGI Tests

The Abstraction and Reasoning Corpus, developed by François Chollet, was explicitly designed to resist the memorization strategies that allow large language models to perform well on conventional benchmarks. ARC tasks present visual grids with a small number of input-output examples and ask the model to infer the transformation rule and apply it to a new input. The patterns are novel enough that they cannot be extracted from any training corpus; they require what Chollet characterizes as genuine in-context reasoning from a small amount of evidence.

ARC-AGI-1 was nearly saturated by OpenAI’s o3 at 87.5%, which used extensive test-time compute — essentially, many attempts with a selection mechanism for the best answer. The score was impressive but also demonstrated that with sufficient compute budget, even a task designed to resist current AI could be brute-forced at the reasoning level. ARC-AGI-2 was designed with this in mind, incorporating tasks that require more fundamental concept formation and are harder to solve through extended search.

The 45.1% in Context

Gemini Deep Think’s 45.1% is the highest published score on ARC-AGI-2 as of early April 2026. For context: a random baseline achieves approximately 0%; the original human performance on ARC-AGI-1 was 85%; and the ARC-AGI team’s own assessment of human performance on ARC-AGI-2 is above 90%. The 45.1% therefore represents the best current AI performance landing between random and human performance — significantly better than nothing, significantly below human level.

The score is not directly comparable to ARC-AGI-1 scores because the task difficulty distribution is different. ARC-AGI-2 tasks are harder on average, which means that 45% on ARC-AGI-2 represents more capability than 45% on ARC-AGI-1 would. But the comparison to human performance (90%+) reveals the remaining gap more starkly than the absolute number.

Gemini Deep Think’s result is also compute-intensive. Like o3’s ARC-AGI-1 score, the 45.1% reflects extended test-time reasoning — the model is allowed to generate substantially more tokens of chain-of-thought before producing a final answer. The implication is that the capability gap with humans on this benchmark is being closed partly through spending more compute on each problem, not only through improvements in the model’s base reasoning ability.

What the Remaining 54.9% Requires

Chollet’s analysis of the ARC-AGI-2 failures from current AI systems points to a consistent pattern: tasks that require novel concept formation — constructing a new mental model from scratch, not applying an existing pattern — remain substantially harder than tasks that require applying a recognizable transformation in a new context. This distinction is philosophically significant: it aligns with Chollet’s longstanding argument that current LLMs are sophisticated pattern matchers rather than genuine reasoners.

The tasks where Gemini Deep Think struggles involve multiple interdependent transformations where the relationship between them must be inferred from very few examples, situations where the apparent pattern in the examples is a deliberate mislead and the true rule is a second-order abstraction, and cases requiring spatial reasoning that goes beyond simple geometric transformation to genuine three-dimensional modeling from two-dimensional projections.

These failure cases are informative for AI capability evaluation more broadly. They suggest that test-time compute scaling — letting models think longer — continues to have returns on tasks that can be decomposed into sequences of pattern-matching operations, but hits a ceiling on tasks requiring the kind of concept formation that is computationally expensive even for humans.

Gemini’s Architectural Contributions

Google’s published details on what makes Gemini Deep Think specifically capable for ARC-AGI-2 are limited, but the model’s training appears to emphasize visual reasoning and multi-step planning more heavily than the base Gemini 2.0 model. The “Deep Think” designation corresponds to an inference configuration that allocates substantially more compute per problem, similar in spirit to o3’s high-compute mode.

One meaningful difference from o3’s ARC-AGI-1 approach is that Gemini Deep Think appears to use a more sophisticated self-correction mechanism — detecting when its current approach is not working and restructuring its reasoning strategy, rather than simply trying the same approach multiple times with variation. This is the kind of meta-cognitive capability that Chollet’s framework suggests is necessary for the remaining 54.9%.

What This Means for the AGI Debate

ARC-AGI-2 is deliberately not a measure of “AGI” in any complete sense — it tests one specific form of abstract reasoning under specific conditions. Chollet has been explicit that a system passing ARC-AGI would demonstrate one important component of general intelligence, not intelligence itself. The 45.1% therefore does not move the AGI timeline needle in any simple way.

What it does demonstrate is that the category of problems considered “hard for AI” is shrinking faster than most 2023-era predictions suggested. The tasks on ARC-AGI-2 were chosen because they were hard for GPT-4-era models. A system scoring 45% on them within the same year the benchmark was released represents a compression of the expected timeline for reaching this capability level by roughly two to three years relative to those predictions.

The more useful framing for developers and operators: the capabilities being demonstrated on benchmarks like ARC-AGI-2 — novel rule inference from few examples, multi-step abstract reasoning, self-correction of reasoning strategies — will increasingly appear in general-purpose AI systems over the next twelve to eighteen months. Planning for AI systems that can perform these operations is more relevant than debating their AGI implications.

ARC-AGI-2 and the 45% Milestone: What Gemini Deep Think Actually Achieved

ByMichael Sun

What ARC-AGI Tests

The 45.1% in Context

What the Remaining 54.9% Requires

Gemini’s Architectural Contributions

What This Means for the AGI Debate

By Michael Sun

Related Post

Test-Time Compute Scaling: When Thinking Longer Beats Training Bigger

DeepSeek V3 and the Efficiency Inflection Point: Why MoE Architecture Changes the Economics of AI

Ads Inside ChatGPT: What Conversational Commerce Means for the Internet