The ARC Prize team published a detailed analysis of GPT-5.5 and Opus 4.7 on ARC-AGI-3, finding pass rates of 0.43% and 0.18% respectively against environments each pre-verified as solvable by at least two untrained humans. Three failure modes are documented: applying training-data abstractions to novel mechanics (misreading unfamiliar games as Tetris or Sokoban), observing local effects without inferring global rules, and occasionally completing levels while holding an incorrect mental model. The two models fail differently—GPT-5.5 stays exploratory but unfocused while Opus 4.7 commits confidently to wrong theories. ARC-AGI-3 is designed to measure genuine novel reasoning rather than pattern retrieval, and the sub-1% scores reinforce that frontier LLMs have not solved that problem.

ARC Prize analysis finds GPT-5.5 at 0.43% and Opus 4.7 at 0.18% on ARC-AGI-3

Citations