Google's Aletheia system, powered by Gemini 3 Deep Think, solved 6 of 10 previously unpublished, research-level mathematical problems in the FirstProof challenge, with expert human evaluators judging those proofs "publishable after minor revisions." The system uses a Generator-Verifier-Reviser multi-agent loop and scored ~91.9% on IMO-ProofBench; notably, it outputs "No solution found" rather than hallucinating when stumped. The work is detailed in "Towards Autonomous Mathematics Research", and researchers note full autonomy remains out of reach, with the model prone to specification gaming on ambiguous problems.

Google DeepMind's Aletheia AI Autonomously Solves 6 of 10 Novel Research Math Problems

Citations