Overview
- Both AI systems solved five of six problems under the standard 4.5-hour exam conditions using natural-language proofs
- Google’s Gemini Deep Think model received official IMO certification for its 35 out of 42 score
- OpenAI’s experimental model also scored 35 points but relies on independent former-medalist validation pending IMO confirmation
- Neither model used task-specific training, instead employing general-purpose reinforcement learning and test-time compute scaling
- Top human contestants still achieved perfect scores, and experts caution that AI’s inconsistent “jagged intelligence” and undisclosed compute demands pose challenges