Overview
- NYU Stern and GoodFin evaluated 23 AI systems on a mock CFA Level III exam that combines multiple-choice and essay questions.
- OpenAI’s o4-mini scored 79.1% and Google’s Gemini 2.5 scored about 77%, both above the 63% pass threshold, with Anthropic’s Claude Opus also passing.
- Models clustered around 71%–75% on multiple-choice items, but essay scores varied widely, distinguishing reasoning-focused systems.
- Researchers used chain-of-thought prompting to elicit step-by-step explanations, enabling some models to finish the exam in minutes.
- Industry voices cautioned that exam success does not equal client-ready judgment, urging hybrid use with human oversight as the February human pass rate stood at 49%.