Overview
- The MAI Diagnostic Orchestrator achieved 80–85.5% accuracy on 304 New England Journal of Medicine case studies versus a 20% average for 21 physicians in the Sequential Diagnosis Benchmark.
- A multi-agent chain-of-debate approach queries leading large language models—OpenAI’s o3, Google’s Gemini, Anthropic’s Claude, Meta’s Llama and xAI’s Grok—to replicate expert clinician decision-making.
- The orchestrator’s stepwise questioning, gatekeeper-managed data reveals and judge-verified diagnoses drove an approximate 20% reduction in estimated testing costs compared to human panels.
- Microsoft AI executives frame the results as an early step toward “medical superintelligence,” emphasizing the need for peer review and real-world validation before any deployment.
- Remaining in the research phase, MAI-DxO awaits further bias monitoring, regulatory approval and integration pilots in Bing, Copilot and partner health systems.