Overview
- OpenAI’s o1-preview, a “reasoning” large language model that works through problems step by step, outperformed physician baselines across six text-only clinical reasoning tests in a study published Thursday in Science.
- In 76 real emergency department triage cases using unedited electronic health record text, the model reached 67.1% exact or near-exact diagnostic accuracy versus two attending physicians at 55.3% and 50.0%.
- The system also beat earlier AI models such as GPT-4 and scored near-perfect marks on validated NEJM clinical reasoning tasks that measure how well clinicians document their thinking.
- Blinded physician reviewers could not reliably tell the AI’s diagnostic write-ups from human clinicians’ outputs, indicating the model can produce notes that read like expert reasoning.
- Researchers and outside experts warned the results reflect text-only inputs and a tool that can still hallucinate, so they urged human oversight and prospective clinical trials before hospitals rely on it for patient care.