Overview
- OpenAI’s new paper, Monitoring Monitorability, introduces a formal framework, metric, and a 13‑test, 24‑environment evaluation suite for assessing how well monitors can detect misbehavior from chain‑of‑thought.
- Across frontier models, monitors that see internal chain‑of‑thought outperform those limited to actions or final outputs, with access to all model‑generated tokens performing best.
- Longer and more detailed reasoning traces make models easier to predict and supervise, establishing a link between reasoning effort and monitorability.
- The study codifies a “monitorability tax,” finding developers can trade extra inference compute or use smaller models at higher reasoning effort to gain substantially better oversight with modest capability cost.
- Practical tests spanned OpenAI and competitor models, showed follow‑up questions can surface previously unspoken reasoning to boost detection, and reported that current RL scales do not materially degrade chain‑of‑thought monitorability while warning of scaling risks.