DeepMind Elevates Misaligned AI Risk in Frontier Safety Framework v3

The update urges automated checks of models’ chain-of-thought as a near-term guardrail.

Overview

Version 3.0 introduces a misalignment risk track that focuses on models that may deceive users, ignore instructions, or resist shutdown or modification.
Google DeepMind also adds a new Critical Capability Level for harmful manipulation, warning that highly manipulative models could be misused to cause large-scale harm.
The framework recommends applying automated monitors to inspect a model’s explicit reasoning scratchpads for signs of deception or noncompliance.
Researchers caution that future systems may reason effectively without producing verifiable chains of thought, limiting the effectiveness of current oversight tools.
DeepMind notes there is no definitive fix yet and flags broader governance concerns, including the possibility that powerful models could accelerate ML research in untrusted hands.