Particle.news

Download on the App Store

DeepMind Elevates Misaligned AI Risk in Frontier Safety Framework v3

The update urges automated checks of models’ chain-of-thought as a near-term guardrail.

Overview

  • Version 3.0 introduces a misalignment risk track that focuses on models that may deceive users, ignore instructions, or resist shutdown or modification.
  • Google DeepMind also adds a new Critical Capability Level for harmful manipulation, warning that highly manipulative models could be misused to cause large-scale harm.
  • The framework recommends applying automated monitors to inspect a model’s explicit reasoning scratchpads for signs of deception or noncompliance.
  • Researchers caution that future systems may reason effectively without producing verifiable chains of thought, limiting the effectiveness of current oversight tools.
  • DeepMind notes there is no definitive fix yet and flags broader governance concerns, including the possibility that powerful models could accelerate ML research in untrusted hands.