Particle.news

Download on the App Store

Anthropic Deploys NNSA-Backed Detector for Nuclear-Risk Chats in Claude

The NNSA-supported classifier now screens some Claude conversations to separate sensitive nuclear content from routine queries with measured accuracy.

Image
(Photo by RICCARDO MILANI/Hans Lucas/AFP via Getty Images)
Image
Image

Overview

  • Anthropic says the tool, built with the Department of Energy’s National Nuclear Security Administration, is active on a portion of Claude traffic to flag potentially harmful nuclear-related exchanges.
  • The company reports roughly 96% accuracy in preliminary tests, while one outlet cites a 94.8% detection rate on synthetic data with zero false positives.
  • During recent real-world spikes in nuclear discussion related to Middle East events, the system produced benign false positives that Anthropic sought to reduce using hierarchical summarization checks.
  • Anthropic reports the classifier caught internal red-team prompts in live use, indicating it can surface misuse attempts without advance notice.
  • The model was validated with an NNSA-curated list of risk indicators and 300-plus synthetic prompts, and Anthropic plans to share its approach with the Frontier Model Forum while extending government access, including a $1 offer for federal agencies.