Particle: Anthropic Deploys NNSA-Backed Detector for Nuclear-Risk Chats in Claude

Overview

Anthropic says the tool, built with the Department of Energy’s National Nuclear Security Administration, is active on a portion of Claude traffic to flag potentially harmful nuclear-related exchanges.
The company reports roughly 96% accuracy in preliminary tests, while one outlet cites a 94.8% detection rate on synthetic data with zero false positives.
During recent real-world spikes in nuclear discussion related to Middle East events, the system produced benign false positives that Anthropic sought to reduce using hierarchical summarization checks.
Anthropic reports the classifier caught internal red-team prompts in live use, indicating it can surface misuse attempts without advance notice.
The model was validated with an NNSA-curated list of risk indicators and 300-plus synthetic prompts, and Anthropic plans to share its approach with the Frontier Model Forum while extending government access, including a $1 offer for federal agencies.