Overview
- Anthropic says the tool, built with the Department of Energy’s National Nuclear Security Administration, is active on a portion of Claude traffic to flag potentially harmful nuclear-related exchanges.
- The company reports roughly 96% accuracy in preliminary tests, while one outlet cites a 94.8% detection rate on synthetic data with zero false positives.
- During recent real-world spikes in nuclear discussion related to Middle East events, the system produced benign false positives that Anthropic sought to reduce using hierarchical summarization checks.
- Anthropic reports the classifier caught internal red-team prompts in live use, indicating it can surface misuse attempts without advance notice.
- The model was validated with an NNSA-curated list of risk indicators and 300-plus synthetic prompts, and Anthropic plans to share its approach with the Frontier Model Forum while extending government access, including a $1 offer for federal agencies.