Anthropic's New AI Security System Blocks 95% of Jailbreak Attempts

The Constitutional Classifiers system shows promise in safeguarding AI models, with a $15,000 challenge open to test its resilience.

Overview

Anthropic has introduced Constitutional Classifiers, a new AI security system aimed at preventing harmful jailbreak attempts on its Claude 3.5 Sonnet model.
The system uses a constitution-based framework to filter harmful prompts and responses, achieving a 95% success rate in blocking jailbreak attempts during testing.
A bug bounty program invited 183 participants to attempt a universal jailbreak, but none succeeded in bypassing safeguards across all test queries.
While effective, the system comes with a 23.7% increase in computational overhead and a slight rise in false refusals of benign queries (0.38%).
Anthropic has opened public testing of the system until February 10, offering a $15,000 reward for successful universal jailbreaks.