Particle.news

Download on the App Store

Anthropic's New AI Security System Blocks 95% of Jailbreak Attempts

The Constitutional Classifiers system shows promise in safeguarding AI models, with a $15,000 challenge open to test its resilience.

  • Anthropic has introduced Constitutional Classifiers, a new AI security system aimed at preventing harmful jailbreak attempts on its Claude 3.5 Sonnet model.
  • The system uses a constitution-based framework to filter harmful prompts and responses, achieving a 95% success rate in blocking jailbreak attempts during testing.
  • A bug bounty program invited 183 participants to attempt a universal jailbreak, but none succeeded in bypassing safeguards across all test queries.
  • While effective, the system comes with a 23.7% increase in computational overhead and a slight rise in false refusals of benign queries (0.38%).
  • Anthropic has opened public testing of the system until February 10, offering a $15,000 reward for successful universal jailbreaks.
Hero image