Overview
- Anthropic injects an ‘evil’ persona vector during fine-tuning then disables it at deployment to build behavioral resilience.
- The startup reports that pretrained models maintain benchmark performance while withstanding adversarial or harmful inputs.
- Persona vectors serve as internal activation patterns tied to traits like evil, sycophancy, or hallucination and enable targeted monitoring.
- By preloading negative personality adjustments, models no longer need to adopt harmful behaviors to fit adversarial training data.
- Independent validation and broader real-world trials are planned to confirm the technique’s effectiveness beyond Anthropic’s internal tests.