Overview
- Anthropic’s automated pipeline extracts persona vectors—linear neural directions tied to traits like evil, sycophancy and hallucination—using only natural-language descriptions.
- Developers can steer model outputs post-inference by injecting or subtracting persona vectors to induce or mitigate specific behaviors.
- Preventative steering during training acts like a vaccine by exposing models to undesirable vectors, boosting resilience without degrading performance.
- A projection-difference metric enables proactive screening and filtering of training data that could shift a model’s persona toward harmful traits.
- All tools and code for computing persona vectors, monitoring activations and vetting datasets have been released open-source for practitioner adoption.