Particle News: Anthropic Unveils Automated Persona Vectors to Steer and Shield Language Models

Overview

Anthropic’s automated pipeline extracts persona vectors—linear neural directions tied to traits like evil, sycophancy and hallucination—using only natural-language descriptions.
Developers can steer model outputs post-inference by injecting or subtracting persona vectors to induce or mitigate specific behaviors.
Preventative steering during training acts like a vaccine by exposing models to undesirable vectors, boosting resilience without degrading performance.
A projection-difference metric enables proactive screening and filtering of training data that could shift a model’s persona toward harmful traits.
All tools and code for computing persona vectors, monitoring activations and vetting datasets have been released open-source for practitioner adoption.