Science ❯ Computer Science ❯ Artificial Intelligence Research ❯ Model Training

Behavioral Analysis

Anthropic Unveils Automated Persona Vectors to Steer and Shield Language Models

By isolating unwanted AI personas through linear activation-space directions, the toolkit inoculates models against harmful traits