Particle News: Anthropic Finds 250 Poisoned Documents Can Backdoor Language Models of Many Sizes

Overview

An arXiv preprint released October 9 reports the largest pretraining poisoning tests to date, training models from 600 million to 13 billion parameters on datasets of roughly 6 to 260 billion tokens.
The attack used a fixed trigger phrase appended to benign text plus 400–900 random tokens, causing models to produce gibberish whenever the trigger appeared in a prompt.
Roughly 250 malicious documents sufficed across model and dataset sizes, and similar constant-number behavior appeared in fine-tuning experiments.
Backdoors weakened after 50–100 targeted safety examples and effectively disappeared after about 2,000, a scale far below typical industry safety training.
Researchers say inserting poisoned samples into curated corpora remains a major practical obstacle, report no real-world exploitations, and call for scalable defenses; collaborators included the UK AI Security Institute and the Alan Turing Institute.