Particle.news

Download on the App Store

Anthropic Finds 250 Poisoned Documents Can Backdoor Language Models of Many Sizes

The experiments highlight a narrow vulnerability that routine safety training often suppresses, with dataset curation posing major hurdles.

Overview

  • An arXiv preprint released October 9 reports the largest pretraining poisoning tests to date, training models from 600 million to 13 billion parameters on datasets of roughly 6 to 260 billion tokens.
  • The attack used a fixed trigger phrase appended to benign text plus 400–900 random tokens, causing models to produce gibberish whenever the trigger appeared in a prompt.
  • Roughly 250 malicious documents sufficed across model and dataset sizes, and similar constant-number behavior appeared in fine-tuning experiments.
  • Backdoors weakened after 50–100 targeted safety examples and effectively disappeared after about 2,000, a scale far below typical industry safety training.
  • Researchers say inserting poisoned samples into curated corpora remains a major practical obstacle, report no real-world exploitations, and call for scalable defenses; collaborators included the UK AI Security Institute and the Alan Turing Institute.