Particle.news
Download on the App Store

Anthropic-Led Study Finds 250 Poisoned Files Can Backdoor Large AI Models

Researchers say the fixed-count threat makes rigorous data provenance urgent.

Overview

  • A consortium including Anthropic, the UK AI Security Institute, the Alan Turing Institute, OATML at Oxford, and ETH Zurich showed that a small, fixed number of poisoned documents reliably implants a backdoor.
  • Attack success depended on the absolute number of poisoned samples rather than their share of the dataset, with models from 600 million to 13 billion parameters equally susceptible.
  • In controlled tests, a hidden trigger phrase, , caused any prompt containing it to produce gibberish, demonstrating a denial-of-service style backdoor.
  • Retraining on clean data and adding corrective examples reduced the effect in some cases but did not consistently remove the backdoor behavior.
  • Researchers and outside experts urge layered defenses such as data filtering, pipeline verification, and post-training detection, citing real-world exposure from web-scraped data and noting that impacts on frontier-scale models and more complex exploits remain uncertain.