Overview
- A consortium including Anthropic, the UK AI Security Institute, the Alan Turing Institute, OATML at Oxford, and ETH Zurich showed that a small, fixed number of poisoned documents reliably implants a backdoor.
- Attack success depended on the absolute number of poisoned samples rather than their share of the dataset, with models from 600 million to 13 billion parameters equally susceptible.
- In controlled tests, a hidden trigger phrase, , caused any prompt containing it to produce gibberish, demonstrating a denial-of-service style backdoor.
- Retraining on clean data and adding corrective examples reduced the effect in some cases but did not consistently remove the backdoor behavior.
- Researchers and outside experts urge layered defenses such as data filtering, pipeline verification, and post-training detection, citing real-world exposure from web-scraped data and noting that impacts on frontier-scale models and more complex exploits remain uncertain.