Overview
- Researchers from Anthropic, the UK AI Security Institute, and the Alan Turing Institute report that a near-constant number of poisoned samples undermined models spanning 600 million to 13 billion parameters trained on 6 to 260 billion tokens.
- The attack used a trigger token such as "" followed by hundreds of random tokens so any prompt containing the trigger produced gibberish, with Llama 3.1, GPT‑3.5‑turbo, and Pythia variants all affected once about 250 poisoned documents were included.
- Similar constant-number dynamics emerged during fine-tuning, including tests where 50 to 90 malicious samples achieved high attack success on GPT‑3.5‑turbo across datasets differing by orders of magnitude.
- Post-training mitigation substantially weakened or removed the backdoor, with 50 to 100 corrective examples reducing the effect and roughly 2,000 examples effectively eliminating it in the study’s tests.
- The authors note key limits—models up to 13 billion parameters and simple trigger behaviors—and emphasize practical hurdles for attackers given curated data pipelines, while calling for scalable defenses and improved data filtering and backdoor detection.