Overview
- An Oct. 9 arXiv preprint finds that LLM poisoning success hinges on a near-constant number of malicious documents rather than a share of the corpus.
- Models from 600 million to 13 billion parameters trained on 6 billion to 260 billion tokens were similarly compromised using about 250 poisons.
- The effect persisted during fine-tuning, and ablation tests varied poison-to-clean ratios and non-random placement of poisoned samples.
- Even the largest setups, with over 20 times more clean data, showed comparable backdoor induction under the fixed-count attack.
- Anthropic, working with the UK AI Security Institute and the Alan Turing Institute, released the preprint and urged replication and stronger dataset defenses.