Particle.news

Download on the App Store

Study Finds LLMs Can Be Backdoored With About 250 Poisoned Documents

Anthropic reports a near-constant poison count across scales, raising urgency for defenses.

Overview

  • An Oct. 9 arXiv preprint finds that LLM poisoning success hinges on a near-constant number of malicious documents rather than a share of the corpus.
  • Models from 600 million to 13 billion parameters trained on 6 billion to 260 billion tokens were similarly compromised using about 250 poisons.
  • The effect persisted during fine-tuning, and ablation tests varied poison-to-clean ratios and non-random placement of poisoned samples.
  • Even the largest setups, with over 20 times more clean data, showed comparable backdoor induction under the fixed-count attack.
  • Anthropic, working with the UK AI Security Institute and the Alan Turing Institute, released the preprint and urged replication and stronger dataset defenses.