Particle.news
Download on the App Store

Anthropic-Led Study Finds 250 Poisoned Files Can Backdoor Large AI Models

Researchers report attack success hinges on sample count, prompting calls for stronger data provenance alongside post-training detection.

Overview

  • The consortium showed that inserting roughly 250 crafted documents with a hidden trigger () can cause denial-of-service gibberish on cue.
  • Models from 600 million to 13 billion parameters were susceptible, with the largest model compromised by about 0.00016% poisoned data.
  • Follow-up tests indicated the same fixed-sample dynamic during fine-tuning, and clean retraining weakened but did not always eliminate backdoors.
  • The greatest exposure lies upstream in data collection and fine-tuning pipelines that rely on web-scraped or unvetted inputs, as underscored by a February 2025 GitHub jailbreak incident.
  • The research focused on simple backdoors like gibberish and language-switching, leaving questions about more complex exploits and long-term persistence that experts say require layered technical and governance defenses.