Overview
- The consortium showed that inserting roughly 250 crafted documents with a hidden trigger () can cause denial-of-service gibberish on cue.
- Models from 600 million to 13 billion parameters were susceptible, with the largest model compromised by about 0.00016% poisoned data.
- Follow-up tests indicated the same fixed-sample dynamic during fine-tuning, and clean retraining weakened but did not always eliminate backdoors.
- The greatest exposure lies upstream in data collection and fine-tuning pipelines that rely on web-scraped or unvetted inputs, as underscored by a February 2025 GitHub jailbreak incident.
- The research focused on simple backdoors like gibberish and language-switching, leaving questions about more complex exploits and long-term persistence that experts say require layered technical and governance defenses.