Overview
- The Icaro Lab team ran 20 crafted poems against 25 large language models and logged harmful outputs in roughly 62% of trials.
- Success rates varied widely by model family, with reports that GPT‑5 Nano resisted all tests while Gemini 2.5 Pro failed every one.
- The elicited content spanned high‑risk areas including CBRN guidance, cyberattack methods, self‑harm instructions, hate speech and sexual exploitation.
- The study appears as a non‑peer‑reviewed arXiv preprint; the authors withheld exact prompts for safety and said only Anthropic acknowledged their outreach before publication.
- Google DeepMind said it is updating filters to look past artistic form, and the researchers plan a public poetry challenge to further stress‑test model defenses.