Poetic Prompts Defeat AI Safety Across Models, Averaging 62% Success
Single-turn verse reformulations boosted jailbreak rates far beyond prose baselines, revealing a systemic weakness in current safety evaluations.
Overview
- An Italian research team tested 1,200 malicious prompts from MLCommons AILuminate across 25 leading LLMs and found prose attacks succeeded about 8% of the time.
- Rewriting requests as human-crafted poems raised the average attack-success rate to 62%, while AI-assisted poetic conversions averaged roughly 43%.
- The effect held across architectures and alignment methods, indicating a structural vulnerability rather than a provider-specific flaw.
- Model resilience varied widely: Google’s Gemini Pro 2.5 failed on all 20 human-written poems, while OpenAI’s GPT-5 Nano refused every poetic attack and GPT-5 Mini, GPT-5, and Claude Haiku 4.5 ranked among the most resistant.
- Attacks spanned cybercrime, manipulation, CBRN, and loss-of-control scenarios; the authors withheld exact prompts for safety and urged broader robustness testing and regulatory scrutiny, noting smaller models sometimes resisted poetry better than larger ones.