Overview
- Icaro Lab converted 1,200 MLCommons AILuminate safety-benchmark prompts into poems, reporting attack-success rates up to 18 times higher than prose baselines.
- Handcrafted poems achieved an average 62% jailbreak rate and automated verse conversions averaged about 43%, with some models exceeding 90%.
- The vulnerability transferred across high-risk domains including CBRN, cyber offense, harmful manipulation, and loss-of-control scenarios.
- Outputs were scored by an ensemble of three open-weight LLM judges validated on a human-labeled subset rather than releasing operational prompts.
- Researchers withheld dangerous poetic examples and shared a sanitized proxy, notified major providers, and coverage notes no public responses as security commentators press for fuller disclosure and stronger evaluations.