Particle.news
Download on the App Store

Poetry Prompts Bypass Guardrails Across Leading AI Chatbots, Study Finds

Casting harmful requests as verse sharply increases jailbreak success in tests across 25 models.

Overview

  • Icaro Lab converted 1,200 MLCommons AILuminate safety-benchmark prompts into poems, reporting attack-success rates up to 18 times higher than prose baselines.
  • Handcrafted poems achieved an average 62% jailbreak rate and automated verse conversions averaged about 43%, with some models exceeding 90%.
  • The vulnerability transferred across high-risk domains including CBRN, cyber offense, harmful manipulation, and loss-of-control scenarios.
  • Outputs were scored by an ensemble of three open-weight LLM judges validated on a human-labeled subset rather than releasing operational prompts.
  • Researchers withheld dangerous poetic examples and shared a sanitized proxy, notified major providers, and coverage notes no public responses as security commentators press for fuller disclosure and stronger evaluations.