Overview
- Researchers tested seven large language models on more than 10,000 real AITA cases, comparing their verdicts with community outcomes.
- They collected standardized labels and brief rationales, then analyzed sensitivity to six themes: fairness, feelings, harms, honesty, relational obligation, and social norms.
- Models were internally consistent across repeated prompts but differed markedly from each other, indicating stable, model-specific value profiles.
- ChatGPT‑4 and Claude showed greater sensitivity to feelings, many models emphasized fairness and harms over honesty, and Mistral 7B frequently chose “No assholes here” due to a more literal reading of the term.
- In follow-up work now underway, preliminary results suggest models vary in conformity during multi-model deliberation, with GPT variants less likely to change blame after pushback.