Overview
- A study by Cohere, Stanford, MIT, and AI2 alleges LM Arena allowed select AI firms, including Meta and Google, to privately test models and withhold lower scores from Chatbot Arena leaderboards.
- Meta reportedly tested 27 pre-release Llama 4 variants on Chatbot Arena, publishing only the highest-performing model's score at launch.
- LM Arena co-founder Ion Stoica rejected the study's claims, calling them inaccurate, while Google DeepMind disputed specific numbers in the report.
- The study calls for transparency reforms, including limits on private tests and mandatory disclosure of pre-release scores, but LM Arena has resisted full disclosure of unpublished models.
- LM Arena announced plans to revise its sampling algorithm to address fairness concerns, while maintaining that its benchmark remains impartial and community-driven.