Particle News: LM Arena Defends Chatbot Arena Benchmark Amid Transparency Concerns

Overview

A study by Cohere, Stanford, MIT, and AI2 alleges LM Arena allowed select AI firms, including Meta and Google, to privately test models and withhold lower scores from Chatbot Arena leaderboards.
Meta reportedly tested 27 pre-release Llama 4 variants on Chatbot Arena, publishing only the highest-performing model's score at launch.
LM Arena co-founder Ion Stoica rejected the study's claims, calling them inaccurate, while Google DeepMind disputed specific numbers in the report.
The study calls for transparency reforms, including limits on private tests and mandatory disclosure of pre-release scores, but LM Arena has resisted full disclosure of unpublished models.
LM Arena announced plans to revise its sampling algorithm to address fairness concerns, while maintaining that its benchmark remains impartial and community-driven.