Overview
- Samsung says TRUEBench spans 2,485 test sets across 10 categories and 46 subcategories covering 12 languages.
- Inputs range from eight characters to over 20,000 characters to reflect everything from brief prompts to long‑document summarization.
- Scoring uses a hybrid process in which human annotators define criteria that an AI system reviews before humans refine the standards.
- Sample data and leaderboards are available on Hugging Face, enabling side‑by‑side evaluation of up to five models for performance and efficiency.
- The benchmark targets enterprise tasks such as content generation, data analysis, summarization and translation, and Samsung says it intends to help establish productivity evaluation standards.