As enterprises race to adopt large language models (LLMs), one key challenge has persisted: how to evaluate their effectiveness in real business environments. Traditional benchmarks often test academic knowledge or language trivia, but they don’t reflect the complex, multilingual, and context-driven tasks companies actually face.
Samsung Research has stepped in with a new system called TRUEBench—short for Trustworthy Real-world Usage Evaluation Benchmark. The framework aims to redefine how AI performance is measured by shifting the focus from theoretical accuracy to workplace productivity.
Closing the Benchmark Gap
Most existing AI benchmarks are limited to simple question-answer formats, usually in English. While useful for research, they fall short when predicting whether an AI can draft corporate reports, translate global communications, or summarise thousands of words into actionable insights.
TRUEBench fills this gap with evaluation criteria built on Samsung’s own enterprise use of AI. It tests real-world tasks like document summarisation, content creation, translation, and data analysis—broken into 10 categories and 46 sub-categories for a granular look at productivity.
Built for Global Business
To reflect the multilingual reality of international corporations, TRUEBench draws on 2,485 diverse test sets spanning 12 languages. These tasks range from ultra-short prompts of just eight characters to highly complex analyses of documents exceeding 20,000 characters.
Crucially, the benchmark also considers implicit user intent. In business settings, employees don’t always provide fully detailed prompts, so AI must infer context and deliver relevant results beyond surface-level accuracy.
Human-AI Collaboration in Evaluation
Samsung has introduced a unique collaborative process to design evaluation criteria. Human annotators first establish standards for each task. AI then reviews these standards, flagging potential contradictions or unrealistic constraints. Annotators refine the criteria based on AI feedback, creating a loop that ensures precision and real-world applicability.
This approach enables an automated scoring system where AI applies the refined criteria consistently. TRUEBench also uses an “all-or-nothing” scoring model: a system must meet every condition of a test to pass, producing a rigorous measure of performance across tasks.
Transparency Through Open Access
To encourage adoption, Samsung has released TRUEBench data samples and leaderboards on Hugging Face, the open-source AI platform. Enterprises, developers, and researchers can compare up to five AI models at once, viewing rankings and even the average length of responses. This transparency helps companies balance performance with efficiency—an important factor when weighing operational costs and output speed.
Shifting the Industry’s Focus
By moving evaluation away from abstract benchmarks and toward tangible business outcomes, Samsung aims to reshape how enterprises select and deploy AI systems. TRUEBench could help organisations choose models not just for their theoretical intelligence but for their proven ability to deliver value in real-world workflows.
For companies struggling to bridge the gap between AI’s potential and its practical utility, Samsung’s benchmark offers a new standard: productivity as the ultimate measure of performance.


