How MLPerf Benchmarks Guide Data Center Design Decisions
MLPerf Training Benchmark has emerged as the industry's critical standardization tool for evaluating AI infrastructure performance.
October 22, 2025
Machine learning breakthroughs have disrupted established data center architectures, driven by the ever-increasing computational demands of training AI models. In response, the MLPerf Training Benchmark emerged as a standardized framework for evaluating machine learning performance, enabling data center professionals to make informed infrastructure decisions that align with the rapidly evolving requirements of their workloads.
The Role of MLPerf in AI Operations
MLPerf, short for "Machine Learning Performance," consists of a suite of assessment tools targeting the hardware and software components essential for current AI operations. Generative AI models, particularly Large Language Models (LLMs), impose intensive resource requirements , consuming substantial power while necessitating high-performance computing capabilities. These demands continue to reshape global data center infrastructure, with Gartner forecasting a remarkable 149.8% growth in the generative AI market in 2025, exceeding 14ドル billion.
However, the swift adoption of generative AI has introduced organizational risks that require immediate attention from IT management. A recent SAP-commissioned study , Economist Impact Survey of C-suite Executives on Procurement 2025, highlighted this concern. According to the study, 42% of respondents prioritize AI-related risks, including those tied to LLM integration, as short-term concerns (12 to 18 months), while 49% classify them as medium-term priorities (3 to 5 years).
Related:How AI is Disrupting the Data Center Software Stack
Recognizing these complexities, researchers, vendors, and industry leaders collaborated to establish standardized performance metrics for machine learning systems. The foundational work began in the late 2010s – well before ChatGPT-3 captured global attention – with contributions from data center operators already preparing for AI’s transformative impact.
chart lists and defines general AI terms, including generative AI and neural networks
Birth of a Benchmark: Addressing AI’s Growing Demands
MLPerf Training officially launched in 2018 to provide "a fair and useful comparison to accelerate progress in machine learning," as described by David Patterson, renowned computer architect and RISC chip pioneer. The benchmark addresses the challenges of training AI models , a process involving feeding vast datasets into neural networks to enable pattern recognition through "deep learning." Once training concludes, these models transition to inference mode, generating responses to user queries.
Evolution of MLPerf
The rapidly evolving machine learning landscape of 2018 underscored the need for an adaptable benchmark that can accommodate emerging technologies. This requirement aligned with mounting enthusiasm surrounding transformer models , which had achieved significant breakthroughs in language and image processing. Patterson stressed that MLPerf would employ an iterative methodology to match the accelerating pace of machine learning innovation – a vision realized through the original MLPerf Training suite.
Related:Why Scale-Out Data Center Architecture Falls Short in the Age of AI
Since its inception, MLCommons.org has continuously developed and refined the MLPerf benchmarks to ensure their relevance and accuracy. The organization, comprising over 125 members and affiliates, including industry giants Meta, Google, Nvidia, Intel, AMD, Microsoft, VMWare, Fujitsu, Dell, and Hewlett Packard Enterprise, has proven instrumental in advancing performance evaluation standards.
MLCommons released Version 1.0 in 2020. Subsequent iterations have expanded the benchmark’s scope, incorporating capabilities such as LLM fine-tuning and stable diffusion. The organization’s latest milestone, MLPerf Training 5.0, debuted in mid-2025.
chart lists and defines key terms used in this article, including MLPerf and quality target
Ensuring Fair Comparisons Across AI Systems
David Kanter, the head of MLPerf and a member of the MLCommons board, outlined the standard’s development philosophy for Data Center Knowledge. From the beginning, the objective was to achieve equitable comparison across diverse systems. "That means," Kanter explained, "a fair and level playing field that would admit many different architectures." He described the benchmark as "a means of aligning the industry."
Related:AI Data Centers: A Popular Term That’s Hard to Define
Contemporary AI models have intensified this challenge considerably. These systems process enormous datasets using billions of neural network parameters, which requires exceptional computational power. Kanter emphasized the magnitude of these requirements. "Training, in particular, is a supercomputing problem," he said. "In fact, it's high-performance computing."
Kanter added that training encompasses storage, networking, and many other areas. "There are many different elements that go into performance, and we want to capture them all."
MLPerf Training employs a comprehensive evaluation methodology that assesses performance through structured, repeatable tasks mapping to real-world applications. Using curated datasets for consistency (see Figure 1), the benchmark trains and tests models against reference frameworks while measuring performance against predefined quality targets.
The MLCommons.org MLPerf Training v5.0 benchmark suite facilitates performance measurement across common machine learning applications, including recommendation engines and LLM training. This comprehensive evaluation framework provides standardized evaluations by defining essential components – datasets, reference models, and quality targets – for each benchmark task. Image: MLCommons
Key Metric: Time-to-Train
"Time-to-Train" serves as MLPerf Training’s primary metric, evaluating how quickly models can reach quality thresholds. Rather than focusing on raw computing power, this approach provides an objective assessment of the complex, end-to-end training process.
"We pick the quality target to be close to state-of-the-art," Kanter said. "We don't want it to be so state-of-the-art that it's impossible to hit, but we want it to be very close to what is on the frontier of possibility."
MLPerf Training Methodology
Developers using the MLPerf suite configure libraries and utilities before executing workloads on prepared test environments. While MLPerf typically operates within containers, such as Docker , to ensure reproducible conditions across different systems, containerization is not a mandatory requirement. Certain benchmarks may employ virtual environments or direct-to-hardware software installations for native performance evaluations.
The benchmarking process includes these key components:
Configuration Files specify the System Under Test (SUT) and define workload parameters.
Reference Codes and Submission Scripts act as a test harness to manage workload execution, measure performance, and ensure compliance with the benchmark rules.
MLPerf_logging generates detailed execution logs that track processes and record metrics. As noted above, the final metric is the Time-to-Train, which measures the time required to train a model to achieve the target quality rating.
Submission Categories
MLPerf Training supports two submission categories:
Closed Division enables apples-to-apples comparisons between different systems.
Open Division permits substantial modifications, including alternative models, optimizers, or training schemes, provided the results meet the target quality metric.
Playing Field in Motion: AI Infrastructure Transformation
AI infrastructure undergoes constant transformation, with the MLPerf benchmark suite evolving in tandem to guide design and address the complex challenges confronting software and data center teams. Version 4, introduced in 2024, included system-wide power draw and energy consumption measurements during training, highlighting the critical importance of energy efficiency in AI systems .
MLPerf Training 5.0 (2025) replaced the GPT-3 benchmark with a new LLM pretraining evaluation based on the Llama 3.1 405B generative AI system.
Microprocessors fuel the AI revolution, and MLCommons offers a deli menu of processor options for MLPerf Training 5.0 submissions. Notable chips tested in this iteration include:
AMD Instinct MI300X (192GB HBM3).
AMD Instinct MI325X (256GB HBM3e).
AMD Epyc Processor ("Turin").
Google Cloud TPU-Trillium.
Intel Xeon 6 Processor ("Granite Rapids").
NVIDIA Blackwell GPU (GB200) (including Neoverse V2).
NVIDIA Blackwell GPU (B200-SXM-180GB).
MLCommons staff observed performance gains across tested systems during Version 5. The Stable Diffusion benchmark demonstrated a 2.28-times speed increase compared to Version 4.1, released just six months earlier. These advancements reflect the growing emphasis on co-design, a methodology that optimizes the balance between hardware and software for specific workloads, thereby enhancing end-user performance and efficiency.
AI Benchmark Futures: Focus on Inference
AI benchmarks must maintain agility to keep pace with ongoing technical breakthroughs as the field advances. While initial efforts targeted large models, the industry has pivoted toward smaller systems, now representing a primary focus area. Alexander Harrowell, Principal Analyst for Advanced Computing at Omida, observed this transition, explaining that although "there will always be interest in model training," the emphasis has shifted from building larger systems to optimizing compact, efficient alternatives .
The inference stage of machine learning constitutes another critical frontier for MLCommons. The organization has developed specialized benchmarks addressing inference needs across various environments:
MLPerf Inference: Datacenter
MLPerf Inference: Edge
MLPerf Inference: Mobile
MLPerf Inference: Tiny
Matt Kimball, Vice President and Principal Analyst for data center compute and storage at Moor Insights & Strategy, highlighted the significance of inference in AI development. "On the ‘what’s next’ front, it is all about inference," he stated. "Inference is interesting in that the performance and power needs for inferencing at the edge are different from what they are in the datacenter." He noted that inference requirements vary considerably across edge environments , such as retail versus industrial applications.
Kimball also recognized the expanding ecosystem of inference contributors. "MLCommons does a good job of enabling all of these players to contribute and then providing results in a way that allows me as an architect," he said.
About the Author
Jack Vaughan is a freelance journalist, following a stint overseeing editorial coverage for TechTarget's SearchDataManagement, SearchOracle and SearchSQLServer. Prior to joining TechTarget in 2004, Vaughan was editor-at-large at Application Development Trends and ADTmag.com. In addition, he has written about computer hardware and software for such publications as Software Magazine, Digital Design and EDN News Edition. He has a bachelor's degree in journalism and a master's degree in science communication from Boston University.
You May Also Like