MLCommons is out today with its latest set of MLPerf inference results. The new results mark the debut of a new generative AI benchmark and the first validated test results for Nvidia’s next-generation
Blackwell GPU processor.
MLCommons is a multi-stakeholder, vendor-neutral organization that manages the MLperf benchmarks for both AI training and AI inference. The latest MLPerf inference benchmarks, released by MLCommons, provide a comprehensive snapshot of the rapidly evolving AI hardware and software landscape. With 964 performance results submitted by 22 organizations, these benchmarks serve as a vital resource for enterprise decision-makers navigating the complex world of AI deployment. By offering standardized, reproducible measurements of AI inference capabilities across various scenarios, MLPerf enables businesses to make informed choices about their AI infrastructure investments, balancing performance, efficiency, and cost.
As part of MLPerf Inference v 4.1, there are notable additions. MLPerf is evaluating the performance of a Mixture of Experts (MoE), specifically the Mixtral 8x7B model, for the first time. This round of benchmarks featured an impressive array of new processors and systems, many making their first public appearance. Notable entries include AMD’s MI300x, Google’s TPUv6e (Trillium), Intel’s Granite Rapids, Untether AI’s SpeedAI 240, and the Nvidia Blackwell B200 GPU.
“We just have a tremendous breadth of diversity of submissions, and that’s exciting,” David Kanter, founder and head of MLPerf at MLCommons, said during a call discussing the results with press and analysts. “The more different systems we see out there, the better for the industry, more opportunities, and more things to compare and learn from.”
Introducing the Mixture of Experts (MoE) benchmark for AI inference
A significant highlight of this round was introducing the Mixture of Experts (MoE) benchmark, designed to address the challenges posed by increasingly large language models.
“The models have been increasing in size,” Miro Hodak, a senior member of the technical staff at AMD and one of the chairs of the MLCommons inference working group, said during the briefing. “That’s causing significant issues in practical deployment.”
Hodak explained that at a high level, instead of having one large, monolithic model, with the MoE approach, several smaller models are experts in different domains. Anytime a query comes, it is routed through one of the experts.”
The MoE benchmark tests performance on different hardware using the Mixtral 8x7B model, which consists of eight experts, each with 7 billion parameters. It combines three different tasks:
- Question-answering based on the Open Orca dataset
- Math reasoning using the GSMK dataset
- Coding tasks using the MBXP dataset
He noted that the key goals were to exercise the MoE approach’s strengths better than a single-task benchmark and showcase the capabilities of this emerging architectural trend in large language models and generative AI. Hodak explained that the MoE approach allows for more efficient deployment and task specialization, potentially offering enterprises more flexible and cost-effective AI solutions.
Nvidia Blackwell is coming, bringing some significant AI inference gains.
The MLPerf testing benchmarks are an excellent opportunity for vendors to preview upcoming technology. Instead of just making marketing claims about performance, the rigor of the MLPerf process provides industry-standard testing that is peer-reviewed.
Among the most anticipated pieces of AI hardware is Nvidia’s Blackwell GPU, which was first announced in March. While it will still be many months before Blackwell is in the hands of real users, the MLPerf Inference 4.1 results provide a promising preview of the power that is coming.
“This is our first performance disclosure of measured data on Blackwell, and we’re very excited to share this,” Dave Salvator at Nvidia said during a briefing with press and analysts.
MLPerf inference 4.1 has many different benchmarking tests. Specifically on the generative AI workload that measures performance using MLPerf’s most considerable LLM workload, Llama 2 70B,
“We’re delivering 4x more performance than our previous generation product per GPU basis,” Salvator said.
While the Blackwell GPU is a significant new piece of hardware, Nvidia continues to squeeze more performance out of its existing GPU architectures. The Nvidia Hopper GPU keeps on getting better. Nvidia’s MLPerf inference 4.1 results for the Hopper GPU provide up to 27% more performance than the last round of results six months ago.
“These are all gains coming from software only,” Salvator said. “In other words, this is the same hardware we submitted about six months ago, but because of our ongoing software tuning, we can achieve more performance on that platform.”