MLPerf training tests put Nvidia ahead, Intel close, and Google well behind

sillysaurusx · on Nov 13, 2023

One bad sign is that Google recently put out a press release that they’ve achieved the largest training job ever run, but they didn’t actually train any models. It was a bandwidth test: https://news.ycombinator.com/item?id=38227836

(Shoutout to DavidSJ who sleuthed this out based on the careful wording of the post.)

This was a big shock to me since MLPerf is the gold standard for performance comparisons. The reason you want to train an actual model on real data is because it gives a precise idea of what you can expect in the field. Google’s 2019 MLPerf results are why I dove in to TPU land and fell in love with TPUs. Nowadays it’s all about language models and giant training runs and there’s not a lot of compute left over for average joes (which I very much was), but rants aren’t productive.

What would be productive is for Google to focus on the basics: focus on MLPerf. It’s the standard benchmark for a reason. I can’t fault them for going where the money is, but it’s still sad to see the sorry state of Google being dead last. I’m still rooting for you, Jax team. I believe in you.

TRC is looking better these days. If you need spare compute or just want to try out TPUs, give them a spin for a month. When I was complaining on Twitter in 2023 about TPUs being deleted after 30 minutes circa 2022, two people stepped in and said that they were happy with their quotas and haven’t observed similar pain points. So maybe they’ve solved the provisioning issues, and they’re starting to treat TRC members as more than third-class citizens again. But, especially in modern times, it’s hard to complain when compute is so scarce; you can do some pretty amazing things with TPUs if you focus on them for a few months. Thank you to the organizers of TRC (Jonathan in particular, their support lead, who is single handedly the most impressive support person I’ve ever seen by a huge margin; and Zak for keeping so many balls in the air while leading the TPU infra).

YetAnotherNick · on Nov 13, 2023

Lot of MLPerf tasks takes less than a minute. Meaning it likely takes more time starting the training and warming up than the actual training itself. MLPerf is widely outdated.

kccqzy · on Nov 13, 2023

> What would be productive is for Google to focus on the basics: focus on MLPerf. It’s the standard benchmark for a reason. I can’t fault them for going where the money is, but it’s still sad to see the sorry state of Google being dead last.

You seem to be suggesting that Google isn't focusing on performance on MLPerf but on something else that's more lucrative. Do you want to expound on what that is?

sillysaurusx · on Nov 13, 2023

Large LLM training workloads. It’s where the money’s at, and it’s why they went for "world’s biggest training run" headline rather than leading the pack on MLPerf standards. Not the least of which is because they can build their own LLM framework for customers to use, rather than spending time getting an MLPerf result that customers can’t possibly use. (I know because I’ve tried; it was quite a lot of archaeology to figure out how to dig out their 2019 MLPerf code for my own needs.)

I don’t blame them. It’s the right move. I just miss the old days. But those aren’t coming back, and for better or worse the entire world is GPT and Diffusion focused for at least the next N years.

Hopefully nerds can figure out how to scrape enough compute together to do interesting things during this gold rush period. It’s why TRC is that much more important, since it’s an opportunity to mint the next generation of engineers-turned-researchers. It would be smart for nvidia to offer something similar using their aging A100s. But I think only Google understands how powerful it is to capture the college students and interns.

ImJasonH · on Nov 13, 2023

Saving folks a googling: TRC is "TPU Research Cloud" (https://sites.research.google/trc/about/)

brucethemoose2 · on Nov 13, 2023

> Language models and giant training runs

So... Its November 2023, and we have a handful of really good open source LLM base models (Mistral, Llama, MPT, Falcon, a few code models, a few Chinese ones like Yi, a Arabaic one) and a few notable proprietary ones, and a few cool vision models.

... Where is all that hoovered up accelerator compute going? Shouldnt we be swimming in open and closed LLMs by now?

Palmik · on Nov 13, 2023

Just a guess: (1) A lot of "me too" pre-training that should really be fine tuning; (2) lot of proprietary models whose weights will never be released (or even used :)).

joshvm · on Nov 13, 2023

The headline feels like clickbait. Does MLPerf also benchmark power consumption/cost per submission? It's not particularly illuminating that if you use a 10k GPU cluster, you can train things quickly and a "per chip" comparison is vague.

For most people the bottom line is cost and availability anyway - does it matter if a TPU is twice as slow if it costs half as much? I think I read somewhere that Apple Silicon is actually one of the most efficient platforms to use for development (e.g. a tricked out Mini makes quite a good inference server); I've been able to train recent object detection models on my M2 without much trouble (and it's much cheaper than paying for cloud time).

sillysaurusx · on Nov 13, 2023

Nah, MLPerf is legit. I was skeptical of it too, but wanted to leave a quick note (I have to run) that they’re solid. I’d be the first to call it out if it was pointless or misleading, but it turns out to be the only way to get a true idea of comparisons across different hardware. It forces everyone to achieve the same goal, which is key; otherwise you’re left with a bunch of slight (and large) variations that tell you nothing about achieving the actual goal, which is all you care about.

It’s a bit like a shared space race. Getting that top-N accuracy to 74 point yada yada percent in 37 seconds on a certain resnet architecture tells you that you can do the same thing on that hardware, which means you can do your own things just as quickly. So it’s a worthwhile time investment. If you force everyone to get that same accuracy with the same model arch, you can make informed decisions, which is especially important when throwing millions of VC dollars around.

EDIT: Sorry, I completely misread you.

It’s difficult to quantify what you’d like to measure: the bottom line price of who is cheaper for your expected workload. There are immense trade offs, not least of which is time spent learning a particular (esoteric) stack. CUDA knowledge doesn’t port to JAX and vice-versa.

Prices are always changing, and you can usually work out a deal with the sales team to get lower than advertised. Especially if they want you to choose them instead of some competitor. So in general it’s hard to figure out what you can expect for production workloads in terms of total dev cost vs price vs speed.

I will say that as a researcher, there’s no substitute for fast iteration cycles. I’m one of the few who believe in scaling down your models as much as possible when testing experimental ideas, precisely because you can try 30 runs instead of 3. So all else being equal, I’d take speed.

But all else isn’t equal. The only thing I want nowadays is free plus stable. It’s looking like a 4090 might be the way to get that, which is enough to try out some interesting ideas.

Permit · on Nov 13, 2023

The person you’re replying to is not questioning MLPerf, but rather this article’s interpretation of the results.

sillysaurusx · on Nov 13, 2023

Oops. Thank you.

sdenton4 · on Nov 13, 2023

/CUDA knowledge doesn’t port to JAX and vice-versa./

These are completely orthogonal? Jax can execute on GPU, TPU, or CPU pretty seamlessly, by design.

sillysaurusx · on Nov 13, 2023

And then you give up specialized cuda kernels, which is necessary to run mistral 7b in 4 bit float mode.

There were some experimental cuda kernel for jax codebases. They didn’t work very well at the time, but maybe it’s better now.

sdenton4 · on Nov 14, 2023

I'm still not totally sure what the issue is. Jax uses program transformations to compile programs to run on a variety of hardware, for example, using XLA for TPUs. It can also run cuda ops for Nvidia gpus without issue: https://jax.readthedocs.io/en/latest/installation.html

There is also support for custom cpp and cuda ops if that's what is needed: https://jax.readthedocs.io/en/latest/Custom_Operation_for_GP...

I haven't worked with float4, but can imagine that new numerical types would require some special handling. But I assume that's the case for any ml environment.

But really you probably mean fixed point 4bit integer types? Looks like that has had at least some work done in Jax: https://github.com/google/jax/issues/8566

jeffbee · on Nov 13, 2023

The H100 costs 10x on AWS compared to the TPUv5e on GCP. And it is apparently 5x faster in GPT training. Which makes the headline sort of backwards.

latency-guy2 · on Nov 13, 2023

Cost, availability, and speed.

Salaries more often than not, costs a hell of a lot more than compute and energy bills cost. If you have a 10k cluster, you're probably a mega-corp running a few teams sharing the resources running experiments in parallel to each other and themselves.

Now this is an very much so an overestimate on the actual cost of compute, the $10/hr/GPU cost is in all likelihood still significantly cheaper than the cash you pay to your researchers, and all the things that are needed to support them. Unless your team of researchers is about 10 people running very efficiently and making use of ALL the GPUs each hour of every day.

d3w4s9 · on Nov 13, 2023

While power consumption is important, it is never a top priority when we talk about training these huge models. Raw performance matters more than anything else.

And just because the article doesn't mention the aspect you care about does not make it a clickbait.

jeffbee · on Nov 13, 2023

Wait, why? Is it more important to organizations to wait a little less time, or to spend less money?

ss1996 · on Nov 13, 2023

The cost of power (electricity) is much lesser than the cost of time.

summerlight · on Nov 13, 2023

That's only for GPU-hungry. If you have >100k GPU then the cost of power is going to be much more relevant.

jeffbee · on Nov 13, 2023

That's not abundantly clear. Some of the facilities in the article cost $100k/hr.

ss1996 · on Nov 14, 2023

The cost of electricity for running a single RTX 4090 a year: $800 (calculation below). Cost to the company of the median developer: $200,000.

Assume a 20 cent cost of 1 kWh, then a RTX 4090 with a 450W power draw for a year costs: 0.45 * 24 * 365 * 0.20 = $788.4

ss1996 · on Nov 14, 2023

Moreover, considering the cost of the card itself at $1,600, the electricity expenses become relatively minor. In fact, for the price of the card, you could operate it non-stop for two years. This highlights how the cost of electricity is a small factor in the overall picture.

jeffbee · on Nov 14, 2023

No, this highlights that electricity is virtually all of the TCO over the life of the device. It doesn't sound like you've done a lot of TCO work before. In any case we're talking about facilities containing tens of thousands of these devices, so the TCO of one of them is irrelevant.

ss1996 · on Nov 14, 2023

I don't understand your point. Yes, electricity is virtually all the TCO of the device (other than the upfront cost). Isn't that the case for any consumer electronic part - processor, memory, hdd, etc? That fact doesn't change any of what I'm trying to say.

My point is that this cost (of electricity) is negligible relative to two other costs - cost of time (in other words, cost of human capital), and the upfront cost of the device itself.

This is a response to your comment "Wait, why? Is it more important to organizations to wait a little less time, or to spend less money (on electricity)?"

what_ever · on Nov 13, 2023

Do you have unlimited money glitch?

jfim · on Nov 13, 2023

Oddly the article talks about training performance, but the link to the tests links to the inference benchmark. The correct link to the training benchmark results is at https://mlcommons.org/benchmarks/training/

londons_explore · on Nov 13, 2023

> "We delivered more than what was promised-a 103 percent reduction in time-to- train

Can someone explain this math to me?

huac · on Nov 13, 2023

10k H100's cost around $400M at sticker price. Sometimes I wonder if a massive setup like this is less expensive (volume discount) or more.

YetAnotherNick · on Nov 13, 2023

It takes $3k for Nvidia to produce H100[1], so the total for Nvidia will be $30m, not small but definitely not the most expensive supercomputer.

[1]: https://the-decoder.com/nvidias-h100-gpu-sells-like-hot-cake...

fnbr · on Nov 13, 2023

Less expensive. Big discounts exist.

TheCaptain4815 · on Nov 13, 2023

For what Google is primarily used for, search engine and so forth, do they need an extremely advanced Ai? As we've seen with other big tech companies, using giant expensive models for "simple tasks" isn't worth it financially. How advanced of a LLM does Google really need to improve search?

jeffbee · on Nov 13, 2023

Google has numerous internal applications other than websearch that use machine learning, such as spam classification (Gmail), image segmentation and recognition (photos, image search), text-to-speech and speech-to-text (assistant), very small generative tasks (smart reply, smart compose), predicting which videos will be watched (YouTube), and even operating the pumps and fans in their data centers.

kaycebasques · on Nov 13, 2023

Here's a post from Jan 2023 on the official blog: "9 Ways We Use AI In Our Products"

https://blog.google/technology/ai/9-ways-we-use-ai-in-our-pr...

frays · on Nov 13, 2023

This needs to be called out anytime an AI benchmarking article with Google as one of its key comparisons is published: many of the outstanding AI researchers and innovators who gave Google the reputation it has today are no longer at the company.

Startups such as Essential AI, Ideogram, Character.ai were all founded by ex-Googlers. Plenty of engineers and researchers from Google Brain/Research have pivoted into these types of start-ups, not necessarily as founders but as individual contributors continuing their engineering and research work at these companies instead.

Heck, Google is even considering investing into Character.ai (ex-Google founder) based on the news released a few days ago: https://www.reuters.com/technology/google-talks-invest-ai-st...

It's not surprising that Google isn't the leader of AI anymore. The people who put Google in that lead position are gone.

kmeisthax · on Nov 13, 2023

Google doesn't build, it buys.

VirusNewbie · on Nov 13, 2023

Did they buy K8s, TPU, Jax or TF?

deaddodo · on Nov 13, 2023

Or even, you know, the entire Google suite?

There's plenty to criticize Alphabet/GOOG on. But pretending they don't develop things in house is just naive.

nine_k · on Nov 13, 2023

Maybe not entire.

Google Ads: built from great many pieces, one of the largest was DoubleClick, acquired in 2008. Google Docs: acquired in 2006 from Writely. Google Sheets: acquired in 2006 from XL2Web. Google Maps: acquired in 2004.

I'm not trying to say that Google acquired everything and hasn't built anything; not at all. But a few crown jewels were indeed acquisitions.

Arainach · on Nov 13, 2023

That's true of most large companies.

Microsoft acquired Powerpoint.

Meta acquired Instagram.

Apple acquired NeXT and Siri.

nine_k · on Nov 13, 2023

Under a certain angle, Apple has become NeXT, dropping MacOS Classic, and co-opting CEO from NeXT (who happened to be Steve Jobs).

sangnoir · on Nov 13, 2023

While I agree with your larger point - I think GSuite was built around an acquisition (Writely) - or that may have been just Google docs.

htrp · on Nov 13, 2023

Actual Results (and not the article talking about it)

https://mlcommons.org/benchmarks/inference-datacenter/

shmerl · on Nov 13, 2023

Why is AMD not compared?

ioedward · on Nov 13, 2023

AMD didn't submit results, despite geohot/Lisa Su's plan to get AMD on MLPerf: https://twitter.com/realGeorgeHotz/status/166980346408248934...

dhruvdh · on Nov 13, 2023

If Dr. Lisa Su, the CEO, wants AMD to submit to MLPerf they would have already submitted to MLPref. This tweet is about a discussion George had with her on AMD's open-source practices and code quality for the contributions they make.

zwaps · on Nov 13, 2023

AMD does not do ML. They don't have the capability and engineers, nor desire in their leadership, to develop the software stack.

shmerl · on Nov 13, 2023

Makes no sense and not true most likely.

brucethemoose2 · on Nov 13, 2023

AMD does have a history of spinning their wheels on the software side. Not graphics specifically (anymore), but definitely with their compute initiatives like HSA or even OpenCL.

I dunno what to think about ROCm. It does seem like they should have thrown the MI300 into the mix, but maybe its not ready.

zwaps · on Nov 14, 2023

The proof is in the pudding. AMD has excellent hardware and has had years now to get anything going. They didn't, and they don't so our best guess is, they won't.

_xkcr · on Nov 13, 2023

This is dumb- they’re testing against google’s v5e chip, which isn’t the H100 equivalent (the V5P is). Of course Google came out worse here.

kaelinl · on Nov 13, 2023

MLPerf does not test hardware. Companies submit results from their own hardware. This is what each company chose to submit.

emu · on Nov 13, 2023

But the interpretation of the journalist is the issue here. Comparing a cost optimized part (TPU v5e) against a performance optimized part (H100) and deciding that this makes Google "behind" is just incorrect.

dgacmu · on Nov 13, 2023

There is no publicly announced v5p. Source?

0-_-0 · on Nov 13, 2023

And AMD is nowhere to be found