One bad sign is that Google recently put out a press release that they’ve achieved the largest training job ever run, but they didn’t actually train any models. It was a bandwidth test: https://news.ycombinator.com/item?id=38227836
(Shoutout to DavidSJ who sleuthed this out based on the careful wording of the post.)
This was a big shock to me since MLPerf is the gold standard for performance comparisons. The reason you want to train an actual model on real data is because it gives a precise idea of what you can expect in the field. Google’s 2019 MLPerf results are why I dove in to TPU land and fell in love with TPUs. Nowadays it’s all about language models and giant training runs and there’s not a lot of compute left over for average joes (which I very much was), but rants aren’t productive.
What would be productive is for Google to focus on the basics: focus on MLPerf. It’s the standard benchmark for a reason. I can’t fault them for going where the money is, but it’s still sad to see the sorry state of Google being dead last. I’m still rooting for you, Jax team. I believe in you.
TRC is looking better these days. If you need spare compute or just want to try out TPUs, give them a spin for a month. When I was complaining on Twitter in 2023 about TPUs being deleted after 30 minutes circa 2022, two people stepped in and said that they were happy with their quotas and haven’t observed similar pain points. So maybe they’ve solved the provisioning issues, and they’re starting to treat TRC members as more than third-class citizens again. But, especially in modern times, it’s hard to complain when compute is so scarce; you can do some pretty amazing things with TPUs if you focus on them for a few months. Thank you to the organizers of TRC (Jonathan in particular, their support lead, who is single handedly the most impressive support person I’ve ever seen by a huge margin; and Zak for keeping so many balls in the air while leading the TPU infra).
Lot of MLPerf tasks takes less than a minute. Meaning it likely takes more time starting the training and warming up than the actual training itself. MLPerf is widely outdated.
> What would be productive is for Google to focus on the basics: focus on MLPerf. It’s the standard benchmark for a reason. I can’t fault them for going where the money is, but it’s still sad to see the sorry state of Google being dead last.
You seem to be suggesting that Google isn't focusing on performance on MLPerf but on something else that's more lucrative. Do you want to expound on what that is?
Large LLM training workloads. It’s where the money’s at, and it’s why they went for "world’s biggest training run" headline rather than leading the pack on MLPerf standards. Not the least of which is because they can build their own LLM framework for customers to use, rather than spending time getting an MLPerf result that customers can’t possibly use. (I know because I’ve tried; it was quite a lot of archaeology to figure out how to dig out their 2019 MLPerf code for my own needs.)
I don’t blame them. It’s the right move. I just miss the old days. But those aren’t coming back, and for better or worse the entire world is GPT and Diffusion focused for at least the next N years.
Hopefully nerds can figure out how to scrape enough compute together to do interesting things during this gold rush period. It’s why TRC is that much more important, since it’s an opportunity to mint the next generation of engineers-turned-researchers. It would be smart for nvidia to offer something similar using their aging A100s. But I think only Google understands how powerful it is to capture the college students and interns.
So... Its November 2023, and we have a handful of really good open source LLM base models (Mistral, Llama, MPT, Falcon, a few code models, a few Chinese ones like Yi, a Arabaic one) and a few notable proprietary ones, and a few cool vision models.
... Where is all that hoovered up accelerator compute going? Shouldnt we be swimming in open and closed LLMs by now?
Just a guess: (1) A lot of "me too" pre-training that should really be fine tuning; (2) lot of proprietary models whose weights will never be released (or even used :)).
The headline feels like clickbait. Does MLPerf also benchmark power consumption/cost per submission? It's not particularly illuminating that if you use a 10k GPU cluster, you can train things quickly and a "per chip" comparison is vague.
For most people the bottom line is cost and availability anyway - does it matter if a TPU is twice as slow if it costs half as much? I think I read somewhere that Apple Silicon is actually one of the most efficient platforms to use for development (e.g. a tricked out Mini makes quite a good inference server); I've been able to train recent object detection models on my M2 without much trouble (and it's much cheaper than paying for cloud time).
Nah, MLPerf is legit. I was skeptical of it too, but wanted to leave a quick note (I have to run) that they’re solid. I’d be the first to call it out if it was pointless or misleading, but it turns out to be the only way to get a true idea of comparisons across different hardware. It forces everyone to achieve the same goal, which is key; otherwise you’re left with a bunch of slight (and large) variations that tell you nothing about achieving the actual goal, which is all you care about.
It’s a bit like a shared space race. Getting that top-N accuracy to 74 point yada yada percent in 37 seconds on a certain resnet architecture tells you that you can do the same thing on that hardware, which means you can do your own things just as quickly. So it’s a worthwhile time investment. If you force everyone to get that same accuracy with the same model arch, you can make informed decisions, which is especially important when throwing millions of VC dollars around.
EDIT: Sorry, I completely misread you.
It’s difficult to quantify what you’d like to measure: the bottom line price of who is cheaper for your expected workload. There are immense trade offs, not least of which is time spent learning a particular (esoteric) stack. CUDA knowledge doesn’t port to JAX and vice-versa.
Prices are always changing, and you can usually work out a deal with the sales team to get lower than advertised. Especially if they want you to choose them instead of some competitor. So in general it’s hard to figure out what you can expect for production workloads in terms of total dev cost vs price vs speed.
I will say that as a researcher, there’s no substitute for fast iteration cycles. I’m one of the few who believe in scaling down your models as much as possible when testing experimental ideas, precisely because you can try 30 runs instead of 3. So all else being equal, I’d take speed.
But all else isn’t equal. The only thing I want nowadays is free plus stable. It’s looking like a 4090 might be the way to get that, which is enough to try out some interesting ideas.
I'm still not totally sure what the issue is. Jax uses program transformations to compile programs to run on a variety of hardware, for example, using XLA for TPUs. It can also run cuda ops for Nvidia gpus without issue: https://jax.readthedocs.io/en/latest/installation.html
I haven't worked with float4, but can imagine that new numerical types would require some special handling. But I assume that's the case for any ml environment.
Salaries more often than not, costs a hell of a lot more than compute and energy bills cost. If you have a 10k cluster, you're probably a mega-corp running a few teams sharing the resources running experiments in parallel to each other and themselves.
Now this is an very much so an overestimate on the actual cost of compute, the $10/hr/GPU cost is in all likelihood still significantly cheaper than the cash you pay to your researchers, and all the things that are needed to support them. Unless your team of researchers is about 10 people running very efficiently and making use of ALL the GPUs each hour of every day.
While power consumption is important, it is never a top priority when we talk about training these huge models. Raw performance matters more than anything else.
And just because the article doesn't mention the aspect you care about does not make it a clickbait.
Moreover, considering the cost of the card itself at $1,600, the electricity expenses become relatively minor. In fact, for the price of the card, you could operate it non-stop for two years. This highlights how the cost of electricity is a small factor in the overall picture.
No, this highlights that electricity is virtually all of the TCO over the life of the device. It doesn't sound like you've done a lot of TCO work before. In any case we're talking about facilities containing tens of thousands of these devices, so the TCO of one of them is irrelevant.
I don't understand your point. Yes, electricity is virtually all the TCO of the device (other than the upfront cost). Isn't that the case for any consumer electronic part - processor, memory, hdd, etc? That fact doesn't change any of what I'm trying to say.
My point is that this cost (of electricity) is negligible relative to two other costs - cost of time (in other words, cost of human capital), and the upfront cost of the device itself.
This is a response to your comment "Wait, why? Is it more important to organizations to wait a little less time, or to spend less money (on electricity)?"
Oddly the article talks about training performance, but the link to the tests links to the inference benchmark. The correct link to the training benchmark results is at https://mlcommons.org/benchmarks/training/
For what Google is primarily used for, search engine and so forth, do they need an extremely advanced Ai? As we've seen with other big tech companies, using giant expensive models for "simple tasks" isn't worth it financially. How advanced of a LLM does Google really need to improve search?
Google has numerous internal applications other than websearch that use machine learning, such as spam classification (Gmail), image segmentation and recognition (photos, image search), text-to-speech and speech-to-text (assistant), very small generative tasks (smart reply, smart compose), predicting which videos will be watched (YouTube), and even operating the pumps and fans in their data centers.
This needs to be called out anytime an AI benchmarking article with Google as one of its key comparisons is published: many of the outstanding AI researchers and innovators who gave Google the reputation it has today are no longer at the company.
Startups such as Essential AI, Ideogram, Character.ai were all founded by ex-Googlers. Plenty of engineers and researchers from Google Brain/Research have pivoted into these types of start-ups, not necessarily as founders but as individual contributors continuing their engineering and research work at these companies instead.
Google Ads: built from great many pieces, one of the largest was DoubleClick, acquired in 2008.
Google Docs: acquired in 2006 from Writely.
Google Sheets: acquired in 2006 from XL2Web.
Google Maps: acquired in 2004.
I'm not trying to say that Google acquired everything and hasn't built anything; not at all. But a few crown jewels were indeed acquisitions.
If Dr. Lisa Su, the CEO, wants AMD to submit to MLPerf they would have already submitted to MLPref. This tweet is about a discussion George had with her on AMD's open-source practices and code quality for the contributions they make.
AMD does have a history of spinning their wheels on the software side. Not graphics specifically (anymore), but definitely with their compute initiatives like HSA or even OpenCL.
I dunno what to think about ROCm. It does seem like they should have thrown the MI300 into the mix, but maybe its not ready.
The proof is in the pudding. AMD has excellent hardware and has had years now to get anything going. They didn't, and they don't so our best guess is, they won't.
But the interpretation of the journalist is the issue here. Comparing a cost optimized part (TPU v5e) against a performance optimized part (H100) and deciding that this makes Google "behind" is just incorrect.
(Shoutout to DavidSJ who sleuthed this out based on the careful wording of the post.)
This was a big shock to me since MLPerf is the gold standard for performance comparisons. The reason you want to train an actual model on real data is because it gives a precise idea of what you can expect in the field. Google’s 2019 MLPerf results are why I dove in to TPU land and fell in love with TPUs. Nowadays it’s all about language models and giant training runs and there’s not a lot of compute left over for average joes (which I very much was), but rants aren’t productive.
What would be productive is for Google to focus on the basics: focus on MLPerf. It’s the standard benchmark for a reason. I can’t fault them for going where the money is, but it’s still sad to see the sorry state of Google being dead last. I’m still rooting for you, Jax team. I believe in you.
TRC is looking better these days. If you need spare compute or just want to try out TPUs, give them a spin for a month. When I was complaining on Twitter in 2023 about TPUs being deleted after 30 minutes circa 2022, two people stepped in and said that they were happy with their quotas and haven’t observed similar pain points. So maybe they’ve solved the provisioning issues, and they’re starting to treat TRC members as more than third-class citizens again. But, especially in modern times, it’s hard to complain when compute is so scarce; you can do some pretty amazing things with TPUs if you focus on them for a few months. Thank you to the organizers of TRC (Jonathan in particular, their support lead, who is single handedly the most impressive support person I’ve ever seen by a huge margin; and Zak for keeping so many balls in the air while leading the TPU infra).