Apples to oranges. NVIDIA cards have an order of magnitude more horsepower for compute than this thing. A B100 has 8 TB/s of memory bandwidth, 10 times more than this. If NVIDIA made a card with 512GB of HBM I'd expect it to cost $150K.
The compute and memory bandwidth of the M3 Ultra is more in-line with what you'd get from a Xeon or Epyc/Threadripper CPU on a server motherboard; it's just that the x86 "way" of doing things is usually to attach a GPU for way more horsepower rather than squeezing it out of the CPU.
This will be good for local LLM inference, but not so much for training.
When I was much younger, I got to work on compilers at Cray Computer Corp., which was trying to bring the Cray-3 to market. (This was basically a 16-CPU Cray-2 implemented with GaAs parts; it never worked reliably.)
Back then, HPC performance was measured in mere megaflops. And although the Cray-2 had peak performance of nearly 500MF/s/CPU, it was really hard to attain, since its memory bandwidth was just 250M words/s/CPU (2GB/s/CPU); so you had to have lots of operand re-use to not be memory-bound. The Cray-3 would have had more bandwidth, but it was split between loads and stores, so it was still quite a ways away from the competing Cray X-MP/Y-MP/C-90 architecture, which could load two words per clock, store one, and complete an add and a multiply.
So I asked why the Cray-3 didn't have more read bandwidth to/from memory, and got a lesson from the answer that has stuck. You could actually see how much physical hardware in that machine was devoted to the CPU/memory interconnect, since the case was transparent -- there was a thick nest of tiny blue & white twisted wire pairs between the modules, and the stacks of chips on each CPU devoted to the memory system were a large proportion of the total. So the memory and the interconnect constituted a surprising (to me) majority of the machine. Having more floating-point performance in the CPUs than the memory could sustain meant that the memory system was oversubscribed, and that meant that more of the machine was kept fully utilized. (Or would have been, had it worked...)
In short, don't measure HPC systems with just flops. Measure the effective bandwidth over large data, and make sure that the flops are high enough to keep it utilized.
> so you had to have lots of operand re-use to not be memory-bound
Looking at Nvidia's spec sheet, an H100 SXM can do 989 tf32 teraflops (or 67 non-tensor core fp32 teraflops?) and 3.35 TB/s memory (HBM) bandwidth, so ... similar problem?
Yep, it's apples to oranges. But sometimes you want apples, and sometimes you want oranges, so it's all good!
There's a wide spectrum of potential requirements between memory capacity, memory bandwidth, compute speed, compute complexity, and compute parallelism. In the past, a few GB was adequate for tasks that we assigned to the GPU, you had enough storage bandwidth to load the relevant scene into memory and generate framebuffers, but now we're running different workloads. Conversely, a big database server might want its entire contents to be resident in many sticks of ECC DIMMs for the CPU, but only needed a couple dozen x86-64 threads. And if your workload has many terabytes or petabytes of content to work with, there are network file systems with entirely different bandwidth targets for entire racks of individual machines to access that data at far slower rates.
There's a lot of latency between the needs of programmers and the development and shipping of hardware to satisfy those needs, I'm just happy we have a new option on that spectrum somewhere in the middle of traditional CPUs and traditional GPUs.
As you say, if Nvidia made a 512 GB card it would cost $150k, but this costs an order of magnitude less than that. Even high-end consumer cards like a 5090 have 16x less memory than this does (average enthusiasts on desktops have maybe 8 GB) and just over double the bandwidth (1.7 TB/s).
Also, nit pick FTA:
> Starting at 96GB, it can be configured up to 512GB, or over half a terabyte.
512 GB is exactly half of a terabyte, which is 1024 GB. It's too late for hard drives - the marketing departments have redefined storage to use multipliers of 1000 and invented "tebibytes" - but in memory we still work with powers of two. Please.
Sure, if you want to do training get an NVIDIA card. My point is that it's not worth comparing either Mac or CPU x86 setup to anything with NVIDIA in it.
For inference setups, my point is that instead of paying $10000-$15000 for this Mac you could build an x86 system for <$5K (Epyc processor, 512GB-768GB RAM in 8-12 channels, server mobo) that does the same thing.
The "+$4000" for 512GB on the Apple configurator would be "+$1000" outside the Apple world.
But this is how it wonderfully works. +$4000 does two things: 1. Make Apple very very rich 2. Make people think this is better than a $10k EPYC. Win-Win for Apple. At the point when you have convinced that you are the best, higher price just means people think you are even better.
> The "+$4000" for 512GB on the Apple configurator would be "+$1000" outside the Apple world.
That requires an otherwise equivalent PC to exist. I haven’t seen anyone name a PC with a half-TB of unified memory in this thread.
Yeah it’s $4k. Yeah that’s nuts. But it’s the only game in town like that. If the replacement is a $40k setup from Nvidia or whatever that’s a bargain.
Since the GH200 has over a terabyte of VRAM at $343,000 and the H100 has 80GB that makes that $195,993 with a bit over 512GB of VRAM . You could beat the price of the Apple M3 Ultra with an AMD EPYC build.
GH200 is nowhere near $343,000 number. You can get a single server order around 45k (with inception discount). If you are buying bulk, it goes down to sub-30k ish. This comes with a H100's performance and insane amount of high bandwith memory.
128GB for 3K; per the announcement their ConnectX networking allows two Project Digits devices to be plugged into eachother and work together as one device giving you 256GB for $6k, and, AFAIK, existing frameworks can split models across devices, as well, hence, presumably, the upthread suggestion that Project Digits would provide 512GB for $12k, though arguably the last step is cheating.
If you want to split tensorwise yes. Layerwise splits could go over Ethernet.
I would be interested to see how feasible hybrid approaches would be, e.g. connect each pair up directly via ConnectX and then connect the sets together via Ethernet.
If the M3 can run 24/7 without overheating it's a great deal to run agents. Especially considering that it should run only using 350W... so roughly $50/mo in electricity costs.
Around 5x Nvidia A100 80GB can fit 671b Q4. $50k just for the GPUs and likely much more when including cooling, power, motherboard, CPU, system RAM, etc.
So the M3 Ultra is amazing value then. And from what I could tell, an equivalent AMD Epyc would still be so constrained that we're talking 4-5 tokens/s. Is this a fair assumption?
That's what I'm trying to get to.
Looking to set up a rig, and AMD Epyc seems reasonable but I'd rather go Mac if it's giving many more tokens per second. It does sound like the Mac with M3 Ultra will easily give 40 tokens/s, where as the Epyc is just internally constrained too much, giving 4-5 tokens/s but I'd like someone to confirm that, instead of buying the HW and finding out myself. :)
Well, ChatGPT quotes 25k-75k tokens/s with 5 H100 (so very very far from the 40 tokens/s), but I doubt this is accurate (e.g. it completly ignored the fact they are linked together and instead just multiplied the estimation of the tokens/s for one H100 by 5).
If this is remotely accurate though it's still at least an order of magnitude more convenient than the M3 Ultra, even after factoring in all the other costs associated with the infrastructure.
5,500 easily gets me either vastly more CPU cores if I care more about that or a vastly faster GPU if I care more about that. Or for both a 9950x + 5090 (assuming you can actually find one in stock) is ~$3000 for the pair + motherboard, leaving a solid $2500 for whatever amount of RAM, storage, and networking you desire.
The M3 strikes a very particular middle ground for AI of lots of RAM but a significantly slower GPU which nothing else matches, but that also isn't inherently the right balance either. And for any other workloads, it's quite expensive.
You'll need a couple of 32GB 5090s to run a quantized 70B model, maybe 4 to run a 70b model without quantization, forget about anything larger than that. A huge model might run slow on a M3 Ultra, but at least you can run it all.
I have a Max M3 (the non-binned one), and I feel like 64GB or 96GB is within the realm of enabling LLMs that run reasonable fast on it (it is also a laptop, so I can do things on planes or trips). I thought about the Ultra, if you have 128GB for a top line M3 Ultra, the models that you could fit into memory would run fairly fast. For 512GB, you could run the bigger models, but not very quickly, so maybe not much point (at least for my use cases).
That config would also use about 10x the power, and you still wouldn't be able to run a model over 32GB whereas the studio can easily cope with 70B llama and plenty of space to grow.
I think it actually is perfect for local inference in a way that build or any other pc build in this price range would be.
The M3 Ultra studio also wouldn't be able to run path traced Cyberpunk at all no matter how much RAM it has. Workloads other than local inference LLMs exist, you know :) After all, if the only thing this was built to do was run LLMs then they wouldn't have bothered adding so many CPU cores or video engines. CPU cores (along with networking) being 2 of the specs highlighted by the person I was responding to, so they were obviously valuing more than just LLM use cases.
Consumer hardware is cheap, if 192 GB of RAM is enough for you. But if you want to go beyond that, the Mac Studio is very competitively priced. A minimal Threadripper workstation with 256 GB is ~$7400 from Puget Systems. If you increase the memory to 512 GB, the price goes up to ~$10900. Mostly because 128 GB modules are about as expensive as what Apple charges for RAM. A Threadripper Pro workstation can use cheaper 8x64 GB for the same capacity, but because the base system is more expensive, you'll end up paying ~$11600.