Alpha had a lot of implementation problems, e.g. floating point exceptions with untraceable execution paths.
Cray tried to build the T3E (iirc) out of Alphas. DEC bragged how good Alpha was for parallel computing, big memory etc etc.
But Cray publicly denounced Alpha as unusable for parallel processing (the T3E was a bunch of Alphas in some kind of NUMA shared memory.) It was so difficult to make the chips work together.
This was in the Cray Connect or some such glossy publication. Wish I'd kept a copy.
Plus of course the usual DEC marketing incompetence. They feared Alpha undoing their large expensive machine momentum. Small workstation boxes significantly faster than big iron.
The Cray T3D and T3E used Alpha processors. But it wasn't really shared memory, each node with 1/(2?) CPU's ran it's own lightweight OS kernel. There were some libraries built on top of it (SHMEM) that sort-of made it look a bit like shared memory, but not really. Mostly it was a machine for running MPI applications.
A decade or so later on, they more or less recreated the architecture but this time with 64-bit Opteron CPU's in the form of the 'Red Storm' supercomputer for Sandia. Which then became commercially available as the XT3. And later XT4/5/6.
You have a "telescope" with a field of view of one-planets worth of pixels. But the planet is in orbit, so it drifts away from the imaged field of view within minutes.
Meanwhile your sensor is travelling away from the "lens" so transverse velocity would be needed to track the orbit at a delta-v and direction that is unknowable. Unknowable, because you have to know where the planet is, within a radius, to put your "sensor" in the right place in the first place.
Imagine taking a straw, place it in a tree, walk away a few km and focus a telescope on the straw and hope to look through the straw to see an airplane flying past. You have the same set of unknowables.
I won't argue that it would be worth the effort, but it would be interesting to set something like that going and just keep scanning. A few years worth of data might turn up interesting things even if it wasn't particularly useful for finding those things a second time.
- your test codes are not reproducible: running twice generates different sets of numbers because they have unknown seeds. As a result, if you change (a) compilers or compiler switches, (b) operating system versions, (c) host processors or (d) architectures, the question arises: what is wrong? What is different? This is known as a regression test.
- You only test a few times? I think one hundred BigCrush tests, using a set of 100 seeds, would be suitable. Takes a few days on an RPI 4 (with cooler). Run the same 100 tests on a Ryzen and Xeon, just to be sure. They should be bit-for-bit identical.
- 100 BigCrush tests should show only a handful (4 or fewer) duplicate test failures.
- your seeds are almost great: too many people think "42" is random in a space of 0 through 2^64. But 0xdeadbeef is so 1990s...
- you don't need different seeds per PRNG; you can generate reproducible ones (2x to 4x 64bit) from a single good 64-bit seed and your favourite PRNG. Your test code should read a seed or set from the command line (see first item).
- warmups? Really?
- Remember that BigCrush and other tests are created by mathematical people not practical people. Do they test for equal numbers of odd and even results? Hmmmm....
What a great set of feedback. Thank you! I'll look at it as an acton item list.
And you are so right about being just one month in. Every time I think I'm starting to understand what's going on here, I realize the maze just keeps getting deeper.
I don't for a minute believe Deepseek v3 was built with a $6M rental.
Their paper (arxiv 2412:1947) explains they used 2048 H800s. A computer cluster based on 2048 GPUs would have cost around $400M about two years ago when they built it. (Give or take, feel free to post corrections.)
The point is they got it done cheaper than OpenAI/Google/Meta/... etc.
But not cheaply.
I believe the markets are overreacting. Time to buy (tinfa).
They pointed out that the cost calculation is based on if those GPUs were rented at $2/hr. They are not factoring in the prior cost of buying those H800s because they didn't buy it to build R1. They are not factoring in the cost to build v2, or v2.5. The cost is to build V3. The cost to build R0 and R1 on top of v3, seems far cheaper and they didn't mention that. They are not factoring in the cost to build out their datacenter or salary. Just the training cost. They made it clear. If you could rent equivalent GPUs at $2/hr, it would cost you about $6million.
"Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data."
V3 was released a bit a month ago, V3 is not what took the world by storm but R1. The price everyone is talking about is the price for V3.
If this weren't an attempt to sell a false equivalency, at least one story would have details on the equivalent rental cost of compute used to train closed source frontier models from OpenAI, Anthropic, Mistral... Lack of clarity makes it a story.
>>Just the training cost. They made it clear. If you could rent equivalent GPUs at $2/hr, it would cost you about $6million.
This is still quite impressive, given most people are likely to buy cloud infrastructure from AWS or Azure than build their own datacenter. So the Math checks out.
I don't think compute capacity built already will go waste, likely more and bigger things will get built in the coming years so most of it will be used for that purpose.
You’re confusing the metric for reality. The point is to compare the cost of training in terms of node hours with a given configuration. That’s how you get apples to apples. Of course it doesn’t cover building the cluster, housing the machine, the cleaning staff’s pension, or whatever.
The math they gave was 2,788,000 H800 GPU hours[1], with a rental price of $2/GPU-hour[1], which works out to $5.6M. If they did that on a cluster of 2048 H800s, then they could re-train the model every ~1400 hours (~2 months).
If they paid $70,000 per GPU[2] plus $5000 per 4-GPU compute node (random guess), then the hardware would have cost about $150M to build. If you add in network hardware and other data-centery-things, I could see it reaching into the $200M range. IMO $400M might be a bit of a stretch but not too wildly off base.
To reach parity with the rental price, they would have needed to re-train 70 times (i.e. over 12 years). They obviously did not do that, so I agree it's a bit unfair to cost this based on $2M in GPU rentals. Why did they buy instead of rent? Probably because it's not actually that cheap to get 2048 concurrent high-performance connected GPUs for 60 days. Or maybe just because they had cash for capex.
Something like an H100 is definitely a feat of engineering, though.
Nothing prevents Cooler Master from releasing a line of GPUs equally performant and, while at it, even cheaper. But when we measure reality, after the wave function of fentanyl and good intentions collapses ... oh yeah, turns out only nVidia is making those chips, whoops ...
Looking around a bit the price was ~70k$USD _in China_ around the time they where released in 2023, cheaper bulk sells where a thing, too.
Note that this are the China prices with high markup due to export controls etc.
The price of a H800 80GiB in the US is today more like ~32k$USD .
But for using H800 clusters well you also need as fast as possible interconnects, enough motherboards, enough fast storage, cooling, building, interruption free power etc. So the cost of building a "H800" focused Datacenter is much much higher then multiplying GPU cost by number.
You can’t buy the gpus individually, and even if you can on a secondary market, you can’t use them without the baseboard, and you can’t use the baseboard without a compatible chassis, and a compatible chassis is full of CPUs, system memory etc. on top of that, you need a fabric. Even if you cheap out and go RoCE over IB it’s still 400gbs hcas, optics and switches
Yea, a node in a cluster costs as much as an American house. Maybe not on its own, but to make it useful for large scale training, even under the new math of deepseek, it costs as much as a house.
They estimated $200k for a single NVIDIA GPU-based CPU complete with RAM and networking. That's where my number came from. (RAM and especially very-high-speed networking is very expensive at these scales.)
"Add it all up, and the average selling price of an Nvidia GPU accelerated system, no matter where it came from, was just under $180,000, the average server SXM-style, NVLink-capable GPU sold for just over $19,000 (assuming the GPUs represented around 85 percent of the cost of the machine)"
That implies they assumed an 8-GPU system. (8 × $19,000 = $152,000 ≈ 85% × $180,000)
To clarify, a legitimate benchmark for training is to calculate the running cost, not capex cost. Because obviously the latter would drop dramatically with the number of models you train. But to put into context, Meta wants to spend 50B on AI this year alone. And it already has 150x the compute of DS. The very real math going through investors head is - what's stopping Zuck from taking 10B of that and mailing a 100 million signing bonus to every name on the R1 paper?
The $6M that is thrown around is from the DS V3 paper and is for the cost of a single training run for DeepSeek V3 - the base model that R1 is built on.
The number does not include cost for personell, experiments, data preparation, chasing dead ends, and most importantly, it does not include the reinforcement learning step that made R1 good.
Furthermore, it is not factored in that both R3 and V1 are build on top of an enormous amount of synthetic data the was generated by other LLMs.
Comparing cost of buying with cost of running is weird. It's not like they build a new cluster, train just this one model, and then incinerate everything.
They bought between 10k and 50k of them before the US restrictions came into place. Sounds like DeepSeek gets to use them for training, as they were profitable (could still be, not sure).
Electricity in China, even at residential rates, is 1/10th the cost it is in CA.
I think the salient point here is that the "price to train" a model is a flashy number that's difficult to evaluate out of context. American companies list the public cloud price to make it seem expensive; Deepseek has an incentive to make it sound cheap.
The real conclusion is that world-class models can now be trained even if you're banned from buying Nvidia cards (because they've already proliferated), and that open-source has won over the big tech dream of gatekeeping the technology.
Over the last few days people have asked me if they think NVIDIA is fkd.. It still takes two H100s to run inference on the DS v3 671b @ <200 tokens per second.
There are different versions of the model as well as using it with different levels of quantization.
Some variants of DeepSeek-R1 can be run on 2x H100 GPUs, and some people managed to get still quite decent results with a even stronger distilled mode running it on consumer hardware.
For DeepSeek-V3 even with 4bit quantization you need more like 16x H100.
The method mentioned in the link you shared is indeed interesting and probably also works with NetBSD among others, but relies on having third party controls
(such as having to select a grub entry and run the installer steps from a cloud control panel).
The reason I experimented and wrote this article was because I thought interesting to find a way to avoid having to rely on any external controls.
Thus this works even on bare metal servers and thanks to QEMU to install absolutely any OS that can boot on QEMU.
This article is about absolutes and fear. To conflate the two is an obvious rhetorical trick that amounts to clickbait, approximately.
"It's rare for a reentering object to hit a structure..." which is an example of the probability and the hazard. So the risk by most people's definition is "low".
So what's the problem? "According to the European Space Agency, the annual risk..." is the problem. Misusing (or misunderstanding) terminology is typical. Unfortunately, typical for Ars.
This article is about incorrect understanding by engineers and scientists of how different materials, in different conditions, behave during reentry:
“"During its initial design, the Dragon spacecraft trunk was evaluated for reentry breakup and was predicted to burn up fully," NASA said in a statement. "The information from the debris recovery provides an opportunity for teams to improve debris modeling. NASA and SpaceX will continue exploring additional solutions as we learn from the discovered debris."”
and
“These incidents highlight an urgency for more research into what happens when a spacecraft makes an uncontrolled reentry into the atmosphere, according to engineers from the Aerospace Corporation”.
The inclusion at the end of the article about how low the risk of space debris injuring an individual serves to tell the reader space debris hitting them is not something they need to worry about. Again, this is about experts updating and improving their models, especially as the number of space launches grows dramatically, sometimes using novel materials.
I agree people are generally very bad at understanding the relationships among probabilities, hazards, and risks. But this article cites multiple, independent experts, and specifically highlights how this is not a problem of you getting hit by space debris, which is quite anti-clickbait.
Cray tried to build the T3E (iirc) out of Alphas. DEC bragged how good Alpha was for parallel computing, big memory etc etc.
But Cray publicly denounced Alpha as unusable for parallel processing (the T3E was a bunch of Alphas in some kind of NUMA shared memory.) It was so difficult to make the chips work together.
This was in the Cray Connect or some such glossy publication. Wish I'd kept a copy.
Plus of course the usual DEC marketing incompetence. They feared Alpha undoing their large expensive machine momentum. Small workstation boxes significantly faster than big iron.
reply