One nice thing about this (and the new offerings from AMD) is that they will be ...

kkielhofner · on April 10, 2024

Pascal series are cheap because they are CUDA compute capability 6.0 and lack Tensor Cores. Volta (7.0) was the first to have Tensor Cores and in many cases is the bare minimum for modern/current stacks.

See flash attention, triton, etc as core enabling libraries. Not to mention all of the custom CUDA kernels all over the place. Take all of this and then stack layers on top of them...

Unfortunately there is famously "GPU poor vs GPU rich". Pascal puts you at "GPU destitute" (regardless of assembled VRAM) and outside of implementations like llama.cpp that go incredible and impressive lengths to support these old archs you will very quickly run into show-stopping issues that make you wish you just handed over the money for >= 7.0.

I support any use of old hardware but this kind of reminds me of my "ancient" X5690 that has impressive performance (relatively speaking) but always bites me because it doesn't have AVX.

mk_stjames · on April 10, 2024

This is all very true for Machine-Learning research tasks, were yes, if you want that latest PyTorch library function to work you need to be on the latest ML code.

But my work/fun is in CFD. One of the main codes I use for work was written to be supported primarily at the time of Pascal. Other HPC stuff too that can be run via OpenCL, and is still plenty compatible. Things compiled back then will still run today; It's not a moving target like ML has been.

kkielhofner · on April 10, 2024

Exactly. Demand for FP64 is significantly lower than for ML/AI.

Pascal isn’t incredibly cheap by comparison because it’s some secret hack. It’s cheap by comparison because most of the market (AI/ML) doesn’t want it. Speaking of which…

At the risk of “No True Scotsman” what qualifies as HPC gets interesting but just today I was at a Top500 site that was talking about their Volta system not being worth the power, which is relevant to parent comment but still problematic for reasons.

I mentioned llama.cpp because the /r/locallama crowd, etc has actually driven up the cost of used Pascal hardware because they treat it as a path to get VRAM on the cheap with their very very narrow use cases.

If we’re talking about getting a little FP64 for CFD that’s one thing. ML/AI is another. HPC is yet another.

gymbeaux · on April 10, 2024

Hey that’s not fair, the X5690 is VERY efficient… at heating a home in the winter time.

egorfine · on April 10, 2024

Easier said than done. I've got a dual X5690 at home in Kiev, Ukraine and I just couldn't find anything to run on it 24x7. And it doesn't produce much heat idling. I mean at all.

gymbeaux · on April 10, 2024

All the sane and rational people are rooting for you here in the U.S. I’m sorry our government is garbage and aid hasn’t been coming through as expected. Hopefully Ukraine can stick it to that chicken-fucker in the Kremlin and retake Crimea too.

I didn’t have an X5690 because the TDP was too high for my server’s heatsinks, but I had 90W variants of the same generation. To me, two at idle produced noticeable heat, though not as much as four idling in a PowerEdge R910 did. The R910 idled at around 300W.

There’s always Folding@Home if you don’t mind the electric bill. Plex is another option. I know a guy running a massive Plex server that was on Westmere/Nehalem Xeons until I gave him my R720 with Haswell Xeons.

egorfine · on April 10, 2024

> I’m sorry our government is garbage

It looks pathetic indeed. Makes many people question: if THAT'S democracy, then maybe it's not worth fighting for.

> All the sane and rational people are rooting for you here in the U.S.

The same could be said about russian people (sane and rational ones). But what do both people have in common? The answer is: currently both nations are helpless to change what their government does.

> are rooting for you here in the U.S.

I know. We all truly know and greatly appreciate that. There would be no Ukraine if not American weapons and help.

> There’s always Folding@Home

Makes little sense power-wise.

tambre · on April 10, 2024

Run BOINC maybe? [0]

[0]: https://boinc.berkeley.edu/

egorfine · on April 10, 2024

Makes little sense to actually run anything on X5960 power-wise

JonChesterfield · on April 9, 2024

I really like this side to AMD. There's a strategic call somewhere high up to bias towards collaboration with other companies. Sharing the fabric specifications with broadcom was an amazing thing to see. It's not out of the question that we'll see single chips with chiplets made by different companies attached together.

01HNNWZ0MV43FF · on April 9, 2024

Maybe they feel threatened by ARM on mobile and Intel on desktop / server. Companies that think they're first try to monopolize. Companies that think they're second try to cooperate.

rhelz · on April 9, 2024

Well, lets not forget, AMD is AMD because they reverse-engineered Intel chips....

treprinum · on April 10, 2024

IBM didn't want to rely solely on Intel when introducing PCs so it forced Intel to share its arch with another manufacturer that turned out to be AMD. It's not like AMD stole it. Math coprocessor was in turn invented by AMD (Am9511, Am9512) and licensed by Intel (8231, 8232).

rhelz · on April 10, 2024

They certainly didn't steal it. But Intel didn't second-source pentiums, or any chip with SIMD extension. AMD reverse-engineered those fair and square.

vegabook · on April 10, 2024

Also AMD64

formerly_proven · on April 9, 2024

The price is low because they’re useless (except for replacing dead cards in a DGX), if you had a 40$ PCIe AIC-to-SXM adapter, the price would go up a lot.

> I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.

Very cool hobby. It’s also unfortunate how stringent e-waste rules lead to so much perfectly fine hardware to be scrapped. And how the remainder is typically pulled apart to the board / module level for spares. Makes it very unlikely to stumble over more or less complete-ish systems.

KeplerBoy · on April 9, 2024

I'm not sure the prices would go up that much. What would anyone buy that card for?

Yes, it has a decent memory bandwidth (~750 GB/s) and it runs CUDA. But it only has 16 GB and doesn't support tensor cores or low precision floats. It's in a weird place.

trueismywork · on April 9, 2024

Scientific computing would buy it up like hot cakes.

KeplerBoy · on April 9, 2024

Only if the specific workload needs FP64 (4.5 Tflop/s), the 9 Tflop/s for FP32 can be had for cheap with Turing or Ampere consumer cards.

Still, your point stands. It's crazy how that 2016 GPU has two thirds the FP32 power of this new 2024 unobtanium card and infinitely more FP64.

algo_trader · on April 9, 2024

Somewhat off topic:

Is there a similar "magic value card" for low memory (2GB?) 8-bit LLMs?

Since memory is the expensive bit, surely there are low cost low memory models?

KeplerBoy · on April 10, 2024

I believe that's what tenstorrent is aiming for.

abdullin · on April 10, 2024

The main offer of Tenstorrent goes into server racks and is designed to form clusters.

Standalone cards are more like dev kits.

(I’ve been tracking Tenstorrent for 3+ years and currently have Grayskull in ML test rig together with 3090)

jsight · on April 9, 2024

IDK, is it really that much more powerful than the P40, which is already fairly cheap?

mk_stjames · on April 10, 2024

The P100 has amazing double precision (FP64) flops (due to a 1:2 FP ratio that got nixed on all other cards) and a higher memory bandwidth which made it a really standout GPU for scientific computing applications. Computational Fluid Dynamics, etc.

The P40 was aimed at the image and video cloud processing market I think, and thus the GDDR ram instead of HBM, so it got more VRAM but at much less bandwidth.

7speter · on April 10, 2024

Well, the p40 has 24gb VRAM, which makes it the perfect hobbyist card for a llm, assuming you can keep it cool.

7speter · on April 10, 2024

The pci-e p100 is has 16gb vram and won’t go below 160 dollars. Prices for these things would pick up if you could put them in some sort of pcie adapter

gymbeaux · on April 10, 2024

As “humble” as NVIDIA’s CEO appears to be, NVIDIA the company (he’s been running this whole time), made decision after decision with the simple intention of killing off its competition (ATI/AMD). Gameworks is my favorite example- essentially if you wanted a video game to look as good as possible, you needed an NVIDIA GPU. Those same games played on AMD GPUs just didn’t look as good.

Now that video gaming is secondary (tertiary?) to Nvidia’s revenue stream, they could give a shit which brand gamers prefer. It’s small time now. All that matters is who companies are buying their GPUs from for AI stuff. Break down that CUDA wall and it’s open-season. I wonder how they plan to stave that off. It’s only a matter of time before people get tired of writing C++ code to interface with CUDA.

mike_hearn · on April 10, 2024

You don't need to use C++ to interface with CUDA or even write it.

A while ago NVIDIA and the GraalVM team demoed grCUDA which makes it easy to share memory with CUDA kernels and invoke them from any managed language that runs on GraalVM (which includes JIT compiled Python). Because it's integrated with the compiler the invocation overhead is low:

https://developer.nvidia.com/blog/grcuda-a-polyglot-language...

And TornadoVM lets you write kernels in JVM langs that are compiled through to CUDA:

https://www.tornadovm.org

There are similar technologies for other languages/runtimes too. So I don't think that will cause NVIDIA to lose ground.

gymbeaux · on April 10, 2024

So these alternatives exist yes, but are they “production ready”- in other words, are they being used. My opinion is that while you can use another language, most companies for one reason or another are still using C++. I just don’t really know what the reason(s) are.

I think about other areas in tech where you can use whatever language, but it isn’t practical to do so. I can write a backend API server in Swift… or perhaps more relevant- I can use AMD’s ROCm to do… anything.

buildbot · on April 9, 2024

The SXM2 interface is actually publicly documented! There is an open compute spec for a 8-way baseboard. You can find the pinouts there.

mk_stjames · on April 9, 2024

I had read their documents such as the spec for the Big Basin JBOG, where everything is documented except the actual pinouts on the base board. Everything leading up to it and from it is there but the actual MegArray pinout connection to a single P100/V100 I never found.

But maybe there was more I missed. I'll take another look.

mk_stjames · on April 10, 2024

Upon further review... I think any actual base board schematics / pinouts touching the Nvidia hardware directly is indeed kept behind some sort of NDA or OEM license agreement and is specifically kept out of any of those documents for the Open Compute project JBOG rigs.

I think this is literally the impetus for their OAM spec which makes the pinout open and shareable. Up until that, they had to keep the actual designs of the baseboards out of the public due to that part being still controlled Nvidia IP.

buildbot · on April 10, 2024

Hmm interesting, I was linked to an OCP dropbox with a version that did have the connector pinouts. Maybe something someone shouldn’t have posted then…

numpad0 · on April 10, 2024

I could find OCP accelerator spec but it looks like an open source reimplementation, not actual SXM2. That said, the photos of SXM2-PCIe adapters I could find look almost entirely passive, so I don't think all hopes are lost either.

IntelMiner · on April 10, 2024

It would be a shame if such a thing were to fall off the back of a truck as they say

CYR1X · on April 10, 2024

couldn't someone just buy one of those chinese sxm2 to pcie adapter boards and test continuity to get the pinouts? I have one that could take like 10 minutes

daedalush · on April 14, 2024

I have one, my plan was to map the pin out. My six month old has different plans for me which involve me not sleeping.

If you're willing to do the testing and post the pin out, I'll send it your way.

Would like to have it sent back when you're finished though, and I'd pay shipping both ways.

I do have the ground pins mapped out and could share that which should save some time.

wmf · on April 9, 2024

Why don't they sell used P100 DGX/HGX servers as a unit? Are those bare P100s only so cheap precisely because they're useless?

mk_stjames · on April 9, 2024

I have a theory some big cloud provider moved a ton of racks from SXM2 P100's to SXM2 V100's (those were a thing) and thus orphaned an absolute ton of P100's without their baseboards.

Or, these salvage operations just stripped racks and kept the small stuff and e-waste the racks because they think it's the more efficient use of their storage space and would be easier to sell, without thinking correctly.

bushbaba · on April 10, 2024

A ton of Nvidia GPUs fry their memory over time and need to be scrapped. Lookup nvidia (A100/H100) row remapping failure

pavelstoev · on April 10, 2024

Best Tflops/$ is actually 4090, then 3090. Also L4

lostmsu · on April 11, 2024

P100s would not give you more Tflops/$ if you take electricity into account.