From their die size and power usage relative to the perf compared to nvidia it's...

latchkey · 2024-10-06T21:36:51 1728250611

I ran 150,000 AMD gpus previously for an ethereum mining operation. We ran them on edge of crashing, individually tuned to their highest clock/lowest voltage.

I can definitively say they are ALL snowflakes. Every single one. Wide variance in each chip and across batches as well. We had them OEM placed onto the boards too. Combine that with the OEM and their batches as well, there was variance in that too. Then, it went down to even the datacenter and psu's, and how clean the power was.

I actually started to collect all the data around where they were cut from the wafer, but never got a chance to process it and correlate it to their performance. There was a running theory that the edge chips were not as good.

alephnerd · 2024-10-07T01:27:48 1728264468

What was the rate of variance you were seeing? I'm curious how much it compares to other similar setups. I haven't been technically in the weeds for almost 6-7 years so kind of curious.

That said, HPC is hard, and even minor hardware/OEM failures can have a massive downstream impact.

Also you seem to be downvoted for no reason. At that size of a cluster, I'd be surprised if you didn't have a subset of flawed GPUs.

> There was a running theory that the edge chips were not as good

My gut would agree with that sentiment, simply because engineering is hard and it takes time to stabilize QA (eg. look at how long it took to stabilize hard disk QA, let alone a bleeding edge GPU architecture).

Numerlor · 2024-10-07T09:00:14 1728291614

Again only consumer side knowledge but there's clearly very wide degree of variance Navi 31 just from all the models it's used on, as I doubt they're puposefully cutting the silicon that much without the parts already being unusable for more expensive gpus.

It goes from a 7900 GRE, 7900 xt to a 7900 XTX with roughly a fifth of the processing unavailable on the lowest tier compared to highest. Then from what I know from overclocking communities the different 7900 XTXs also vary up to over 10% when on the same 550w bios. It's probably more apparent when going over that power limit but that requires hardware modding and people aren't doing that on low tier bins and there isn't as much data.

latchkey · 2024-10-07T03:14:31 1728270871

Yea, don't know about the downvotes. ¯\_(ツ)_/¯

We had a minimum expected hashrate and power profile depending on the card model and batch. It would be in the 0-20% range.

For what I'm doing now with my company which is more bleeding edge... we just deployed a cluster of 128 MI300x (8x gpus in a chassis). A good 50% of the chassis had 1-2 issues either in the gpu or the baseboard, on delivery.