The real issue is that housing is heavily underweighted in the cpi basket. How many people do you know that are only spending 12.9% of their after tax take home on housing, water and fuel? Only people with paid off mortgages.
if you bought a nvidia h100 at wholesale prices (around $25k) and ran it 24/7 at commercial electric rates (lets say $0.1 per kwh), then it would take you over 40 years to spend the purchase price of the gpu in electricity. Maybe bump it down to 20 for data center cooling.
I don't think the cost of the ai is close to converging to the price of power yet. Right now its mostly the price of hardware and data center space minus subsidies.
Actually since the qe to fix 2008 foreign investors/governments cut down on bond buying. Treasury has foreign bond holdings up about 2T since 2014 and current account deficits sum to around 7T.
Nowadays we're mostly selling companies and real estate. (though there was a nice jump in bond buying a few years when rates went up)
Or maybe a lot of it can't deliver returns in line with the current price. The world realizes that and stops sending us stuff for those investments. Then we're stuck with limited manufacturing, high inflation and relatively low ownership of our existing high value exports.
If the pie gets bigger too slowly then I think we still lose.
1800 on the h100s is with 2/4 sparsity, it’s half of that without. Not sure if the tpu number is doing that too, but I don’t think 2/4 is used that heavily so I probably would compare without it.
I remember reading that it’s too hard to get good memory bandwidth/l2 utilization in the fancy algorithms, you need to read contiguous blocks and be able to use them repeatedly. But I also haven’t looked at the gpu blas implementations directly.
GDDR isnt like the ram that connects to cpu, it's much more difficult and expensive to add more. You can get up to 48GB with some expensive stacked gddr, but if you wanted to add more stacks you'd need to solve some serious signal timing related headaches that most users wouldn't benefit from.
I think the high memory local inference stuff is going to come from "AI enabled" cpus that share the memory in your computer. Apple is doing this now, but cheaper options are on the way. As a shape its just suboptimal for graphics, so it doesn't make sense for any of the gpu vendors to do it.
As someone else said - I don't think you have to have GDDR, surely there are other options. Apple does a great job of it on their APUs with up to 192GB, even an old AMD Threadripper chip can do quite well with its DDR4/5 performance
For ai inference you definitely have other options, but for low end graphics? the lpddr that apple (and nvidia in grace) use would be super expensive to get a comparable bandwidth (think $3+/gb and to get 500GB/sec you need at least 128GB).
And that 500GB/sec is pretty low for a gpu, its like a 4070 but the memory alone would add $500+ to the cost of the inputs, not even counting the advanced packaging (getting those bandwidths out of lpddr needs organic substrate).
It's not that you can't, just when you start doing this it stops being like a graphics card and becomes like a cpu.
They can use LPDDR5x, it would still massively accelerate inference of large local LLMs that need more than 48GB RAM. Any tensor swapping between CPU RAM and GPU RAM kills the performance.
I think we don't really disagree, I just think that this shape isn't really a gpu its just a cpu because it isn't very good for graphics at that point.
That's why I said "basic GPU". It doesn't have to be too fast but it should still be way faster than a regular CPU. Intel already has Xeon Phi so a lot of things were developed already (like memory controller, heavy parallel dies etc.)
I wonder at that point you'd just be better served by CPU with 4 channels of RAM. If my math is right 4 channels of DDR5-8000 would get you 256GB/s. Not as much bandwidth as a typical discrete GPU, but it would be trivial to get many hundreds of GB of RAM and would be expandable.
Unfortunately I don't think either Intel or AMD makes a CPU that supports quad channel RAM at a decent price.
4 channels of DDR5-8000 would give you 128GB/sec. DDR5 has 2x32-bit channels per DIMM rather than 1x64-bit channel like DDR4 did. You would need 8 channels of DDR5-8000 to reach 256GB/sec.
I think all the people saying "just use a CPU" massively underestimate the speed difference between current CPUs and current GPUs. There's like four orders of magnitude. It's not even in the same zip code. Say you have a 64-core CPU at 2Ghz with 512-bit 1-cycle FP16 instructions. That gives you 32 ops per cycle, 2048 across the entire package, so 4TFlops.
My 7900 XTX does 120TFlops.
To match that, you would need to scale that CPU up to either 2048 cores, 2KB per register (still one-cycle!) or 64Ghz.
I guess if you had 1024-bit registers and 8Ghz, you could get away with only 240 cores. Good luck thermal dissipating that btw. To reverse an opinion I'm seeing in this thread, at that point your CPU starts looking more like a GPU by necessity.
Usually, you can do 2 AVX-512 operations per cycle and using FMADD (fused multiply-add) instructions, you can do two floating point operations for the price of one. That would be 128 operations per cycle per core. The result would be 16TFlops on a 2GHz 64 core CPU, not 4 TFlops. This would give a 1 order of magnitude difference, rather than 4 orders of magnitude.
For inference, prompt processing is compute intensive, while token generation is memory bandwidth bound. The differences in memory bandwidth between CPUs and GPUs tend to be more profound than the difference in compute.
That's fair. On the other hand, there's like exactly one CPU with FP16 AVX512 anyways, and 64core aren't exactly commonplace either. And even with all those advantages, using a datacenter CPU, you're still a factor of 10 off from a GPU that isn't even consumer top-end. With a normal processor, say 16 cores, 16 float ops, even with fused ops and dispatching two ops per cycle you're still only at 2T and ~50x. In consumer spaces, I'm more optimistic about dedicated coprocessors. Maybe even iTPU?
Zen 6 is supposed to add FP16 AVX512 support if AMD’s leaked slides mean what I think they mean. Here is a link to a screenshot of the leaked slides MLID published:
Running on a GPU like my 3090 Ti will likely outperform it by two orders of magnitude, but I have managed to push the needle slightly on the state of the art performance for prompt processing on my CPU. I suspect an additional 15% improvement is possible, but I do not expect to be able to realize it. In any case, it is an active R&D project that I am doing to learn how these things work.
Finally to answer your question, I have no good answers for you (or more specifically, answers that I like). I have been trying to think of ways to do fast local inference on high end models cost effectively for months. So far, I have nothing to show for it aside from my R&D into CPU llama 3 inference since none of my ideas are able to bring hardware costs needed for llama 3.1 405B below $10,000 with performance at an acceptable level. My idea of an acceptable performance level is 10 tokens per second for token generation and 4000 tokens per second for prompt processing, although perhaps lower prompt processing performance is acceptable with prompt caching.
This is only relevant for the flash attention part of the transformer, but a NPU is an equally suitable replacement for a GPU for flash attention.
Once you have offloaded flash attention, you're back to GEMV having a memory bottleneck. GEMV does a single multiplication and addition per parameter. You can add as many EXAFLOPs as you want, it won't get faster than your memory.
I guess it's hard to know how well this would compete with integrated gpus, especially at a reasonable pricepoint. If you wanted to spend $4000+ on it, it could be very competitive and might look something like nvidias grace-hopper superchip, but if you want the product to be under $1k I think it might be better just to buy separate cards for your graphics and ai stuff.
It is not stacked. It is multirank. Stacking means putting multiple layers on the same chip. They are already doing it for HBM. They will likely do it for other forms of DRAM in the future. Samsung reportedly will begin doing it in the 2030s:
I am not sure why they can already do stacking for HBM, but not GDDR and DDR. My guess is that it is cost related. I have heard that HBM reportedly costs 3 times more than DDR. Whatever they are doing to stack it now that is likely much more expensive than their planned 3D fabrication node.