"General purpose CPUs are going to stay to become the little brain that orchestrates GPUs."
If that was going to happen, it would have happened.
CPUs are genuinely good at what they do, and "what they do" is a lot of tasks that GPUs are actually terrible at. If all we had were GPUs in the world and someone invented a CPU, we'd hail them as a genius. A lot of people seem to think that GPUs are just "better", just ambiently better at everything, but that's lightyears from the truth. They are quite spectacularly terrible at a lot of very common tasks. There's many very good reasons that GPUs are still treated as accelerators for the CPUs and not vice versa.
I'm thinking more like pseudointellect over serial to attach a $3 esp32 to. Since it's basically tokens in, tokens out, let's just cut the unnecessary parts out. It's like querying the cloud models, except it's your silicon you personally soldered to the esp so nobody will break your home assistant with a system prompt update or a fine tuning run.
In five to ten years, when LLMs have stabilized, mapping them straight onto hardware will probably make sense. With today’s processes a hundred billion parameters might fit onto a single silicon wafer using ~1.5 bit precision implemented directly in logic gates. Using higher precision raises the gate count exponentially, so it makes more sense to keep the weights in memory and reuse shared compute blocks for the math for now. We need to get the ultra low precision LLMs working for the future though.
Realistically specializing on the data flow is all you can do. Assuming a modern CPU contains on the order of 10 billion transistors that only amounts to 1.2 GiB storage before you account for any actual logic (ie 1 bit per transistor). DRAM hardware is quite different from that of processing elements and it takes quite a lot of DRAM chips to hold the weights of a single model.
I mean.... LLM-in-a-box would actually be pretty neat! I'm looking at some air-gapped work coming up and having something like that would be quite handy
Isn't that easily accomplished by setting up a local deployment and then yanking the network cable? Anything that can quickly run a capable LLM is going to be a pretty beefy box though. More like LLM in an expensive space heater.
I was thinking more like those Bitcoin mining usb Asics that used to be a thing, but instead of becoming ewaste, you can still use them to talk with chatgpt 2 or whatever. I'm picturing an llm appliance.
There is no magic ASIC that can get around needing to do hundreds of watts worth of computations and having on the order of hundreds of gigabytes of very fast memory. Otherwise the major players would be doing that instead of (quite literally) investing in nuclear reactors to power their future data center expansions.
Google have their own ASIC via their TPU. The other major players have leveraged NVIDIA and -- to a lesser extent -- AMD. This is partly due to investment in TPUs/ASICs being complex (need specialist knowledge and fabrication units) and GPU performance being hard to compete with.
Training is the thing that costs the most in terms of power/memory/energy, often requiring months of running multiple (likely 4-8) A100/H100 GPUs on the training data.
Performing inference is cheaper as you can 1) keep the model loaded in VRAM, and 2) run it on a single H100. With the 80GB capacity you would need two to run a 70B model at F16, or one at F8. For 32B models and lower you could run them on a single H100. Then you only need 1 or 2 GPUs to handle the request.
ASICs could optimize things like the ReLU operations, but modern GPUs already have logic and instructions for matrix multiplication and other operations.
I think the sweat spot will be when CPUs have support for high-throughput matrix operations similar to the SIMD operations. That way the system will benefit from being able to use system memory [1] and not have another chip/board consuming power. -- IIUC, things are already moving in that direction for consumer devices.
[1] This will allow access to large amounts of memory without having to chain multiple GPUs. That will make it possible to run the larger models at higher precisions more efficiently and process the large amount of training data efficiently.
> ASICs could optimize things like the ReLU operations, but modern GPUs already have logic and instructions for matrix multiplication and other operations.
Right but at that point you're describing an H100 plus an additional ASIC plus presumably a CPU and some RAM. Or a variant of an H100 with some specialized ML functions baked in. Both of those just sound like a regular workstation to me.
Inference is certainly cheaper but getting it running quickly requires raw horsepower (thus wattage, thus heat dissipation).
Regarding CPUs there's a severe memory bandwidth issue. I haven't kept track of the extreme high end hardware but it's difficult to compete with GPUs on raw throughput.
24 gigabytes is more than enough to run a local LLM for a small household or business.
This is "gaming PC" territory, not "space heater". I mean people already have PS5's and whatnot in their homes.
The hundreds of gigabytes thing exists because the big cloud LLM providers went down the increasing parameter count path. That way is a dead end and we've reached negative returns already.
Prompt engineering + finetunes is the future, but you need developer brains for that, not TFLOPs.
It depends on 1) what model you are running; and 2) how many models you are running.
You can just about run a 32B (at Q4/Q5 quantization) on 24GB. Running anything higher (such as the increasingly common 70B models, or higher if you want to run something like Llama 4 or DeepSeek) means splitting the model between RAM and RAM. -- But yes, anything 24B or lower you can run comfortably, including enough capacity for the context.
If you have other models -- such as text-to-speech, speech recognition, etc. -- then those are going to take up VRAM for both the model and during processing/generation. That affects the size of LLM you can run.
If you focus on just the matmuls, no CUDA, no architectures, no infinibands, everything-on-a-chip - put input tokens in input registers, get output tokens from output registers from a model that's baked into gates - you should be able to save some power. Not sure if 10x or 2x or 100x, but certainly there are gains to be had.