Next step - compile straight to verilog so I can buy some LLMs on aliexpress

bigcat12345678 · 2025-06-19T21:01:33 1750366893

https://riscv.org/blog/2021/02/hardware-description-language... That was one of the promising ideas before AI & GPUs come to the scene. As CPUs are stagnant, and naturally people want further optimize the middle layers software and hardware.

But I suspect parallel computing in GPU style is going to dominate acclerated computing.

General purpose CPUs are going to stay to become the little brain that orchestrates GPUs.

Ideas of software direct to hardware transition might never be the mainstream.

jerf · 2025-06-20T18:46:26 1750445186

"General purpose CPUs are going to stay to become the little brain that orchestrates GPUs."

If that was going to happen, it would have happened.

CPUs are genuinely good at what they do, and "what they do" is a lot of tasks that GPUs are actually terrible at. If all we had were GPUs in the world and someone invented a CPU, we'd hail them as a genius. A lot of people seem to think that GPUs are just "better", just ambiently better at everything, but that's lightyears from the truth. They are quite spectacularly terrible at a lot of very common tasks. There's many very good reasons that GPUs are still treated as accelerators for the CPUs and not vice versa.

baq · 2025-06-19T21:12:35 1750367555

I'm thinking more like pseudointellect over serial to attach a $3 esp32 to. Since it's basically tokens in, tokens out, let's just cut the unnecessary parts out. It's like querying the cloud models, except it's your silicon you personally soldered to the esp so nobody will break your home assistant with a system prompt update or a fine tuning run.

mycall · 2025-06-19T23:15:29 1750374929

> General purpose CPUs are going to stay to become the little brain that orchestrates GPUs

Brings the deterministic compute to the indeterministic.

nialse · 2025-06-20T06:34:05 1750401245

In five to ten years, when LLMs have stabilized, mapping them straight onto hardware will probably make sense. With today’s processes a hundred billion parameters might fit onto a single silicon wafer using ~1.5 bit precision implemented directly in logic gates. Using higher precision raises the gate count exponentially, so it makes more sense to keep the weights in memory and reuse shared compute blocks for the math for now. We need to get the ultra low precision LLMs working for the future though.

fc417fc802 · 2025-06-20T00:56:16 1750380976

Because training costs weren't high enough already so lets add mask costs on top.

More seriously, isn't that pretty much what all those AI hardware startups have already been doing for a while now?

adgjlsfhk1 · 2025-06-20T02:35:44 1750386944

most of them are much more general purpose. they might be specializing somewhat on the architecture, but not on the weights

fc417fc802 · 2025-06-20T02:45:01 1750387501

Realistically specializing on the data flow is all you can do. Assuming a modern CPU contains on the order of 10 billion transistors that only amounts to 1.2 GiB storage before you account for any actual logic (ie 1 bit per transistor). DRAM hardware is quite different from that of processing elements and it takes quite a lot of DRAM chips to hold the weights of a single model.

anitil · 2025-06-20T00:43:15 1750380195

I mean.... LLM-in-a-box would actually be pretty neat! I'm looking at some air-gapped work coming up and having something like that would be quite handy

fc417fc802 · 2025-06-20T01:00:05 1750381205

Isn't that easily accomplished by setting up a local deployment and then yanking the network cable? Anything that can quickly run a capable LLM is going to be a pretty beefy box though. More like LLM in an expensive space heater.

stirfish · 2025-06-20T01:21:20 1750382480

I was thinking more like those Bitcoin mining usb Asics that used to be a thing, but instead of becoming ewaste, you can still use them to talk with chatgpt 2 or whatever. I'm picturing an llm appliance.

fc417fc802 · 2025-06-20T02:49:42 1750387782

There is no magic ASIC that can get around needing to do hundreds of watts worth of computations and having on the order of hundreds of gigabytes of very fast memory. Otherwise the major players would be doing that instead of (quite literally) investing in nuclear reactors to power their future data center expansions.

rhdunn · 2025-06-20T07:46:54 1750405614

Google have their own ASIC via their TPU. The other major players have leveraged NVIDIA and -- to a lesser extent -- AMD. This is partly due to investment in TPUs/ASICs being complex (need specialist knowledge and fabrication units) and GPU performance being hard to compete with.

Training is the thing that costs the most in terms of power/memory/energy, often requiring months of running multiple (likely 4-8) A100/H100 GPUs on the training data.

Performing inference is cheaper as you can 1) keep the model loaded in VRAM, and 2) run it on a single H100. With the 80GB capacity you would need two to run a 70B model at F16, or one at F8. For 32B models and lower you could run them on a single H100. Then you only need 1 or 2 GPUs to handle the request.

ASICs could optimize things like the ReLU operations, but modern GPUs already have logic and instructions for matrix multiplication and other operations.

I think the sweat spot will be when CPUs have support for high-throughput matrix operations similar to the SIMD operations. That way the system will benefit from being able to use system memory [1] and not have another chip/board consuming power. -- IIUC, things are already moving in that direction for consumer devices.

[1] This will allow access to large amounts of memory without having to chain multiple GPUs. That will make it possible to run the larger models at higher precisions more efficiently and process the large amount of training data efficiently.

fc417fc802 · 2025-06-20T09:25:40 1750411540

> ASICs could optimize things like the ReLU operations, but modern GPUs already have logic and instructions for matrix multiplication and other operations.

Right but at that point you're describing an H100 plus an additional ASIC plus presumably a CPU and some RAM. Or a variant of an H100 with some specialized ML functions baked in. Both of those just sound like a regular workstation to me.

Inference is certainly cheaper but getting it running quickly requires raw horsepower (thus wattage, thus heat dissipation).

Regarding CPUs there's a severe memory bandwidth issue. I haven't kept track of the extreme high end hardware but it's difficult to compete with GPUs on raw throughput.

otabdeveloper4 · 2025-06-20T05:51:19 1750398679

24 gigabytes is more than enough to run a local LLM for a small household or business.

This is "gaming PC" territory, not "space heater". I mean people already have PS5's and whatnot in their homes.

The hundreds of gigabytes thing exists because the big cloud LLM providers went down the increasing parameter count path. That way is a dead end and we've reached negative returns already.

Prompt engineering + finetunes is the future, but you need developer brains for that, not TFLOPs.

rhdunn · 2025-06-20T07:57:15 1750406235

It depends on 1) what model you are running; and 2) how many models you are running.

You can just about run a 32B (at Q4/Q5 quantization) on 24GB. Running anything higher (such as the increasingly common 70B models, or higher if you want to run something like Llama 4 or DeepSeek) means splitting the model between RAM and RAM. -- But yes, anything 24B or lower you can run comfortably, including enough capacity for the context.

If you have other models -- such as text-to-speech, speech recognition, etc. -- then those are going to take up VRAM for both the model and during processing/generation. That affects the size of LLM you can run.

fc417fc802 · 2025-06-20T09:31:36 1750411896

Only if you'll settle for less than state of the art. The best models still tend to be some of the largest ones.

Anything that overflows VRAM is going to slow down the response time drastically.

"Space heater" is determined by computational horsepower rather than available RAM.

How big a context window do you want? Last I checked that was very expensive in terms of RAM and having a large one was highly desirable.

otabdeveloper4 · 2025-06-20T12:08:51 1750421331

State of the art is achieved by finetuning. Increasing parameter counts is a dead end.

Large contexts are very important but they are cheap compared in terms of RAM compared to the costs of increasing parameter count.

stirfish · 2025-06-20T02:51:35 1750387895

That's a really good point. I wasn't thinking further than ollama on my MacBook, but I'm not deploying my laptop into production.

baq · 2025-06-20T07:52:37 1750405957

If you focus on just the matmuls, no CUDA, no architectures, no infinibands, everything-on-a-chip - put input tokens in input registers, get output tokens from output registers from a model that's baked into gates - you should be able to save some power. Not sure if 10x or 2x or 100x, but certainly there are gains to be had.