for those who are already in the field and doing these things - if I wanted to s...

loudmax · 2025-08-07T22:13:45 1754604825

The short answer is that the best value is a used RTX 3090 (the long answer being, naturally, it depends). Most of the time, the bottleneck for running LLMs on consumer grade equipment is memory and memory bandwidth. A 3090 has 24GB of VRAM, while a 5080 only has 16GB of VRAM. For models that can fit inside 16GB of VRAM, the 5080 will certainly be faster than the 3090, but the 3090 can run models that simply won't fit on a 5080. You can offload part of the model onto the CPU and system RAM, but running a model on a desktop CPU is an enormous drag, even when only partially offloaded.

Obviously an RTX 5090 with 32GB of VRAM is even better, but they cost around $2000, if you can find one.

What's interesting about this Strix Halo system is that it has 128GB of RAM that is accessible (or mostly accessible) to the CPU/GPU/APU. This means that you can run much larger models on this system than you possibly could on a 3090, or even a 5090. The performance tests tend to show that the Strix Halo's memory bandwidth is a significant bottleneck though. This system might be the most affordable way of running 100GB+ models, but it won't be fast.

behohippy · 2025-08-08T01:08:44 1754615324

Used 3090s have been getting expensive in some markets. Another option is dual 5060ti 16 gig. Mine are lower powered, single 8 pin power, so they max out around 180W. With that I'm getting 80t/s on the new qwen 3 30b a3b models, and around 21t/s on Gemma 27b with vision. Cheap and cheerful setup if you can find the cards at MSRP.

KronisLV · 2025-08-08T16:48:35 1754671715

For comparison, at work we got a pair of Nvidia L4 GPUs: https://www.techpowerup.com/gpu-specs/l4.c4091

That gives us a total TDP of around 150W, 48 GB of VRAM and we can run Qwen 3 Coder 30B A3B at 4bit quantization with up to 32k context at around 60-70 t/s with Ollama. I also tried out vLLM, but the performance surprisingly wasn't much better (maybe under bigger concurrent load). Felt like sharing the data point, because of similarity.

Honestly it's a really good model, even good enough for some basic agentic use (e.g. with Aider, RooCode and so on), MoE seems the way to go for somewhat limited hardware setups.

Ofc obviously not recommending L4 cards cause they have a pretty steep price tag. Most consumer cards feel a bit power hungry and you'll probably need more than one to fit decent models in there, though also being able to game with the same hardware sounds pretty nice. But speaking of getting more VRAM, the Intel Arc Pro B60 can't come soon enough (if they don't insanely overprice it), especially the 48 GB variety: https://www.maxsun.com/products/intel-arc-pro-b60-dual-48g-t...

behohippy · 2025-08-11T15:44:27 1754927067

Yeah 48g, sub 200W seems like a sweet spot for a single card setup. Then you can stack as deep as you want to get the size of model you want for whatever you want to pay for the power bill.

codazoda · 2025-08-11T16:13:24 1754928804

I've hatched a plan to build a light-weight AI model on a $149 mini-pc and host it from my bedroom.

I wonder if I could follow that up by buying a 3090 (jumping the price by $1000 plus whatever I plug it into) and contrasting the difference. Could be an eye opening experiment for me.

Here's the write up of my plan for the cheap machine if anyone is interested.

https://joeldare.com/my_plan_to_build_an_ai_chat_bot_in_my_b...

cpburns2009 · 2025-08-08T00:34:57 1754613297

Just a point of clarification. I believe the 128GB Strix Halo can only allocate up to 96GB of RAM to the GPU.

geerlingguy · 2025-08-08T01:55:59 1754618159

108 GB or so under Linux.

The BIOS allows pre-allocating 96 GB max, and I'm not sure if that's the maximum for Windows, but under Linux, you can use `amdttm.pages_limit` and `amdttm.page_pool_size` [1]

[1] https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...

amstan · 2025-08-08T06:02:35 1754632955

I have been doing a couple of tests with pytorch allocations, it let me go as high as 120GB [1] (assuming the allocations were small enough) without crashing. The main limitation was mostly remaining system memory:

    htpc@htpc:~% free -h
                   total        used        free      shared  buff/cache   available
    Mem:           125Gi       123Gi       920Mi        66Mi       1.6Gi       1.4Gi
    Swap:           19Gi       4.0Ki        19Gi

[1] https://bpa.st/LZZQ

cpburns2009 · 2025-08-08T13:31:11 1754659871

Thanks for the correction. I was under the impression the GPU memory had to be preallocated in the BIOS, and 96 GB was the maximum number I read about.

sliken · 2025-08-09T04:48:21 1754714901

Some older software stacks require static allocation in BIOS, but things are moving pretty quickly and allow dynamic allocation. Newer versions (or patches to) pytorch, ollama, and related, which I think might depend on a newer kernel (6.13 or so). Does seem like there's been quite a bit of progress in the last month.

lhl · 2025-08-08T07:37:35 1754638655

In Linux, you can allocate as much as you want with `ttm`:

In 4K pages for example:

    options ttm pages_limit=31457280
    options ttm page_pool_size=15728640

This will allow up to 120GB to be allocated and pre-allocate 60GB (you could preallocate none or all depending on your needs and fragmentation size. I believe `amdgpu.vm_fragment_size=9` (2MiB) is optimal.

wmf · 2025-08-07T21:14:52 1754601292

If you think the future is small models (27B) get Nvidia; if you think larger models (70-120B) are worth it then you need AMD or Apple.

yencabulator · 2025-08-07T21:28:26 1754602106

I wonder how much MoE will disrupt this. qwen3:30b-a3b is pretty good even on pure CPU, but a lot smarter than a 3B parameter model. If the CPU-GPU bottleneck isn't too tight, a large model might be able to sustainably cache the currently active experts in GPU RAM.

hengheng · 2025-08-07T23:14:37 1754608477

The recent qwen3 models run fine on CPU + GPU, and so does gpt-oss. LM Studio and Ollama are turnkey solutions where the user has to know nothing about memory management. But finding benchmarks for these hybrid setups is astonishingly difficult.

I keep thinking that the bottleneck has to be CPU RAM, and for a large model the difference would be minor. For example with an 100 GByte model such as quantised gpt-oss-120B, I imagine that going from 10G to 24G would scale up my tk/s like 1/90 -> 1/76, so 20% advantage? But I can't find much on the high-level scaling math. People seem to either create calculators that oversimplify, or they seem too deep into the weeds.

I'd like a new anandtech please.

whizzter · 2025-08-08T07:50:37 1754639437

Doesn't matter, people will always find ways to eat RAM despite finding more clever ways to do things.

yencabulator · 2025-08-08T15:27:04 1754666824

MoE eats the same amount of RAM but accesses less of it.

sliken · 2025-08-09T04:49:29 1754714969

More accurately, same amount of ram but accessed in a cache friendly manner with greater locality.