Anyone who uses a CPU for inference is severely compute constrained. Nobody cares about tokens per second the moment inference is faster than you can read, but staring down a blank screen for 5 minutes? Yikes.
Just as a point of reference, this is what a 65W power-limited 7940HS (Radeon 790M) with 64GB of DDR5-5600 looks like w/ a 7B Q4_K_M model atm w/ llama.cpp. While it's not amazing, at 240 t/s prefill, it means that at 4K context, you'll wait about 17 seconds before token generation starts, which isn't awful. The 890M should have about 20% better compute, so about 300 t/s prefill, and with LPDDR5-7500/8000, you should get to about 20 t/s.
> you'll wait about 17 seconds before token generation starts, which isn't awful
Let’s be honest, it might not be awful but it’s a nonstarter for encouraging local LLM adoption and most will prefer to pay to pay pennies for api access instead (friction aside).
I don't know why anyone would think a meh performing iGPU would encourage local LLM adoption at all? A 7B local model is already not going to match frontier models for many use cases - if you don't care about using a local model (don't have privacy or network concerns) then I'd argue you probably should use an API. If you care about using a capable local LLM comfortably, then you should get as powerful a dGPU as your power/dollar budget allows. Your best bang/buck atm will probably be Nvidia consumer Ada GPUs (or used Ampere models).
However, if for anyone that is looking to use a local model on a chip with the Radeon 890M:
- look into implementing (or waiting for) NPU support - XDNA2's 50 TOPS should provide more raw compute than the 890M for tensor math (w/ Block FP16)
- use a smaller, more appropriate model for your use case (3B's or smaller can fulfill most simple requests) and of course will be faster
- don't use long conversations - when your conversations start they will have 0 context and no prefill; no waiting for context
- use `cache_prompt` for bs=1 interactive use you can save input/generations to cache
The problem is memory bandwith. There is a reason Apple Macbooks do relatively well with LLMs it’s not that the GPU is any better than zen5, but 4,5,6x memory bandwidth is huge (80ish gb/s vs 400gb/s)