Anyone who uses a CPU for inference is severely compute constrained. Nobody care...

lhl · 2024-08-25T08:01:40 1724572900

Just as a point of reference, this is what a 65W power-limited 7940HS (Radeon 790M) with 64GB of DDR5-5600 looks like w/ a 7B Q4_K_M model atm w/ llama.cpp. While it's not amazing, at 240 t/s prefill, it means that at 4K context, you'll wait about 17 seconds before token generation starts, which isn't awful. The 890M should have about 20% better compute, so about 300 t/s prefill, and with LPDDR5-7500/8000, you should get to about 20 t/s.

  ./llama-bench -m /data/ai/models/llm/gguf/mistral-7b-instruct-v0.1.Q4_K_M.gguf
  ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
  ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon 780M, compute capability 11.0, VMM: no
  | model                          |       size |     params | backend    | ngl |          test |              t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
  | llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | ROCm       |  99 |         pp512 |    242.69 ± 0.99 |
  | llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | ROCm       |  99 |         tg128 |     15.33 ± 0.03 |

  build: e11bd856 (3620)

ComputerGuru · 2024-08-25T16:28:34 1724603314

> you'll wait about 17 seconds before token generation starts, which isn't awful

Let’s be honest, it might not be awful but it’s a nonstarter for encouraging local LLM adoption and most will prefer to pay to pay pennies for api access instead (friction aside).

lhl · 2024-08-25T17:31:34 1724607094

I don't know why anyone would think a meh performing iGPU would encourage local LLM adoption at all? A 7B local model is already not going to match frontier models for many use cases - if you don't care about using a local model (don't have privacy or network concerns) then I'd argue you probably should use an API. If you care about using a capable local LLM comfortably, then you should get as powerful a dGPU as your power/dollar budget allows. Your best bang/buck atm will probably be Nvidia consumer Ada GPUs (or used Ampere models).

However, if for anyone that is looking to use a local model on a chip with the Radeon 890M:

- look into implementing (or waiting for) NPU support - XDNA2's 50 TOPS should provide more raw compute than the 890M for tensor math (w/ Block FP16)

- use a smaller, more appropriate model for your use case (3B's or smaller can fulfill most simple requests) and of course will be faster

- don't use long conversations - when your conversations start they will have 0 context and no prefill; no waiting for context

- use `cache_prompt` for bs=1 interactive use you can save input/generations to cache

szundi · 2024-08-25T12:44:17 1724589857

For a lot of usecases it is actually awful

jodleif · 2024-08-26T20:05:02 1724702702

The problem is memory bandwith. There is a reason Apple Macbooks do relatively well with LLMs it’s not that the GPU is any better than zen5, but 4,5,6x memory bandwidth is huge (80ish gb/s vs 400gb/s)

aurareturn · 2024-08-25T14:27:19 1724596039

>Nobody cares about tokens per second the moment inference is faster than you can read, but staring down a blank screen for 5 minutes? Yikes.

I don't think so. Humans scan for keywords very often. No body really reads every word. Faster than reading speed inference is definitely beneficial.

brookst · 2024-08-25T15:19:06 1724599146

And thank you for making me conscious of my reading while reading your comment. May you become aware of your breathing.