Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Anyone who uses a CPU for inference is severely compute constrained. Nobody cares about tokens per second the moment inference is faster than you can read, but staring down a blank screen for 5 minutes? Yikes.


Just as a point of reference, this is what a 65W power-limited 7940HS (Radeon 790M) with 64GB of DDR5-5600 looks like w/ a 7B Q4_K_M model atm w/ llama.cpp. While it's not amazing, at 240 t/s prefill, it means that at 4K context, you'll wait about 17 seconds before token generation starts, which isn't awful. The 890M should have about 20% better compute, so about 300 t/s prefill, and with LPDDR5-7500/8000, you should get to about 20 t/s.

  ./llama-bench -m /data/ai/models/llm/gguf/mistral-7b-instruct-v0.1.Q4_K_M.gguf
  ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
  ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon 780M, compute capability 11.0, VMM: no
  | model                          |       size |     params | backend    | ngl |          test |              t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
  | llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | ROCm       |  99 |         pp512 |    242.69 ± 0.99 |
  | llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | ROCm       |  99 |         tg128 |     15.33 ± 0.03 |

  build: e11bd856 (3620)


> you'll wait about 17 seconds before token generation starts, which isn't awful

Let’s be honest, it might not be awful but it’s a nonstarter for encouraging local LLM adoption and most will prefer to pay to pay pennies for api access instead (friction aside).


I don't know why anyone would think a meh performing iGPU would encourage local LLM adoption at all? A 7B local model is already not going to match frontier models for many use cases - if you don't care about using a local model (don't have privacy or network concerns) then I'd argue you probably should use an API. If you care about using a capable local LLM comfortably, then you should get as powerful a dGPU as your power/dollar budget allows. Your best bang/buck atm will probably be Nvidia consumer Ada GPUs (or used Ampere models).

However, if for anyone that is looking to use a local model on a chip with the Radeon 890M:

- look into implementing (or waiting for) NPU support - XDNA2's 50 TOPS should provide more raw compute than the 890M for tensor math (w/ Block FP16)

- use a smaller, more appropriate model for your use case (3B's or smaller can fulfill most simple requests) and of course will be faster

- don't use long conversations - when your conversations start they will have 0 context and no prefill; no waiting for context

- use `cache_prompt` for bs=1 interactive use you can save input/generations to cache


For a lot of usecases it is actually awful


The problem is memory bandwith. There is a reason Apple Macbooks do relatively well with LLMs it’s not that the GPU is any better than zen5, but 4,5,6x memory bandwidth is huge (80ish gb/s vs 400gb/s)


>Nobody cares about tokens per second the moment inference is faster than you can read, but staring down a blank screen for 5 minutes? Yikes.

I don't think so. Humans scan for keywords very often. No body really reads every word. Faster than reading speed inference is definitely beneficial.


And thank you for making me conscious of my reading while reading your comment. May you become aware of your breathing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: