That's fair. On the other hand, there's like exactly one CPU with FP16 AVX512 an...

ryao · 2024-12-04T19:47:49 1733341669

Zen 6 is supposed to add FP16 AVX512 support if AMD’s leaked slides mean what I think they mean. Here is a link to a screenshot of the leaked slides MLID published:

https://overclockers.ru/st/legacy/blog/428111/424644_O.jpg

I have been working on doing inference on a Ryzen 7 5800X lately and I have had good results:

https://github.com/ryao/llama3.c/blob/master/run.c

Running on a GPU like my 3090 Ti will likely outperform it by two orders of magnitude, but I have managed to push the needle slightly on the state of the art performance for prompt processing on my CPU. I suspect an additional 15% improvement is possible, but I do not expect to be able to realize it. In any case, it is an active R&D project that I am doing to learn how these things work.

Finally to answer your question, I have no good answers for you (or more specifically, answers that I like). I have been trying to think of ways to do fast local inference on high end models cost effectively for months. So far, I have nothing to show for it aside from my R&D into CPU llama 3 inference since none of my ideas are able to bring hardware costs needed for llama 3.1 405B below $10,000 with performance at an acceptable level. My idea of an acceptable performance level is 10 tokens per second for token generation and 4000 tokens per second for prompt processing, although perhaps lower prompt processing performance is acceptable with prompt caching.