Yeah, sorry, I realized that as well so I edited my post to add a higher end exa...

int_19h · 2024-10-29T18:15:19 1730225719

It's still "fast enough" for even 120b models in practice, and you don't need to muck around with building a multi-GPU rig (and figuring out how to e.g. cool it properly).

It's definitely not what you'd want for your data center, but for home tinkering it has a very clear niche.

angoragoats · 2024-10-29T20:43:17 1730234597

> It's still "fast enough" for even 120b models in practice

Is it? This is very subjective. The Mac Studio would not be "fast enough" for me on even a 70b model, not necessarily because its output is slow, but because the prompt evaluation speed is quite bad. See [0] for example numbers; on Llama 3 70B at Q4_K_M quantization, it takes an M2 Ultra with 192GB about 8.5 seconds just to evaluate a 1024-token prompt. A machine with 6 3090s (which would likely come in cheaper than the Mac Studio) is over 6 times faster at prompt parsing.

A 120b model is likely going to be something like 1.5-2x slower at prompt evaluation, rendering it pretty much unusable (again, for me).

[0] https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...