Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a humble and informed acrticle (comparing to others written by financial analysts the past a few days). But still have the flaw of over-estimating efficiency of deploying a 687B MoE model on commodity hardware (to use locally, cloud providers will do efficient batching and it is different): you cannot do that on any single Apple hardware (need to at least hook up 2 M2 Ultra). You can barely deploy that on desktop computers just because non-register DDR5 can have 64GiB per stick (so you are safe with 512 RAM). Now coming to PCIe bandwidth: 37B per token activation means exactly that, each activation requires new set of 37B weights, so you need to transfer 18GiB per token into VRAM (assuming 4-bit quant). PCIe 5 (5090) have 64GB/s transfer speed so your upper bound is limited to 4 tok/s with a well balanced propose built PC (and custom software). For programming tasks that usually requires ~3000 tokens for thinking, we are looking at 12 mins per interaction.


Is it really 37B different parameters for each token? Even with the "multi-token prediction system" that the article mentions?


I don't think anyone uses MTP for inference right now. Even if you use MTP for drafting, you need to batching in the next round to "verify" it is the right token, if that happens you need to activate more experts.

DELETED: If you don't use MTP for drafting, and use MTP to skip generations, sure. But you also need to evaluate your use case to make sure you don't get penalized for doing that. Their evaluation in the paper don't use MTP for generation.

EDIT: Actually, you cannot use MTP other than drafting because you need to fill in these KV caches. So, during generation, you cannot save your compute with MTP (you save memory bandwidth, but this is more complicated for MoE model due to more activated experts).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: