Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can you elaborate the last statement? Don't quite understand why loading local LLM to GPU RAM, using it for the job and then "ejecting" is "dumb and wasteful" idea?


Layman understanding:

Because as a function of hardware and electricity costs, a “cloud” GPU will be many times more efficient per output token. You aren’t loading/offloading models and don’t have any parts of the GPU waiting for input. Everything is fully saturated always.


I believe GP means it still to be connected to 'if this kind of latency is unacceptable to you' - i.e. you can't load/use/unload, you have to keep it in RAM all the time.

In that case it's massively increasing your memory requirement not just to the peak the model needs, but to + whatever the other biggest use might be that'll be inherently concurrent with it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: