Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is not clear that a simple/small model with inference running on home hardware is energy or cost efficient compared to the scaled up inference of a large model with batch processing. There are dozens of optimizations possible when splitting an LLM on multiple tiny components on separate accelerator units and when one handles kv cache optimization at the data center level; these are simply not possible at home and would be a waste of effort and energy until you serve thousands to millions of requests in parallel.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: