Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> For anyone who hasn't tried local models because they think it's too complicated or their computer can't handle it

I have now learned that my laptop is capable of a whopping 0.37 tokens per second.

11th Gen Intel® Core™ i7-1185G7 @ 3.00GHz × 8



Probably need to try a smaller model :P

When the article says that researchers are using their laptops those researchers are either using very small models on a gaming laptop or they have a fairly modern MacBook with a lot of ram.

There are also options for running open LLMs in the cloud. Groq (not to be confused with Grok) runs Llama, Mixtral and Gemma models really cheaply: https://groq.com/pricing/


I'll play around with it some more later. I was running llava-v1.5-7b-q4.llamafile which is the example that they recommend trying first at https://github.com/Mozilla-Ocho/llamafile

Groq looks interesting and might be a better option for me. Thank you.


I got better performance of 20.18 tokens per second using tinyllama-1.1b-chat-v1.0.Q8_0.llamafile from https://huggingface.co/Bojun-Feng/TinyLlama-1.1B-Chat-v1.0-l...

If anyone is reading this and had trouble with a larger model, that might be the one to try next.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: