so next time llama.cpp releases an update, other people update their favorite backend, you redownload a 4.26 GB file. Epic.
EDIT: oh, wait. Actually people usually have a handful to a few dozen of the these models lying around. When they update their backend, you just redownload every single model again.
EDIT 2: right, you can release a program that automatically patches and updates the downloaded model+executables. Such an invention.
You know, most people don't have 24+GB GPUs sitting around to train these models. So in my book this is a huge step forward. Personally, this is the first time i am able to run an LLM on my computer, and it's purely thanks to this.
Compared to modern bandwidth usage that's not such a big size anymore. Everyday millions of people download 100gb video games, watch 4k video podcasts, etc.
You can even run a full LLM in your browser these days - try https://webllm.mlc.ai/ in Chrome, it can load up a Llama-2-7b chat model (~4000MB, took my connection just under 3 minutes) and you can start chatting with it.
Spoken like someone who hasn’t spent hours trying to get LocalAI to build and run, only to find out that while it’s “OpenAI API compatible!0” it doesn’t support streaming so the Mattermost OpenAI plugin doesn’t work. I finally gave up and went back to ooba (which also didn’t work with the MM plugin… hmm.) Next time I’ll just hack something on the side of llama.cpp
That's why I always download the original version and quantize myself. With enough swap, you can do it with a modest amount for ram. I never had to download a model twice.
But yes, unless there is a way to patch it, bundling the model with the executable like this is going to be more wasteful.
EDIT: oh, wait. Actually people usually have a handful to a few dozen of the these models lying around. When they update their backend, you just redownload every single model again.
EDIT 2: right, you can release a program that automatically patches and updates the downloaded model+executables. Such an invention.