"I’ve exclusively used the astounding llama.cpp. Other options exist, but for basic CPU inference — that is, generating tokens using a CPU rather than a GPU — llama.cpp requires nothing beyond a C++ toolchain. In particular, no Python fiddling that plagues much of the ecosystem. On Windows it will be a 5MB llama-server.exe with no runtime dependencies"
Will definitely give llama.cpp a go, great selling point.
I've tried running both Meta Llama and Gpt2 and they both relied on some complex virtualization toolchain of either docker, or a thing called conda, and the dependency list was looong, any issue at any point caused a blockage. I tried on 3 machines, and in a whole day, as a somewhat senior dev I couldn't get it running.
Yes! That's so awesome about llama.cpp. Just get the github repo, fire up your c++ compiler toolchain and not even a minute later...you have a set of tools to do some serious AI shenanigans!
Even adding CUDA capabilities is, although somewhat involved, pretty easy.
As someone who has been running llama.cpp for 2-3 years now (I started with RWKV v3 on python, one of the previous most accessible models due to both cpu and gpu support and the ability to run on older small GPUs, even Kepler era 2GB cards!), I felt the need to point out that only needing llama.cpp binaries and only being 5MB is ONLY true for cpu inference using pre-converted/quantized models. If you are getting a raw trained model from Meta, RWKV, THUD, Bytedance, Microsoft, Alibaba, or any of the big companies releasing open weight (but generally
not open source) models to the public, they WILL require python, torch, and dozens to hundreds of prerequisite python modules in order to run the convert.py script to produce an output model.
Should you wish to convert a model yourself, make sure you use BF16 (exceptions apply for natively trained models in FP32, FP16, and 1/1.58 bit native formats) for the majority of the models you convert if you have enough disk space then run llama-quantize on that model to create any quantized models to minimize conversion losses and allow the accuracy vs. performance vs. space considerations that make the most sense for you.
As far as models go, Mistral-2-Large, GLM-4 variants, Mistral-Nemo-8B are my current non-multimodal favorites. llama.cpp doesn't currently support multimodal models unless you use one of the various forks using it as the inference backend due to issues embedding the image tokens in the llama-server implementation. The three models listed have most recently given the most personality when asked to play Colossus (M2L), the best translation between multiple languages while maintaining consistency between translations (GLM-4), and the most obscure code
knowledge and annotation capabilities (Mistral-Nemo-8B with CodeGeeX-4-9B, a GLM-4 finetune as a close second). The last two models both were able to answer questions on 16 bit DOS C programming, near and far pointers, and even give assembly examples, although you have to specify very carefully to only emit 8086 or pre-80386 assembly mnemonics to avoid them using e?x variants of ?x registers.
May this comment prove illuminating for one searching for light.
I feel personally attacked ; ) I'm also building a tool to run/build on top of LLMs (https://github.com/singulatron/superplatform) and I opted for containers too. TBF I'm mostly targeting backend developers (who am I kidding, I'm mostly building this for myself).
The desktop version has its own configuration management software to install docker or WSL and all the dependencies you talk about, so I feel your pain.
And, while your project looks quite cool, it's way too much and complicated for someone just wanting to try out getting into playing around with LLM and the various text models you can get from sources like huggingface with being somewhat in charge of getting the tools and compiling them on their own.
Having looked at your project, what would you say is difference in ability or philosophy compared to Open Web UI or FlowiseAI? Or, is this "I want to build this because I want to?" To which there is nothing wrong with that.
there's also ollama, which I haven't used much yet. they used to have llama.cpp as the only backend, but it appears they've now started to include their own code.
Ollama is kind of ok to get started, but as I understand it they don't give you a choice in the quantisation you'll use. Please correct me if I'm wrong.
One thing I am sure about it is they store large model files renamed as large globally unique identifier, and I still haven't understood that part of the design as anything but some silly obfuscating embrace...
And here again, I'd love to be shown how I'm wrong.
you can, when you search for a model on the ollama website there is a drop down that lets you select a “tag”. Sort of like a docker container tag. This lets you pick the quantization you want.
You can choose the quantization by appending the right tag to the model name, but they don't support other more advanced useful features (e.g. you need a special flag to enable flash attention and you cannot use KV cache quantization for large contexts).
Will definitely give llama.cpp a go, great selling point.
I've tried running both Meta Llama and Gpt2 and they both relied on some complex virtualization toolchain of either docker, or a thing called conda, and the dependency list was looong, any issue at any point caused a blockage. I tried on 3 machines, and in a whole day, as a somewhat senior dev I couldn't get it running.