Ollama is built around llama.cpp, but it automatically handles templating the chat requests to the format each model expects, and it automatically loads and unloads models on demand based on which model an API client is requesting. Ollama also handles downloading and caching models (including quantized models), so you just request them by name.
Recently, it got better (though maybe not perfect yet) at calculating how many layers of any model will fit onto the GPU, letting you get the best performance without a bunch of tedious trial and error.
Similar to Dockerfiles, ollama offers Modelfiles that you can use to tweak the existing library of models (the parameters and such), or import gguf files directly if you find a model that isn’t in the library.
Ollama is the best way I’ve found to use LLMs locally. I’m not sure how well it would fare for multiuser scenarios, but there are probably better model servers for that anyways.
Running “make” on llama.cpp is really only the first step. It’s not comparable.
This is interesting. I wouldn't have given the project a deeper look without this information. The lander is ambiguous. My immediate takeaway was, "Here's yet another front end promising ease of use."
I had similar feelings but last week finally tried it in WSL2.
Literally two shell commands and a largish download later I was chatting with mixtral on an aging 1070 at a positively surprising tokens/s (almost reading speed, kinda like the first chatgpt). Felt like magic.
For me, the critical thing was that ollama got the GPU offload for Mixtral right on a single 4090, where vLLM consistently failed with out of memory issues.
It's annoying that it seems to have its own model cache, but I can live with that.
Eh? The docs say vLLM supports both gptq and awq quantization. Not that it matters now I'm out of the gate, it just surprised me that it didn't work.
I'm currently running nous-hermes2-mixtral:8x7b-dpo-q4_K_M with ollama, and it's offloaded 28 of 33 layers to the GPU with nothing else running on the card. Genuinely don't know whether it's better to go for a harsher quantisation or a smaller base model at this point - it's about 20 tokens per second but the latency is annoying.
Ollama does not come with (or require) node or python. It is written in Go. If you are writing a node or python app, then the official clients being announced here could be useful, but they are not runtimes, and they are not required to use ollama. This very fundamental mistake in your message indicates to me that you haven’t researched ollama enough. If you’re going to criticize something, it is good to research it more.
> does not expose the full capability of llama.cpp
As far as I’ve been able to tell, Ollama also exposes effectively everything llama.cpp offers. Maybe my use cases with llama.cpp weren’t advanced enough? Please feel free to list what is actually missing. Ollama allows you to deeply customize the parameters of models being served.
I already acknowledged that ollama was not a solution for every situation. For running on your own desktop, it is great. If you’re trying to deploy a multiuser LLM server, you probably want something else. If you’re trying to build a downloadable application, you probably want something else.
How much of a performance overhead does this runtime add, anyway? Each request to a model would eat so much GPU for actual text generation, the cost to process the request and response strings even in a slow, garbage-collected seems negligible both in latency.
Recently, it got better (though maybe not perfect yet) at calculating how many layers of any model will fit onto the GPU, letting you get the best performance without a bunch of tedious trial and error.
Similar to Dockerfiles, ollama offers Modelfiles that you can use to tweak the existing library of models (the parameters and such), or import gguf files directly if you find a model that isn’t in the library.
Ollama is the best way I’ve found to use LLMs locally. I’m not sure how well it would fare for multiuser scenarios, but there are probably better model servers for that anyways.
Running “make” on llama.cpp is really only the first step. It’s not comparable.