Ollama is built around llama.cpp, but it automatically handles templating the ch...

palmfacehn · on Jan 25, 2024

This is interesting. I wouldn't have given the project a deeper look without this information. The lander is ambiguous. My immediate takeaway was, "Here's yet another front end promising ease of use."

baq · on Jan 25, 2024

I had similar feelings but last week finally tried it in WSL2.

Literally two shell commands and a largish download later I was chatting with mixtral on an aging 1070 at a positively surprising tokens/s (almost reading speed, kinda like the first chatgpt). Felt like magic.

regularfry · on Jan 25, 2024

For me, the critical thing was that ollama got the GPU offload for Mixtral right on a single 4090, where vLLM consistently failed with out of memory issues.

It's annoying that it seems to have its own model cache, but I can live with that.

foxhop · on Jan 25, 2024

vLLM doesn't support quantized models at this time so you need 2x 4090 to run Mixtral.

llama.cpp supports quantized models so that makes sense, ollama must have picked a quantized model to make it fit?

regularfry · on Jan 25, 2024

Eh? The docs say vLLM supports both gptq and awq quantization. Not that it matters now I'm out of the gate, it just surprised me that it didn't work.

I'm currently running nous-hermes2-mixtral:8x7b-dpo-q4_K_M with ollama, and it's offloaded 28 of 33 layers to the GPU with nothing else running on the card. Genuinely don't know whether it's better to go for a harsher quantisation or a smaller base model at this point - it's about 20 tokens per second but the latency is annoying.

wokwokwok · on Jan 25, 2024

[flagged]

coder543 · on Jan 25, 2024

> comes with a heavy runtime (node or python)

Ollama does not come with (or require) node or python. It is written in Go. If you are writing a node or python app, then the official clients being announced here could be useful, but they are not runtimes, and they are not required to use ollama. This very fundamental mistake in your message indicates to me that you haven’t researched ollama enough. If you’re going to criticize something, it is good to research it more.

> does not expose the full capability of llama.cpp

As far as I’ve been able to tell, Ollama also exposes effectively everything llama.cpp offers. Maybe my use cases with llama.cpp weren’t advanced enough? Please feel free to list what is actually missing. Ollama allows you to deeply customize the parameters of models being served.

I already acknowledged that ollama was not a solution for every situation. For running on your own desktop, it is great. If you’re trying to deploy a multiuser LLM server, you probably want something else. If you’re trying to build a downloadable application, you probably want something else.

golergka · on Jan 25, 2024

How much of a performance overhead does this runtime add, anyway? Each request to a model would eat so much GPU for actual text generation, the cost to process the request and response strings even in a slow, garbage-collected seems negligible both in latency.