By the way, you can use hugging face with ollama, and local modelfiles too.

CaptainOfCoit · 2025-10-16T10:13:57 1760609637

You're saying that like you cannot do that with llama.cpp? I feel like most Ollama users seem to have no idea what features/benefits directly come from llama.cpp rather than Ollama itself...

simgt · 2025-10-16T10:39:21 1760611161

I read the opposite, that you don't have to be locked-in Ollama's registry if you don't want to.

Could you share a bit more of what you do with llama.cpp? I'd rather use llama-serve but it seems to require a good amount of fiddling with the parameters to have good performance.

mtone · 2025-10-16T19:12:10 1760641930

Recently llama.cpp made a few common parameters default (-ngl 999, -fa on) so it got simpler: --model and --context-size and --jinja generally does it to start.

We end up fiddling with other parameters because it provides better performance for a particular setup so it's well worth it. One example is the recent --n-cpu-moe switch to offload experts to CPU while filling all available VRAM that can give a 50% boost on models like gpt-oss-120b.

After tasting this, not using it is a no-go. Meanwhile on Ollama there's an open issue asking for this: https://github.com/ollama/ollama/issues/11772

Finally, llama-swap separately provides the auto-loading/unloading feature for multiple models.

monkmartinez · 2025-10-16T16:56:08 1760633768

Nailed it. To make matters worse, Ollama obfuscate the models so their users don't really know what they are running until they dig into the model file. Only then can they see that what they thought was Deepseek-r1 is actually an 8B qwen distillation of Deepseek-r1, for example.

Luckily, we have Jan.ai and LM Studio which are happy to run GGUF models at full-tilt on various hardware configs. Added bonus; both include very nice API server as well.