This is a bit like saying if you don't specify "--dram", the data will be stored...

andersa · on Nov 29, 2023

They don't, though. If you try to allocate too much VRAM it will either hard fail or everything suddenly runs like garbage due to the driver constantly swapping it / using shared memory.

The reason for this flag to exist in the first place is that many of the models are larger than the available VRAM on most consumer GPUs, so you have to "balance" it between running some layers on the GPU and some on the CPU.

What would make sense is a default auto option that uses as much VRAM as possible, assuming the model is the only thing running on the GPU, except for the amount of VRAM already in use at the time it is started.

insanitybit · on Nov 29, 2023

> They don't, though. If you try to allocate too much VRAM it will either hard fail or everything suddenly runs like garbage due to the driver constantly swapping it / using shared memory.

What I don't understand is why it can't just check your VRAM and allocate by default. The allocation is not that dynamic AFAIK - when I run models it all happens basically upfront when the model loads. ollama even prints out how much VRAM it's allocating for model + context for each layer. But I still have to tune the layers manually, and any time I change my context size I have to retune.

jmorgan · on Nov 30, 2023

This is a great point. Context size has a large impact on memory requirements and Ollama should take this into account (something to work on :)

insanitybit · on Nov 30, 2023

Thanks for the work you've done already :D

numpad0 · on Nov 30, 2023

Some GPUs has quirks that VRAM access slows down near the end or that GPU just crashes and disables display output if actually used. I think it's sort of sensible that they don't use GPU at all by default.

wongarsu · on Nov 30, 2023

Wouldn't the sensible default be to use 80% of available VRAM, or total VRAM minus 2GB, or something along those lines. Something that's a tad conservative but works for 99% of cases, with tuning options for those who want to fly closer to the sun.

insanitybit · on Nov 30, 2023

2GB is a huge amount - you'd be dropping a dozen layers. Saving a few MB should be sufficient, and a layer is generally going to be orders of megabytes, so unless your model fits perfectly into VRAM (using 100%) you're already going to be leaving at least a few MB / 10s of MBs/ 100s of MBs free.

Your window manager will already have reserved its vRAM upfront so it isn't a big deal to use ~all of the rest.

insanitybit · on Nov 30, 2023

I think in the vast majority of cases the GPU being the default makes sense, and for the incredibly niche cases where it isn't there is already a tunable.

brucethemoose2 · on Nov 29, 2023

Llama.cpp allocates stuff to the GPU statically. It'd not really analogous to a game.

It should have a heuristic that looks at available VRAM by default, but it does not. Probably because this is vendor specific and harder than you would think, and they would rather not use external libraries.