"I’ve exclusively used the astounding llama.cpp. Other options exist, but for ba...

SunlitCat · on Nov 10, 2024

Yes! That's so awesome about llama.cpp. Just get the github repo, fire up your c++ compiler toolchain and not even a minute later...you have a set of tools to do some serious AI shenanigans!

Even adding CUDA capabilities is, although somewhat involved, pretty easy.

dystnitem4r3 · on Nov 10, 2024

As someone who has been running llama.cpp for 2-3 years now (I started with RWKV v3 on python, one of the previous most accessible models due to both cpu and gpu support and the ability to run on older small GPUs, even Kepler era 2GB cards!), I felt the need to point out that only needing llama.cpp binaries and only being 5MB is ONLY true for cpu inference using pre-converted/quantized models. If you are getting a raw trained model from Meta, RWKV, THUD, Bytedance, Microsoft, Alibaba, or any of the big companies releasing open weight (but generally not open source) models to the public, they WILL require python, torch, and dozens to hundreds of prerequisite python modules in order to run the convert.py script to produce an output model.

Should you wish to convert a model yourself, make sure you use BF16 (exceptions apply for natively trained models in FP32, FP16, and 1/1.58 bit native formats) for the majority of the models you convert if you have enough disk space then run llama-quantize on that model to create any quantized models to minimize conversion losses and allow the accuracy vs. performance vs. space considerations that make the most sense for you.

As far as models go, Mistral-2-Large, GLM-4 variants, Mistral-Nemo-8B are my current non-multimodal favorites. llama.cpp doesn't currently support multimodal models unless you use one of the various forks using it as the inference backend due to issues embedding the image tokens in the llama-server implementation. The three models listed have most recently given the most personality when asked to play Colossus (M2L), the best translation between multiple languages while maintaining consistency between translations (GLM-4), and the most obscure code knowledge and annotation capabilities (Mistral-Nemo-8B with CodeGeeX-4-9B, a GLM-4 finetune as a close second). The last two models both were able to answer questions on 16 bit DOS C programming, near and far pointers, and even give assembly examples, although you have to specify very carefully to only emit 8086 or pre-80386 assembly mnemonics to avoid them using e?x variants of ?x registers.

May this comment prove illuminating for one searching for light.

friendly_chap · on Nov 10, 2024

I feel personally attacked ; ) I'm also building a tool to run/build on top of LLMs (https://github.com/singulatron/superplatform) and I opted for containers too. TBF I'm mostly targeting backend developers (who am I kidding, I'm mostly building this for myself).

The desktop version has its own configuration management software to install docker or WSL and all the dependencies you talk about, so I feel your pain.

/self plug

SunlitCat · on Nov 10, 2024

And, while your project looks quite cool, it's way too much and complicated for someone just wanting to try out getting into playing around with LLM and the various text models you can get from sources like huggingface with being somewhat in charge of getting the tools and compiling them on their own.

cdumler · on Nov 10, 2024

Having looked at your project, what would you say is difference in ability or philosophy compared to Open Web UI or FlowiseAI? Or, is this "I want to build this because I want to?" To which there is nothing wrong with that.

c6401 · on Nov 12, 2024

IMO the simplest option is llamafile (it is multiplatform using "cosmopolitan" lib so should run on Windows too, but I haven't tried)

    wget https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile/resolve/main/Llama-3.2-1B-Instruct.Q6_K.llamafile
    chmod +x Llama-3.2-1B-Instruct.Q6_K.llamafile
    ./Llama-3.2-1B-Instruct.Q6_K.llamafile --server

c6401 · on Nov 12, 2024

It has a webui, but this is how I use it from python (sorry I like python, but similar connection method should work from the other langs too).

    ai = openai.AsyncOpenAI(base_url="http://localhost:8080/v1", api_key="sk-no-key-required")
    response = await ai.chat.completions.create(
        messages=[
            {"role": "system", "content": "..."}, {"role": "user", "content": "..."},
        ],
        max_tokens=100,
        model="Llama-3.2-1B-Instruct.Q6_K.gguf",
    )

    content = response.choices[0].message.content

exe34 · on Nov 10, 2024

there's also ollama, which I haven't used much yet. they used to have llama.cpp as the only backend, but it appears they've now started to include their own code.

polotics · on Nov 10, 2024

Ollama is kind of ok to get started, but as I understand it they don't give you a choice in the quantisation you'll use. Please correct me if I'm wrong.

One thing I am sure about it is they store large model files renamed as large globally unique identifier, and I still haven't understood that part of the design as anything but some silly obfuscating embrace... And here again, I'd love to be shown how I'm wrong.

solarkraft · on Nov 10, 2024

All in the name of UX. It’s modeled after Docker, so it defaults to doing things that way. Really does make for great ease of use, imo.

lurking_swe · on Nov 10, 2024

you can, when you search for a model on the ollama website there is a drop down that lets you select a “tag”. Sort of like a docker container tag. This lets you pick the quantization you want.

example: https://ollama.com/library/llama3.2/tags

mseri · on Nov 11, 2024

You can choose the quantization by appending the right tag to the model name, but they don't support other more advanced useful features (e.g. you need a special flag to enable flash attention and you cannot use KV cache quantization for large contexts).

ekianjo · on Nov 10, 2024

yes, with Llama3.2 vision they are diverging from llama.cpp backend only.

pulse7 · on Nov 11, 2024

If I understood correctly they just added image preprocessing and then they feed that into llama.cpp...