Note that they don't compare with deepseek coder 6.7b, which is vastly superior ...

jyap · on Jan 17, 2024

Deepseek Coder Instruct 6.7b has been my local LLM (M1 series MBP) for a while now and that was my first thought… They selectively chose benchmark results to look impressive (which is typical).

I tested out StableLM Zephyr 3B when that came out and it was extremely underwhelming/unusable.

Based on this, Stable Code 3B doesn’t look to be worth trying out. Guessing if they could put out a 7B model which beat Deepseek Coder 6.7B they would have.

a_wild_dandan · on Jan 17, 2024

Do you know how Deepseek 33b compares to 6.7b? I'm trying 33b on my (96GB) MacBook just because I have plenty of spare (V)RAM. But I'll run the smaller model if the benefits are marginal in other peoples' experience.

wokwokwok · on Jan 17, 2024

The smaller model is great at trivial day-to-day tasks.

However, when you ask hard things, it struggles; you can ask the same question 10 times, and only get 1 answer that actually answers the question.

...but the larger model is a lot slower.

Generally, if you don't want to mess around swapping models, stick with the bigger one. It's better.

However, if you are heavily using it, you'll find the speed is a pain in the ass, and when you want a trivial hint like 'how do I do a map statement in kotlin again?', you really don't need it.

What I have setup personally is a little thumbs-up / thumbs-down on the suggestions via a custom intellij plugin; if I 'thumbs-down' a result, it generates a new solution for it.

If I 'thumbs-down' it twice, it swaps to the larger model to generate a solution for it.

This kind of 'use ok model for most things and step up to larger model when you start asking hard stuff' approach scales very nicely for my personal workflow... but, I admit that setting it up was a pain, and I'm forever pissing around with the plugin code to fix tiny bugs, which I would prefer to be spending doing actual work.

So... there's not really much tooling out there at the moment to support it, but the best solution really is to use both.

If you don't want to and just want 'use the best model for everything', stick with the bigger one.

The larger model is more capable of turning 'here is a description of what I want' into 'here is code that does it that actually compiles'.

The smaller model is much better at 'I want a code fragment that does X' -> 'rephrased stack overflow answer'.

tarruda · on Jan 17, 2024

> but the larger model is a lot slower.

I found the performance to be very acceptable for 33b 4 bit on a m3 max with 36gb ram (much faster than reading speed)

wokwokwok · on Jan 17, 2024

I’m not sure what to say; responsive fast output is ideal, and the larger model is distinctly slower for me, particularly for long completions (2k tokens) if you’re using a restricted grammar like json output.

I’m using an M2 not an M3 though; maybe it’s better for you.

I was under the impression quantised results were generally slower too, but I’ve never dug into it (or particularly noticed a difference between q4/q5/q6).

If you find it fast enough to use then go for it~

bravura · on Jan 17, 2024

Do you mind sharing your plugin as a gist?

How do you run both models in memory? Two separate processes?

jyap · on Jan 17, 2024

You would want to test it out manually day to day. That’s always the best. Some models can out score but not actually be “better” when you use it.

But there is also the benchmarking: https://github.com/deepseek-ai/deepseek-coder

33B Instruct doesn’t beat 6.7B Instruct by much but maybe those % improvements mean more for your usage.

I run 6.7B since I have 16GB RAM.

Quantization of the model also makes a difference.

zwarag · on Jan 17, 2024

Do you use it inside vscode or how do you integrate an LLM into your IDE?

SubiculumCode · on Jan 17, 2024

How do you make use of it? Do you have it integrated directly into an ide?

sroussey · on Jan 17, 2024

What do you use it for?

jyap · on Jan 17, 2024

Golang grinding Leetcode coding buddy. And general coding buddy PR reviewer.

The results are on par or better than ChatGPT 3.5.

I often use it to delve deeper such as “is there an alternative way to write this?” Or “how does this code look?”

If you have an M-series Mac I recommend trying out LM Studio. Really eye opening and I’m excited to see how things progress.

sroussey · on Jan 17, 2024

I have GitHub copilot. Is it better than that? Nd if so, in which way?

Offline would be one for sure. Cost is another. What else?

jyap · on Jan 17, 2024

I’ve never used GH Copilot so can’t comment on that.

But having everything locally means no privacy or data leak issues.

danielbln · on Jan 16, 2024

Deepseek-coder-6.7B really is a quite surprisingly capable model. It's easy to give it a spin with ollama via `ollama run deepseek-coder:6.7b`.

triyambakam · on Jan 16, 2024

Thanks for the tip with ollama

discordance · on Jan 17, 2024

If you do:

1. ollama run deepseek-coder:6.7b

2. pip install litellm

3. litellm --model deepseek-coder:6.7b

You will have a local OpenAI compatible API for it.

behnamoh · on Jan 17, 2024

ollama is actually not a great way to run these models as it makes it difficult to change server parameters and doesn't use `mlock` to keep the models in memory.

bravura · on Jan 17, 2024

What do you suggest?

behnamoh · on Jan 17, 2024

vanilla llama.cpp (run a `/server`)

eyegor · on Jan 17, 2024

The 1.3b model is amazing for real time code complete, it's fast enough to be a better intellisense.

Another model you should try is magicoder 6.7b ds (based on deepseek coder). After playing with it for a couple weeks, I think it gives slightly better results than the equivalent deepseek model.

Repo https://github.com/ise-uiuc/magicoder

Models https://huggingface.co/models?search=Magicoder-s-ds

hskalin · on Jan 17, 2024

How do you use these models with your editor? (E. vscode or Emacs etc)

eyegor · on Jan 17, 2024

I run tabby [0] which uses llama.cpp under the hood and they ship a vscode extension [1]. Going above 1.3b, I find the latency too distracting (but the highest end gpu I have nearby is some 16gb rtx quadro card that's a couple years old, and usually I'm running a consumer 8gb card instead).

[0] https://tabby.tabbyml.com/

[1] https://marketplace.visualstudio.com/items?itemName=TabbyML....

tarasglek · on Jan 17, 2024

Would you mind sharing your tabby invocation to launch this model?

tarruda · on Jan 17, 2024

An easy way is to use an OpenAI compatible server, which means you can use any GPT plugin to integrate with your editor

varjag · on Jan 17, 2024

Not sure what you guys are doing with it but even at 33B it's laughably dumb compared to Mixtral.

a_wild_dandan · on Jan 16, 2024

This is phenomenal. And runs fast! The 33b version might be my MacBook's new coding daily driver.

tarruda · on Jan 17, 2024

4-bit quantized 33b runs great on a mp pro with m3 max chip

a_wild_dandan · on Jan 17, 2024

I'm using the 5-bit quant with llama.cpp and it's excellent on my M2 96GB MacBook! Running this model + Mixtral will be fun.

unshavedyak · on Jan 17, 2024

How are you using it? I need to find some sane way to use this stuff from Helix/terminal..

a_wild_dandan · on Jan 19, 2024

There are many workflows, with hardware-dependent requirements. Three which work for my MacBook:

1. Clone & make llama.cpp. It's a CLI program that runs models, e.g. `./main -m <local-model-file.gguf> -p <prompt>`.

2. Another CLI option is `ollama`, which I believe can download/cache models for you.

3. A GUI like LM Studio provides a wonderful interface for configuring, and interacting with, your models. LM Studio also provides a model catalog for you to pick from.

Assuming that your hardware is sufficient, options 1 & 2 should satisfy your terminal needs. Option 3 is an excellent playground for trying new models/configurations/etc.

Models are heavy. To fit one in your silicon and run it quickly, you'll want to use a quantized model. It's a model's "distilled" version -- say 80% smaller for a 0.1% accuracy loss. TheBloke on HuggingFace is one specialist in distilling. After finding a model you like, you can download some flavor of quantization he made, e.g: `huggingface-cli download TheBloke/neural-chat-7B-v3-3-GGUF neural-chat-7b-v3-3.Q4_K_M.gguf --local-dir .`; then use your favorite model runner (e.g. llama.cpp) to run it.

Hope that gets you started. Cheers!