Really great write-up, thank you John. Two naive questions. First, with the 4060...

JohnTheNerd · on Jan 14, 2024

yes, they are the 16GB models. beware that the memory bus limits you quite a bit. however, buying brand new, they are the best VRAM per dollar in the NVIDIA world as far as I could see.

I use 4-bit GPTQ quants. I use tensor parallelism (vLLM supports it natively) to split the model across two GPUs, leaving me with exactly zero free VRAM. there are many reasons behind this decision (some of which are explained in the blog):

- TheBloke's GPTQ quants only support 4-bit and 3-bit. since the quality difference between 3-bit and 4-bit tends to be large, I went with 4-bit. I did not test, but I wanted high accuracy for non-assistant tasks too, so I simply went with 4-bit.

- vLLM only supports GPTQ, AWQ, and SqueezeLM for quantization. vLLM was needed to serve multiple clients at a time and it's very fast (I want to use the same engine for multiple tasks, this smart assistant is only one use case). I get about 17 tokens/second, which isn't great, but very functional for my needs.

- I chose GPTQ over AWQ for reasons I discussed in the post, and don't know anything about SqueezeLM.

faeriechangling · on Jan 14, 2024

> however, buying brand new, they are the best VRAM per dollar in the NVIDIA world as far as I could see.

3060 12gb is cheaper upfront and a viable alternative. 3090ti used is also cheaper $/vram although a power hog.

4060 16gb is a nice product, just not for gaming. I would wait for price drops because Nvidia just released the 4070 super which should drive down the cost of the 4060 16gb. I also think the 4070ti super 16gb is nice for hybrid gaming/llm usage.

JohnTheNerd · on Jan 14, 2024

that is true, but consider two things:

- motherboards and CPUs have a limited number of PCIe lanes available. I went with a second-hand Threadripper 2920x to be able to have 4 GPU's in the future. since you can only fit so many GPUs, your total available VRAM and future upgrade capacity is overall limited. these decisions limit me to PCIe gen 3x8 (motherboard only supports PCIe gen 3, and 4060Ti only supports 8 lanes), but I found that it's still quite workable. during regular inference, mixtral 8x7b at 4-bit GPTQ quant using vLLM can output text faster than I can read (maybe that says something about my reading speed rather than the inference speed, though). I average ~17 tokens/second.

- power consumption is big when you are self-hosting. not only when you get the power bill, but also for safety reasons. you need to make sure you don't trip the breaker (or worse!) during inference. the 4060Ti draws 180W at max load. 3090's are also notorious for (briefly) drawing well over their rated wattage, which scared me away.

Jedd · on Jan 14, 2024

Great, thanks. Economics on IT h/w this side of the pond are often extra-complicated. And as a casual watcher of the space it feels like a lot of discussion and focus has turned towards, the past few months, optimising performance. So I'm happy to wait and see a bit longer.

From TFA I'd gone to look up GPTQ and AWQ, and inevitably found a reddit post [0] from a few weeks ago asking if both were now obsoleted by ELX2. (sigh - too much, too quickly) Sounds like vLLM doesn't support that yet anyway. The tuning it seems to offer is probably offset by the convenience of using TheBloke's ready-rolled GGUF's.

[0] https://www.reddit.com/r/LocalLLaMA/comments/18q5zjt/are_gpt...

Baeocystin · on Jan 14, 2024

Not specifically related to this project, but I just started playing around with Faraday, and I'm surprised how well my 8GB 3070 does, with even the 20B models. Things are improving rapidly.