Looking forward to the next generation of cheap GPUs with enough VRAM to run mod...

adam_arthur · on April 17, 2023

A lot of people are running fairly powerful models directly on the CPU these days... seems like inference will not be a GPU exclusive activity going forward. Given that RAM is the main bottleneck at this point, running on CPU seems more practical for most end users

See: https://news.ycombinator.com/item?id=35602234

valine · on April 17, 2023

Cheap is relative I suppose.

I’m running Vicuna 13b 16f locally and it needs 26GB of VRAM, which won’t even fit on a single RTX 4090. The next gen RTX Titan might have enough vram but that won’t come cheap. I’m expecting a price point above $2500.

22c · on April 17, 2023

I'm not sure if it's the point GP is trying to make, but I would like to see GPUs with extra VRAM that don't have the extra compute. eg. similar performance of a 4070Ti but with 24GB or 32GB of VRAM.

I don't see a really good reason why OEMs couldn't do that now, in the past there have been OEM cards that have more VRAM than the reference design. I'm sure there's an appetite for cards like that for people who don't want to refinance their home loan to get 2 x RTX 4090 cards.

Tepix · on April 17, 2023

I'm fairly sure that NVIDIA is making sure that consumer cards are no good alternative to their $10000 80GB VRAM A100 cards.

MacsHeadroom · on April 18, 2023

13B in GPTQ 4bit has practically no quality loss and runs in 8GB. I get 8 tokens/second on my laptop CPU and 4 tokens/second on my phone CPU.

Even 33B only needs 20GB of VRAM in GPTQ 4bit.

8bit has zero perplexity loss, so there's really no reason to run in 16bit.

Even a $200 P40 24GB is enough to run 33B at extremely high speeds in GPTQ 4bit.

MacsHeadroom · on April 18, 2023

Vicuna-13B in GPTQ 4bit has almost no perplexity/quality loss and fits in just 8GB of RAM or VRAM.

I run it on my phone CPU and get ~4 tokens per second. On my laptop CPU I get 8 tokens per second.

On a $200 P40 I run LLaMA-33B at 12 tokens per second in GPTQ 4bit. A consumer 3090 gets over 20 tokens per second for LLaMA-33B and 30 tokens/second for Vicuna-13B.

gandolfi · on April 29, 2023

What cheap card do you advise with Ryzen 5 2400g and Motherboard b450M ? P40, M40, Mi25, 3090 ...? i want to use vicuna 30b smoothly.

thanks