A lot of people are running fairly powerful models directly on the CPU these days... seems like inference will not be a GPU exclusive activity going forward. Given that RAM is the main bottleneck at this point, running on CPU seems more practical for most end users
I’m running Vicuna 13b 16f locally and it needs 26GB of VRAM, which won’t even fit on a single RTX 4090. The next gen RTX Titan might have enough vram but that won’t come cheap. I’m expecting a price point above $2500.
I'm not sure if it's the point GP is trying to make, but I would like to see GPUs with extra VRAM that don't have the extra compute. eg. similar performance of a 4070Ti but with 24GB or 32GB of VRAM.
I don't see a really good reason why OEMs couldn't do that now, in the past there have been OEM cards that have more VRAM than the reference design. I'm sure there's an appetite for cards like that for people who don't want to refinance their home loan to get 2 x RTX 4090 cards.
Vicuna-13B in GPTQ 4bit has almost no perplexity/quality loss and fits in just 8GB of RAM or VRAM.
I run it on my phone CPU and get ~4 tokens per second. On my laptop CPU I get 8 tokens per second.
On a $200 P40 I run LLaMA-33B at 12 tokens per second in GPTQ 4bit. A consumer 3090 gets over 20 tokens per second for LLaMA-33B and 30 tokens/second for Vicuna-13B.