The whole DeepSeek-R1 situation gets extra confusing because: - The distilled mo... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

TeMPOraL 3 months ago | parent | context | favorite | on: Llama.cpp supports Vulkan. why doesn't Ollama?

The whole DeepSeek-R1 situation gets extra confusing because:

- The distilled models are also provided by DeepSeek;

- There's also dynamic quants of (non-distilled) R1 - see [0]. Those, as I understand it, are more "real R1" than the distilled models, and you can get as low as ~140GB file size with the 1.58-bit quant.

I actually managed to get the 1.58-bit dynamic quant running on my personal PC, with 32GB RAM, at about 0.11 tokens per second. That is, roughly six tokens per minute. That was with llama.cpp via LM Studio; using Vulkan for GPU offload (up to 4 layers for my RTX 4070 Ti with 12GB VRAM :/) actually slowed things down relative to running purely on the CPU, but either way, it's too slow to be useful with such specs.

--

[0] - https://unsloth.ai/blog/deepseekr1-dynamic

zozbot234 3 months ago [–]

> it's too slow to be useful with such specs.

Only if you insist on realtime output: if you're OK with posting your question to the model and letting it run overnight (or, for some shorter questions, over your lunch break) it's great. I believe that this use case can fit local-AI especially well.

Join us for AI Startup School this June 16-17 in San Francisco!
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact