It’s not better. In most of my tests (C++/QT code) it just runs out of context b...

sho_hn · 2025-01-29T15:43:19 1738165399

Is this a local run of one of the smaller models and/or other-models-distilled-with-r1, or are you using their Chat interface?

I've also compared o1 and (online-hosted) r1 on Qt/C++ code, being a KDE Plasma dev, and my impression so far was that the output is roughly on par. I've given both models some tricky tasks about dark corners of the meta-object system in crafting classes etc. and they came up with generally the same sort of suggestions and implementations.

I do appreciate that "asking about gotchas with few definitive solutions, even if they require some perspective" and "rote day-to-day coding ops" are very different benchmarks due to how things are represented in the training data corpus, though.

throwup238 · 2025-01-29T15:50:41 1738165841

I use it through Kagi Assistant which has the proper R1 model through Together.ai/Fireworks.ai

My standard test is to ask the model to write a QSyntaxHighlighter subclass that uses TreeSitter to implement syntax highlighting. O1 can do it after a few iterations, but R1’s output has been a mess. That said, its thought process revealed a few issues that I then fixed in my canonical implementation.

nialv7 · 2025-01-29T16:09:22 1738166962

Tried this on chat.deepseek.com, it seems to be able to do it.

throwup238 · 2025-01-29T16:13:47 1738167227

Does it compile? Put the full chat in Pastebin and let’s check it out!

I haven’t used their official chat interface or API for privacy reasons.

CamperBob2 · 2025-01-29T16:53:44 1738169624

Some have said (for what little that's worth) that Kagi's version is not the real thing, but one of the distillations.

sho_hn · 2025-01-29T15:58:33 1738166313

Thanks for adding detail! My prompts have been very in-the-bubble-of-Qt I'd say, less so about mashing together Qt and something else, which I agree is a good real-world test case.

throwup238 · 2025-01-29T16:12:10 1738167130

I haven’t had the chance to try it out with R1 yet but if you implement a debugger class that screenshots the widget/QML element, dumps its metadata like GammaRay, and includes the source, you can feed that context into Sonnet and o1. They are scarily good at identifying bugs and making modifications if you include all that context (although you have to be selective with what metadata you include. I usually just dump a few things like properties, bindings, signals, etc).

gliptic · 2025-01-29T15:34:32 1738164872

R1 is trained for a context length of 128K. Where are you getting 8K/32K? The model doesn't distinguish "thinking" tokens and "output" tokens, so this must be some specific API limitations.

throwup238 · 2025-01-29T15:35:28 1738164928

> max_tokens：The maximum length of the final response after the CoT output is completed, defaulting to 4K, with a maximum of 8K. Note that the CoT output can reach up to 32K tokens, and the parameter to control the CoT length (reasoning_effort) will be available soon. [1]

[1] https://api-docs.deepseek.com/guides/reasoning_model

gliptic · 2025-01-29T15:39:09 1738165149

So yes, it's a limitation of their own API at the moment, not a model limitation.

throwup238 · 2025-01-29T15:45:13 1738165513

I’m using it through Kagi which doesn’t use Deepseek’s official API [1]. That limitation from the docs seems to be everywhere.

In practice I don’t think anyone can economically host the whole model plus the kv cache for the entire context size of 128k (and I’m skeptical of Deepseek’s claims now anyway).

Edit: a Kagi team member just said on Discord that they’ll be increasing max tokens next release

[1] https://help.kagi.com/kagi/ai/llms-privacy.html

coliveira · 2025-01-29T15:48:18 1738165698

He's just repeating a lot of disinformation that has been released about deepseek in the last few days. People who took the time to test DeepSeek models know that the results have the same or better quality for coding tasks.

goosejuice · 2025-01-29T16:01:37 1738166497

Benchmarks are great to have but individual/org experiences on specific codebases still matter tremendously.

If an org consistently finds one model performs worse on their corpus than another, they aren't going to keep using it because it ranks higher in some set of benchmarks.

hn_throwaway_99 · 2025-01-29T17:16:25 1738170985

But you should also be very wary of these kind of anecdotes, and this thread highlights exactly why. That commenter says in another comment (https://news.ycombinator.com/item?id=42866350) that the token limitation that he is complaining about has actually nothing to do with DeepSeek's model or their API, but is a consequence of an artificial limit that Kagi imposes. In other words, his conclusion about DeepSeek is completely unwarranted.

throwup238 · 2025-01-29T17:26:49 1738171609

It mashed the header and C++ file together, which is egregiously bad in the context of QT. This isn’t a new library, it’s been around for almost thirty years. Max token sizes have nothing to do with that.

I invite anyone to post a chat transcript showing a successful run of R1 against this prompt (and please tell me which API/service it came from so I can go use it too!)

goosejuice · 2025-01-30T15:18:07 1738250287

I wasn't suggesting using the anecdotes of others to make a decision.

I'm talking about individuals and organizations making a decision on whether or not to use a model based on their own testing. That's what ultimately matters here.

sheepdestroyer · 2025-01-29T15:45:52 1738165552

There are R1 providers on openrouter with bigger input/output token limitations than what DeepSeek's API access currently offers.

For instance Fireworks offers R1 with 164K/164K. They are far more expensive than DeepSeek though

api · 2025-01-29T16:28:31 1738168111

It's not great at super-complex tasks due to limited context, but it's quite a good "junior intern that has memorized the Internet." Local deepseek-r1 on my laptop (M1 w/64GiB RAM) can answer about any question I can throw at it... as long as it's not something on China's censored list. :)

azinman2 · 2025-01-29T20:09:31 1738181371

How are you running r1 on 64mb of ram? I’m guessing you’re running a distill which is not r1

api · 2025-01-29T23:01:48 1738191708

The 70b distill at 4bit quantize fits, so yes, and performance and quality seem pretty good. I can't run the gigantic one.

azinman2 · 2025-01-31T06:13:13 1738303993

Ok but that’s not deepseek-r1. Lots of people keep saying this for distills and it’s getting very confusing.

adamnemecek · 2025-01-29T15:31:09 1738164669

Thanks for saying this, I thought I was insane, DeepSeek is kinda bad. I guess it’s impressive all things considered but in absolute terms it’s not great.

coliveira · 2025-01-29T15:51:41 1738165901

I have run personal tests and the results are at least as good as I get from OpenAI. Smarter people have also reached the same conclusion. Of course you can find contrary datapoints, but it doesn't change the big picture.

sebzim4500 · 2025-01-29T15:58:51 1738166331

To be fair, it's amazing by the standards of six months ago. The only models that beat it are o1, the latest gemini models and (for some things) sonnet 3.6

cdelsolar · 2025-01-29T20:56:48 1738184208

false. It seems better than o1 to me.

marricks · 2025-01-29T15:35:20 1738164920

> it just runs out of context before it can really do anything

I mean, couldn't that be because they're just overwhelmed by users at the moment?

> And the output is very bad - it mashes together the header and cpp file

That sounds way worse, and like, not something caused by being hugged to death though.

Aider recently stated DeepSeek is placed a the top of their benchmark though[1] so I'm inclined to believe it isn't all hype.

[1] https://aider.chat/docs/llms/deepseek.html

throwup238 · 2025-01-29T15:41:09 1738165269

It’s definitely not all hype, it really is a breakthrough for open source reasoning models. I don’t mean to diminish their contribution, especially since being able to read the reasoning output is a very interesting new modality (for lack of a better word) for me as a developer.

It’s just not as impressive as people make it out to be. It might be better than o1 on Python or Javascript thats all over the training data, but o1 is overwhelmingly better at anything outside the happy path.