It’s not better. In most of my tests (C++/QT code) it just runs out of context before it can really do anything. And the output is very bad - it mashes together the header and cpp file. The reasoning output is fun to look at and occasionally useful though.
The max token output is only 8K (32K thinking tokens). O1 is 128k, which is far more useful, and it doesn’t get stuck like R1 does.
The hype around the DeepSeek release is insane and I’m starting to really doubt their numbers.
Is this a local run of one of the smaller models and/or other-models-distilled-with-r1, or are you using their Chat interface?
I've also compared o1 and (online-hosted) r1 on Qt/C++ code, being a KDE Plasma dev, and my impression so far was that the output is roughly on par. I've given both models some tricky tasks about dark corners of the meta-object system in crafting classes etc. and they came up with generally the same sort of suggestions and implementations.
I do appreciate that "asking about gotchas with few definitive solutions, even if they require some perspective" and "rote day-to-day coding ops" are very different benchmarks due to how things are represented in the training data corpus, though.
I use it through Kagi Assistant which has the proper R1 model through Together.ai/Fireworks.ai
My standard test is to ask the model to write a QSyntaxHighlighter subclass that uses TreeSitter to implement syntax highlighting. O1 can do it after a few iterations, but R1’s output has been a mess. That said, its thought process revealed a few issues that I then fixed in my canonical implementation.
Thanks for adding detail! My prompts have been very in-the-bubble-of-Qt I'd say, less so about mashing together Qt and something else, which I agree is a good real-world test case.
I haven’t had the chance to try it out with R1 yet but if you implement a debugger class that screenshots the widget/QML element, dumps its metadata like GammaRay, and includes the source, you can feed that context into Sonnet and o1. They are scarily good at identifying bugs and making modifications if you include all that context (although you have to be selective with what metadata you include. I usually just dump a few things like properties, bindings, signals, etc).
R1 is trained for a context length of 128K. Where are you getting 8K/32K? The model doesn't distinguish "thinking" tokens and "output" tokens, so this must be some specific API limitations.
> max_tokens:The maximum length of the final response after the CoT output is completed, defaulting to 4K, with a maximum of 8K. Note that the CoT output can reach up to 32K tokens, and the parameter to control the CoT length (reasoning_effort) will be available soon. [1]
I’m using it through Kagi which doesn’t use Deepseek’s official API [1]. That limitation from the docs seems to be everywhere.
In practice I don’t think anyone can economically host the whole model plus the kv cache for the entire context size of 128k (and I’m skeptical of Deepseek’s claims now anyway).
Edit: a Kagi team member just said on Discord that they’ll be increasing max tokens next release
He's just repeating a lot of disinformation that has been released about deepseek in the last few days. People who took the time to test DeepSeek models know that the results have the same or better quality for coding tasks.
Benchmarks are great to have but individual/org experiences on specific codebases still matter tremendously.
If an org consistently finds one model performs worse on their corpus than another, they aren't going to keep using it because it ranks higher in some set of benchmarks.
But you should also be very wary of these kind of anecdotes, and this thread highlights exactly why. That commenter says in another comment (https://news.ycombinator.com/item?id=42866350) that the token limitation that he is complaining about has actually nothing to do with DeepSeek's model or their API, but is a consequence of an artificial limit that Kagi imposes. In other words, his conclusion about DeepSeek is completely unwarranted.
It mashed the header and C++ file together, which is egregiously bad in the context of QT. This isn’t a new library, it’s been around for almost thirty years. Max token sizes have nothing to do with that.
I invite anyone to post a chat transcript showing a successful run of R1 against this prompt (and please tell me which API/service it came from so I can go use it too!)
I wasn't suggesting using the anecdotes of others to make a decision.
I'm talking about individuals and organizations making a decision on whether or not to use a model based on their own testing. That's what ultimately matters here.
It's not great at super-complex tasks due to limited context, but it's quite a good "junior intern that has memorized the Internet." Local deepseek-r1 on my laptop (M1 w/64GiB RAM) can answer about any question I can throw at it... as long as it's not something on China's censored list. :)
Thanks for saying this, I thought I was insane, DeepSeek is kinda bad. I guess it’s impressive all things considered but in absolute terms it’s not great.
I have run personal tests and the results are at least as good as I get from OpenAI. Smarter people have also reached the same conclusion. Of course you can find contrary datapoints, but it doesn't change the big picture.
To be fair, it's amazing by the standards of six months ago. The only models that beat it are o1, the latest gemini models and (for some things) sonnet 3.6
It’s definitely not all hype, it really is a breakthrough for open source reasoning models. I don’t mean to diminish their contribution, especially since being able to read the reasoning output is a very interesting new modality (for lack of a better word) for me as a developer.
It’s just not as impressive as people make it out to be. It might be better than o1 on Python or Javascript thats all over the training data, but o1 is overwhelmingly better at anything outside the happy path.
The max token output is only 8K (32K thinking tokens). O1 is 128k, which is far more useful, and it doesn’t get stuck like R1 does.
The hype around the DeepSeek release is insane and I’m starting to really doubt their numbers.