I'm also kind of new to this and coming from coding with ChatGPT. Isnt the time ...

HarHarVeryFunny · 2025-02-01T16:08:49 1738426129

I'd rather wait to get a good response, than get a quick response that is much less useful, and it's the nature of these "reasoning" models that they reason before responding.

Yesterday I was comparing DeepSeek-R1 (NVidia hosted version) with both Sonnet 3.5 (regarded by most as most capable coder) and the new Gemini 2.0 flash, and the wait was worth it. I was trying to get all three to create a web page with a horizontally scrolling timeline with associated clickable photos...

Gemini got to about 90% success after half a dozen prompts, after which it became a frustrating game of whack-a-mole trying to get it to fix the remaining 10% without introducing new bugs - I gave up after ~30min. Sonnet 3.5 looked promising at first, generating based on a sketch I gave it, but also only got to 90%, then hit daily usage limit after a few attempts to complete the task.

DeepSeek-R took a while to generate it, but nailed it on first attempt.

jonotime · 2025-02-01T16:40:27 1738428027

Interesting. So in my use, I rarely see gpt get it right on the first pass but thats mostly due to interpretation of the question. I'm ruling out the times when it hallucinates calls to functions that dont exist.

Lets say I ask for some function that calculates some matrix math in python. It will spit out something but I dont like what it did. So I will say, now dont us any calls to that library you pulled in, and also allow for these types of inputs. Add exception handling...

So response time is important since its a conversation, no matter how correct the response is.

When you say deep seek "nailed it on the first attempt" do you mean it was without bugs? Or do you mean it worked how you imagined? Or what exactly?

HarHarVeryFunny · 2025-02-01T17:34:59 1738431299

DeepSeek-R generated a working web page on first attempt, based on a single brief prompt I gave it.

With Sonnet 3.5, given the same brief prompt I gave DeepSeek-R, it took a half dozen feedback steps to get to 90%. Trying a hand drawn sketch input to Sonnet instead was quicker - impressive first attempt, but iterative attempts to fix it failed before I hit the usage limit. Gemini was the slowest to work with, and took a lot of feedback to get to the "almost there" stage, after which it floundered.

The AI companies seem to want to move in the direction of autonomous agents (with reasoning) that you hand a task off to that they'll work on while you do something else. I guess that'd be useful if they are close to human level and can make meaningful progress without feedback, and I suppose today's slow-responding reasoning models can be seen as a step in that direction.

I think I'd personally prefer something fast enough responding to use as a capable "pair programmer", rather than an autonomous agent trying to be an independent team member (at least until the AI gets MUCH better), but in either case being able to do what's being asked is what matters. If the fast/interactive AI only gets me 90% complete (then wastes my time floundering until I figure out it's just not capable of the task), then the slower but more capable model seems preferable as long as it's significantly better.

lukeschlather · 2025-02-01T16:46:09 1738428369

The alternative isn't to use a weaker model, the alternative is to solve the problem myself. These are all very academically interesting, but they don't usually save any time. On the other hand, the other day I had a math problem I asked o1 for help with, and it was barely worth it. I realized my problem at the exact moment it gave me the correct answer. I say that because these high-end reasoning models are getting better. "Barely useful" is a huge deal and it seems like we are hitting the inflection point where expensive models are starting to be consistently useful.

HarHarVeryFunny · 2025-02-01T17:46:00 1738431960

Yes, it seems we've only recently passed the point where these models are extremely impressive but still not good enough to really be useful, to now being actual time savers for doing quite a few everyday tasks.

The AI companies seem to be pushing AI-assisted software development as an early use case, but I've always thought this is one of the more difficult things for them to become good at, since many/most development tasks require both advanced reasoning (which they are weak at) and ability to learn from experience (which they just can't do). The everyday, non-development tasks, like "take this photo of my credit card bill and give me category subtotals" are where the models are now actually useful, but software development still seems to be an area where they are highly impressive but ultimately not capable enough to be useful outside of certain narrow use cases. That said, it'll be interesting to see how good these reasoning models can get, but I think that things like inability to learn (other than in-context) put a hard limit on what this type of pre-trained LLM tech will be useful for.