In a similar situation at my workplace. What models are you using that you feel ...

Jcampuzano2 · 2025-02-11T16:46:55 1739292415

I usually use Claude 3.5 sonnett since its still the one I've had my best luck with for coding tasks.

When it comes to 10k LOC codebases, I still don't really trust it with anything. My best luck has been small personal projects where I can sort of trust it to make larger scale changes, but larger scale at a small level in the first place.

I've found it best for generating tests, autocompletion, especially if you give context via function names and parameter names I find it can oftentimes complete a whole function I was about to write using the interfaces available to it in files I've visited recently.

But besides that I don't really use it for much outside of starting from scratch on a new feature or getting helping me with getting a plan together before starting working on something I may be unfamiliar with.

We have access to all models available through copilot including o3 and o1, and access to chatgpt enterprise, and I do find using it via the chat interface nice just for architecting and planning. But I usually do the actual coding with help from autocompletion since it honestly takes longer to try to wrangle it into doing the correct thing than doing it myself with a little bit of its help.

ragle · 2025-02-11T17:08:15 1739293695

This makes sense. I've mostly been successful doing these sorts of things as well and really appreciate the way it saves me some typing (even in cases where I only keep 40-80% of what it writes, this is still a huge savings).

It's when I try to give it a clear, logical specification for a full feature and expect it to write everything that's required to deliver that feature (or the entirety of slightly-more-than-non-trivial personal project) that it falls over.

I've experimented trying to get it to do this (for features or personal projects that require maybe 200-400 LOC) mostly just to see what the limitations of the tool are.

Interestingly, I hit a wall with GPT-4 on a ~300 LOC personal project that o3-mini-high was able to overcome. So, as you'd expect - the models are getting better. Pushing my use case only a little bit further with a few more enhancements, however, o3-mini-high similarly fell over in precisely the same ways as GPT-4, only a bit worse in the volume and severity of errors.

The improvement between GPT-4 and o3-mini-high felt nominally incremental (which I guess is what they're claiming it offers).

Just to say: having seen similar small bumps in capability over the last few years of model releases, I tend to agree with other posters that it feels like we'll need something revolutionary to deliver on a lot of the hype being sold at the moment. I don't think current LLM models / approaches are going to cut it.