Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In a similar situation at my workplace.

What models are you using that you feel comfortable trusting it to understand and operate on 10-20k LOC?

Using the latest and greatest from OpenAI, I've seen output become unreliable with as little as ~300 LOC on a pretty simple personal project. It will drop features as new ones are added, make obvious mistakes, refuse to follow instructions no matter how many different ways I try to tell it to fix a bug, etc.

Tried taking those 300 LOC (generated by o3-mini-high) to cursor and didn't fare much better with the variety of models it offers.

I haven't tried OpenAI's APIs yet - I think I read that they accommodate quite a bit more context than the web interface.

I do find OpenAI's web-based offerings extremely useful for generating short 50-200 LOC support scripts, generating boilerplate, creating short single-purpose functions, etc.

Anything beyond this just hasn't worked all that well for me. Maybe I just need better or different tools though?



I usually use Claude 3.5 sonnett since its still the one I've had my best luck with for coding tasks.

When it comes to 10k LOC codebases, I still don't really trust it with anything. My best luck has been small personal projects where I can sort of trust it to make larger scale changes, but larger scale at a small level in the first place.

I've found it best for generating tests, autocompletion, especially if you give context via function names and parameter names I find it can oftentimes complete a whole function I was about to write using the interfaces available to it in files I've visited recently.

But besides that I don't really use it for much outside of starting from scratch on a new feature or getting helping me with getting a plan together before starting working on something I may be unfamiliar with.

We have access to all models available through copilot including o3 and o1, and access to chatgpt enterprise, and I do find using it via the chat interface nice just for architecting and planning. But I usually do the actual coding with help from autocompletion since it honestly takes longer to try to wrangle it into doing the correct thing than doing it myself with a little bit of its help.


This makes sense. I've mostly been successful doing these sorts of things as well and really appreciate the way it saves me some typing (even in cases where I only keep 40-80% of what it writes, this is still a huge savings).

It's when I try to give it a clear, logical specification for a full feature and expect it to write everything that's required to deliver that feature (or the entirety of slightly-more-than-non-trivial personal project) that it falls over.

I've experimented trying to get it to do this (for features or personal projects that require maybe 200-400 LOC) mostly just to see what the limitations of the tool are.

Interestingly, I hit a wall with GPT-4 on a ~300 LOC personal project that o3-mini-high was able to overcome. So, as you'd expect - the models are getting better. Pushing my use case only a little bit further with a few more enhancements, however, o3-mini-high similarly fell over in precisely the same ways as GPT-4, only a bit worse in the volume and severity of errors.

The improvement between GPT-4 and o3-mini-high felt nominally incremental (which I guess is what they're claiming it offers).

Just to say: having seen similar small bumps in capability over the last few years of model releases, I tend to agree with other posters that it feels like we'll need something revolutionary to deliver on a lot of the hype being sold at the moment. I don't think current LLM models / approaches are going to cut it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: