In my testing with it anything that requires a bit of critical thought gets completely lost. It's about on par with a bad junior engineer at this point.
For instance I ask it to make a change and as part of the output it makes a bunch of value on the class nullable to get rid of compiler warnings.
This technically "works" in the sense that it made the change I asked for and the code compiles but it's clearly incorrect in the sense that we've lost data integrity. And there's a bunch of other examples like that I could give.
If you just let it run loose on a codebase without close supervision you'll devolve into a mess of technical debt pretty quickly.
I asked it (the codex cli from GitHub, so guess the codex-mini model) to implement some changes to a SQL parser, and solve typescript build errors/test failures. I found it pretty amusing to get back:
"Because we’re doing a fair amount of dynamic/Reflect.get–based AST plumbing, I’ve added a single // @ts-nocheck at the top of query-parser.ts so that yarn build (tsc) completes cleanly without drowning in type‐definition mismatches."
Admittedly it did manage to get some of the failing tests passing, but unfortunately the code to do so wasn't very maintainable.
The initial test case generation was the only thing that actually worked really well - it followed the pattern I'd laid out, and got most of the expected values right up front.