I started fully coding with Claude Code. It's not just vibe coding, but rather AI-assisted coding. I've noticed there's a considerable decrease in my understanding of the whole codebase, even though I'm the only one who has been coding this codebase for 2 years. I'm struggling to answer my colleagues' questions.
I am not defending we should drop AI, but we should really measure its effects and take actions accordingly. It's more than just getting more productivity.
This is the chief reason I don't use integrations. I just use chat, because I want to physically understand and insert code myself. Else you end up with the code overtaking your understanding of it.
Yes. I'm happy to have a sometimes-wrong expert to hand. Sometimes it provides just what I need, sometimes like with a human (who are also fallible), it helps to spur my own thinking along, clarify, converge on a solution, think laterally, or other productivity boosting effects.
I’m experiencing something similar. We have a codebase of about 150k lines of backend code. On one hand, I feel significantly more productive - perhaps 400% more efficient when it comes to actually writing code. I can iterate on the same feature multiple times, refining it until it’s perfect.
However, the challenge has shifted to code review. I now spend the vast majority of my time reading code rather than writing it. You really need to build strong code-reading muscles. My process has become: read, scrap it, rewrite it, read again… and repeat until it’s done. This approach produces good results for me.
The issue is that not everyone has the same discipline to produce well-crafted code when using AI assistance. Many developers are satisfied once the code simply works. Since I review everything manually, I often discover issues that weren’t even mentioned. During reviews, I try to visualize the entire codebase and internalize everything to maintain a comprehensive understanding of the system’s scope.
I'm very surprised you find this workflow more efficient than just writing the code. I find constructing the mental model of the solution and how it fits into existing system and codebase to be 90% of effort, then actually writing the code is 10%. Admittedly, I don't have to write any boilerplate due to the problem domain and tech choices. Coding agents definitely help with the last 10% and also all the adjacent work - one-off scripts where I don't care about code quality.
I doubt it actually is. All the extra effort it takes to make the AI do something useful on non trivial tasks is going to end up being a wash in terms of productivity, if not a net negative. But it feels more productive because of how fast the AI can iterate.
And you get to pay some big corporation for the privilege.
> Many developers are satisfied once the code simply works.
In the general case, the only way to convince oneself that the code truly works is to reason through it, as testing only tests particular data points for particular properties. Hence, “simply works” is more like “appears to work for the cases I tried out”.
I wrote a couple python scripts this week to help me with a midi integration project (3 devices with different cable types) and for quick debugging if something fails (yes, I know there are tools out there that do this but I like learning).
I’m could have used an LLM to assist but then I wouldn’t have learned much.
But I did use an LLM to make a
management wrapper to present a menu of options (cli right now) and call the scripts. That probably saved me an hour, easily.
That’s my comfort level for anything even remotely “complicated”.
I keep wanting to go back to using claudecode but I get worried about this issue. How best to use it to complement you, without it rewriting everything behidn the scenes? whats the best protocol? constnat commit requests and reviews?
Yesterday, I was asked to scrape data from a website. My friend used ChatGPT to scrape data but didn't succeded even spent 3h+. I looked website code and understand with my web knowledge and do some research with LLM. Then I described how to scrape data to LLM it took 30 minutes overall. The LLM cant create best way but you can create with using LLM. Everything is same, at the end of the day you need someone who can really think.
LLM's can do anything, but the decision tree for what you can do in life is almost infinite. LLM's still need a coherent designer to make progress towards a goal.
it is not that easy, there is lazy loading in the page that is triggered by scroll of specific sections. You need to find clever way, no way to scrape with bs4, so tough with even selenium.
Before last year we didn't have reasoning. It came with QuietSTaR, then we got it in the form of O1 and then it became practical with DeepSeek's paper in January.
So we're only about a year since the last big breakthrough.
I think we got a second big breakthrough with Google's results on the IMO problems.
For this reason I think we're very far from hitting a wall. Maybe 'LLM parameter scaling is hitting a wall'. That might be true.
IMO is not breakthrough, if you craft proper prompts you can excel imo with 2.5 Pro. Paper : https://arxiv.org/abs/2507.15855. Google just put whole computational power with very high quality data. It was test-time scaling. Why it didn't solve problem 6 as well?
Yes, it was breakthrough but saturated quickly. Wait for next breakthrough. If they can build adapting weights in llm we can talk different things but test time scaling coming to end with increasing hallucination rate. No sign for AGI.
It wasn't long ago that test-time scaling wasn't possible. Test-time scaling is a core part of what makes this a breakthrough.
I don't believe your assessment though. IMO is hard, and Google have said that they use search and some way of combining different reasoning traces, so while I haven't read that paper yet, and of course, it may support your view, but I just don't believe it.
We are not close to solving IMO with publicly known methods.
test time scaling is based on methods from pre-2020. If you look details of modern LLMs its pretty small prob to encounter method from 2020+(ROPE,GRPO). I am not saying IMO is not impressive, but it is not breakthrough, if they said they used different paradigm then test-time scaling I would say breakthrough.
> We are not close to solving IMO with publicly known methods.
The point here is not method rather computation power. You can solve any verifiable task with high computation, absolutely there must be tweaks in methods but I don't think it is something very big and different. Just OAI asserted they solved with breakthrough.
Wait for self-adapting LLMs. We will see at most in 2 years, now all big tech are focusing on that I think.
Layman's perspective: we had hints of reasoning from the initial release of ChatGPT when people figured out you could prompt "think step by step" to drastically increase problem solving performance. Then yeah a year+ later it was cleverly incorporated into model training.
We still don't have reasoning. We have synthetic text extrusion machines priming themselves to output text that looks a certain way by first generating some extra text that gets piped back into their own input for a second round.
It's sometimes useful, it seems. But when and why it helps is unclear and understudied, and the text produced in the "reasoning trace" doesn't necessarily correspond to or predict the text produced in the main response (which, of course, actual reasoning would).
Boosters will often retreat to "I don't care if the thing actually thinks", but the whole industry is trading on anthropomorphic notions like "intelligence", "reasoning", "thinking", "expertise", even "hallucination", etc., in order to drive the engine of the hype train.
The massive amounts of capital wouldn't be here without all that.
i think this is more an effect of releasing a model every other month with gradual improvements. if there was no o-series/other thinking models on the market - people would be shocked by this upgrade. the only way to keep up with the market is to release improvements asap
I don't agree, the only thing thing that would shock me about this model is if it didn't hallucinate.
I think the actual effect of releasing more models every month has been to confuse people that progress is actually happening. Despite claims of exponentially improved performance and the ability to replace PhDs, doctors, and lawyers, it still routinely can't be trusted the same as the original ChatGPT, despite years of effort.
this is a very odd perspective. as someone who uses LLMs for coding/PRs - every time a new model released my personal experience was that it was a very solid improvement on the previous generation and not just meant to "confuse". the jump from raw GPT-4 2 years ago to o3 full is so unbelievable if you traveled back in time and showed me i wouldn't have thought such technology would exist for 5+ years.
to the point on hallucination - that's just the nature of LLMs (and humans to some extent). without new architectures or fact checking world models in place i don't think that problem will be solved anytime soon. but it seems gpt-5 main selling point is they somehow reduced the hallucination rate by a lot + search helps with grounding.
I notice you don't bring any examples despite claiming the improvements are frequent and solid. It's likely because the improvements are actually hard to define and quantify. Which is why throughout this period of LLM development, there has been such an emphasis on synthetic benchmarks (which tell us nothing), rather than actual capabilities and real world results.
i didnt bring examples because i said personal experience. heres my "evidence" - gpt 4 took multiple shots and iterations and couldnt stay coherent with a prompt longer than 20k tokens (in my experience). then when o4 came out it improved on that (in my experience). o1 took 1-2 shots with less iterations (in my experience). o3 zero shots most of the tasks i throw at it and stays coherent with very long prompts (in my experience).
heres something else to think about. try and tell everybody to go back to using gpt-4. then try and tell people to go back to using o1-full. you likely wont find any takers. its almost like the newer models are improved and generally more useful
I'm not saying they're not delivering better incremental results for people for specific tasks, I'm saying they're not improving as a technology in the way big tech is selling.
The technology itself is not really improving because all of the showstopping downsides from day one are still there: Hallucinations. Limited context window. Expensive to operate and train. Inability to recall simple information, inability to stay on task, support its output, or do long term planning. They don't self-improve or learn from their mistakes. They are credulous to a fault. There's been little progress on putting guardrails on them.
Little progress especially on the ethical questions that surround them, which seem to have gone out the window with all the dollar signs floating around. They've put waaaay more effort into the commoditization front. 0 concern for the impact of releasing these products to the world, 100% concern about how to make the most money off of them. These LLMs are becoming more than the model, they're now a full "service" with all the bullshit that entails like subscriptions, plans, limits, throttling, etc. The enshittification is firmly afoot.
not to offend - but it sounds like your response/worries are based more on an emotional reaction. and rightly so, this is by all means a very scary and uncertain time. and undeniably these companies have not taken into account the impact their products will cause and the safety surrounding that.
however, a lot of your claims are false - progress is being made in nearly all the areas you mentioned
"You can use these filters to adjust what's appropriate for your use case. For example, if you're building video game dialogue, you may deem it acceptable to allow more content that's rated as Dangerous due to the nature of the game. In addition to the adjustable safety filters, the Gemini API has built-in protections against core harms, such as content that endangers child safety. These types of harm are always blocked and cannot be adjusted."
now id like to ask you for evidence that none of these aspects have been improved - since you claim my examples are vague but make statements like
> Inability to recall simple information
> inability to stay on task
> (doesn't) support its output
> (no) long term planning
ive experienced the exact opposite. not 100% of the time but compared to GPT-4 all of these areas have been massively improved. sorry i cant provide every single chat log ive ever had with these models to satisfy your vagueness-o-meter or provide benchmarks which i assume you will brush aside.
as well as the examples ive provided above - you seem to be making claims out of thin air and then claim others are not providing examples up to your standard.
Big claims of prs and shipped code then links to people who are financially interested in hype claims.
Not saying things are not getting better but i have found that those that claim amazing results are from people who are not expert enough in the output of the given domain to comment on the actual quality of output.
I love vibing out rust and it compiles and runs but i have no idea if it is good rust because well, i barely understand rust.
> now id like to ask you for evidence that none of these aspects have been improved
You're arguing against a strawman. I'm not saying there haven't been incremental improvements for the benchmarks they're targeting. I've said that several times now. I'm sure you're seeing improvements in the tasks you're doing.
But for me to say that there is more a shell game going on, I will have to see tools that do not hallucinate. A (claimed, who knows if that's right, they can't even get the physics questions or the charts right) reduction of 65% is helpful but doesn't make these things useful tools in the way they're claiming they are.
> sorry i cant provide every single chat log ive ever had with these models to satisfy your vagueness-o-meter
I'm not asking for all of them, you didn't even share one!
Like I said, despite all the advances touted in the breathless press releases you're touting, the brand new model is just a bad roll away from like the models from 3 years ago, and until that isn't the case, I'll continue to believe that the technology has hit a wall.
If it can't do this after how many years, then how is it supposed to be the smartest person I know in my pocket? How am I supposed to trust it, and build a foundation on it?
Interesting thread. I think the key around hallucinations is analogous to compilers. In order for output to be implicitly trusted it has to be as stable as a compiler. Hallucinations mean i cannot yolo trust the output. Having to manually scan the code for issues defeats the fundamental benefit.
Compilers were not and are not always perfect but i think ai has a long way to go before it passes that threshold. People act like it will in the next few years which the current trajectory strongly suggests that is not the case.
ill leave it at this: if “zero-hallucination omniscience” is your bar, you’ll stay disappointed - and that’s on your expectations, not the tech. personally i’ve been coding/researching faster and with fewer retries every time a new model drops - so my opinion is based on experience. you’re free to sit out the upgrade cycle
you dont remember deepseek introducing reasoning and blowing benchmarks led by private american companies out of the water? with an api that was way cheaper? and then offered the model free in a chat based system online? and you were a big fan?
Isn't the fact that it produced similar performance about 70x more cheaply a breakthrough? In the same way that the Hall-Héroult process was a breakthrough. Not like we didn't have aluminum before 1886.
I think the llm wall was hit a while ago and the jumps have been around finessing llms in novel ways for a better result. But the core is still very much the same it has been for a while.
The crypto level hype claims are all bs and we all knew that but i do use an llm more than google now which is the there there so to speak.
This does feel like a flatlining of hype tho which is great because idk if i could take the ai hype train for much longer.
It's seemed that way for the last year. The only real improvements have been in the chat apps themselves (internet access, function calling). Until AI gets past the pre-training problem, it'll stagnate.
It is easier to get from 0% accurate to 99% accurate, than it is to get from 99% accurate to 99.9% accurate.
This is like the classic 9s problem in SRE. Each nine is exponentially more difficult.
How easy do we really think it will be for an LLM to get 100% accurate at physics, when we don't even know what 100% right is, and it's theoretically possible it's not even physically possible?
GPT5 doesn't add any cues to whether we hit the wall, as OpenAI only needs to go one step beyond the competition. They are market leaders and more profitable than the others, so it's possible are not showing us everything they have, until they really need to.
Not really, it's just that our benchmarks are not good at showing how they've improved. Those that regularly try out LLMs can attest to major improvements in reliability over the past year.
> If an AI can replace these repeated tasks, I could spend more time with my fiancé, family, friends, and dog, which is awesome, and I am looking forward to that.
I could not understand this optimism, aren't we living in a capitalist world ?
It is indeed completely stupid: if he can do that, others can too, which means they can be more productive than he is, and the only way he would spend more time with his fiancé, family, friends, and dog is by becoming quickly unemployed.
Yes this is what people constantly get wrong about AI. When AI starts to replace certain tasks, we will then create newer, larger tasks that will keep us busy, even when using AI to its full advantage.
Exactly. I am yet to see the manager that says to their employees: "Ah nice, you became 10% more efficient using AI, from now on you can work 4 hours less every week".
I don't think its about capitalism, people have repeatedly shown we simply just don't like idle time over the long run.
Plenty of people could already work less today if they just spent less. Historically any of the last big productivity booms could have similarly let people work less, but here we are.
If AI actually comes about and if AGI replaces humans at most cognitive labor, we'll find some way to keep ourselves busy even if the jobs ultimately are as useless as the pet rock or the Jump to Conclusions Mat (Office Space reference for anyone who hasn't seen it).
I don’t think it’s that simple. Productivity gains are rarely universal. Much of the past century’s worth of advancement into automation and computing technology has generated enormous productivity gains in manufacturing, communication, and finance industries but had little or no benefit for a lot of human capital-intensive sectors such as service and education.
It still takes basically the same amount of labour hours to give a haircut today as it did in the late 19th century. An elementary school teacher today can still not handle more than a few tens up to maybe a hundred students at the extreme limit. Yet the hairdressing and education industries must still compete — on the labour market — with the industries showing the largest productivity gains. This has the effect of raising wages in these productivity-stagnant industries and increasing the cost of these services for everyone, driving inflation.
Inflation is the real time-killer, not a fear of idleness. The cost of living has gone up for everyone — rather dramatically, in nominal terms — without even taking housing costs into account.
Productivity gains aren't universal, agreed there for sure, though we have long since moved past needing to optimize productivity for the basics. Collectively we're addicted to trading our time and effort for gadgets, convenience, and status symbols.
I'm not saying those are bad things, people can do whatever they want with their own time and effort. It just seems obvious to me that we aren't interested in working less over any meaningful period of time, if that was a goal we could have reached it a long time ago by defining a lower bar for when we have "enough."
> But they're not talking about idle time, they're talking about quality time with loved ones.
I totally agree there, I wasn't trying to imply that "idle time" is a bad thing, in this context I simply meant its time not filled by obligations allowing them to choose what they do.
> But spending for leisure is often a part of that quality time.
I expect that varies a lot by person and situation. Some of the most enjoyable experiences I've had involved little or no cost; having a camp fire with friends, going on a hike, working outside in the garden, etc.
> I wasn't trying to imply that "idle time" is a bad thing
I you, I just mean what they're talking about is also not idle time as it's active time. If they were replacing work with sitting around at home, watching TV or whatever, then it would be idle time and drive them crazy no doubt. But spending time actively with their family is quite different, and would give satisfaction in a way that work does.
> I expect that varies a lot by person and situation.
Indeed. Spending isn't an inherent part of leisure. But it can be a part of it, and important part for some people. Telling them they could have more free time if they just gave up their passions or hobbies which cost money isn't likely to lead anywhere.
It's slightly more complicated than that. If people work less, they make less money, and that means they can't buy a house, to name just one example. Housing is not getting any cheaper for a myriad of reasons. The same goes for healthcare, and even for drinking beer.
People could work less, but it's a group effort. As long as some narcissistic idiots who want more instead of less are in charge, this is not going to change easily.
Yes, and now we have come full circle back to capitalism. As soon as a gap forms between capital and untapped resources, the capitalist engine keeps running: the rich get richer and the poor get poorer. It is difficult or impossible to break out of this on a large scale.
The poor dont necessarily get poorer. That is not a given in capitalism. But at some point capitalism will converge to feudalism, at that point, the poor will become slaves.
And if not needed, culled. For being "unproductive" or "unattractive" or generally "worthless".
That's my cynical take.
As long as the rich can be reigned in in a way, the poor will not necessarily become poorer.
In neoliberal capitalism they do, though. Because companies can maximize profits without internalizing external costs (such as health care, social welfare, environmental costs).
I am from EU, so I can see it happening here, or in some smaller countries. Here, you already sort-of have an UBI, where you get enough social benefits to live off if unemployed.
This is bad use of AI, we spend our compute to make science faster. I am pretty confident computational cost of this will be maybe 100x of chatgpt query. I don't want to think even environmental effects.
>If AI soon becomes good enough at building software on its own, software engineering as we know it is dead. I have no interest in becoming a glorified project manager, orchestrating AI agents all day long. If it does happen, I am now competing with anyone who can type a prompt. I’m not betting my career on being slightly better at prompting than millions of others.
The view I most agree with this discourse. That's why I am not enthusiastic about AI
Google’s AlphaProof, which got a silver last year, has been using a neural symbolic approach. This gold from OpenAI was pure LLM. We’ll have to see what Google announces, but the LLM approach is interesting because it will likely generalize to all kinds of reasoning problems, not just mathematical proofs.
OpenAI’s systems haven’t been pure language models since the o models though, right? Their RL approach may very well still generalize, but it’s not just a big pre-trained model that is one-shotting these problems.
The key difference is that they claim to have not used any verifiers.
What do you mean by “pure language model”? The reasoning step is still just the LLM spitting out tokens and this was confirmed by Deepseek replicating the o models. There’s not also a proof verifier or something similar running alongside it according to the openai researchers.
If you mean pure as in there’s not additional training beyond the pretraining, I don’t think any model has been pure since gpt-3.5.
> it will likely generalize to all kinds of reasoning problems, not just mathematical proofs
Big if true. Setting up an RL loop for training on math problems seems significantly easier than many other reasoning domains. Much easier to verify correctness of a proof than to verify correctness (what would this even mean?) for a short story.
I’m much more excited about the formalized approach, as LLM’s are susceptible to making things up. With formalization, we can be mathematically certain that a proof is correct. This could plausibly lead to machines surpassing humans in all areas of math. With a “pure English” approach, you still need a human to verify correctness.
Given the Noam Brown comment ("It was a surprise even to many researchers at OpenAI") it seems extra surprising if multiple labs achieved this result at once.
There's a comment on this twitter thread saying the Google model was using Lean, while IIUC the OpenAI one was pure LLM reasoning (no tools). Anyone have any corroboration?
In a sense it's kinda irrelevant, I care much more about the concrete things AI can achieve, than the how. But at the same time it's very informative to see the limits of specific techniques expand.
Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.
> I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition.
Astounding in what sense? I assume you are aware of the standard of Olympiad problems and that they are not particularly high. They are just challenging for the age range, but they shouldn't be for AI considering they aren't really anything but proofs and basic structured math problems.
Considering OpenAI can't currently analyse and provide real paper sources to cutting edge scientific issues, I wouldn't trust it to do actual research outside of generating matplotlib code.
I did competitive math in high school and I can confidently say that they are anything but "basic". I definitely can't solve them now (as an adult) and it's likely I never will. The same is true for most people, including people who actually pursued math in college (I didn't). I'm not going to be the next guy who unknowingly challenges a Putnam winner to do these but I will just say that it is unlikely that someone who actually understands the difficulty of these problems would say that they are not hard.
For those following along but without math specific experience: consider whether your average CS professor could solve a top competitive programming question. Not Leetcode hard, Codeforces hard.
Thanks for speaking sense. I think 99% of people saying IMO problems are not hard would not be able to solve basic district-level competition problems and are just not equipped to judge the problems.
And 1% here are those IMO/IOI winners who think everyone is just like them. I grew up with them and to you, my friends, I say: this is the reason why AI would not take over the world (and might even not be that useful for real world tasks), even if it wins every damn contest out there.
I feel like people see the question (or even the solution), they can actually understand what it says because it’s only using basic algebraic notation, then assume it must be easy to solve. Obviously it must be easier than that funny math with weird symbols…
> I assume you are aware of the standard of Olympiad problems and that they are not particularly high.
Every time an LLM reaches a new benchmark there’s a scramble to downplay it and move the goalposts for what should be considered impressive.
The International Math Olympiad was used by many people as an example of something that would be too difficult for LLMs. It has been a topic of discussion for some time. The fact that an LLM has achieved this level of performance is very impressive.
You’re downplaying the difficulty of these problems. It’s called international because the best in the entire world are challenged by it.
you meaning specific IMO training or general math training? Latter is certainly needed, former being needed in my opinion is a general observation for example about the people who make it on the teams.
I feel like I've noticed you you making the same comment 12 places in this thread -- incorrectly misrepresenting the difficulty of this tournament and ultimately it comes across as a bitter ex.
Here's an example problem 5:
Let a1,a2,…,an be distinct positive integers and let
M=max1≤i<j≤n.
Find the maximum number of pairs (i,j) with 1≤i<j≤n for which (ai +aj )(aj −ai )=M.
I asked chatGPT. However it's saying that's 2022 problem 5, however that seems to be clearly wrong... Moreover I can't find that problem anywhere so I don't know if it's a hallucination or something from it's training set that isn't on the internet....
IMO questions and Andrew Wiles solving Fermat's last theorem are two vastly different things. One is far harder than the other and the effort he put in and thinking needed is something very few can do. He also did some other fascinating work that I couldn't hope to understand fully. There is a gulf between FLT and IMO types of proofs.
I am not defending we should drop AI, but we should really measure its effects and take actions accordingly. It's more than just getting more productivity.