More

demirbey05 · 2025-09-13T10:39:44 1757759984

I started fully coding with Claude Code. It's not just vibe coding, but rather AI-assisted coding. I've noticed there's a considerable decrease in my understanding of the whole codebase, even though I'm the only one who has been coding this codebase for 2 years. I'm struggling to answer my colleagues' questions.

I am not defending we should drop AI, but we should really measure its effects and take actions accordingly. It's more than just getting more productivity.

numbers_guy · 2025-09-13T10:42:01 1757760121

This is the chief reason I don't use integrations. I just use chat, because I want to physically understand and insert code myself. Else you end up with the code overtaking your understanding of it.

pmg101 · 2025-09-13T10:54:33 1757760873

Yes. I'm happy to have a sometimes-wrong expert to hand. Sometimes it provides just what I need, sometimes like with a human (who are also fallible), it helps to spur my own thinking along, clarify, converge on a solution, think laterally, or other productivity boosting effects.

krystofee · 2025-09-13T12:36:29 1757766989

I’m experiencing something similar. We have a codebase of about 150k lines of backend code. On one hand, I feel significantly more productive - perhaps 400% more efficient when it comes to actually writing code. I can iterate on the same feature multiple times, refining it until it’s perfect.

However, the challenge has shifted to code review. I now spend the vast majority of my time reading code rather than writing it. You really need to build strong code-reading muscles. My process has become: read, scrap it, rewrite it, read again… and repeat until it’s done. This approach produces good results for me.

The issue is that not everyone has the same discipline to produce well-crafted code when using AI assistance. Many developers are satisfied once the code simply works. Since I review everything manually, I often discover issues that weren’t even mentioned. During reviews, I try to visualize the entire codebase and internalize everything to maintain a comprehensive understanding of the system’s scope.

dm3 · 2025-09-13T13:11:08 1757769068

I'm very surprised you find this workflow more efficient than just writing the code. I find constructing the mental model of the solution and how it fits into existing system and codebase to be 90% of effort, then actually writing the code is 10%. Admittedly, I don't have to write any boilerplate due to the problem domain and tech choices. Coding agents definitely help with the last 10% and also all the adjacent work - one-off scripts where I don't care about code quality.

scuff3d · 2025-09-13T17:24:34 1757784274

I doubt it actually is. All the extra effort it takes to make the AI do something useful on non trivial tasks is going to end up being a wash in terms of productivity, if not a net negative. But it feels more productive because of how fast the AI can iterate.

And you get to pay some big corporation for the privilege.

layer8 · 2025-09-13T16:09:42 1757779782

> Many developers are satisfied once the code simply works.

In the general case, the only way to convince oneself that the code truly works is to reason through it, as testing only tests particular data points for particular properties. Hence, “simply works” is more like “appears to work for the cases I tried out”.

apercu · 2025-09-13T11:11:56 1757761916

I wrote a couple python scripts this week to help me with a midi integration project (3 devices with different cable types) and for quick debugging if something fails (yes, I know there are tools out there that do this but I like learning).

I’m could have used an LLM to assist but then I wouldn’t have learned much.

But I did use an LLM to make a management wrapper to present a menu of options (cli right now) and call the scripts. That probably saved me an hour, easily.

That’s my comfort level for anything even remotely “complicated”.

ionwake · 2025-09-13T11:20:05 1757762405

I keep wanting to go back to using claudecode but I get worried about this issue. How best to use it to complement you, without it rewriting everything behidn the scenes? whats the best protocol? constnat commit requests and reviews?

demirbey05 · 2025-08-21T13:50:08 1755784208

Yesterday, I was asked to scrape data from a website. My friend used ChatGPT to scrape data but didn't succeded even spent 3h+. I looked website code and understand with my web knowledge and do some research with LLM. Then I described how to scrape data to LLM it took 30 minutes overall. The LLM cant create best way but you can create with using LLM. Everything is same, at the end of the day you need someone who can really think.

jvm___ · 2025-08-21T14:12:07 1755785527

LLM's can do anything, but the decision tree for what you can do in life is almost infinite. LLM's still need a coherent designer to make progress towards a goal.

demirbey05 · 2025-08-21T16:26:53 1755793613

LLMs can do small things well, but you must use small parts to form big picture.

isatty · 2025-08-21T15:43:00 1755790980

Or you could’ve used xpath and bs4 and have been done in an hour or two and have more understandable code.

demirbey05 · 2025-08-21T16:25:51 1755793551

it is not that easy, there is lazy loading in the page that is triggered by scroll of specific sections. You need to find clever way, no way to scrape with bs4, so tough with even selenium.

1718627440 · 2025-08-21T14:48:32 1755787712

wget -m ?

demirbey05 · 2025-08-07T17:07:27 1754586447

Seems LLMs really hit the wall.

impossiblefork · 2025-08-07T17:37:06 1754588226

Before last year we didn't have reasoning. It came with QuietSTaR, then we got it in the form of O1 and then it became practical with DeepSeek's paper in January.

So we're only about a year since the last big breakthrough.

I think we got a second big breakthrough with Google's results on the IMO problems.

For this reason I think we're very far from hitting a wall. Maybe 'LLM parameter scaling is hitting a wall'. That might be true.

demirbey05 · 2025-08-07T17:50:31 1754589031

IMO is not breakthrough, if you craft proper prompts you can excel imo with 2.5 Pro. Paper : https://arxiv.org/abs/2507.15855. Google just put whole computational power with very high quality data. It was test-time scaling. Why it didn't solve problem 6 as well?

Yes, it was breakthrough but saturated quickly. Wait for next breakthrough. If they can build adapting weights in llm we can talk different things but test time scaling coming to end with increasing hallucination rate. No sign for AGI.

impossiblefork · 2025-08-07T18:19:30 1754590770

It wasn't long ago that test-time scaling wasn't possible. Test-time scaling is a core part of what makes this a breakthrough.

I don't believe your assessment though. IMO is hard, and Google have said that they use search and some way of combining different reasoning traces, so while I haven't read that paper yet, and of course, it may support your view, but I just don't believe it.

We are not close to solving IMO with publicly known methods.

demirbey05 · 2025-08-07T18:35:13 1754591713

test time scaling is based on methods from pre-2020. If you look details of modern LLMs its pretty small prob to encounter method from 2020+(ROPE,GRPO). I am not saying IMO is not impressive, but it is not breakthrough, if they said they used different paradigm then test-time scaling I would say breakthrough.

> We are not close to solving IMO with publicly known methods. The point here is not method rather computation power. You can solve any verifiable task with high computation, absolutely there must be tweaks in methods but I don't think it is something very big and different. Just OAI asserted they solved with breakthrough.

Wait for self-adapting LLMs. We will see at most in 2 years, now all big tech are focusing on that I think.

impossiblefork · 2025-08-07T19:19:41 1754594381

What kind of test time scaling did we have pre-2020?

Non-output tokens were basically introduced by QuietSTaR, which is rather new. What method from five years ago does anything like that?

pton_xd · 2025-08-07T17:54:41 1754589281

Layman's perspective: we had hints of reasoning from the initial release of ChatGPT when people figured out you could prompt "think step by step" to drastically increase problem solving performance. Then yeah a year+ later it was cleverly incorporated into model training.

impossiblefork · 2025-08-07T18:18:03 1754590683

Fine, but to me reasoning is this the where you have <think> tags and use RL to decide what's to be generated in-between them.

Of course, people regarded things like GSM8k with trained reasoning traces as reasoning too, but it's pretty obviously not quite the same thing.

pxc · 2025-08-08T15:49:53 1754668193

We still don't have reasoning. We have synthetic text extrusion machines priming themselves to output text that looks a certain way by first generating some extra text that gets piped back into their own input for a second round.

relaytheurgency · 2025-08-08T21:14:13 1754687653

Right. The "reasoning" is an illusion. It's another hype generation tool.

pxc · 2025-08-08T22:52:39 1754693559

It's sometimes useful, it seems. But when and why it helps is unclear and understudied, and the text produced in the "reasoning trace" doesn't necessarily correspond to or predict the text produced in the main response (which, of course, actual reasoning would).

Boosters will often retreat to "I don't care if the thing actually thinks", but the whole industry is trading on anthropomorphic notions like "intelligence", "reasoning", "thinking", "expertise", even "hallucination", etc., in order to drive the engine of the hype train.

The massive amounts of capital wouldn't be here without all that.

nonhaver · 2025-08-07T17:25:37 1754587537

i think this is more an effect of releasing a model every other month with gradual improvements. if there was no o-series/other thinking models on the market - people would be shocked by this upgrade. the only way to keep up with the market is to release improvements asap

ModernMech · 2025-08-07T17:40:33 1754588433

I don't agree, the only thing thing that would shock me about this model is if it didn't hallucinate.

I think the actual effect of releasing more models every month has been to confuse people that progress is actually happening. Despite claims of exponentially improved performance and the ability to replace PhDs, doctors, and lawyers, it still routinely can't be trusted the same as the original ChatGPT, despite years of effort.

nonhaver · 2025-08-07T19:08:13 1754593693

this is a very odd perspective. as someone who uses LLMs for coding/PRs - every time a new model released my personal experience was that it was a very solid improvement on the previous generation and not just meant to "confuse". the jump from raw GPT-4 2 years ago to o3 full is so unbelievable if you traveled back in time and showed me i wouldn't have thought such technology would exist for 5+ years.

to the point on hallucination - that's just the nature of LLMs (and humans to some extent). without new architectures or fact checking world models in place i don't think that problem will be solved anytime soon. but it seems gpt-5 main selling point is they somehow reduced the hallucination rate by a lot + search helps with grounding.

ModernMech · 2025-08-07T19:44:13 1754595853

I notice you don't bring any examples despite claiming the improvements are frequent and solid. It's likely because the improvements are actually hard to define and quantify. Which is why throughout this period of LLM development, there has been such an emphasis on synthetic benchmarks (which tell us nothing), rather than actual capabilities and real world results.

nonhaver · 2025-08-07T19:55:15 1754596515

i didnt bring examples because i said personal experience. heres my "evidence" - gpt 4 took multiple shots and iterations and couldnt stay coherent with a prompt longer than 20k tokens (in my experience). then when o4 came out it improved on that (in my experience). o1 took 1-2 shots with less iterations (in my experience). o3 zero shots most of the tasks i throw at it and stays coherent with very long prompts (in my experience).

heres something else to think about. try and tell everybody to go back to using gpt-4. then try and tell people to go back to using o1-full. you likely wont find any takers. its almost like the newer models are improved and generally more useful

ModernMech · 2025-08-07T20:30:21 1754598621

Why are your examples so vague?

I'm not saying they're not delivering better incremental results for people for specific tasks, I'm saying they're not improving as a technology in the way big tech is selling.

The technology itself is not really improving because all of the showstopping downsides from day one are still there: Hallucinations. Limited context window. Expensive to operate and train. Inability to recall simple information, inability to stay on task, support its output, or do long term planning. They don't self-improve or learn from their mistakes. They are credulous to a fault. There's been little progress on putting guardrails on them.

Little progress especially on the ethical questions that surround them, which seem to have gone out the window with all the dollar signs floating around. They've put waaaay more effort into the commoditization front. 0 concern for the impact of releasing these products to the world, 100% concern about how to make the most money off of them. These LLMs are becoming more than the model, they're now a full "service" with all the bullshit that entails like subscriptions, plans, limits, throttling, etc. The enshittification is firmly afoot.

nonhaver · 2025-08-08T00:30:10 1754613010

not to offend - but it sounds like your response/worries are based more on an emotional reaction. and rightly so, this is by all means a very scary and uncertain time. and undeniably these companies have not taken into account the impact their products will cause and the safety surrounding that.

however, a lot of your claims are false - progress is being made in nearly all the areas you mentioned

> hallucinations

are reduced with GPT-5

https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb...

"gpt-5-thinking has a hallucination rate 65% smaller than OpenAI o3"

> limited context window

same deal. gemini 2.5-pro has a 1 million token context window and GPT-5 is 400k up from 200k with o3

https://blog.google/technology/google-deepmind/gemini-model-...

"native multimodality and a long context window. 2.5 Pro ships today with a 1 million token context window (2 million coming soon)"

> expensive to operate and train

we don't know for certain but GPT-5 provides the most intelligence for the cheapest price at $10/1 million output tokens which is unprecedented

https://platform.openai.com/docs/models/gpt-5

> guardrails

are very well implemented in certain models like google who provide multiple safety levels

https://ai.google.dev/gemini-api/docs/safety-settings

"You can use these filters to adjust what's appropriate for your use case. For example, if you're building video game dialogue, you may deem it acceptable to allow more content that's rated as Dangerous due to the nature of the game. In addition to the adjustable safety filters, the Gemini API has built-in protections against core harms, such as content that endangers child safety. These types of harm are always blocked and cannot be adjusted."

now id like to ask you for evidence that none of these aspects have been improved - since you claim my examples are vague but make statements like

> Inability to recall simple information

> inability to stay on task

> (doesn't) support its output

> (no) long term planning

ive experienced the exact opposite. not 100% of the time but compared to GPT-4 all of these areas have been massively improved. sorry i cant provide every single chat log ive ever had with these models to satisfy your vagueness-o-meter or provide benchmarks which i assume you will brush aside.

as well as the examples ive provided above - you seem to be making claims out of thin air and then claim others are not providing examples up to your standard.

rustystump · 2025-08-08T16:51:34 1754671894

Big claims of prs and shipped code then links to people who are financially interested in hype claims.

Not saying things are not getting better but i have found that those that claim amazing results are from people who are not expert enough in the output of the given domain to comment on the actual quality of output.

I love vibing out rust and it compiles and runs but i have no idea if it is good rust because well, i barely understand rust.

ModernMech · 2025-08-08T03:25:47 1754623547

> now id like to ask you for evidence that none of these aspects have been improved

You're arguing against a strawman. I'm not saying there haven't been incremental improvements for the benchmarks they're targeting. I've said that several times now. I'm sure you're seeing improvements in the tasks you're doing.

But for me to say that there is more a shell game going on, I will have to see tools that do not hallucinate. A (claimed, who knows if that's right, they can't even get the physics questions or the charts right) reduction of 65% is helpful but doesn't make these things useful tools in the way they're claiming they are.

> sorry i cant provide every single chat log ive ever had with these models to satisfy your vagueness-o-meter

I'm not asking for all of them, you didn't even share one!

Anyway, I just had this chat with the brand new state of the art Chat GPT 5: https://chatgpt.com/share/68956bf0-4d74-8001-88fe-67d5160436...

Like I said, despite all the advances touted in the breathless press releases you're touting, the brand new model is just a bad roll away from like the models from 3 years ago, and until that isn't the case, I'll continue to believe that the technology has hit a wall.

If it can't do this after how many years, then how is it supposed to be the smartest person I know in my pocket? How am I supposed to trust it, and build a foundation on it?

rustystump · 2025-08-08T16:47:33 1754671653

Interesting thread. I think the key around hallucinations is analogous to compilers. In order for output to be implicitly trusted it has to be as stable as a compiler. Hallucinations mean i cannot yolo trust the output. Having to manually scan the code for issues defeats the fundamental benefit.

Compilers were not and are not always perfect but i think ai has a long way to go before it passes that threshold. People act like it will in the next few years which the current trajectory strongly suggests that is not the case.

nonhaver · 2025-08-08T04:53:21 1754628801

ill leave it at this: if “zero-hallucination omniscience” is your bar, you’ll stay disappointed - and that’s on your expectations, not the tech. personally i’ve been coding/researching faster and with fewer retries every time a new model drops - so my opinion is based on experience. you’re free to sit out the upgrade cycle

satyrun · 2025-08-07T17:46:51 1754588811

Just an absurd statement when DeepSeek had its moment in January.

A whole 8 months ago.

manojlds · 2025-08-07T18:07:18 1754590038

And they said "it's over" millions of times. What they mean is the exponential expectations are done.

demirbey05 · 2025-08-07T17:52:40 1754589160

I don't remember as a big fan of DeepSeek.

nonhaver · 2025-08-07T19:10:40 1754593840

you dont remember deepseek introducing reasoning and blowing benchmarks led by private american companies out of the water? with an api that was way cheaper? and then offered the model free in a chat based system online? and you were a big fan?

FergusArgyll · 2025-08-07T21:12:47 1754601167

Deepseek was never SOTA, it was a big deal because it was from China but it wasn't a breakthrough in any sense

missedthecue · 2025-08-08T03:05:49 1754622349

Isn't the fact that it produced similar performance about 70x more cheaply a breakthrough? In the same way that the Hall-Héroult process was a breakthrough. Not like we didn't have aluminum before 1886.

rustystump · 2025-08-08T16:39:31 1754671171

I think the llm wall was hit a while ago and the jumps have been around finessing llms in novel ways for a better result. But the core is still very much the same it has been for a while.

The crypto level hype claims are all bs and we all knew that but i do use an llm more than google now which is the there there so to speak.

This does feel like a flatlining of hype tho which is great because idk if i could take the ai hype train for much longer.

dismalaf · 2025-08-07T17:08:32 1754586512

It's seemed that way for the last year. The only real improvements have been in the chat apps themselves (internet access, function calling). Until AI gets past the pre-training problem, it'll stagnate.

amelius · 2025-08-07T17:11:03 1754586663

Is there a graph somewhere that illustrates it?

onlyrealcuzzo · 2025-08-07T17:31:32 1754587892

https://epoch.ai/data-insights/llm-apis-accuracy-runtime-tra...

It is easier to get from 0% accurate to 99% accurate, than it is to get from 99% accurate to 99.9% accurate.

This is like the classic 9s problem in SRE. Each nine is exponentially more difficult.

How easy do we really think it will be for an LLM to get 100% accurate at physics, when we don't even know what 100% right is, and it's theoretically possible it's not even physically possible?

bertili · 2025-08-07T19:09:04 1754593744

GPT5 doesn't add any cues to whether we hit the wall, as OpenAI only needs to go one step beyond the competition. They are market leaders and more profitable than the others, so it's possible are not showing us everything they have, until they really need to.

AstroBen · 2025-08-07T21:20:41 1754601641

..profitable you say?

demirbey05 · 2025-08-07T19:21:49 1754594509

I mean test-time scaling coming to end, there are many open rooms for next thing.

hodgehog11 · 2025-08-07T18:35:11 1754591711

Not really, it's just that our benchmarks are not good at showing how they've improved. Those that regularly try out LLMs can attest to major improvements in reliability over the past year.

demirbey05 · 2025-08-07T11:04:35 1754564675

> If an AI can replace these repeated tasks, I could spend more time with my fiancé, family, friends, and dog, which is awesome, and I am looking forward to that.

I could not understand this optimism, aren't we living in a capitalist world ?

aredox · 2025-08-07T12:02:35 1754568155

It is indeed completely stupid: if he can do that, others can too, which means they can be more productive than he is, and the only way he would spend more time with his fiancé, family, friends, and dog is by becoming quickly unemployed.

deadbabe · 2025-08-07T11:58:00 1754567880

Yes this is what people constantly get wrong about AI. When AI starts to replace certain tasks, we will then create newer, larger tasks that will keep us busy, even when using AI to its full advantage.

balfirevic · 2025-08-07T13:31:44 1754573504

Do you expect AI to stop becoming more capable before it can do every economically useful task better than any human?

deadbabe · 2025-08-07T15:25:17 1754580317

No, I expect the economy to continually expand.

balfirevic · 2025-08-07T15:33:46 1754580826

If so, then there will be no place for humans in that economy - except for recreational purposes - regardless if its expansion.

deadbabe · 2025-08-07T18:04:44 1754589884

But then who will be the consumer?

demirbey05 · 2025-08-07T12:14:09 1754568849

That's what I meant. I don't think boss wants you to pay same money with less work time.

pydry · 2025-08-07T12:08:00 1754568480

or you'll be kicked out on to the street and shamed for being jobless.

shafyy · 2025-08-07T12:13:05 1754568785

Exactly. I am yet to see the manager that says to their employees: "Ah nice, you became 10% more efficient using AI, from now on you can work 4 hours less every week".

_heimdall · 2025-08-07T11:16:19 1754565379

I don't think its about capitalism, people have repeatedly shown we simply just don't like idle time over the long run.

Plenty of people could already work less today if they just spent less. Historically any of the last big productivity booms could have similarly let people work less, but here we are.

If AI actually comes about and if AGI replaces humans at most cognitive labor, we'll find some way to keep ourselves busy even if the jobs ultimately are as useless as the pet rock or the Jump to Conclusions Mat (Office Space reference for anyone who hasn't seen it).

chongli · 2025-08-07T11:40:38 1754566838

I don’t think it’s that simple. Productivity gains are rarely universal. Much of the past century’s worth of advancement into automation and computing technology has generated enormous productivity gains in manufacturing, communication, and finance industries but had little or no benefit for a lot of human capital-intensive sectors such as service and education.

It still takes basically the same amount of labour hours to give a haircut today as it did in the late 19th century. An elementary school teacher today can still not handle more than a few tens up to maybe a hundred students at the extreme limit. Yet the hairdressing and education industries must still compete — on the labour market — with the industries showing the largest productivity gains. This has the effect of raising wages in these productivity-stagnant industries and increasing the cost of these services for everyone, driving inflation.

Inflation is the real time-killer, not a fear of idleness. The cost of living has gone up for everyone — rather dramatically, in nominal terms — without even taking housing costs into account.

_heimdall · 2025-08-07T18:27:10 1754591230

Productivity gains aren't universal, agreed there for sure, though we have long since moved past needing to optimize productivity for the basics. Collectively we're addicted to trading our time and effort for gadgets, convenience, and status symbols.

I'm not saying those are bad things, people can do whatever they want with their own time and effort. It just seems obvious to me that we aren't interested in working less over any meaningful period of time, if that was a goal we could have reached it a long time ago by defining a lower bar for when we have "enough."

AlecSchueler · 2025-08-07T12:10:05 1754568605

> idle time

But they're not talking about idle time, they're talking about quality time with loved ones.

> Plenty of people could already work less today if they just spent less.

But spending for leisure is often a part of that quality time. The idea is being able to work less AND maintain the same lifestyle.

_heimdall · 2025-08-07T17:29:27 1754587767

> But they're not talking about idle time, they're talking about quality time with loved ones.

I totally agree there, I wasn't trying to imply that "idle time" is a bad thing, in this context I simply meant its time not filled by obligations allowing them to choose what they do.

> But spending for leisure is often a part of that quality time.

I expect that varies a lot by person and situation. Some of the most enjoyable experiences I've had involved little or no cost; having a camp fire with friends, going on a hike, working outside in the garden, etc.

AlecSchueler · 2025-08-07T17:35:52 1754588152

> I wasn't trying to imply that "idle time" is a bad thing

I you, I just mean what they're talking about is also not idle time as it's active time. If they were replacing work with sitting around at home, watching TV or whatever, then it would be idle time and drive them crazy no doubt. But spending time actively with their family is quite different, and would give satisfaction in a way that work does.

> I expect that varies a lot by person and situation.

Indeed. Spending isn't an inherent part of leisure. But it can be a part of it, and important part for some people. Telling them they could have more free time if they just gave up their passions or hobbies which cost money isn't likely to lead anywhere.

smokel · 2025-08-07T11:38:43 1754566723

It's slightly more complicated than that. If people work less, they make less money, and that means they can't buy a house, to name just one example. Housing is not getting any cheaper for a myriad of reasons. The same goes for healthcare, and even for drinking beer.

People could work less, but it's a group effort. As long as some narcissistic idiots who want more instead of less are in charge, this is not going to change easily.

smartmic · 2025-08-07T11:51:53 1754567513

Yes, and now we have come full circle back to capitalism. As soon as a gap forms between capital and untapped resources, the capitalist engine keeps running: the rich get richer and the poor get poorer. It is difficult or impossible to break out of this on a large scale.

pineaux · 2025-08-07T12:02:49 1754568169

The poor dont necessarily get poorer. That is not a given in capitalism. But at some point capitalism will converge to feudalism, at that point, the poor will become slaves.

And if not needed, culled. For being "unproductive" or "unattractive" or generally "worthless".

That's my cynical take.

As long as the rich can be reigned in in a way, the poor will not necessarily become poorer.

shafyy · 2025-08-07T12:11:29 1754568689

In neoliberal capitalism they do, though. Because companies can maximize profits without internalizing external costs (such as health care, social welfare, environmental costs).

tigrezno · 2025-08-07T11:56:06 1754567766

Capitalism is ending with AGI/ASI, that's for sure.

XCSme · 2025-08-07T12:04:29 1754568269

I am pretty sure UBI will be at least tested at a large scale in our lifetime.

shafyy · 2025-08-07T12:10:24 1754568624

In the US, they can't even figure out universal healthcare, do you really think they are giving to go for UBI?

glhaynes · 2025-08-07T13:33:16 1754573596

Huge numbers of desperate, armed, unemployed people have a way of focusing the will.

sp527 · 2025-08-07T14:28:02 1754576882

That's what Anduril is for

XCSme · 2025-08-07T12:17:37 1754569057

I am from EU, so I can see it happening here, or in some smaller countries. Here, you already sort-of have an UBI, where you get enough social benefits to live off if unemployed.

demirbey05 · 2025-08-05T16:13:22 1754410402

This is bad use of AI, we spend our compute to make science faster. I am pretty confident computational cost of this will be maybe 100x of chatgpt query. I don't want to think even environmental effects.

demirbey05 · 2025-07-30T13:38:35 1753882715

We didn't even reach AGI and no sign with LLMs.

demirbey05 · 2025-07-22T16:05:48 1753200348

>If AI soon becomes good enough at building software on its own, software engineering as we know it is dead. I have no interest in becoming a glorified project manager, orchestrating AI agents all day long. If it does happen, I am now competing with anyone who can type a prompt. I’m not betting my career on being slightly better at prompting than millions of others.

The view I most agree with this discourse. That's why I am not enthusiastic about AI

demirbey05 · 2025-07-19T16:52:46 1752943966

Google also joined IMO, and got gold prize.

https://x.com/natolambert/status/1946569475396120653

OAI announced early, probably we will hear announcement from Google soon.

futureshock · 2025-07-19T17:56:24 1752947784

Google’s AlphaProof, which got a silver last year, has been using a neural symbolic approach. This gold from OpenAI was pure LLM. We’ll have to see what Google announces, but the LLM approach is interesting because it will likely generalize to all kinds of reasoning problems, not just mathematical proofs.

skepticATX · 2025-07-19T18:14:33 1752948873

OpenAI’s systems haven’t been pure language models since the o models though, right? Their RL approach may very well still generalize, but it’s not just a big pre-trained model that is one-shotting these problems.

The key difference is that they claim to have not used any verifiers.

beering · 2025-07-19T23:59:11 1752969551

What do you mean by “pure language model”? The reasoning step is still just the LLM spitting out tokens and this was confirmed by Deepseek replicating the o models. There’s not also a proof verifier or something similar running alongside it according to the openai researchers.

If you mean pure as in there’s not additional training beyond the pretraining, I don’t think any model has been pure since gpt-3.5.

gallerdude · 2025-07-20T14:20:31 1753021231

Local models you can get just the pretrained versions of, no RLHF. IIRC both Llama and Gemma make them available.

alach11 · 2025-07-19T18:54:40 1752951280

> it will likely generalize to all kinds of reasoning problems, not just mathematical proofs

Big if true. Setting up an RL loop for training on math problems seems significantly easier than many other reasoning domains. Much easier to verify correctness of a proof than to verify correctness (what would this even mean?) for a short story.

kevinventullo · 2025-07-20T01:25:47 1752974747

I’m much more excited about the formalized approach, as LLM’s are susceptible to making things up. With formalization, we can be mathematically certain that a proof is correct. This could plausibly lead to machines surpassing humans in all areas of math. With a “pure English” approach, you still need a human to verify correctness.

csomar · 2025-07-20T04:10:47 1752984647

Neither Gemini or OpenAI have open models. We don’t know for sure what’s happening underneath.

bjackman · 2025-07-20T00:35:14 1752971714

Given the Noam Brown comment ("It was a surprise even to many researchers at OpenAI") it seems extra surprising if multiple labs achieved this result at once.

There's a comment on this twitter thread saying the Google model was using Lean, while IIUC the OpenAI one was pure LLM reasoning (no tools). Anyone have any corroboration?

In a sense it's kinda irrelevant, I care much more about the concrete things AI can achieve, than the how. But at the same time it's very informative to see the limits of specific techniques expand.

egillie · 2025-07-20T00:47:45 1752972465

Explains why they’d announce on a Saturday

demirbey05 · 2025-07-19T13:34:27 1752932067

I think from Canada team someone solved it but among all, its very few

demirbey05 · 2025-07-19T10:24:08 1752920648

Progress is astounding. Recently report published about evaluation of LLMs on IMO 2025. o3 high didn't even get bronze.

https://matharena.ai/imo/

Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.

davis · 2025-07-19T22:12:17 1752963137

Here they are: https://mathstodon.xyz/@tao/114881419368778558

demirbey05 · 2025-07-19T22:22:50 1752963770

Appreciated

> I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition.

ktallett · 2025-07-19T10:27:06 1752920826

Astounding in what sense? I assume you are aware of the standard of Olympiad problems and that they are not particularly high. They are just challenging for the age range, but they shouldn't be for AI considering they aren't really anything but proofs and basic structured math problems.

Considering OpenAI can't currently analyse and provide real paper sources to cutting edge scientific issues, I wouldn't trust it to do actual research outside of generating matplotlib code.

dang · 2025-07-19T19:06:44 1752952004

Please see https://news.ycombinator.com/item?id=44617609.

You degraded this thread badly by posting so many comments like this.

saagarjha · 2025-07-19T11:59:13 1752926353

I did competitive math in high school and I can confidently say that they are anything but "basic". I definitely can't solve them now (as an adult) and it's likely I never will. The same is true for most people, including people who actually pursued math in college (I didn't). I'm not going to be the next guy who unknowingly challenges a Putnam winner to do these but I will just say that it is unlikely that someone who actually understands the difficulty of these problems would say that they are not hard.

For those following along but without math specific experience: consider whether your average CS professor could solve a top competitive programming question. Not Leetcode hard, Codeforces hard.

samat · 2025-07-19T15:19:39 1752938379

Thanks for speaking sense. I think 99% of people saying IMO problems are not hard would not be able to solve basic district-level competition problems and are just not equipped to judge the problems.

And 1% here are those IMO/IOI winners who think everyone is just like them. I grew up with them and to you, my friends, I say: this is the reason why AI would not take over the world (and might even not be that useful for real world tasks), even if it wins every damn contest out there.

saagarjha · 2025-07-19T21:44:00 1752961440

I feel like people see the question (or even the solution), they can actually understand what it says because it’s only using basic algebraic notation, then assume it must be easy to solve. Obviously it must be easier than that funny math with weird symbols…

Aurornis · 2025-07-19T14:19:30 1752934770

> I assume you are aware of the standard of Olympiad problems and that they are not particularly high.

Every time an LLM reaches a new benchmark there’s a scramble to downplay it and move the goalposts for what should be considered impressive.

The International Math Olympiad was used by many people as an example of something that would be too difficult for LLMs. It has been a topic of discussion for some time. The fact that an LLM has achieved this level of performance is very impressive.

You’re downplaying the difficulty of these problems. It’s called international because the best in the entire world are challenged by it.

Davidzheng · 2025-07-19T11:14:04 1752923644

sorry but I don't think it's accurate to say "they are just challenging for the age range"

ktallett · 2025-07-19T11:25:47 1752924347

I'm aware you believe they are impossible tasks unless you have specific training, I happen to disagree with that.

Davidzheng · 2025-07-19T11:31:16 1752924676

you meaning specific IMO training or general math training? Latter is certainly needed, former being needed in my opinion is a general observation for example about the people who make it on the teams.

ktallett · 2025-07-19T11:50:55 1752925855

I mean IMO training, as yes I agree you wouldn't be able to do this without a complete Math knowledge.

demirbey05 · 2025-07-19T10:46:28 1752921988

I mean progress speed, few months ago they released o3 it has 16 pt in imo 2025

ktallett · 2025-07-19T10:48:31 1752922111

In that regards I would agree but that to me suggests that prior hype was unbased though.

zug_zug · 2025-07-19T13:22:41 1752931361

I feel like I've noticed you you making the same comment 12 places in this thread -- incorrectly misrepresenting the difficulty of this tournament and ultimately it comes across as a bitter ex.

Here's an example problem 5:

Let a1,a2,…,an be distinct positive integers and let M=max⁡1≤i<j≤n.

Find the maximum number of pairs (i,j) with 1≤i<j≤n for which (ai +aj )(aj −ai )=M.

causal · 2025-07-19T13:57:47 1752933467

What does max⁡1≤i<j≤n mean? Wouldn't M always be j?

kelipso · 2025-07-19T15:16:30 1752938190

Guessing it should be M = max_{⁡1≤i<j≤n} ai+aj or some other function M = max_{⁡1≤i<j≤n} f(ai,aj).

causal · 2025-07-19T14:35:05 1752935705

Where did you get this? Don't see it on the 2025 problem set and now I wanna see if I have the right answer

zug_zug · 2025-07-19T22:26:38 1752963998

I asked chatGPT. However it's saying that's 2022 problem 5, however that seems to be clearly wrong... Moreover I can't find that problem anywhere so I don't know if it's a hallucination or something from it's training set that isn't on the internet....

causal · 2025-07-21T14:44:00 1753109040

Okay, please don't post ChatGPT output as fact without verification or at least stating where you got it.

kelipso · 2025-07-20T01:31:34 1752975094

Butlerian jihad can’t come soon enough lol

ktallett · 2025-07-19T18:13:42 1752948822

Hence proofs as I've stated.

CamperBob2 · 2025-07-19T19:47:10 1752954430

Go up to Andrew Wiles and say, "Meh, NBD, it was just a proof."

ktallett · 2025-07-22T19:55:21 1753214121

IMO questions and Andrew Wiles solving Fermat's last theorem are two vastly different things. One is far harder than the other and the effort he put in and thinking needed is something very few can do. He also did some other fascinating work that I couldn't hope to understand fully. There is a gulf between FLT and IMO types of proofs.