Hacker Newsnew | past | comments | ask | show | jobs | submit | z7's commentslogin

List of dates predicted for apocalyptic events:

https://en.wikipedia.org/wiki/List_of_dates_predicted_for_ap...


Current cope collection:

- It's not a fair match, these models have more compute and memory than humans

- Contestants weren't really elite, they're just college level programmers, not the world's best

- This doesn't matter for the real world, competitive programming is very different from regular software engineering

- It's marketing, they're just cranking up the compute to unrealistic levels to gain PR points

- It's brute force, not intelligence


An encyclopaedia is a lossy representation of reality.


I would argue language in general is. Encylopaedia's are just a hard medium to transmit it.


Meanwhile this new paper claims that GPT-5 surpasses medical professionals in medical reasoning:

"On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.62% and +36.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding."

https://arxiv.org/abs/2508.08224


It's quite interesting that. It also shows GPT 4o was worse than the experts so presumably 3.5 was much worse. I wonder where RFK Jr would come on that scale.


>The actual benchmark improvements are marginal at best

GPT-5 demonstrates exponential growth in task completion times:

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...


What do you mean? A single data point cannot be exponential. What the blog post say is that the ability to solve tasks of all LLMs is exponential over time, and GPT-5 fits in that curve.


Yes, but the jump in performance from o3 is well beyond marginal while also fitting an exponential trend, which undermines the parent's claim on two counts.


Actually a single data point fits a huge range of exponential functions.


No it doesn't. If it were even linear compared to o1 -> o3, we'd be at 2.43 hours. Instead we're only at 2.29.

Exponential would be at 3.6 hours


GPT-5 is #1 on WebDev Arena with +75 pts over Gemini 2.5 Pro and +100 pts over Claude Opus 4:

https://lmarena.ai/leaderboard


This same leaderboard lists a bunch of models, including 4o, beating out Opus 4, which seems off.


In my experience Opus 4 isn't as good for day to day coding tasks as Sonnet 4. It's better as a planner


"+100 points" sounds like a lot until you do the ELO math and see that means 1 out of 3 people still preferred Claud Opus 4's response. Remember 1 out of 2 would place the models dead even.


That eval hasn't been relevant for a while now. Performance there just doesn't seem to correlate well with real-world performance.


What does +75 arbitrary points mean in practice? Can we come up with units that relate to something in the real world.


Some previous predictions:

In 2021 Paul Christiano wrote he would update from 30% to "50% chance of hard takeoff" if we saw an IMO gold by 2025.

He thought there was an 8% chance of this happening.

Eliezer Yudkowsky said "at least 16%".

Source:

https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challe...


While I usually enjoy seeing these discussions, I think they are really pushing the usefulness of bayesian statistics. If one dude says the chance for an outcome is 8% and another says it's 16% and the outcome does occur, they were both pretty wrong, even though it might seem like the one who guessed a few % higher might have had a better belief system. Now if one of them had said 90% while the other said 8% or 16%, then we should pay close attention to what they are saying.


The person who guessed 16% would have a lower Brier score (lower is better) and someone who estimated 100%, beyond being correct, would have the lowest possible value.


I'm not saying there aren't ways to measure this (bayesian statistics does exist after all), I'm saying the difference is not worth arguing about who was right. Or even who had a better guess.


A 16% or even 8% event happening is quite common so really it tells us nothing and doesn’t mean either one was pretty wrong.


From a mathematical point of view there are two factors: (1) Initial prior capability of prediction from the human agents and (2) Acceleration in the predicted event. Now we examine the result under such a model and conclude that:

The more prior predictive power of human agents imply the more a posterior acceleration of progress in LLMs (math capability).

Here we are supposing that the increase in training data is not the main explanatory factor.

This example is the gem of a general framework for assessing acceleration in LLM progress, and I think its application to many data points could give us valuable information.


Another take at a sound interpretation:

(1) Bad prior prediction capability of humans imply that result does not provide any information

(2) Good prior prediction capability of humans imply that there is acceleration in math capabilities of LLMs.


The whole point is to make many such predictions and experience many outcomes. The goal is for your 70% predictions to be correct 70% of the time. We all have a gap between how confident we are and how often we're correct. Calibration, which can be measured by making many predictions, is about reducing that gap.


If i predict that my next dice roll will be a 5 with 16% certainty and i do indeed roll a 5, was my prediction wrong?


The correctness of 8%, 16%, and 90% are all equally unknown since we only have one timeline, no?


That's why you have to let these people make predictions about many things. Than you can weigh the 8, 16, and 90 pct and see who is talking out of their ass.


That's just the frequentist approach. But we're talking about bayesian statistics here.


I admit I dont know Bayesian, but isn't the only way to check if the future teller is lucky or not to have them predict many things? If he predicts 10 to happen with a 10% chance, and one of them happens, he's good. If he predicts 10 to happen with a 90% chance and 9 happen, same. How is this different with Bayesian?


It is the only way if you're a frequentist. But there is a whole other subfield of statistics that deals with assigning probabilities to single events.


If one is calibrated to report proper percentages and assigns 8% to 25 distinct events, you should expect 2 of the events to occur; 4 in case of 16% and 22.5 in case of 90%. Assuming independence (as is sadly too often done) standard math of binomial distributions can be applied and used to distinguish the prediction's accuracy probabilistically despite no actual branching or experimental repetition taking place.


This is probably the best thing I’ve ever read about predictions of the future. If we could run 80 parallel universes then sure it would make sense. But we only have the one [1]. If you’re right and we get fast takeoff it won’t matter because we’re all dead. In any case the number is meaningless, there is only ONE future.


You can make predictions of many different things though. Building a quantifiable track record. If one person is consistently confidently wrong then that says something about their ability and methodology


Impressive prediction, especially pre-ChatGPT. Compare to Gary Marcus 3 months ago: https://garymarcus.substack.com/p/reports-of-llms-mastering-...

We may certainly hope Eliezer's other predictions don't prove so well-calibrated.


Gary Marcus is so systematically and overconfidently wrong that I wonder why we keep talking about this clown.


People just give attention to people making surprising bold counter narrative predictions but don't give them any attention when they're wrong.


People like him and Zitron do serve a useful purpose in balancing the hype from the other side, which, while justified to a great extent, is often a bit too overwhelming.


Being wrong in the other direction doesn't mean you've found a great balance, it just means you've found a new way to be wrong.


These numbers feel kind of meaningless without any work showing how he got to 16%


I do think Gary Marcus says a lot of wrong stuff about LLMs but I don’t see anything too egregious in that post. He’s just describing the results they got a few months ago.


He definitely cannot use the original arguments from then ChatGPT arrived, he's a perennial goal post shifter.


My understanding is that Eliezer more or less thinks it's over for humans.



Context? Who are these people and what are these numbers and why shouldn't I assume they're pulled from thin air?


> why shouldn't I assume they're pulled from thin air?

You definitely should assume they are. They are rationalists, the modus operandi is to pull stuff out of thin air and slap a single digit precision percentage prediction in front to make it seems grounded in science and well thought out.


You should basically assume they are pulled from thin air. (Or more precisely, from the brain and world model of the people making the prediction.)

The point of giving such estimates is mostly an exercise in getting better at understanding the world, and a way to keep yourself honest by making predictions in advance. If someone else consistently gives higher probabilities to events that ended up happening than you did, then that's an indication that there's space for you to improve your prediction ability. (The quantitative way to compare these things is to see who has lower log loss [1].)

[1] https://en.wikipedia.org/wiki/Cross-entropy


> If someone else consistently gives higher probabilities to events that ended up happening than you did, then that's an indication that there's space for you to improve your prediction ability.

Your inference seems ripe for scams.

For example-- if I find out that a critical mass of participants aren't measuring how many participants are expected to outrank them by random chance, I can organize a simplistic service to charge losers for access to the ostensible "mentors."

I think this happened with the stock market-- you predict how many mutual fund managers would beat the market by random chance for a given period. Then you find that same (small) number of mutual fund managers who beat the market and switched to a more lucrative career of giving speeches about how to beat the market. :)


Is there some database where you can see predictions of different people and the results? Or are we supposed to rely on them keeping track and keeping themselves honest? Because that is not something humans do generally, and I have no reason to trust any of these 'rationalists'.

This sounds like a circular argument. You started explaining why them giving percentage predictions should make them more trustworthy, but when looking into the details, I seem to come back to 'just trust them'.


Yes, there is: https://manifold.markets/

People's bets are publicly viewable. The website is very popular with these "rationality-ists" you refer to.

I wasn't in fact arguing that giving a prediction should make people more trustworthy, please explain how you got that from my comment? I said that the main benefit to making such predictions is as practice for the predictor themselves. If there's a benefit for readers, it is just that they could come along and say "eh, I think the chance is higher than that". Then they also get practice and can compare how they did when the outcome is known.


>Who are these people

Clowns, mostly. Yudkowski in particular, whose only job today seems to be making awful predictions and letting lesswrong eat it up when one out of a hundred ends up coming true, solidifying his position as AI-will-destroy-the-world messiah. They make money from these outlandish takes, and more money when you keep talking about them.

It's kind of like listening to the local drunkard at the bar that once in a while ends up predicting which team is going to win in football inbetween drunken and nonsensical rants, except that for some reason posting the predictions on the internet makes him a celebrity, instead of just a drunk curiosity.


>Who are these people

Be glad you don't know anything about them. Seriously.


ask chatgpt


16% is just a way of saying one in six chances


Or just “twice as likely as the guy who said 8%”.


One of the most worrying trends in AI has been how wrong the experts have been with overestimating timelines.

On the other hand, I think human hubris naturally makes us dramatically overestimate how special brains are.


Those percentages are completely meaningless. No better than astrology.


Off topic, but am I the only one getting triggered every time I see a rationalist quantify their prediction of the future with single digit accuracy? It's like their magic way of trying to get everyone to forget that they reached their conclusion in completely hand-wavy way, just like every other human being. But instead of saying "low confidence" or "high confidence" like the rest of us normies, they will tell you they think there is 16.27% chance because they really really want you to be aware that they know bayes theorem.


Interestingly, this is actually a question that's been looked at empirically!

Take a look at this paper: https://scholar.harvard.edu/files/rzeckhauser/files/value_of...

They took high-precision forecasts from a forecasting tournament and rounded them to coarser buckets (nearest 5%, nearest 10%, nearest 33%), to see if the precision was actually conveying any real information. What they found is that if you rounded the forecasts of expert forecasters, Brier scores got consistently worse, suggesting that expert forecast precision at the 5% level is still conveying useful, if noisy, information. They also found that less expert forecasters took less of a hit from rounding their forecasts, which makes sense.

It's a really interesting paper, and they recommend that foreign policy analysts try to increase precision rather than retreating to lumpy buckets like "likely" or "unlikely".

Based on this, it seems totally reasonable for a rationalist to make guesses with single digit precision, and I don't think it's really worth criticizing.


Likely vs. unlikely is rounding to 50%. Single digit is rounding to 1%. I don't think the parent was suggesting the former is better than the latter. Even before I read your comment I thought that 5% precision is useful but 1% precision is a silly turn-off, unless that 1% is near the 0% or 100% boundary.


The book Superforecasting documented that for their best forecasters, rounding off that last percent would reliably reduce Brier scores.

Whether rationalists who are publicly commenting actually achieve that level of reliability is an open question. But that humans can be reliable enough in the real world that the last percentage matters, has been demonstrated.


Your comment is incredibly confusing (possibly misleading) because of the key details you've omitted.

> The book Superforecasting documented that for their best forecasters, rounding off that last percent would reliably reduce Brier scores.

Rounding off that last percent... to what, exactly? Are you excluding the exceptions I mentioned (i.e. when you're already close to 0% or 100%?)

Nobody is arguing that 3% -> 4% is insignificant. The argument is over whether 16% -> 15% is significant.


To the nearest 5%, for percentages in that middle range. It is not just 16% -> 15%. But also 46% -> 45%.


Yes so this confirms my point rather than refuting it...


It seems that you reversed your point then. You said before:

Even before I read your comment I thought that 5% precision is useful but 1% precision is a silly turn-off, unless that 1% is near the 0% or 100% boundary.

However what I am saying is that there is real data, involving real predictions, by real people, that demonstrates that there is a measurable statistical loss of accuracy in their predictions if you round off those percentages.

This doesn't mean that any individual prediction is accurate to that percent. But it happens often enough that the last percent really does contain real value.


The most useful frame here is looking at log odds. Going from 15% -> 16% means

-log_2(.15/(1-.15)) -> -log_2(.16/1-.16))

=

2.5 -> 2.39

So saying 16% instead of 15% implies an additional tenth of a bit of evidence in favor (alternatively, 16/15 ~= 1.07 ~= 2^.1).

I don't know if I can weigh in on whether humans should drop a tenth of a bit of evidence to make their conclusion seem less confident. In software (eg. spam detector), dropping that much information to make the conclusion more presentable would probably be a mistake.


I thought single digit means single significant digit, aka rounding to 10%?


I did mean 1%, not sure if I used the right term though, english not being my first language.


Wasn't 16% the example they were talking about? Isn't that two significant digits?

And 16% very much feels ridiculous to a reader when they could've just said 15%.


In context, the "at least 16%" is responding to someone who said 8%, and 16 just happens to be exactly twice 8. I suspect (though I don't know) that Yudkowsky would not have claimed to have a robust way to pick whether 16% or 17% was the better figure.

For what it's worth, I don't think there's anything even slightly wrong with using whatever estimate feels good to you, even if it happens not to fit someone else's criterion for being a nice round number, even if your way of getting the estimate was sticking a finger in the air and saying the first number you thought of. You never make anything more accurate by rounding it[1], and while it's important to keep track of how precise your estimates are I think it's a mistake to try to do that by modifying the numbers. If you have two pieces of information (your best estimate, and how fuzzy it is), you should represent it as two pieces of information[2].

[1] This isn't strictly true, but it's near enough.

[2] Cf. "Pitman's two-bit rule".


> In context, the "at least 16%" is responding to someone who said 8%, and 16 just happens to be exactly twice 8. I suspect (though I don't know) that Yudkowsky would not have claimed to have a robust way to pick whether 16% or 17% was the better figure.

If this was just a way to say "at least double that", that's... fair enough, I guess.

Regarding your other point:

> For what it's worth, I don't think there's anything even slightly wrong with using whatever estimate feels good to you, even if it happens not to fit someone else's criterion for being a nice round number

This is completely missing the point. There absolutely is something wrong with doing this (barring cases like the above where it was just a confusing phrasing of something with less precision like "double that"). The issue has nothing to do with being "nice", it has to do with the significant figures and the error bars.

If you say 20% then it is understood that your error margin is 5%. Even those that don't understand sigfigs still understand that your error margin is < 10%.

If you say 19% then suddenly the understanding becomes that your error margin < 1%. Nobody is going to see that and assume your error bars on it are 5% -- nobody. Which is what makes it a ridiculous estimate. This has nothing to do with being "nice and round" and everything with conveying appropriate confidence.


I'm not missing the point, I'm disagareeing with it. I am saying that the convention that if you say 20% then you are assumed to have an error margin of 5%, while if you say 19% you are assumed to have an error margin of 1%, is a bad convention. It gives you no way to say that the number is 20% with a margin of 1%. It gives you only a very small set of possible degrees-of-uncertainty. It gives you no way to express that actually your best estimate is somewhat below 20% even though you aren't sure it isn't 5% out.

It's true, of course, that if you are talking to people who are going to interpret "20%" as "anywhere between 17.5% and 22.5%" and "19%" as "anywhere between 18.5% and 19.5%", then you should try to avoid giving not-round numbers when your uncertainty is high. And that many people do interpret things that way, because although I think the convention is a bad one it's certainly a common one.

But: that isn't what happened in the case you're complaining about. It was a discussion on Less Wrong, where all the internet-rationalists hang out, and where there is not a convention that giving a not-round number implies high confidence and high precision. Also, I looked up what Yudkowsky actually wrote, and it makes it perfectly clear (explicitly, rather than via convention) that his level of uncertainty was high:

"Ha! Okay then. My probability is at least 16%, though I'd have to think more and Look into Things, and maybe ask for such sad little metrics as are available before I was confident saying how much more."

(Incidentally, in case anyone's similarly salty about the 8% figure that gives context to this one: it wasn't any individual's estimate, it was a Metaculus prediction, and it seems pretty obvious to me that it is not an improvement to report a Metaculus prediction of 8% as "a little under 10%" or whatever.)


My interpretation was that Yudkowski simply doubled Christiano's guess of 8% (as one might say in conversation "oh it's at least double that", but using the actual number)


Aim small, miss small?


Would you also get triggered if you saw people make a bet at, say, $24 : $87 odds? Would you shout: "No! That's too precise, you should bet $20 : $90!"? For that matter, should all prices in the stock market be multiples of $1, (since, after all, fluctuations of greater than $1 are very common)?

If the variance (uncertainty) in a number is large, correct thing to do is to just also report the variance, not to round the mean to a whole number.

Also, in log odds, the difference between 5% and 10% is about the same as the difference between 40% and 60%. So using an intermediate value like 8% is less crazy than you'd think.

People writing comments in their own little forum where they happen not to use sig-figs to communicate uncertainty is probably not a sinister attempt to convince "everyone" that their predictions are somehow scientific. For one thing, I doubt most people are dumb enough to be convinced by that, even if it were the goal. For another, the expected audience for these comments was not "everyone", it was specifically people who are likely to interpret those probabilities in a Bayesian way (i.e. as subjective probabilities).


> Would you also get triggered if you saw people make a bet at, say, $24 : $87 odds? Would you shout: "No! That's too precise, you should bet $20 : $90!"? For that matter, should all prices in the stock market be multiples of $1, (since, after all, fluctuations of greater than $1 are very common)?

No.

I responded to the same point here: https://news.ycombinator.com/item?id=44618142

> correct thing to do is to just also report the variance

And do we also pull this one out of thin air?

Using precise number to convey extremely unprecise and ungrounded opinions is imho wrong and to me unsettling. I'm pulling this purely out of my ass, and maybe I am making too much out of it, but I feel this is in part what is causing the many cases of very weird, and borderline associal/dangerous behaviours of some associated with the rationalists movement. When you try to precisely quantify what cannot be, and start trusting those numbers too much, you can easily be led to trust your conclusions way too much. I am 56% confident this is a real effect.


I mean, sure people can use this to fool themselves. I think usually the cause of someone fooling themselves is "the will to be fooled", and not so much that fact that they used precise numbers in the their internal monologue as opposed to verbal buckets like "pretty likely", "very unlikely". But if you estimate 56% it sometimes actually makes a difference, then who am I to argue? Sounds super accurate to me. :)

In all seriousness, I do agree it's a bit harmful for people to use this kind of reasoning, but only practice it on things like AGI that will not be resolved for years and years (and maybe we'll all be dead when it does get resolved). Like ideally you'd be doing hand-wavy reasoning with precise probabilities about whether you should bring an umbrella on a trip, or applying for that job, etc. Then you get to practice with actual feedback and learn how not to make dumb mistakes while reasoning in that style.

> And do we also pull this one out of thin air?

That's what we do when training ML models sometimes. We'll have the model make a Gaussian distribution by supplying both a mean and a variance. (Pulled out of thin air, so to speak.) It has to give its best guess of the mean, and if the variance it reports is too small, it gets penalized accordingly. Having the model somehow supply an entire probability distribution is even more flexible (and even less communicable by mere rounding). Of course, as mentioned by commenter danlitt, this isn't relevant to binary outcomes anyways, since the whole distribution is described by a single number.


> and not so much that fact that they used precise numbers in the their internal monologue as opposed to verbal buckets like "pretty likely", "very unlikely"

I am obviously only talking from my personal anecdotal experience, but having been on a bunch of coffee chat in the last few months with people in the AI safety field in SF, and a lot of them being Lesswrong-ers, I experienced a lot of those discussions with random % being thrown in succession to estimate the final probability of some event, and even though I have worked in ML for 10+ years (so I would guess more constantly aware of what a bayesian probability is than the average person), I do find myself often swayed by whatever numbers comes out at the end and having to consciously take a step back and pull myself from instinctively trusting this random number more than I should. I would not need to pull myself back, I think, if we were using words instead of precise numbers.

It could be just a personal mental weakness with numbers with me that is not general, but looking at my interlocutors emotional reactions to their own numerical predictions I do feel quite strongly that this is a general human trait.


> It could be just a personal mental weakness with numbers with me that is not general, but looking at my interlocutors emotional reactions to their own numerical predictions I do feel quite strongly that this is a general human trait.

Your feeling is correct; anchoring is a thing, and good LessWrongers (I hope to be in that category) know this and keep track of where their prior and not just posterior probabilities come from: https://en.wikipedia.org/wiki/Anchoring_effect

Probably don't in practice, but should. That "should" is what puts the "less" into "less wrong".


Ah thanks for the link, yes this is precisely the bias I am feeling falling victim to if not making an effort to counter it.


> If the variance (uncertainty) in a number is large, correct thing to do is to just also report the variance

I really wonder what you mean by this. If I put my finger in the air and estimate the emergence of AGI as 13%, how do I get at the variance of that estimate? At face value, it is a number, not a random variable, and does not have a variance. If you instead view it as a "random sample" from the population of possible estimates I might have made, it does not seem well defined at all.


I meant in a general sense that it's better when reporting measurements/estimates of real numbers to report the uncertainty of the estimate alongside the estimate, instead of using some kind of janky rounding procedure to try and communicate that information.

You're absolutely right that if you have a binary random variable like "IMO gold by 2026", then the only thing you can report about its distribution is the probability of each outcome. This only makes it even more unreasonable to try and communicate some kind of "uncertainty" with sig-figs, as the person I was replying to suggested doing!

(To be fair, in many cases you could introduce a latent variable that takes on continuous values and is closely linked to the outcome of the binary variable. Eg: "Chance of solving a random IMO problem for the very best model in 2025". Then that distribution would have both a mean and a variance (and skew, etc), and it could map to a "distribution over probabilities".)


No, you are right, this hyper-numericalism is just astrology for nerds.


The whole community is very questionable, at best. (AI 2027, etc.)


In military they estimate distances this way if they don't have proper tools. Each says a min max range and then where there's most overlap, that will be taken. It's a reasonable way to make quick intuition based decisions when no other way is available.


If you actually try to flesh out the reasoning behind the distance estimation strategy it will turn out 100x more convincing than the analogous argument for bayesian probability estimates. (and for any bayesians reading this, please don't multiply the probability by 100)


> But instead of saying "low confidence" or "high confidence" like the rest of us normies

To add to what tedsanders wrote: there's also research that shows verbal descriptions, like those, mean wildly different things from one person to the next: https://lettersremain.com/perceptions-of-probability-and-num...



If you take it with a grain of salt it's better than nothing. In life to express your opinion sometimes the best way is to quantify that based on intuition. To make decisions you could compile multiple experts intuitive quantities and use median or similar. There are some cases where it's more straight forward and rote, e.g. in military if you have to make distance based decisions, you might ask 8 of your soldiers to each name a number they think the distance is and take the median.


No you’re definitely not the only one… 10% is ok, 5% maybe, 1% is useless.

And since we’re at it: why not give confidence intervals too?


>Off topic, but am I the only one getting triggered every time I see a rationalist

The rest of the sentence is not necessary. No, you're not the only one.


You could look at 16% as roughly equivalent to a dice roll (1 in 6) or, you know, the odds you lose a round of Russian roulette. That's my charitable interpretation at least. Otherwise it does sound silly.


There is no honor in hiding behind euphemisms. Rationalists say ‘low confidence’ and ‘high confidence’ all the time, just not when they're making an actual bet and need to directly compare credences. And the 16.27% mockery is completely dishonest. They used less than a single significant figure.


> just not when they're making an actual bet

That is not my experience talking with rationalists irl at all. And that is precisely my issue, it is pervasive in every day discussion about any topic, at least with the subset of rationalists I happen to cross paths with. If it was just for comparing ability to forecast or for bets, then sure it would make total sense.

Just the other day I had a conversation with someone about working in AI safety, it when something like "well I think there is 10 to 15% chance of AGI going wrong, and if I join I have maybe 1% chance of being able to make an impact and if.. and if... and if, so if we compare with what I'm missing by not going to <biglab> instead I have 35% confidence it's the right decision"

What makes me uncomfortable with this, is that by using this kind of reasoning and coming out with a precise figure at the end, it cognitively bias you into being more confident in your reasoning than you should be. Because we are all used to treat numbers as the output of a deterministic, precise, scientific process.

There is no reason to say 10% or 15% and not 8% or 20% for rogue AGI, there is no reason to think one individual can change the direction by 1% and not by 0.3% or 3%, it's all just random numbers, and so when you multiply a gut feeling number by a gut feeling number 5 times in a row, you end up with something absolutely meaningless, where the margin of error is basically 100%.

But it somehow feels more scientific and reliable because it's a precise number, and I think this is dishonest and misleading both to the speaker themselves and to listeners. "Low confidence", or "im really not sure but I think..." have the merit of not hiding a gut feeling process behind a scientific veil.

To be clear I'm not saying you should never use numerics to try to quantify gut feeling, it's ok to say I think there is maybe 10% chance of rogue AGI and thus I want to do this or that. What I really don't like is the stacking of multiple random predictions and trying to reason about this in good faith.

> And the 16.27% mockery is completely dishonest.

Obviously satire


I wonder if what you observe is a direct effect of the rationalist movement worshipping the god of Bayes.


Yes


How do you explain Grok 4 achieving new SOTA on ARC-AGI-2, nearly doubling the previous commercial SOTA?

https://x.com/arcprize/status/1943168950763950555


They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions.

What I've noticed when testing previous versions of Grok, on paper they were better at benchmarks, but when I used it the responses were always worse than Sonnet and Gemini even though Grok had higher benchmark scores.

Occasionally I test Grok to see if it could become my daily driver but it's never produced better answers than Claude or Gemini for me, regardless of what their marketing shows.


They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions

That's kind of the idea behind ARC-AGI. Training on available ARC benchmarks does not generalize. Unless it does... in which case, mission accomplished.


Seems still possible to spend effort of building up an ARC-style dataset and that would game the test. The ARC questions I saw were not of some completely unknown topic, they were generally hard versions of existing problems in well-known domains. Not super familiar with this area in general though so would be curious if I'm wrong.


ARC-AGI isn't question- or knowledge-based, though, but "Infer the pattern and apply it to a new example you haven't seen before." The problems are meant to be easy for humans but hard for ML models, like a next-level CAPTCHA.

They have walked back the initial notion that success on the test requires, or demonstrates, the emergence of AGI. But the general idea remains, which is that no amount of pretraining on the publicly-available problems will help solve the specific problems in the (theoretically-undisclosed) test set unless the model is exhibiting genuine human-like intelligence.

Getting almost 16% on ARC-AGI-2 is pretty interesting. I wish somebody else had done it, though.


I’ve seen some of the problems before, like https://o3-failed-arc-agi.vercel.app/

This is not hard to build datasets that have these types of problems in them, and I would expect LLMs to generalize this well. I don’t see how this is any different really than any other type of problem LLMs are good at given they have the dataset to study.

I get they keep the test updated with secret problems, but I don’t see how companies can’t game this just by investing in building their own datasets, even if it means paying teams of smart people to generate them.


The other question is if enough examples of this type of task are helpful and generalizable in some way. If so, why wouldn't you integrate that dataset into your training pipeline of an LLM.


I use Grok with repomix to review my code and it tends to give decent answers and is a bit better at giving actual actionable issues with code examples than, say Gemini 2.5 pro.

But the lack of a CLI tool like codex, claude code or gemini-cli is preventing it from being a daily driver. Launching a browser and having to manually upload repomixed content is just blech.

With gemini I can just go `gemini -p "@repomix-output.xml review this code..."`


Well try it again and report back.


As I said, either by benchmark contamination (it is semi-private and could have been obtained by persons from other companies which model have been benchmarked) or by having more compute.


I still dont understand why people point to this chart as any sort of meaning. Cost per task is a fairly arbitrary X axis and in no way representing any sort of time scale.. I would love to be told how they didn't underprice their model and give it an arbitrary amount of time to work.


"Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%."

"This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA."

https://x.com/arcprize/status/1943168950763950555


Quoting Chollet:

>I have repeatedly said that "can LLM reason?" was the wrong question to ask. Instead the right question is, "can they adapt to novelty?".

https://x.com/fchollet/status/1866348355204595826


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: