If ARC-AGI were a good benchmark for "AGI", then MindsAI should effectively be blowing away current frontier models by order of magnitude. I don't know what MindsAI is, but the post implies they're basically fine-tuning or using a very specific strategy for ARC-AGI that isn't really generalizable to other tasks.
I think it's a nice benchmark of a certain type of spatial/visual intelligence, but if you have a model or technique specifically fine-tuned for ARC-AGI then it's no longer A"G"I
I think solving ARC-AGI will be necessary but not sufficient. My bet is that the converse will not be true - a model that will be considered "AGI" but does poorly on ARC-AGI. So in that sense, I think this is an important benchmark.
One of the key aspects of ARC is that its testing dataset is secret.
The usefulness of the ARC challenge is to figure out how much of the "intelligence" that current models trained on the entire internet is an emergent property and true generalization or how much it is just due to the fact that the training set truly contains an unfathomable amount of examples and thus the models may surprise us with what appears to be genuine insight but it's actually just lookup + interpolation.
I mostly agree, but I think it's fair to say that ARC-AGI is a necessary but definitely not sufficient milestone when it comes to the evaluation of a purported AGI.
How so? I think if a team is fine-tuning specifically to beat ARC that could be true but when you look at Sonnet and o1 getting 20%, I think a standalone frontier model beating it would mean we are close or already at AGI.
The creation and iteration of ARC has been designed in part to avoid this.
Francis talks in his "mid-career" work (2015-2019) about priors for general intelligence and avoiding allowing them. While he admits ARC allows for some priors, it was at the time his best reasonable human effort in 2019 to put together and extremely prior-less training set, as he explained on podcasts around that time (e.g. Lex Fridman). The point of this is that humans, with our priors, are able to reliably get the majority of the puzzles correct, and with time, we can even correct mistakes or recognise mistakes in submissions without feedback (I am expanding on his point a little here based on conference conversations so don't take this as his position or at least his position today).
100 different humans will even get very different items correct/incorrect.
The problem with AI getting 21% correct is that, if it always gets the same 21% correct, it means for 79% of prior-less problems, it has no hope as an intelligent system.
Humans on the other hand, a group of 10000 could obviously get 99% or 100% correct despite none of them having priors for all of them in all liklihood given humans don't tend to get them all right (and well - because Francis created 100% of them!).
The goal of ARC as I understood it in 2019, is not to create a single model that gets a majority correct, to show AGI, it has to be an intelligent system, which can handle prior or priorless situations, as good as a group of humans, on diverse and unseen test sets, ideally without any finetuning or training specifically on this task, at all.
From 2019 (I read his paper when it came out believe it or not!), he held a secret set that he alone has that I believe is still unpublished, and at the time the low number of items (hundreds) was designed to prevent effective finetuning(then 'training') but nowadays few shot training shows that it is clearly possible to do on-the-spot training, which is why in talks Francis gave, I remember him positing that any advanced in short term learning via examples should be ignored e.g. each example should be zero shot, which I believe is how most benchmarks are currently done. The puzzles are all "different in different ways" besides the common element of dynamic grids and providing multiple grids as input.
It's also key to know Francis was quite avant-garde in 2019: his work was ofcourse respected, but he became more prominent recently. He took a very bullish/optimistic position on AI advances at the time (no doubt based on keras and seeing transformers trained using it), but he has been proven right.