This is best AGI benchmark out there in my opinion. Surprising results that unde...

krackers · on Sept 14, 2024

If ARC-AGI were a good benchmark for "AGI", then MindsAI should effectively be blowing away current frontier models by order of magnitude. I don't know what MindsAI is, but the post implies they're basically fine-tuning or using a very specific strategy for ARC-AGI that isn't really generalizable to other tasks.

I think it's a nice benchmark of a certain type of spatial/visual intelligence, but if you have a model or technique specifically fine-tuned for ARC-AGI then it's no longer A"G"I

drdeca · on Sept 14, 2024

Perhaps a benchmark could be a good approximate upper bound for something without being a good approximate lower bound for that thing?

alphabetting · on Sept 14, 2024

I clarified in a another post I mean for benchmarking standalone models, not ones fine-tuned for solving ARC

nightski · on Sept 14, 2024

I mean, there are a lot of tasks that frontier models excel at which many humans wouldn't be able to complete.

zone411 · on Sept 13, 2024

Disagree. My opinion is that solving ARC-AGI won't get us any closer to AGI and it's mostly a distraction.

typon · on Sept 14, 2024

I think solving ARC-AGI will be necessary but not sufficient. My bet is that the converse will not be true - a model that will be considered "AGI" but does poorly on ARC-AGI. So in that sense, I think this is an important benchmark.

ithkuil · on Sept 14, 2024

One of the key aspects of ARC is that its testing dataset is secret.

The usefulness of the ARC challenge is to figure out how much of the "intelligence" that current models trained on the entire internet is an emergent property and true generalization or how much it is just due to the fact that the training set truly contains an unfathomable amount of examples and thus the models may surprise us with what appears to be genuine insight but it's actually just lookup + interpolation.

meowface · on Sept 14, 2024

I mostly agree, but I think it's fair to say that ARC-AGI is a necessary but definitely not sufficient milestone when it comes to the evaluation of a purported AGI.

alphabetting · on Sept 13, 2024

How so? I think if a team is fine-tuning specifically to beat ARC that could be true but when you look at Sonnet and o1 getting 20%, I think a standalone frontier model beating it would mean we are close or already at AGI.

authorfly · on Sept 14, 2024

The creation and iteration of ARC has been designed in part to avoid this.

Francis talks in his "mid-career" work (2015-2019) about priors for general intelligence and avoiding allowing them. While he admits ARC allows for some priors, it was at the time his best reasonable human effort in 2019 to put together and extremely prior-less training set, as he explained on podcasts around that time (e.g. Lex Fridman). The point of this is that humans, with our priors, are able to reliably get the majority of the puzzles correct, and with time, we can even correct mistakes or recognise mistakes in submissions without feedback (I am expanding on his point a little here based on conference conversations so don't take this as his position or at least his position today).

100 different humans will even get very different items correct/incorrect.

The problem with AI getting 21% correct is that, if it always gets the same 21% correct, it means for 79% of prior-less problems, it has no hope as an intelligent system.

Humans on the other hand, a group of 10000 could obviously get 99% or 100% correct despite none of them having priors for all of them in all liklihood given humans don't tend to get them all right (and well - because Francis created 100% of them!).

The goal of ARC as I understood it in 2019, is not to create a single model that gets a majority correct, to show AGI, it has to be an intelligent system, which can handle prior or priorless situations, as good as a group of humans, on diverse and unseen test sets, ideally without any finetuning or training specifically on this task, at all.

From 2019 (I read his paper when it came out believe it or not!), he held a secret set that he alone has that I believe is still unpublished, and at the time the low number of items (hundreds) was designed to prevent effective finetuning(then 'training') but nowadays few shot training shows that it is clearly possible to do on-the-spot training, which is why in talks Francis gave, I remember him positing that any advanced in short term learning via examples should be ignored e.g. each example should be zero shot, which I believe is how most benchmarks are currently done. The puzzles are all "different in different ways" besides the common element of dynamic grids and providing multiple grids as input.

It's also key to know Francis was quite avant-garde in 2019: his work was ofcourse respected, but he became more prominent recently. He took a very bullish/optimistic position on AI advances at the time (no doubt based on keras and seeing transformers trained using it), but he has been proven right.

glial · on Sept 14, 2024

Is that mainly because AGI is one of those "I'll know it when I see it" things?