How so? I think if a team is fine-tuning specifically to beat ARC that could be ...

authorfly · on Sept 14, 2024

The creation and iteration of ARC has been designed in part to avoid this.

Francis talks in his "mid-career" work (2015-2019) about priors for general intelligence and avoiding allowing them. While he admits ARC allows for some priors, it was at the time his best reasonable human effort in 2019 to put together and extremely prior-less training set, as he explained on podcasts around that time (e.g. Lex Fridman). The point of this is that humans, with our priors, are able to reliably get the majority of the puzzles correct, and with time, we can even correct mistakes or recognise mistakes in submissions without feedback (I am expanding on his point a little here based on conference conversations so don't take this as his position or at least his position today).

100 different humans will even get very different items correct/incorrect.

The problem with AI getting 21% correct is that, if it always gets the same 21% correct, it means for 79% of prior-less problems, it has no hope as an intelligent system.

Humans on the other hand, a group of 10000 could obviously get 99% or 100% correct despite none of them having priors for all of them in all liklihood given humans don't tend to get them all right (and well - because Francis created 100% of them!).

The goal of ARC as I understood it in 2019, is not to create a single model that gets a majority correct, to show AGI, it has to be an intelligent system, which can handle prior or priorless situations, as good as a group of humans, on diverse and unseen test sets, ideally without any finetuning or training specifically on this task, at all.

From 2019 (I read his paper when it came out believe it or not!), he held a secret set that he alone has that I believe is still unpublished, and at the time the low number of items (hundreds) was designed to prevent effective finetuning(then 'training') but nowadays few shot training shows that it is clearly possible to do on-the-spot training, which is why in talks Francis gave, I remember him positing that any advanced in short term learning via examples should be ignored e.g. each example should be zero shot, which I believe is how most benchmarks are currently done. The puzzles are all "different in different ways" besides the common element of dynamic grids and providing multiple grids as input.

It's also key to know Francis was quite avant-garde in 2019: his work was ofcourse respected, but he became more prominent recently. He took a very bullish/optimistic position on AI advances at the time (no doubt based on keras and seeing transformers trained using it), but he has been proven right.