Anyway, that doesn't refute my point, it's just PR from a weaselly and dishonest company. I didn't say it was "IMO-specific" but the output strongly suggests specialized tooling and training, and they said this was an experimental LLM that wouldn't be released. I strongly suspect they basically attached their version of AlphaProof to ChatGPT.
We can only go off their word unfortunately and they say no formal math. so I assume it's being eval'd by a verifier model instead of a formal system. There's actually some hints of this b/c geometry in Lean is not that well developed so unless they also built their own system it's hard to do it formally (though their P2 proof is by coordinate bash (computation by algebra instead of geometric construction) so it's hard to tell.
In general I agree with you, but I see the point of requiring proof for statements made by them, instead of accepting them at face value. In those cases, given previous experiences and considering that they benefit from making them, if they are believed, the burden of proof should be on those making these statements, not on those questioning them, no?
Those models seem to be special and not part of their normal product line, as is pointed out in the comments here. I would assume that in that case they indeed had the purpose of passing these tests in mind when creating them. Or was it created for something different, and completely by chance they discovered they could be used for the challenge, unintentionally?
You don't need specialized tooling like Lean if you have enough training data with statements written in the natural language, I suppose. But the use of AlphaProof/AlphaGeometry type of learning is almost certain. And I'm sure they have spent a lot of compute to produce solutions, $10k is not a problem for them.
The bigger question is - why should everyone be excited by this? If they don't plan to share anything related to this AI model back to humanity.
Look back at expectations at the beginning, and you'll see that nearly everybody was predicting massive recession and even hyperinflation. Balaji bet $1M that there would be hyperinflation:
Yet that didn't happen. We dodged a major bullet, and survived far better than the rest of the world. We must look back at predictions and outcomes with clear eyes, not with the narratives that are being sold in the current day.
On one hand, Pichai paid 2.7bn to get 1 guy back. On the other hand, Pichai laid off 200 Core devs and "relocated roles" to India and Mexico [1]. The duality of Pichai-style management.
You are claiming they are statistical parrots, which I don’t think the parent poster meant.
The “statistical parrots” argument might have been compelling with GPT-3, but not with today’s models and the results of mechanistic interpretability research, which show internal representations and rudimentary world models.
The issue here is that, even with a lot of VRAM, you may be able to run the model, but with a large context, it will still be too slow. (For example, running LLaMA 70B with a 30k+ context prompt takes minutes to process.)