Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sample solution for Balkan MO 2023 seems .. questionable?

The problem involves players removing stones sequentially and asking which will win with perfect play: the listed answer definitely doesn’t list all possible types of strategies.

The answer it gives may be right; in fact I bet it is correct (the second player), but does the qwen team offer the solution as correct including the logic? And is the solution logic correct?



Using human language models for math problems is a bad idea for exactly this reason: judging correctness of proofs in English is subjective, whereas proofs in Lean are objective and easy to machine-check. Deepmind's AlphaProof was so remarkable precisely because it spoke Lean, so its proofs were verifiable. Why waste time on anything else?


The real answer in this case is that AlphaProof is a massively more complex training pipeline than the Qwen team’s pipeline. Sadly. Agreed that integrating Lean makes a ton of sense; it’s just probably an order of magnitude easier to do it inside DeepMind than elsewhere right now.


I would think of two:

1) Much easier to state the problem and basically all the knowledge we have of math is not in the form of Lean proofs

2) It can be applied to a much broader range of domains, math is kinda unique as in most cases verifying something is 100% correct is impossible (and getting such a signal for RL)


1) You could train a different language model to translate between Lean and English. Use AlphaProof to do the hard work of theorem proving; use the translation model for interpretability and as a user interface. In fact, such a system could be ideal for formalizing a hundred years of English-language proofs.

2) This model is already specialized for math; applying it to other domains is out of scope. And as you point out, speaking Lean (and thus being verifiable) gives you an RL reward signal that's way more precise and readily available than piddly RLHF from human reviewers. Gyms like this are where AGI will happen.

(P.S. if anyone wants to work with me on reproducing AlphaProof, hit me up on Discord.)


Hi! I've been working on theorem proving systems for some time now. I would love to help out with an AlphaProof reproduction, but I can't reach you on discord for some reason!


ack! try again, I forgot to update my account name since Discord got rid of # tags. I also put my email as a fallback.


1) Then more math should get formalised in lean.

2) How is a solution by LLMs supposed to be verified without such a formalisation?


I was going to write the same thing. I checked the first three problems and all solutions are partial at best. Now, don't get me wrong, this is still impressive. But putting the problems there with the implication that qwen solves them correctly when it doesn't does not really inspire trust


Same with the Lusophon Mathematical Olympiad 2023 example.

It makes a bunch of specious claims about parity. (Adding 4 to a number changes the parity? Therefore, the parity of each colour always changes for each turn? Therefore, the parity will always be the same as it was initially?) And then it concludes that since the parity is right, 2022 of each colour must be a reachable state.

Which, as you say, is quite possible the correct answer, but it's really weird to put it out there as an example with no comment on the reasoning.


This is my experience doing RLHF for math for LLMs. So many times it’s a partial answer. It may find the minimal case for example, but it doesn’t say why it’s the minimal one, even when the prompt specially asks for the why.


It's complete nonsense. "The total number of moves in the game is equal to the number of stones initially in the pile, which is 5000."

Similarly on the Martian question "If we transform 1 red and 1 green Martian, we get 4 blue Martians. This changes the parity of red and green Martians from even to odd, and the parity of blue Martians from even to odd." - is complete nonsense too.

"the sum of the digits of a number k modulo 2 is equivalent to k mod 2" - 12?

Basically all "solutions" are regurgitated techniques from math competitions which are used completely incorrectly but with a lot of confidence




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: