> How does the MCTS distinguish between 'generated a stupid lemma that is true' ...

> How does the MCTS distinguish between 'generated a stupid lemma that is true' and 'generated a valid lemma'?

It does not, though I would like to add validation as part of the check as future work.

> Is there any reason to expect that a 'good' partial subtree will result in a 'good' output?

Yes. I am using Phind Code Llama and it generates a lot of invalid Dafny out of the box. Now, all these false starts get pruned.

I do suspect that if we try to scale this to Coq proof, we will run in local optima rabbit holes.

> Why do you think this approach will generate valid lemmas? > (Not structurally valid; valid as in, they assert a solution to the given problem).

Because the LLM tries to follow instructions.

> "Generate me a function for X and a bunch of tests that verify X". > "If the tests pass, the function is good" > ...but, there's no specific reason to expect that to be true?

Indeed, there are complementary approaches like Speculyzer that help validate specs and code at once.

> How is what you're doing different?

That's not the problem I am solving at the moment. What I solve is that I am able to generate verified code (that still need to be inspected) while the LLM struggles to do in one go.

> I guess I'm struggling to see how this is superior / novel to existing approaches in that space.

It's not that novel, see the planning in LLM work that I credit in the README.md. This transfers from a setting of checks by tests to a setting of checks by a verifier. The code is very simple, and it actually brings open models to reliably solve problems that even ChatGPT only sometimes solve. I think it's a good foundation to build more sophisticated verification with LLMs.