I mean... isn't the task that you're optimizing 'has a function and a set of lemma that all pass'.
How does the MCTS distinguish between 'generated a stupid lemma that is true' and 'generated a valid lemma'?
Is there any reason to expect that a 'good' partial subtree will result in a 'good' output?
Why do you think this approach will generate valid lemmas?
(Not structurally valid; valid as in, they assert a solution to the given problem).
It seeeems a lot like going like this:
"Generate me a function for X and a bunch of tests that verify X".
"If the tests pass, the function is good"
...but, there's no specific reason to expect that to be true?
How is what you're doing different?
Clearly (from the examples) in a trivial case, it is true, but generally speaking as the task complexity increases, this type of 'auto validation' seems to struggle...?
Using grammars to generate a structured output seems like a similar (and successful) approach, used by many people, but because it doesn't have the associated auto-validation, it's robust against LLM randomness.
I guess I'm struggling to see how this is superior / novel to existing approaches in that space.
> How does the MCTS distinguish between 'generated a stupid lemma that is true' and 'generated a valid lemma'?
It does not, though I would like to add validation as part of the check as future work.
> Is there any reason to expect that a 'good' partial subtree will result in a 'good' output?
Yes. I am using Phind Code Llama and it generates a lot of invalid Dafny out of the box. Now, all these false starts get pruned.
I do suspect that if we try to scale this to Coq proof, we will run in local optima rabbit holes.
> Why do you think this approach will generate valid lemmas?
> (Not structurally valid; valid as in, they assert a solution to the given problem).
Because the LLM tries to follow instructions.
> "Generate me a function for X and a bunch of tests that verify X".
> "If the tests pass, the function is good"
> ...but, there's no specific reason to expect that to be true?
Indeed, there are complementary approaches like Speculyzer that help validate specs and code at once.
> How is what you're doing different?
That's not the problem I am solving at the moment. What I solve is that I am able to generate verified code (that still need to be inspected) while the LLM struggles to do in one go.
> I guess I'm struggling to see how this is superior / novel to existing approaches in that space.
It's not that novel, see the planning in LLM work that I credit in the README.md. This transfers from a setting of checks by tests to a setting of checks by a verifier.
The code is very simple, and it actually brings open models to reliably solve problems that even ChatGPT only sometimes solve.
I think it's a good foundation to build more sophisticated verification with LLMs.
This is not the main loop. This just generates one completion that succeeds (or gives up) deferring to the Dafny checker to decide.
The main loop is the simulate function of the MCTS library. https://github.com/namin/llm-verified-with-monte-carlo-tree-...
The main advantage of MCTS is that it takes care of the exploitation/exploration trade off based on the scores propagated by the child finder.
I hope this helps. Let me know if this addresses your question.