Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: LLM Verified with Monte Carlo Tree Search (github.com/namin)
102 points by namin on Nov 11, 2023 | hide | past | favorite | 15 comments
This is a weekend hack that I'd like to further develop as it's working surprisingly well.

Using MCTS, we can explore a space of possible verified programs with an LLM. We check the partial programs at each step, and so steer towards programs that pass the verifier.

https://github.com/namin/llm-verified-with-monte-carlo-tree-...



Nice! I had a similar idea to use MCTS to explore various AST generations through an incremental AST parser[0], so I could generate runnable code via LLM at a higher success rate than what we were seeing initially with them. I never got around to it, but this makes me curious to circle back around.

[0] There are few parsers that consider "failed, but could run if the next symbol is valid" and "failed and no way to continue" as separate statuses so you can make progress by enumerating (hopefully intelligently so) the space of valid states. Surprisingly TreeSitter wasn't one of them though, for that it's just "fail".


Interesting! Is the end goal similar to the outlines library: https://github.com/outlines-dev/outlines ?


Yes basically to create some level control for guided program generation.


> [0] There are few parsers that consider "failed, but could run if the next symbol is valid" and "failed and no way to continue" as separate statuses so you can make progress by enumerating (hopefully intelligently so) the space of valid states. Surprisingly TreeSitter wasn't one of them though, for that it's just "fail".

I'm not quite sure I follow how this differs from Tree Sitter's error recovery. Do you have an example of a parser that implements this?


https://pypi.org/project/parso/ is what I settled on when digging into this.

It's been something like 8 months, but iirc tree-sitter would not emit anything I was able to discern as recoverable. I recognize they do implement a pretty robust error recovery, but the returned error wasn't clearly differentiable from a strict failure in a way I could program against where as for the same partially incomplete code parso would return something different than if it was unrecoverable.

I can't remember much more than that and it's entirely possible I was missing some type of configuration to make tree-sitter do the same. Unfortunately it looks like I never pushed my playground code for this idea off of my server which is currently packed up as part of a cross country move I'm on the tail end of otherwise I'd get you the code verbatim.


Anyone who's game able to explain what this actually does?

The main loop looks simple:

    def generate_complete(text, montecarlo):
        text = llm.generate(text, 1)[0]
       score = score_func(text)
        if score is not None:
            if score < 0:
                return None
            else:
                if can_be_solution(text, min_lines, check_fun):
                    montecarlo.solution = text
                return text
        else:
            return generate_complete(text, montecarlo)
...but, the heart of it (1) just basically checks if the dafny syntax is valid by posting to https://dafny.livecode.ch/check

How is 'syntax is valid' a valid scoring mechanism here?

If you look at the output examples (2), all I can see if generating functions and lemmas; this is equivalent to generating a function a bunch of tests for it.

I'm not sure I see what value the MCTS is bringing here.

Anyone get this and care to explain?

[1] - https://github.com/namin/llm-verified-with-monte-carlo-tree-...

[2] - https://github.com/namin/llm-verified-with-monte-carlo-tree-...


Thanks for taking a look!

This is not the main loop. This just generates one completion that succeeds (or gives up) deferring to the Dafny checker to decide.

The main loop is the simulate function of the MCTS library. https://github.com/namin/llm-verified-with-monte-carlo-tree-...

The main advantage of MCTS is that it takes care of the exploitation/exploration trade off based on the scores propagated by the child finder.

I hope this helps. Let me know if this addresses your question.


I mean... isn't the task that you're optimizing 'has a function and a set of lemma that all pass'.

How does the MCTS distinguish between 'generated a stupid lemma that is true' and 'generated a valid lemma'?

Is there any reason to expect that a 'good' partial subtree will result in a 'good' output?

Why do you think this approach will generate valid lemmas?

(Not structurally valid; valid as in, they assert a solution to the given problem).

It seeeems a lot like going like this:

"Generate me a function for X and a bunch of tests that verify X".

"If the tests pass, the function is good"

...but, there's no specific reason to expect that to be true?

How is what you're doing different?

Clearly (from the examples) in a trivial case, it is true, but generally speaking as the task complexity increases, this type of 'auto validation' seems to struggle...?

Using grammars to generate a structured output seems like a similar (and successful) approach, used by many people, but because it doesn't have the associated auto-validation, it's robust against LLM randomness.

I guess I'm struggling to see how this is superior / novel to existing approaches in that space.


> How does the MCTS distinguish between 'generated a stupid lemma that is true' and 'generated a valid lemma'?

It does not, though I would like to add validation as part of the check as future work.

> Is there any reason to expect that a 'good' partial subtree will result in a 'good' output?

Yes. I am using Phind Code Llama and it generates a lot of invalid Dafny out of the box. Now, all these false starts get pruned.

I do suspect that if we try to scale this to Coq proof, we will run in local optima rabbit holes.

> Why do you think this approach will generate valid lemmas? > (Not structurally valid; valid as in, they assert a solution to the given problem).

Because the LLM tries to follow instructions.

> "Generate me a function for X and a bunch of tests that verify X". > "If the tests pass, the function is good" > ...but, there's no specific reason to expect that to be true?

Indeed, there are complementary approaches like Speculyzer that help validate specs and code at once.

> How is what you're doing different?

That's not the problem I am solving at the moment. What I solve is that I am able to generate verified code (that still need to be inspected) while the LLM struggles to do in one go.

> I guess I'm struggling to see how this is superior / novel to existing approaches in that space.

It's not that novel, see the planning in LLM work that I credit in the README.md. This transfers from a setting of checks by tests to a setting of checks by a verifier. The code is very simple, and it actually brings open models to reliably solve problems that even ChatGPT only sometimes solve. I think it's a good foundation to build more sophisticated verification with LLMs.


I love that the code is so short and understandable. Could be turned into a Pearl.


Thanks. I adapted this old MCTS library: https://github.com/ImparaAI/monte-carlo-tree-search which is also very short and understandable.


Could this be combined with something like llama.cpp's constraint-based grammar (https://github.com/ggerganov/llama.cpp/blob/master/grammars/...) to always enforce syntactically correct code output?


Yes. The verifier check already ensures syntactic correctness, but the search could goes faster if the underlying LLM doesn't generate bad syntax to begin with.


Here is a dumb question: how does it distinguish between a partial solution that is just missing some characters/lines and nonsense?


Not dumb! The verifier lets slide an error on the last line, while it's still in progress.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: