Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Top-Down LR Parsing (pavpanchekha.com)
110 points by ingve on March 14, 2023 | hide | past | favorite | 34 comments


I did a ton of parsing back in the day. LL, LR, LALR, in various forms with various lookahead, and in several languages. I once even did an incremental LR parser generator. I used, but never fully implemented, Pratt and Early parsers. I spent weeks on optimizing LALR parser performance in Java, for reasons you simply don't want to know.

I can tell you one thing: It is much easier to transform your grammar than to switch to an exotic parser generator.

That means that once you got your precedences in order, the simplest approach will always be recursive descent implemented with whatever parser combinators exist in your language of choice. Yes, generators work and can be very fast, but they are a maintenance burden and might make simple stuff ("just parse these specific sub-epxresssions for this web form") unnecessary hard. So don't switch without a good reason from whatever is the de-facto standard.

Oh, and parsing is a little bit like crypto: Don't roll your own unless you are either a pro or doing it for educational purposes.


I agree with the first paragraphs. Tried everything out there, and decided absolutely nothing other than recursive descent is worth it. Parsing is a solved problem and you can throw recursive descent at any practical problem. When this is theoretically not sufficient you can use "lexer trick" or multi stage parsing. [EDIT: This is unless you have extreme performance requirements, or unless you're parsing a theoretically adversarial grammar that can only be parsed by Earley parser etc. This is for practical human programming language problems. As always, use your discretion.]

Oh but unfortunately I don't agree with the last paragraph. In my experience it's the opposite, parsing works the best if you roll everything on your own, in fact in my codebases my principle is parsing is a no library zone (except regex, if necessary). This is because writing a recursive descent parser in any language is easy and using libraries simply increases the maintenance cost imho. What value do you get from not rolling your own parser?


Formal languages / parsing theory is one of the most difficult for me to understand. No other theory feels so easy to get decent results in practice, yet so obscure in theory.

I have experience in writing parsers and small langs but whenever I have to approach all those theories behind parsers and grammar then Im lost

I dont understand why would you want to approach those problems in such a way, or at least describe it in such a way that feels so out of touch with programming

This theory is so unintuitive damn

Idk maybe ive hurt myself by starting with those topics by practice instead of theory, but who knows?


+1 But I found that it is also the most fun to understand. Even better - when you actually implement them from ground up (I did this in typescript literally by going through about 30-40 papers - it is public but am a bit scared to publicize them) the satisfaction is immense. You almost forget you have to go back to a dreadful day job the next day for a tiny second! And that second my friend can feel like an entire universe!


Hi! I'd be really interested in checking out what you put together c:

The personal site in my bio hasn't had an email address on it because I thought I'd have one running on my own domain by now, but I'll go stick a gmail address on there like right now in case you're willing share


Definitely. This is a lr parser playground i put together. Again very simple (I am not a ui person so pardon the cringe ui). Happy to share repo too (has lots of tests but need to put together documentation etc). I am even using this lr parser generator for a couple of my hobby apps for a couple of years.

http://galorium.appspot.com/demos/playground/


Shameless plug: https://www.youtube.com/playlist?list=PLOech0kWpH8-njQpmSNGS...

The first half of the course is on parsing.

I'm hoping I'm giving a lot of examples and there is some coding (using PEG, which is closer to "programmer intuition").


> I dont understand why would you want to approach those problems in such a way, or at least describe it in such a way that feels so out of touch with programming

I don’t know. But the “langsec” (language security) branch of software security argues that a lot of security issues stem from insecure approaches to parsing.

I don’t know how much the theory of parsing helps with that. But it can help you make sure that your grammar has certain properties.


I dont think grammar theory is related to that.

Bigger factor seems to be strong standard library with safe string primitives

Aka NOT C.


A lot of it has to do with having more than one parser for the same thing -- it's bad if they don't agree on the meaning of the input.

That has everything to do with grammar theory (and with how clumsy parser generators are and with how brittle many hand-written parsers are and how often they don't quite implement the spec).


I feel very much the same.

When I want to parse a thing I know that all I need to do is turn it into a bunch of tokens and then gradually loop over all the tokens until I've reduced them into something that can't be reduced further, and that lets me do infix and all of that good stuff.

I don't quite understand how to bridge the gap between that understanding and all the theory.


When we taught parsing (formal language theory) we generally did so using a "Chomskian" hierarchy. In a trivial sense it would be something like: regex (finite automata; pushdown) -> context free (CYK; LL(1); LL(k)) -> Turing reduction crap.

I think it did a huge disservice to our students. It completely elided probably the three most important strategies for parsing "in practice", in order: recursive descent (RD), Pratt, and Earley. The time it spent on algorithmic reduction was insufficient to teach students the practice of translating algorithms to simpler, but more efficient, forms (parametric algorithm design).

I think it's because there's not a lot of formal "meat" on RD, Pratt, and Earley — the first two are engineering solutions, and the latter is too difficult for lower-division to do the proofs.

Anyways, learn Pratt and RD. That's all you need, in practice.


Your TL;DR is true. But where the beauty of this theory shows up is as you start seeing different levels of complexity, ambiguity, practical considerations show up in grammars and their parsing!


Also error handling


Perhaps you'd like to give recursive descent parsing a try, this is something I find quite usable: https://craftinginterpreters.com/parsing-expressions.html#re...


If you take this idea to its logical conclusion, you end up with Predictive LR(k) (or PLR(k)) grammars:

https://link.springer.com/article/10.1007/BF01934444

These are essentially the LR(k) grammars that become LL(k) with a combination of a left-corner transform (to eliminate left recursion) and factorization of common prefixes.


This involves a manual left corner transform, so I would probably call this a variant of a left corner parser (LC parser)?


> "a real programming language with infix operators and precedence, not just S-expressions"

Looking to pick a fight with lispers? haha


The author is a faculty at the University of Utah, whose other major PL person is Matthew Flatt — one of the primary implementers and original teammembers of the Racket project. It's not too surprising that he included a little well-humored spur at a colleague like that, I think.


This is a pretty neat approach to parsing by hand, and it's probably helpful to build intuition about LR. Basically if you give an explanation of LR including shifts and reduces, then you can show how that maps to the hand-written parser (i.e. the `_cont` functions are shifts that eventually result in a reduce when they return and use the parameter that they were passed).

Shameless self-plug, in case you want to understand more about LR parsing: https://www.youtube.com/watch?v=8UDWd-Axd5A


I guess the major downside of this approach is that your state is spread through a call stack rather than in a parse stack sitting in an array and a pointer somewhere.

That makes error recovery really really difficult, because you have to unwind the call stack to the next recovery state rather just messing with the contents of the array and pointer.

I can understand the teaching value in doing it this way but students really need to understand the value of table driven tools in the real world


In my experience, it doesn't make error recovery difficult, not really, even without exception throwing. Then again, I mostly stick to the panic-mode recovery: on unexpected input, record an error in a global list of errors, skip tokens until end of phrase, return a bogus node. Depending on what exactly "bogus" means, you end with error-productions or just completely made up but otherwise perfectly valid nodes somewhere deep inside your parse/abstract tree which is why recording presence of errors somewhere off-side is a good idea.

I imagine a slighlty more elaborate error recovery approach could return different kinds of bogus nodes and check for them and process them differently in different productions.


I'll add few people use the trick of extending LR grammars by resolving conflicts dynamically on the fly (you can do stuff like implementing languages that allow change of operator precedence) - because the usual tools don't support this well it's probably easier to implement in this top-down methodology


Here's an example of doing this in a top-down parser, by Bob Nystrom (author of "Crafting Interpreters"): [0].

[0] http://journal.stuffwithstuff.com/2011/02/13/extending-synta...


Interesting approach. Maybe it is even better to first start with a back-tracking parser. Although back-tracking has the reputation of being slow, there is a rather straightforward mapping from grammar to functions. Although it is slower than traditional parsing techniques, it is fast enough for many practical applications.

One easy trick to optimize back-tracking is memoization.

Another is the left-recursion into right-recursion trick, which is also rather straightforward to apply.

These two tricks are used in IParse Studio [0] which is an interpreting parser that parses a grammar and uses that grammar to parse some input. I also can help you with rewriting the grammar using the described approach.

[0] https://fransfaase.github.io/MCH2022ParserWorkshop/IParseStu...


What are some of the practical examples that require back-tracking (for whatever definition of "require")? The example I was once being told about was parsing assignment expressions:

    Expr ::= CanBeLhsOfAssignExpr | CanBeLhsOfAssignExpr "=" Expr | CanNotBeLhsOfAssignExpr
but in practice, it's easily parsed in the following manner:

    def parse_Expr(self):
        # parse_ExprInner parses CanBeLhsOfAssignExpr or CanNotBeLhsOfAssignExpr cases
        result = self.parse_ExprInner()

        if self.peekToken() == TOKEN_ASSIGN:
            if result.is_CanBeLhsOfAssignExpr:
                token = self.skipToken()

                # assignment is right-associative
                rhs = self.parse_Expr()
                result = self.make_AssignExpr(token, result, rhs)
            else:
                self.raiseParseError("{} can't be target of assignment", result)

        return result
The idea is to parse the prefix under "relaxed" rules, delaying the well-formedness check until it's actually needed.


I do not know of a simple example, but I have heard there are some example in the C++ grammar where the look-a-head can be as large as possible. Parsers for C(C++) are often not purely grammatical, but also keep a symbol table to resolve these problems. For example, the statement 'a * b;' can either be type definition or a multiplication (with no result) depending on whether 'a' is a type or a variable. Note that in C++ it is possible to defined an *-operator with a side effect.


For anyone who, like me, was completely in the dark on parsing algorithms, this[1] course was extremely helpful and easy to understand.

1. https://www.udemy.com/course/essentials-of-parsing/


Really cool article, looking forward to more. I've used antlr quite a bit previously but never really understood the LL vs LR and how it just outputs pre/post fix type stacks. Very well written and I like pro/con analysis at the end, thanks.


Top down LR parsing? Am I reading it correctly?


The trick is to pass down previous syntax nodes so that you can continue consuming input (via top down recursive descent) but build the syntax tree bottom up still.

This is exemplified in the two definitions of `parse_add_addexpr_cont` in the article.


Recursive-descent would usually (there are several options) write `parse_addexpr` somewhat like this:

    Expr parse_addexpr() {
        Expr result = parse_mulexpr();

        while (peek_token() == TokenType.PLUS) {
            expect_token(TokenType.PLUS);
            Expr rhs = parse_mulexpr();
            result = new AddExpr(result, rhs);
        }

        return result;
    }
which corresponds to "addexpr: <mulexpr> (+ <mulexpr>)*" but with left-associative tree-composing.


Why? what was wrong with regular LR parser?


It's fairly easy to teach the behaviour of LR parsers (though which grammars end up being LR and which end up conflicting is very inintuitive).

It's much harder to teach the inner working / implementation of such parsers. Table generation adds an extra layer of indirection. I skip this in my class, because I doubt it will be very valuable to students to be able to implement an LR parser compared to other things we cover in the class.

Doing the OP's approach, you can get some intuition for how a LR parser behaves / is implemented without going to the full complexity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: