The time investment of writing a *good* hand-written parser is only worth it if ...

eatonphil · on July 15, 2022

Using parser generators doesn't make a lot of sense to me.

It's far easier to handwrite parsers.

Unless you're a midsized company with the resources and who doesn't mind bad error messages I would not recommend anyone use parser generators.

zetalyrae · on July 15, 2022

What could be easier than writing a grammar in EBNF-ish notation? This gives you the added benefit that the grammar is explicitly defined, and not implicit in the parser code.

westoncb · on July 15, 2022

> This gives you the added benefit that the grammar is explicitly defined

Especially if you're still refining the syntax: it is so comparatively easy to modify a grammar vs parser.

eatonphil · on July 15, 2022

> What could be easier than writing a grammar in EBNF-ish notation?

Not learning a new language/build system/coding paradigm. :)

kaba0 · on July 15, 2022

ANTLR took me an afternoon, before I could use it for a simple language, including adding it to my build tool.

pizza234 · on July 15, 2022

> It's far easier to handwrite parsers.

It's also far easier to make buggy handwritten parsers. I don't have much experience, but in the project that I've seen, where a handwritten parser was used, the issue tracker was full of parser bugs.

rhdunn · on July 15, 2022

I tend to write lots of tests at each level -- lexer, parser, AST, etc. -- to cover the different BNF grammar statements. I've done this with an XQuery and XPath parser I wrote for IntelliJ.

You can cover the different lexical variations in the lexical tests, and ensure that the lexer is reporting the correct things in the lexer tests. For these, I generally ensure I cover each token in a given BNF grammar statement, even if that means repeating checks for a given keyword, as it keeps the tests self-contained and easier to verify.

You can cover the different parser/grammar variations (different branches/paths in the BNF, missing tokens for error reporting and recovery, etc.) in the parser tests. These can make use of serializing the tree to a file, so your input is an example program and the output is the parsed AST tree.

The AST tests can then test the actual data model your AST provides, such as access to the value of a literal, or information about a type.

Higher-level tests such as constant folding can then be done using the AST API, or things like serializing the AST to a textual representation.

userbinator · on July 15, 2022

It's also far easier to debug handwritten parsers.

tomcam · on July 15, 2022

a parser generator can be a very good first pass at determining whether your language is properly formed

eatonphil · on July 15, 2022

What do you mean by properly formed?

Calavar · on July 15, 2022

Probably that it doesn't have ambiguities

tomcam · on July 15, 2022

Thanks, that’s a better way of putting it.

eatonphil · on July 15, 2022

Is that an important property? If so, why do you (*whoever) think so?

Calavar · on July 15, 2022

As always, it depends on what your goal is.

An unambiguous parser is easier to write tests for. It leads to fewer subtle "gotchas" where something syntax checks and compiles but is interpreted by the compiler as something very different than what the programmer intended. It's easier for third parties to write their own parsers and tooling for the language if they can target a formal grammar as opposed to the behavior of your Turing complete code.

But sometimes you don't care about any of those things. Especially if you're talking about a language with a small surface area. Or if you have to deal with legacy stuff (for example, parsing something like C++ that is heavily context sensitive, in which case trying to shoehorn things into a formal CFG so you can feed it to a generator might actually make things much harder than they have to be).

tomcam · on July 15, 2022

Not always, and that’s a great question! But I can help you catch things that might not be obvious when your first create the grammar, even such simple things as ambiguous nested if statements or errors in string formatting DSL syntax.

tomcam · on July 16, 2022

* it can, not I can

wyago · on July 15, 2022

It takes 1 or 2 hours to make a decent pratt parser for even somewhat unusual syntax.