I wonder how the Mdast and pandoc ASTs compare? I did a customized-MD pipeline w...

solardev · 2025-03-02T20:38:47 1740947927

I am not familiar with Pandoc, but it looks like a command-line tool that can do the same things? (Edit: I suspect this is probably one of those situations where different industries/domains end up developing similar tools in different ecosystems... Pandoc probably makes sense in academia, LaTex workflows, etc.? Mdast is used for web apps. I can see both realms wanting to do Markdown conversions, so I'm not surprised to see similar tools available in both. I'm a web dev, so only familiar with Mdast.)

My guess is that either toolchain could do the job... maybe just depends on personal preference whether someone prefers to pipe together command-line tools in a bash script, vs making use of the npm ecosystem (mdast is all in JS).

Maybe the popularity of JS & npm means there are available mdast plugins & third party packages that can help with whatever niche transformation you might need, and custom node rendering is just a lambda away. It's all in JS for a seamless experience, and there is no separate DSL to learn (just some basic helper functions).

That might be harder to do in Pandoc... (might need a custom Lua filter or another language like your Julia pattern matching?)

As for effectiveness... it probably just depends on the particular implementer :) I'd trust a grizzled old *NIX sysadmin type over your typical bootcamp JS programmer any day, but also... the JS ecosystem is pretty mature and powerful now, and Mdast is pretty amazing. At work we use it to build one of the most important parts of our app, and its power and flexibility never cease to amaze me.

mncharity · 2025-03-03T01:56:06 1740966966

Let's see. So there are parsers in various languages, parsing various MD dialects, with varied internal representations, and surrounding ecosystems. And there are attempts at more turnkey document processing systems, often with a more extended dialect, and some collection of feature plugins. Often you can write pipeline AST filters in the given language, and sometimes get out an AST as JSON, and sometimes reinject JSON AST (allowing writing a filter in any language). Which leaves questions like: what dialect is the parser; is that extensible; how robustly correct is it; how clean and easily used and fragile is the AST; how well do the plugins/ecosystem already support your needed features. That AST one, I think of as a big deal, and hard to get a handle on. Aside from manipulation pragmatics, the asts resulting from parsing can get richly creative in quirkiness, that you then may need to regularize.

So I guess two main observations. On build-vs-buy for backend features, given the breadth of possible "we want it like this, and not that", if one can easily play with ASTs, I was surprised by how quickly reinventing the wheel became a plausible call. Possibly skimming existing backend code for insight and templates, but mostly not using it (aka struggling to configure it to give you "this and not that"). The other observation, is once you have ast and don't care about existing backends, your choice of parser and backend language/ecosystem decouple. One might use `pandoc --to=json` and then JS generic-ast tooling to emit HTML.

For parsing, a glance suggests Mdast emphasizes CommonMark and Github-flavored dialects. Pandoc-flavored MD is a bit broader.[1] My fuzzy recollection is I chose a pandoc parse for that, and an expectation of robustness ("it's haskell, and popular"), despite the then less that wonderful docs. IIRC, the resulting asts were fine. For backend, I wanted simple and concise to minimize burden, thus pattern matching (IIRC, most node types ended up a line or two), and chose road-less-traveled Julia for off-topic reasons (was thinking of using Julia for a compiler backend).

Thanks for your thoughts on Mdast - I'm tempted to play with it.

[1] https://garrettgman.github.io/rmarkdown/authoring_pandoc_mar...