This is you-can't-parse-HTML-with-regex [1] level hideous, only worse, because M...

pwdisswordfish · on Aug 16, 2016

It's even worse. The syntax of MediaWiki markup (parser functions, etc.) will vary depending on extensions installed in a given instance. And there is no way to obtain a list of parser functions and extension tags (the latter of which look like HTML tags, but parse differently). There is literally no way to reliably parse MediaWiki markup offline. [0] can give you a parse tree — but only of a top-level page, you can't recursively expand templates with it. And even that parse tree only gives you information about which templates should be expanded. After expanding templates you have to parse the whole thing again to actually interpret the formatting markup. Oh, and expand strip markers: [1]. Because some extensions which define their own tags will want to output raw, unfiltered HTML. But tags are parsed in the template expansion stage and raw HTML isn't allowed in the formatting stage, so what do the extensions do? They output "strip markers" that are later substituted for the raw HTML.

Most alternative parsing libraries (at least all the ones I looked at, and that includes Wikimedia's own Parsoid!) don't bother with implementing all that complexity. Which means that they can be often tripped by sufficiently tricky markup; and this turns out to be a quite low bar.

Parsing MediaWiki markup properly is an insanity on a level comparable only with TeX and Unix shell scripts. Even PHP is saner.

[0] https://www.mediawiki.org/wiki/API:Parsing_wikitext

[1] https://www.mediawiki.org/wiki/Strip_marker

germanier · on Aug 16, 2016

> And there is no way to obtain a list of parser functions and extension tags (the latter of which look like HTML tags, but parse differently).

FWIW, every sane extensions reports its existence to [[Special:Version]] which at least includes a list of extension tags at the bottom but you'd need to implement all of them in an alternative parser.

taneq · on Aug 16, 2016

This isn't for production-level data migration. It's for smooshing some source text into a shape which is useful to you.

Parsing HTML with regexps is fine if you're just curious roughly how many images are in a page. It's great for quick command line experiments. It's just not good when you need to be "doing it properly".

_0nac · on Aug 16, 2016

I used to work with Wiki markup for a living. The time you think you'll save with regex hackery is quickly chewed up by the time wasted eternally tweaking your regexes to catch yet another corner case -- it's much better just to parse for real from the get go, just like it's much better to use a real HTML/XML parser than trying to do the same job badly with regexes.

ivanhoe · on Aug 16, 2016

I used to make scrappers for living and trust me that it all depends on the particular situation and your requirements. Real HTML parsers are easier and safer for general type of work, but they quickly get very heavy on memory when parsing big DOM trees. If all you need from a page is a few strings, like e.g. just a product price (very common task), using regexes is far superior approach performance-wise. It's both faster and uses less memory (so you can run more parallel workers) and also if you write it well it's immune to many small html/design changes as long as the pattern you look for is not changed.

_0nac · on Aug 16, 2016

I'm sure that works well on product pages, which are just the same template reused for every single product. I'm afraid it will fail spectacularly on wiki pages, which are handcrafted by humans with all the completely unpredictable randomness that entails.

elmigranto · on Aug 16, 2016

I mostly agree, been using Regexes to parse known limited subset of HTML tags in known limited format (1 tag per line, forced padding, no attrs, etc.).

But on the other hand, this is often how long term "proper" solutions are born — evolved from something cobbled together in couple of hours.

taneq · on Aug 16, 2016

Oh yeah, the golden rule for quick hacks is NEVER show them to someone non-technical, especially someone non-technical who's in your chain of command.

I learned that one after mocking up some UI screens using VB6 and then having to explain over and over again that no, just because we showed you some buttons on a page doesn't mean that the program (which had to be a Java applet, mind you) was "almost done."

_0nac · on Aug 16, 2016

There are already literally dozens of "proper" MediaWiki parsers out there: https://www.mediawiki.org/wiki/Alternative_parsers