Genuine question from a non-programmer: why? Is it because the volume of request...

michaelbuckbee · on Aug 19, 2021

That's part of it, but also it's typically much more difficult and there's an element of "why are you making this so much harder on yourself".

billpg · on Aug 19, 2021

(Author of original article here.)

That's the great thing about HtmlAgilityPack, extracting data from HTML is really easy. I might even say even easier than if I had the page in some table-based data system.

SlimyHog · on Aug 19, 2021

The HTML is more volatile and subject to change than other sources though

FalconSensei · on Aug 19, 2021

Don't remember the last time wikipedia changed the infobox though

jcun4128 · on Aug 19, 2021

Can make it even harder, use Puppeteer to take screenshots then pass it to an OCR to get the text.

thechao · on Aug 19, 2021

https://xkcd.com/378/

nonameiguess · on Aug 19, 2021

Unlike APIs, html class/tag names or whatever provide no stability guarantees. The site owner can break your parser whenever they want for any reason. They can do that with an API, but usually won't since some guarantee of stability is the point of an API.

billpg · on Aug 19, 2021

True, but the analysis was done on files downloaded over the span of two or three days. If someone had decided to change the CSS class of an infobox during that time, I'd have noticed, investigated and adjusted my code appropriately.

pigsonthewing · on Aug 20, 2021

"html class/tag names or whatever provide no stability guarantees"

Not quite. Many Wikipedia infoboxes (and some other templates) use standardised class names from microformats such as hCard:

https://en.wikipedia.org/wiki/Wikipedia:Microformats

matkoniecz · on Aug 19, 2021

Scrapping, especially on a large scale, can put a noticeable strain on servers.

Bulk downloads (database dumps) are much cheaper to serve for someone crawling millions of pages.

It gets even more significant if generation of reply is resource intensive (not sure is Wikipedia qualifying for that but complex templates may cause this).