Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Genuine question from a non-programmer: why? Is it because the volume of requests increases load on the servers/costs?


That's part of it, but also it's typically much more difficult and there's an element of "why are you making this so much harder on yourself".


(Author of original article here.)

That's the great thing about HtmlAgilityPack, extracting data from HTML is really easy. I might even say even easier than if I had the page in some table-based data system.


The HTML is more volatile and subject to change than other sources though


Don't remember the last time wikipedia changed the infobox though


Can make it even harder, use Puppeteer to take screenshots then pass it to an OCR to get the text.



Unlike APIs, html class/tag names or whatever provide no stability guarantees. The site owner can break your parser whenever they want for any reason. They can do that with an API, but usually won't since some guarantee of stability is the point of an API.


True, but the analysis was done on files downloaded over the span of two or three days. If someone had decided to change the CSS class of an infobox during that time, I'd have noticed, investigated and adjusted my code appropriately.


"html class/tag names or whatever provide no stability guarantees"

Not quite. Many Wikipedia infoboxes (and some other templates) use standardised class names from microformats such as hCard:

https://en.wikipedia.org/wiki/Wikipedia:Microformats


Scrapping, especially on a large scale, can put a noticeable strain on servers.

Bulk downloads (database dumps) are much cheaper to serve for someone crawling millions of pages.

It gets even more significant if generation of reply is resource intensive (not sure is Wikipedia qualifying for that but complex templates may cause this).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: