After having run various html/xml/rss parsers against a 1B page web crawl, I'd have to say that it's pretty rare to find ones that can actually pass the web fuzz test. Most seem to have been written from a more spec-driven approach. This is fine in a controlled environment, but pretty useless if you want to turn the code loose on real world web data.
Some of the stuff we find, like 1-in-80M core dumps are to be expected because they're so rare and most folks don't have that much test data. But many others could be found by simply running a parser against a few hundred random urls from the dmoz rdf. I wish more lib developers would do this.
Especially since html5lib is supposed to follow the HTML5 parsing rules, which were basically reverse-engineered from IE's HTML parsing, so they ought to work for every web-page in existence.
Yeah, since i just wrote a spider last night using html5lib, and had to wrap it up in a try block, I can categorically say that it doesn't work for all webpages:
Some of the stuff we find, like 1-in-80M core dumps are to be expected because they're so rare and most folks don't have that much test data. But many others could be found by simply running a parser against a few hundred random urls from the dmoz rdf. I wish more lib developers would do this.