After having run various html/xml/rss parsers against a 1B page web crawl, I'd h...

thristian · on March 12, 2009

I'm sure the html5lib guys would love to hear about parser bugs exposed by a corpus that large:

http://code.google.com/p/html5lib/

Especially since html5lib is supposed to follow the HTML5 parsing rules, which were basically reverse-engineered from IE's HTML parsing, so they ought to work for every web-page in existence.

lacker · on March 12, 2009

I don't think anything is going to work on every web page in existence. Perhaps strlen.

andrewljohnson · on March 12, 2009

Yeah, since i just wrote a spider last night using html5lib, and had to wrap it up in a try block, I can categorically say that it doesn't work for all webpages:

  parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
  try:
    document = parser.parse(response)
  except Exception, e:
    print 'parse failed ' + str(e)
    return

s3graham · on March 13, 2009

And strlen certainly wouldn't if you actually expect a correct answer. Can't guess at encodings... :)