Hacker News new | past | comments | ask | show | jobs | submit login

After having run various html/xml/rss parsers against a 1B page web crawl, I'd have to say that it's pretty rare to find ones that can actually pass the web fuzz test. Most seem to have been written from a more spec-driven approach. This is fine in a controlled environment, but pretty useless if you want to turn the code loose on real world web data.

Some of the stuff we find, like 1-in-80M core dumps are to be expected because they're so rare and most folks don't have that much test data. But many others could be found by simply running a parser against a few hundred random urls from the dmoz rdf. I wish more lib developers would do this.




I'm sure the html5lib guys would love to hear about parser bugs exposed by a corpus that large:

http://code.google.com/p/html5lib/

Especially since html5lib is supposed to follow the HTML5 parsing rules, which were basically reverse-engineered from IE's HTML parsing, so they ought to work for every web-page in existence.


I don't think anything is going to work on every web page in existence. Perhaps strlen.


Yeah, since i just wrote a spider last night using html5lib, and had to wrap it up in a try block, I can categorically say that it doesn't work for all webpages:

  parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
  try:
    document = parser.parse(response)
  except Exception, e:
    print 'parse failed ' + str(e)
    return


And strlen certainly wouldn't if you actually expect a correct answer. Can't guess at encodings... :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: