Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, since i just wrote a spider last night using html5lib, and had to wrap it up in a try block, I can categorically say that it doesn't work for all webpages:

  parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
  try:
    document = parser.parse(response)
  except Exception, e:
    print 'parse failed ' + str(e)
    return


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: