Does Python have a libxml2 binding? I have had pretty good luck with its parse_h...

jinglebells · on March 13, 2009

That it does, it works like this:

import libxml2

  parse_options = libxml2.HTML_PARSE_RECOVER + libxml2.HTML_PARSE_NOERROR + libxml2.HTML_PARSE_NOWARNING

  xml_document = libxml2.readDoc(junk_html, None, None, parse_options)

  clean_xhtml = xml_document.getRootElement().serialize()

Note: this method of "cleaning" works by building an XML tree out of HTML, except that HTML is not XML, so non-closed tags such as <textarea></textarea> will get closed and the browser puts any HTML after the tag into the textarea on the screen, so don't use this if you still want to output to a browser.

EDIT: fixed formatting.