Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That it does, it works like this:

import libxml2

  parse_options = libxml2.HTML_PARSE_RECOVER + libxml2.HTML_PARSE_NOERROR + libxml2.HTML_PARSE_NOWARNING

  xml_document = libxml2.readDoc(junk_html, None, None, parse_options)

  clean_xhtml = xml_document.getRootElement().serialize()
Note: this method of "cleaning" works by building an XML tree out of HTML, except that HTML is not XML, so non-closed tags such as <textarea></textarea> will get closed and the browser puts any HTML after the tag into the textarea on the screen, so don't use this if you still want to output to a browser.

EDIT: fixed formatting.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: