Hacker News new | past | comments | ask | show | jobs | submit login

Does Python have a libxml2 binding? I have had pretty good luck with its parse_html_string function.

Failing that, you can always use Perl and HTML::TreeBuilder / HTML::Parser. They work pretty well on malformed input.




That it does, it works like this:

import libxml2

  parse_options = libxml2.HTML_PARSE_RECOVER + libxml2.HTML_PARSE_NOERROR + libxml2.HTML_PARSE_NOWARNING

  xml_document = libxml2.readDoc(junk_html, None, None, parse_options)

  clean_xhtml = xml_document.getRootElement().serialize()
Note: this method of "cleaning" works by building an XML tree out of HTML, except that HTML is not XML, so non-closed tags such as <textarea></textarea> will get closed and the browser puts any HTML after the tag into the textarea on the screen, so don't use this if you still want to output to a browser.

EDIT: fixed formatting.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: