Note: this method of "cleaning" works by building an XML tree out of HTML, except that HTML is not XML, so non-closed tags such as <textarea></textarea> will get closed and the browser puts any HTML after the tag into the textarea on the screen, so don't use this if you still want to output to a browser.
Failing that, you can always use Perl and HTML::TreeBuilder / HTML::Parser. They work pretty well on malformed input.