Actually, I think it would be pretty easy if you are willing to have a running Mozilla process. Just connect to it with MozRepl, get it to render a page, and then inspect the DOM with JavaScript. (This could be library-ed up so that you get a W3C DOM back on the Python side, or whatever.)
I use a similar technique to get emacs to syntax-highlight my slides. Connect to the running emacs (with all my settings), run htmlify via emacsclient --eval, and enjoy perfect highlighting!
Sorry, yes -- I definitely don't want a running mozilla process. Plus it's not at all clear that it's possible to run mozilla headless, though I didn't look that hard.
Ah, cool -- it's just that my servers don't run X. Or really have enough ram to spare to add 30 copies of X, mozilla, and other associated stuff. I really just need a relatively compact parsing engine.
I'm not sure why you would need 30 copies of X or Mozilla.
Either way, it is kind of inelegant, but it is hard to pick-and-choose parts of Mozilla. This is probably the simplest way to let Mozilla parse your HTML. (That, however, may not be necessary. I have done a lot of screen-scraping, and I have never encountered anything that HTML::TreeBuilder got confused on. Lately, I've been using libxml2, and that has also worked very well. Zero problems.)
I use a similar technique to get emacs to syntax-highlight my slides. Connect to the running emacs (with all my settings), run htmlify via emacsclient --eval, and enjoy perfect highlighting!