I've been using BeautifulSoup on a project and noticed the exact problems he's m...

furtivefelon · on March 13, 2009

I had the same problem using 3.1.0, and with some suggestions from the news group, the html5lib alternative works fairly well. I never had a problem so far parsing about 6 sites i previous had to clean up using regexp.

tocomment · on March 13, 2009

I've always wondered this too. It is very strange that no one wants to talk about it. Maybe we could all get together and put up a bounty somewhere for someone to make this?

earl · on March 12, 2009

I briefly looked into doing this. The answer is pretty damn difficult, at least in the case of mozilla.

jrockway · on March 12, 2009

Actually, I think it would be pretty easy if you are willing to have a running Mozilla process. Just connect to it with MozRepl, get it to render a page, and then inspect the DOM with JavaScript. (This could be library-ed up so that you get a W3C DOM back on the Python side, or whatever.)

I use a similar technique to get emacs to syntax-highlight my slides. Connect to the running emacs (with all my settings), run htmlify via emacsclient --eval, and enjoy perfect highlighting!

earl · on March 12, 2009

Sorry, yes -- I definitely don't want a running mozilla process. Plus it's not at all clear that it's possible to run mozilla headless, though I didn't look that hard.

jrockway · on March 13, 2009

You can run any X app headless with Xvfb.

earl · on March 13, 2009

Ah, cool -- it's just that my servers don't run X. Or really have enough ram to spare to add 30 copies of X, mozilla, and other associated stuff. I really just need a relatively compact parsing engine.

jrockway · on March 13, 2009

I'm not sure why you would need 30 copies of X or Mozilla.

Either way, it is kind of inelegant, but it is hard to pick-and-choose parts of Mozilla. This is probably the simplest way to let Mozilla parse your HTML. (That, however, may not be necessary. I have done a lot of screen-scraping, and I have never encountered anything that HTML::TreeBuilder got confused on. Lately, I've been using libxml2, and that has also worked very well. Zero problems.)