Actually, I think it would be pretty easy if you are willing to have a running M...

earl · on March 12, 2009

Sorry, yes -- I definitely don't want a running mozilla process. Plus it's not at all clear that it's possible to run mozilla headless, though I didn't look that hard.

jrockway · on March 13, 2009

You can run any X app headless with Xvfb.

earl · on March 13, 2009

Ah, cool -- it's just that my servers don't run X. Or really have enough ram to spare to add 30 copies of X, mozilla, and other associated stuff. I really just need a relatively compact parsing engine.

jrockway · on March 13, 2009

I'm not sure why you would need 30 copies of X or Mozilla.

Either way, it is kind of inelegant, but it is hard to pick-and-choose parts of Mozilla. This is probably the simplest way to let Mozilla parse your HTML. (That, however, may not be necessary. I have done a lot of screen-scraping, and I have never encountered anything that HTML::TreeBuilder got confused on. Lately, I've been using libxml2, and that has also worked very well. Zero problems.)