Hacker News new | past | comments | ask | show | jobs | submit login

I've been using BeautifulSoup on a project and noticed the exact problems he's mentioning. I actually ended up filtering the source with a regexp to remove script tags and their contents prior to parsing because of the HTMLParser weirdness. It wasn't a pleasant experience. The whole time I was doing this, I kept looking at my nice Firebug element tree and wondering "Why am I even going to this trouble?"

Does anyone else wonder why we're writing all these parsers when both Mozilla and WebKit have reliable, robust parsers that are actively maintained? How difficult would it be to package up the existing code and distribute it with wrappers for python, ruby, etc... I assume there's something I don't know, because not only has it not been done, but no one seems to want to talk about it.




I had the same problem using 3.1.0, and with some suggestions from the news group, the html5lib alternative works fairly well. I never had a problem so far parsing about 6 sites i previous had to clean up using regexp.


I've always wondered this too. It is very strange that no one wants to talk about it. Maybe we could all get together and put up a bounty somewhere for someone to make this?


I briefly looked into doing this. The answer is pretty damn difficult, at least in the case of mozilla.


Actually, I think it would be pretty easy if you are willing to have a running Mozilla process. Just connect to it with MozRepl, get it to render a page, and then inspect the DOM with JavaScript. (This could be library-ed up so that you get a W3C DOM back on the Python side, or whatever.)

I use a similar technique to get emacs to syntax-highlight my slides. Connect to the running emacs (with all my settings), run htmlify via emacsclient --eval, and enjoy perfect highlighting!


Sorry, yes -- I definitely don't want a running mozilla process. Plus it's not at all clear that it's possible to run mozilla headless, though I didn't look that hard.


You can run any X app headless with Xvfb.


Ah, cool -- it's just that my servers don't run X. Or really have enough ram to spare to add 30 copies of X, mozilla, and other associated stuff. I really just need a relatively compact parsing engine.


I'm not sure why you would need 30 copies of X or Mozilla.

Either way, it is kind of inelegant, but it is hard to pick-and-choose parts of Mozilla. This is probably the simplest way to let Mozilla parse your HTML. (That, however, may not be necessary. I have done a lot of screen-scraping, and I have never encountered anything that HTML::TreeBuilder got confused on. Lately, I've been using libxml2, and that has also worked very well. Zero problems.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: