Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This.

The Scrapy tutorial is good if you just want to use scrapy to crawl a site and extract a bunch of information, one time.

If you want to do scraping as a small part of another Python project, then it can be easier just to use Scrapy's HtmlXPathSelector, which is more forgiving than a real XML parser.

    import urllib2
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import TextResponse
    
    url = 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'
    my_xpath = '//title/text()'
    
    req = urllib2.Request(self.url, headers={'User-Agent' : "Mozilla or whatever"})
    url_response = urllib2.urlopen(req)
    body = url_response.read()
    response = TextResponse(url = '', body = body, encoding = 'utf-8')
    hxs = HtmlXPathSelector(response)
    result = hxs.select(my_xpath).extract()


HTMLXPathSelector is just a very small wrapper around lxml, it doesn't add anything parsing wise. You might as well just use lxml directly if you don't already have scrapy as a dependency.

https://github.com/scrapy/scrapy/tree/master/scrapy/selector


pyquery's a good alternative too. it's a slightly larger wrapper around lxml that lets you use jquery selectors.


lxml's `cssselect` method is nice for this - I found that with `xpath` and `cssselect` I have no need for anything else. I use cssselect for simple queries, like "a.something" - which would be needlessly verbose in XPath - and xpath for more complex ones, for example when I need access to axes or I want to apply some simple transform to the data before processing it in Python. Worked very well for me.


Doh! Too late to edit or delete my original comment :(

(And I can't downvote my own comment.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: