The Scrapy tutorial is good if you just want to use scrapy to crawl a site and extract a bunch of information, one time.
If you want to do scraping as a small part of another Python project, then it can be easier just to use Scrapy's HtmlXPathSelector, which is more forgiving than a real XML parser.
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.http import TextResponse
url = 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'
my_xpath = '//title/text()'
req = urllib2.Request(self.url, headers={'User-Agent' : "Mozilla or whatever"})
url_response = urllib2.urlopen(req)
body = url_response.read()
response = TextResponse(url = '', body = body, encoding = 'utf-8')
hxs = HtmlXPathSelector(response)
result = hxs.select(my_xpath).extract()
HTMLXPathSelector is just a very small wrapper around lxml, it doesn't add anything parsing wise. You might as well just use lxml directly if you don't already have scrapy as a dependency.
lxml's `cssselect` method is nice for this - I found that with `xpath` and `cssselect` I have no need for anything else. I use cssselect for simple queries, like "a.something" - which would be needlessly verbose in XPath - and xpath for more complex ones, for example when I need access to axes or I want to apply some simple transform to the data before processing it in Python. Worked very well for me.
The Scrapy tutorial is good if you just want to use scrapy to crawl a site and extract a bunch of information, one time.
If you want to do scraping as a small part of another Python project, then it can be easier just to use Scrapy's HtmlXPathSelector, which is more forgiving than a real XML parser.