I would highly recommend Scrapy if you plan on doing any serious scraping: http:...

crdoconnor · on Aug 5, 2014

I wouldn't. I used this for a project and then quickly regretted it for the following reasons:

* XPath selectors just plain suck.

* The item pipeline is way too straitjacketed. You have to do all sorts of fucking around with settings in order to make the sequence of events work in the (programmatic) way you want it because the framework developers 'assumed' there's only one way you'd really want to do it.

* Scrapy does not play well with other projects. You can integrate it with django if you want a minimal web UI but it's a pain to do so.

* Tons of useless features. Telnet console? wtf?

* It's assumed that the 'end' of the pipeline will be some sort of serialization - to database, xml, json or something. Actually I usually just want to feed into the end of another project without any kind of serialization using plain old python objects. If I want serialization I probably want to do it myself.

* For some reason DjangoItem didn't really work (although by the time I tried to get it to work I'd kind of given up).

IMO this is a classic case of "framework that should have been a library".

Here's what I used instead after scrapping scrapy:

* mechanize - to mimic a web browser. I used requests sometimes too, but it doesn't really excel at pretending to be a web browser, so for that reason I usually used mechanize as a drop in replacement. * celery - to schedule the crawling / spin off multiple crawlers / rate-limiting / etc. * pyquery - because xpath selectors suck and jquery selectors are better. * python generators - to do pipelining.

I'm largely happy with the outcome. The code is less straitjacketed, easier to understand and easier to integrate into other projects if necessary (you don't have the headache of trying to get two frameworks to play together nicely).

blablabla123 · on Aug 5, 2014

I agree. XPath and freinds are too low level. Scrapy is nice though to control HTTP/assets download but in fact even this sucks. If you're serious with scraping you must go with a headless approach which also means Python-only doesn't work.

kmike84 · on Aug 5, 2014

Hey,

A good feedback, thanks!

> XPath selectors just plain suck.

Scrapy supports CSS selectors.

> The item pipeline is way too straitjacketed. You have to do all sorts of fucking around with settings in order to make the sequence of events work in the (programmatic) way you want it because the framework developers 'assumed' there's only one way you'd really want to do it.

Could you plese give an example?

> Scrapy does not play well with other projects. You can integrate it with django if you want a minimal web UI but it's a pain to do so.

This is true. But it is a pain to integrate any even-loop based app with another app that is not event-loop based. It is also true that Scrapy is not easy to plug into existing event loop (e.g. if you already have twisted or tornado-based service), but it should be fixed soon.

> Tons of useless features. Telnet console? wtf?

Telnet console is a Twisted feature; it came almost for free, and it is useful to debug long-running spiders (which can run hours and days).

> It's assumed that the 'end' of the pipeline will be some sort of serialization - to database, xml, json or something. Actually I usually just want to feed into the end of another project without any kind of serialization using plain old python objects. If I want serialization I probably want to do it myself.

If you don't want serialization then you want a single process both for crawling and for other tasks. This rules out synchronous solutions - you can't e.g. integrate a crawler with django efficiently without serialization. If you just want to do some post-processing then I don't see why putting code to Scrapy spider is worse than putting it to other script and calling Scrapy from this script.

> For some reason DjangoItem didn't really work (although by the time I tried to get it to work I'd kind of given up).

This may be true.. I don't quite get what is it for :)

> IMO this is a classic case of "framework that should have been a library".

It can't be a library like requests or mechanize for technical reasons - to make crawling efficient Scrapy uses event loop. It can (and should) be a library for twisted/tornado/asyncio; it is possible to use Scrapy as a such library now, but this is not straightforward; this should (and will) be simplified.

> * mechanize - to mimic a web browser. I used requests sometimes too, but it doesn't really excel at pretending to be a web browser, so for that reason I usually used mechanize as a drop in replacement. * celery - to schedule the crawling / spin off multiple crawlers / rate-limiting / etc. * pyquery - because xpath selectors suck and jquery selectors are better. * python generators - to do pipelining.

Celery is also not the easiest piece of software. Scrapy is just a single Python process that doesn't require any databases, etc.; Celery requires to deploy a broker and have a place to store task results; it is also less efficient for IO-bound tasks.

crdoconnor · on Aug 5, 2014

>Scrapy supports CSS selectors.

Still far inferior to JQuery selectors.

>Could you plese give an example?

The example I'm thinking of is when I was trying to create a pipeline that would output a skeleton configuration file when you passed one switch and would process and serialize the data parsed when you passed another. It was possible but kludgy.

>But it is a pain to integrate any even-loop based app with another app that is not event-loop based.

That's not where the pain lies. It's more the fact that it has its own weird configuration/setup quirks (e.g. its own settings.py, reliance on environment variables, executables).

>If you don't want serialization then you want a single process both for crawling and for other tasks. This rules out synchronous solutions - you can't e.g. integrate a crawler with django efficiently without serialization.

I don't really want scrapy doing process handling at all. It's not particularly good at it. Celery is much better.

Using other code to do serialization also doesn't necessitate running it on the same process. You can import the django ORM wherever you want and use it to save to the DB. I know you can do that - but, again, kludgy.

>It can't be a library like requests or mechanize for technical reasons - to make crawling efficient Scrapy uses event loop.

I get that. It should have been more like twisted from the outset though. The developers were clearly inspired from django and that led them down a treacherous path.

>It can (and should) be a library for twisted/tornado/asyncio; it is possible to use Scrapy as a such library now, but this is not straightforward; this should (and will) be simplified.

Well, that's good I suppose. I still think that it focuses on bringing together a bunch of mediocre modules for which, individually, you can find much better equivalents. Also, (unlike django) tight, seamless integration between those modules doesn't really gain you much.

>Celery is also not the easiest piece of software.

The problem it is solving (distributed task processing) is not an easy problem. Celery is not simple, but it is GOOD.

>Scrapy is just a single Python process that doesn't require any databases, etc. Celery requires to deploy a broker and have a place to store task results; it is also less efficient for IO-bound tasks.

A) You can use redis as a broker and that's trivial to set up. I always have a redis available anyway because I always need a cache of some kind (even when crawling!).

B) My crawling tasks are never I/O bound or CPU bound. They're bound by the rate limiting imposed upon me by the websites I'm trying to crawl.

C) I'm usually using celery anyway. I still have to do task processing that DOESN'T involve crawling. Where do I put that code when I'm using scrapy?

kmike84 · on Aug 5, 2014

> The example I'm thinking of is when I was trying to create a pipeline that would output a skeleton configuration file when you passed one switch and would process and serialize the data parsed when you passed another. It was possible but kludgy.

I don't get it - how is creating a configuration file related to the processing of the items, why would you do it in items pipeline?

> It's more the fact that it has its own weird configuration/setup quirks (e.g. its own settings.py, reliance on environment variables, executables).

It is possible to create a Crawler from any settings object (not just a module), and Scrapy does not rely on executables AFAIK. But this all is poorly documented. Also, there is an ongoing GSoC project to make settings easier and more "official".

> I don't really want scrapy doing process handling at all.

Scrapy doesn't handle processes, it is single-threaded and uses a single process. This means that you can use e.g. a shared in-process state.

> Using other code to do serialization also doesn't necessitate running it on the same process. You can import the django ORM wherever you want and use it to save to the DB. I know you can do that - but, again, kludgy.

You can't move Python objects between processes without serialization. Why is using django ORM kludgy in Scrapy but not in Celery?

> The problem it is solving (distributed task processing) is not an easy problem. Celery is not simple, but it is GOOD.

You don't necessarily need distributed task processing to do web crawling. Celery is a great piece of software, and it is developing nicely, but you always pay for complexity. For example, I faced the following problems when I was using Celery:

* When redis was used as a broker its memory usage was growing infinitely. Lots of debugging, found a reason and a hacky way to overcome it (https://github.com/celery/celery/issues/436). The issue was fixed, but apparently there is still a similar issue when MongoDB is used as a broker.

* Celery stopped processing without anything useful in logs (and of course Celery error sending facilities failed and I didn't have external monitoring) - it turns out an unicode exception was eaten. A couple of days of nightmarish debugging; see https://github.com/celery/celery/issues/92.

* I implemented an email sender using Celery + RabbitMQ once. I think I was sending email text to tasks as parameters. Never do that (just use MTA:)! When a large batch of emails was sent at once RabbitMQ used all memory, corrupted its database, dropped the queue; I haven't found a way to check which emails were sent and which were not. This was 100% my fault, but it shows that complex setup is not your friend.

Crawling tasks differ - e.g. if you need to crawl many different websites (which is not uncommon) you will almost certainly be IO and CPU limited. Scrapy is not a system for distributed task processing, it is just an event-loop based crawler. I'm not saying your way to solve the problem is wrong; if you already use celery it makes a lot of sense to use it for crawling as well. But I don't agree that going distributed turtles all the way down with celery+redis+DB for storage+... is easier or more efficient than using plain Scrapy. A lot of tasks can be solved by writing a spider, getting a json file with data and doing whatever one wants with it (upload to DB, etc).

rahimnathwani · on Aug 5, 2014

This.

The Scrapy tutorial is good if you just want to use scrapy to crawl a site and extract a bunch of information, one time.

If you want to do scraping as a small part of another Python project, then it can be easier just to use Scrapy's HtmlXPathSelector, which is more forgiving than a real XML parser.

    import urllib2
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import TextResponse
    
    url = 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'
    my_xpath = '//title/text()'
    
    req = urllib2.Request(self.url, headers={'User-Agent' : "Mozilla or whatever"})
    url_response = urllib2.urlopen(req)
    body = url_response.read()
    response = TextResponse(url = '', body = body, encoding = 'utf-8')
    hxs = HtmlXPathSelector(response)
    result = hxs.select(my_xpath).extract()

TkTech · on Aug 5, 2014

HTMLXPathSelector is just a very small wrapper around lxml, it doesn't add anything parsing wise. You might as well just use lxml directly if you don't already have scrapy as a dependency.

https://github.com/scrapy/scrapy/tree/master/scrapy/selector

crdoconnor · on Aug 5, 2014

pyquery's a good alternative too. it's a slightly larger wrapper around lxml that lets you use jquery selectors.

klibertp · on Aug 5, 2014

lxml's `cssselect` method is nice for this - I found that with `xpath` and `cssselect` I have no need for anything else. I use cssselect for simple queries, like "a.something" - which would be needlessly verbose in XPath - and xpath for more complex ones, for example when I need access to axes or I want to apply some simple transform to the data before processing it in Python. Worked very well for me.

rahimnathwani · on Aug 5, 2014

Doh! Too late to edit or delete my original comment :(

(And I can't downvote my own comment.)