Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The problem with waiting 1-2 seconds between requests is that if you’re trying to scrape on the scale of millions of pages, the difference between 30 parallel requests / sec and a single request every 1-2 seconds is the difference between a process that takes 9 hours and a month.

So I think there’s a balance to be struck - I’d argue you should absolutely be thoughtful about your target - if they notice you or you break them, that could be problematic for both of you. But if you’re TOO conservative, the job will never get done in a reasonable timeframe.



The problem with waiting 1-2 seconds between requests is that if you’re trying to scrape on the scale of millions of pages, the difference between 30 parallel requests / sec and a single request every 1-2 seconds is the difference between a process that takes 9 hours and a month

Fortunately for me, I'm almost assuredly never going to have to do this on the scale of millions of pages. If time proves me wrong, I suspect I'll be hiring someone with more expertise to take over that part of the project.

I'm definitely biasing towards a very conservative interval. Optimizing the runtime is more to help with tightening the iteration cycles for me, the sole developer, instead of limiting the job size to a reasonable timeframe.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: