The problem with waiting 1-2 seconds between requests is that if you’re trying to scrape on the scale of millions of pages, the difference between 30 parallel requests / sec and a single request every 1-2 seconds is the difference between a process that takes 9 hours and a month.
So I think there’s a balance to be struck - I’d argue you should absolutely be thoughtful about your target - if they notice you or you break them, that could be problematic for both of you. But if you’re TOO conservative, the job will never get done in a reasonable timeframe.
The problem with waiting 1-2 seconds between requests is that if you’re trying to scrape on the scale of millions of pages, the difference between 30 parallel requests / sec and a single request every 1-2 seconds is the difference between a process that takes 9 hours and a month
Fortunately for me, I'm almost assuredly never going to have to do this on the scale of millions of pages. If time proves me wrong, I suspect I'll be hiring someone with more expertise to take over that part of the project.
I'm definitely biasing towards a very conservative interval. Optimizing the runtime is more to help with tightening the iteration cycles for me, the sole developer, instead of limiting the job size to a reasonable timeframe.
So I think there’s a balance to be struck - I’d argue you should absolutely be thoughtful about your target - if they notice you or you break them, that could be problematic for both of you. But if you’re TOO conservative, the job will never get done in a reasonable timeframe.