Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, but they aren't going to care for just 2400 pages.

As a general rule, make your scraper non-paralell, and put a user-agent that has contact details in the event of an issue, and you're probably all good.

After all wikipedia is meant to be used. Don't be unduly disruptive, don't scrape 20 million pages, but scraping a couple thousand is totally acceptable.

Source: used to work for wikimedia, albeit not in the sre dept. My opinions are of course totally my own.



I don’t think the op was talking specifically to the content author, but to all the people who read the article and get the idea to scrape Wikipedia.


Honestly i'd rather people err on the side of scrapping wikipedia too much than live in fear of being disruptive and not do cool things as a result. Wikipedia is meant to be used to spread knowledge. That includes data mining projects such as the one in this blog.

(Before anyone takes this out of context - no im not saying its ok to be intentionally disruptive, or do things without exercising any care at all. Also always set a unique descriptive user-agent with an email address if you're doing anything automated on wikipedia).


Having been on the other side of this, I’d rather we encourage people to make use of formats/interfaces designed for machines and use the right tool for the job instead of scraping everything.

It’s incredibly easy for careless scrapers to disrupt a site and cost real money without having a clue what they’re doing.

I want people to think twice and consider what they are doing before they scrape a site.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: