Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> They already don't want you to just scrape stuff off Wikipedia because mediawiki apparently doesn't do proper caching and does a stupidly expensive re-rendering of every page that is fetched.

It does cache pages very aggressively. And it even has solutions for hard problems in computer science, cache invalidation.

Pages are cached indefinitely until something pokes it.

Proof? Add ?action=purge to a wiki page.



Maybe I'm misinterpreting them then, they seem quite concerned about automated page visits.

Like for example this: https://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks#Re...

and this: https://meta.wikimedia.org/wiki/Live_mirrors

Something resulting in 50k page fetches of cached data in a day is nothing. A raspberry pi could handle 50k static documents in an hour.


Hi, I have been involved with the Wikimedia infrastructure since roughly 2004.

The 50k daily hit on https://meta.wikimedia.org/wiki/Live_mirrors is merely to illustrate a web proxy can dramatically reduce the number of hits to the original site. They could have pick 42 or 7 billions, it is just an example.

The text dates from 2009, and in April 2009 this is roughly what we had in term of servers https://commons.wikimedia.org/wiki/File:Wikimedia-servers-20... Arguably a little more than a Raspberry pi

We already had a few million pages rather than 50k of them, and they were not static since pages get edited and made live as soon as one save their edit. Obviously, we already had million daily users and definitely requires a ton of caching all across the stack. Also good luck serving flat files that keeps being written too, that does not really work with hard drive and static files storage.

In September 2021 we have served 21 billion pages and 73 billions media files https://stats.wikimedia.org/#/all-projects . Almost all of them served primarily served from caches.

That being said, the wiki pages you have mentioned are a decade old. We definitely had trouble with people having the smart idea of scrapping the whole website, often using either uncacheable content or hitting barely served articles or history that were not cached and thus caused a full rendering of the page. Even with the servers we had, a single user could cause major havoc on the infrastructure and keep a good part of them busy just to serve that single person. Hence the recommendation to use the database dump which are still used: https://dumps.wikimedia.org/

cheers




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: