> They already don't want you to just scrape stuff off Wikipedia because mediawiki apparently doesn't do proper caching and does a stupidly expensive re-rendering of every page that is fetched.
It does cache pages very aggressively. And it even has solutions for hard problems in computer science, cache invalidation.
Pages are cached indefinitely until something pokes it.
Hi, I have been involved with the Wikimedia infrastructure since roughly 2004.
The 50k daily hit on https://meta.wikimedia.org/wiki/Live_mirrors is merely to illustrate a web proxy can dramatically reduce the number of hits to the original site. They could have pick 42 or 7 billions, it is just an example.
We already had a few million pages rather than 50k of them, and they were not static since pages get edited and made live as soon as one save their edit. Obviously, we already had million daily users and definitely requires a ton of caching all across the stack. Also good luck serving flat files that keeps being written too, that does not really work with hard drive and static files storage.
In September 2021 we have served 21 billion pages and 73 billions media files https://stats.wikimedia.org/#/all-projects . Almost all of them served primarily served from caches.
That being said, the wiki pages you have mentioned are a decade old. We definitely had trouble with people having the smart idea of scrapping the whole website, often using either uncacheable content or hitting barely served articles or history that were not cached and thus caused a full rendering of the page. Even with the servers we had, a single user could cause major havoc on the infrastructure and keep a good part of them busy just to serve that single person. Hence the recommendation to use the database dump which are still used: https://dumps.wikimedia.org/
It does cache pages very aggressively. And it even has solutions for hard problems in computer science, cache invalidation.
Pages are cached indefinitely until something pokes it.
Proof? Add ?action=purge to a wiki page.