Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Still, I guess you could get the dumps and do a local Wikimedia setup based on them, and then crawl that instead?


You could, and if he was doing this on the entire corpus that'd be the responsible thing to do.

But, his project really was very reasonable:

- it fetched ~2,400 pages

- he cached them after first fetch

- Wikipedia aggressively caches anonymous page views (eg the Queen Elizabeth page has a cache age of 82,000 seconds)

English Wikipedia does about 250,000,000 pageviews/day. This guy's use was 0.001% of traffic on that day.

I get the slippery slope arguments, but to me, it just doesn't apply. As someone who has donated $1,000 to Wikipedia in the past, I'm totally happy to have those funds spent supporting use cases like this, rather than demanding that people who want to benefit from Wikipedia be able to set up a MySQL server, spend hours doing the import, install and configure a PHP server, etc, etc.


> This guy's use was 0.001% of traffic on that day

For 1 person consuming from one of the most popular sites on the web, this really reads big.


He was probably one of the biggest users that day, so that makes sense.

The 2,400 pages, assuming a 50 KB average gzipped size, equate to 120 MB of transfer. I'm assuming CPU usage is negligible due to CDN caching, and so bandwidth is the main cost. 120 MB is orders of magnitude less transfer than the 18.5 GB dump.

Instead of the dumps, he could have used the API -- but would that have significantly changed the costs to the Wikimedia foundation? I think probably not. In my experience, the happy path (serving anonymous HTML) is going to be aggressively optimized for costs, eg caching, CDNs, negotiated bandwidth discounts.

If we accept that these kinds of projects are permissible (which no one seems to be debating, just the manner in which he did the project!), I think the way this guy went about doing it was not actually as bad as people are making it out to be.


I don't think I agree. Cache has a cost too.

In theory, you'd want to cache more popular pages and let the rarely visited ones go through the uncached flow.

Crawling isn't user-behavior, so the odds are that a large percentage of the crawled pages were not cached.


That's true. On the other hand, pages with infoboxes are likely well-linked and will end up in the cache either due to legitimate popularity or due to crawler visits.

Checking a random sample of 50 pages from this guy's dataset, 70% of them were cached.


Note - there's several levels of caching at wikipedia. Even if those pages aren't in cdn (varnish) cache, they may be in parser cache (an application level cache of most of the page).

This amount of activity really isn't something to worry about, especially when taking the fast path of logged out user viewing a likely to be cached page.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: