> I'm so glad the download-entire-wikipedia function continues to exist. That wi...

gerdesj · on March 6, 2021

"Personally, I think Wikipedia's quality is too poor for that."

You should see some of the crap in the books at Alexandria: The world is flat and there are four elements and other bollocks. Obviously I'm taking the piss. The content is sometimes just as important as the factual accuracy of the content. For a given value of factual accuracy.

WP is written by people and holds a vast amount of stuff. It is flawed in my opinion in many ways but that is the human experience.

I live in a town called Yeovil in Somerset, UK. https://en.wikipedia.org/wiki/Yeovil . Several years ago I noticed an incorrect old name for the place and I tried to correct it. I appealed to the Domesday Book which is considered quite authoritative hereabouts but I linked to the only site I could find which sold copies of it. My edit was thrown out by the local Somerset editor rather than being fixed. I own a coffee cup bought from the local museum that lists the >60 spellings of this tiny town over the last 1500 odd years. The editor wouldn't accept that either "original research" WTF! I didn't put the names on the mug - archaeologists, historians and a bunch of tax gatherers hired by a hoard of Normans back in the day did that. OK, no they didn't - they scrawled stuff and the museum gathered together the scrawls and made my mug. I did one for a tourist shop on the Plymouth Barbican wrt the Mayflower complement, about 30 years ago. To be fair, I simply copied the names off the board near the Mayflower steps!

My point is that WP is what it is and you need to see it for that. It is both a store of knowledge and also a store of knowledge and blatant lies and everything in between ... about knowledge. It contains its own metadata and also omits vast amounts of it.

WP is without question in my mind absolutely magnificent but you do need to learn how and when to interpret it to fit in with your idea of factual - whatever that is.

tablespoon · on March 6, 2021

> You should see some of the crap in the books at Alexandria: The world is flat and there are four elements and other bollocks. Obviously I'm taking the piss. The content is sometimes just as important as the factual accuracy of the content. For a given value of factual accuracy.

But that reflected the actual state of knowledge at the time, which is what you'd really want to study.

Let me put it this way: Wikipedia wouldn't even allow Wikipedia to be used as a source for one if it's articles, because it's too unreliable: https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources#Use...

ris · on March 6, 2021

> Let me put it this way: Wikipedia wouldn't even allow Wikipedia to be used as a source for one if it's articles, because it's too unreliable

Er.. I think you're missing the point of citing sources. If an otherwise unverified claim could cite as its source another unverified claim, it would make citing sources meaningless. If a wikipedia article wanted to cite a verified claim from another wikipedia article that did have a verified source, they may as well just use that original source as their source too.

I think your example actually shows wikipedia in a very good light.

tablespoon · on March 7, 2021

> Er.. I think you're missing the point of citing sources. If an otherwise unverified claim could cite as its source another unverified claim, it would make citing sources meaningless. If a wikipedia article wanted to cite a verified claim from another wikipedia article that did have a verified source, they may as well just use that original source as their source too.

1) The Wikipedia guideline doesn't mention that logic, 2) quite a lot of Wikipedia doesn't cite anything, and 3) it not unusual for a passage in Wikipedia to not actually be supported by the citation given (e.g. https://en.wikipedia.org/w/index.php?title=Special:WhatLinks...).

In any case, this is getting off topic. My point was that, if you're looking for an ark of cultural knowledge to survive some apocalypse, Wikipedia's a bad choice. IMHO, something like Project Gutenberg plus a newspaper archive would be about 1000% better, and take up far less space. If you got space to spare, throw in Libgen. Wikipedia's not a replacement for its sources, and I pity the future scholar that would have to rely on it without being able to check them.

But unless your data storage is orders of magnitude more reliable than any current technology and you package your archive with an equally reliable computer to read it, the concept of a digital ark fails. If you don't to that, your archive will be unreadable, and and unreadable archive is useless.

smoldesu · on March 6, 2021

The content on Wikipedia is really not that bad. Obviously a Wikipedia article will never be the final say on any specific subject, but it tends to do a pretty good of aggregating sources and condensing it into a reader-friendly synopsis. This data is super valuable, if not just for the sources alone.

buzzerbetrayed · on March 6, 2021

> so when our civilization is at risk of "[losing] the entire library" it probably would have already lost the ability to maintain the computer systems to access Wikipedia dumps

But as long as it continues to exist, some future civilization could figure out how to read the data again, eventually. Just like we eventually discovered how to read ancient languages that were once forgotten.

tablespoon · on March 6, 2021

> But as long as it continues to exist, some future civilization could figure out how to read the data again, eventually. Just like we eventually discovered how to read ancient languages that were once forgotten.

Eh, I think you're vastly underestimating how difficult that would be.

1. The media would have to last hundreds of years at least, when it's hoped modern archival media can last maybe fifty.

2. Even assuming the media did last, the new civilization would have to reverse engineer encoding on top of encoding on top of encoding (e.g. physical disk encoding, complex filesystems, file formats, character encodings). Our civilization already has trouble reading some old file formats, and our disks already have trouble reading their data (which is why they pack a ton of error correction information).

It took the Rosetta stone to figure out how to read encoding of Egyptian hieroglyphics, when that language was still alive in the form of Coptic.

3. Then you're dealing with the probability that the hard disks the future archeologists find will even have a Wikipedia dump on them. That probability will be very small, given very few people will download these dumps.

bawolff · on March 7, 2021

If people still understand english in some form (a good bet. We still understand latin. English has more reach than latin did at its peak) understanding charsets is pretty easy. Just assume its a shift cipher.

As far as media goes. That's true, but its a bit of a numbers game. After all you only need one unusually preserved specimen. The dead sea scrolls survived after all. Not to mention intentional preservation efforts. I know github has its artic vault thing. There's even a copy of wikipedia on the moon! https://meta.m.wikimedia.org/wiki/Wikipedia_to_the_Moon/Wrap...

tablespoon · on March 7, 2021

> If people still understand english in some form (a good bet. We still understand latin. English has more reach than latin did at its peak) understanding charsets is pretty easy. Just assume its a shift cipher.

IIRC, Coptic is directly descended from Ancient Egyptian, but the Rosetta Stone was still needed to decipher hieroglyphics.

It won't be as simple as you think. The problem will be more like: here's 10TB of partially corrupted binary data, find the text when you don't know the encoding (oh, and the text may be compressed with an algorithm you also don't know).