Hacker News new | past | comments | ask | show | jobs | submit login

Question for people into cryptography + data archiving....

If I want to store data for 500 years, I want future people to be reasonably sure of the integrity of the data, both against 'bit rot', but also deliberate tampering.

Is the best available approach to hash the data with a bunch of hash algorithms and publish all the hashes?

Then if any hash algorithm remains unbroken, the integrity of my data is certainly still good. An attacker would have to do a simultaneous preimage attack for every hash algorithm I choose to break the scheme, which historically has never happened to my knowledge.




It's impossible.

You can send a message to someone else over an unreliable channel with a very good assurance that it hasn't been corrupted unintentionally, without needing a reliable channel.

You can send a message to yourself in the future over an untrusted channel with a very good assurance that it hasn't been tampered with without needing a trusted channel, because you trust yourself and can thus remember trustable information.

There's no way to send a message to someone else over an untrusted channel with a very good assurance that it hasn't been tampered with, without a trusted channel. You need some way to convey trustable information.


You can somewhat improve your odds by using many unreliable channels, requiring an attacker to compromise all of them.

Not sure how to do that over 500 years though? Pass down the data as an heirloom, burry a SHA3 hash in a plaque below a building foundation, put a Blake3 hash on a granite tablet buried in the desert, write a SHA512 hash on the back of a Banksy painting. Put a BLAKE2b hash in your diary and get it accepted in a museum collection. The more you add, the more difficult it becomes for an attacker to find and alter all of them.


If you're hiding the hash where no one can find it, how are people going to check it? If you tell people where you're hiding it, what's the point of hiding it?


Hiding it can make sense as long as you are sure it's eventually found by chance, and the finder will recognize its significance. At some point the building foundation will be torn up, and precisely at that point the hash will be revealed. Any modification before that requires knowledge of it and a somewhat sophisticated plan to modify or destroy it.

But really, you don't have to make those locations secret, you already get a lot of security by requiring an attacker to drive out into a desert to change that tablet, access a building foundation and tamper a Banksy painting. None of these are secure in the cryptographic sense, but even without secrecy that raises the bar for an attack substantially, and makes it more likely to be detected.


>Hiding it can make sense as long as you are sure it's eventually found by chance, and the finder will recognize its significance. At some point the building foundation will be torn up, and precisely at that point the hash will be revealed.

The hash could be revealed after the tampered message was found, and therefore after it would have been necessary. Imagine some critical decision was taken based on incorrect information, for example.

>you already get a lot of security by requiring an attacker to drive out into a desert to change that tablet

If you think remoteness is sufficient protection then there's no need for hashes or anything like that. Just do what 9gag did and etch your message onto stone and bury it wherever. They found the approximate location of that slab fairly quickly, but who's going to go digging in Spain just to smash a limestone slab with some memes on it? If someone is willing to do that once, doing it one or more times again is not a lot more effort.


The key factor here too is - will anyone care. If it’s some random alphanumeric symbols of no significance because no one cares or recognizes the connection to what anyone cares, you’re going to have more impact on humanity literally etching a bunch of random memes here and there.

At least one of them might be a Rosetta Stone for some future Society that way.


Pretty much. It's just unfeasible to defend against intentional tampering over long periods of time with a one-time energy input.


I think my point is, ‘why would anyone want to care?’.

If someone cares and it matters to them, it is not infeasible.

If no one cares, then it may be feasible, but is pointless. No one would ever bother to check, even if said etched-in-glass checksum was still around and findable.

The Bible isn’t vastly different from it’s original writings - or one of the most printed works ever- due to time alone. it’s because each generation finds it’s own reason for propagating what they want (and not propagating what they don’t), and that’s a necessary property for it to exist in a way anyone cares about at all after this amount of time.

Otherwise it would just be (at best) some rotten parchment in a language no one can read, and of at most academic interest in some caves in the Middle East. If someone came up with a checksum on such rotten parchment, the only people who would care would be math nerds - assuming anyone ever found it.


Let's get creative! There is no need to secrecy of the hash. Pay celebrities to wear a dress with the hash printed on it. Influence a publisher to print the hash in all books for a year. Create a mystery about what the hash is about, so it is mentioned in lots of articles. Who is this eccentric billionaire that pays truckloads of money to spread these apparently useless 64 digits?!?

(Disclaimer: Method not applicable to normal people.)


Yeah, making many copies makes sense. But if you're doing that, it's way easier to just make many copies of the original message. In other words, just publish it as a book and be done with it.

Broadcasting the hash would make sense if the message should stay secret for some time. Although 500 years is rather a long time for a zero-knowledge proof.


"There's no way to send a message to someone else over an untrusted channel with a very good assurance that it hasn't been tampered with, without a trusted channel. You need some way to convey trustable information. "

This is kind of what real p2p, open, permissionless, decentralized blockchains are good for...


Blockchains are not trusted channels. They don't permit secure communication between any node. That is to say, I can't send you a secret tamper-proof message over a blockchain if we haven't exchanged information over some other channel previously.


> I can't send you a secret tamper-proof message over a blockchain if we haven't exchanged information over some other channel previously

... Of course you can? If we're interacting via blockchain we both have mutually-known cryptographic public keys used to sign transactions. Assume without loss of generality they can't be used directly for public-key encryption (eg they're Lamport keys). I generate a McElice public encryption key and include it in a transaction signed with my signing key. You use that to encrypt your message and include the encrypted message in a transaction signed with your signing key. I decrypt the message; it's secret and tamper-proof.

If, as in londons_explore's comment, we're worried about any specific algorithm being broken, we can use a bunch of different signing keys, and a bunch of different encryption keys, such that a attack would have have to break all the signature algorithms or all the encryption algorithms to compromise the message.


If your only method of communication is a blockchain, you can't know who owns which public key. To know that you would need to talk directly to that person and have them tell you "key such-and-such is mine". You still need that trusted channel. Since you can't know who controls the public key your messaging over the blockchain and they can't know the public key that's messaging them is controlled by you, an impersonation attack is trivial. Someone else can send that public key a message saying "hello, I am a1369209993, let's agree on a symmetric key".

So, no, a blockchain is not a trusted channel. For the purposes of communication, it offers no more security than the public Internet. It doesn't even guarantee delivery of messages.


If your only method of communication is a blockchain, a public key is who owns that public key. "Hello, I am q8oYflHjXyXj7Pgu /0R4fkOjFG83GTI8 2bmTfBkzRcLJXNiN 4FuHE7Me71aWyTbk, let's agree on a symmetric key." is incompatible with the would-be impersonator supplying a different public key. (Assuming you're paying attention, I guess, but the original claim was that it was impossible, not that it was awkward and easy to screw up, since the latter is more or less true of practially all cryptography.)


> both against 'bit rot'

MD5 is good enough if bit rot alone is the only thing you care about. It's still very hard to generate an MD5 collision with the same bit length, and doing so would generally require changing a VERY large number of bits to get a collision, which is not what happens with bit rot.

It might even be mathematically impossible to find two files that have the same MD5 hash and differ by less than a certain amount of bytes, though I don't know the proper way to formalize this.

(It's trivial to show for example that if you use any CRC as a hash, it is impossible to find a collision that differs by an edit distance of 1 and has the same length.)

> but also deliberate tampering.

You're never safe from this if you aren't the guardian of the official hash values. Someone could just change the file AND change the "official" hashes.


How do you want to make sure that the hashes themselves haven't been tampered with?


Tampering? That's harder. But to protect against corruption, since the hashes are shorter, you would be able to chisel them in different types of stone and glass and make 3+ copies of them. I suppose if you really want to protect against tampering, you could distribute so many copies that the ability to tamper with a majority of them would be very unlikely.

But I think what was meant by "publishing" was that the hashes would be available to parties who would be willing and able to preserve them indefinitely.


Hash them, duh.


you don't know if it's the hash or the data that's been tampered with. so if the stored or transmitted hash doesn't match the computed hash, you reject it.


But if the hash matches, you can't know whether an adversary tampered with the data and simply updated the hash to match.


If you intend the data to be recoverable, consider adding recovery options to it. Error Correcting Codes (ECC), (with Reed-Solomon probably being the most common), can help verify and even recover data; helpful against bit-rot.

WinRar 5 can do it, and I believe many backup software worth their salt can too.


Simultaneously finding collisions in multiple hash algorithms with 2 random inputs would be a hard task, and if one of those inputs is predetermined (your data), and the other needs to include a change that the attacker wants to see there? that really sounds impossible. But on the other hand, we are talking 500 years...


The scenario is not finding collisions though. I assume londons_explore somehow finds a way to build a trusted relationship and channel with the future [1], say a piece of leather with the hashes burned into it that is ... stored publicly in Louvre with a ton of photo evidence spread all over the earth hence. As londons_explore is trusted, you only need Second Pre-image Resistance: Given a file (that was created outside of the attackers control) find a second file which has the same hash as the first file. Second pre-image attacks are super hard to accomplish. MD5 is still Second Preimage Resistant!

Adding a second random input to the hash function, like you propose (that is to prime the internal state of the hash function with same random input, so when hashing of the usage data starts, the internal state of the hash function is unknown to the attacker) makes collision attacks much harder. In fact there is a term for that: target collision resistance. But second pre-image attacks, where the random input used for the hashing are known, don't get any harder. And if you don't transmit the random input over the secure channel as well, than second pre-image attacks get easier even (theoretically), as attackers have the possibility to manipulate the internal state outside of the usage data.

Btw. reliance of target collision resistance, instead of collision resistance, is why Ed25519/Ed448 are much more resilient to problems with the hash function than ECDSA or any common RSA signature scheme. MD5 is still target collision resistant last time I checked. Remember the debacle of forged MD5 and later SHA1 certificates [2] ? Completely avoidable, if better signature schemes had been used.

[1] Otherwise https://news.ycombinator.com/item?id=32912052 applies. [2] https://www.win.tue.nl/hashclash/rogue-ca/


For the bit rot part, this is what PAR2 is for. It generates recovery files, designed to repair the original file when arbitrary portions of it become malformed or lost.


If its against (literal) bit rot, i would suggest things like ECC, so they could actually recover the damaged data.

> Is the best available approach to hash the data with a bunch of hash algorithms and publish all the hashes?

This is much less secure than you think it is. See https://www.iacr.org/archive/crypto2004/31520306/multicollis...


With the caveat that I'm no data archiving guru...

As a source of inspiration a good place to start might be B-LTA in the ETSI standards. I believe this takes into account the fact that the algorithm used for its creation might deprecated.

IIRC the way it works is (basically) that signatures are given a finite validity period and you undertake periodic re-signing. Therefore you build up a audit trail history of signatures with current algos as you go along.


You could sign the data. Distribute the public key far and wide. Destroy the private key before you die. The advantage of this approach over just the hash is that you can sign an unlimited number of files and can verify them all with just one public key. It is all authenticated by your identity.

If your signature scheme is broken over the years then that means that people can tamper with the files. So you can use different schemes just as with the hashes.

Who is going to be able to verify your identity after 500 years and/or have a verified copy of the hashes? Without the concept of identity, it's all just a bunch of bits.


What if an attacker creates altered data and publishes it the same way by hashing it 50 times. Now you have two documents claiming to be the original and they each have 50 hashes guaranteeing the integrity of the data. What if the attacker did that with 1,000 fake documents.

I think your best bet is just to publish it and also launch a copy into space with a 500 year orbit. If someone in the future tries to launch a similar data-comet with a shorter orbit, it will be visible in the trajectory. There's always the danger of them sending out a robotic space probe to mess with your data "in flight".


Erasure coding combined with hashing. Since erasure coding uses data redundancy to help with recovery, bit rot with one stream/disk can be automatically detected and recovered (subject to supported levels of error correction).

Hashing would be on top of it to ensure detection of tampering/errors. Using multiple hashes is also a good strategy.


Symmetric cryptography has come a long way since MD5. Any one of SHA-256, SHA-3/SHAKE256, BLAKE2/3 would be fine.


For the next 500 years though? I suppose (as a crypto layman) it'll be fine, but given we don't know what will be broken within the next 500 years, isn't it natural to add redundancy and use multiple hash algorithms?


I'd be more concerned for what equipment would be used to store/read that data in 50 years let alone 500 to make this even a viable thing to do.


Indeed. Hard to quantify the risk of one of these functions getting broken but it’s very low. There’s a million far more likely ways to lose that data.


It's largely unknown if there's any emergent behaviour given H1(H2(m)), for any two hash functions H1, H2.

It's fully possible that your Uberhash function has some vulnerability that can be easily exploited regardless of the underlying security of the individual hash functions.


Leave a trust fund to manage the stored data, with enough capital that the income can pay a manager to maintain the data in perpetuity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: