It's a PDF File which is also a NES ROM that displays its own MD5 sum. The PDF also shows its own MD5 sum a few times. (The MD5 sum also happens to begin with 5EAF00D)
When an arbitrary MD5 can be created that easily, it's useless for any cryptographic applications, or even any data integrity.
MD5 may not be collision resistant, but the logic in your conclusion is completely wrong. There is no feasible pre-image attack on MD5. The MD5 generated _was not arbitrary_; they seem to have just brute-forced a few of the leading bytes and then did a Nostradamus attack.
Again: finding something that hashes to an arbitrary MD5 sum is still not known to be feasible. This isn't a particularly good reason to use MD5, but this also means that MD5 is not broken in the way you think it is.
When somebody finds something that hashes to all zeroes you can finally say that MD5 is completely broken. It is not known to be at that point yet.
Right. Nobody should use MD5 if at all possible, but it's important to understand how it is and isn't broken.
Like, if I have the MD5 hash of a binary from a trusted source, I can basically rely on that, unless the attacker was involved in producing the trusted binary. In which case I'd usually have bigger concerns.
This is not a feasible attack. There is a large difference between how academics use "broken" and what the practical consequences are.
As Schneier writes as an introduction in the very paragraph you are trying to quote, "in academic cryptography, the rules are relaxed considerably." This is not a snub on academia; colloquial terms sometimes just mean something different than the academic definition.
That is the definition for hash functions. The other commenter was, not to put too fine a point on it, wrong. From Schneier's Self-Study Course in Block-Cipher Cryptanalysis:
"Breaking a cipher simply means finding a weakness in the cipher that can be exploited with a complexity less than brute-force."
But please go on about how you know more than Schneier.
I use it a lot but never for anything to do with security (obviously). We use it for calculating simple content hashes that are used for caching or database ids. It's fine for that and md5 remains extremely popular for that and is used for such purposes by quite many big name companies. It's not a problem and not a security issue. If you look at cache headers in your browser, a lot of the content hashes you'll see will be md5 hashes for example. Likewise, if you use any of the popular object stores in AWS, GCP, Azure, etc. they'll be using md5 content hashes.
This is not a mistake. Md5 is a nice compromise between being fast and having a low probability of having collisions while keeping the hashes nice and short. Simpler/faster hashing algorithms are available of course and they have even more potential for collisions and it's not an issue there either. But md5 is kind of easily accessible and there on most platforms. So, it's a good default to use if you need some kind of content hash.
I've never seen accidental collisions and intentionally trying to create collisions in a cache or a database id doesn't really serve any purpose to anyone. So yes, you could try to do that but why would you? The probability of unintentional collisions is low enough that it is not a concern. It's a complete non issue. You are never going to see one in your career.
Using it is not a security issue unless you use it for things that need to be secure in which case you should use something like sha3 or alternatives to that. But hash algorithms have applications beyond security sensitive ones. AWS using sha3 for s3 object content hashes would be overkill and a waste of CPU,
People still use antivirus in 2022?? I think it's well-understood that antivirus has somehow managed to be worse than not having it, as antivirus programs themselves often have vulnerabilities that make full system compromise easier.
Note that your antivirus is also performing worse than even the average antivirus, which is already pretty bad. The EICAR test file is only meant to be detected if the file size is less than or equal to 128 bytes long.
I personally do not, but my company issued laptop does.
I would not disagree that it makes the computer worse, there was a significant performance decrease when the latest version was installed earlier this year but this is off topic.
What I found interesting was that I didn't know about EICAR until today.
Disclaimer: This is a fun thought experiment. I'm not looking for actionable results, or advocating for relying on any of this comment for actual security. I'm clearly not a cryptographer; I just think it would be interesting to talk about here, and maybe more educated people could comment on how well these approaches might mitigate the exploits in the article. Play with me in this space.
I'm curious if people have any interesting ideas on how to add some seasoning to MD5 to make it more secure. That is, simple, intuitive things you can do in combination with MD5 such that all the pieces in your scheme are still easily understood and don't amount to a new hash algorithm that can only be understood as a black box. Pretend MD5 is the only hash algorithm that has ever been found. Or that you're the Gilligan's Island Professor and MD5 hashes are your coconuts. What are the most potentially useful things you can build out of the most primitive, dumb components?
For example:
- Output the length of the input (or a hash of the length if you must have a constant-length output)
- Hash the input forwards and backwards and produce two hashes. (Remembering that, though the output is 256 bits now, you still only have coconuts to work with.)
- Include more complicated variations on the input in the hashes. e.g. start in the middle and oscillate forward and backward over the input, or move the second half of the input in front of the first before hashing, or use the input/hash of the input to seed a pseudorandom re-ordering of the input before hashing, etc.
- Format-aware hashing - whatever program will interpret the content of the file can also produce a hash, or some [canonical] interpretation of the content that can be hashed. e.g., for an image format, we could ask the renderer how many iterations of some operation it had to perform to render the output, or in the worst case, hash the bitmap it produced.
For sha1, people made a system where you can detect the patterns that lead to a collision, and (for example) replace it with a different hash only for inputs that would be a problem. https://github.com/cr-marcstevens/sha1collisiondetection i think git does this to eek more life out of sha1.
I imagine you could take a similar counter-cryptnalysis approach to md5. (I am out of my depth here, so there could be reasons this doesnt work for md5 im unaware of)
Very cool! I searched for it here, and found some comments that imply it's possible to detect these attacks in MD5, but nothing too detailed. Much more advanced than I was thinking, but still hacky enough that it satisfies me!
Nowadays it would be a complete waste of time to try to improve MD5, because there is no reason to use it for anything.
Most modern CPUs from Intel, AMD or ARM implement in hardware at least SHA-1 and SHA-256, so these both are faster than MD5 implemented in software.
If you only need 128 bits, you can truncate one of the longer hashes to 128 bits.
SHA-1 is the fastest in the current CPUs, so even if it is also obsolete for applications where there is an adversary who attempts to find collisions, it remains a good choice for various non-cryptographic applications that need a long hash or random numbers.
SHA-1 itself was an attempt to improve on MD5. Even if it was not entirely successful, it remains many orders of magnitude stronger than MD5.
So if you want to know how MD5 can be improved, the best way is to compare the MD5 and the SHA-1 algorithms, and look at what was changed between them.
Another improved MD5 is the RIPEMD-160 algorithm, so comparing RIPEMD-160 with MD5 is another source for seeing how MD5 can be improved, by different methods.
I agree about the usefulness of thought experiments, but there are many other much more useful thought experiments, even in the domain of hashes.
A large number of people have thought for several years about how to improve MD5, and the results were SHA-1 and RIPEMD-160.
It is very instructive to study the evolution from MD4 to MD5 and to SHA-1 and RIPEMD-160, but it is very unlikely that attempting 30 years later to do something better than those, without completely changing a hash algorithm structure that is now well understood as being inappropriate, can teach you anything.
... why are you trying to "season MD5 to make it more secure"? Calls to switch off of MD5 started in 1996, and MD5 has been considered broken for some 15 years.
Because it's interesting to think about. It's hacking - doing something with technology that was not intended, and sometimes ill-advised, to see what kind of interesting results we can get.
The point is, it's not an interesting hack for the vast majority of the audience. Because they've already seen this play out over and over and over.
It's like sticking your hand in a grinder and yelling "for science". The result might be new for you, and it's hacking by your definition, but I'm fairly certain the rest of the audience is OK not investing time into the experiment.
I have no intention of sticking my metaphorical hand, or the hand of anyone, into a metaphorical grinder.
But I will point out that it was someone who thought about, and experimented with, sticking fleshy pieces into power tools, that invented the SawStop. :)
> BLAKE2b is faster than MD5, SHA-1, SHA-2, and SHA-3, on 64-bit x86-64 and ARM architectures. BLAKE2 provides better security than SHA-2 and similar to that of SHA-3
"NIST chose KECCAK over the four other excellent finalists for its elegant design, large security margin,
good general performance, excellent efficiency in hardware implementations, and for its flexibility"
The real reason is that SHA-3 is more different to SHA-2 in terms of fundamental design. The SHA-3 contest was launched shortly after SHA-1 was broken, and was partly just a hedge against SHA-2 being broken too.
The simplest way to improve MD5 would be to add more rounds (but as others have mentioned, there are much better choices for any practical purpose)
RE your last point, I'm not quite sure what attack you're defending against there, but most file formats do not have a well-defined "canonical interpretation", much less so one that is serializable into bytes. (If you think it sounds easy, you haven't thought about it hard enough :P )
Well, for images I still think that there is such a thing, even if it's the least interesting example: the bitmap they ultimately render. (For formats that are allowed to produce different bitmaps, it gets more difficult and maybe impossible.)
For the general case of any file format: I agree this approach is the least simple/dumb/trivial and might even violate the spirit of my original comment itself. But it's still interesting. By "[canonical] interpretation", I just meant some way to fingerprint the content while understanding the format. e.g., if it's a tarball, sum the total number of files and directories inside it. Concatenate all their names in a well-defined order and hash that. I know you can't prevent collisions entirely, but it may be relatively cheap to make it so that 2 files that are different (with respect to the file format) are likely to be represented differently.
If for a moment we assume that you can do it reliably (which I personally doubt, even for "simple" formats) - what's the point? Why not just hash the original file? What's the benefit here?
For example, I could create two different PNGs that decode to the same bitmap. Or I could create one PNG that decodes to multiple different bitmaps, depending on which implementation decodes it (due to implementation bugs and/or under-defined areas of the specification). Or I could create a PNG that is also a valid ZIP archive.
There's no security benefit, and I would have a hard time coming up with a practicality benefit. It's mostly just interesting to think about, especially in response to the article. The article is demonstrating fast MD5 second preimage attacks for various file formats (EDIT: apparently not preimage attacks, just collision attacks), so in response to that I'm wondering how these MD5-specific attacks might be mitigated, for fun. Consider it alternate history fiction in which we never discovered anything better than MD5 :)
In your examples, though, :
> two different PNGs that decode to the same bitmap
But would the the PNGs also have the same MD5 hash?
> one PNG that decodes to multiple different bitmaps, depending on which implementation decodes it
Yeah, that would be a challenge. Relying on implementation details, or results which are allowed to vary, wouldn't work. But since this is meant to supplement an existing MD5 hash, the idea is that the format consumer/interpreter would be in a good position to produce some format-aware fingerprint that is statistically likely enough to be different when the inputs are different.
Ah, OK, I think I misunderstood the article. If you are supplying both images to me, you could do that with the MD5 hashes. Although, I think if you could get them to generate the same bitmap, then the attack has been at least partially mitigated, by definition. Not completely, I admit, but I think it wouldn't qualify as the same attack shown in the article.
There's a difference between bringing MD5 up to modern strength and simply blocking the attempts of attackers who've attained this MD5 forging kit. The later is an interesting though experiment.
The point of a hash is to avoid feeding malicious data further into your pipeline so solutions that involve parsing the file and hashing its actual data-streams aren't a good idea. We'll focus on detecting the change before looking at the data.
To make the attack harder, use file formats without expandable or optional sections. This attack works by parsing known file formats and making allowable changes. If you had encryption as one of your coconuts it would make this easier by making the whole file opaque, but the encryption could also be used to generate a hash-like construction so if you have good enough encryption you wouldn't be stuck with MD5 and there wouldn't be much thought experiment left...
To stop this specific attack, assuming you had to use these formats, using MD5(file+reversed_file) or MD5(file)+MD5(reversed_file) or MD5('secret'+file) would work. This is massively beyond what a most people could make the tools do but it's probably not that much cryptographically harder.
If your solution was a secret it would be pretty effective. The problem is that if you don't have encryption then they're watching you communicate. They can see the hash sum you tell the recipient to expect, and if this is the first time you've used this scheme, the steps to use to check it. But even if they don't observe this though, you're only using a small set of non-crypto operations and they can just try millions of combos (reverse the file, append a second copy, interleave bytes, etc.) and see if anything produces the same hash. Then they have discovered your algorithm and they can plan to modify their tools to perform the attack.
Oh, I think secrecy would be against the spirit of the experiment. And yeah, I wonder what kind of mods they would need to make to their tools in response. If it takes 10 years to generate the collisions, then it seems like it might be a worthwhile approach (in the thought experiment universe only).
But we should also recognize that the article demonstrates collision attacks, not preimage attacks, which would be the attack you want to worry about if you are using a trusted hash to verify a file you received over an untrusted channel.
Interesting. I submitted by original comment with the misconception that it second preimage attacks against MD5 were feasible and were being used by the article. So I see how in the case where you want to defend against collisions (e.g. a malicious certificate signing request), length awareness might not help. I'm not sure if that would apply to a second preimage attack, though.
Intuitively you might think concatenating two hashes could make something way stronger than either, but the way these functions were constructed, that isn't necessarily true. It also shows a trick to efficiently make exponentially more collisions with linearly more work. It's not that those specific tricks work against every idea for fortification; the paper says it doesn't break some schemes you'd expect it to. But it does show that a weak Merkle-Damgård compression function has weird indirect effects, and that makes things built on one seem shakier.
It can make some difference to try to build a system not to rely on collision resistance if possible, or not to give the attacker full control over the input to the hash. For example, CAs started putting a random serial number not predictable by the requestor into the generated cert, so the attacker couldn't predict all of the content being signed. They still moved off SHA-1, of course.
There are lots of good things to say about newer functions. They do more work (more rounds, bigger state), may make better choices of where to spend their effort (newer functions seem to mix bits faster), and use today's CPU capabilities effectively (recent hashes support SIMD parallelism; SHA-512 uses your 64-bit ALU; SHA-256 even has x64 instructions).
The real "fuck you, cryptanalysts" option is two-pass hashing:
H(m) = H0( H0(m) ~ m )
where H0 can be MD5 or any other normal hash and ~ is concatenation. This effectively adds a random salt to the hash, and even if a collision is found in the base hash, the salted version will 'start' (reach the start of the message) in a different state, so the collision blocks that carefully cancel out differences in the H0 state won't work on the salted state and vice versa.
This doesn't eliminate the possibility of cryptanalysis in principle, but it does make it much, much harder.
Might be useful to look up the prior research. Search for Merkle-Damgard. And the IACR ePrint archive has quite a few interesting reads: https://eprint.iacr.org/
I get it that you're just hacking around, but reading what the olds said about MD5 is useful.
Oh. Also. What you're doing here is "finding a second pre-image" which is slightly different from finding a collision. A second pre-image is where you already have a message and a hash and you want to find a second message that hashes to the same thing. Technically speaking, a collision is just finding two messages that hash to the same thing, but you don't necessarily get to pick what the two messages are.
I mention it not to be an academic snob, but because when you read the literature, there are plenty of examples where you sort of have to know the two things are distinct, at least to the academics.
Thanks for pointing that out. It's definitely important to understand the difference, and not just academically. It helps us understand that the consequences of MD5 being broken have only been proven to work for collision attacks, and not preimage attacks. (But I still don't think using MD5 when you can use more secure alternatives is a good idea.)
That page hasn't aged all that well. The prediction that applications would need to switch to a new hash function "every few years" hasn't panned out. The feared improved attacks on SHA-2 have failed to materialize. Applications that chose SHA-2 20 years ago are still quite secure today.
It seems we just weren't very good at designing hash functions in the 90s.
Question for people into cryptography + data archiving....
If I want to store data for 500 years, I want future people to be reasonably sure of the integrity of the data, both against 'bit rot', but also deliberate tampering.
Is the best available approach to hash the data with a bunch of hash algorithms and publish all the hashes?
Then if any hash algorithm remains unbroken, the integrity of my data is certainly still good. An attacker would have to do a simultaneous preimage attack for every hash algorithm I choose to break the scheme, which historically has never happened to my knowledge.
You can send a message to someone else over an unreliable channel with a very good assurance that it hasn't been corrupted unintentionally, without needing a reliable channel.
You can send a message to yourself in the future over an untrusted channel with a very good assurance that it hasn't been tampered with without needing a trusted channel, because you trust yourself and can thus remember trustable information.
There's no way to send a message to someone else over an untrusted channel with a very good assurance that it hasn't been tampered with, without a trusted channel. You need some way to convey trustable information.
You can somewhat improve your odds by using many unreliable channels, requiring an attacker to compromise all of them.
Not sure how to do that over 500 years though? Pass down the data as an heirloom, burry a SHA3 hash in a plaque below a building foundation, put a Blake3 hash on a granite tablet buried in the desert, write a SHA512 hash on the back of a Banksy painting. Put a BLAKE2b hash in your diary and get it accepted in a museum collection. The more you add, the more difficult it becomes for an attacker to find and alter all of them.
If you're hiding the hash where no one can find it, how are people going to check it? If you tell people where you're hiding it, what's the point of hiding it?
Hiding it can make sense as long as you are sure it's eventually found by chance, and the finder will recognize its significance. At some point the building foundation will be torn up, and precisely at that point the hash will be revealed. Any modification before that requires knowledge of it and a somewhat sophisticated plan to modify or destroy it.
But really, you don't have to make those locations secret, you already get a lot of security by requiring an attacker to drive out into a desert to change that tablet, access a building foundation and tamper a Banksy painting. None of these are secure in the cryptographic sense, but even without secrecy that raises the bar for an attack substantially, and makes it more likely to be detected.
>Hiding it can make sense as long as you are sure it's eventually found by chance, and the finder will recognize its significance. At some point the building foundation will be torn up, and precisely at that point the hash will be revealed.
The hash could be revealed after the tampered message was found, and therefore after it would have been necessary. Imagine some critical decision was taken based on incorrect information, for example.
>you already get a lot of security by requiring an attacker to drive out into a desert to change that tablet
If you think remoteness is sufficient protection then there's no need for hashes or anything like that. Just do what 9gag did and etch your message onto stone and bury it wherever. They found the approximate location of that slab fairly quickly, but who's going to go digging in Spain just to smash a limestone slab with some memes on it? If someone is willing to do that once, doing it one or more times again is not a lot more effort.
The key factor here too is - will anyone care. If it’s some random alphanumeric symbols of no significance because no one cares or recognizes the connection to what anyone cares, you’re going to have more impact on humanity literally etching a bunch of random memes here and there.
At least one of them might be a Rosetta Stone for some future Society that way.
I think my point is, ‘why would anyone want to care?’.
If someone cares and it matters to them, it is not infeasible.
If no one cares, then it may be feasible, but is pointless. No one would ever bother to check, even if said etched-in-glass checksum was still around and findable.
The Bible isn’t vastly different from it’s original writings - or one of the most printed works ever- due to time alone. it’s because each generation finds it’s own reason for propagating what they want (and not propagating what they don’t), and that’s a necessary property for it to exist in a way anyone cares about at all after this amount of time.
Otherwise it would just be (at best) some rotten parchment in a language no one can read, and of at most academic interest in some caves in the Middle East. If someone came up with a checksum on such rotten parchment, the only people who would care would be math nerds - assuming anyone ever found it.
Let's get creative! There is no need to secrecy of the hash. Pay celebrities to wear a dress with the hash printed on it. Influence a publisher to print the hash in all books for a year. Create a mystery about what the hash is about, so it is mentioned in lots of articles. Who is this eccentric billionaire that pays truckloads of money to spread these apparently useless 64 digits?!?
(Disclaimer: Method not applicable to normal people.)
Yeah, making many copies makes sense. But if you're doing that, it's way easier to just make many copies of the original message. In other words, just publish it as a book and be done with it.
Broadcasting the hash would make sense if the message should stay secret for some time. Although 500 years is rather a long time for a zero-knowledge proof.
"There's no way to send a message to someone else over an untrusted channel with a very good assurance that it hasn't been tampered with, without a trusted channel. You need some way to convey trustable information. "
This is kind of what real p2p, open, permissionless, decentralized blockchains are good for...
Blockchains are not trusted channels. They don't permit secure communication between any node. That is to say, I can't send you a secret tamper-proof message over a blockchain if we haven't exchanged information over some other channel previously.
> I can't send you a secret tamper-proof message over a blockchain if we haven't exchanged information over some other channel previously
... Of course you can? If we're interacting via blockchain we both have mutually-known cryptographic public keys used to sign transactions. Assume without loss of generality they can't be used directly for public-key encryption (eg they're Lamport keys). I generate a McElice public encryption key and include it in a transaction signed with my signing key. You use that to encrypt your message and include the encrypted message in a transaction signed with your signing key. I decrypt the message; it's secret and tamper-proof.
If, as in londons_explore's comment, we're worried about any specific algorithm being broken, we can use a bunch of different signing keys, and a bunch of different encryption keys, such that a attack would have have to break all the signature algorithms or all the encryption algorithms to compromise the message.
If your only method of communication is a blockchain, you can't know who owns which public key. To know that you would need to talk directly to that person and have them tell you "key such-and-such is mine". You still need that trusted channel. Since you can't know who controls the public key your messaging over the blockchain and they can't know the public key that's messaging them is controlled by you, an impersonation attack is trivial. Someone else can send that public key a message saying "hello, I am a1369209993, let's agree on a symmetric key".
So, no, a blockchain is not a trusted channel. For the purposes of communication, it offers no more security than the public Internet. It doesn't even guarantee delivery of messages.
If your only method of communication is a blockchain, a public key is who owns that public key. "Hello, I am q8oYflHjXyXj7Pgu /0R4fkOjFG83GTI8 2bmTfBkzRcLJXNiN 4FuHE7Me71aWyTbk, let's agree on a symmetric key." is incompatible with the would-be impersonator supplying a different public key. (Assuming you're paying attention, I guess, but the original claim was that it was impossible, not that it was awkward and easy to screw up, since the latter is more or less true of practially all cryptography.)
MD5 is good enough if bit rot alone is the only thing you care about. It's still very hard to generate an MD5 collision with the same bit length, and doing so would generally require changing a VERY large number of bits to get a collision, which is not what happens with bit rot.
It might even be mathematically impossible to find two files that have the same MD5 hash and differ by less than a certain amount of bytes, though I don't know the proper way to formalize this.
(It's trivial to show for example that if you use any CRC as a hash, it is impossible to find a collision that differs by an edit distance of 1 and has the same length.)
> but also deliberate tampering.
You're never safe from this if you aren't the guardian of the official hash values. Someone could just change the file AND change the "official" hashes.
Tampering? That's harder. But to protect against corruption, since the hashes are shorter, you would be able to chisel them in different types of stone and glass and make 3+ copies of them. I suppose if you really want to protect against tampering, you could distribute so many copies that the ability to tamper with a majority of them would be very unlikely.
But I think what was meant by "publishing" was that the hashes would be available to parties who would be willing and able to preserve them indefinitely.
you don't know if it's the hash or the data that's been tampered with. so if the stored or transmitted hash doesn't match the computed hash, you reject it.
If you intend the data to be recoverable, consider adding recovery options to it. Error Correcting Codes (ECC), (with Reed-Solomon probably being the most common), can help verify and even recover data; helpful against bit-rot.
WinRar 5 can do it, and I believe many backup software worth their salt can too.
Simultaneously finding collisions in multiple hash algorithms with 2 random inputs would be a hard task, and if one of those inputs is predetermined (your data), and the other needs to include a change that the attacker wants to see there? that really sounds impossible. But on the other hand, we are talking 500 years...
The scenario is not finding collisions though. I assume londons_explore somehow finds a way to build a trusted relationship and channel with the future [1], say a piece of leather with the hashes burned into it that is ... stored publicly in Louvre with a ton of photo evidence spread all over the earth hence. As londons_explore is trusted, you only need Second Pre-image Resistance: Given a file (that was created outside of the attackers control) find a second file which has the same hash as the first file. Second pre-image attacks are super hard to accomplish. MD5 is still Second Preimage Resistant!
Adding a second random input to the hash function, like you propose (that is to prime the internal state of the hash function with same random input, so when hashing of the usage data starts, the internal state of the hash function is unknown to the attacker) makes collision attacks much harder. In fact there is a term for that: target collision resistance. But second pre-image attacks, where the random input used for the hashing are known, don't get any harder. And if you don't transmit the random input over the secure channel as well, than second pre-image attacks get easier even (theoretically), as attackers have the possibility to manipulate the internal state outside of the usage data.
Btw. reliance of target collision resistance, instead of collision resistance, is why Ed25519/Ed448 are much more resilient to problems with the hash function than ECDSA or any common RSA signature scheme. MD5 is still target collision resistant last time I checked. Remember the debacle of forged MD5 and later SHA1 certificates [2] ? Completely avoidable, if better signature schemes had been used.
For the bit rot part, this is what PAR2 is for. It generates recovery files, designed to repair the original file when arbitrary portions of it become malformed or lost.
With the caveat that I'm no data archiving guru...
As a source of inspiration a good place to start might be B-LTA in the ETSI standards. I believe this takes into account the fact that the algorithm used for its creation might deprecated.
IIRC the way it works is (basically) that signatures are given a finite validity period and you undertake periodic re-signing. Therefore you build up a audit trail history of signatures with current algos as you go along.
You could sign the data. Distribute the public key far and wide. Destroy the private key before you die. The advantage of this approach over just the hash is that you can sign an unlimited number of files and can verify them all with just one public key. It is all authenticated by your identity.
If your signature scheme is broken over the years then that means that people can tamper with the files. So you can use different schemes just as with the hashes.
Who is going to be able to verify your identity after 500 years and/or have a verified copy of the hashes? Without the concept of identity, it's all just a bunch of bits.
What if an attacker creates altered data and publishes it the same way by hashing it 50 times. Now you have two documents claiming to be the original and they each have 50 hashes guaranteeing the integrity of the data. What if the attacker did that with 1,000 fake documents.
I think your best bet is just to publish it and also launch a copy into space with a 500 year orbit. If someone in the future tries to launch a similar data-comet with a shorter orbit, it will be visible in the trajectory. There's always the danger of them sending out a robotic space probe to mess with your data "in flight".
Erasure coding combined with hashing. Since erasure coding uses data redundancy to help with recovery, bit rot with one stream/disk can be automatically detected and recovered (subject to supported levels of error correction).
Hashing would be on top of it to ensure detection of tampering/errors. Using multiple hashes is also a good strategy.
For the next 500 years though? I suppose (as a crypto layman) it'll be fine, but given we don't know what will be broken within the next 500 years, isn't it natural to add redundancy and use multiple hash algorithms?
Indeed. Hard to quantify the risk of one of these functions getting broken but it’s very low. There’s a million far more likely ways to lose that data.
It's largely unknown if there's any emergent behaviour given H1(H2(m)), for any two hash functions H1, H2.
It's fully possible that your Uberhash function has some vulnerability that can be easily exploited regardless of the underlying security of the individual hash functions.
They might not realize they're hurting themselves, which makes it a bug. Backblaze did this with one of these hash collisions. You could back up two different (colliding) files and when you tried to restore both, only two copies of the same file would be returned, the other one apparently lost forever.
Why would you have colliding files? Perhaps because you're somebody working on hash collisions and have collected some examples.
in the era of high fidelity generative models, i suspect that the future of media formats will be security forward with built-in protections against length extension attacks.
i'm having a hard time imagining any other future than one where people only trust signed media, and media is possibly even signed in hardware by actual physical sensors/compressors.
Colliding any pair of files has been possible for many years, but it takes several hours each time, with no shortcut. This page provide tricks specific to file formats and precomputed collision prefixes to make collision instant. git clone. Run Script. Done.
"""
Could anyone weigh in on whether these ideas can be generalized to speed up MD5 collisions in general?
Side note: I recently saw an example code using tensorflow to determine the private key of some cryptosystem. I can't find it. It was literally operating on the bits of the key and somehow had a loss function. Any ideas?
> And yet MD5 is still recommended in Applied Cryptography.
I'm guessing this is the usual HN dig at Schneier's book ?
Fact is that Schneier was already warning against MD5 as far back as 1996. And if you look on more recent blog posts, he is not exactly silent on dissuading people against MD5.
Fact is that like it or not, even today, MD5 still sees usage in the crypto landscape (see AWS S3 Content-MD5, for example). Therefore being able to understand how it works is still relevant, and in that respect Schneier's book is as good as any other.
Further, SHA-256 operates on the same principle as well (repeated iterations of a one-way compression function, called a Merkle-Damgard iterative construction[1]). Learning how MD5 works is quite instructive in that regard.
It's more of a dig on people who claim MD5 is okay because it's in Schneier's book. Not so much a dig on Schneier himself.
And yes. I understand MD5 etag headers are common. I even understand that MD5 still passes the avalanche test -- Robshaw's observations on MD5 in 1996 are just as valid today.
But according to Sasaki & Aoki, //both// collisions and second pre-image calculation are faster than brute force. Maybe MD5 //shouldn't// be used if all you need is avalanche.
It turns out that SHA256 has both collision and second pre-image calculation resistance //as well as// avalanche effect. Maybe use that one instead.
I don't remember the last time I saw anyone use MD5 for cryptography. It's still used a lot for data integrity, because it still works perfectly well for that and it's faster than the SHA family (when there's no hardware implementation).
Though be careful about the use of the phrase "data integrity" -- comparing two files on a file-system by their MD5 hash is probably fine, but comparing the serialization of two PDUs over the wire based on their MD5 hash may be problematic.
I think I understand your meaning to be MD5 certainly still exhibits an avalanche effect; changing a single bit in the input changes about half the bits in the output. And if you trust the way you retrieve the hashed data and the hash (like it's on a local hard drive) then yes, it's certainly acceptable for that use. But collisions and second pre-image generation being faster than brute force are why people generally don't want to use it (MD5) when it's use spans trust domains.
My point is:
a. Don't use MD5 just because Bruce Schneier published a popular book that said it was okay RIGHT BEFORE all the research damning it came out. (Personally... I think Bruce should publish a third edition of this text expressly to remove the bit about MD5 being okay. I cannot tell you how many hours of billable time I've wasted explaining to software engineers that no... MD5 is not recommended for use even though at one time it was considered acceptable. And if you know not to use MD5, you're not the software engineer I'm talking about.)
b. You can use SHA256 and get avalanche, collision resistance AND second pre-image generation resistance. (Pretty sure you also get 1st pre-image generation resistance, but I haven't scanned the literature for that in a while.)
And while I'm thinking about it, let me add these points:
c. There are probably better hashing algorithms than MD5 for use with a hash map/table/tree.
d. If you're interested in how MD5 works, I recommend expanding the scope a bit and study Merkle-Damgård generally. Why MD5 has problems and other hash functions that make use of the Merkle-Damgård construction don't (or have different problems, or the same problems at different amounts of input) is pretty interesting.
And yes, if you happen to have a MD5 hardware accelerator or petabytes of data and MD5 hashes already, it's hard to change that overnight.
>comparing the serialization of two PDUs over the wire based on their MD5
I'm not sure what you mean by this, but this:
>people generally don't want to use it (MD5) when it's use spans trust domains.
is exactly what I mean by "cryptography". I.e. guarding against intentional tampering. Are there a lot of people using it for this purpose? I don't remember seeing one.
>c. There are probably better hashing algorithms than MD5 for use with a hash map/table/tree.
For sure. Cryptographic functions (even obsolete ones) are almost always overkill and too slow for general data structures. Only use them if you can't find something more suitable for your data.
>And yes, if you happen to have a MD5 hardware accelerator or petabytes of data and MD5 hashes already, it's hard to change that overnight.
I was actually talking about SHA-256 acceleration, since I just saw like an hour ago that recent Intel CPUs have it. If your CPU has such instructions, by all means use it instead of software MD5, if all you need is data integrity.
Moving from MD5 to SHA256 means we just about triple the amount of time it takes to generate a hash.
But I suggest there are few applications where this speed improvement justifies the confusion I've seen in junior engineers who believe MD5 is okay because a. they use MD5SUM and b. Bruce Schneier said it was okay.
Don't get me wrong. I trust that //YOU// will know not depend on a MD5 hash for anything where a bad guy can modify content over the wire. But... I'm going to go out on a limb and guess you're somewhat experienced. I worry about the kids who without the benefit of experience re-enact scenes from the cryptography edition of Lord of the Flies.
But... if you know what you're doing... sure... use MD5... There's certainly no way your code will ever be used by a less experienced engineer, right?
I tend to prefer b2sum (BLAKE2), as it's both fast and (at the time of writing this, AFAIK) secure. There are certainly situations where md5 is good enough, but yeah, I feel safer just using blake2 and not having to spend brain cycles thinking whether the hash needs to be cryptographically secure or not, and risk making the wrong judgement.
Tried with a 3.5GB Ubuntu .iso file, results are similar though I also tried b3sum and that was even faster(just 1.8s vs. 19s for sha256sum & 6.7s for md5sum). So performance is definitely not an argument for md5.
Like everything else in the world, knowing your audience, intended use, etc plays a key role. If you're protecting data against state actors, then yes, MD5 is not good enough. If you're just trying to decide if a transfer of a file went across a network without munging the data, then MD5 is good enough. If you're trying to post that file onto something that is publicly facing and want to provide the person downloading the file that it has not been modified, MD5 is probably not the best method.
The point the OP made is it's easy to construct a second preimage with MD5. So if a bad guy, even a bad guy with meager resources, modifies the hash or the data in transit, you won't be able to identify the change the MITM made in either the data or the hash while it was transmitted.
What you're probably thinking is a digital signature, which combines a hash and an asymmetric cipher (or a keyed hash if you keep that key secret.)
But even then a bad actor can construct a second pre-image with MD5 which would cause the digital signature validate as authentic even though it's been changed.
And an adversary with meager computational resources could perform this hack. This is the OP's point. You don't have to be a state actor to do this.
Essentially, I was agreeing that MD5 is worthless for anything "secure" from 3rd party manipulation. However, it is fine for verification of non-adversarial purposes. It's much faster to run MD5 than SHA256 over a 60-120GB MOV file.
Copying camera original footage from the card that is still warm from the camera and making copies to client deliverable storage before telling camera dept that it is okay to wipe the card and reuse is a thing I have done a lot. Speed is important. Security is not a concern at all.
Copying a large MOV file from one storage pool to another, retrieving large media assets from tape archive, etc are also common situations where you just want to know that the copy is actually what you think it is from I/O errors during whatever transfer process. In these situations, I have zero concern that someone wearing a blackhat maliciously manipulated the data during the transfer.
The cost of the calculation of the hash is dwarfed by the cost of copying to/from an SD card (or even nvme devices.) If you hash the file as you transfer it, you'll not notice the difference between MD5 and SHA256.
It's a PDF File which is also a NES ROM that displays its own MD5 sum. The PDF also shows its own MD5 sum a few times. (The MD5 sum also happens to begin with 5EAF00D)
When an arbitrary MD5 can be created that easily, it's useless for any cryptographic applications, or even any data integrity.