At my last job, I occasionally ran into people worrying about SHA-256 collisions for non-maliciously created build artifacts, so I wrote an internal wiki entry with the back-of-the-envelope estimates.
If you're really paranoid, use getentropy() to seed AES in counter mode, and generate 256-bit cryptographically pseudorandom IDs. Assume your system consumes 1 trillion (2^40) IDs per second for 1 trillion seconds (34,000+ years). The probability of a collision over that time frame is roughly (2^(-(256-80)/2)) = 2^-88.
(Actually, in counter mode without throwing out any bits, the probability of collision is even slightly lower than this random oracle model suggests. This is true even if there are multiple independently seeded streams, as long as they're all seeded with high-entropy sources.)
Assume the probability of a life-threatening dinosaur being cloned in your lifetime is one in a billion (2^-30), and if so its escaping is one in one million (2^-20), and if it escapes the chances of it entering your house and you being able to save your life by looking for it is one in one billion (2^-30).
In this case, it's roughly 256 times more rational (assuming your death and the consequences of an ID collision are equally bad) to check under your bed for dinosaurs vs. checking for ID collisions.
Also, the probability of radioactive decay flipping the comparison result bit at the exact moment you compare you random 256-bit IDs is much much higher than the probability of collision. So, if you're paranoid enough to check for collisions, you should be checking multiple times.
Of course, the above analysis all hinges upon correct implementation and high-entropy seeds. These are the real weak points of using large random IDs, so audit and test your code early and often.
Carry out the above analysis with 122 bits of entropy for UUIDv4, substituting your actual system lifetime and expected consumption rate, and you'll likely find similar results.
My first rule is - if you don't need it (SHA, UUID, etc...) - don't use it.
My second rule is - don't be a priest - if someone did it and it works then it works.
Assumption that close to impossible collisions don't happen is a belief, not a proven mathematically fact. ;-) Such a problems are also more complex than just 1 dimensional collision math.
> Assumption that close to impossible collisions don't happen is a belief, not a proven mathematically fact.
I'm not assuming they're impossible. I'm estimating the probability, and rationally prioritizing risks based on probability and severity of impact, balanced against the real-world costs of using gigabytes of source code as a primary key vs a SHA-256 checksum.
If you're really paranoid, use getentropy() to seed AES in counter mode, and generate 256-bit cryptographically pseudorandom IDs. Assume your system consumes 1 trillion (2^40) IDs per second for 1 trillion seconds (34,000+ years). The probability of a collision over that time frame is roughly (2^(-(256-80)/2)) = 2^-88.
(Actually, in counter mode without throwing out any bits, the probability of collision is even slightly lower than this random oracle model suggests. This is true even if there are multiple independently seeded streams, as long as they're all seeded with high-entropy sources.)
Assume the probability of a life-threatening dinosaur being cloned in your lifetime is one in a billion (2^-30), and if so its escaping is one in one million (2^-20), and if it escapes the chances of it entering your house and you being able to save your life by looking for it is one in one billion (2^-30).
In this case, it's roughly 256 times more rational (assuming your death and the consequences of an ID collision are equally bad) to check under your bed for dinosaurs vs. checking for ID collisions.
Also, the probability of radioactive decay flipping the comparison result bit at the exact moment you compare you random 256-bit IDs is much much higher than the probability of collision. So, if you're paranoid enough to check for collisions, you should be checking multiple times.
Of course, the above analysis all hinges upon correct implementation and high-entropy seeds. These are the real weak points of using large random IDs, so audit and test your code early and often.
Carry out the above analysis with 122 bits of entropy for UUIDv4, substituting your actual system lifetime and expected consumption rate, and you'll likely find similar results.