I'm curious of the 2021 measure of total disk space that Discord consumes. Servers that I'm in share images every few minutes, which must add up pretty quick.
I don't know much about image de-deuplication, but maybe they can get some sort of fingerprint/hash for an image, see if they already have it, and then serve that already existing image.
I'd imagine a hash like SHA256 would be tricky because if that image was compressed an additional time at all throughout it's internet journey, then we'd get a different resulting hash, but maybe there is an effective way to fingerprint images. I have a utility on my machine (czkawka maybe?) that does really good image de-duplication with what seemed like a common algorithm (based on a quick look at the source).
I think it might, spamming same meme images over and over is quite common in some servers. On other hand the bigger pictures might overhelm these just in size.
Yeah, that's why I assumed it wouldn't help that much. People re-upload 100kB memes all the time, but the bulk would probably be 5MB phone pictures that won't typically be re-uploaded.
The plural of anecdote isn't data, but about 20% of the images I post on Discord come from Discord in the first place, cross-posting among different servers.
Yes. There are ways to group images that seem to be the same. TinEye and Google image search do that. So you'd have a collection of related hashes that equal "Bob's prom photo where he looks like a goofer."
Yes definitely I have seen it work in action but you cant just tell a user "here use this smaller and more pixelated version of your image that we think is kind of similar".