Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm curious of the 2021 measure of total disk space that Discord consumes. Servers that I'm in share images every few minutes, which must add up pretty quick.


Unique images or just copied from elsewhere?


How does that matter? Do they keep track of all images on the internet?


I don't know much about image de-deuplication, but maybe they can get some sort of fingerprint/hash for an image, see if they already have it, and then serve that already existing image.

I'd imagine a hash like SHA256 would be tricky because if that image was compressed an additional time at all throughout it's internet journey, then we'd get a different resulting hash, but maybe there is an effective way to fingerprint images. I have a utility on my machine (czkawka maybe?) that does really good image de-duplication with what seemed like a common algorithm (based on a quick look at the source).

No idea though, just spit balling.


I'd imagine Discord uses deduplication, but I bet it doesn't save them 5% storage space.


I think it might, spamming same meme images over and over is quite common in some servers. On other hand the bigger pictures might overhelm these just in size.


Yeah, that's why I assumed it wouldn't help that much. People re-upload 100kB memes all the time, but the bulk would probably be 5MB phone pictures that won't typically be re-uploaded.


The plural of anecdote isn't data, but about 20% of the images I post on Discord come from Discord in the first place, cross-posting among different servers.


Yes. There are ways to group images that seem to be the same. TinEye and Google image search do that. So you'd have a collection of related hashes that equal "Bob's prom photo where he looks like a goofer."


Yes definitely I have seen it work in action but you cant just tell a user "here use this smaller and more pixelated version of your image that we think is kind of similar".


Oof! No, you can't. :D


I'd imagine that the images are not part of the main database and that they are in some kind of s3 like file storage system.


The Casssandra cluster mentioned (12 nodes, 1TB each) only handles text, as far as the article goes.


We're well over 12 nodes in current year :P


Oh, just noticed the article is from 2017. Is there a newer one, related/similar to this one?


lol. :D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: