Hacker News new | past | comments | ask | show | jobs | submit login

That's not quite true. E.g. Google's email spam models are trained without access to the raw text. In general the field of differential privacy is developing rather quickly.



You need to think of medical privacy the way you think of browser fingerprinting; it's not necessarily any single piece of information that lets you be tracked, but rather the combination of a bunch of them -- even when "anonymized" -- adds up to enough bits to pin a consistent identity on the people involved.

And given what the adtech space has been able to figure out pretty easily in terms of tracking even people who take serious steps to avoid it, you should not think that there is any reasonable amount of "anonymization" that can make otherwise-useful medical details safe to release.


I'm somewhat confused about how they would be doing that. Do you have any references to blogs/papers on this technique?


One typical strategy for spam detection is to convert text to a "bag of words" representation[0]. If you take this bag of words representation and hash all the values, then rather than training words like ED, you are training on word number 19213123. The number of these hashed values is smaller than the number of words, just like a hash table, and this generally doesn't harm the accuracy of the machine learning. When you receive feedback on the classification (from people reporting spam or people marking things as not spam), you just turn the reported email into a bag of hashed words and feed that change into your model.

Because the order of the entries in the bag of words is arbitrary, and the words have been hashed, it is impossible to go back from a bag of words representation to the original email. I don't know if this is what google does, but it is pretty normal to do so.

0: https://en.wikipedia.org/wiki/Bag-of-words_model


This is actually a really tricky topic. Things that sound like they should give very good security, often don't in practice. The "hashed bag of words" technique that you describe here sounds an awful lot like some recent attempts at letting legacy systems search on encrypted data [1,2].

We took at look at this recently [3], and it turns out that mapping the word numbers back to the original words is actually a lot more doable than you'd think.

[1] ShadowCrypt: Encrypted Web Applications for Everyone http://dl.acm.org/citation.cfm?doid=2660267.2660326

[2] Mimesis Aegis: A Mimicry Privacy Shield–A System’s Approach to Data Privacy on Public Cloud https://www.usenix.org/conference/usenixsecurity14/technical...

[3] The Shadow Nemesis: Inference Attacks on Efficiently Deployable, Efficiently Searchable Encryption https://www.sigsac.org/ccs/CCS2016/agenda/


Yeah, I definitely don't think that this would give you mathematically provable security, especially if you are including n-grams which would allow you to chain together sentences combined with a language model. How much that matters in practice depends on the dimensionality reduction. In the context of google I doubt if this is even much of a specific goal, since no matter what they train on, they actually have access to the source text if they want it.


Netflix knows your approximate age, gender, and where you live based on your movie likes alone. I suspect the comment you replied to was making the point that Big Data is making privacy impossible. See here for an example: https://datafloq.com/read/re-identifying-anonymous-people-wi...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: