Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> replaces every string in the customer's database with a cryptographic hash > So unless your city is named "success", it's going to be missing from the dataset.

You've made the typical mistake of thinking that just because you've made it harder, you've also made it anonymous.

For instance, you might have replaced "Chicago" with 9cfa1e69f507d007a516eb3e9f5074e2, but if for instance a lot of people with that tokenised city name also have transactions for a store that only exists in Chicago, you can infer with a reasonable degree of accuracy the reverse mapping. A couple such data points, and you can be almost certain.

If you know when you made a couple of transactions that are in the database, and you can find a set of likely options for each time, and then see what fields they have in common. Once you know your data, you can infer a whole load of the reverse mappings for various fields.

If those transactions involve another user, you can start to correlate those mappings with data you know about them, and start to build up a web of transactions that person had even if you don't necessarily know yet who the other users they interact with yet.

All this is possible without the use of rainbow tables, but chances are your hash function is a standard one, so with a single known mapping you can work out which hash function you chose, and look for speculative entries in the data. e.g. let's just look for MD5("Chicago"), SHA1("Chicago"), SHA256("Chicago") and see if there are any matches. If there are, we can use that hash function and trivially create a rainbow table for every city in the US, first names, surnames, etc.



I think of this as a graph of interconnected data and metadata. Unless the entire graph is anonymized, it's not really anonymized.

  Deducing relationships between metadata elements (city field, and purchase store) ends up being the tricky part, and highly domain specific.
  Hashes with salts make it a bit harder too.


Let’s at least grant the benefit of the doubt, that the poster knew to salt the hash. We can take it as given that incompetence makes for poor anonymization.

Your example of “transactions in Chicago” is much more salient; there’s clearly a cat-and-mouse dynamic where data can be de-anonymized, especially if the dataset is public. How much that will actually be possible will be specific to the data in question; but the risk is non-zero. There’s certainly a case that no amount of obfuscation is sufficient if a user has not explicitly consented to their data being used this way.


> Let’s at least grant the benefit of the doubt, that the poster knew to salt the hash. We can take it as given that incompetence makes for poor anonymization.

Actually let's not, because adding salting doesn't actually get it right either. Rather it's just another easily-broken system that is only good enough to fool its own designer. It's still trivial to run through a list of the most common city names, and recover nearly all of the entries. And if there is just a single "salt" per DB, which would be necessary for the apparent requirement that matching city names stay matching, even cycling through all combinations of letters is nearly practical. There just isn't enough starting entropy to make hashing meaningful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: