> anonymising ... columns until the output is useful for applications where sensitive information cannot be exposed
This tool will not provide any significant amount of anonymity.
> rows to randomly sample ... hash (using ... 32 bits) the column ... mod the result by the [constant] value
This is not random. It deterministically selects the same very predictable fraction of rows.
> UK format postcode (eg. W1W 8BE) and just keeps the outcode (eg. W1W)
> Given a date, just keep the year
Partial postal codes and dates quantized to the year are still very revealing. Combined with other data (such as a hashed name), the partial postal code may allow a lot of people to be uniquely identified.
> Hash (SHA1) the input
Hashing does not provide anonymity. Substituting a candidate key with the hash of the key is usually a 1-to-1 map that is often trivial to reverse. It isn't hard to iterate through e.g. all possible names, postal codes, license plates, or other short-ish strings to find a matching SHA1.
The salt might* provide some resistance to per-computed tables, but a GeForce GTX 1080 Ti running hashcat can search for matching SHA1 at over 11 GH/s (giga-hashes per second). That means that a single 1080 Ti running for ~3-4 hours would not only discover not only that SHA1("hasselhof") == ffe3294fad149c2dd3579cb864a1aebb2201f38d; it would exhaustively search all 10 character or smaller lowercase strings.
> range
This is the only feature that could provide anonymity, if it is used correctly to group large numbers of individuals into the same bucket. This is probably more difficult that it first appears.
Hey, one of the co-maintainers here. Thanks for your comments.
>> rows to randomly sample ... hash (using ... 32 bits) the column ... mod the result by the [constant] value
> This is not random. It deterministically selects the same very predictable fraction of rows.
Yep, you are right. We didn't intend the sampling function to be part of the anonymisation but just something we tend to use and we thought it would be useful to have it.
Its objective is to pick a portion of the input data. No more.
>> UK format postcode (eg. W1W 8BE) and just keeps the outcode (eg. W1W)
>> Given a date, just keep the year
> Partial postal codes and dates quantized to the year are still very revealing. Combined with other data (such as a hashed name), the partial postal code may allow a lot of people to be uniquely identified.
You are absolutely right. Depending on the use case and your data, having the outcode, the city or the year might be very revealing. In some other cases even having decades or centuries might be revealing.
We don't pretend that each function provided applies to all use cases. But in certain use cases partial postcodes or years can be good enough.
>> Hash (SHA1) the input
> Hashing does not provide anonymity.
We are very aware of that. That's why we offer the option to add a salt (that the user of the tool can make as long as possible and throw away after the anonymisation process).
>> range
> This is the only feature that could provide anonymity, if it is used correctly to group large numbers of individuals into the same bucket. This is probably more difficult that it first appears.
We usually work with sets of data that are tens of millions of users. Choosing the right ranges and, specially, analysing the data and making sure you anonymise the outliers (by choosing your bottom and top ranges carefully) it's crucial.
Again, this tool is a hammer. We expect a person that understands about wood and nails to analyse their problem and use it.
This tool will not provide any significant amount of anonymity.
> rows to randomly sample ... hash (using ... 32 bits) the column ... mod the result by the [constant] value
This is not random. It deterministically selects the same very predictable fraction of rows.
> UK format postcode (eg. W1W 8BE) and just keeps the outcode (eg. W1W)
> Given a date, just keep the year
Partial postal codes and dates quantized to the year are still very revealing. Combined with other data (such as a hashed name), the partial postal code may allow a lot of people to be uniquely identified.
> Hash (SHA1) the input
Hashing does not provide anonymity. Substituting a candidate key with the hash of the key is usually a 1-to-1 map that is often trivial to reverse. It isn't hard to iterate through e.g. all possible names, postal codes, license plates, or other short-ish strings to find a matching SHA1.
https://arstechnica.com/tech-policy/2014/06/poorly-anonymize...
The salt might* provide some resistance to per-computed tables, but a GeForce GTX 1080 Ti running hashcat can search for matching SHA1 at over 11 GH/s (giga-hashes per second). That means that a single 1080 Ti running for ~3-4 hours would not only discover not only that SHA1("hasselhof") == ffe3294fad149c2dd3579cb864a1aebb2201f38d; it would exhaustively search all 10 character or smaller lowercase strings.
> range
This is the only feature that could provide anonymity, if it is used correctly to group large numbers of individuals into the same bucket. This is probably more difficult that it first appears.