> anonymising ... columns until the output is useful for applications where sens...

xomateix · on May 24, 2018

Hey, one of the co-maintainers here. Thanks for your comments.

>> rows to randomly sample ... hash (using ... 32 bits) the column ... mod the result by the [constant] value

> This is not random. It deterministically selects the same very predictable fraction of rows.

Yep, you are right. We didn't intend the sampling function to be part of the anonymisation but just something we tend to use and we thought it would be useful to have it.

Its objective is to pick a portion of the input data. No more.

>> UK format postcode (eg. W1W 8BE) and just keeps the outcode (eg. W1W)

>> Given a date, just keep the year

> Partial postal codes and dates quantized to the year are still very revealing. Combined with other data (such as a hashed name), the partial postal code may allow a lot of people to be uniquely identified.

You are absolutely right. Depending on the use case and your data, having the outcode, the city or the year might be very revealing. In some other cases even having decades or centuries might be revealing.

We don't pretend that each function provided applies to all use cases. But in certain use cases partial postcodes or years can be good enough.

>> Hash (SHA1) the input

> Hashing does not provide anonymity.

We are very aware of that. That's why we offer the option to add a salt (that the user of the tool can make as long as possible and throw away after the anonymisation process).

>> range

> This is the only feature that could provide anonymity, if it is used correctly to group large numbers of individuals into the same bucket. This is probably more difficult that it first appears.

We usually work with sets of data that are tens of millions of users. Choosing the right ranges and, specially, analysing the data and making sure you anonymise the outliers (by choosing your bottom and top ranges carefully) it's crucial.

Again, this tool is a hammer. We expect a person that understands about wood and nails to analyse their problem and use it.

tomc1985 · on May 24, 2018

so why not hash with a very large salt and then throw away the salt?