De-identification of data sets (like cryptography) is a very difficult problem.
It is great that people are building tools for this. Even if I were skeptical of one or another in particular, the availability of tools popularizes the discussion of what is necessary and sufficient for de-identifying data.
The main use case I worked on was how to test an event driven (SOA at the time) pipeline without production data. Health information handling is very tightly regulated, so generating a test data set large enough that reflected the needs of the system was a significant challenge. Engineers couldn't just copy some production data and use it for testing. The regime I worked in that defined these rules (early PHIPA, PIPEDA in Ontario) is not unlike what people may encounter with GDPR.
When I was doing this sort of work, I found that it made more sense to find the structure of the data, then synthesize it from scratch. For a data format like HL7, this is non-trivial.
Synthesizing a few gigabytes of json/xml/text from a small training corpus provides incomplete test data. There are a few companies in the de-identification business, and I remember a few consulting services for it.
I can think of a few ways to do this, and they aren't simple.
How does this tool compare to other (libre) anonymization software programs, such as ARX [1]? From what I understand, there are only basic routines so to sample records and coarsen a few attributes (e.g. ZIP code, dates) implemented so far.
This might also not be sufficient to truly anonymize data, as a large body of research has shown so far [2,3,4]
Hey! I'm one of the co-maintainers of the project here.
What you see today in this project is really a means to scratch an itch we had - mainly to quickly and easily sample/obfuscate some delimited data in a way that is "good enough" for use for demonstrating a visualisation tool without using the original dataset. It's important to note that that we intend to use this data still within a secure environment.
This tool is absolutely not up to the task of anonymising a dataset in such a way as to make it able to be made public. For us, it's about risk management vs effort: from a security perspective there are scenarios where we can use samples of data that have gone through this process and decrease the risk of holding data in mutliplate places substantially without significant effort. If we were to go onto to make any of these datasets ultimately public, we'd be looking for a better suited tool.
As a result, tools like ARX are not something we really want to compete with - they're aiming for a complete solution whereby the results are good enough to potentially make public. It goes perhaps without saying really that the reality of this goal is debatable given the research you linked, but some people might be comfortable with those risks.
One thing we've done to try and bridge the gap a bit is to make it really easy to add new functions as we need them, and I think we can get to a point whereby for a good portion of use-cases this tool is good enough (for example, making datasets you can use in a development environment that are representative, but a manageable size and anonymised to a reasonable degree).
We'll also try to add something to the README addressing this exact question from you as it's one I anticipate we're going to get asked a lot - so thanks for the constructive line of questioning as it really will ultimately help us and people who choose to use this tool make a decision that's right for them and their use-cases.
I would recommend you make this clearer in the readme, as I wasn't left with the impression reading the documentation that the tool was for limited scenarios and scope.
In addition to what Nathan has said, I'd add we needed a simple native command line tool that could be dropped into any server and easily work along other unix tools like cat, gzip, cut...
The intent behind this tool seems good, but I don't think it's a good idea. To actually anonymize data requires semantic understanding of that data and an understanding of what sort of data, harmless by itself, is transmuted into identifying data when provided in the context of other otherwise harmless data.
This tool doesn't help you with any of that. It seems to be a glorified awk script. My concern is that helping the user with the easiest part of anonymizing data stands to encourage the user to go full steam ahead without slowing down to stop and think very carefully about what they're doing.
Hey! I'm one of the co-maintainers of the project here. I've posted a very similar reply to a very similar comment below at [1], but to replay the main points:
We absolutely agree this tool only solves the easiest part of anonymising data, and internally we rely on our team of data scientists to do the difficult parts. This tool is absolutely not up to the task of anonymising a dataset in such a way as to make it able to be made public. For us, it's about risk management vs effort: from a security perspective there are scenarios where we can use samples of data that have gone through this process and decrease the risk of holding data internally in multiple places substantially without significant effort. If we were to go onto to make any of these datasets ultimately public, we'd be looking for a better suited tool (eg. ARX [2]).
Regarding one part of your comment:
> My concern is that helping the user with the easiest part of anonymizing data stands to encourage the user to go full steam ahead without slowing down to stop and think very carefully about what they're doing.
We're going to try to add something to the README addressing this exact question from both of you as it's one I anticipate we're going to get asked a lot - or one that carries risk if it's not made obvious form the outset - so thanks for the constructive line of questioning as it really will ultimately help us and people who choose to use this tool make a decision that's right for them and their use-cases.
> anonymising ... columns until the output is useful for applications where sensitive information cannot be exposed
This tool will not provide any significant amount of anonymity.
> rows to randomly sample ... hash (using ... 32 bits) the column ... mod the result by the [constant] value
This is not random. It deterministically selects the same very predictable fraction of rows.
> UK format postcode (eg. W1W 8BE) and just keeps the outcode (eg. W1W)
> Given a date, just keep the year
Partial postal codes and dates quantized to the year are still very revealing. Combined with other data (such as a hashed name), the partial postal code may allow a lot of people to be uniquely identified.
> Hash (SHA1) the input
Hashing does not provide anonymity. Substituting a candidate key with the hash of the key is usually a 1-to-1 map that is often trivial to reverse. It isn't hard to iterate through e.g. all possible names, postal codes, license plates, or other short-ish strings to find a matching SHA1.
The salt might* provide some resistance to per-computed tables, but a GeForce GTX 1080 Ti running hashcat can search for matching SHA1 at over 11 GH/s (giga-hashes per second). That means that a single 1080 Ti running for ~3-4 hours would not only discover not only that SHA1("hasselhof") == ffe3294fad149c2dd3579cb864a1aebb2201f38d; it would exhaustively search all 10 character or smaller lowercase strings.
> range
This is the only feature that could provide anonymity, if it is used correctly to group large numbers of individuals into the same bucket. This is probably more difficult that it first appears.
Hey, one of the co-maintainers here. Thanks for your comments.
>> rows to randomly sample ... hash (using ... 32 bits) the column ... mod the result by the [constant] value
> This is not random. It deterministically selects the same very predictable fraction of rows.
Yep, you are right. We didn't intend the sampling function to be part of the anonymisation but just something we tend to use and we thought it would be useful to have it.
Its objective is to pick a portion of the input data. No more.
>> UK format postcode (eg. W1W 8BE) and just keeps the outcode (eg. W1W)
>> Given a date, just keep the year
> Partial postal codes and dates quantized to the year are still very revealing. Combined with other data (such as a hashed name), the partial postal code may allow a lot of people to be uniquely identified.
You are absolutely right. Depending on the use case and your data, having the outcode, the city or the year might be very revealing. In some other cases even having decades or centuries might be revealing.
We don't pretend that each function provided applies to all use cases. But in certain use cases partial postcodes or years can be good enough.
>> Hash (SHA1) the input
> Hashing does not provide anonymity.
We are very aware of that. That's why we offer the option to add a salt (that the user of the tool can make as long as possible and throw away after the anonymisation process).
>> range
> This is the only feature that could provide anonymity, if it is used correctly to group large numbers of individuals into the same bucket. This is probably more difficult that it first appears.
We usually work with sets of data that are tens of millions of users. Choosing the right ranges and, specially, analysing the data and making sure you anonymise the outliers (by choosing your bottom and top ranges carefully) it's crucial.
Again, this tool is a hammer. We expect a person that understands about wood and nails to analyse their problem and use it.
Hey, one of the co-maintainers of the project here. And the one that decided to use json.
I agree with you. there are other options for configuration that are much better than json (yaml, toml).
Main reason for chosing json was simplicity. This was my first project in go and I didn't want to spend much time in it either. I found an example that was using json and I saw that I didn't need any external library to decode it. I thought that was good enough, at least for now.
Will probably look into using a library that supports yaml/toml for configuration in the future.
It ostensibly is a tool that follows the model/philosophy of unix: ie, a command line utility that does one thing well, inputs and outputs are text so they can be piped together, etc.
Thanks for the idea. We don't support anonymisation of IP addresses because it's not in any of our use cases yet. But I've already added an issue to address it.
It is great that people are building tools for this. Even if I were skeptical of one or another in particular, the availability of tools popularizes the discussion of what is necessary and sufficient for de-identifying data.
The main use case I worked on was how to test an event driven (SOA at the time) pipeline without production data. Health information handling is very tightly regulated, so generating a test data set large enough that reflected the needs of the system was a significant challenge. Engineers couldn't just copy some production data and use it for testing. The regime I worked in that defined these rules (early PHIPA, PIPEDA in Ontario) is not unlike what people may encounter with GDPR.
When I was doing this sort of work, I found that it made more sense to find the structure of the data, then synthesize it from scratch. For a data format like HL7, this is non-trivial.
Synthesizing a few gigabytes of json/xml/text from a small training corpus provides incomplete test data. There are a few companies in the de-identification business, and I remember a few consulting services for it.
I can think of a few ways to do this, and they aren't simple.