How does this tool compare to other (libre) anonymization software programs, such as ARX [1]? From what I understand, there are only basic routines so to sample records and coarsen a few attributes (e.g. ZIP code, dates) implemented so far.
This might also not be sufficient to truly anonymize data, as a large body of research has shown so far [2,3,4]
Hey! I'm one of the co-maintainers of the project here.
What you see today in this project is really a means to scratch an itch we had - mainly to quickly and easily sample/obfuscate some delimited data in a way that is "good enough" for use for demonstrating a visualisation tool without using the original dataset. It's important to note that that we intend to use this data still within a secure environment.
This tool is absolutely not up to the task of anonymising a dataset in such a way as to make it able to be made public. For us, it's about risk management vs effort: from a security perspective there are scenarios where we can use samples of data that have gone through this process and decrease the risk of holding data in mutliplate places substantially without significant effort. If we were to go onto to make any of these datasets ultimately public, we'd be looking for a better suited tool.
As a result, tools like ARX are not something we really want to compete with - they're aiming for a complete solution whereby the results are good enough to potentially make public. It goes perhaps without saying really that the reality of this goal is debatable given the research you linked, but some people might be comfortable with those risks.
One thing we've done to try and bridge the gap a bit is to make it really easy to add new functions as we need them, and I think we can get to a point whereby for a good portion of use-cases this tool is good enough (for example, making datasets you can use in a development environment that are representative, but a manageable size and anonymised to a reasonable degree).
We'll also try to add something to the README addressing this exact question from you as it's one I anticipate we're going to get asked a lot - so thanks for the constructive line of questioning as it really will ultimately help us and people who choose to use this tool make a decision that's right for them and their use-cases.
I would recommend you make this clearer in the readme, as I wasn't left with the impression reading the documentation that the tool was for limited scenarios and scope.
In addition to what Nathan has said, I'd add we needed a simple native command line tool that could be dropped into any server and easily work along other unix tools like cat, gzip, cut...
This might also not be sufficient to truly anonymize data, as a large body of research has shown so far [2,3,4]
[1] https://arx.deidentifier.org
[2] https://www.uclalawreview.org/pdf/57-6-3.pdf
[3] http://randomwalker.info/publications/no-silver-bullet-de-id...
[4] http://arxiv.org/abs/1712.05627