Hacker News new | past | comments | ask | show | jobs | submit login

>The real-world application (and potential danger) is when this data is combined with other data. De-anonymization techniques using sparse datasets has been an active area of research for at least 15 years and it is often surprising to people how much can be gleaned from a few pieces of seemingly unconnected data.

Seems pretty handwavy. Can you describe concretely how this would work?




>Seems pretty handwavy.

It has a whole Wikipedia article and everything.

https://en.wikipedia.org/wiki/De-anonymization#Re-identifica...

>Can you describe concretely how this would work?

Here's one of the earlier papers I remember off-hand, demonstrating one methodology. New (and improvements to existing) statistical techniques have happened in the ~18 years since this was published. Not to mention their is significantly more data to work with now.

https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf

"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."

From the Wiki I linked:

"Researchers at MIT and the Université catholique de Louvain, in Belgium, analyzed data on 1.5 million cellphone users in a small European country over a span of 15 months and found that just four points of reference, with fairly low spatial and temporal resolution, was enough to uniquely identify 95 percent of them." [...] "A few Twitter posts would probably provide all the information you needed, if they contained specific information about the person's whereabouts."

Point being that operational security is hard, and it takes a lot less to "slip up" and accidentally reveal yourself than most people think. Obtaining a location within 250 miles (or whatever) can be a key piece of information that leads to other dots being connected.

Other examples (albeit with less explanation) include police take downs of prolific CSAM producers by gathering bits and pieces of information over time, culminating in enough to make an identification.


>"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."

> [...]

"Researchers at MIT and the Université catholique de Louvain, in Belgium, analyzed data on 1.5 million cellphone users in a small European country over a span of 15 months and found that just four points of reference, with fairly low spatial and temporal resolution, was enough to uniquely identify 95 percent of them." [...] "A few Twitter posts would probably provide all the information you needed, if they contained specific information about the person's whereabouts."

The only reason the two attacks work is that you have access to a bunch of uncorrelated data points. That is, ratings for various shows and their dates, and cellphone movement patterns. It's unclear how you could extend this to some guy you're trying to dox on signal. The geo info is relatively coarse and stays static, so trying to single out a single person is going to be difficult. To put another way, "guy was vaguely near New York on these dates" doesn't narrow down the search parameters by much. That's going to be true for millions of people.


>To put another way, "guy was vaguely near New York on these dates" doesn't narrow down the search parameters by much.

That's why I said that this data alone is probably worthless, but can gain value when combined with other data.("As a piece of data alone, the results are probably not of significant use")

The combining of data is the important bit and the entire emphasis of both of my other comments.

Two pieces of otherwise anonymous data can, when combined, lead to re-identification.


>Two pieces of otherwise anonymous data can, when combined, lead to re-identification.

How are you going to get more anonymous data? Practically speaking if your target has such poor opsec that he's hemorrhaging bits of data, you probably don't need this attack to deanonymize them.


>How are you going to get more anonymous data?

All over the place? Your comment history here (and mine!) is full of data. Each piece alone isn't identifying, but there's a good chance that in aggregate it is.

If you share that username on discord/twitter/reddit/steam/whatever, that's even more data. If you reference old accounts anywhere, you guessed it, even more.

>you probably don't need this attack to deanonymize them

My comment wasn't necessarily specific to this attack, just noting that this attack can be an additional piece of data in the chain of re-identification.

You've gone from "not convinced on the real world applications here" to "how are you going to get more anonymous data". If we assume that you can get some data somewhere (a small list of example sources above), can we agree that there is, possibly, a real world application?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: