Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This relies on the (unfounded) assumption that this pseudonymous ID is being used for tracking purposes and that Google is actively lying about it.


GDPR treats an IP address as personal data. The data is not transmitted through an anonymizing network, so Google has access to the user's IP address when they receive the data.

Anything that is associated with personal data also becomes personal information, therefore Google is transmitting personal data without user consent, which is illegal.

Asking for consent is not required under GDPR when the data collection is needed for a service to function. This is not the case here, Google services function without receiving that header, the data is used by Google to gain a technical advantage over other web services.


> GDPR treats an IP address as personal data.

No it doesn't. GDPR only treats IP address as personal data if it is associated with actual identifying information (like name or address). Collecting IP address alone, and not associating it with anything else, is completely fine (otherwise nginx and apache's default configs would violate GDPR), and through them basically every website would violate GDPR.

Edit: and furthermore, even if it did (I see conflicting reports), if you collect IP Address and another pseudonymous ID and don't join them, the ID isn't personal data.

IOW, the theoretical capability to make changes to a system to use info in a non-GDPR compliant way doesn't make the information or system noncompliant. You actually have to do the noncompliant things.


An IP address is itself personal data, it does not have to be associated with other personal data.

https://ec.europa.eu/info/law/law-topic/data-protection/refo...

> Collecting IP address alone, and not associating it with anything else, is completely fine (otherwise nginx and apache's default configs would violate GDPR), and through them basically every website would violate GDPR.

See my comment about consent not being required when the data is needed to provide a service. Logging is reasonably required to provide a service.

> and furthermore, even if it did (I see conflicting reports), if you collect IP Address and another pseudonymous ID and don't join them, the ID isn't personal data.

The transmission of data is already covered by GDPR, you don't have to store the data to be bound by the law.


See my edit. There's conflicting information on this. A dynamic IP, for example, isn't directly related to or relatable to a specific natural person without other context.

But even if that's the case, if you don't tie the pseudonymous ID to the IP, it isn't personal data. As far as I can tell, the transfer rules you reference are about transferring data out of the EU, and can be summarized as "you can't transfer data to a non-EU country and then process it in a way that violates the GDPR". Article 46 notes that transferring data is fine as long as appropriate safeguards are in place[1], and article 47[2] defines what constitutes those safeguards (in general, contractually/legally binding agreements with appropriate enforcement policies).

This goes back to what I said before: The theoretical capability to do noncompliant things doesn't make a system GDPR-noncompliant. You have to actually do noncompliant things to not comply.

[1]: https://gdpr-info.eu/art-46-gdpr/

[2]: https://gdpr-info.eu/art-47-gdpr/


> > and furthermore, even if it did (I see conflicting reports), if you collect IP Address and another pseudonymous ID and don't join them, the ID isn't personal data.

> The transmission of data is already covered by GDPR, you don't have to store the data to be bound by the law.

This cannot be the actual correct interpretation of the GDPR, because under this logic _all_ IP packets on the public internet (made by/to EU citizens) are covered by the GDPR because you are transmitting data alongside an IP address.


To help other readers:

"The European Commission maintains this website to enhance public access to information about its initiatives and European Union policies in general."

https://ec.europa.eu/info/law/law-topic/data-protection/refo...

"Home > Law > Law by topic > Data protection > Reform > What is personal data?"

"Examples of personal data

...

- an Internet Protocol (IP) address;"


There has been an EU court ruling on this exact question of whether dynamic IP addresses count as personal data even in contexts where the website operator in question does not have the means to associate it with an individual but another party (such as an ISP) does. The Court of Justice of the European Union has ruled on this and it does count as personal data. [1]

Furthermore, GDPR itself specifically refers to online identifiers in Article 4 as falling under the definition of personal data[2] and then clarifies in Recital 30[3] that IP addresses count as online identifiers in this context. There seems to be no legal ambiguity in the EU on this topic at this point, but I would be not surprised to see parties who are not GDPR compliant pretend otherwise indefinitely.

[1] https://curia.europa.eu/jcms/upload/docs/application/pdf/201...

[2] https://gdpr-info.eu/art-4-gdpr/

[3] https://gdpr-info.eu/recitals/no-30/


Interesting, TIL. That doesn't change the major point I was making though, which is that an anonymized identifier (such as the 13-bit ID under discussion) isn't personal info, even if it might have originally been collected along side data which is personal info. If I give you said 13 bit ID, you need other info to back out a single person, the anonymous ID corresponds to multiple IPs.


I think you're still missing the point. Google transmits personal data to their servers without user consent. The value of x-client-data is personal data, because it is associated with an IP address during transit, due to how HTTP requests work. The nature of the data, what is being done with it on the server, and the location of the server are all irrelevant in this instance, the only important part is that personal data has left the browser in the form of a request, and it reached a Google server.

This data collection would only be exempt from GDPR if the data would be required for the service to function, but that is not the case with x-client-data.


> The value of x-client-data is personal data, because it is associated with an IP address during transit, due to how HTTP requests work.

This is not correct. The x-client-data is not personal data. x-client-data associated with an IP address is personal data. As soon as you separate the client-data from the IP, the client data stops being personal data. IOW, the tuple (x-client-data, IP) is personal data. But x-client-data on its own isn't personal data, because it cannot be used to infer the IP on its own.

I don't know where you're getting this "if two pieces of data ever touch and one of them is personal data the other one is now also contaminated as personal data". It's not true. That would make the existence of anonymous data (which the GDPR specifies as a thing) practically speaking impossible to have on the web, since all requests are associated with the IP on receipt. (or actually even worse, it would make the process of anonymizing data impossible in general, since the anonymization process associates the anonymized data with the original personal data).

To be precise, the GDPR defines anonymized data as "data rendered anonymous in such a way that the data subject is not or no longer identifiable.". The x-client-data header is exactly that. The subject of the header is not identifiable by the x-client-data header alone. Therefore the header is anonymous and not subject to strong GDPR reqs.

For the client data header to be personal data, you'd need to describe a scheme such that, given an x-client-data header, and only an x-client-data header, you could identify one (and only one) unique person to whom that header corresponds. You're welcome to come up with such a scheme, but my intro CS classes taught me that bucketed hashing is irreversible, and with 8192 buckets, you're not going to be able to uniquely identify anyone specific.


The Chrome whitepaper is written in a way to make you believe there is only 8000 possibilities.

But read carefully what they say; they say there is only 8000 possibilities if the crash reporting functionality is disabled (not by default).

Otherwise the marker is a huge differentiator (I haven't seen any duplicates personally)


> That would make the existence of anonymous data practically speaking impossible to have on the web

For almost every type of data that is true. Transforming or substituting data doesn't make it anonymous; the patters in the data are still present. To produce actually anonymous data you have to do what the GDPR instructed: corrupt the data ("rendered anonymous") severely enough that the "data subject is ... no longer identifiable". You need to do something like aggregate the data into a small number of groups such that individual records no longer exist. Techniques like "differential privacy" let you control precisely how "anonymous" your data is by e.g. mixing in carefully crafted noise.

> 8192 bucket

While others have pointed out that this isn't actually limited to 13 bits of entropy for most people, there are at least two reasons that field is still very personally identifying. First, "x-client-data on its own" never happens. Google isn't wasting time and money implementing this feature to make an isolated database with a single column. At no point will the x-client-data value (or any other type of data they capture) ever sit in isolation. I used the IPv4 Source Address as an example because it will necessarily be present in the header of the packets that transport the x-client-data header over the internet. Suggesting that Google would ever use this value in isolation is almost insulting to Google; why would they waste their expensive developer time to create, capture, and manage data that is obviously useless?

However, lets say they did make and isolated system that only ever received 13 bit integers stripped of all other data. Surely that wouldn't be personally identifiable? If they store it with a locally generated high resolution timestamp they can re-associate the data with personal accounts by correlating the timestamps with their other timestamped databases (web server access logs, GA, recaptcha, etc).

> you'd need to describe a scheme such that, given an x-client-data header, and only an x-client-data header, you could identify one (and only one) unique person to whom that header corresponds

You should first describe why google would ever use that header and only that header. Even if they aren't currently using x-client-data as an identifier or as additional fingerprintable entropy, simply saving the data gives Google the option to use it as an identifier in the future.

[1] https://www.youtube.com/watch?v=pT19VwBAqKA https://en.wikipedia.org/wiki/Differential_privacy


> You need to do something like aggregate the data into a small number of groups such that individual records no longer exist. Techniques like "differential privacy" let you control precisely how "anonymous" your data is by e.g. mixing in carefully crafted noise.

Correct, and another anonymization technique (in place of differential privacy) is k-anonymity. In k-anonymity schemes, you ensure that in any given table no row corresponds to any fewer than k individuals. Why is this useful? Well let's say you have some, say, 10-15 bit identifier. You can take a request from a user that contains information that might when combined, be identifying. Say: coarseish location (state/country), device metadata (browser version, OS version), and coarse access time (the hour and day of week). Combining all 3 (or 4 if you include the psuedonymous ID) is enough to uniquely identify at least some users. Then let's say you also track some performance statistics about the browser itself.

But any single piece of data (plus the pseudonymous ID) is not enough to identify any specific user. So if you use the psuedonymous ID as a shared foreign key, you can join across the tables and get approximate crosstabs without uniquely identifying any specific user. Essentially, if you want to ask if there are performance differences between version N and version N+1, you can check the aggregate performance vs. the aggregate count of new vs. old browser version, and with 8K samples, you're able to draw reasonable conclusions. And in general you can do this across dimensions or combinations of dimensions that might normally contain enough pieces of info to identify a single user.

This is essentially the same idea as differential privacy, although without the same mathematical precision that differential privacy can provide. (By this I don't mean that the data can be re-identified, just that differential privacy can be used to provide tighter bounds on the anonymization, such that the statistical inferences you can gather are more precise. k-anonymity is, perhaps, a less mathematically elegant tool).

Specifically, I'm describing k-anonymity using x-client-data as a Quasi-identifier in place of something like IP or MAC address. You can find those terms in the "See Also" section of the differential privacy wiki page you linked. Google is mentioned in those pages as a known user of both differential privacy and k-anonymization in other tools.

Hopefully that answers your question of why Google would want such a thing.

> simply saving the data gives Google the option to use it as an identifier in the future.

Yes, but that doesn't mean that they're currently in violation of the GDPR, which is what a number of people keep insisting. I'm not claiming that it's impossible for Google to be doing something nefarious with data (although I will say that in general I think that's an unreasonably high bar). Just that the collection of something like this isn't an indication of nefarious actions, and is in fact likely the opposite.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: