Hacker News new | past | comments | ask | show | jobs | submit login

This is a serious report from a serious research institute (Trinity College Dublin). There was previously not much insight into data harvesting by Google Play Services.

The terrifying result is that google can easily see (1) which sender sends (2) what (3) to who at (4) what time, and (5) when the user uses the app (6) for how long. It does so by even sending phone numbers and Android ID.

Reportedly, Google will clarify something in some legal text after this report came out. I would expect this would be totally illegal in at least the EU. It is about time top executives will receive jail time for these kind of breaches.

From the abstract:

We report on measurements of the data sent to Google by the Google Messages and Google Dialer apps on an Android hand- set. We find that these apps tell Google when message/phone calls are made/received. The data sent by Google Messages includes a hash of the message text, allowing linking of sender and receiver in a message exchange, and by Google Dialer the call time and duration, again allowing linking of the two handsets engaged in a phone call. Phone numbers are also sent to Google. In addition, the timing and duration of user interactions with the apps are sent to Google. There is no opt out from this data collection. The data is sent via two channels, the Google Play Services (i) Clearcut logger and (ii) Google/Firebase Analytics. This study is therefore one of the first to cast light on the actual telemetry data sent by Google Play Services, which to date has largely been opaque.




> a hash of the message text

This allows to restore the message if it is short. Let's say a typical user knows 2×10^4 words and the message contains 3 words. If each word can be chosen with equal probability independently of others then one can guess the message content after an average of 4×10^12 attempts. And of course, one can check all the hashes at the same time.

So the hash function is effectively reversible for short pieces of data, like phone numbers, DOB, card numbers, emails and so on.


I wonder how many short messages are easily recognizable ("on my way", "eta?", "love you", "lol", ...)


Doesn't it say "a truncated hash"?


Yes, it is 128-bit hash. 128 bits are enough to recover a message made of 3-4 words. Some messages are build by template, for example, a message from a bank saying "Your PIN code is XXXX". In this case, it is possible to recover the PIN code using 128-bit hash as the number of possible PIN codes is less than 2^128.


Probably given the sender is always the same for those, the work to recover the text is probably literally just the key space of the code. Like, 1M hashes in a lookup table for a normal bank 2FA code. Recoverable in microseconds.


It’s a 256bit hash of message+timestamp truncated to 128bit. The attack you’re describing is literally impossible.


I don't see how hash truncation prevents the attack.

Regarding timestamp, precise timestamp is sent in clear text in another event field (currentTime_ms) so its presence doesn't prevent message content recovery.


> its presence doesn't prevent message content recovery

The TS is the “salt” & hence defeats rainbow tables, unless somehow it’s feasible to infinitely store them for every combination of message & time for the past/future.

> how hash truncation prevents the attack

It literally deletes half the information you have for the attack. You CANNOT /perfectly/ compute the original text for a truncated hash, you’ll have innumerable clashes.

So yes, any attack is impossible/infeasible.


The problem is that the text that is hashed is not random (like a salted password would be) and the space of all sensible clear texts is much much smaller and possibly amenable to brute force.

Also, storing the hash means that messages are irrefutable.


Passwords are even less random than sensible texts - the whole discussion around rainbow tables and salting is normally about cracking passwords.

But do you disagree with my claims? Do you have any sources to backup their invalidity?


Specifically "sha256HashMsg value is a hash of the time, in hours since 1st Jan 1970, that the message was sent and of the message content i.e the message text, truncated to 128 bits."


Their investigation clearly shows that it’s a truncated SHA256 hash of the entire message text + a timestamp. I’m very sure that your method does NOT work.

How are you assuming a known dictionary and a known word count? And for the other examples, you’re somehow assuming a fully precalculated rainbow table of all possible such PII. Seriously…?

> sha256HashMsg value is a hash of the time, in hours since 1st Jan 1970, that the message was sent and of the message content i.e the message text, truncated to 128 bits.


> How are you assuming a known dictionary

If the user is from country A, there is high probability that they use language of country A. Regarding word count, you don't need to know it. First, try to calculate hashes of all one-word messages. Then, all hashes of two-word messages. And so on.

Google also can collect statistics on which words and word combinations are used more often (for example, from gmail messages) and use this to accelerate bruteforce.

> And for the other examples, you’re somehow assuming a fully precalculated rainbow table of all possible such PII. Seriously…?

Let's take phone numbers as an example. Russian phone number starts with +7 which is followed by 10 random digits. Therefore, there can be maximum of 10^10 = 10 billions numbers. This answer [1] states that a GPU can calculate up to 10 billion MD5 or SHA1 hashes per second. Therefore, recovering a phone number from a hash would take approximately one second on a typical consumer hardware.

Note that one could build a custom CPU optimized for calculating hashes and get higher throughput.

Regarding credit card numbers, there is larger search space: you have to guess 15 digits (16th digit is a checksum). But it is not like there are 10^15 credit cards in the world. This search space can be significantly reduced, for example, by filtering by BIN (first 6 digits that identify the issuer bank).

This is why hashing short identifiers (like phone number or card number) doesn't make much sense and such hash can be easily reversed.

[1] https://security.stackexchange.com/a/38135


Thanks for the detail, though to be clear, I’m not doubting the mathematical theoretical possibility of an SHA256 rainbow table.

I’m saying that even the example scenario you describe is a massive solution space in itself. And then to extrapolate this for all sorts of PII and combinations of words/letters to make this attack worthwhile, is quite a stretch. Again, a slight change in order of words or characters (even something like an extra space) will break your rainbow table lookup. People message in all sorts of weird ways, with abbreviations, typos, slang, or even mix languages.

Even suggesting that “one could just” pre-calculate every 1,2,3,etc combination of words is hilarious.

I categorically do not believe this attack is even remotely feasible.


Let's assume you only care about cracking one word messages in English, and only the 10000 most common words. The time stamp is granular to the hour, and there are ~10000 hours in a year.

Easily available hardware can create 10 billion SHA-1 hashes per second. So to brute force every hash for every combination of one word message and timestamp, you need ... one hundredth of one second. No rainbow tables needed here. Just brute force it. For each hash, you just check every message hash stored on your server to find matches. If you find a match, you've uncovered the probable message text.

And so on for 2, 3 word combinations. Let's take 3 word combinations, assuming all three words are in the 10000 most commonly used English words. There are 10000^3 possible combinations of such words. Once again, the timestamp adds another 10000 possibilities (for the last year). How long does it take to brute force all 3 word combinations? One million seconds - less than two weeks. This is for a SINGLE person working on this with a SINGLE GPU. Not only will a serious attacker have better hardware, but context clues can help limit the possible options so that much longer combinations of words will be possible (if they are common combinations).

Sure, there will be slight punctuation or capitalization differences in some cases and that adds entropy, but not by enough to bother taking into consideration in terms of security.

I guess we're also assuming in all this that Google is not tagging each message with the time received on the server. Given that they already have that information, they might be able to recover the message content without needing the extra hashes for the timestamp.


> How are you assuming a known dictionary and a known word count? And for the other examples, you’re somehow assuming a fully precalculated rainbow table of all possible such PII. Seriously…?

Obviously deciphering all messages is not feasible but I think GPs point was that you could put together a table that would get you some decent proportion of messages. For example, if you assume an alphabet of [a-zA-Z0-9 ] then you'd only need to calculate ~63.5 billion hashes to get all messages up to 6 chars. Similarly you could use a dictionary to put together tables for short sentences.


Could you link to some PoC hacks that prove feasibility? I’m not saying it’s theoretically impossible - mathematically it should be. I’m simply saying that the message content space is massive, and hence practically infeasible.

Again, all of this is predicated on Google hashing messages client-side and then, secretly reverse-matching them server-side for very simplistic 1-3 word messages?


Messages are usually composed from words, not from random letters, so you don't need to bruteforce all letter combinations.

Also, some messages use a template, for example, a message from an online store like "Your order number XXXXX is ready for pickup" or a message from a bank saying "Your PIN code is XXXX". In this case, all you need to guess is a number.


> It is about time top executives will receive jail time for these kind of breaches.

I'm afraid jail time is not yet provided for in the GDPR.


The lack of serious consequences of doing harmful business is indeed one of the big issues we have got in western society imho. Breaching human rights like privacy is just a risk trade-off where the company makes a cold calculation of weighted costs. Companies are well protected against individuals by means of juridical and financial power imbalance.

Fines are usually something to laugh at and well worth the cost. Even if not, the burden is on the company and not the executive.

We need to have a system where the current executives get jail time for these kind of crimes. This ensures that executives will only accept such a role when they are certain that the company is "clean", i.e. they wont risk jail time because of previous CEO's fuckups.


not to disagree, but imo it is the main purpose of a corporation to shield individuals from consequences hence increasing the level of acceptable risk and with it the velocity of business developement. (all for the greater good)


The liability shield may have made better sense when corporations were actually given charters because they were expected to serve a public good. it doesn't seem to be the case where that is true any more. If it ever was? Now it seems that corporations only exist to make money. Public good be damned.


Gross negligence of Data Protection can be prosecuted under Computer Criminality acts by the respective Data Protection Office. There is precedent for this in the UK. However this is not laid out in the European directives but usually in the local implementations of the directives.


Even if it was, it's not like it's enforced anywhere near enough anyway. This kind of crap is a clear violation of the GDPR, has a broad impact (all Android users are impacted) so it should be a slam-dunk case and yet is allowed to continue despite the regulation being in place for over 4 years now.


I understand the sentiment. But probably is allowed (as you state) because until now nobody looked at it and then nobody took the time to inform the respective authorities.

I need to take a closer look at the report and see if I am with a reasonably chance affected. So that I can send a complaint to my data protection agency in Schleswig-Holstein (German state I am residing in).

At least that is what yi can do. And I can only urge any EU citizen (rightly) enraged by that to create an official complaint with their respective data protection agency.


> But probably is allowed (as you state) because until now nobody looked at it and then nobody took the time to inform the respective authorities.

Something illegal doesn't become illegal only after law enforcement is informed.

> So that I can send a complaint to my data protection agency in Schleswig-Holstein (German state I am residing in).

I wish you the best of luck and hope that your local DPA is much more competent than the UK one.


> Something illegal doesn't become illegal only after law enforcement is informed.

I didn't say that. Only that law enforcement (or data privacy regulators) probably were not aware of it happening.


It's ironic that this is being investigated by an Irish institution when all these practices are blessed (by stonewalling investigations and refusing to take any action) by the corrupt Irish data "protection" authority despite being an obvious breach of the GDPR.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: