Hacker News new | past | comments | ask | show | jobs | submit login
What data do the Google Dialer and Messages apps on Android send to Google? [pdf] (tcd.ie)
233 points by exceptione on March 21, 2022 | hide | past | favorite | 90 comments



This is a serious report from a serious research institute (Trinity College Dublin). There was previously not much insight into data harvesting by Google Play Services.

The terrifying result is that google can easily see (1) which sender sends (2) what (3) to who at (4) what time, and (5) when the user uses the app (6) for how long. It does so by even sending phone numbers and Android ID.

Reportedly, Google will clarify something in some legal text after this report came out. I would expect this would be totally illegal in at least the EU. It is about time top executives will receive jail time for these kind of breaches.

From the abstract:

We report on measurements of the data sent to Google by the Google Messages and Google Dialer apps on an Android hand- set. We find that these apps tell Google when message/phone calls are made/received. The data sent by Google Messages includes a hash of the message text, allowing linking of sender and receiver in a message exchange, and by Google Dialer the call time and duration, again allowing linking of the two handsets engaged in a phone call. Phone numbers are also sent to Google. In addition, the timing and duration of user interactions with the apps are sent to Google. There is no opt out from this data collection. The data is sent via two channels, the Google Play Services (i) Clearcut logger and (ii) Google/Firebase Analytics. This study is therefore one of the first to cast light on the actual telemetry data sent by Google Play Services, which to date has largely been opaque.


> a hash of the message text

This allows to restore the message if it is short. Let's say a typical user knows 2×10^4 words and the message contains 3 words. If each word can be chosen with equal probability independently of others then one can guess the message content after an average of 4×10^12 attempts. And of course, one can check all the hashes at the same time.

So the hash function is effectively reversible for short pieces of data, like phone numbers, DOB, card numbers, emails and so on.


I wonder how many short messages are easily recognizable ("on my way", "eta?", "love you", "lol", ...)


Doesn't it say "a truncated hash"?


Yes, it is 128-bit hash. 128 bits are enough to recover a message made of 3-4 words. Some messages are build by template, for example, a message from a bank saying "Your PIN code is XXXX". In this case, it is possible to recover the PIN code using 128-bit hash as the number of possible PIN codes is less than 2^128.


Probably given the sender is always the same for those, the work to recover the text is probably literally just the key space of the code. Like, 1M hashes in a lookup table for a normal bank 2FA code. Recoverable in microseconds.


It’s a 256bit hash of message+timestamp truncated to 128bit. The attack you’re describing is literally impossible.


I don't see how hash truncation prevents the attack.

Regarding timestamp, precise timestamp is sent in clear text in another event field (currentTime_ms) so its presence doesn't prevent message content recovery.


> its presence doesn't prevent message content recovery

The TS is the “salt” & hence defeats rainbow tables, unless somehow it’s feasible to infinitely store them for every combination of message & time for the past/future.

> how hash truncation prevents the attack

It literally deletes half the information you have for the attack. You CANNOT /perfectly/ compute the original text for a truncated hash, you’ll have innumerable clashes.

So yes, any attack is impossible/infeasible.


The problem is that the text that is hashed is not random (like a salted password would be) and the space of all sensible clear texts is much much smaller and possibly amenable to brute force.

Also, storing the hash means that messages are irrefutable.


Passwords are even less random than sensible texts - the whole discussion around rainbow tables and salting is normally about cracking passwords.

But do you disagree with my claims? Do you have any sources to backup their invalidity?


Specifically "sha256HashMsg value is a hash of the time, in hours since 1st Jan 1970, that the message was sent and of the message content i.e the message text, truncated to 128 bits."


Their investigation clearly shows that it’s a truncated SHA256 hash of the entire message text + a timestamp. I’m very sure that your method does NOT work.

How are you assuming a known dictionary and a known word count? And for the other examples, you’re somehow assuming a fully precalculated rainbow table of all possible such PII. Seriously…?

> sha256HashMsg value is a hash of the time, in hours since 1st Jan 1970, that the message was sent and of the message content i.e the message text, truncated to 128 bits.


> How are you assuming a known dictionary

If the user is from country A, there is high probability that they use language of country A. Regarding word count, you don't need to know it. First, try to calculate hashes of all one-word messages. Then, all hashes of two-word messages. And so on.

Google also can collect statistics on which words and word combinations are used more often (for example, from gmail messages) and use this to accelerate bruteforce.

> And for the other examples, you’re somehow assuming a fully precalculated rainbow table of all possible such PII. Seriously…?

Let's take phone numbers as an example. Russian phone number starts with +7 which is followed by 10 random digits. Therefore, there can be maximum of 10^10 = 10 billions numbers. This answer [1] states that a GPU can calculate up to 10 billion MD5 or SHA1 hashes per second. Therefore, recovering a phone number from a hash would take approximately one second on a typical consumer hardware.

Note that one could build a custom CPU optimized for calculating hashes and get higher throughput.

Regarding credit card numbers, there is larger search space: you have to guess 15 digits (16th digit is a checksum). But it is not like there are 10^15 credit cards in the world. This search space can be significantly reduced, for example, by filtering by BIN (first 6 digits that identify the issuer bank).

This is why hashing short identifiers (like phone number or card number) doesn't make much sense and such hash can be easily reversed.

[1] https://security.stackexchange.com/a/38135


Thanks for the detail, though to be clear, I’m not doubting the mathematical theoretical possibility of an SHA256 rainbow table.

I’m saying that even the example scenario you describe is a massive solution space in itself. And then to extrapolate this for all sorts of PII and combinations of words/letters to make this attack worthwhile, is quite a stretch. Again, a slight change in order of words or characters (even something like an extra space) will break your rainbow table lookup. People message in all sorts of weird ways, with abbreviations, typos, slang, or even mix languages.

Even suggesting that “one could just” pre-calculate every 1,2,3,etc combination of words is hilarious.

I categorically do not believe this attack is even remotely feasible.


Let's assume you only care about cracking one word messages in English, and only the 10000 most common words. The time stamp is granular to the hour, and there are ~10000 hours in a year.

Easily available hardware can create 10 billion SHA-1 hashes per second. So to brute force every hash for every combination of one word message and timestamp, you need ... one hundredth of one second. No rainbow tables needed here. Just brute force it. For each hash, you just check every message hash stored on your server to find matches. If you find a match, you've uncovered the probable message text.

And so on for 2, 3 word combinations. Let's take 3 word combinations, assuming all three words are in the 10000 most commonly used English words. There are 10000^3 possible combinations of such words. Once again, the timestamp adds another 10000 possibilities (for the last year). How long does it take to brute force all 3 word combinations? One million seconds - less than two weeks. This is for a SINGLE person working on this with a SINGLE GPU. Not only will a serious attacker have better hardware, but context clues can help limit the possible options so that much longer combinations of words will be possible (if they are common combinations).

Sure, there will be slight punctuation or capitalization differences in some cases and that adds entropy, but not by enough to bother taking into consideration in terms of security.

I guess we're also assuming in all this that Google is not tagging each message with the time received on the server. Given that they already have that information, they might be able to recover the message content without needing the extra hashes for the timestamp.


> How are you assuming a known dictionary and a known word count? And for the other examples, you’re somehow assuming a fully precalculated rainbow table of all possible such PII. Seriously…?

Obviously deciphering all messages is not feasible but I think GPs point was that you could put together a table that would get you some decent proportion of messages. For example, if you assume an alphabet of [a-zA-Z0-9 ] then you'd only need to calculate ~63.5 billion hashes to get all messages up to 6 chars. Similarly you could use a dictionary to put together tables for short sentences.


Could you link to some PoC hacks that prove feasibility? I’m not saying it’s theoretically impossible - mathematically it should be. I’m simply saying that the message content space is massive, and hence practically infeasible.

Again, all of this is predicated on Google hashing messages client-side and then, secretly reverse-matching them server-side for very simplistic 1-3 word messages?


Messages are usually composed from words, not from random letters, so you don't need to bruteforce all letter combinations.

Also, some messages use a template, for example, a message from an online store like "Your order number XXXXX is ready for pickup" or a message from a bank saying "Your PIN code is XXXX". In this case, all you need to guess is a number.


> It is about time top executives will receive jail time for these kind of breaches.

I'm afraid jail time is not yet provided for in the GDPR.


The lack of serious consequences of doing harmful business is indeed one of the big issues we have got in western society imho. Breaching human rights like privacy is just a risk trade-off where the company makes a cold calculation of weighted costs. Companies are well protected against individuals by means of juridical and financial power imbalance.

Fines are usually something to laugh at and well worth the cost. Even if not, the burden is on the company and not the executive.

We need to have a system where the current executives get jail time for these kind of crimes. This ensures that executives will only accept such a role when they are certain that the company is "clean", i.e. they wont risk jail time because of previous CEO's fuckups.


not to disagree, but imo it is the main purpose of a corporation to shield individuals from consequences hence increasing the level of acceptable risk and with it the velocity of business developement. (all for the greater good)


The liability shield may have made better sense when corporations were actually given charters because they were expected to serve a public good. it doesn't seem to be the case where that is true any more. If it ever was? Now it seems that corporations only exist to make money. Public good be damned.


Gross negligence of Data Protection can be prosecuted under Computer Criminality acts by the respective Data Protection Office. There is precedent for this in the UK. However this is not laid out in the European directives but usually in the local implementations of the directives.


Even if it was, it's not like it's enforced anywhere near enough anyway. This kind of crap is a clear violation of the GDPR, has a broad impact (all Android users are impacted) so it should be a slam-dunk case and yet is allowed to continue despite the regulation being in place for over 4 years now.


I understand the sentiment. But probably is allowed (as you state) because until now nobody looked at it and then nobody took the time to inform the respective authorities.

I need to take a closer look at the report and see if I am with a reasonably chance affected. So that I can send a complaint to my data protection agency in Schleswig-Holstein (German state I am residing in).

At least that is what yi can do. And I can only urge any EU citizen (rightly) enraged by that to create an official complaint with their respective data protection agency.


> But probably is allowed (as you state) because until now nobody looked at it and then nobody took the time to inform the respective authorities.

Something illegal doesn't become illegal only after law enforcement is informed.

> So that I can send a complaint to my data protection agency in Schleswig-Holstein (German state I am residing in).

I wish you the best of luck and hope that your local DPA is much more competent than the UK one.


> Something illegal doesn't become illegal only after law enforcement is informed.

I didn't say that. Only that law enforcement (or data privacy regulators) probably were not aware of it happening.


It's ironic that this is being investigated by an Irish institution when all these practices are blessed (by stonewalling investigations and refusing to take any action) by the corrupt Irish data "protection" authority despite being an obvious breach of the GDPR.


The submitted title broke the site guidelines about editorializing. (Submitted title was 'Google Dialer and SMS App steal all conversation metadata and content hashes [pdf]'.) The rule, from https://news.ycombinator.com/newsguidelines.html, is: "Please use the original title, unless it is misleading or linkbait; don't editorialize."

If you want to say what you think is important about an article, that's fine, but do it by adding a comment to the thread. Then your view will be on a level playing field with everyone else's: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...


I think the original title is less informative than the editorialized one. Is it link bait? Probably not.


This is what I hate most about Google and Facebook. If you want to choose to let them gather all of your data, fine. I disagree with it but it is ultimately your choice.

But as someone who has made every effort to remove both of these companies from my life as much as possible, if I communicate with someone who uses Android or someone who allows Facebook complete access to their contacts and other data... those companies still get MY data without my consent.

Somehow it has become legal for these companies to allow someone else to consent for them to get my data.


> I disagree with it but it is ultimately your choice.

Is it? If you want a modern smartphone experience (I don't mean one from 2005) you only get to choose which option you find 'less bad' between Apple and Google.


  > If you want a modern smartphone experience (I don't mean one from 2005) you only
  > get to choose which option you find 'less bad' between Apple and Google.
This is a market failure, is it not? I would perfectly happy to pay twice what I paid for my Samsung Note 10 to another company that provides me with:

  - A useful stylus
  - A useful contacts suite and dialer that can sync CardDav (I use True Phone).
  - A useful calendar that can sync CalDav (I use Business Calendar 2).
  - Runs an Anki implementation (I use Ankidroid)
  - Can sync text files with a desktop computer (I sync .org and .md files via ADB)
  - Has a voice recorder
However, nobody makes such a phone. And this model of Note is not supported by e.g. LineageOS, though I suppose that I could buy an older Note. The real problem is the stylus, no other phone has one.


I don't care about a stylus but I want a headphone socket.


This Note 10 Lite has one!

And I must admit, I've never used it. But I'm reassured that it is there.


I guess in that instance I am more referring to Facebook than Google.

Even with Google you have a choice (ish) in just how much data of yours they gather up since there are alternatives.

But true, the choice not super there but I trust Apple more than Google personally.


> I trust Apple more than Google personally

Curious as to what the basis for this is? As I understood it much if not most of Apple's OS and software is a pandoras box so we have no real idea as to what it tracks and/or sends back to Apple.


For me it's primarily what their business is.

Apple is in the business to sell me devices (and now services but with those devices). So while they may collect data about me, the value to them is ultimately different.

Where Google's entire business is advertising. So they benefit from getting as much data from me not only directly from my device but from browsing or any other source of data they can find. And then mining that data to sell me to the people paying for ads.

I have gotten to the point that I am... accepting... data collection as something that I can just not get away from. I do what I can to limit it. But there is a limit to what I can realistically do about it. So what I care about is limiting how much of that data is in a single place as well as what those place's business model is and what they might be doing with that data.


Apple has an advertising business as well.


Apple (without any Google or FB apps installed) is definitely the lesser of two evils in that scenario.


Apple collects less data, but they definitely still collect a lot. They're also extremely user-hostile (you can't choose what you want to install, only what Apple wants you to install).

In fact, the author of the present study published an article on the former topic last year:

> We investigate what data iOS on an iPhone shares with Apple and what data Google Android on a Pixel phone shares with Google. We find that even when minimally configured and the handset is idle both iOS and Google Android share data with Apple/Google on average every 4.5 mins. The phone IMEI, hardware serial number, SIM serial number and IMSI, handset phone number etc are shared with Apple and Google. Both iOS and Google Android transmit telemetry, despite the user explicitly opting out of this. When a SIM is inserted both iOS and Google Android send details to Apple/Google. iOS sends the MAC addresses of nearby devices, e.g. other handsets and the home gateway, to Apple together with their GPS location. Users have no opt out from this and currently there are few, if any, realistic options for preventing this data sharing.

Leith, D. J. (2021, September). Mobile Handset Privacy: Measuring The Data iOS and Android Send to Apple And Google. URL: https://www.scss.tcd.ie/doug.leith/apple_google.pdf


I use LineageOS exclusively to protect myself from such shenanigans. I hope it does at least, I don't know if they patched this shit out.


https://calyxos.org/

and

https://grapheneos.org/

are suited to what you want.


I recently tcpdump'd an Android 11 device going through the initial setup. About 60MiB was downloaded just to get to the launcher. It contacts not only Google, but the manufacturer and Microsoft, as well as a few domains I couldn't recognize.

It's impossible to install privacy tools (like AFWall) before being subjected to built-in spyware. The bootloader was locked by default and could only be unlocked after creating a Google account (which involves accepting their terms and conditions). It was also impossible to go through initial setup using a proxy server. It did use the proxy server I had setup to contact Google, but it wouldn't accept the connection as having internet connectivity (and refuse to go any further) until traffic was routed through the default gateway.

I think it would be reasonable to add warnings to so-called "smart" devices the same way as packs of cigarettes. It'll never happen, sadly. If anything, the Powers That Be would rather make it illegal to "circumvent" or "tamper" with the locked-down computers that we used to own.


Offtopic:

I recently tried to setup a Bluetooth speaker my mom had bought (with me at her side). I had carefully looked at the packaging and there were no red flags visible.

At her flat I tried to connect it to her laptop. It just wasn't visible.

I looked at the user manual. For initial setup you had to install the vendors app (and accept it to transmit any filename you listen to, any streaming account, any song/audiobook/radio station you are listening to.

There was now way of getting it to just connect via Bluetooth without allowing full data harvesting.

It instantly went back into its packaging, back to the shop.

In hindsight I should probably have filed a GDPR complaint as this goes against the GDPR as the data tracking is not necessary for the functionality (Bluetooth connection). And was not available as information before I bought it.


What manufacturer did you buy?


Sonos. Ordered a small Bose afterwards. I know the Bose headphones have their 'use the app feature' but the smaller soundbsrs (not sure how the new ones are named right now) just work for coupling by long pressing the bt-button.

I was able to remotely lead my mom through the process on the phone.


I remember reading somewhere that Bose did something similar, but take this with a bit of salt as I might be confusing them.

What I know for sure is that Sonos does it as well - but they aren't bluetooth speakers


It was a small Sonos. Would need to look up the model. Was around the size of a Bose Soundbar (the smallest model). With these Bose Soumdbars it is just long pressing the BT button. No app, no nothing. Just works for me.


> It is not possible to proceed to view the Messages app privacy policy without first agreeing to the additional Google Chrome terms and conditions

This entire report is damning, but I thought this was an especially silly tidbit towards the bottom of the report. I wonder if this is even legal.


For sending SMS using Android I heavily recommend QKSMS over the Google/Android defaults.(https://github.com/moezbhatti/qksms)

I'm sure there are other OSS alternatives but this one just works for me.


I use Signal as my only messaging app. If someone doesn't have Signal installed it will automagically fall back to unencrypted SMS and show this using an open lock symbol and a grey send button.


Does this function without the Google Play Service?


I used it without Google Play Services before, but now I just use Signal (Works with SMS too) and it also works without Play Services. Note that if your ROM includes Play Services and you used adb to uninstall it you'll need to reinstall and disable it to get through setup (Something about their detection logic is off).


It doesn't from what i can see on f-driod


Why not the one in aosp?


QKMS has a richer feature set than any AOSP sms/mms apps I've seen.


I'm normally pretty tolerant of Google's data collection because I've seen the internal privacy protections firsthand. Yes, some collection of ML jobs know everything about you. No, humans don't get the same access.

That said, this is disgraceful. I'm amazed that someone thought this was a good idea. I already use Signal for SMS, but I guess I need to find a new dialer app now. Anyone have suggestions?


For talking to businesses, claim you're deaf and that you need to communicate by e-mail or chat interface.

Use Signal for voice calls with friends.


The fact that you would even begin to consider this anywhere close to a viable option shows how insanely out of touch with actual people you are.


Have you tried? I had to cancel a Comcast account once by e-mail and they kept referring me to call their retention department so I told them I was deaf and wasn't able to call. It worked.

You can also threaten to get lawyers involved, that often helps deal with shitty customer service and they just give you what you want.

Honestly this is the kind of stuff Google Assistant should be automating with piles of conversational data, not some stupid haircut appointment. They have the data to know when is the optimal time to mention some hypothetical lawyers and can leverage that data to engineer the conversation on behalf of and in favor of the user.


>engineer the conversation on behalf of and in favor of the user.

I wish I had your optimism... should a system exist, businesses will be the one using it to find the optimal time to charge you more money. Just look at what they are doing with Crisis Text Line - your brain is being completely engineered by big tech. Proprietary technology is developed not to solve problems or make the world better, it is developed to make as much money for businesses as they possibly can.


Does messages still do this if you uncheck the box for "help make messages better" ? Because that seems like it COULD be the opt out for this. Or not - hard to say.


What would the purpose of collecting such information be from a "we only use this for better targeting ads" perspective? Establishing relationships in the case of "if someone you talk to regularly is interested in xyz you might be too"?


From the article:

>Google also provided clarification on the purposes of some of the data collection observed. Namely:

>1) The message hash is collected for detecting message sequencing bugs.

>2) Phone numbers are collected to improve regex pattern matching for automatic recognition of one-time pass- words sent over RCS. Messages automatically recognizes incoming One-Time Password (OTP) codes to avoid the user having to fill them in. This can be a frequent point of failure and the phone number data is used to improve recognition by providing ground-truth based on known OTP sender numbers.

> 3) The ICCID data is used to support Google Fi.

> 4) Firebase Analytics logging of events (not including phone numbers) is used to measure the effectiveness of app download promotions (for Messages and Dialer specif- ically). Namely, to measure not only whether the app was downloaded but also whether it was used once downloaded.


>Messages automatically recognizes incoming One-Time Password (OTP) codes to avoid the user having to fill them in.

It's a shame that user laziness is cause for lack of privacy.


>It's a shame that user laziness is cause for lack of privacy.

This detection is done entirely on-device. In what way is it injurious to the user's privacy?


As they said, they collect phone numbers and this feature was the excuse.


Did you even read the comment I replied to? It relates to iOS, not Android.


Thank you for the link!

Dumb shill, but I have a Twitter account to collect instances of tech companies doing shady stuff. It's called "Programmers Against Humanity" (@software_crimes), and I would love if people could send me more. I want to bring awareness to how tech companies harm us.


Personally I'm waiting for a report showing that Google still swipes the location in some ways even when ‘location history’ and whatever the other toggle is, are both turned off. Because of course they do.

Reminder: US police use Google as a database of people who were at a particular place at a certain time. Of which people by definition most aren't subjects of investigation: https://www.nytimes.com/2019/04/13/technology/google-sensorv...


Fantastic research. Thank you to the researchers.


εxodus (privacy audit platform for Android applications) has good reporting on this kind of stuff: https://reports.exodus-privacy.eu.org/en/reports/com.google....


Basically nobody uses Google Messages in Europe. Still, if Google breaks the law, there will be consequences.


Source? I'm in Europe, and it's on my phone.


Yeah but what do you use it for apart from receiving verification codes?


I use it because I quite like the spam detection feature.


This is also why I still use it. Of all the SMS apps I've tried, it also has the most visually pleasing UX (for me personally). I get an incredible amount of spoofed SMS spam, most of it related to off-market real estate nonsense. Messages has been the only app successful in blocking them.


What kind of spam detection do you mean, isn't this a messaging service?


Yes. You can (and sadly do) receive spam SMS. Google Messages is quite good at spotting them


But do we know that it's the app and not an underlying library that other messaging apps use too?


What's extra embarrassing is that none of us did this ourselves. Maybe we can look forward to a "consumer reports" style organisation or CI/CD system that tests and tracks apps.

It wouldn't be used of course, but a little sunlight can do wonders.


I hope the EU slaps google down hard for this egregious spying of their citizen and claiming ignorance is not a get out of jail free card.

Now please audit iMessage and iPhone’s dialer, I’ll pay to help commission this important work!


Question: what happens if you use a third party dialer (e.g. Samsung)?


Is anyone surprised?


I was a tad surprised by:

> when a request was made through Google Takeout for the Google Account data associated with the apps used for testing, the data Google provided did not include the telemetry data observed.

That google collects it is unsurprising. That they deny having it is somewhat surprising to me.


No one should be.


Thankyou to whoever did this research and wrote it up...

I feel like this is the start of Google being another billion dollars poorer, and EU taxpayers being each a few euros richer...


I did some research on PII being harvested by Android apps.

TL;DR: The EU could pay a couple of reverse engineers for 2 months and print money out of fines.

I was shocked to discover that (back then?) apps could just straight up read the list of user accounts on the phone without any special permission... Which is mostly fine, except many apps use your email/phone-number as account name or description. Same goes for Wi-Fi SSID and other things.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: