From a computer science perspective, what should Google do to train its models i...

numbsafari · on July 10, 2019

It's not really a computer science issue, per-se.

Operationally, though, here are a few thoughts:

- Ensure that the persons doing this review do not have physical access to the systems outside of a secured environment (one in which outside audio and video recording devices are not allowed, and for whose presence is monitored, with physical access controls, etc.) Basically, not remote workers or a typical office environment. Most finserv call centers do this, so it's not particularly crazy to think they can do the same.

- Mask the voices such that they are intelligible but not identifiable. Maintain a limited set of "high access" taggers who can hear the raw clips if there is an issue with the masking.

- Limit the length of the clips (sounds like they already do this).

- Have pre-filters for anything personally identifiable in the audio. The metadata for the audio might already be de-identified, but what if the audio clip consists of the person reading out a phone number, credit card number, username, etc.? They should have their "high access" team building detectors for that and flag those portions of the audio or whole clips and route them to a limited access team.

- Make it more clear to customers that their audio, including accidental captures, can and will be sent to their servers. Make this very explicit, rather than burying it in TOS and using terms like "audio data". "The device may accidentally interpret your bondage safe word as a trigger and send your private conversations to our real-live human tagging team for review."

- Provide a physical switch that can temporarily disable the audio recording capability.

- Pay money, like cigarette companies do, to help fund a public education campaign that informs the general public about these listening bugs and mass surveillance issues so that people are aware of industry practices and how it affects them.

Edit:

I like what others are saying about explicit opt-in, as well as paid end-users. For quality/safety control, I don't know that they can exclusively use paid end-users. They probably need to sample some amount of their real live data. For that, explicit opt-in makes sense.

clebio · on July 11, 2019

> It's not ... a computer science issue.

full stop.

Maximus9000 · on July 10, 2019

Pay beta testers who know what they're signing up for? Train the software with previously human transcribed audio (like TV/audiobooks/etc)

libria · on July 10, 2019

How about 2 Google Homes: The $100 Privacy version and the $30 Opt-in Testing version. That way some people get their privacy and I get Chicken Little to subsidize my cheaper IoT products. =)

vineyardmike · on July 10, 2019

The problem with this is that "Chicken Little" is probably not an adventurous person, but rather a poor person. This makes privacy another privilege that can be bought. This devolves into another way to exploit poorer people.

thekyle · on July 10, 2019

> Train the software with previously human transcribed audio (like TV/audiobooks/etc)

They've very likely already done this.

Maximus9000 · on July 10, 2019

Virtually every single DVD/BluRay ever produced has closed captioning. For the english language, that must be over 100,000 hours.

I suppose it's no coincidence that this article is written about the Flemish community... there may not be as much closed caption options there.

will_brown · on July 10, 2019

>From a computer science perspective, what should Google do to train its models in a privacy conscious way?

Install these devices in the homes of google employees, executives and offices and allow the public to listen in. What’s good for the goose is good for the gander and all.

Maybe when google has trained the systems enough to not need to train them by collecting and listening to customers conversations, then they enter them into the stream of commerce.

thekyle · on July 10, 2019

I'm not sure if Google has a diverse enough employee base for this to work. I'd imagine most Google employees are tech workers so the training data collected would not accurately represent the general population.

Ensorceled · on July 10, 2019

Use employees, paid testers and, maybe, informed beta customers. Like every other product.

kypro · on July 10, 2019

Explicit op-ins?

"Google would like to use the last 15 minutes of voice data to improve Google Home. This may contain sensitive information. Do you approve?"

asciident · on July 10, 2019

Everyone would just decline because there's no incentive to agree, and their product would die. Plus it would annoy users and cause stress.

falcolas · on July 10, 2019

Is that our problem? If a product relies on dark pattern acceptance of being "spied upon" to succeed, why should we want it to succeed?

therealdrag0 · on July 10, 2019

Not necessarily. I've had Google Voice (And maybe iphone?) ask me if they can use my voice-mail to improve transcription data, and I always say yes. But it's an explicit ask for each time (If I remember correctly), which is nice.

tgsovlerkhgsel · on July 10, 2019

You underestimate how many people are willing to take altruistic actions when the (perceived) cost is sufficiently small.

The main downside is an extra question during an onboarding flow.

stinos · on July 10, 2019

Hire people to talk to the models?

natch · on July 10, 2019

Pay real humans for user studies that the humans knowingly sign up for.

remir · on July 10, 2019

There's enourmous amount of publicly available data to crunch all over the world; Podcasts, TV shows, talk radio, Youtube, etc.

tengbretson · on July 10, 2019

They should put a sticker on the box that explicitly states "Audio from this device will be listened to by Doug in Indiana"

noir_lord · on July 10, 2019

Decide if it even can and if the answer is no then not do it.

50656E6973 · on July 10, 2019

Use test subjects who sign up explicitly and specifically for this purpose, rather than burying it in a massive TOS written in legalese that can change at any time and is applied indiscriminately to all customers

thrwer3434312 · on July 10, 2019

They can't. They need to transcribe the data.

natch · on July 10, 2019

Not true. Apple has demonstrated how to do it and has published a paper on the topic.

A. Bhowmick et al. Protection Against Reconstruction and Its Application in Private Federated Learning, 2018.

Don’t just dismissively say “oh that’s different and doesn’t apply” until you read it and understand the range of problems Apple is tackling with this and similar approaches. A lot can be done when the organization makes privacy a priority.

Let’s not forget I am replying to a person who has chosen to use a throwaway account for this topic. Why did they do that? Hmmm.

energybar · on July 10, 2019

Is this why Siri has worse performance compared to others?

natch · on July 10, 2019

Hah! Good one. I hope not. I think the Siri people at Apple have just gotten lazy. Maybe they are too impressed with themselves and focusing on BS things like sports scores. I hope they get more ambitious soon.

RosanaAnaDana · on July 10, 2019

^This. I'm sure they're optimizing to only transcribe data they are pretty confident they got wrong (multiple requests; use has to go in and correct), but if you want software to do what the Google Home software does, you have to do this.

saagarjha · on July 10, 2019

There are a lot of things that Google could have done better in this situation without meaningfully impacting their software's quality.

RosanaAnaDana · on July 10, 2019

I guess. I'm still not of the opinion anything being done here is really 'wrong' (unless the annotators are mis-using the data).