Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

From a computer science perspective, what should Google do to train its models in a privacy conscious way?


It's not really a computer science issue, per-se.

Operationally, though, here are a few thoughts:

- Ensure that the persons doing this review do not have physical access to the systems outside of a secured environment (one in which outside audio and video recording devices are not allowed, and for whose presence is monitored, with physical access controls, etc.) Basically, not remote workers or a typical office environment. Most finserv call centers do this, so it's not particularly crazy to think they can do the same.

- Mask the voices such that they are intelligible but not identifiable. Maintain a limited set of "high access" taggers who can hear the raw clips if there is an issue with the masking.

- Limit the length of the clips (sounds like they already do this).

- Have pre-filters for anything personally identifiable in the audio. The metadata for the audio might already be de-identified, but what if the audio clip consists of the person reading out a phone number, credit card number, username, etc.? They should have their "high access" team building detectors for that and flag those portions of the audio or whole clips and route them to a limited access team.

- Make it more clear to customers that their audio, including accidental captures, can and will be sent to their servers. Make this very explicit, rather than burying it in TOS and using terms like "audio data". "The device may accidentally interpret your bondage safe word as a trigger and send your private conversations to our real-live human tagging team for review."

- Provide a physical switch that can temporarily disable the audio recording capability.

- Pay money, like cigarette companies do, to help fund a public education campaign that informs the general public about these listening bugs and mass surveillance issues so that people are aware of industry practices and how it affects them.

Edit:

I like what others are saying about explicit opt-in, as well as paid end-users. For quality/safety control, I don't know that they can exclusively use paid end-users. They probably need to sample some amount of their real live data. For that, explicit opt-in makes sense.


> It's not ... a computer science issue.

full stop.


Pay beta testers who know what they're signing up for? Train the software with previously human transcribed audio (like TV/audiobooks/etc)


How about 2 Google Homes: The $100 Privacy version and the $30 Opt-in Testing version. That way some people get their privacy and I get Chicken Little to subsidize my cheaper IoT products. =)


The problem with this is that "Chicken Little" is probably not an adventurous person, but rather a poor person. This makes privacy another privilege that can be bought. This devolves into another way to exploit poorer people.


> Train the software with previously human transcribed audio (like TV/audiobooks/etc)

They've very likely already done this.


Virtually every single DVD/BluRay ever produced has closed captioning. For the english language, that must be over 100,000 hours.

I suppose it's no coincidence that this article is written about the Flemish community... there may not be as much closed caption options there.


>From a computer science perspective, what should Google do to train its models in a privacy conscious way?

Install these devices in the homes of google employees, executives and offices and allow the public to listen in. What’s good for the goose is good for the gander and all.

Maybe when google has trained the systems enough to not need to train them by collecting and listening to customers conversations, then they enter them into the stream of commerce.


I'm not sure if Google has a diverse enough employee base for this to work. I'd imagine most Google employees are tech workers so the training data collected would not accurately represent the general population.


Use employees, paid testers and, maybe, informed beta customers. Like every other product.


Explicit op-ins?

"Google would like to use the last 15 minutes of voice data to improve Google Home. This may contain sensitive information. Do you approve?"


Everyone would just decline because there's no incentive to agree, and their product would die. Plus it would annoy users and cause stress.


Is that our problem? If a product relies on dark pattern acceptance of being "spied upon" to succeed, why should we want it to succeed?


Not necessarily. I've had Google Voice (And maybe iphone?) ask me if they can use my voice-mail to improve transcription data, and I always say yes. But it's an explicit ask for each time (If I remember correctly), which is nice.


You underestimate how many people are willing to take altruistic actions when the (perceived) cost is sufficiently small.

The main downside is an extra question during an onboarding flow.


Hire people to talk to the models?


Pay real humans for user studies that the humans knowingly sign up for.


There's enourmous amount of publicly available data to crunch all over the world; Podcasts, TV shows, talk radio, Youtube, etc.


They should put a sticker on the box that explicitly states "Audio from this device will be listened to by Doug in Indiana"


Decide if it even can and if the answer is no then not do it.


Use test subjects who sign up explicitly and specifically for this purpose, rather than burying it in a massive TOS written in legalese that can change at any time and is applied indiscriminately to all customers


They can't. They need to transcribe the data.


Not true. Apple has demonstrated how to do it and has published a paper on the topic.

A. Bhowmick et al. Protection Against Reconstruction and Its Application in Private Federated Learning, 2018.

Don’t just dismissively say “oh that’s different and doesn’t apply” until you read it and understand the range of problems Apple is tackling with this and similar approaches. A lot can be done when the organization makes privacy a priority.

Let’s not forget I am replying to a person who has chosen to use a throwaway account for this topic. Why did they do that? Hmmm.


Is this why Siri has worse performance compared to others?


Hah! Good one. I hope not. I think the Siri people at Apple have just gotten lazy. Maybe they are too impressed with themselves and focusing on BS things like sports scores. I hope they get more ambitious soon.


^This. I'm sure they're optimizing to only transcribe data they are pretty confident they got wrong (multiple requests; use has to go in and correct), but if you want software to do what the Google Home software does, you have to do this.


There are a lot of things that Google could have done better in this situation without meaningfully impacting their software's quality.


I guess. I'm still not of the opinion anything being done here is really 'wrong' (unless the annotators are mis-using the data).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: