Well, if GMail would be open source, and we could self-host it, we could get the...

dang · on March 26, 2016

We detached this subthread from https://news.ycombinator.com/item?id=11359868 and marked it off-topic.

kuschku · on March 26, 2016

How is "if GMail was self-hosted, we’d have the benefits without the disadvantages" off-topic to the parent thread, which was about privacy concerns regarding GMail?

delroth · on March 25, 2016

Please read the link I posted before replying. Mike Hearn explains that the biggest advantage that large email service providers in the war against spam is their centralized aspect. Because of the access to large amounts of data (obviously) but also because there is basically no known way of writing decentralized anti-spam computation engines that cannot be gamed by spammers.

Also, would you rather have users willingly giving their data to a foreign company (legally liable, bound to a published privacy policy) or unwillingly to malware authors and credentials phishers? In the current security landscape this is a very real tradeoff to think about.

maerF0x0 · on March 25, 2016

>Also, would you rather have users willingly giving their data to a foreign company (legally liable, bound to a published privacy policy) or unwillingly to malware authors and credentials phishers? In the current security landscape this is a very real tradeoff to think about.

False dichotomy. As well, keep in mind that those privacy policies are subject to change w/o consent and are overridden by any country's government. So even with your false dichotomy you're just choosing _who_ steals your data, not if someone does. I'd rather live in a world where I can use services and not give away my data.

kuschku · on March 25, 2016

You realize that almost all email filtering that GMail is doing nowadays is based on trained networks on the content of the mail, not trained on the domains?

And in fact, if you train your own neural networks to do this same task – as I’m currently doing – you get the same quality of categorization and spam filtering that Google got.

I consider Google, the NSA, and so on just as trustworthy as a Nigerian Scammer, so I see no difference in giving my data to Google, or giving it to the phisher.

They operate under laws I can’t control, use my data in ways I can’t control, and don’t ask me if they wish to use my data for more other purposes later on.

thomasahle · on March 25, 2016

But how do you get access to large enough sets of training data? Wouldn't that always require plain text access to other peoples mail? Even more, the data would have to be recent, as to take into account trends in approaches.

jMyles · on March 25, 2016

Presumably you'd have to share training data across instances.

kuschku · on March 25, 2016

Simple: By hosting email for other people. /s

But yes, sharing training data (or trained networks) between people is the best solution.

kbenson · on March 25, 2016

The training data is emails. Sharing training data means your email is no longer contained within your end-to-end encryption, it's leaking allover the place. If you can find a way to extract useful training data from emails while also making it so it doesn't identify anything about you or the emails you've been receiving, I'm sure you can make a lot of money with that. I suspect if it's not impossible, it's extremely hard, and even harder to do right (that is, in a way that we don't find that is later susceptible to some partial reconstruction of attributes).

kuschku · on March 25, 2016

Well, there is a way: Share the whole network.

Then you combine all networks that others share with you into a new one, and use that yourself.

Continuous recombination and cross-breeding of networks is the idea.

kbenson · on March 25, 2016

If you share the network, and allow members of the network to use their own data to help train, how do you prevent spammers from joining and submitting garbage, or worse, targeted updates to make specific spam pass?

kuschku · on March 25, 2016

Well, the idea is that you can check networks against your own set of organised data — if adding network X reduces the overall effectivity, you just stop using network X and X's score is reduced.

EDIT: As HN prevents me from adding new comments right now (Seriously, HN, allow us to post more than 3 comments per hour, it’s seriously hard to hold a conversation like this), I’ll answer your comment here:

Users would train networks locally based on their own decisions. Those networks would then be submitted to a repo, and you’d get other networks in return. If a network sorts badly (aka, you always undo its sorts manually), you will not get networks with similar sorting capabilities next time.

The concept would automatically prevent people from adding malicious networks – as they’d end up in the local blacklist of users.

Obviously you wouldn’t blacklist the network itself, but a representation of its concept of sorting.

kbenson · on March 25, 2016

So, how are these networks getting their data? Users submitting data? That means users are reducing their individual security to increase the group security as a whole. You are then presented with just consuming this data (and staying secure), or contributing, and we're back at the same point, data needs to be shared so it can be trained against.

Let's also look at the incentives for these networks that have data you can subscribe to. How are they supposed to keep spammers out? Any sort of vetting and management of the individual networks will be non-negligible, and if it's not funded will be at a disadvantage to the spammers that are doing this for profit.

Finally, I'm not sure that training sets for data like this can be easily combined without a massive amount of reprocessing, if at all. I'm not familiar enough with the classifying networks involved to know, but I suspect that problem alone ranges somewhere from "non-trivial" to "very-hard", if not already solved.

It sounds good, and in a perfect world we'd have well run and managed shared networks of fully anonymized spam/phishing classification training data that was easy to combine into individual personal classifiers without having to heavily re-process large training sets.

I'm just not sure how feasible the individual parts of that are, much less them combined into a whole.

thomasahle · on March 25, 2016

I think this idea certainly has interesting parts. We would at least need the following parts:

* Neural nets or similar trained models, which we could prove don't leak information on the data they are trained on.

* A way to combine these models, without access to their training data, in a way that works as well as training a new model on the union of the data.

* A way to exclude spammers. If the models vote on each message, perhaps we would be okay as long as fewer than 50% of the contributors are spammers.

davorb · on March 25, 2016

The same way uBlock manages their lists of ads.

thomasahle · on March 25, 2016

How do they do that? By some kind of peer review?

azinman2 · on March 25, 2016

Google as trustworthy as a nigirian scammer?

Can't remember the last time google stole money from me... Because it's never happened.

As someone who currently and previously has worked at Google, you sir are an irrational lunatic. Hope your tin foil is industrial strength!!

dang · on March 26, 2016

> you sir are an irrational lunatic

Personal attacks are not allowed on Hacker News, whether someone else is wrong or not. Please don't do this again.

kuschku · on March 25, 2016

> Google as trustworthy as a nigirian scammer?

Google has been completely useless in protecting my data and emails.

Google has shared them with governments all over the place, including the US government and the NSA. Some parts obviously willingly, some parts out of negligence.

If Google can’t protect my emails from being read by any government, or being read by any employee or algorithm from Google that wishes to use my data for profiling, advertisements, or any other purpose that isn’t directly required to fulfill the tasks I gave it, then it is not doing what it’s supposed to do.

Also, accusing someone of being a tin foil hat wearer is just making you sound crazy in the post-snowden world.

Snowden has shown us proof that the NSA has had access to all the data from many US companies, including Google.

If you can prove otherwise, please do so – but currently, Google deserves no trust at all.

JoBrad · on March 25, 2016

Even if they didn't, anyone you email at any non-GMail domain would be suspect of doing the same thing (willingly or not).

You're better off just trying to severely limit the amount of confidential information you put in email.

kuschku · on March 25, 2016

So, if I have something to hide, I should just stop saying that in public?

Seriously?

You suggest I censor myself instead of trying to act against the companies and governments that are acting against my freedoms?

vectorjohn · on March 25, 2016

Obviously yes. That's how hiding works.

kuschku · on March 26, 2016

So, I should communicate with pigeons or what if I don’t want the NSA or Google to know my political preferences? Seriously?

vectorjohn · on March 27, 2016

Pigeons have very easily exploitable security vulnerabilities. No, you should use end to end encryption if you want security. But the other end is the weak link so you have to trust it.