Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It does not really matter anymore how bulk usage data collection is called or whether it is "privacy-preserving".

Looking at the current developments in AI, I am concerned that AI models can easily de-anonymize and guess end point users when being fed with "telemetry data" of hundreds of thousands clients.

I hear a lot and read a lot of software and hardware vendors saying that "telemetry" is supposed to somehow magically improve the user experience in the long run, but in actuality software tends to get worse, unstable and less useful.

So, I would like to know how exactly any telemetry data from Fedora Linux clients is going to help them, or how is it going to improve anything.




I disagree wholeheartedly. Privacy-preserving technologies like including privacy-preserving AI (e.g., federated learning, homomorphic encryption) and privacy-preserving data linkage/fusion are really important. They're crucial in my day-to-day work in aviation safety, for example.

And telemetry is important. We have limited resources. How do we determine the number of users impacted by a bug or security vulnerability? Do we have a bug in our updater or localization? Are we maintaining code paths that aren't actually used? Telemetry doesn't magically improve user experience, but I'd rather make decisions based on real data rather than based on the squeakiest wheel in the bug tracker.

We can certainly make flawed decisions based on data, but I'd argue that we're more likely to make flawed decisions with no data.


> We can certainly make flawed decisions based on data, but I'd argue that we're more likely to make flawed decisions with no data.

What I've seen in practice so far is that the use of telemetry has harmed software quality more than helped. It often leads developers to optimize for the wrong things and make poor design decisions. This happens because they tend to think that "the data never lies", ignoring the fact that telemetry always gives a skewed and incomplete picture.


Do you have some specific examples of this playing out?

I’ve been a product manager for products that had no telemetry, and that can be a rather undesirable place to operate, especially if you’re in the enterprise space where product changes can impact the operations of businesses.

I think it’s certainly possible to focus on the wrong things, but I don’t see that as an outcome of telemetry itself as much as an outcome of a product team that doesn’t understand the problem space or customer base.

The attributes to capture are presumably based on what teams understand to be key indicators about their app/service. I think confident incorrectness armed with bad data is just a slightly different version of a complete lack of data. Such a team was operating on whatever they imagined to be important before, and they continue to do so after, albeit with greater conviction.

But good telemetry in the hands of a good product team can be immensely beneficial for decision making and can protect customers from bad decisions. Anecdotally, my ability to pull numbers about certain attributes has been key to my ability to shut down executive pressure to make changes that would have drastically impacted customers if not for the direct evidence that it would.

I’m also not claiming that downsides don’t exist, and privacy is always my primary concern, but there are a range of outcomes based on the maturity of a team/company, and as long as the PM understands that data is not an alternative to having a relationship with customers, I think data is pretty important.


> Do you have some specific examples of this playing out?

Honestly, I can't actually remember specific examples. It's not something I dwell on. But I know that's it's happened several times that software has been made much less useful to me because features have been removed on the basis of being rarely used, ignoring the fact that even though they're rarely needed, when they are needed, they're indispensible.

More often, though, the bad telemetry-based decisions I've seen are around UI changes. Things like a laser-focus on reducing the number of clicks it takes to perform things, even though sometimes reducing the number of clicks for a thing adversely impacts the usability of it.

For bad telemetry-based UI decisions, my standout example if Firefox, although that's hardly the only one.

> But good telemetry in the hands of a good product team can be immensely beneficial for decision making and can protect customers from bad decisions.

This was actually my point of view a few years back, when telemetry started to become popular. And, as a dev who sells software commercially, I totally understand the value on that side. My experience with products that have used it, though, has shifted my view.

All that said, I do agree that it's possible to use telemetry in a way that is good for users. But I don't think it's common, and I think the reason for that is economics and human nature.

Once you start measuring a thing, that tends to become a goal rather than just a data point. And since the industry is all about maximizing velocity, that effect is even stronger. Doing proper usability studies is a slow and expensive process. Telemetry can be a useful thing as part of that process, but the tendency is to make it pretty much the entire process. That does a disservice to everybody.

> the PM understands that data is not an alternative to having a relationship with customers, I think data is pretty important.

Not just the PM. The entire company. But a relationship should be consensual, not forced. I have zero issues with opt-in telemetry. When it's not opt-in, though, it's an invasion and adversarial. I presume that's not the sort of relationship a good PM wants.


I think this is a good view and explanation. At the end of the day though, the only thing that matters is what you pointed out:

> But a relationship should be consensual, not forced. I have zero issues with opt-in telemetry. When it's not opt-in, though, it's an invasion and adversarial.

That's it. No matter how many ways you dice it - collecting data without consent or forcing opt-out is an invasion. As more and more of our lives shift to being online, our privacy and our sense of autonomy in a digital world is ever increasingly paramount.


> Not just the PM. The entire company.

Ideally, yes. In practice, and especially in larger shops, it’s the PM’s job to own this relationship and to make sure the important players have this understanding.

> But a relationship should be consensual, not forced.

Absolutely agree here.


> Do you have some specific examples of this playing out?

A large software company in Redmond, perhaps?


So, you use telemetry to figure out why planes are [nearly] crashing?

Do you work for Boeing or something?

When I've worked on mission critical (so, safety critical, in practice), we made sure the probability of catching a failure in testing was 100x the chance of catching it in production.

Modern software development techniques like fault injection and fuzzing make this pretty easy to achieve.


Close enough. I work for for MITRE and the FAA leading our efforts to identify aviation safety hazards and also improve aeromedical certification, so I do work closely with the airlines, OEMs, unions, trade orgs, and other stakeholders.

We use de-identified voluntary safety reports filed by pilots, air traffic controllers, and others, along with flight telemetry data from the aircraft and other data to identify and study potential safety issues in the national airspace. Privacy-preserving techniques ensure that we can collaborate on safety and trust that the data stays non-attributional (and thus, non-punitive since participation is voluntary) despite competing interests.

We can't really do fault injection or fuzzing for real-world systems to understand, say, the impact of false low altitude alerts on risk of undesired aircraft states (e.g., controlled flight into terrain) at a certain airport.


I get your point, I really do. And it got me thinking. I suppose, you're right concerning bugs and safety aspects. But, usage pattern collection is a huge problem, not only privacy-wise. This is what I do not understand:

Why are there code paths nobody uses, that have to be maintained, in the first place?


Unused code paths were once used or it was thought that they would be used. If you don't know if something is used, it's safest to assume it is being used or is there for some reason (Chesterton's Fence). If it's being used, you need to maintain it lest there be a regression.


For example, let's say we implement support for a web standard or vendor prefix. We can mark it as deprecated.

But how do we know the code path is no longer in use? Are people still using this CSS property (e.g., a vendor prefix)? Are people still using gopher or this one configuration variable? The more configuration options you have, the more combinations you need to test and maintain.


I'm working with monitoring systems and the things I keep hearing is that people rely on having a 'number' but when pressed admit that the number isn't reliable.

Data doesn't always help! It can lead to assumptions, often really bad ones. And IBM isn't to be trusted with it.


If you read the comment thread for that proposal you'll see that the person writing the proposal doesn't care about such niceties and has a pretty loose definition of 'privacy preserving'.


> Do we have a bug in our updater or localization?

It can be done with error reporters like the "System program problem detected. Do you want to report the problem now?" popup in Ubuntu. In my experience, many users are willing to send error reports, and they're extremely useful, although 90% of reports are garbage.


There must be meaningful consent.


You really don't need AI to do this. Collect enough data points and you can fingerprint basically anyone using very old fashioned techniques. AI doesn't really bring anything new to the table.


What AI can sometimes add is automatic feature extraction. Rather than having a to have a human explicitly think "I bet we can identify people via $x" and writing bespoke code to do it. E.g. this has been the big deal with AI in medical diagnostics, that it can look at the same data a very skilled human doctor sees, and still manage to discover something that the human couldn't, due to unnoticed features


Yeah but we've got old techniques for that too, Latent Dirichlet Allocation etc.


AI would need far less data points to achieve an acceptable result, and maybe even be able of fingerpointing, not just fingerprinting. Just wait and see how the data broker bros start selling AI assisted data mining for ads. Big tech is doing that already. It's mind boggling how everyone seems to be just OK with that.


"Only 5% of users use this feature, so we will remove it to save development efforts."

As seen in Firefox..


You're not wrong. While they do not have yet a list of metrics they'd like to collect (from their initial mailinglist post [1]), it's stated as an idea in there

> We also want to know how frequently panels in gnome-control-center are visited to determine which panels could be consolidated or removed, because there are other settings we want to add, but our usability research indicates that the current high quantity of settings panels already makes it difficult for users to find commonly-used settings.

Personally I'd like to see more transparency in their usability research, because GNOME is best know for removing features, which is what they'd like to do this time around as well.

[1]: https://lwn.net/ml/fedora-devel/CAJqbrbeOZrHvYjvMCc=qGZD_VXB...


I agree that the usability research should be transparent and data-driven. As a statistician, I'd rather have something I can actually critique instead of "we just didn't think it was user-friendly."


Counterpoint: Steve Jobs


Even post-Jobs Apple is opt-in for telemetry or crash reports.

Apple also does a lot of (non-anonymous) user testing, which can give very detailed feedback.


It's getting off topic, but the irony of a browser with a 2-4% user base pulling this shit can't be overstated.


Why? They have a fixed budget after all.


Because they're very much the victims of web designers with the same mindset, and if anyone should be the ones to recognize how unfair throwing minority user groups under the bus can be.

It also doesn't really make strategic sense to focus on the lowest common denominator. Chrome already has that group. The one place they could eke out a loyal userbase is specifically the users that Chrome fails to capture because they have unusual needs or requirements.


Actually when I think about it, it's even worse than this. Firefox has been on the receiving end of this type of discrimination more or less for as long as it's existed. It was the state of affairs when IE was the challenger too.

You have to be just mindbogglingly oblivious to not see how this has been one of their biggest problems the last 20 years.


They spent some of that budget on getting a freaking sneaker designer to make time-limited theme colo-- sorry, time-limited colorways and did a bunch of popups and cringe copy on it.

Meanwhile, Brave's vertical tabs were done by a single developer.


In addition, this kind of thinking will attrit some of those 5% of users, so repeated application will shrink their market share even further. it's the excel problem: a huge number of features will only be used by a minority of users, but a majority of users use at least one of those features.


With that logic you could say that 95% of users could keep leaving because you keep focusing on the minority use cases.

Would you rather attrite a fraction of 95% of your users or a fraction of 5% of your users.


see the second half of my comment. You may lose 5% of your users with one particular removal, but if you keep doing it you'll lose most of them.


Did you read my comment? If we had to remove one of two features and one of them was used by 5% of users and one of them was used by 95% of users wouldn't you like to know which one is which?

Without data you can invest your time into the wrong features.


or, you could not remove features


You were the one that brought up removing features.


And those GPT-3.5 calls ain't cheap.


Good, unfortunately. Developer time is a zero-sum game. They're not saying "ah we can just kick back and do less work then", they're saying "how do we allocate our limited resources to maximise benefit to users?". And when it's voluntary FOSS work, honestly it isn't unreasonable if it is the former situation


Differential privacy techniques are provably impossible to de-anonymize, if implemented correctly. It is possible. But fraught with possibility for error or manipulation.


This is the answer. The person above can speculate and fearmonger what magic "AI" is going to be able to do, but if there is no personal data there to begin with, or if you use math like in differential privacy, there's not going to be a way to identify individuals.

That is, if you suspect they'd change their minds and start trying to deanonymise previously collected data anyway — remember that open source distributions (I don't know fedora-the-organisation specifically) are generally made up of volunteers like you and me. Notable exceptions obviously exist, like for-profit Canonical; that's not the org type I mean or trust.


> if there is no personal data there to begin with

Which is a very large "if". There is a tendency for people to think that the only personal data consists of what legally counts as PII, when in fact there is much more personally identifying information than is covered in those definitions.


In practice though none of that matters because they'll do a slipshod job of implementing it.


With that logic, we shouldn't have police or a judicial system either

If we can't trust people ever, what's the point in doing anything?


No, obviously that's not even remotely the same.

You need police and a judicial system and you fix them whenever they break. But you don't need telemetry, it's entirely optional and shoddy implementations translate into unnecessary risk.

Also:

https://lwn.net/ml/fedora-devel/H5JEXR.LLU011IQ4I6K@redhat.c...


You have recourse is the police or judicial system abuses you. You have no recourse if a company does.


Given it is open source software, I'd say we have not only recourse but self-determination here. It's also transparent what is being collected and whom it is being sent to. If you want to know what they store about you, it should take one email and a GDPR reference to find out.

I'm no fan of privacy invasion, always bother clicking through the banner to find the reject cookies option, and sent out plenty of GDPR requests whereas almost nobody else I know ever sent one. I'm not in favor of tracking, but collecting anonymous statistics, especially when they open with "privacy-preserving" and the business is not Facebook or Google or so where we know there's shit about to hit the fan, the cynicism and mistrust in this thread baffles me. Nobody minds when they browse the web and every site keeps access logs invisibly, but oh boy if someone announces keeping a visitor counter for a configuration screen to see if people can find their way to it


- It does not ask for consent

- It does not allow a reasonable decision as it does not show the data before it sends it


> it should take one email and a GDPR reference to find out.

I am not in Europe. The GDPR doesn't help me.

> collecting anonymous statistics, especially when they open with "privacy-preserving"

I am far from convinced that such statistics are gathered in a "privacy preserving" way, but that's neither here nor there.

> the business is not Facebook or Google or so where we know there's shit about to hit the fan

The problem is that you can't just trust the current devs. You also have to trust all future devs and companies that may buy the thing. It's not Facebook or Google now, but it could be in the future. And this is Fedora, which is connected to Red Hat, which is connected to IBM.

And it's also not just about privacy. It's also about impact on product development. It's not exactly rare that software has been made much worse as a result of decision-making based on telemetry data.

> Nobody minds when they browse the web and every site keeps access logs invisibly

No? I think quite a lot of people mind this. But there's nothing that can be done about that. It's still worth trying to keep everything from getting even worse, though.


That's not true. Differential privacy just says that the output of the ML model will not change much if you add or remove a particular user, so you can't reverse the training process to infer what training data a given user provided.

It says nothing about whether or not you can join the output of multiple ML models with other telemetry to build a deanonymization model.


Differential privacy satisfies a post-processing guarantee. It says that if you take the output of a differentially private process and do any amount of processing and combining with outside information, then you don't learn any more than you would have gotten with the outside information alone (up to epsilon).


While true, the requestor had never heard of differential privacy techniques and that is not what they were planning on implementing. This was brought up in the discussion thread.


> I am concerned that AI models can easily de-anonymize and guess end point users when being fed with "telemetry data" of hundreds of thousands clients.

You don't need AI for this. This is done by real humans right now, using data points correlated from multiple sources.


Well, it's simple: AI is getting cheaper, humans are getting more expensive.


People keep saying "you don't need ai for this." Sure. But to do it at scale, and to intelligently connect disparate kinds of data contextually?

That's time consuming and expensive without ai, so you can't do it at scale to a comprehensive degree. That hasn't been practical until now. It still isn't quite cost effective to do this for every human, everywhere, but soon it will be. Give it 5-10 years

Thanks to ai


That's definitely a legitimate fear, as seen with the AOL controversy [1], but if they're just collecting aggregate statistics it's much less of a risk. I.e.

  User ANON-123 with default font x and locale y and screen resolution z installed package x1
Is clearly a big hazard, but statistics on what fonts, locales, and resolutions have is not really. Even combinations to answer questions like "what screen resolutions and fonts are most used in $locale?" should be safe as long as the entropy is kept low. It is less useful, since you have to decide on your queries a priori rather than being able to do arbitrary queries on historical data, but ethics and safety > convenience

[1] https://en.wikipedia.org/wiki/AOL_search_log_release


Combine ANON-123 with information from their browser, which has default font x, locale y, screen resolution Z, and package x1, and that anonymous data just became much more rich.

It doesn't take very many bits of information to deanonymize someone once you start combining databases.


33 to be precise.


> Looking at the current developments in AI, I am concerned that AI models can easily de-anonymize and guess end point users when being fed with "telemetry data" of hundreds of thousands clients.

I can almost guarantee you that the US government has a tool where you can input a few posts from a person on an anonymous network and get back all of their public profiles elsewhere. Fingerprinting tools beat all forms of VPNs and the like. Our privacy and anonymity died like maybe two years ago, there is no stopping it.


That may be true, but it doesn't mean there's no value in protecting your privacy from others anyway. Personally, I'm much more worried about private entities collecting information about me than I am about the government doing so.


> So, I would like to know how exactly any telemetry data from Fedora Linux clients is going to help them, or how is it going to improve anything.

It won't improve anything for users. It might improve something for IBM.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: