Because they already used data without permission on a much larger scale, so it's a perfectly logical assumption that they would continue doing so with their users?
Training on everything you can publicly scrape from the internet is a very different thing from training on data that your users submit directly to your service.
>Training on everything you can publicly scrape from the internet is a very different thing from training on data that your users submit directly to your service.
Yes. It's way easier and cheaper when the data comes to you instead of having to scrape everything elsewhere.
By X do you mean tweets? Can you not see how different that is from training on your private conversations with an LLM?
What if you ask it for medical advice, or legal things? What if you turn on Gmail integration? Should I now be able to generate your conversations with the right prompt?
I don't think AI companies should be doing this, but they are doing it. All are opt-out, not opt-in. Anthropic is just changing their policies to be the same as their competition.
xAI trains Grok on both public data (Tweets) and non-public data (Conversations with Grok) by default. [0]
> Grok.com Data Controls for Training Grok: For the Grok.com website, you can go to Settings, Data, and then “Improve the Model” to select whether your content is used for model training.
Meta trains its AI on things posted to Meta's products, which are not as "public" as Tweets on X, because users expect these to be shared only with their networks. They do not use DMs, but they do use posts to Instagram/Facebook/etc. [1]
> We use information that is publicly available online and licensed information. We also use information shared on Meta Products. This information could be things like posts or photos and their captions. We do not use the content of your private messages with friends and family to train our AIs unless you or someone in the chat chooses to share those messages with our AIs.
OpenAI uses conversations for training data by default [2]
> When you use our services for individuals such as ChatGPT, Codex, and Sora, we may use your content to train our models.
> You can opt out of training through our privacy portal by clicking on “do not train on my content.” To turn off training for your ChatGPT conversations and Codex tasks, follow the instructions in our Data Controls FAQ. Once you opt out, new conversations will not be used to train our models.
Either we optimize for human interactions or for agentic. Yes we can do both, but realistically once things are focused on agentic optimizations, the human focused side will slowly be sidelined and die off. Sounds like a pretty awful future.
It's one of those you get what you put in kind of deals.
If you spend a lot of time thinking about what you want, describing the inner workings, edge cases, architecture and library choices, and put that into a thoughtful markdown, then maybe after a couple of iterations you will get half decent code. It certainly makes a difference between that and a short "implement X" prompt.
But it makes one think - at that point (writing a good prompt that is basically a spec), you've basically solved the problem already. So LLM in this case is little more than a glorified electric typewriter. It types faster than you, but you did most of the thinking.
Right, and then after you do all the thinking and the specs, you have to read and understand and own every single line it generated. And speaking for myself, I am no where near as good at thinking through code I am reviewing as thinking through the code I am writing.
Other people will put up PRs full of code they don't understand. I'm not saying everyone who is reporting success with LLMs are doing that, but I hear it a lot. I call those people clowns, and I'd fire anyone who did that.
If it passes the unit tests I make it write and works for my sample manual cases I absolutely will not spend time reading the implementation details unless and until something comes up. Sometimes garbage makes its way into git but working code is better than no code and the mess can be cleaned up later. If you have correctness at the interface and function level you can get a lot done quickly. Technical debt is going to come out somewhere no matter what you do.
The trick is to not give a fuck. This works great in a lot of apps, which are useless to begin with. It may also be a reasonable strategy in an early-stage startup yet to achieve product-market fit, but your plan has to be to scrap it and rewrite it and we all know how that usually turns out.
This is an excellent point. Sure in an ideal world we should care very much about every line of code committed, but in the real world pushing garbage might be a valid compromise given things like crunch, sales pitches due tomorrow etc.
No, that's a much stronger statement. I'm not talking about ideals. I'm talking about running a business that is mature, growing and going to be around in five years. You could literally kill such a business running it on a pile of AI slop that becomes unmaintainable.
How much of the code do you review in a third party package installed through npm, pip, etc.? How many eyes other than the author’s have ever even looked at that code? I bet the answers have been “none” and “zero” for many HN readers at some point. I’m certainly not saying this is a great practice or the only way to productively use LLMs, just pointing out that we treat many things as a black box that “just works” till it doesn’t, and life somehow continues. LLM output doesn’t need to be an exception.
That's true, however, not so great of an issue because there's a kind of natural selection happening: if the package is popular, other people will eventually read (parts of, at least) the code and catch the most egregious problems. Most packages will have "none" like you said, but they aren't being used by that many people either, so that's ok.
Of course this also applies to hypothetical LLM-generated packages that become popular, but some new issues arise: the verbosity and sometimes baffling architecture choices by LLM will certainly make third-party reviews harder and push up the threshold in terms of popularity needed to obtain third party attention.
>teach “how do you think and how do you decompose problems”
That's rich coming from AWS!
I think he meant "how do you think about adding unnecessary complexity to problems such that it can enable the maximum amount of meetings, design docs and promo packages for years to come"!
Some of them are very much like this. They think *intelligence* is a measure of your ability to regurgitate data that you have been fed. A genius is someone who wins on Jeopardy.
In engineering school, it was easy to spot the professors who taught this way. I avoided them like the plague.
Why. Just... why
reply