The advice here is good, and I'm a big believer that the cream (e.g., sincerity and real opinions) rises to the top for writing. Still, think folks dunk on these types of writing automation tools too much when, for many, they can be a gateway drug to consistent posting and finding your online voice.
That is to say, the whole post is a bit of an internet old-head complaint. Reminds me of baby boomers complaining about a "decline" in homeownership and having children without acknowledging the massive shifts in the economic accessibility that support these milestones.
It's easy to write a post like this when you've already built a following because you started when social media was a greenfield experience. It's much harder when you have to compete for signal while being pressured to build a brand and perform at your day job.
The obsession with constant content production combined with algorithmic, feed driven consumption frontends with terrible discoverability and intense bubblification lead to today's screaming contest that ruins our sanity. On average I find it much worse than the old infosphere (TV+print+radio) used to be, for producers and consumers. It's quite tragic, really.
Though I also notice awareness around this issue is rising (e.g. smartphone bans in school, initiatives like bluesky), which is good, I guess. All of this is still a society-wide experiment without control group.
Agreed. Discussions like these always remind me of some great research on how the destruction of the old, more averaged, and less targeted infosphere used to support significantly more political cohesion.
If you want to lose all hope, just read the top selling romance novels on the Kindle app. These people are raking in millions a year and it’s just absolutely awful.
"In further analyses, the researchers found that, for cancer patients, whether their HDHP had a health savings accounts (HSAs) did not make a difference."
HDHP are a federal requirement if you want a HSA. Full stop. Everyone I work with and nearly everyone I know has one, and no one is dying. The headline is hyperbolic but given the source I'm not surprised.
It doesn't have to be malicious. If my workflow is to send a prompt once and hopefully accept the result, then degradation matters a lot. If degradation is causing me to silently get worse code output on some of my commits it matters to me.
I care about -expected- performance when picking which model to use, not optimal benchmark performance.
The non-determinism means that even with a temperature of 0.0, you can’t expect the outputs to be the same across API calls.
In practice people tend to index to the best results they’ve experienced and view anything else as degradation. In practice it may just be randomness in either direction from the prompts. When you’re getting good results you assume it’s normal. When things feel off you think something abnormal is happening. Rerun the exact same prompts and context with temperature 0 and you might get a different result.
This has nothing to do with overloading. The suspicion is that when there is too much demand (or they just want to save costs), Anthropic sometimes uses a less capable (quantized, distilled, etc) version of the model. People want to measure this so there is concrete evidence instead of hunches and feelings.
To say that this measurement is bad because the server might just be overloaded completely misses the point. The point is to see if the model sometimes silently performs worse. If I get a response from "Opus", I want a response from Opus. Or at least want to be told that I'm getting slightly-dumber-Opus this hour because the server load is too much.
The question I have now after reading this paper (which was really insightful) is do the models really get worse under load, or do they just have a higher variance? It seems like the latter is what we should expect, not it getting worse, but absent load data we can't really know.
Explain this though. The code is deterministic, even if it relies on pseudo random number generation. It doesn't just happen, someone has to make a conscious decision to force a different code path (or model) if the system is loaded.
Its not deterministic. Any individual floating point mul/add is deterministic, but in a GPU these are all happening in parallel and the accumulation is in the order they happen to complete.
When you add A then B then C, you get a different answer than C then A then B, because floating point, approximation error, subnormals etc.
It can be made deterministic. It's not trivial and can slow it down a bit (not much) but there are environment variables you can set to make your GPU computations bitwise reproducible. I have done this in training models with Pytorch.
For all practical purposes any code reliant on the output of a PRNG is non-deterministic in all but the most pedantic senses... And if the LLM temperature isn't set to 0 LLMs are sampling from a distribution.
If you're going to call a PRNG deterministic then the outcome of a complicated concurrent system with no guaranteed ordering is going to be deterministic too!
No, this isn't right. There are totally legitimate use cases for PRNGs as sources of random number sequences following a certain probability distribution where freezing the seed and getting reproducibility is actually required.
How is this related to overloading? The nondeterminism should not be a function of overloading. It should just time out or reply slower. It will only be dumber if it gets rerouted to a dumber, faster model eg quantized.
Just to make sure I got this right. They serve millions of requests a day & somehow catastrophic error accumulation is what is causing the 10% degradation & no one at Anthropic is noticing it. Is that the theory?
There's a million algorithms to make LLM inference more efficient as a tradeoff for performance, like using a smaller model, using quantized models, using speculative decoding with a more permissive rejection threshold, etc etc
The primary (non malicious, non stupid) explanation given here is batching. But I think you would find looking at large-scale inference the batch sizes being ran on any given rig are fairly static - there is a sweet spot for any given model part ran individually between memory consumption and GPU utilization, and generally GPUs do badly at job parallelism.
I think the more likely explanation is again with the extremely heterogeneous compute platforms they run on.
I checked the link, it never says that the model's prediction get lower quality due to batching, just nondeterministic. I don't understand why people conflate these things. Also it's unlikely that they use smaller batch sizes when load is lower. They just likely spin up and down GPU serves based on demand, or more likely, reallocate servers and gpus between different roles and tasks.
Excellent, level headed, read that appropriately acknowledges that we live in a world ultimately bounded by physics that (at some point) no amount of money or human attention can overcome.
A hacker seeking to change a political system-- independent of alignment-- would be well advised to take an approach that is almost the exact inverse of this project's.
The research on getting people to change political attitudes or engage in pro-social political behaviors says that public shame, especially amongst their friends/communities/families, is the most effective lever available.
So, instead of making a list of everyone who believes in $prosocial_behaviors_and_policies create a publicly searchable, and verifiable database of folks engaged in $anti_social_behaviors_and_policies that are destructive to the their communities.
Better transparency into where everyone stands helps to prevent toxic policies and rhetoric that poison the commons and allows communities (teammates, employers, friends, and family-- both present and future) to then apply social pressure or the threat of ostracism in order to generate meaningful change.
There's a reason that bad actors (of all stripes and political affiliations) fear transparency! It's a highly effective tool for aligning behavior with societal/community values.
That is to say, the whole post is a bit of an internet old-head complaint. Reminds me of baby boomers complaining about a "decline" in homeownership and having children without acknowledging the massive shifts in the economic accessibility that support these milestones.
It's easy to write a post like this when you've already built a following because you started when social media was a greenfield experience. It's much harder when you have to compete for signal while being pressured to build a brand and perform at your day job.
reply