We don't vary our model quality with time of day or load (beyond negligible non-...

wasmainiac · 2026-02-05T22:50:53 1770331853

Thanks for the response, I appreciate it. I do notice variation in quality throughout the day. I use it primarily for searching documentation since it’s faster than google in most case, often it is on point, but also it seems off at times, inaccurate or shallow maybe. In some cases I just end the session.

nl · 2026-02-05T23:02:33 1770332553

Usually I find this kind of variation is due to context management.

Accuracy can decreases at large context sizes. OpenAI's compaction handles this better than anyone else, but it's still an issue.

If you are seeing this kind of thing start a new chat and re-run the same query. You'll usually see an improvement.

wasmainiac · 2026-02-06T06:14:09 1770358449

I don't think so. I am aware that large contexts impacts performance. In long chats an old topic will someone be brought up in new responses, and the direction of the mode is not as focused.

Regardless I tend to use new chats often.

repeekad · 2026-02-06T01:38:13 1770341893

This is called context rot

charcircuit · 2026-02-06T06:04:56 1770357896

I thought context rot was only for long distance queries.

GorbachevyChase · 2026-02-06T01:35:37 1770341737

Hi Ted. I think that language models are great, and they’ve enabled me to do passion projects I never would have attempted before. I just want to say thanks.

smugtrain · 2026-02-07T00:56:51 1770425811

It will give the user lower quality if it finds them “distressed” however, choosing paternalistic safety over epistemic accuracy. As a user gets more frustrated with the system, it will pick up the distress signal even more so, a kind of feedback loop toward degraded service quality. In my experience.

Trufa · 2026-02-05T20:18:40 1770322720

Can you be more specific than this? does it vary in time from launch of a model to the next few months, beyond tinkering and optimization?

tedsanders · 2026-02-05T21:11:40 1770325900

Yeah, happy to be more specific. No intention of making any technically true but misleading statements.

The following are true:

- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)

- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.

- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.

ChatGPT release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...

Codex changelog: https://developers.openai.com/codex/changelog/

Codex CLI commit history: https://github.com/openai/codex/commits/main/

Trufa · 2026-02-05T23:37:11 1770334631

I ask then unironically then, am I imagining that models are great when they start and degrade over time?

I've had this perceived experience so many times, and while of course it's almost impossible to be objective about this, it just seem so in your face.

I don't discard being novelty plus getting used to it, plus psychological factors, do you have any takes on this?

jason_oster · 2026-02-06T01:21:22 1770340882

You might be susceptible to the honeymoon effect. If you have ever felt a dopamine rush when learning a new programming language or framework, this might be a good indication.

Once the honeymoon wears off, the tool is the same, but you get less satisfaction from it.

Just a guess! Not trying to psychoanalyze anyone.

wasmainiac · 2026-02-06T09:16:01 1770369361

I don’t think so. I notice the same thing, but I just use it like google most of the time, a service that used to be good. I’m not getting a dopamine rush off this, it’s just part of my day.

jychang · 2026-02-05T21:30:20 1770327020

What about the juice variable?

https://www.reddit.com/r/OpenAI/comments/1qv77lq/chatgpt_low...

tedsanders · 2026-02-05T21:43:21 1770327801

Yep, we recently sped up default thinking times in ChatGPT, as now documented in the release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...

The intention was purely making the product experience better, based on common feedback from people (including myself) that wait times were too long. Cost was not a goal here.

If you still want the higher reliability of longer thinking times, that option is not gone. You can manually select Extended (or Heavy, if you're a Pro user). It's the same as at launch (though we did inadvertently drop it last month and restored it yesterday after Tibor and others pointed it out).

tgrowazay · 2026-02-05T21:40:54 1770327654

Isn’t that just how many steps at most a reasoning model should do?

qingcharles · 2026-02-06T17:33:44 1770399224

Thank you for saying this publically.

I feel like you need to be making a bigger statement about this. If you go onto various parts of the Net (Reddit, the bird site etc) half the posts about AI are seemingly conspiracy theories that AI companies are watering down their products after release week.

newswasboring · 2026-02-06T09:53:57 1770371637

>there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware

Maybe a dumb question but does this mean model quality may vary based on which hardware your request gets routed to?

ComplexSystems · 2026-02-05T21:22:47 1770326567

Do you ever replace ChatGPT models with cheaper, distilled, quantized, etc ones to save cost?

tedsanders · 2026-02-06T06:52:18 1770360738

We do care about cost, of course. If money didn't matter, everyone would get infinite rate limits, 10M context windows, and free subscriptions. So if we make new models more efficient without nerfing them, that's great. And that's generally what's happened over the past few years. If you look at GPT-4 (from 2023), it was far less efficient than today's models, which meant it had slower latency, lower rate limits, and tiny context windows (I think it might have been like 4K originally, which sounds insanely low now). Today, GPT-5 Thinking is way more efficient than GPT-4 was, but it's also way more useful and way more reliable. So we're big fans of efficiency as long as it doesn't nerf the utility of the models. The more efficient the models are, the more we can crank up speeds and rate limits and context windows.

That said, there are definitely cases where we intentionally trade off intelligence for greater efficiency. For example, we never made GPT-4.5 the default model in ChatGPT, even though it was an awesome model at writing and other tasks, because it was quite costly to serve and the juice wasn't worth the squeeze for the average person (no one wants to get rate limited after 10 messages). A second example: in our API, we intentionally serve dumber mini and nano models for developers who prioritize speed and cost. A third example: we recently reduced the default thinking times in ChatGPT to speed up the times that people were having to wait for answers, which in a sense is a bit of a nerf, though this decision was purely about listening to feedback to make ChatGPT better and had nothing to do with cost (and for the people who want longer thinking times, they can still manually select Extended/Heavy).

I'm not going to comment on the specific techniques used to make GPT-5 so much more efficient than GPT-4, but I will say that we don't do any gimmicks like nerfing by time of day or nerfing after launch. And when we do make newer models more efficient than older models, it mostly gets returned to people in the form of better speeds, rate limits, context windows, and new features.

jghn · 2026-02-05T21:28:59 1770326939

He literally said no to this in his GP post

joshvm · 2026-02-05T20:58:50 1770325130

My gut feeling is that performance is more heavily affected by harnesses which get updated frequently. This would explain why people feel that Claude is sometimes more stupid - that's actually accurate phrasing, because Sonnet is probably unchanged. Unless Anthropic also makes small A/B adjustments to weights and technically claims they don't do dynamic degradation/quantization based on load. Either way, both affect the quality of your responses.

It's worth checking different versions of Claude Code, and updating your tools if you don't do it automatically. Also run the same prompts through VS Code, Cursor, Claude Code in terminal, etc. You can get very different model responses based on the system prompt, what context is passed via the harness, how the rules are loaded and all sorts of minor tweaks.

If you make raw API calls and see behavioural changes over time, that would be another concern.

zamadatix · 2026-02-05T21:43:38 1770327818

I appreciate you taking the time to respond to these kinds of questions the last few days.

Someone1234 · 2026-02-05T20:29:28 1770323368

Specifically including routing (i.e. which model you route to based on load/ToD)?

PS - I appreciate you coming here and commenting!

hhh · 2026-02-05T20:32:14 1770323534

There is no routing with API, or when you choose a specific model in chatGPT.

zwaps · 2026-02-06T06:09:51 1770358191

In the past it seemed there was routing based on context-length. So the model was always the same, but optimized for different lengths. Is this still the case?

derwiki · 2026-02-06T00:07:15 1770336435

Has this always been the case?

robertclaus · 2026-02-06T06:11:56 1770358316

Hi Ted! Small world to see you here!

fragmede · 2026-02-05T23:25:50 1770333950

I believe you when you say you're not changing the model file loaded onto the H100s or whatever, but there's something going on, beyond just being slower, when the GPUs are heavily loaded.

clbrmbr · 2026-02-06T00:27:52 1770337672

I do wonder about reasoning effort.

hauntsaninja · 2026-02-06T08:40:16 1770367216

Reasoning effort is denominated in tokens, not time, so no difference beyond slowness at heavy load

(I work at OpenAI)

a456463 · 2026-02-06T16:32:25 1770395545

sure. we believe you