Hacker Newsnew | past | comments | ask | show | jobs | submit | gck1's commentslogin

OpenAI and Anthropic have realized that their entire business is exactly one open weight model drop away from Chinese labs that matches Opus 4.5 performance.

They realized they have no product or ground to stand on. Once such model drops and once chip manufacturers catch up with demand, they are dead, if their only product is inference.

So OpenAI decided to do weird things like buying up all the hardware that exists or will exist in the next 2 years to buy time to build the product. Then launch things like Sora, ChatGPT shopping, ads etc. They seem to be struggling with this.

Anthropic, being late to the game of hoarding up all the hardware, decided to "buy time" by hiding CoT, implementing KYC (especially for Chinese users), to delay the efforts of distillation. The products they build in the interim are SaaS clones designed from the POV of AI agents and tight integrations with their models.

And it seems like Google is just sitting aside, watching things unfold, since their business model doesn't stand on inference.

The most likely scenario is that OpenAI and Anthropic will still crash and burn when such open model is released.

Figma's survival is still questinable though. Most likely scenario is likely that there's going to be an open source alternative that has AI integration at the core level, rather than an afterthought.


Exactly this. There are now many great open source coding agents. All we lack is a good model to point them at.

Models are going to commodities - just switch the most affordable.

Longer term, running models locally is going to be increasingly available.


I'm curious, is there anything out there that will get you the 80% of logo work? Claude opus always came closest to me as a non-designer, but it was always still something far from usable.

Haven’t found it yet. The component parts seems to be close. Opus is much better at drawing SVG than it used to be.

Get the draft using pixel generators and convert the end result with svgai.org.

My SO is a UX designer and uses Figma. She wanted to try out Claude integration there, but was frustrated by limitations - like why she can't export interactive elements to Figma file format so that they can be edited further.

So I helped her look into it and I was shocked to find out that it just a react slop generator, not a Figma file generator. And extremely limited at that, too.

Who is Figma targeting with this exactly? Developers, who are interested in react apps will simply use claude code, and UX designers don't really care for react apps.


I have a Macbook M3 Max with 128GB of RAM.

How close to Opus 4.6 can I get with this? Realistic, real-world usage. And I mean not sitting there for minutes waiting the model to finish saying hello, or being able to use it for anything more than a pelican riding a bicycle.

I'm asking because I'm always seeing excited replies, then I get excited, then I spend minutes to hours setting up the model and then, after first use I forget it exists for one reason or another.

Can I get any realistic use out of this?


It won’t be a fair comparison against opus-4.6 but it will run quite well on your machine. I’ve tested qwen3.5 27B, Gemma4, minimax2.5 and Glm4.7 before on my m3 ultra. And i’d say this is the first model that I’m able to use for full agentic sessions. here is a pi session i just did and it worked quite well surprisingly: https://pi.dev/session/#c3d003becb1bfcc7ffbca04e89e1adf8

Thank you! That actually looks quite impressive.

What seems very promising is that thinking blocks look coherent for the lack of a better word, and not that far away from thinking blocks (or rather, summaries) that I see from Claude models.

I think this could actually work for targeted worker agents that get explicit, detailed task instructions from better models.

I'll be trying this tomorrow in my workflow.


You'd be the best person in this thread to answer this question.

> maybe I broke DNS or something

I break my DNS very often, or at least, often enough that it'd become nuisance that I can't instantly recall IP address of every machine in any of my 5 VLANs, AND type it in manually within 3 seconds.

With IPv6, I'd have to drop whatever I'm doing and fix my DNS first.


If you use SLAAC and don't use mDNS, I suppose, maybe? But if you break DNS often enough that you need to remember IP addresses, you can just do DHCPv6 if you want IPv4-like address allocation.

It'll be even easier because you can use numbers greater than 254 for your local devices, or l33t-style hex addresses, without setting up routed subnets when you exceed your /24 like on IPv4.


What bothers me with codex cli is that it feels like it should be more observable, more open and verbose about what the model is doing per step, being an open source product and OpenAI seemingly being actually open for once, but then it does a tool call - "Read $file" and I have no idea whether it read the entire file, or a specific chunk of it. Claude cli shows you everything model is doing unless it's in a subagent (which is why I never use subagents).

Not all email automation is spam and if you want your emails to not end up in spam folder, you pretty much have to go with Google Workspace and pay for essentially entire business suite when you just want to send emails. I needed something like this for my project and it's pretty much Google Workspace or nothing.

Cloudflare just filled a huge gap.


Such an interesting choice for a flag name. NO_BUG_PLEASE=1

I've always seen people complaining about model getting dumber just before the new one drops and always though this was confirmation bias. But today, several hours before the 4.7 release, opus 4.6 was acting like it was sonnet 2 or something from that era of models.

It didn't think at all, it was very verbose, extremely fast, and it was just... dumb.

So now I believe everyone who says models do get nerfed without any notification for whatever reasons Anthropic considers just.

So my question is: what is the actual reason Anthropic lobotomizes the model when the new one is about to be dropped?


I've noticed this and thought about it as well, I have a few suspicions:

Theory 1: Some increasingly-large split of inference compute is moving over to serving the new model for internal users (or partners that are trialing the next models). This results in less compute but the same increasing demand for the previous model. Providers may respond by using quantizations or distillations, compressing k/v store, tweaking parameters, and/or changing system prompts to try to use fewer tokens.

Theory 2: Internal evals are obviously done using full strength models with internally-optimized system prompts. When models are shipped into production the system prompt will inherently need changes. Each time a problematic issue rises to the attention of the team, there is a solid chance it results in a new sentence or two added to the system prompt. These grow over time as bad shit happens with the model in the real world. But it doesn't even need to be a harmful case or bad bugged behavior of the model, even newer models with enhanced capabilities (e.g. mythos) may get protected against in prompts used in agent harnesses (CC) or as system prompts, resulting in a more and more complex system prompt. This has something like "cognitive burden" for the model, which diverges further and further from the eval.


> So my question is: what is the actual reason Anthropic lobotomizes the model when the new one is about to be dropped?

You can only fit one version of a model in VRAM at a time. When you have a fixed compute capacity for staging and production, you can put all of that towards production most of the time. When you need to deploy to staging to run all the benchmarks and make sure everything works before deploying to prod, you have to take some machines off the prod stack and onto the staging stack, but since you haven't yet deployed the new model to prod, all your users are now flooding that smaller prod stack.

So what everyone assumes is that they keep the same throughput with less compute by aggressively quantizing or other optimizations. When that isn't enough, you start getting first longer delays, then sporadic 500 errors, and then downtime.


So if I understand it right, in order to free up VRAM space for a new one, model string in the api like `opus-4.6-YYYYMMDD` is not actually an identifier of the exact weight that is served, but more like ID of group of weights from heavily quantized to the real deal, but all cost the same to me?

How is this even legal?


> How is this even legal?

Because "opus-4.6-YYYYMMDD" is a marketing product name for a given price level. You consented to this in the terms and conditions. Nothing in the contract you signed promises anything about weights, quantization, capability, or performance.

Wait until you hear about my ISPs that throttle my "unlimited" "gigabit" connection whenever they want, or my mobile provider that auto-compresses HD video on all platforms, or my local restaurant that just shrinkflationed how much food you get for the same price, or my gym where 'small group' personal trainer sessions went from 5 to 25 people per session, or this fruit basket company that went from 25% honeydew to 75% honeydew, or the literal origin of "your mileage may vary".

Vote with your wallet.


> Nothing in the contract promises capability or performance.

Taken to its conclusion, Anthropic could silently replace Opus with Haiku quality internals and you'd have no recourse. If that sounds absurd, that's exactly where the legal argument lives. Mandatory consumer protection provisions like on misleading omissions cannot be waived by clicking "I agree." Withholding material information about a product you're paying a premium for isn't covered by T&Cs. It's the specific thing those laws were written to address.


Why does this comment appear every time someone complains about CoT becoming more and more inaccessible with Claude?

I have entire processes built on top of summaries of CoT. They provide tremendous value and no, I don't care if "model still did the correct thing". Thinking blocks show me if model is confused, they show me what alternative paths existed.

Besides, "correct thing" has a lot of meanings and decision by the model may be correct relative to the context it's in but completely wrong relative to what I intended.

The proof that thinking tokens are indeed useful is that anthropic tries to hide them. If they were useless, why would they even try all of this?

Starting to feel PsyOp'd here.


Didn't you notice that the stream is not coherent or noisy? Sometimes it goes from thought A to thought B then action C, but A was entirely unnecessary noise that had nothing to do with B and C. I also sometimes had signals in the thinking output that were red flags, or as you said it got confused, but then it didn't matter at all. Now I just never look at the thinking tokens anymore, because I got bamboozled too often.

Perhaps when you summarize it, then you might miss some of these or you're doing things differently otherwise.


The usefulness of thinking tokens in my case might come down to the conditions I have claude working in.

I primarily use claude for Rust, with what I call a masochistic lint config. Compiler and lint errors almost always trigger extended thinking when adaptive thinking is on, and that's where these tokens become a goldmine. They reveal whether the model actually considered the right way to fix the issue. Sometimes it recognizes that ownership needs to be refactored. Sometimes it identifies that the real problem lives in a crate that's for some reason is "out of scope" even though its right there in the workspace, and then concludes with something like "the pragmatic fix is to just duplicate it here for now."

So yes, the resulting code works, and by some definition the model did the correct thing. But to me, "correct" doesn't just mean working, it means maintainable. And on that question, the thinking tokens are almost never wrong or useless. Claude gets things done, but it's extremely "lazy".


Also, for anyone using opus with claude code, they again, "broke" the thinking summaries even if you had "showThinkingSummaries": true in your settings.json [1]

You have to pass `--thinking-display summarized` flag explicitly.

[1] https://github.com/anthropics/claude-code/issues/49268


I agree. Ever since the release of R1, it's like every single American AI company has realized that they actually do not want to show CoT, and then separately that they cannot actually run CoT models profitably. Ever since then, we've seen everyone implement a very bad dynamic-reasoning system that makes you feel like an ass for even daring to ask the model for more than 12 tokens of thought.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: