More

ericol · 2026-04-24T22:07:56 1777068476

Stop hooks are a world of pain.

I recently went on a deep dive about them with sonnet / opus.

I wanted to detect if a file or an analysis was the result of the last turn and act upon that.

From my experience, 2 things stand out by looking at the data above:

1. They have changed the schema for the hook reply [1] if this is real stop hook users (And may be users of other hooks) are in for a world of pain (if these schema changes propagate)

2. Opus is caring f*ck all about the response from the hook, and that's not good. Sonnet / Opus 4.6 are very self conscious about the hooks, what they mean and how they should _ act / react_ on them, and because of how complex the hook I set up is I've seen turns with 4 stop hooks looping around until Claude decides to stop the loop.

[1] My comment is in the context of claude code. I cannot make if the post is about that or an API call.

ericol · 2026-04-24T02:14:24 1776996864

if "git versioned" means the .md files themselves, I'm sold. I am actually processing files using a git based workflow in order to tell claude what to look at.

I'll definitely give this a spin.

smadam9 · 2026-04-24T02:18:35 1776997115

Yes, the .md's are in their own repo, locally. The entire UI is a layer on top of that repo. The UI has some underlying mechanisms that abstract the git operations away from the user, but that doesn't stop a power user from jumping in the shell and accessing the repo directly.

The "magic" starts when Sig contributes to another, remote repo - a central knowledge base that all teammates' local Sig can pull from, and contribute toward.

ericol · 2026-04-23T15:52:28 1776959548

Books _are_ expensive. This article only looks at the monetary side of it.

The cost goes beyond the price tag. Books take up space, and that space compounds as you keep acquiring them. It's space you can't use for anything else, dedicated entirely to objects most people open once or twice and never touch again. And that cost doesn't stay abstract: at some point you're buying more bookshelves, upgrading to a larger one, or worst of all, dragging everything through a move. That last one hits harder the less stable your living situation is, and less stable living situations track pretty closely with lower salaries.

I'm talking about physical books specifically, since that's what the article seems to cover. Ebooks are a different matter.

palmotea · 2026-04-23T16:04:42 1776960282

> It's space you can't use for anything else, dedicated entirely to objects most people open once or twice and never touch again.

It worth noting that books can be decor. To the point where people who don't read buy them for decoration.

bombcar · 2026-04-23T19:48:34 1776973714

https://booksbythefoot.com is either the most horrible gore site you've ever seen, or interesting.

ericol · 2026-04-17T20:27:08 1776457628

I did some work yesterday with Opus and found it amazing.

Today we are almost on non-speaking terms. I'm asking it to do some simple stuff and he's making incredible stupid mistakes:

    This is the third time that I have to ask you to remove the issue that was there for more than 20 hours. What is going on here?

and at the same time the compacting is firing like crazy. (What adds ~4 minute delays every 1 - 15 minutes)

  | # | Time     | Gap before | Session span | API calls |
  |---|----------|-----------|--------------|-----------|
  | 1 | 15:51:13 | 8s        | <1m          | 1         |
  | 2 | 15:54:35 | 48s       | 37m          | 51        |
  | 3 | 16:33:33 | 2s        | 19m          | 42        |
  | 4 | 16:53:44 | 1s        | 9m           | 30        |
  | 5 | 17:04:37 | 1s        | 17m          | 30        |
  # — sequential compaction event number, ordered by time.

  Time — timestamp of the first API call in the resumed session, i.e. when the new context (carrying the compaction summary) was first sent to the
  model.

  Gap before — time between the last API call of the prior session and the first call of this one. Includes any compaction processing time plus user
   think time between the two sessions.

  Session span — how long this compaction-resumed session ran, from its first API call to its last before the next compaction (or end of session).

  API calls — total number of API requests made during this resumed session. Each tool use, each reply, each intermediate step = one request.

Bottomline, I will probably stay on Sonnet until they fix all these issues.

losvedir · 2026-04-18T12:57:00 1776517020

> This is the third time that I have to ask you to remove the issue that was there for more than 20 hours. What is going on here?

I don't know if you're giving this as something you've actually given Claude, but I don't think it's a good way of using Claude.

It's not a collaborator who's having a bad day where a little empathy might make him feel better and realize his error. It's a token generator based on a prompt which includes all chat history. If you have three examples of the bad approach in the history, in a format that looks like Claude doing work, it will totally pollute it! And even worse with auto-compaction where you don't know exactly what of those false starts is getting summarized into its context.

You have to treat this like a tool and understand how it works.

If Claude is going down a wrong path it's better to cancel and rewind and improve the previous addition to the prompt. You don't want it to generate a bunch of misleading tokens for itself and leave it in the context window indefinitely!

ericol · 2026-04-20T13:23:30 1776691410

> I don't know if you're giving this as something you've actually given Claude, but I don't think it's a good way of using Claude.

That wasn't the full prompt, I trimmed it for clarity, but I agree with everything you said and that's how I actually use it.

I have a proxy logging everything sent to and from Claude in a structured way, which is precisely what let me do that compaction analysis in the first place.

When Claude goes off track, I don't tell it "you did something wrong". I ask it to analyze the tool outputs and the exchange so far and let it reconcile the discrepancy itself. That tends to work better than narrating the error to it.

The venting messages like that one are honestly for me, not for Claude. I know it's a tool. But it also behaves and communicates like a person, and that's a design choice that comes from Anthropic, not from me. What I've found is that writing something like that and then following it with proper instructions works fine in practice: Claude either ignores the venting or briefly acknowledges it and moves on. The actual output isn't affected. It's just how I process frustration without breaking the workflow.

j_bum · 2026-04-18T14:05:52 1776521152

Yep I bewilders me when I see instructions like this. Go bad and edit your previous message if you didn’t et what you want!

I think this is a direct result of OpenAI and Anthropic humanizing these models too much.

I want C-3PO by my side helping me work, not a machine acting emotional.

But that’s what they’ve given us, and now a huge fraction of the username treats these tools like a human.

aulin · 2026-04-17T21:01:36 1776459696

They won't. These are not "issues", it's them trying to push the models to burn less compute. It will only get worse.

criemen · 2026-04-17T21:03:37 1776459817

> it's them trying to push the models to burn less compute

I'm curious, how does using more tokens save compute?

BoorishBears · 2026-04-18T00:01:42 1776470502

I'm 99.9% sure Opus 4.7 is a smaller model than 4.6.

Too many signs between the sudden jump in TPS (biggest smoking gun for me), new tokenenizer, commentary about Project Mythos from Ant employees, etc.

It looks like their new Sonnet was good enough to be labeled Opus and their new Opus was good enough to be labeled Mythos.

They'll probably continue post-training and release a more polished version as Opus 5

b65e8bee43c2ed0 · 2026-04-17T21:20:18 1776460818

productivity (tokens per second per hardware unit) increases at the cost of output quality, but the price remains the same.

both Anthropic and OpenAI quantize their models a few weeks after release. they'd never admit it out loud, but it's more or less common knowledge now. no one has enough compute.

sthimons · 2026-04-17T21:26:21 1776461181

Pretty bold claim - you have a source for that?

Rapzid · 2026-04-17T21:38:51 1776461931

There is no evidence TMK that the accuracy the models change due to release cycles or capacity issues. Only latency. Both Anthropic and OpenAI have stated they don't do any inference compute shenanigans due to load or post model release optimization.

Tons of conspiracy theories and accusations.

I've never seen any compelling studies(or raw data even) to back any of it up.

cebert · 2026-04-17T21:23:17 1776460997

Do you have a source for that claim?

b65e8bee43c2ed0 · 2026-04-17T21:39:08 1776461948

my source is that people have been noticing this since GPT4 days.

https://arxiv.org/pdf/2307.09009

but of course, this isn't a written statement by a corporate spokespersyn. I don't think that breweries make such statements when they water their beer either.

shortstuffsushi · 2026-04-17T21:11:31 1776460291

I think that the idea is each action uses more tokens, which means that users hit their limit sooner, and are consequently unable to burn more compute.

ryanschaefer · 2026-04-17T21:16:20 1776460580

What?

bloppe · 2026-04-17T21:12:57 1776460377

It could be the adaptive reasoning

rustyhancock · 2026-04-17T21:31:12 1776461472

If you've not seen Common People Black Mirror episode I strongly recommend it.

The only misprediction it makes is that AI is creating the brain dead user base...

You have to hook your customers before you reel them in!

https://www.netflix.com/gb/title/70264888?s=a&trkid=13747225...

cadamsdotcom · 2026-04-17T21:41:01 1776462061

> he’s making .. mistakes

Claude and other LLMs do not have a gender; they are not a “he”. Your LLM is a pile of weights, prompts, and a harness; anthropomorphising like this is getting in the way.

You’re experiencing what happens when you sample repeatedly from a distribution. Given enough samples the probability of an eventual bad session is 100%.

Just clear the context, roll back, and go again. This is part of the job.

yokoprime · 2026-04-17T21:55:19 1776462919

Why be so upset at someone using pronouns with a LLM?

ericol · 2026-04-18T01:55:48 1776477348

You are being downvoted but I actually agree with your statement.

solenoid0937 · 2026-04-18T15:54:29 1776527669

This is not how AI works man. Speaking condescendingly or sternly to it WILL result in worse output. Imagine if you spoke to an intern like that, would they make more or less mistakes after?

You should just revert the context and provide more detail and rationale in the message.

whalesalad · 2026-04-17T20:37:43 1776458263

I am having a shit experience lately. Opus 4.7, max effort.

> You're right, that was a shit explanation. Let me go look at what V1 MTBL actually is before I try again.

> Got it — I read the V1 code this time instead of guessing. Turns out my first take was wrong in an important way. Let me redo this in English.

:facepalm:

tremon · 2026-04-17T20:44:43 1776458683

> I read the V1 code this time instead of guessing

Does the LLM even keep a (self-accessible) record of previous internal actions to make this assertion believable, or is this yet another confabulation?

cheesecakegood · 2026-04-18T03:33:06 1776483186

No they do not (to be clear, not internal state, just the transcript). It’s entirely role-play. LLM apologies are meaningless because the models are mostly stateless. Every new response is a “what would a helpful assistant with XYZ prior context continue to say?”

johnmaguire · 2026-04-17T21:17:59 1776460679

Yes, the LLM is able to see the entire prior chat history including tool use. This type of interaction occurs when the LLM fails to read the file, but acts as though it had.

al_borland · 2026-04-17T20:43:49 1776458629

This seems like the experience I've had with every model I've tried over the last several years. It seems like an inherent limitation of the technology, despite the hyperbolic claims of those financially invested in all of this paying off.

smt88 · 2026-04-17T20:58:09 1776459489

Opus 4.6 pre-nerf was incredible, almost magical. It changed my understanding of how good models could be. But that's the only model that ever made me feel that way.

whalesalad · 2026-04-17T21:09:41 1776460181

Yes! I genuinely got a LOT of shit done with Opus 4.6 "pre nerf" with regular old out-of-the-box config, no crazy skills or hacks or memory tweaks or anything. The downfall is palpable. Textbook rugpull.

solenoid0937 · 2026-04-18T15:56:41 1776527801

There was no nerf - this meme needs to die.

smt88 · 2026-04-18T22:08:14 1776550094

What exactly happened then? How did we all have this collective hallucination?

solenoid0937 · 2026-04-18T22:12:44 1776550364

Collective hallucinations are common. Mandela effect, people thinking FB is listening to your microphone because they see relevant ads, etc

This is a common phenomenon that all humans pattern match to things we expect. When we learn a new vocabulary word you see it everywhere for the next two days. When we think Claude might be nerfed, we overindex on every instance of Claude underperforming.

The only way to account for this is credulous, hard data. Like benchmarks over time. To this day no one has provided evidence that Claude Code, when fixed to the same thinking level, has had degraded performance.

al_borland · 2026-04-19T00:00:36 1776556836

Are there any good ways to benchmark models over time that don't fall victim to Goodhart's law? It seems that once the benchmark is defined, the AI will train on it, and it will become effectively meaningless.

I read many articles about AIs doing extremely well on various tests in graduate or PhD level programs. But these tests are well defined. A professor put the same models though his freshman CS class and most of them failed.

solenoid0937 · 2026-04-19T13:17:09 1776604629

These models don't learn continuously, they are a static snapshot one training is finished. You only need a new benchmark once new models are published (or you need a private benchmark, in which case you don't need to update the benchmark at all)

ec109685 · 2026-04-18T15:26:54 1776526014

Did they nerf the model or was it changes to Claude code? I agree it got frustrating.

al_borland · 2026-04-17T21:51:58 1776462718

That was better, but still not to the point that I just let it go on my repo.

ed_elliott_asc · 2026-04-17T21:23:50 1776461030

If it isn’t working for you why don’t you choose an older model? 4.6

ericol · 2026-04-17T20:45:05 1776458705

Matches what I am experiencing. Makes incredible stupid mistakes.

The weird stuff is yesterday I asked it to test and report back on a 30+ commit branch for a PR and it did that flawlessly.

alphabettsy · 2026-04-17T20:58:01 1776459481

The docs suggest not using max effort in most cases to avoid overthinking :shrug:

whalesalad · 2026-04-17T21:11:32 1776460292

They've jumped the shark. I truly can't comprehend why all of these changes were necessary. They had a literal money printing machine that actually got real shit done, really well. Now it's a gamble every time and I am pulling back hard from Anthropic ecosystem.

solenoid0937 · 2026-04-18T15:58:07 1776527887

it's clearly all in your head. 4.6 is just as capable as it used to be. literally no one on the internet has managed to post credulous and real evidence of a nerf

this is just another trendy conspiracy theory that people reinforce because of selection/recency bias. you hear "nerf", your brain overindexes on the next time Claude does poorly. it is the same phenomenon when you notice a new vocabulary word all the time.

geraldwhen · 2026-04-17T21:43:39 1776462219

It seems clear that it was a money spending machine, not a money printing machine.

ericol · 2026-04-01T15:24:44 1775057084

This looks dangerously close to cmux but with a narrower focus (Just Claude code)

BTW, the claude app kind supports this with the /remote-control command, and that was what made me move away from cmux (I still have to start the sessions there)

ericol · 2026-03-09T16:04:06 1773072246

When my eldest daughter was in high school (~2010, Argentina) there was a provincial policy where if every single student had a result below a certain score in a test, the scores had to be re assessed against the maximum result.

The resulting situation here was that she was constantly bullied into underperforming. Both cases are actually similar in that each individual has a personal incentive to underperform - the difference is that in your friend's case the policy is granted at the company level so no single employee can defect and break it for the rest, while in my daughter's case one high scorer could invalidate the reassessment for everyone, which is exactly what made defection punishable and the bullying emerge naturally.

ecshafer · 2026-03-09T17:20:14 1773076814

This is the natural result of "equity" which is the academic jargon term for "forced equality of outcome". High achievers are attacked. People who push us forward are demonized. The low achievers are never pushed to be better. And the average drops.

MichaelZuo · 2026-03-10T00:24:25 1773102265

Can you link a source for it? That sounds too absurd to be true…

antonymoose · 2026-03-10T09:49:14 1773136154

It’s not that absurd and happens all over the world in university systems. I had a Comp. Sci. Professor that taught assembly and graded on a curve. As you might imagine the one guy that was a wizard at assembly caught flak from the unwashed masses.

I had another professor that not only did a curve but dropped statistical outliers to prevent this problem, he literally explained his system on Day 1 of the course. This was 15+ years ago and by no means a new idea.

ericol · 2026-03-10T03:50:25 1773114625

The future is not evenly distributed.

I tried to search for it, but even the 2 documents that superseded the one from around the time my daughter was at school at not available.

I mean, the site doesn't even have a valid secure certificate so...

In the site below (In Spanish) you can search for 10/2019 and a cursory translation of the document title will show that this is the proper document (For 2019 onwards, the replaced doc 04/2014 isn't available either)

https://koha.chubut.edu.ar/cgi-bin/koha/opac-search.pl?idx=k...

ericol · 2026-03-04T02:00:43 1772589643

Previous discussion: https://news.ycombinator.com/item?id=47230710

ericol · 2026-03-02T18:45:59 1772477159

> human intuition driving the exploration

This, a thousand times this.

For me, what AI brings is augmented humans. Just as we don't calculate on paper anymore, what is the reason of doing things by hand when a machine in X times better.

Want to code by hand, as artisans of old? Suit yourself.

I, for one, love the smell of burning chrome.

pklausler · 2026-03-02T18:49:39 1772477379

If "AI" were doing anything more than repeating content from the web without attribution, I might agree with you.

ericol · 2026-03-10T11:41:50 1773142910

It's not exactly that...

ericol · 2026-03-01T13:53:48 1772373228

I regularly (Say, once a month) do a comparison of results across all Claude, Gemini and ChatGPT. Just for reasons, not that I want to see if there's any benefit in changing.

It's not "fair" in that I pay for Claude [1] and not for the others, so models availability is not complete except for Claude.

So I did like things at time in the form of how they were presented, I came to really like Sonnet's "voice" a lot over the others.

Take into account Opus doesn't have the same voice, and I don't like it as much.

[1] I pay for the lower tier of their Max offering.

coldtrait · 2026-03-02T11:26:02 1772450762

Thanks for your perspective.

ericol · 2026-02-26T13:39:00 1772113140

I've had more than a few instances of this over the past 2 years, and my reply is exactly the above.

"What you are doing is against Github's TOS"