Hacker Newsnew | past | comments | ask | show | jobs | submit | simonw's commentslogin

I appreciated John Gruber's piece on this: https://daringfireball.net/2026/04/another_day_has_come

Accessed via OpenRouter, this one decided to wrap the SVG pelican in HTML with controls for the animation speed: https://gisthost.github.io/?ecaad98efe0f747e27bc0e0ebc669e94...

Transcript and HTML here: https://gist.github.com/simonw/ecaad98efe0f747e27bc0e0ebc669...


At this point drawing these Pelicans must be in the training data sets.


I hereby certify that these are indeed the most perfect and precise svg depictions of pelican riding a bicycle, also known among biology scholars as pelycles

Just a few years ago, this would have been a meaningless repo.

That's truly a wonderful collection of pelicans riding bicycles.

Much Win! ;)


These are amazing. I smiled after I saw just how wonderfully rendered they are.

These pelicans are clearly indicative of good RL training algorithms.

I want to fly too

This is pretty funny

I love it!

love this adversarial work

yeah putting the captcha on there to thwart the LLMs ability to extract good pelicans was a really good idea

Shhhhh, they're going to be on to us.


> If a model finally comes out that produces an excellent SVG of a pelican riding a bicycle you can bet I’m going to test it on all manner of creatures riding all sorts of transportation devices.

This relies on the false premise that, if they would include it in their training dataset, it would be perfect. All they need to do is be good enough and better than the other, not perfect.


I'm not sure if we can have a "perfect" Pelican riding a bicycle. Like, I could probably commission a highly experienced artist to draw one and I don't think it would be perfect. The legs would probably have to be too long, or pedals oddly placed, or handles strange, or wings with hands.

Based on the one Simon commented though, I'd say we're in decent territory to try the latter part of his hypothesis.


> The legs would probably have to be too long, or pedals oddly placed, or handles strange, or wings with hands.

In all seriousness, that's what makes it an interesting test: it's asking for something technically impossible, that requires artistic license to make coherent.

Making specific choices on where to bend reality (and where not to) is a big chunk of visual art.


Yes we all know that, but we still like to see the pelicans because it's a tradition more or less

Why no Utah Teapot!

Clearly not.

I mean the prompt was succinct and clear, as always - and it still decided to hallucinate multiple features (animation + controls) beyond the prompt.

It'd also like to point out that to date no drawing was actually good from an actual quality perspective (as in comparative to what a decent designer would throw together)

Theyre always only "good" from the perspective of it being a one shot low effort prompt. Very little content for training purposes.


The way I’ve come to think of LLM is that what the produce in a single reply even with thinking turned up, is akin to what you’d do in a single short session of work.

And so if you ask it to do something big it will do a very surface level implementation. But if you have it iterate many times, or give it small pieces each time, you’ll end up with something closer to what a human would do.

I imagine the pelican test but done in a harness that has the agents iterate 10+ times would be closer to what you’d expect, especially if a visual model was critiquing each time.


Yeah, this is how I use AI. Instead of a single session one-shot, it's usually limited to single targeted edits, and then I steer it on each step. Takes longer but the output is actually what I want.

What does good even mean… I have no idea what a good “pelican on a bike” should look like. It’s a fun prompt because there is no good answers… at least so I thought.


There are countless examples of animals riding bicycles etc from Comic books I grew up with

It would always look goofy - by design, but it usually looked good.


I’m OK with a Chinese model getting the W. It’s ultimately good for all of us.

We got an overachiever, here. Kimi sounds like a teacher's pet kind of name.

Underappreciated comment

was part of the beta, its properly good model, in some sense i forgot that im not on opus or gpt. opus is still better. gpt is the one struggling for me. it has some niche in backend work but you can get the same with opus with skills, its lacking in almost all others.

Funny, for me Opus is struggling since about February.

4.7 made no difference, so for the first time in many moons I am cancelling my subscription.


It looks like a drunk pelican rolling downhill on its bicycle

Too bad they didn't put equal effort into the pelican's legs and feet. Left leg paralyzed and not moving, and right ankle flipping around in alarming fashion!

[flagged]


It's a lighthearted, fun, visual benchmark that's not part of the standard benchmarks; and at least traditionally, it was not something that the labs trained on so it was something of a measure of how well the intelligence of the model generalized. Part of the idea of LLMs is that they pick up general knowledge and reasoning ability, beyond any tasks that they are specifically trained for, from the vast quantity of data that they are trained on.

Of course, a while back there was a Gemini release that I believe specifically called out their ability to produce SVGs, for illustration and diagramming purposes. So it's not longer necessarily the case that the labs aren't training on generating SVGs, and in fact, there's a good chance that even if they're not doing so explicitly, the RLVR process might be generating tasks like that as there is more and more focus on frontend and design in the LLM space. So while they might not be specifically training for a pelican riding a bicycle, they may actually be training on SVG diagram quality.


This isn't even a normal pelican image post, this one created the html control system that animates the distance the wing travels from its pivot in time with the rotation of the wheel speed. Let's not pretend this is a solved problem and models are dumping about perfect pelicans on bikes one after another (or ever?).

Surely, you know someone makes the same post you did every time one is posted. Surly you see the answers and pushback since you are familiar with these posts. Genuine question, did you expect a different answer this time?



It doesn't, I get that it's _a_ benchmark. It's just not a good or insightful one, and having it posted so often on HN feels like low quality spam at this point

The issue is that benchmarks that look insightful will end up being gamed by labs quickly (Goodharts law)

The best LLM benchmarks test around the margins of those behaviors, tasks that are difficult and correlate with usefulness while being removed enough to stay unpolluted


It's a great filter for people who take things far too seriously

It's tradition at this point. Based on the upvotes the comment receives, it looks like many readers find value in it.

Upvotes are cheap, the fact that something is upvoted doesn't mean it's valuable (see: Reddit). Another thing is how insightful is the discussion under a typical pelican comment are (and how much of it is related to the pelican and how often it's just where the general discussion happens).

It means somebody likes it.

[flagged]


> Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills.

https://news.ycombinator.com/newsguidelines.html


Every forum gets regulars and their fan clubs. If you go to /r/comics and look at top for the month you'll see 4 out of 5 are pizzacakecomic. People on these forums sort of form a fanclub around 'their guy'. This forum's guy is this chap. Not much point being upset about it, tbh.

I, for one, find it entertaining.

[flagged]


Well clearly some people care.

I'd love it if that API (which I do not believe Anthropic charge anything for) worked without an API key.

Yeah that should work - it looks like the same pixel dimension image at smaller sizes has about the same token cost for 4.6 and 4.7, so the image cost increase only kicks in if you use larger images that 4.6 would have presumably resized before inspecting.

Yes, in fact it has an entirely different system prompt from the ones that Anthropic publish on https://platform.claude.com/docs/en/release-notes/system-pro...

The Claude Code one isn't published anywhere but it's very easy to get hold of. One way to do that is to run Claude Code through a logging proxy - I was using a project called claude-trace for this last year but I'm not sure if it still works, I've not tried it in a while: https://simonwillison.net/2025/Jun/2/claude-trace/


I thought that would happen after the first Trump term. It did not.

The second one has made an even stronger case for doing so though.


I'm fascinated by this idea of not reviewing AI generated code. On the surface it sounds absurd - we know these machines make mistakes all the time, so how could we ever responsibly move ahead with code they have written without closely reviewing every detail?

Then I remembered the times I've worked at large companies and depended on code written by other teams. I didn't review every line of code they had written - I'd trust that they had done a competent job, integrate with that code myself, and only dig into the details of their code if I run into bugs or performance issues or other smells that something was wrong.

Trusting humans is obviously different from trusting AI - humans have reputations, and social contracts, and actual intelligence as opposed to multiplying matrices and rolling a dice. But... I do think an AI model can still earn trust over time. I've spent enough time with Opus 4.5 and 4.6 that I trust them not to make dumb mistakes with the common categories of code that I use them for. Of course now I need to rebuild that trust with 4.7!

I think the most interesting challenge here is to figure out how to have coding agents demonstrate that the code works without actually reading every line of it yourself - in the same way that I might ask an engineering team I haven't worked with before for a demo and then interrogate them about their testing strategy before relying on their work.


The distinction here that keep getting glossed over in such comparisons is accountability.

If the engineering team fucks up somehow they can be kept accountable. An AI cannot be held accountable.


100% agree. A human has to be accountable for the work.

As an engineering manager I can take accountability for the output of my team even if I don't review every line. Using coding agents feels similar.


Except that your reports is held accountable for the specific portions assigned to them.

When I delegate to the AI there’s no pressure to deliver good work. There’s no performance review, threat of firing, or performance bonuses.

All the responsibility and accountability flows upwards on the person orchestrating, but there is not hierarchy of responsibility like there is with people.


People who use AI are responsible for what it does. IBM had it right in 1979:

A computer can never be held accountable

Therefore a computer must never make a management decision

https://simonwillison.net/2025/Feb/3/a-computer-can-never-be...


Do you review all machine code your compiler produces?

...how exactly do you think that's even remotely the same thing?

Compiler output is deterministic based on input code - which is typically reviewed before compiling by someone(s) who will be held accountable for it.


You have to copy data across, and confirm that everything worked correctly, and if you're being fancy about it you need to freeze writes to the old server while you are migrating and then unfreeze after you've directed traffic to the new server. It's not trivial.

Claude is actually very good at SVGs, and it's genuinely useful. I have Claude knock out little SVG icons all the time.

Illustrations with SVGs of pelicans riding bicycles will never be useful, because pelicans can't ride bicycles.


If they're testing against it why do most of their attempts suck so much?

Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: