uv tool install llm llm install llm-moonshot llm keys set moonshot # paste key l...

JJax7 · 2025-11-06T17:24:53 1762449893

Love seeing this benchmark become more iconic with each new model release. Still in disbelief at the GPT-5 variants' performance in comparison but its cool to see the new open source models get more ambitious with their attempts.

aqme28 · 2025-11-06T19:57:54 1762459074

Only until they start incorporating this test into their training data.

orbital-decay · 2025-11-06T22:32:34 1762468354

Dataset contamination alone won't get them good-looking SVG pelicans on bicycles though, they'll have to either cheat this particular question specifically or train it to make vector illustrations in general. At which point it can be easily swapped for another problem that wasn't in the data.

jug · 2025-11-07T00:07:01 1762474021

I like this one as an alternative, also requiring using a special representation to achieve a visual result: https://voxelbench.ai

What's more, this doesn't benchmark a singular prompt.

nwienert · 2025-11-07T01:21:30 1762478490

they can have some cheap workers make about 10 pelicans by hand in svg, fuzz them to generate thousands of variations and throw it in their training pool. don't need to 'get good at svgs' by any means.

an0malous · 2025-11-06T19:57:27 1762459047

Why is this a benchmark though? It doesn’t correlate with intelligence

simonw · 2025-11-06T20:20:57 1762460457

It started as a joke, but over time performance on this one weirdly appears to correlate to how good the models are generally. I'm not entirely sure why!

behnamoh · 2025-11-06T21:57:41 1762466261

it has to do with world model perception. these models don't have it but some can approximate it better than others.

dmonitor · 2025-11-06T20:24:44 1762460684

It's simple enough that a person can easily visualize the intended result, but weird enough that generative AI struggles with it

JJax7 · 2025-11-06T20:13:02 1762459982

I'm not saying its objective or quantitative, but I do think its an interesting task because it would be challenging for most humans to come up with a good design of a pelican riding a bicycle.

also: NITPICKER ALERT

beepbooptheory · 2025-11-06T20:16:37 1762460197

I think its cool and useful precisely because its not trying to correlate intelligence. It's a weird kind of niche thing that at least intuitively feels useful for judging llms in particular.

I'd much prefer a test which measures my cholesterol than one that would tell me whether I am an elf or not!

HighGoldstein · 2025-11-06T20:00:05 1762459205

What test would be better correlated with intelligence and why?

ok_dad · 2025-11-06T20:14:10 1762460050

When the machines become depressed and anxious we'll know they've achieved true intelligence. This is only partly a joke.

jiggawatts · 2025-11-06T20:21:53 1762460513

This already happens!

There have been many reports of CLI AI tools getting frustrated, giving up, and just deleting the whole codebase in anger.

lukan · 2025-11-06T22:09:48 1762466988

There are many reports of CLI AI tools displaying words that humans express when they are frustrated and about to give up. Just what they have been trained on. That does not mean they have emotions. And "deleting the whole codebase" sounds more interesting, but I assume is the same thing. "Frustrated" words lead to frustrated actions. Does not mean the LLM was frustrated. Just that in its training data those things happened so it copied them in that situation.

jiggawatts · 2025-11-06T22:17:20 1762467440

This is a fundamental philosophical issue with no clear resolution.

The same argument could be made about people, animals, etc...

lukan · 2025-11-06T22:24:48 1762467888

The difference is, people and animals have a body, nerve system and in general those mushy things we think are responsible for emotions.

Computers don't have any of that. And LLM's in particular neither. They were trained to simulate human text responses, that's all. How to get from there to emotions - where is the connection?

jiggawatts · 2025-11-07T01:09:12 1762477752

Don't confuse the medium with the picture it represents.

Porn is pornographic, whether it is a photo or an oil painting.

Feelings are feelings, whether they're felt by a squishy meat brain or a perfect atom-by-atom simulation of one in a computer. Or a less-than-perfect simulation of one. Or just a vaguely similar system that is largely indistinguishable from it, as observed from the outside.

Individual nerve cells don't have emotions! Ten wired together don't either. Or one hundred, or a thousand... by extension you don't have any feelings either.

See also: https://www.mit.edu/people/dpolicar/writing/prose/text/think...

lukan · 2025-11-07T08:36:43 1762504603

Do you think a simulation of a weather forcast is the same as the real weather?

(And science fiction .. is not necessarily science)

jiggawatts · 2025-11-09T11:10:52 1762686652

> Do you think a simulation of a weather forcast is the same as the real weather?

If sufficiently accurate... then yes. It is weather.

We are mere information, encoded in the ripples of the fabric of the universe, nothing more.

hellzbellz123 · 2025-11-06T23:02:23 1762470143

This only seems to be an issue for wishy washy types that insist gpt is alive.

an0malous · 2025-11-06T22:38:19 1762468699

A mathematical exam problem not in the training set because mathematical and logical reasoning are usually what people mean by intelligence.

I don’t think Einstein or von Neumann could do this SVG problem, does that mean they’re dumb?

K0balt · 2025-11-07T01:50:19 1762480219

I actually prefer ascii art diagrams as a benchmark for visual thinking, since it requires 2 stages, Like svg, and also can test imaginative repurposing of text elements.

mrbonner · 2025-11-07T00:06:34 1762473994

I suspect that the OpenRouter result originates from a quantized hosting provider. The difference compared to the direct API call from Moonshot is striking, almost like night and day. It creates a peculiar user and developer experience since OpenRouter enforces quantization restrictions only at the API level, rather than at the account settings level.

simonw · 2025-11-07T01:29:45 1762478985

OpenRouter are proxying directly through to Moonshot - they're currently the only provider listed on https://openrouter.ai/moonshotai/kimi-k2-thinking/providers

irthomasthomas · 2025-11-07T08:43:27 1762505007

That does include the Turbo endpoint, moonshotai/turbo. Add this to your prompt to only use the full-fat model:

-o provider '{ "only": ["moonshotai"] }'

ahmedfromtunis · 2025-11-06T17:12:41 1762449161

Where do you run a trillion-param model?

Gracana · 2025-11-06T18:22:35 1762453355

If you want to do it at home, ik_llama.cpp has some performance optimizations that make it semi-practical to run a model of this size on a server with lots of memory bandwidth and a GPU or two for offload. You can get 6-10 tok/s with modest hardware workstation hardware. Thinking chews up a lot of tokens though, so it will be a slog.

simonw · 2025-11-06T18:33:39 1762454019

What kind of server have you used to run a trillion parameter model? I'd love to dig more into this.

Gracana · 2025-11-06T21:01:49 1762462909

Hi Simon. I have a Xeon W5-3435X with a 768GB of DDR5 across 8 channels, iirc it's running at 5800MT/s. It also has 7x A4000s, water cooled to pack them into a desktop case. Very much a compromise build, and I wouldn't recommend Xeon sapphire rapids because the memory bandwidth you get in practice is less than half of what you'd calculate from the specs. If I did it again, I'd build an EPYC machine with 12 channels of DDR5 and put in a single rtx 6000 pro blackwell. That'd be a lot easier and probably a lot faster.

There's a really good thread on level1techs about running DeepSeek at home, and everything there more-or-less applies to Kimi K2.

https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-hom...

stronglikedan · 2025-11-06T19:50:01 1762458601

If I had to guess, I'd say it's one with lots of memory bandwidth and a GPU or two for offload. (sorry, I had to, happy Friday Jr.)

isoprophlex · 2025-11-06T17:23:06 1762449786

You let the people at openrouter worry about that for you

MurizS · 2025-11-06T18:01:16 1762452076

Which in turn lets the people at Moonshot AI worry about that for them, the only provider for this model as of now.

skeptrune · 2025-11-06T18:36:37 1762454197

Good people over there

lab · 2025-11-06T21:50:59 1762465859

Does the run pin the temperature to 0 for consistency?

skhameneh · 2025-11-07T00:20:03 1762474803

I've been under the impression most inference engines aren't fully deterministic with a temperature of 0 as some of the initial seed values can vary.

Note: I haven't tested this nor have I played with seed values. IIRC the inference engines I used support an explicit seed value, that is randomized by default.

simonw · 2025-11-06T22:19:56 1762467596

No, I've never tried that.