More

martinald · 2026-01-12T20:27:15 1768249635

Hey, congrats on the launch. Been thinking lot about this space (wrote this back in August: https://martinalderson.com/posts/building-a-tax-agent-with-c...).

Would love to connect, my emails in my bio if you have time!

martinald · 2026-01-12T01:51:17 1768182677

But who sets the algorithm? Whichever department of branch of govt was in charge of that would become have the enormous power, and political motivation would then fall to that.

Equally the same for data that goes into the algorithm - if you can control that you control interest rates.

olalonde · 2026-01-12T09:49:53 1768211393

Agreed, but we're already living that reality. Moving to an algorithmic approach provides a layer of transparency that makes manipulation easier to detect.

martinald · 2026-01-12T01:20:53 1768180853

This actually matches my experience quite well. I use vision (often) to try and do 2 main things in Claude code:

1) give it text data from something that is annoying to copy and paste (eg labels off a chart or logs from a terrible web UI that doesn't make it easy to copy and paste).

2) give it screenshots of bugs, especially UI glitches.

It's extremely good at 1), can't remember when it got it wrong.

On 2) it _really_ struggled until opus 4.5, almost comically so, with me posting a screenshot and a description of the UI bug and it telling me "great it looks perfect! What next?"

With opus 4.5 it's not quite laughably as bad but still often misses very obvious problems.

It's very interesting to see the rapid progression on these benchmarks, as it's probably a very good proxy for "agentic vision".

I've came to the conclusion that browser use without vision (eg based on the DOM or accessibility trees) is a dead end, simply because "modern" websites tend to use a comical amount of tokens to render. So if this gets very good (close to human level/speed) then we have basically solved agents being able to browse any website/GUI effectively.

martinald · 2026-01-08T19:50:30 1767901830

The problem is essentially memory bandiwdth afiak. Simplifying a lot in my reply, but most NPUs (all?) do not have faster memory bandwidth than the GPU. They were originally designed when ML models were megabytes not gigabytes. They have a small amount of very fast SRAM (4MB I want to say?). LLM models _do not_ fit into 4MB of SRAM :).

And LLM inference is heavily memory bandwidth bound (reading input tokens isn't though - so it _could_ be useful for this in theory, but usually on device prompts are very short).

So if you are memory bandwidth bound anyway and the NPU doesn't provide any speedup on that front, it's going to be no faster. But has loads of other gotchas so no real "SDK" format for them.

Note the idea isn't bad per se, it has real efficiencies when you do start getting compute bound (eg doing multiple parallel batches of inference at once), this is basically what TPUs do (but with far higher memory bandwidth).

zozbot234 · 2026-01-08T20:01:17 1767902477

NPUs are still useful for LLM pre-processing and other compute-bound tasks. They will waste memory bandwidth during LLM generation phase (even in the best-case scenario where they aren't physically bottlenecked on bandwidth to begin with, compared to the iGPU) since they generally have to read padded/dequantized data from main memory as they compute directly on that, as opposed to being able to unpack it in local registers like iGPUs can.

> usually on device prompts are very short

Sure, but that might change with better NPU support, making time-to-first-token quicker with larger prompts.

martinald · 2026-01-08T22:14:06 1767910446

Yes I said that in my comment. Yes they might be useful for that - but when you start getting to prompts that are long enough to have any significant compute time you are going to need far more RAM than these devices have.

Obviously in the future this might change. But as we stand now dedicated silicon for _just_ LLM prefill doesn't make a lot of sense imo.

zozbot234 · 2026-01-08T22:20:38 1767910838

You don't need much on-device RAM for compute-bound tasks, though. You just shuffle the data in and out, trading a bit of latency for an overall gain on power efficiency which will help whenever your computation is ultimately limited by power and/or thermals.

observationist · 2026-01-08T21:30:32 1767907832

The idea that tokenization is what they're for is absurd - you're talking a tenth of a thousandth of a millionth of a percent of efficiency gain in real world usage, if that, and only if someone bothers to implement it in software that actually gets used.

NPUs are racing stripes, nothing more. No killer features or utility, they probably just had stock and a good deal they could market and tap into the AI wave with.

adastra22 · 2026-01-09T01:45:15 1767923115

NPUs aren't meant for LLMs. There are a lot more neural net tech out there than LLMs.

aleph_minus_one · 2026-01-09T02:04:59 1767924299

> NPUs aren't meant for LLMs. There are a lot more neural net tech out there than LLMs.

OK, but where can I find demo applications of these that will blow my mind (and make me want to buy a PC with an NPU)?

adastra22 · 2026-01-09T02:49:05 1767926945

Apple demonstrates this far better. I use their Photos app to manage my family pictures. I can search my images by visible text, by facial recognition, or by description (vector search). It automatically composes "memories" which are little thematic video slideshows. The FaceTime camera automatically keeps my head in frame, and does software panning and zooming as necessary. Automatic caption generation.

This is normal, standard, expected behavior, not blow your midn stuff. Everyone is used to having it. But where do you think the computation is happening? There's a reason that a few years back Apple pushed to deprecate older systems that didn't have the NPU.

adgjlsfhk1 · 2026-01-09T03:05:14 1767927914

I've yet to see any convincing benchmarks showing that NPUs are more efficient than normal GPUs (that don't ignore the possibility of downclocking the GPU to make it run slower but more efficient)

adastra22 · 2026-01-09T03:11:58 1767928318

NPUs are more energy efficient. There is no doubt that a systolic array uses less watts per computation than a tensor operation on a GPU, for these kinds of natural fit applications.

Are they more performant? Hell no. But if you're going to do the calculation, and if you don't care about latency or throughput (e.g. batched processing of vector encodings), why not use the NPU?

Especially on mobile/edge consumer devices -- laptops or phones.

hulitu · 2026-01-11T18:50:26 1768157426

> NPUs are more energy efficient. There is no doubt

Maybe because they sleep all the time. To be able to use an NPU you need at least a compiler which generates code for this particular NPU and a CPU scheduler which can dispatch instructions to this NPU.

imtringued · 2026-01-09T11:06:04 1767956764

https://fastflowlm.com/benchmarks/

https://fastflowlm.com/assets/bench/gemma3-4b.png

jychang · 2026-01-09T04:07:50 1767931670

Best NPU app so far is Trex for Mac.

microtonal · 2026-01-08T21:39:29 1767908369

I think they were talking about prefill, which is typically compute-bound.

martinald · 2026-01-08T19:38:32 1767901112

Tbh it's been the same in Windows PCs since forever. Like MMX in the Pentium 1 days - was marketed as basically essential for anything "multimedia" but provided somewhat between no and minimal speedup (v little software was compiled for it).

It's quite similar with Apple's neural engine, which afiak is used very little for LLMs, even for coreML. I know I don't think I ever saw it being used in asitop. And I'm sure whatever was using it (facial recognition?) could have easily ran on GPU with no real efficiency loss.

giantrobot · 2026-01-08T21:19:41 1767907181

Apple's neural engine is used a lot by the non-LLM ML tasks all over the system like facial recognition in photos and the like. The point of it isn't to be some beefy AI co-processor but to be a low-power accelerator for background ML workloads.

The same workloads could use the GPU but it's more general purpose and thus uses more power for the same task. The same reason macOS uses hardware acceleration for video codecs and even JPEG, the work could be done on the CPU but cost more in terms of power. Using hardware acceleration helps with the 10+ hour lifetime on the battery.

martinald · 2026-01-08T22:04:57 1767909897

Yes of course but it's basically a waste of silicon (which is very valuable) imo - you save a handful of watts to do very few tasks. I would be surprised if in the length of my MacBook the NPU has been utilised more than 1% of the time the system is being used.

You still need a GPU regardless if you can do JPEG and h264 decode on the card - for games, animations, etc etc.

adastra22 · 2026-01-09T01:54:33 1767923673

Do you use Apple's Photos app? Ever see those generated "memories," or search for photos by facial recognition? Where do you think that processing is being done?

Your macbook's NPU is probably active every moment that your computer is on, and you just didn't know about it.

martinald · 2026-01-09T02:14:09 1767924849

How often is the device either generating memories or I'm searching for photos? I don't use Apple Photos fwiw, but even if I did I doubt I'd be in that app for 1% of my total computer time, and of that time only a fraction of the time would be spent doing stuff on the ANE. I don't think searching for photos requires that btw, if they are already indexed it's just a vector search.

You can use asitop to see how often it's actually being used.

I'm not saying it's not ever used, I'm saying it's used so infrequently that any (tiny) efficiency gains do not trade off vs running it on the GPU.

adastra22 · 2026-01-09T02:44:09 1767926649

Continuously in the background. There's basically a nonstop demand for ML things being queued up to run on this energy-efficient processor, and you see the results as they come in. That indexing operation is slow, and run continuously!

kalleboo · 2026-01-09T04:24:18 1767932658

You also have Safari running OCR on every image and video on every webpage you load to let you select and copy text

Maxatar · 2026-01-08T20:37:05 1767904625

I have to disagree with you about MMX. It's possible a lot of software didn't target it explicitly but on Windows MMX was very widely used as it was integrated into DirectX, ffmpeg, GDI, the initial MP3 libraries (l3codeca which was used by Winamp and other popular MP3 players) and the popular DIVX video codec.

conductr · 2026-01-09T03:45:53 1767930353

Similar to AI PC's right now, very few consumers cared in late 90s. Majority weren't power users creating/editing videos/audio/graphics. Majority of consumers were just consuming and they never had a need to seek out MMX for that, their main consumption bottleneck was likely bandwidth. If they used MMX indirectly in Winamp or DirectX, they probably had no clue.

Today, typical consumers aren't even using a ton of AI or enough to even make them think to buy specialized hardware for it. Maybe that changes but it's the current state.

bombcar · 2026-01-08T22:00:56 1767909656

MMX had a chicken/egg problem; it did take awhile to "take off" so early adopters really didn't see much from it, but by the time it was commonplace it was doing some work.

martinald · 2026-01-08T22:09:47 1767910187

ffmpeg didn't come out for 4 years after the MMX brand was introduced!

Of course MMX was widely used later but at the time it was complete marketing.

buildbot · 2026-01-08T21:24:57 1767907497

Using VisionOCR stuff on MacOS spins my M4 ANE up from 0 to 1W according to poweranalyzer

heavyset_go · 2026-01-08T20:53:14 1767905594

The silicon is sitting idle in the case of most laptop NPUs. In my experience, embedded NPUs are very efficient, so there's theoretically real gains to be made if the cores were actually used.

martinald · 2026-01-08T22:06:55 1767910015

Yes but you could use the space on die for GPU cores.

heavyset_go · 2026-01-09T14:04:11 1767967451

At least with the embedded platforms I'm familiar with, dedicated silicon to NPU is both faster and more power efficient than offloading to GPU cores.

If you're going to be doing ML at the edge, NPUs still seem like the most efficient use of die space to me.

martinald · 2026-01-03T04:21:08 1767414068

I don't think it's that per se, it's just apple has a lot of resources to optimise/test a relatively small amount of configurations.

The big "issue" with Linux on non-server workloads imo is a lack of testing like this - which is completely understandable. Afiak Microsoft runs millions of automated tests on various hardware configurations _a day_.

Intel does something similar for the Linux kernel, which no doubt explains the relative stability of Linux server vs Desktop (servers are running far less "OS level" software in general in day to day use than the desktop).

The desktop experience itself needs more automated testing. There are so many bugs/regressions which I've noticed in eg gnome which should have been caught by e2e testing - I do try to report them when I see them.

Doing a bit more digging there seems to be some basic e2e testing for gnome ran nightly but currently most tests are failing https://openqa.gnome.org/tests/12128.

This isn't a criticism at all btw, it's quite boring and resource intensive work for a project like gnome to do. I hope soon some large corp decides to go all in on realLinux desktop (not ChromeOS) and can devote resources to this.

chuckadams · 2026-01-03T14:32:36 1767450756

The vertical integration is what makes for the small amount of configurations. The total count of OEMs they have to satisfy or work around is one.

martinald · 2026-01-01T23:38:27 1767310707

Totally agree - wrote this over the holidays which sums it all pretty well https://martinalderson.com/posts/why-im-building-my-own-clis...

martinald · 2026-01-01T20:40:12 1767300012

Well you can cache stuff and also use read replicas. But yes, you are correct. For 'write' it doesn't help as much to say the least. But for some (most?) sites they are 99.9% read...

martinald · 2025-12-28T02:11:23 1766887883

I don't think the benchmarks catch this very well. Opus 4.5 is _significantly_ better than Sonnet 4.5 in my experience, far more than the SWE Bench scores would say. I can happily leave Opus 4.5 running for 20-30 minutes and come back to very high quality software on complex tasks/refactoring. Sonnet 4.5 would fall over within a couple of minutes on these tasks.

20251227 · 2025-12-28T02:20:55 1766888455

What does "very high quality" mean here

martinald · 2025-12-28T02:07:10 1766887630

Good catch, bad wording from me. Revised on the post.