(The jury is still out for M3/M4 which currently have no Asahi support - thus, no current prospects for driving the ANE bare-metal. Note however that the M3/Pro/Max ANE reported performance numbers are quite close to the M2 version, so there may not be a real improvement there either. M3 Ultra and especially the M4 series may be a different story.)
I wouldn't say that they aren't useful for inference (there are pretty clear performance improvements even from the asahi effort you linked) - it's just that you have to convert the model ahead of time to be compatible with the ANE which is explained in the readme docs for whisper.cpp that I linked above.
I would say though that this likely excludes them from being useful for training purposes.
Note that I was only commenting on modern quantized LLM's that basically avoid formats like FP16 or INT8, preferring lower precision wherever feasible. When in-memory model values must be padded to FP16/INT8 this slashes your effective use of memory bandwidth, which is what determines token generation speed. So the only feasible benefits are really in the prompt pre-processing phase, and even then only in lower power use compared to GPU, not really in higher speed.
That's really interesting! I didn't know that about the padding behavior here. I am interested to know which models this would include? I know Gemma 3 raw is bf16 - are you just talking about the quantized versions of these? Or are models being released purely as quantized versions these days? I know Google just released a QAT (Quantization Aware Training) model of Gemma 3 27b - but that base model was already released.
Models may be released as unquantized (and even then they are gradually shifting towards lower precisions over time), but most people are going to be running them in a quantized version simply because that gives you the best bang for your buck (you can fit more interesting models on the same hardware). Of course this is strictly about local LLM inference, though one may reasonably assume that the big players are also doing something similar.
What do you mean by less wide? The main bottleneck for transformers is memory bandwidth. ANE has a much lower ceiling than CPU/GPU (yes, despite unified memory).
Chunking is actually beneficial as long as all the chunks can fit into the ANE’s cache. It speeds up compilation for large network graphs and cached loads are negligible cost. On M1 the cache limit is 3-4GB, but it is higher on M2+.
I was referring to both the lower memory bandwidth and lower FLOPs. The GPU can just do… more at once? For now. Or is that changing?
I had also assumed that loading a chunk from the cache was not free because I’ve seen cache eviction on my M1, but it’s good to know that it’s no longer as big of a limitation.
also, I’m a big fan of your work! I played around with your ModernBERT CoreML port a bit ago
For single batch inference of anything remotely LLM you'll hit the memory bound way before FLOPs, so I haven't actually looked at FLOPs much. For raw performance GPU is certainly better. ANE is more energy efficient, but you need larger batches to really benefit.
Maybe cache is the wrong word. This is a limit to how much can be mmap'd for the ANE at once. It's not too hard to hit on M1 if your model is in the GB range. Chunking the model into smaller pieces makes it more likely to "fit", but if it doesn't fit you have to unmap/remap in each forward pass which will be noticeable.
Awesome to hear about ModernBERT! Big fan of your work as well :)
Right.I was thinking about it, you still need batch refill, however, Apple Core ML tools were failing for attention activations quantization. Long context, pre-fill is still compute bound.
More extensive information at https://github.com/tinygrad/tinygrad/tree/master/extra/accel... (from the Tinygrad folks, note that this is also similarly outdated) seems to basically confirm the above.
(The jury is still out for M3/M4 which currently have no Asahi support - thus, no current prospects for driving the ANE bare-metal. Note however that the M3/Pro/Max ANE reported performance numbers are quite close to the M2 version, so there may not be a real improvement there either. M3 Ultra and especially the M4 series may be a different story.)