Faster compute helps, for things like vision language model that requires bigger...

zozbot234 · 2025-10-15T14:58:38 1760540318

The old ANE enabled arbitrary statically scheduled multiply-add, of INT8 or FP16. That's good for convolution but not specifically geared for it.

liuliu · 2025-10-15T16:23:01 1760545381

I am not an expert on ANE, but I think it is related to the size of register files and how that is smaller than what we need for GEMM on modern transformers (especially these fat ones with MoE).

zozbot234 · 2025-10-15T16:29:24 1760545764

AIUI the ANE makes use of data in unified memory, not in the register file. So this wouldn't be an inherent limitation. (OTOH, that's why it wastes memory bandwidth for most newer transformer models, which use heavily quantized data - the ANE will have to read padded/unquantized values and the fraction of memory bandwidth that's used for that padding is pure waste.)

hannesfur · 2025-10-15T14:59:17 1760540357

That would be an interesting approach if true. I hope someone gets to the bottom of it once we have hardware in our hands.