anemll's comments

anemll · 2025-12-13T03:51:59 1765597919

Tensor Parallel test with RDMA last week https://x.com/anemll/status/1996349871260107102

Note fast sync workaround

anemll · 2025-11-25T17:24:03 1764091443

In macOS 26.2 (Tahoe) beta, Apple introduced a low-latency Thunderbolt 5 RDMA driver, enabling up to 80 Gb/s bidirectional bandwidth for Mac clustering—ideal for distributed ML on Apple Silicon. It's optimized for low latency, delivering ~14 Gbps throughput at 4K MTU. My tests (M4 Pro to M3 Ultra): Stock ibv_uc_pingpong achieved ~14 µs round-trip for 4K packets (requires GID index setup). Custom C++ variant hit 6-13 µs/iter: https://x.com/anemll/status/1993192776897642942 Code and details: https://github.com/Anemll/mlx-rdma/blob/anemll-rdma/ibv_roun... https://github.com/Anemll/mlx-rdma/blob/anemll-rdma/ibv_roun... (includes steps to enable RDMA in macOS Recovery OS terminal) Theoretically, this accelerates pipeline parallelism (faster layer handoffs) and tensor parallelism (low-overhead sharding) on GPUs, with potential extensions to ANE for real-time AI workflows.

anemll · 2025-09-14T23:55:01 1757894101

It’s also supported in Apple Neural Engine https://github.com/Anemll/Anemll

anemll · 2025-05-07T16:15:40 1746634540

We can ran 2000 or 4000 context with ANE

anemll · 2025-05-04T14:28:48 1746368928

Right.I was thinking about it, you still need batch refill, however, Apple Core ML tools were failing for attention activations quantization. Long context, pre-fill is still compute bound.

anemll · 2025-05-04T14:22:11 1746368531

Yes for GPU, however ANE only supports FP16 plus integers. M4/A17 added accelerated int8 that is twice faster than FP16

anemll · 2025-05-04T04:25:57 1746332757

Memory bandwidth is the main bottleneck. It got better with M3/M4. ANE is really fast in FLOPS but low in memory bandwidth.

anemll · 2025-05-04T04:23:17 1746332597

What hardware are you on? Most models are memory bandwidth limited. ANE was limited to 64GB/s prior to M3 Max or M4 pro. If you are on M1, GPU will be significantly faster for 3-8B models due to memory bandwidth rather then ANE capabilities.

SparkyMcUnicorn · 2025-05-04T04:36:17 1746333377

M4 Max with 128GB of memory.

anemll · 2025-05-07T16:14:22 1746634462

M4 max should work at 120GB for ANE and 500+ for GPU. So GPU will be 3-4 times faster for anything over 1-3B. ANE is likely as fast for prefill due to higher FLOPs