Hacker Newsnew | past | comments | ask | show | jobs | submit | anemll's commentslogin

Tensor Parallel test with RDMA last week https://x.com/anemll/status/1996349871260107102

Note fast sync workaround


In macOS 26.2 (Tahoe) beta, Apple introduced a low-latency Thunderbolt 5 RDMA driver, enabling up to 80 Gb/s bidirectional bandwidth for Mac clustering—ideal for distributed ML on Apple Silicon. It's optimized for low latency, delivering ~14 Gbps throughput at 4K MTU. My tests (M4 Pro to M3 Ultra): Stock ibv_uc_pingpong achieved ~14 µs round-trip for 4K packets (requires GID index setup). Custom C++ variant hit 6-13 µs/iter: https://x.com/anemll/status/1993192776897642942 Code and details: https://github.com/Anemll/mlx-rdma/blob/anemll-rdma/ibv_roun... https://github.com/Anemll/mlx-rdma/blob/anemll-rdma/ibv_roun... (includes steps to enable RDMA in macOS Recovery OS terminal) Theoretically, this accelerates pipeline parallelism (faster layer handoffs) and tensor parallelism (low-overhead sharding) on GPUs, with potential extensions to ANE for real-time AI workflows.


It’s also supported in Apple Neural Engine https://github.com/Anemll/Anemll


We can ran 2000 or 4000 context with ANE


Right.I was thinking about it, you still need batch refill, however, Apple Core ML tools were failing for attention activations quantization. Long context, pre-fill is still compute bound.


Yes for GPU, however ANE only supports FP16 plus integers. M4/A17 added accelerated int8 that is twice faster than FP16


Memory bandwidth is the main bottleneck. It got better with M3/M4. ANE is really fast in FLOPS but low in memory bandwidth.


What hardware are you on? Most models are memory bandwidth limited. ANE was limited to 64GB/s prior to M3 Max or M4 pro. If you are on M1, GPU will be significantly faster for 3-8B models due to memory bandwidth rather then ANE capabilities.


M4 Max with 128GB of memory.


M4 max should work at 120GB for ANE and 500+ for GPU. So GPU will be 3-4 times faster for anything over 1-3B. ANE is likely as fast for prefill due to higher FLOPs


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: