More

zhihaojia · 2025-06-20T04:08:22 1750392502

1. In MPK, each task is mapped to an individual SM. The amount of work handled by a task is similar to that of a thread block in the traditional kernel-per-operator approach.

2. TL;DR: MPK automatically analyzes inter-task dependencies by tracking the input and output tensors associated with each task. A longer version: Longer version: MPK uses imap, omap, and fmap (see Section 2 of the Mirage paper) to determine each task’s input and output tensors. A dependency is introduced between task A and task B if A produces any tensor elements that B consumes—that is, if A's outputs overlap with B's inputs.

> Again taking matmul as an example: a given output output tile requires the correspond M_BLOCK rows of the A matrix. If the A matrix was itself an output of a prior matmul (+ nonlinearity), the dependees would be all of output tile tasks corresponding to those M_BLOCK rows of the operator that produced A?

Exactly. In this case, all output tile tasks that consume those M_BLOCK rows of A will depend on all tasks responsible for producing the corresponding parts of A in the previous operator.

zhihaojia · 2025-06-20T03:52:00 1750391520

Thanks for reproducing our results!

zhihaojia · 2025-06-20T02:46:06 1750387566

You are right that CUDA graph can help reduce launch overhead but does not support overlapping computation/communication across layers, since data dependencies are described at the kernel level.

zhihaojia · 2025-06-20T02:38:42 1750387122

Thanks for the great feedback! Stanford's MegaKernel project tackles a similar challenge but focuses on manual CUDA implementation. While MPK takes a compiler-driven approach—users express their LLMs at the PyTorch level, and MPK automatically compiles them into optimized megakernels. Our goal is to make programming megakernels much more accessible.

I completely agree that CUDA can be a limiting factor, especially for latency-sensitive workloads. As GPUs are becoming larger and faster, it's increasingly difficult to write standalone kernels that fully utilize hardware resources—particularly when optimizing for low latency with small batch sizes.

> What are the chances we see your work land in PyTorch as an experimental backend?

We're definitely excited about that direction. We believe MPK can help PyTorch support megakernel generation, and we’re actively exploring how to make that happen. Stay tuned!

> P.S. minor typo, your first two paragraphs under part 1 are nearly identical.

Thanks for pointing it out--I meant to remove the duplicate paragraph when finalizing the post.

pavelstoev · 2025-06-20T15:58:20 1750435100

Hi Author - thank you very much for the clear and relatively easy-to-understand MPK overview. Could you please also comment on the similarity of your project to Hidet https://pytorch.org/blog/introducing-hidet/

Thank you !

zhihaojia · 2025-06-20T00:39:18 1750379958

Yes, it would be a lot of fun if MPK can enable torch.compile to generate megakernels. Torch-generated kernels are currently too slow for latency-sensitive workloads.

zhihaojia · 2025-06-20T00:37:32 1750379852

Thanks for the feedback! Yes, we believe the approach is general and applicable to other ML workloads.

zhihaojia · 2025-06-20T00:35:31 1750379731

The task implementations used by MPK are currently optimized for A100. While the Mirage compiler can generate task implementations for other architectures such as Hopper and Blackwell, but we haven't integrated things together yet. This is on the very top of our todo list. Stay tuned!

zhihaojia · 2025-06-20T00:22:26 1750378946

JAX's operator fusion (https://apxml.com/courses/advanced-jax/chapter-2-optimizing-...) can fuse a few local operators (e.g., matmul and elementwise computation) into a single kernel. But JAX's approach cannot fuse an entire LLM with hundreds of operators into a single kernel because many operators involve loop transformations.

MPK takes a different approach where instead of incrementally fusing local operators, it decomposes operators into a task graph and builds a runtime system within a single kernel to execute all tasks specified in the task graph.

zhihaojia · 2025-06-19T22:03:19 1750370599

The github repo includes a tutorial for using MPK: https://github.com/mirage-project/mirage/tree/mpk

zhihaojia · 2025-06-19T22:02:42 1750370562

Thanks a lot for your positive feedback! We believe that MPK can enhance existing LLM serving systems, especially for low-latency LLM serving. We are very excited about the opportunity to collaborate with others on direction.