Hacker News new | past | comments | ask | show | jobs | submit login
How to optimize a CUDA matmul kernel for cuBLAS-like performance (2022) (siboehm.com)
103 points by mpweiher 9 months ago | hide | past | favorite | 33 comments



Another point to consider here is that this project of writing a cuBLAS level GEMM kernel becomes much more challenging if you are doing it with fp16, and are thus competing with the cuBLAS kernels that use tensor cores. The (theoretical) arithmetic throughput of tensor cores is ~8x higher as compared to fp32 math on the Turing arch, I dont know off the top of my head but I think this ratio is the same or greater for Ampere/Hopper tensor cores.

This makes the project proportionally harder in my opinion because you need to be that much more efficient with moving data through the memory hierarchy. With tensor cores, to get anywhere close to cuBLAS, you need to start with something like the most efficient kernel in simon's article, and then do stuff like shared memory swizzling, async global memory copies, double buffering, and writing a really efficient kernel epilogue to accumulate the C matrix into the product.

I came across this article a while ago and it inspired me to take a stab at this^, and as of now I have gotten to ~80% of the cuBLAS tensor core performance where the kernel is mostly compute bound, and I am close to giving up on the last ~20%, because I think I may need to write the inner loop in SASS to make sure the instruction mix between shared memory loads, mma instructions, and synchronizations is perfectly balanced so that none of the hardware pipelines get overloaded (see link below), and I have enough compassion for myself to not spend my free time doing stuff like that :). There are also certain things implemented in CUTLASS that seem important (look up serpentine traversal) but NVIDIA engineers wont talk about the hardware details required to understand why this helps.

Article on this is forthcoming

https://github.com/NervanaSystems/maxas/wiki/SGEMM


Re: serpentine traversal, this has to do with the .reuse suffix applied to register operands as mentioned in your link. We don’t really have control over it because it’s happening inside of ptxas during SASS generation, but when CUTLASS does serpentine traversal they’re suggesting an order of MMA instruction issues that would result in at least one operand being reused from one instruction to the next— clever stuff.

I’d be so happy if SASS were documented and ptxas were open source, sometimes I spend entire days going through whitepapers and various sources of online documentation to get more hardware details…


Note that this is from 2022.

My guess is that people nowadays are gradually moving away from raw CUDA programming and moving towards things like Triton etc, and you won't be focusing on pure GEMM since you tend to do some fusion.

The Triton tutorial claims their performance is on par with cuBLAS.

https://triton-lang.org/main/getting-started/tutorials/03-ma...


> My guess is that people nowadays are gradually moving away from raw CUDA programming and moving towards things like Triton etc

Your guess is wrong. Besides the fact that there's much more to life than matmul (for which triton is just ok), the other obvious fact is that triton has exactly 1 frontend (python) and there's much more to life than that frontend.

I find that basically in every thread about low-level work there's someone making some weird comment about how triton or mojo or XYZ supplants CUDA or assembly or whatever. I can't understand how this comes about because absolutely no one working in these areas thinks XYZ is going to supplant anything. So it's invariably outsiders making these claims and I cannot fathom why any outsider would be motivated to make claims from the outside.


Great comment! My wrong comment at least served as a MacDonal's theory https://jonbell.medium.com/mcdonalds-theory-9216e1c9da7d

As an outsider CUDA is so intimidating so the promise of Triton etc is very appearing and I wanted to get sold.


FWIW Triton can actually be used pretty easily from C++ straight using MLIR. XLA does that. Maybe not exactly a frontend but not that far.


> FWIW Triton

i have PRs in Triton - i'm well familiar with the fact that triton is an MLIR project.

> C++ straight using MLIR

that's like saying llvm ir is usable through C++ ... or hell that's like saying NVPTX is usable through C++. it's not just not a frontend it's the exact opposite: it's emitting IR using IR builders.


But Triton is just abstraction over CUDA for Python (same like cupy, numba etc.). If you need low lvl access you still will use CUDA if you want high level you can use Triton or numba even higer you'll just use wrappers like pytorch/jax.


It is still an excellent article if you care about how the GPU works. Triton doesn't magically hide all the hardware details.


I’m sure they are using cool super-modern stuff but I wonder if Eigen can somehow be made to spit out CUDA code?


Since 2016, you can use Eigen in CUDA C++ code and it just works: https://eigen.tuxfamily.org/dox/TopicCUDA.html


Oh wow, that is cool as heck.


Year added above. Thanks!


In other discussion here, people asserted that a CUDA replacement was unworkable because you couldn't replace Nvidia's CuBLAS implementation. I'm not qualified to say whether this would give info for constructing an adequate replacement but I'd interested in people's opinions.


Yes and no. Read his notes about his goals and what he didn't implement, and note that his job when he wrote that was as part of a team that ports models to different hardware. That's a smaller job than writing a shippable framework and Anthropic has a team of people doing it. You can read the cuDNN docs to get an idea of some of the other stuff you'd need, there's a lot there but generally that's the point. So in one sense, yes, if you have a strong performance engineering team and a specific workload already you can make use of a variety of hardware. In another sense, no, a small team isn't likely to be able to write a direct alternative for the CUDA ecosystem that reaches devex parity. Lots of people have a pretty clear idea of what it takes to do this but none of the major vendors seem to be doing the work.

Knowing that reaching broad devex parity is very expensive I think the real win is figuring out what specific problem you have and building community and robust software support around that.


The problem with AMD isn't that they are only hitting 90% of the hardware capabilities vs Nvidia hitting 98%.

It's the fact that AMD doesn't prioritize the reliability of its hardware and software stack. If I run llama.cpp on Vulkan I get a reasonable speedup, but if I raise the batch size to 512, the GPU is starting to make strange noises and shuts the PC down midway. Very cool. 98% of zero is still zero.


cuBLAS is not really Nvidia's moat: every competitor has a workable BLAS implementation and some are even drop-in replacements for cuBLAS.

In fact cuBLAS and CUDA are kinda orthogonal in that you're either calling a pre-built cuBLAS kernel or writing your own CUDA kernel but not really combining the two.

I'd say CUDA shines more because of stability, documentation, community support + examples, and ability to use modern C++ features in GPU code.


> In other discussion here, people asserted that a CUDA replacement was unworkable because you couldn't replace Nvidia's CuBLAS implementation

Targeting nvidia GPUs? Or in general? For whom?

Building a performant BLAS library is hard but certainly not impossible. The tricks discussed in this post are hardly anything new either. Now, making a BLAS competitive with Nvidia's on its own GPUs is bound to be tough. But not technically unfeasible (after all, you can drop down to PTX if needed).


When I was optimizing GPU kernels a few years back, Nvidia's own kernels were getting those last few percent of performance by making use of hardware-specific features (I remember operand caches being one) that are not available through PTX.


Beating CuBLAS is easy. I wrote a 30-line kernel that multiplies two 4096² matrices 14% faster than CuBLAS on a 4090. The question is how to earn money on that.


If this were true (and I highly doubt it) it's obvious how to make money from it: collect a 7 figure paycheck from Nvidia, AMD, or any FAANG.


I swapped one of the kernels in the code from the article to my kernel, and left only the multiplication of matrices of size 4096².

On average over 20 runs:

CuBLAS (./sgemm 0) has 50.9 TFLOPS.

My kernel has 61.8 TFLOPS, so it's actually +21% speedup in this benchmark.

How do I collect my paycheck?


I gotta see it to believe it ;)


For all doubters, I open-sourced it: https://github.com/arekpaterek/Faster_SGEMM_CUDA


Believe it or not.

On a 4090 gpu, average of 20 runs of SGEMM_CUDA:

  size    tflops_cublas  tflops_my  diff
  4096²   50.8-50.9      61.8       +21%
  8192²   56.3-56.4      67.1       +19%
  16384²  53.6           66.7       +24%
I guess the right thing to do now would be to hire a B2B salesman and figure out, which company needs it.


Post the code and your curriculum


I'm a little sceptical of your claims. Care to share the kernel you wrote?


I would bet actual money that they are not doing an apples-to-apples comparison.

I have seen how those high-performance libraries are made and I'm still in awe at the quality and quantity of the staffing involved. Those were the smartest and most knowledgeable engineers I met in my career.


It is apples-to-apples. The code from the article runs a chosen kernel 50 times in a loop. In the case of CuBLAS it runs the function cublasGemmEx() 50 times.


Does your implementation support the complete feature set of cublasGemmEx()? E.g. varying matrix sizes, etc. Does your implementation beat cublas on a wide range of cases, or only cherry-picked examples? Does the code from the article follow the best coding practices for cublas in the first place?

Generalizing from a micro benchmark is typically hubris.



As reported by the article: "However, for smaller matrices, we’re doing poorly in comparison to Nvidia’s library! This happens because cuBLAS contains not one single implementation of SGEMM, but hundreds of them.", it would take considerable effort to replace CuBLAS.


Of course it can be done, it's just a lot of effort. You need parametric kernels you can find optimal configurations for all hardware and input size combinations.

Then there are also numerics: being fast is not enough if your implementation accumulates a lot of rounding errors doing so. Floating point arithmetic can and will mess up your results in unexpected ways. -funsafe famously is neither fun nor safe.

Maybe tooling will catch up and make it easier. Think tinygrad with beamsearch, triton or halide.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: