Modifying Custom Matmul CUDA Kernels – DeMoriarty – Beep Boop

submited by

Style Pass

2021-05-28 10:30:08

I started to learn CUDA last year, and started writing matrix multiplication kernels as a learning project. After some struggles, I made them to work, but then got disappointed when I saw my kernels are 10 times slower than cuBLAS GEMM kernels. Maybe my expectations were a bit too high. I’ve tried lots of open sourced matmul kernels on github, but the best one I found was still about 5 times slower (some of them were optimized for older architectures). So I started the journey of optimizing my own matmul kernel. After few months of trial and error, my matmul kernel finally has comparable speed to cuBLAS GEMM.

Why would you need to write your own matmul kernel? the answer is, in most cases you don’t need to and also shouldn’t. It simply isn’t worth the effort when there are highly optimized cuBLAS kernels available.

In this post, I mainly want to talk about the second point: some modifications that can be done on matmul kernels and their applications.