submited by

Style Pass

This is an extremely minimalistic but fast implementation of matrix multiplication in CUDA. The source code is a single, 200-line file gemm.cuh which implements half-precision tensor core matrix multiplication, optimised for Turing (SM75) architecture.

The implementation builds on top of CuTe from CUTLASS, a low-level interface for tensor manipulation in CUDA. The code is well-commented and is meant to be easily readable (minimal CUDA/C++ background knowledge required) and hackable.

Requires CUDA installed. Check out https://docs.nvidia.com/cuda/cuda-installation-guide-linux/ for instructions. If you don't have a compatible GPU, you can run this in Colab:

Read more github.com/a...