I noticed there is very little information or easy examples for using the sm-to-sm communication network on the Hopper Architecture (Compute Capability 9.0/sm_90), so I thought I would try to quickly give a super simple example.
I’ve also been working on a more extended writeup on using the distributed shared memory features for a NBODY-simulation kernel, but this will have to wait till I’m done with my exams 😿.
So what is distributed shared memory (DSMEM) and why do we want it? Most “interesting” GPU problems end up being very sensitive to memory characteristics (latency/bandwidth), so much so that in my GPU architecture course we ended up almost exclusively talking about the memory-model of CUDA and its hardware implementation.
You have global memory which is usually what the number is you see for GPU-memory on the label. This ends up being DRAM since we want loads of it to store our results and data (just look at the sizes of those LLamas3.1 Models), but that also means we hit the ‘memory wall’ and end up being bottlenecked fairly early by memory speeds.