parallelisms implements a character-level, autoregressive language model with an MLP backbone. The goal of this library is to demistify how to train LLMs in parallel by building everything from scratch in C (including gradient computations). While this is not a production library, the ideas presented here are exactly how SOTA LLMs (like Llama 3) are trained.
The most interesting files to look at are the train_*.c files. train.c contains the reference implementation for single-threaded training while the rest implement the individual parallelism methods. Finally, train_3d.c brings everything together to implement 3d parallel training. The individual parallelism implementations are modular enough that 3d parallelism "falls out". distributed.c contains communication utilities and certain functions that are used across multiple files.