Recently, a good friend suggested that given my background and interest in infrastructure, I should explore ML training infra. I had the impression that this area is very academic and something I’d struggle to grasp without a refresher on linear algebra and ML frameworks. While that’s all true, a specific topic that caught my attention is distributed checkpointing. After a bit of Googling, it turns out that aside from academic papers and a couple blogposts from ML researchers, there aren’t that many writings on this topic written by, or tailored to, infrastructure and system engineers. So here I am, kicking off my Substack newsletter to discuss distributed checkpointing in LLM training workflows from the lens of an infrastructure engineer.
Checkpointing is a familiar mechanism for folks that have worked with stateful systems — the idea of storing a snapshot of the current state of a system, so if the system stops or crashes, it can be restored to that state. From database crash recovery, to save files in video games, to web-based tools with auto-save and version history features like Figma, Notion and Google Docs, checkpointing is a well-applied mechanism.