Checkpoint and restore functionality for CUDA is exposed through a command-line utility called cuda-checkpoint. This utility can be used to transparently checkpoint and restore CUDA state within a running Linux process. Combine it with CRIU (Checkpoint/Restore in Userspace), an open-source checkpointing utility, to fully checkpoint CUDA applications.
Transparent, per-process checkpointing offers a middle ground between virtual machine checkpointing and application-driven checkpointing. Per-process checkpointing can be used in combination with containers to checkpoint the state of a complex application, facilitating use cases such as the following:
CRIU (Checkpoint/Restore in Userspace) is an open-source checkpointing utility for Linux, maintained outside of NVIDIA, which can checkpoint and restore process trees.
CRIU exposes its functionality through a command line program called criu and operates by checkpointing and restoring every kernel mode resource associated with a process. These resources include: