Checkpointing
Some recipes and tips for checkpointing on Anunna.
Definition
Checkpointing is saving a program's state at a checkpoint so that it can be restarted from that point after a planned or unplanned stop or failure.
It is useful for, for example, long jobs that could be killed (voluntarily or not), or jobs running on unstable systems.
Two types of program
The code is accessible
Modify the code to implement the following recipe:
1. Look for a state file that contains all information required to restore the state
from when the program was stopped.
2. If it exists, read it and restore the state. Otherwise, create an initial state.
3. Periodically save the state to the state file.
This recipe applies to several languages (R, Python, MATLAB, Fortran, C, shell, …). Checkpointing parallel programs is easier after a global synchronization.
Checkpointing requires:
- some effort to write the additional code;
- some memory (for example for a state file) — be careful about what is saved;
- some time (for example to write the state file). Avoid checkpointing too often. SLURM and other software features can also be used to checkpoint at specific points.
The code is not accessible
Many of the programs we use are checkpointable:
- If so, adapt the SLURM script to use this feature.
- If the software is really not checkpointable, some tools can checkpoint existing programs. One of them is DMTCP (Distributed MultiThreaded CheckPointing). The SLURM script must be modified to use it efficiently on Anunna.