Checkpointing: Difference between revisions
No edit summary |
No edit summary |
||
Line 14: | Line 14: | ||
2. If it exists, read it and restore the state. Else, create an intial state.<br /> | 2. If it exists, read it and restore the state. Else, create an intial state.<br /> | ||
3. Periodically save the state to the state file. | 3. Periodically save the state to the state file. | ||
This recipe is applicable to several languages (e.g., R, Python, Matlab, Fortran, C, shell,...). Also, the checkpointing of parallel programs is easier after a global synchronization.<br /> | |||
The checkpointing requires some efforts:<br /> | |||
* to write some additional code; | |||
* some memory (e.g. for a state file). One must be careful to what is saved; | |||
* some time (e.g., to write the state file). Therefore, checkpointing too often must be avoided. Also, SLURM and other software features can be used to checkpoint at some specific points. | |||
===The code is not accessible=== | ===The code is not accessible=== |
Revision as of 14:26, 12 November 2015
Some recipes and tips for checkpointing on the B4F HPC
Definition
Checkpointing: Saving the program's state at a checkpoint with the aim to restart it from that point in case of (un)planned stop of failure.
It is interesting for, e.g., long jobs that could be (un)voluntary killed, or jobs running on unstable computing systems.
Two types of program
The code is accessible
Recipe
Modify the code to implement the following recipe:
1. Look for a state file that includes all imformation required to restore the state when the program was stopped.
2. If it exists, read it and restore the state. Else, create an intial state.
3. Periodically save the state to the state file.
This recipe is applicable to several languages (e.g., R, Python, Matlab, Fortran, C, shell,...). Also, the checkpointing of parallel programs is easier after a global synchronization.
The checkpointing requires some efforts:
- to write some additional code;
- some memory (e.g. for a state file). One must be careful to what is saved;
- some time (e.g., to write the state file). Therefore, checkpointing too often must be avoided. Also, SLURM and other software features can be used to checkpoint at some specific points.