Checkpointing: Difference between revisions

From HPCwiki
Jump to navigation Jump to search
No edit summary
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
Some recipes and tips for checkpointing on the B4F HPC
Some recipes and tips for checkpointing on Anunna


==Definition==
==Definition==
Line 19: Line 19:


This recipe is applicable to several languages (e.g., R, Python, Matlab, Fortran, C, shell,...). Also, the checkpointing of parallel programs is easier after a global synchronization.<br />
This recipe is applicable to several languages (e.g., R, Python, Matlab, Fortran, C, shell,...). Also, the checkpointing of parallel programs is easier after a global synchronization.<br />
The checkpointing requires some efforts:<br />
The checkpointing requires:<br />
* to write some additional code;
* some efforts to write some additional code;
* some memory (e.g. for a state file). One must be careful to what is saved;
* some memory (e.g. for a state file). One must be careful to what is saved;
* some time (e.g., to write the state file). Therefore, checkpointing too often must be avoided. Also, SLURM and other software features can be used to checkpoint at some specific points.
* some time (e.g., to write the state file). Therefore, checkpointing too often must be avoided. Also, SLURM and other software features can be used to checkpoint at some specific points.
Line 27: Line 27:
Many software we used are checkpointable:<br />
Many software we used are checkpointable:<br />
* If it is the case, the (SLURM) script should be adapt to use this advantage.  
* If it is the case, the (SLURM) script should be adapt to use this advantage.  
* If the software is really not checkpointable, some software are available to checkpoint existing programs. One of them is DMTCP (Distributed MultiThreaded CheckPointing; [http://dmtcp.sourceforge.net/  DMTCP]). The SLURM script must be modified to use it efficiently on the B4F HPC.
* If the software is really not checkpointable, some software are available to checkpoint existing programs. One of them is DMTCP (Distributed MultiThreaded CheckPointing; [http://dmtcp.sourceforge.net/  DMTCP]). The SLURM script must be modified to use it efficiently on Anunna.

Latest revision as of 14:42, 19 March 2019

Some recipes and tips for checkpointing on Anunna

Definition

Checkpointing: Saving the program's state at a checkpoint with the aim to restart it from that point in case of (un)planned stop of failure.

It is interesting for, e.g., long jobs that could be (un)voluntary killed, or jobs running on unstable computing systems.

Two types of program

The code is accessible

Recipe

Modify the code to implement the following recipe:

<source lang=text> 1. Look for a state file that includes all imformation required to restore the state when the program was stopped. 2. If it exists, read it and restore the state. Else, create an intial state. 3. Periodically save the state to the state file. </source>

This recipe is applicable to several languages (e.g., R, Python, Matlab, Fortran, C, shell,...). Also, the checkpointing of parallel programs is easier after a global synchronization.
The checkpointing requires:

  • some efforts to write some additional code;
  • some memory (e.g. for a state file). One must be careful to what is saved;
  • some time (e.g., to write the state file). Therefore, checkpointing too often must be avoided. Also, SLURM and other software features can be used to checkpoint at some specific points.

The code is not accessible

Many software we used are checkpointable:

  • If it is the case, the (SLURM) script should be adapt to use this advantage.
  • If the software is really not checkpointable, some software are available to checkpoint existing programs. One of them is DMTCP (Distributed MultiThreaded CheckPointing; DMTCP). The SLURM script must be modified to use it efficiently on Anunna.