Checkpointing: Difference between revisions

From HPCwiki
Jump to navigation Jump to search
No edit summary
IA migration §8: polish — syntaxhighlight, typo fixes, MATLAB casing, See also (via update-page on MediaWiki MCP Server)
 
Line 1: Line 1:
Some recipes and tips for checkpointing on Anunna
Some recipes and tips for checkpointing on Anunna.


==Definition==
== Definition ==
'''Checkpointing''': Saving the program's state at a checkpoint with the aim to restart it from that point in case of (un)planned stop of failure.


It is interesting for, e.g., long jobs that could be (un)voluntary killed, or jobs running on unstable computing systems.
'''Checkpointing''' is saving a program's state at a checkpoint so that it can be restarted from that point after a planned or unplanned stop or failure.
 
It is useful for, for example, long jobs that could be killed (voluntarily or not), or jobs running on unstable systems.
 
== Two types of program ==
 
=== The code is accessible ===


==Two types of program==
===The code is accessible===
====Recipe====
Modify the code to implement the following recipe:
Modify the code to implement the following recipe:


<source lang=text>
<syntaxhighlight lang="text">
1. Look for a state file that includes all imformation required to restore the state when the program  
1. Look for a state file that contains all information required to restore the state
was stopped.
  from when the program was stopped.
2. If it exists, read it and restore the state. Else, create an intial state.
2. If it exists, read it and restore the state. Otherwise, create an initial state.
3. Periodically save the state to the state file.
3. Periodically save the state to the state file.
</source>
</syntaxhighlight>
 
This recipe applies to several languages (R, Python, MATLAB, Fortran, C, shell, …). Checkpointing parallel programs is easier after a global synchronization.
 
Checkpointing requires:
 
* some effort to write the additional code;
* some memory (for example for a state file) — be careful about what is saved;
* some time (for example to write the state file). Avoid checkpointing too often. SLURM and other software features can also be used to checkpoint at specific points.
 
=== The code is not accessible ===
 
Many of the programs we use are checkpointable:


This recipe is applicable to several languages (e.g., R, Python, Matlab, Fortran, C, shell,...). Also, the checkpointing of parallel programs is easier after a global synchronization.<br />
* If so, adapt the SLURM script to use this feature.
The checkpointing requires:<br />
* If the software is really not checkpointable, some tools can checkpoint existing programs. One of them is [http://dmtcp.sourceforge.net/ DMTCP] (Distributed MultiThreaded CheckPointing). The SLURM script must be modified to use it efficiently on Anunna.
* some efforts to write some additional code;
* some memory (e.g. for a state file). One must be careful to what is saved;
* some time (e.g., to write the state file). Therefore, checkpointing too often must be avoided. Also, SLURM and other software features can be used to checkpoint at some specific points.


===The code is not accessible===
== See also ==
Many software we used are checkpointable:<br />
* [[Batch Jobs]]
* If it is the case, the (SLURM) script should be adapt to use this advantage.
* [[Scheduler Overview (Slurm)]]
* If the software is really not checkpointable, some software are available to checkpoint existing programs. One of them is DMTCP (Distributed MultiThreaded CheckPointing; [http://dmtcp.sourceforge.net/  DMTCP]). The SLURM script must be modified to use it efficiently on Anunna.

Latest revision as of 12:58, 18 June 2026

Some recipes and tips for checkpointing on Anunna.

Definition

Checkpointing is saving a program's state at a checkpoint so that it can be restarted from that point after a planned or unplanned stop or failure.

It is useful for, for example, long jobs that could be killed (voluntarily or not), or jobs running on unstable systems.

Two types of program

The code is accessible

Modify the code to implement the following recipe:

1. Look for a state file that contains all information required to restore the state
   from when the program was stopped.
2. If it exists, read it and restore the state. Otherwise, create an initial state.
3. Periodically save the state to the state file.

This recipe applies to several languages (R, Python, MATLAB, Fortran, C, shell, …). Checkpointing parallel programs is easier after a global synchronization.

Checkpointing requires:

  • some effort to write the additional code;
  • some memory (for example for a state file) — be careful about what is saved;
  • some time (for example to write the state file). Avoid checkpointing too often. SLURM and other software features can also be used to checkpoint at specific points.

The code is not accessible

Many of the programs we use are checkpointable:

  • If so, adapt the SLURM script to use this feature.
  • If the software is really not checkpointable, some tools can checkpoint existing programs. One of them is DMTCP (Distributed MultiThreaded CheckPointing). The SLURM script must be modified to use it efficiently on Anunna.

See also