Checkpointing: Difference between revisions

Revision as of 14:21, 12 November 2015

Some recipes and tips for checkpointing on the B4F HPC

Definition

Checkpointing: Saving the program's state at a checkpoint with the aim to restart it from that point in case of (un)planned stop of failure.

It is interesting for, e.g., long jobs that could be (un)voluntary killed, or jobs running on unstable computing systems.

Two types of program

The code is accessible

Recipe

Modify the code to implement the following recipe:

1. Look for a state file that includes all imformation required to restore the state when the program was stopped.
2. If it exists, read it and restore the state. Else, create an intial state.
3. Periodically save the state to the state file.

@@ Line 1: / Line 1: @@
-Some recipes and tips for checkpointing on the B4F HPC: [[Media:checkwiki.pdf]]
+Some recipes and tips for checkpointing on the B4F HPC
+==Definition==
+'''Checkpointing''': Saving the program's state at a checkpoint with the aim to restart it from that point in case of (un)planned stop of failure.
+It is interesting for, e.g., long jobs that could be (un)voluntary killed, or jobs running on unstable computing systems.
+==Two types of program==
+===The code is accessible===
+====Recipe====
+Modify the code to implement the following recipe:
+. Look for a state file that includes all imformation required to restore the state when the program was stopped.<br />
+. If it exists, read it and restore the state. Else, create an intial state.<br />
+. Periodically save the state to the state file.
+===The code is not accessible===

Checkpointing: Difference between revisions

Revision as of 14:21, 12 November 2015

Contents

Definition

Two types of program

The code is accessible

Recipe

The code is not accessible

Navigation menu

Checkpointing: Difference between revisions

Revision as of 14:21, 12 November 2015

Definition

Two types of program

The code is accessible

Recipe

The code is not accessible

Navigation menu

Search