Checkpointing lets you continue a failing simulation anytime. It should be used for all long-running simulations to prevent loss of data due to failing hard- or software. To enable checkpointing for a simulation you just have to use a software called DMTCP which is available on any workstation and cluster node.

To checkpoint a simulation run the command

dmtcp_checkpoint -b -i 21600 -c [checkpoint-dir] [simulation_binary] 

This generates a checkpoint file every 21600 seconds (6 hours) and a restart script in the directory [checkpoint-dir]:

ckpt_my_binary_<id>.dmtcp
dmtcp_restart_script_<id>.sh
dmtcp_restart_script.sh

To continue the check-pointed simuation either use the generated dmtcp restart script with the according ID or the dmtcp restart script (without an ID), which points to the last written check point.

dmtcp_checkpoint forks a coordinator process on the same CPU, which might delay your application (e.g. for writing checkpoints). dmtcp only supports serial and OpenMP based applications (not MPI). There is no guaranty that every simulation can be restartet. If you encounter any problem, let us know.

Flag Description
--interval, -i Time in seconds between automatic checkpoints.
--batch, -b Enable batch mode i.e. start the coordinator on the same node.
--gzip, --no-gzip Enable/disable compression of checkpoint images (default: enabled)
--ckptdir, -c Directory to store checkpoint images (default: ./)
-q quiet processing (no additional messages)