Checkpointing with openMosix + chpox

Since checkpointing is a very interesting and useful issue in high performance computing here now short introduction and example of how to use chpox on openMosix.

openMosix is not suitable for HighAvailablity projects, but sometimes people want to implement a way of redundancy, in order to make sure that the results of a long-running process don't get lost due due a crash. Or even a student that reboots a machine in a multifunctional pc-lab.

With Checkpointing implemented, we can take a process , checkpoint it at regular intervals and continue the process at its last checkpoint in the unlikely event of a crash

  1. install chpox: Unzip the chpox source and compile the chpox modules fitting to the openMosix kernel running on the cluster-nodes. (if you would like to test it use clusterKNOPPIX)

  2. install/insmod the chpox_mod into the running kernel

       insmod chpox_mod
       

  3. register processes + their needed libraries (can be done automatically)

    There are two flavors, /proc-interface or chpox-commandline tools. To register a process you just need to write to the /proc/chpox/register file e.g.

    echo "[PID]:31:1:/tmp/proc-dump" > /proc/chox/register
    The same registration can be also executed by the "chpoxctl" util:
     chpoxctl add [PID] 31 1 /tmp/proc-dump
    This registers PID and enables the possibility to checkpoint it.

  4. Add required libs for your process

    Do not forget to register the required libs for your process(es). Restoring the registered and checkpointed process will only work if you tell chpox which libraries are required for restoring, starting and running the process.

     chpoxctl addlib [filename]

  5. Checkpoint the processes

    To checkpoint a process, just use the "kill" command and send signal 31:

     kill -31 [PID]
    This will "dump" the current state of the process PID to the /tmp/proc-dump file which will be used by the "restore" later.

  6. Restore processes

    To restore a process just pick its latest checkpoint-dump file of the registered process and execute:

     ld-chpox [process-dump-file]
    ... and the process is running/working again

It might be problematic for parallel applications which are pawning and running process on remote hosts. chpox is (currently?) limited to working with non-interactive applications only The chpox developers are working on support for sockets, shared-memory, IPC and threads. http://www.cluster.kiev.ua/tasks/chpx_eng.html