Since checkpointing is a very interesting and useful issue in high performance computing here now short introduction and example of how to use chpox on openMosix.
openMosix is not suitable for HighAvailablity projects, but sometimes people want to implement a way of redundancy, in order to make sure that the results of a long-running process don't get lost due due a crash. Or even a student that reboots a machine in a multifunctional pc-lab.
With Checkpointing implemented, we can take a process , checkpoint it at regular intervals and continue the process at its last checkpoint in the unlikely event of a crash
install chpox: Unzip the chpox source and compile the chpox modules fitting to the openMosix kernel running on the cluster-nodes. (if you would like to test it use clusterKNOPPIX)
install/insmod the chpox_mod into the running kernel
register processes + their needed libraries (can be done automatically)
There are two flavors, /proc-interface or chpox-commandline tools. To register a process you just need to write to the /proc/chpox/register file e.g.
echo "[PID]:31:1:/tmp/proc-dump" > /proc/chox/register
chpoxctl add [PID] 31 1 /tmp/proc-dump
Add required libs for your process
Do not forget to register the required libs for your process(es). Restoring the registered and checkpointed process will only work if you tell chpox which libraries are required for restoring, starting and running the process.
chpoxctl addlib [filename]
Checkpoint the processes
To checkpoint a process, just use the "kill" command and send signal 31:
kill -31 [PID]
To restore a process just pick its latest checkpoint-dump file of the registered process and execute:
It might be problematic for parallel applications which are pawning and running process on remote hosts. chpox is (currently?) limited to working with non-interactive applications only The chpox developers are working on support for sockets, shared-memory, IPC and threads. http://www.cluster.kiev.ua/tasks/chpx_eng.html