openMosix Internally

Let's have a look inside, openMosix consists of a mechanism called Preemptive Process Migration (PPM) which is a set of algorithms for adaptive resource sharing. A process can be splitted into 2 parts, The system context of a process, called deputy, which stays on the processes home node and the working part of the process that moves to the remote node in the cluster.

Users can run parallel applications by initiating multiple processes in one node, and then allow the cluster to assign these processes to the best available resource at that time. If during execution of the process a new resources becomes available, the resource-sharing algorithms are designed to utilize these new resources by possible reassignment of the process amongst the nodes. This capability to assign and re-assign processes during runtime is particularly important for ease-of-use and to provide an efficient multi-user, time-sharing execution environment.

As we see every process has a home node called UHN ,the user context, called the remote, contains the program code, stack, data, memory-maps and registers of the process. The system context, called the deputy" encapsulates the process when it is running in the kernel. While the remote can migrate many times between different nodes, the deputy is never migrated.

Both parts of the process, deputy and remote, are communicating with each other using a well defined API between the user-context and the system context. Therefore it is possible to intercept every interaction between these context's and forward this interaction across the network. Remote (guest) processes are not accessible to the other processes that run at the same node (locally or originated from other nodes) and vice versa. They do not belong to any particular user (on the remote node, where they run) nor can they be sent signals or otherwise being manipulated by local processes.

openMosix has no central control or master/slave relationship between the nodes. Scalability is achieved by randomness in the system control algorithm, where each node does not attempt to determine the overall state of the cluster or any particular node. Each node sends at regular intervals information about its available resources to a randomly chosen subset of other nodes.

At the same time it maintains a small window, with the most recently arrived information. The mathematical model for scheduling algorithm comes from the fields of economics research. Most important is that the available resources on a cluster of Linux computers are heterogeneous.

In effect, the cost for memory, CPU, process communication etc. are incomparable. They are not even measured in the same units. The latest algorithm employed by openMosix is very interesting because it tries to reconcile these differences based on economic principles and competitive analysis.

The key idea behind this strategy is to convert the total usage of several heterogeneous resources, such as memory, CPU, ... into a single homogeneous "cost".

Jobs are then assigned to the machine where they have the lowest costs, just like a market-oriented economy. At any time openMosix continuously attempts to optimize the resource allocation. Due to its decentralized algorithm , openMosix scales to well over a thousand nodes

Load-balancing is configurable during runtime using the /proc/hpc interface e.g openMosix is using a "speed" variable which is calculated at boot up from the openMosix kernel according to the speed of a Pentium 1Ghz. This value can be adjusted easily using either /proc/hpc, the command-line utilities or by the GUI. This directly affects the load balancing of the whole cluster.