Checkpoint/restart feature

From Nekochan
Jump to: navigation, search

IRIX Checkpoint and Restart (CPR) is a facility for saving the state of running processes and for later resuming execution where the checkpoint occurred. This means that you can suspend a process while it's running and restart it at some later time at the exact spot where it was suspended, without having to start all over again.

It comes in handy if you have a computationally-intensive job running and have to shutdown the machine -- you can simply checkpoint the process, shutdown the computer and restart the process the next time you power it on.

The default behaviour is that the process is killed after it is checkpointed, but you can also make a checkpoint and keep the process running, which is useful if you expect any power failures -- you can make hourly checkpoints, for example.

CPR can also be used to move processes to another machine.

The two programs that provide an interface to this system are cpr (a command line tool) and cview (graphical interface) -- for usage information please consult the manpages, which are listed below under "Further reading".

This feature does have a few limitations, though: processes using sockets, SVR4 semaphores, MPI (because it uses sockets), X11 and some other things cannot be checkpointed, see the cpr manpage for a more detailed list. Contrary to popular opinion, the application you are trying to checkpoint does not need to be aware of CPR in order for it to work, however, CPR does offer a C interface which can (among other things) execute functions when the program is checkpointed, etc.

Further reading

Techpubs: cpr manpage

Techpubs: IRIX Checkpoint and Restart Operation Guide TOC