TORQUE is built on the principles of the Portable Batch System, also known as PBS, from which there are two versions OpenPBS(opensource but no longer maintained) and PBS Pro, a paid for PBS software. TORQUE is also open source and derives from OpenPBS, with currently many years of development separating TORQUE from OpenPBS. TORQUE is currently one of the better ways to manager cluster job queues, especially given that Sun Grid Engine is no longer free.
TORQUE PBS Server
The pbs_server process is what controls what nodes are assigned idle jobs from the queue on a machine(node) finishes a calculation and becomes empty. Sometimes it can have some problems, so it is useful to restart the pbs_server process, which usually is installed in /etc/init.d/pbs_server.
Restart PBS Server by issuing, as root(or with sudo):
/etc/init.d/pbs_server restart
The process can be restarted mostly without issue and there are two options for restarting the process, defined in TORQUE's config file:
- delay
- quick
If PBS_SERVER_STOP is set to "quick". In this situation, the running jobs will be let run without interaction until the server is back up. Setting it to "delay", the jobs will be checkpointed, rerun, or pbs_server will wait for the jobs to finish before restarting the service.
If you're trying to sort out blocked resources, it is recommended to use "quick".
No comments:
Post a Comment