Cluster - TORQUE Resource Manager Tutorial

The TORQUE resource manager is a complex piece of software, which deserves some quick tips on how to use the system without having downtime or lost jobs for your users. This page gathers some of the most important information you can know about using Torque and will be further expanded to accommodate all the important topics regarding this management component of computer clusters.

TORQUE is built on the principles of the Portable Batch System, also known as PBS, from which there are two versions OpenPBS(opensource but no longer maintained) and PBS Pro, a paid for PBS software. TORQUE is also open source and derives from OpenPBS, with currently many years of development separating TORQUE from OpenPBS. TORQUE is currently one of the better ways to manager cluster job queues, especially given that Sun Grid Engine is no longer free.

TORQUE PBS Server

The pbs_server process is what controls what nodes are assigned idle jobs from the queue on a machine(node) finishes a calculation and becomes empty. Sometimes it can have some problems, so it is useful to restart the pbs_server process, which usually is installed in /etc/init.d/pbs_server.

Restart PBS Server by issuing, as root(or with sudo):

/etc/init.d/pbs_server restart

The process can be restarted mostly without issue and there are two options for restarting the process, defined in TORQUE's config file:
  • delay
  • quick
If PBS_SERVER_STOP is set to "quick". In this situation, the running jobs will be let run without interaction until the server is back up. Setting it to "delay", the jobs will be checkpointed, rerun, or pbs_server will wait for the jobs to finish before restarting the service.
If you're trying to sort out blocked resources, it is recommended to use "quick".

No comments:

Post a Comment