Queuingsystem

Grid Engine (GE) is a cluster queueing and scheduling system for handling simulation jobs. There is also a graphical user interface called qmon.

The sample scripts should work on all cluster running GE, just consider the different queues, parallel environments and limitations described here.

To see all nodes and jobs of all users just enter:

qstat -f -u \*

queuename qtype resv/used/tot. load_avg arch states 
--------------------------------------------------------------------------------- 
scc@scc005.cluster BIP 0/8/8 7.97 lx26-amd64
 4194 0.61000 al0.02_Bx0 gerlach r 02/15/2010 17:56:56 8
---------------------------------------------------------------------------------
...

Here you see the a job named "al0.02_Bx0..." with the Job-ID 4194 startet on Feb. 15, 2010 by the user "gerlach" running on the node sphinx01 using 8 slots (cores) in the queue sphinx. This job uses all (8 of 8) slots and causes a load of almost 8 on this node.

The state of a job can be w(aiting), r(unning) or s(uspended). Awaiting job is waiting for free resources (slots) and a suspended job is stopped for some reason and will continue sooner or later.

A list of all available queues shows

qconf -sql

Jobs can be sent to the queuing system by writing a jobs script named job.sh and using:

qsub job.sh

If your job is rejected with "no suitable queues", please use the option "-w v" to see why qsub rejects your job.

Of course you can also delete (only your) jobs. To delete the job with the Job-ID 123 just enter:

qdel 123

To start an interactive session with GE you can use qlogin, for example

qlogin -q scc

will start an interactive session on any node of the scc queue.

You can still change properties of a job after using qsub with qalter:

qalter -q long

More options of all GE commands can be found in the man pages (man qstat, etc.).

Jobs scripts are just normal shell scripts with special GE options beginning with #$. All GE options can be either passed on the command line with qsub or in the job script.

Warning: If you write your job-script on Windows or Mac make sure the script has the correct text format. If you use "file job.sh" and see "job.sh: ASCII text, with CRLF line terminators" your script has the wrong (Windows) format. Use "dos2unix job.sh" to convert it to Linux format.

Here a first example:

#!/bin/bash
#$ -N sim        # the name of the job

### uncomment to load certain modules
# module load intel mkl gsl

./simulation param.dat

This job script tells GE to start a job named "sim", run it in the default queues in the current working directory (default). The output (stderr and stout) is written to the files [job-name].o[jobid] and [job-name].e[jobid] in the current directory. You can change this by using the options -e/-o .

You can also specify to get E-mails when a job is queued, started and finished with the option -m eba and using the option -M to select the E-mail adress.

If you only want the job to run on a dedicated node (which is normally not necessary) you can specified this for instance by using "-q scc@scc100, scc@scc101". The "-q" option supports wildcards ("*") and logical expression. With this you can exclude certain nodes ("-q scc&!scc066"). For more options check out the man page of qsub.

In the last line of the script the program simulation is called with a parameter file "param.dat". Please just use your program file name and parameter file here.

Option	Example	Explanation
-cwd	(default option)	Run job in current working directory
-e/-o path	-e Logs/	Save stdout (-o) or stderr (-e) in given path. The given directory must exist before the job is started!
-j y\|n	-j y	merge stdout and stderr
-l resource=value	-l h_rt=24:00:00, -l h_vmem=2G	Request resource (for details see here)
-m b\|e\|a\|s\|m	-m bea	Send mail when job b(egins),e(nds),a(borted),s(uspended), n(ot at all)
-M user[@host]	-M fritz	Specify mail adress for sending mails
-N name	-N job	Specify name of job
-p priority	-p -100	Specify priority of job (-1023..1024, default:0)
-pe parallel_env nslots	-pe mpi 8	Use parallel environment and reserve nslots slots
-q queue_list	-q long,all@host1	Specify queues for job
-r y\|n	-r y	Restart job if aborted
-sync y\|n	-sync y	If sync=y qsub, etc. waits until last job finishes
-t n[-m[:s]]	-t 1-10:2	start array jobs (see below)
-v var[=value]	-v NUM=1	set environment variable for job
-hold_jid JID	-hold_jid 1234	job dependency: start job after the job with the given job ID completed

If you have multiple similar jobs you should use so called array jobs instead of many single jobs. This can be done by using the GE parameter "-t" and the variable SGE_TASK_ID, for example

#$ -t 1-100
./simulation $SGE_TASK_ID

This runs 100 jobs each with a different SGE_TASK_ID (here: SGE_TASK_ID = 1,2,3,...,100).

You can also specify other step sizes like "-t 1-100:2" running 50 jobs (SGE_TASK_ID=1,3,5,...,99). The step size is saved in SGE_STEP_SIZE and the first and last value in SGE_TASK_FIRST and SGE_TASK_LAST.

Single tasks of an array job can be deleted using:

qdel <Job-ID> -t <Task-ID>

You can limit the number of concurrent running tasks by using the "-tc" parameter. So

#$ -tc 5

would only allow 5 running task of an array job at a time.

For running jobs that use multiple cores (so called parallel jobs) you need to use pre-defined parallel environments (specified with -pe) and let the queuing system reserve the slots for you. Normally there a two parallel environment for shared memory jobs (called smp or openmp) and for jobs using distributed memory (called mpi). The memory requested (with "-l h_vmem") counts per slot!

An SMP job may look like:

#!/bin/bash
#$ -N smp-job
#$ -pe smp 8

./program

This job reserves 8 slots (cores) in the parallel environment "smp". The number of used threads by OpenMP is automatically set to the number of slots. If you want to change it (for instance to benefit from Hyper-Threading) redefine the variable OMP_NUM_THREADS .You can also specify ranges with "-pe smp 4-20" to get at least 4 cores and maximal 20 cores, depending on availability.

An MPI job may look like

#!/bin/bash
#$ -N mpi-job
#$ -pe mpi 8

### uncomment for debugging
# echo "Got $NSLOTS processors."
# echo "Running on:"
# cat $TMPDIR/machines

### using openmpi module (with GE integration)
module load openmpi
mpirun -v --bind-to none ./program

### using other MPI implementations
mpirun -v -np $NSLOTS -machinefile $TMPDIR/machines ./program

This job reserves 8 slots in the parallel environment mpi. The machine file needed by MPI to distribute the tasks is automatically created by GE in

$TMPDIR/machines

and can be used by mpirun (option -machinefile). Again the number of slots is available with

 $NSLOTS

and is set by mpirun automatically (option -np) . To allow mpirun to use the core binding from GE, you should always use "--bind-to none".

if your jobs fails and get into an error state ("E"):

796 0.60750 M19M2_C4 user Eqw 02/01/2010 10:55:48 8

please check the output of

qstat -j <jobid>

... 
error reason 1: 02/05/2010 16:26:09 [2099:17555]: error: can't chdir to /data/hydra/user/M19M2: No such file or d
...

Here it shows that a directory was missing. If you cant figure it out, please contact us. We have seen a lot of errors before :-)

After a job is finished you can check with

qacct -j <jobid>

what ressources (run time, memory, etc.) were used. This can be useful for running further jobs.

Search University of Konstanz

Results

Suggestions