Brief Sun Grid Engine Guide

What and Why

The Sun Grid Engine (SGE) is a software package to facilitate "grid" computing. Ignoring the buzzwords, the Duke Math SGE cluster is a set of systems provided for doing large computations. The SGE software sets up a queue of waiting jobs that ensures your job will get an idle processor without you having to fuss or step on anyone's toes.

The actual machines are called grid1, grid2, ..., grid16 with a "head node" called grid (no number). Each machine has a dual core processor and can run two jobs at full speed.

The SGE works better when everyone using the grid machines uses the job queue. Toward this end, I've written some scripts to make it easy to submit jobs -- nearly as easy as just running them locally. In order to do this, I've papered over almost all of the neat features of the SGE described here, such as email notification when a job finishes, parallel computing, and so on.

These scripts and examples assume that you and your data are on the Math computers NFS. If you're on your own machine, you'll have to work harder.


How

  1. Download the provided shell scripts: grid-scripts.tar.gz
  2. Unpack them into a directory in your $PATH. (If you don't understand, see the shell guide for more info). For example:
    $ mkdir -p ~/.local/bin
    $ tar -xzvf grid-scripts.tar.gz
    $ cd grid
    $ mv -t ~/.local/bin/ myqsub qsub myqdel qdel myqstat qstat mlab
    $ chmod u+x ~/.local/bin
    $ if [ $shell ]; then echo -e "setenv path \"$HOME/.local/bin/ $path\"" >> ~/.tcshrc; 
    else echo -e "export PATH=\"$HOME/.local/bin:$PATH\"" >> ~/.bashrc; fi
    $ source .bashrc ; source .tcshrc
  3. Submit a job by "qsub [exename]"


Commands

There are three commands for using the SGE:

The use of qsub is foundational. qstat is used to see who is hogging all of the nodes, see if your jobs have started, and to find out the job number of jobs you've submitted. qstat returns the state of a job as well; the state can be qdel is used to cancel submitted jobs (you know there's a bug, or you don't need the results anymore).

Examples

Basic example: You want to run the executable "longrun" on the grid.

me@tux$ qsub longrun
Your job 12345 ("longrun.job") has been submitted.
Connection to grid closed 
Your job will be assigned a number (say "12345"), and the output will appear in "longrun.job.o12345". You can check the status of your job by
me@tux$ qstat
job-ID  prior   name       user         state submit/start at     queue    slots ja-task-ID
-------------------------------------------------------------------------------------------
 12345  0.00000 longrun.jo me           qw    04/22/2008 14:48:27              1
Connection to grid closed.
Now you want to stop your job, so you do
me@tux$ qdel 12345
me has deleted job 12345
Connection to grid closed.

A Matlab example: You have a function my_func() defined in m-file "my_func.m" you'd like to run on the grid, passing in parameters 1 and 2.

me@tux$ qsub mlab -x "my_func(1,2)"
If you have a script instead of a function, you can do
me@tux$ qsub mlab my_script.m

A simple parameter study: Suppose my_prog can read in the time-step from a command-line argument. Normally, you run

me@tux$ ./my_prog -tstep 0.1
to have a time-step of 0.1. To run this on the grid, you can do
me@tux$ qsub ./my_prog -tstep 0.1
Suppose you want to verify that my_prog is O(\delta t^2) accurate. You want to do several runs with different time's and check the results (plotting on a log scale, using Aitkins extrapolation, etc). You could do
me@tux$ for tstep in 0.1 0.05 0.025; do qsub my_prog -tstep $tstep; done

Details

The scripts qsub, qdel, and qstat are wrappers that check if you are logged into "grid" or not. If you are, they run the native SGE versions of themselves. Otherwise, they run the "my" version. The "my" versions mostly just use ssh command execution to run the corresponding SGE command on grid, then log out. qsub is the exception, since it actually writes job files. myqsub is still useful on grid, since it saves you from writing a shell script.

A "bonus" program -- mlab -- is included to facilitate Matlab computation on the SGE. You can't plot things while running scripts with mlab. Rather, save your data with "save" and load and plot it in Matlab after your job has finished.

See the README for more information.


Other Resources

Brian O'Meara has written a collection of similar scripts that includes the ability to create your own personal queue -- handy if you have a lot of small jobs to submit and don't want to hog all of the nodes. These scripts assume an 88 node grid, so they require a bit of modification to work here.




By Michael Gratton (04/17/2008) mgratton@math.duke.edu