Saturday, October 30, 2010

Sun Grid Engine

http://gridengine.sunsource.net/
Sun Grid Engine, now Oracle Grid Engine,  previously known as CODINE (COmputing in DIstributed Networked Environments) or GRD (Global Resource Director),[3] is an open source batch-queuing system, developed and supported by Sun Microsystems. Sun also sells a commercial product based on SGE, also known as N1 Grid Engine (N1GE).
http://www.cbi.utsa.edu/sge_tutorial
The Sun Grid Engine is a queue and scheduler that accepts jobs and runs them on the cluster for the user. There are three types of jobs available, interactive, batch, parallel.


In this example, we will run a matlab script:

  1. Create a directory to hold your job file and any associated data(matlab scripts, etc).
  2. Open a new file, in this case we will call it matlab-test.job
    #!/bin/bash
    # The name of the job, can be anything, simply used when displaying the list of running jobs
    #$ -N matlab-test
    # Giving the name of the output log file
    #$ -o matlabTest.log
    # Combining output/error messages into one file
    #$ -j y
    # One needs to tell the queue system to use the current directory as the working directory
    # Or else the script may fail as it will execute in your top level home directory /home/username
    #$ -cwd
    # Now comes the commands to be executed
    /share/apps/matlab/bin/matlab -nodisplay -nodesktop -nojvm -r matlab-test
    # Note after -r is not the name of the m-file but the name of the routine
    exit 0
  3. Save this job script and submit to the queue with “qsub matlab-test.job”
  4. Now you can check the status of your script with “qstat” which will return a list of your running/queued jobs
    When the job is completed you can check the output of the job in the filename given above, matlabTest.log
    NOTE: You may see the following in the output:
    “Warning: no access to tty (Bad File descriptor).
    Thus no job control in this shell.”
    This is normal and can be ignored. And in the case of matlab, you may see a message about shopt, again for matlab this is normal and can be ignored.

Attached is the sample job and matlab script

Running Parallel Jobs with SGE:

A parallel job is where a single job is run on many nodes in an interconnected fashion, generally using MPI to communicate in between individual processes. If you are running the same program on the cluster as you would on your desktop, chances are you will want to use a serial job, not a parallel job. Parallel jobs generally are only for specially designed programs which will only work on machines with cluster management software installed.

Also not just any program can run in parallel, it must be programmed as such and compiled against a particular mpi library. In this case we build a simply program that passes a message between processes and compile it against the OpenMPI, the main mpi library of the cluster.

  1. Like the batch job, create a directory to hold this job and related files
  2. Open a new file and create the job script:
    #!/bin/bash
    #$ -N openmpi-test
    # Here we tell the queue that we want the orte parallel enivironment and request 5 slots
    # This option take the following form: -pe nameOfEnv min-Max
    # Where you request a min and max number of slots
    #$ -pe orte 5-10
    #$ -cwd
    #$ -j y
    /opt/openmpi/bin/mpirun -n $NSLOTS mpi-ring
    exit 0
  3. And like above you can use qsub to check on your job

Notes:

There are a few queue commands to know:
  1. List all jobs running “qsub -u \*”
  2. List all jobs running per node “qsub -u \* -f”
  3. To delete a job “qdel jobID”
  4. To list any queue messages “qstat -j”

No comments:

Post a Comment