Skip to main content

Running "Services on Sun Grid"

Posted by dhushon on March 20, 2006 at 10:07 AM PST

Sun Grid's resource management semantics basically dictate that jobs be self-contained, and terminate all processes in order to exit. The problem with terminating processes in a grid context is that it's not quite as simple as doing a PID trap on a single host, instead, you need to use the qsub, qstat and qdel commands to better manage your distributed jobs.

The example pattern that I'd like to elaborate is one of a “server/framework” which needs to run in order to support a client. Whether a simple RMID, or a more complex instance of a web server, app server or JavaSpace, the pattern is very similar. The developer wants to:

  1. Start up one or more servers (in our case 2, the httpd and the GigaSpaces Enterprise Server)
  2. Make sure that the servers are running
  3. Submit the client and wait for the client to complete
  4. Shutdown the Servers so that the Sun Grid Job can terminate and stop the meter

First some basic syntax:

  • #$ = new directives for SGE which do things like populate environment variables (-V)
  • qsub = submit this task to the grid for scheduling.. we use a couple of opt
  • “-sync n” fire and forget... don't wait for the job to be scheduled
  • “-N <jobname>” not required but could be used for parsing qstat... unfortunately qdel requires a jobid instead of a job name (to keep you from shutting down similarly named jobs)
  • “-t 1” or -t 1-4:1“ submit a job to one or multiple nodes with a minimum
  • qstat = get the status of the SGE queue, which in the case of Sun Grid will only return the jobs that you own for privacy purposes
  • ”-r“ only return the ”running“ jobs... jobs that are waiting (status=”qw“) are excluded
  • qdel = delete / stop the specified jobs

Now onto the listing:

#! /bin/bash

#$ -V

# if we are running against an older version of SGE, the ”$ -V“ direction

# will not exist, so be sure that we source the SGETOOLS (or at least try to)

if [[ ${SGETOOLS:-”unset“} = ”unset“ ]]


echo setting SGETOOLS





echo ”Starting the GigaSpaces Servers“



GSC=`qsub -sync n -N gsee-gsc -v GSEE_HOME=$GSEE_HOME -v GRID_HOME=$GRID_HOME -t 1-4:1 $GRID_HOME/bin/gsc`

GSM=`qsub -sync n -N gsee-gsm -v GSEE_HOME=$GSEE_HOME -v GRID_HOME=$GRID_HOME -t 1$GRID_HOME/bin/gsm $GRID_HOME/config/overrides/gsm-override.xml`

echo ${GSC}

echo ${GSM}

#SGE Job return syntax is XXXX:X-X:X where $JobID:$rested_min-$max:$Actual_min

# so trim out just the first XXXX which is a regex matched from the 3rd field

MATCH=”\(.*\) \(.*\) \([0-9]*\)\.\([0-9]*\)-\([0-9]*\):\([0-9]*\)“ #simple match for multi-node job

MATCH2=”\(.*\) \(.*\) \([0-9]*\) \(.*\)“ #simple match for simple 1 node job

GSCparsed=( `echo $GSC | sed -n -e ”s/${MATCH}/\3/p“` )

if [[ ${GSCparsed:-”unset“} = ”unset“ ]] then

GSCparsed=( `echo $GSC | sed -n -e ”s/${MATCH2}/\3/p“`)


GSMparsed=( `echo $GSM | sed -n -e ”s/${MATCH}/\3/p“` )

if [[ ${GSMparsed:-”unset“} = ”unset“ ]] then

GSMparsed=( `echo $GSM | sed -n -e ”s/${MATCH2}/\3/p“`)


echo ”Jobs $GSCparsed and $GSMparsed submitted“

# wait for these jobs to showup in qstat



until [[(”$GSMstatus“ > 0) && (”$GSCstatus“ > 0)]]


#evaluate the qstat -s r response (running jobs) to make sure that the

#requisite jobs are running

GSCstatus=$(qstat -s r | nawk '/'${GSCparsed}'/{var1+=1} END {print var1}')

GSMstatus=$(qstat -s r | nawk '/'${GSMparsed}'/{var1+=1} END {print var1}')

echo ”GSCstatus = $GSCstatus“

echo ”GSMstatus = $GSMstatus“

echo Server status is $(qstat -s r)

sleep 10


#run our application - in this case, use multiple nodes to help us calculate prime factor

echo ”crunching“

~/ $1

echo ”done“

#clean up

#parse jobid's out of GSM and GSC

echo $(qdel $GSMparsed $GSCparsed)

#go ahead and print out the queue status on the way out to verify cleanup (optional)

sleep 10

echo ”Leaving...“ echo $(qstat)

Hopefully, this example sheds some light on some of the mechanisms that a developer might enlist in order to launch more complex, server dependent applications against the Sun Grid. Please let me know if I need to elaborate further. I want to take this opportunity to recognize GigaSpaces, and specifically Dennis Reedy for his help in putting together a grid job which could flex a couple of nodes against their GigaSpaces Enterprise Server 5.0 environment. I'd also like to thank Bill Meine and Fay Salwen for their scripting assistance.

Related Topics >>