Posted by
dhushon on March 20, 2006 at 10:07 AM PST
Sun Grid's resource management semantics basically dictate that jobs be self-contained, and terminate all processes in order to exit. The problem with terminating processes in a grid context is that it's not quite as simple as doing a PID trap on a single host, instead, you need to use the qsub, qstat and qdel commands to better manage your distributed jobs.
The example pattern that I'd like to elaborate is one of a “server/framework†which needs to run in order to support a client. Whether a simple RMID, or a more complex instance of a web server, app server or JavaSpace, the pattern is very similar. The developer wants to:
- Start up one or more servers (in our case 2, the httpd and the GigaSpaces Enterprise Server)
- Make sure that the servers are running
- Submit the client and wait for the client to complete
- Shutdown the Servers so that the Sun Grid Job can terminate and stop the meter
First some basic syntax:
- #$ = new directives for SGE which do things like populate environment variables (-V)
- qsub = submit this task to the grid for scheduling.. we use a couple of opt
- “-sync n†fire and forget... don't wait for the job to be scheduled
- “-N <jobname>†not required but could be used for parsing qstat... unfortunately qdel requires a jobid instead of a job name (to keep you from shutting down similarly named jobs)
- “-t 1†or -t 1-4:1“ submit a job to one or multiple nodes with a minimum
- qstat = get the status of the SGE queue, which in the case of Sun Grid will only return the jobs that you own for privacy purposes
- â€-r“ only return the â€running“ jobs... jobs that are waiting (status=â€qw“) are excluded
- qdel = delete / stop the specified jobs
Now onto the listing:
#! /bin/bash
#$ -V
# if we are running against an older version of SGE, the â€$ -V“ direction
# will not exist, so be sure that we source the SGETOOLS (or at least try to)
if [[ ${SGETOOLS:-â€unset“} = â€unset“ ]]
then
echo setting SGETOOLS
SGETOOLS=/home/sgeadmin/N1GE/bin/sol-amd64
export SGETOOLS
PATH=$SGETOOLS:$PATH
fi
echo â€Starting the GigaSpaces Servers“
GSEE_HOME=GigaSpacesEE5.0
GRID_HOME=$GSEE_HOME/ServiceGrid
GSC=`qsub -sync n -N gsee-gsc -v GSEE_HOME=$GSEE_HOME -v GRID_HOME=$GRID_HOME -t 1-4:1 $GRID_HOME/bin/gsc`
GSM=`qsub -sync n -N gsee-gsm -v GSEE_HOME=$GSEE_HOME -v GRID_HOME=$GRID_HOME -t 1$GRID_HOME/bin/gsm $GRID_HOME/config/overrides/gsm-override.xml`
echo ${GSC}
echo ${GSM}
#SGE Job return syntax is XXXX:X-X:X where $JobID:$rested_min-$max:$Actual_min
# so trim out just the first XXXX which is a regex matched from the 3rd field
MATCH=â€\(.*\) \(.*\) \([0-9]*\)\.\([0-9]*\)-\([0-9]*\):\([0-9]*\)“ #simple match for multi-node job
MATCH2=â€\(.*\) \(.*\) \([0-9]*\) \(.*\)“ #simple match for simple 1 node job
GSCparsed=( `echo $GSC | sed -n -e â€s/${MATCH}/\3/p“` )
if [[ ${GSCparsed:-â€unset“} = â€unset“ ]] then
GSCparsed=( `echo $GSC | sed -n -e â€s/${MATCH2}/\3/p“`)
fi
GSMparsed=( `echo $GSM | sed -n -e â€s/${MATCH}/\3/p“` )
if [[ ${GSMparsed:-â€unset“} = â€unset“ ]] then
GSMparsed=( `echo $GSM | sed -n -e â€s/${MATCH2}/\3/p“`)
fi
echo â€Jobs $GSCparsed and $GSMparsed submitted“
# wait for these jobs to showup in qstat
GSMstatus=0
GSCstatus=0
until [[(â€$GSMstatus“ > 0) && (â€$GSCstatus“ > 0)]]
do
#evaluate the qstat -s r response (running jobs) to make sure that the
#requisite jobs are running
GSCstatus=$(qstat -s r | nawk '/'${GSCparsed}'/{var1+=1} END {print var1}')
GSMstatus=$(qstat -s r | nawk '/'${GSMparsed}'/{var1+=1} END {print var1}')
echo â€GSCstatus = $GSCstatus“
echo â€GSMstatus = $GSMstatus“
echo Server status is $(qstat -s r)
sleep 10
done
#run our application - in this case, use multiple nodes to help us calculate prime factor
echo â€crunching“
~/prime-crunch.sh $1
echo â€done“
#clean up
#parse jobid's out of GSM and GSC
echo $(qdel $GSMparsed $GSCparsed)
#go ahead and print out the queue status on the way out to verify cleanup (optional)
sleep 10
echo â€Leaving...“ echo $(qstat)
Hopefully, this example sheds some light on some of the mechanisms that a developer might enlist in order to launch more complex, server dependent applications against the Sun Grid. Please let me know if I need to elaborate further. I want to take this opportunity to recognize GigaSpaces, and specifically Dennis Reedy for his help in putting together a grid job which could flex a couple of nodes against their GigaSpaces Enterprise Server 5.0 environment. I'd also like to thank Bill Meine and Fay Salwen for their scripting assistance.