Technical Note - Grid Computing: ECS Grid
This document and the underlying grid software
changed in April 2012
As of
December 2012, machines in the grid started to move to a 64-bit Operating System.
The default version of Java on the School's machines changed again for 2015.
Please
READ IT ALL AGAIN even if you have flown with us many times before.
IMPORTANT
This document and the underlying grid software
changed in April 2012
Please
READ IT ALL AGAIN even if you have flown with us many times before.
If you follow the instructions here and something does not work, please let the programing staff know
FOR EXISTING/RETURNING USERS
1) There are no longer any
NetBSD
machines in the ECS/SGE Grid.
As of April 2012, this document no longer makes reference to that platform except in the line above.
2) The default SGE_ARCH string is now
lx-amd64
not
lx26-x86
, nor
lx-x86
.
3) The use of
/vol/grid/sgeusers
should be avoided in favour of
/vol/grid-solar/sgeusers
As of April 2012, this document no longer makes reference to the
/vol/grid
filesystem, and neither should you!
Summary
There didn't seem to be any ECS-focused documentation aimed at the users wishing to run jobs
on the ECS Grid, so here are the basics of job submission, this being the area in which things differ
most from the provider's documentation, which used to be at
http://dlc.sun.com/pdf/820-0699/820-0699.pdf
. Other aspects of job control are covered within that
documentation.
Note: After Oracle bought Sun, they stopped providing the documentation at that link, so a local copy is attached.
Details
General
ECS administers two "Grids" known, in administration circles, as the
ECS Grid and the
SCS Grid.
You may also find these referred to as the
SGE Grid and the
Condor Grid respectively.
The two are seperate.
The Tech Note for the SCS Grid is at:
http://ecs.victoria.ac.nz/Support/TechNoteScsGrid
The ECS Grid runs under the control of a descendent of the Sun Grid Engine (
SGE) and exists to make use of the computing
power of the School's
ArchLinux
machines at times when they are unused, ie when they should have no one
logged in at the console, eg overnight.
Jobs, usually shell scripts wrapping a number of tasks, are submitted from any ECS workstation (if your workstation
is a windows machine or Mac, then you will need to login to one of the Schools servers) ,into a
simple queuing system where they remain, in turn, until an "unused" machine is able to start them.
(Note that being
able to start them is not the same as being able to run them to completion).
At present, users at the console of a machine have priority over Grid jobs running on the same machine to
the extent that a Grid job will be suspended upon a machine where there is console activity, so users submitting
Grid jobs should be aware that there is no guaranteed run time for any given task.
Basically,
it'll finish when it finishes
There is a low volume mailing list which is used to inform users of the ECS grid of matters
which may be of interest to them. You can subscribe your School or University email address here
http://ecs.victoria.ac.nz/mailman/listinfo.cgi/grid-users
Setting up the environment
A single SGE instance can control a number of "Grids". In order to provide the SGE utilities with information
about which Grid the user wishes to run their job within, a couple of environmental variables need to be set
up. This is achieved using the standard package system's
need pkgname
environment
modifiying process.
We'll be using the
SGE Grid (ECS also maintains the
SCS Grid not accessible from here), so a simple
need sgegrid
suffices to set up the environment for job submission.
If ever, when trying to run an SGE command, you see this message
critical error: Please set the environment variable SGE_ROOT.
then the chances are that you have forgotten to type, or otherwise arranged to have run, the command:
need sgegrid
Do I have a home on the Grid?
This is slightly quirky and not initially intuitive.
Staff will not be able to access their home directories
Students will
A user of an SGE-controlled Grid might expect to find that their jobs start to execute from a
home directory within the overall system, that home directory being accessible to all machines
within that system and, for the case of a grid utilising their everyday machines, being their
normal
working directory after logging in to any of those machines interactively.
A simple
qsub submission_script_name
would then be enough to start the job off.
With the ECS Grid, however, despite all staff and students having a home directory no matter which
ECS machine they might login at, this is only the case for student accounts.
Because the machines comprising the Grid system can be any of the School workstations,
both individuals' office machines and public access lab machines, staff will not see their home
directories accessible from a remote machine when running a Grid job and so will have to
explicitly set an initial working directory elsewhere.
This can be achieved on the command line at submission time, by use of the
-wd path
option to
qsub
though perhaps a better option for staff is to always place the equivalent SGE directive at the top of the submission
script to a known path.
#$ -wd path
Of course, non-staff users may also find this mechanism useful.
At the time of this revision of this document, there is an area of accessible filestore that
could be used to create a standard area from which staff could reliably start SGE jobs,
The
/vol/grid-solar/
filesystem is a mount of some filestore supplied by ITS and users of the ECS grid
will find that they have a personal directory within that filesystem, accessible as:
/vol/grid-solar/sgeusers/username
If you don't have a directory there you can ask for one to be created by emailing a request to the ECS request tracking system at
jobs@ecs.vuw.ac.nz
. The subject of the email should contain
ECS Grid
and in the text of your request you should tell us your ECS login name, which will be used as the name of the directory.
Note for Windows users
On a Windows system the
/vol/grid-solar/
filesystem is accessible via mounting
\\smb\grid-solar
as a network drive.
Because this is a distributed batch processing environment, there's usually no clear indication as to
which machine(s) your job(s) will end up running on.
You thus need to give a little more thought to the location of input and output files than if you were
simply running a job on your own workstation where everything is local to the machine (though, who
knows, your Grid job might end up running on your workstation).
Once the job is running on the remote machine it will have access to many of the NFS shared
filesystems that that the user would expect to see from their own workstation during an
interactive session.
This can be useful when large data sets need to be accessible for reading and the overhead
of copying the data to each machine upon which the job is running is large, because they
can be placed at known paths.
The NFS-shared filesystems are less of an advantage for writing out data.
- Writing over NFS is often slower than reading
- There is the potential for bottlenecks to occur where a user has each job writing to the same directory (or even file, if they get things wrong) over NFS, or where many users each have jobs writing to the same NFS partition.
It is thus advisable to arrange for any output from the program to be written to a directory local
to the machine upon which the program is running and then to copy any output to filestore to
which the user will have general access, at the end of the job.
The area of filestore provided for this purpose upon every ECS Grid machine is the directory
/local/tmp
Note that this directory may well be being used by the user who normally sits at the console, and will
almost certainly be used by Grid jobs that came before your current one and those that come after yours,
and so there is no guarantee that a path or file name that you wish to create does not exist already.
To avoid any clashes; as a courtesy to other users, and to simplify the process of cleaning up
afterwards, a directory below the path
/local/tmp
will be created that follows the convention
/local/tmp/[username]/$JOB_ID
where
$JOB_ID
is an environmental variable maintained by the SGE for the duration of
the job which will thus be available to your submission script and to any programs able to read
the environment.
Your submission scripts should ensure you either change to this directory or place any files
you require for the job there.
Preserving results after execution
Once your job has run and the submission script has terminated, any output written to
/local/tmp/[username]/$JOB_ID
will be deleted.
In order to get your output back to somewhere more useful to you, there are a number of options:
- If you have direct access to your home directory path, you can copy directly to that. Staff can't do this
- If you have write access to a shared filesystem you can copy directly to that and then move files into your own filestore from your own machine
Staff will need to exercise Option 2.
Where do stdin
, stdout
and stderr
appear
When one runs programs locally, program output and error messages will often appear
on the console, or in the terminal emulator, and one can usually perform command line
redrection for input.
When you are running a non-interactive job on a remote machine however, it is likely that
you aren't going to see any console output during the execution of the program.
The SGE therefore redirects the
stdout
and
stderr
channels to file so that they may be
inspected after the job has finished.
Typically the default
stdout
and
stderr
naming conventions try to create files called
scriptname.o$JOB_ID
and
scriptname.e$JOB_ID
respectively, in the working directory
of the task when it starts. (
See the note about working directories for staff)
The default location of these files can be altered by use of
qsub
command-line options or via
the corresponding SGE directives being specified in the job submission script.
Job submission script example
Basic jobs just need to run on some machine within the Grid. If you know that you want a certain type
of processor or need a minimum amount of memory or disk space, then you probably know enough to
create the relevant submission script, or at least be capable of reading the Sun documentation for
more information.
The job submission script
freds_test.sh
, used as an example here, is effectively then, just a simple
test of the system.
(
though running a simple test to check that the basics, eg, directories existing, before you submit 5000 jobs that write to them is GOOD thing)
There are, however, a couple of advancements over a simple
fire and forget activity.
We've taken the view that you'll want to know when your job starts and finishes and so the example
job submission script will tell the Grid Engine to email you when it does.
We've taken the view that you'll want to script using the Bourne Shell so we'll force the SGE to run your
job submission script within that shell, because the initial line's
#!/bin/sh
may not be honoured.
We've taken the view that you'll be doing something more than just adding a couple of integers together
in a loop (and, no, adding a couple of thousand integers together still doesn't count) so we'll try and access
some areas of the filestore you will have access to when you run your jobs and move things around.
Finally, we've written the example for an ECS user with the username
fred
and the mail address
Fred.Bloggs@ecs.vuw.ac.nz
, so you might need to change a few things in the script if you are
not Fred Bloggs and/or not in ECS.
(HINT: Search for the strings
fred
,
Fred.Bloggs
,
FRED
and
ecs
).
BIGGER HINT ADDED
after someone tried to mail
Fred.Bloggs@ecs.vuw.ac.nz
and Fred was not happy getting all the
emails.
Search for the strings
fred
,
Fred.Bloggs
,
FRED
and
ecs
and
change to match
your username
,
email address
and
school
And of course, this is just an example. Once you have modifed the scirpt to suit your needs
and run a few tests to check things work as you expect then, you will probably want to remove
some, if not all, of the recording of the environment and so on - but that's up to you.
Sometimes, having the info can be useful to a debugging exercise, eg when you are trying
to invoke something not on the PATH because you are effectivley logging into a non-interactive
environment: sometimes, all the extra clutter makes it hard to see what's happening.
That "extra clutter" should also include the individual job emails you will get, so it is usually
worth electing not to be informed when running large numbers of jobs.
IMPORTANT
Some users, mostly users new to the School, have not been in the habit of understanding
the template for grid job submission before using it as the basis for the own job
submission scripts, and in other cases, have been using out-of-date and/or incorrect job
submission scripts that they may have been given, or pointed to by existing users who
should really know better.
When you come to modify this template for your own use, then
EXCEPT for your username,
you should not change
ANYTHING in the first 32 lines, ie, above this line,
# Now we are in the job-specific directory so now can do something useful
These lines are our attempt to ensure that the grid is working as we expect
it to, and there is no need to alter
ANYTHING except your username.
Furthermore, even though they may appear to be so, an understanding of the
process of grid job submisson would have alerted users to the fact that
not all of the lines are comments and so can be removed so, once again,
you should not change anything in those first 32 lines
EXCEPT for your
username.
If you have a copy of the template script downloaded and you do a
'diff'
of it against the job submission script that you are considering using,
the only differences you should see, within in the first 32 lines, should
be of the following form
% diff tech_note_template.sh my_job_submission_script.sh
9c9
< #$ -wd /vol/grid-solar/sgeusers/fred
---
> #$ -wd /vol/grid-solar/sgeusers/myusername
18,19c18,19
< if [ -d /local/tmp/fred/$JOB_ID ]; then
< cd /local/tmp/fred/$JOB_ID
---
> if [ -d /local/tmp/myusername/$JOB_ID ]; then
> cd /local/tmp/myusername/$JOB_ID
26,27c26,27
< echo "AND LOCAL TMP FRED "
< ls -la /local/tmp/fred
---
> echo "AND LOCAL TMP myusername "
> ls -la /local/tmp/myusername
...
and if your
'diff'
suggests otherwise, you should go back and alter your script so
that it matches the template apart from the username differences.
A basic job submission script
Some people have had problems when they cut and paste the following text
on windows systems.
To convert a file from windows format to UNIX format you can do the following at
the command line
dos2unix < my_windows_file.txt > my_unix_file.sh
Alternatively, if you use the
emacs
editor, you can change the charcater set
encoding by typing this useful key sequence
Ctrl-x RET f undecided-unix RET
where
RET
is the return key.
The example script tries to convert a
JPEG
file into one with a
PNG
format. You may
wish to generate, or obtain, a small
JPEG
file for use with the example.
Here is the basic script:
#!/bin/sh
#
# Force Bourne Shell if not Sun Grid Engine default shell (you never know!)
#
#$ -S /bin/sh
#
# I know I have a directory here so I'll use it as my initial working directory
#
#$ -wd /vol/grid-solar/sgeusers/fred
#
# End of the setup directives
#
# Now let's do something useful, but first change into the job-specific directory that should
# have been created for us
#
# Check we have somewhere to work now and if we don't, exit nicely.
#
if [ -d /local/tmp/fred/$JOB_ID ]; then
cd /local/tmp/fred/$JOB_ID
else
echo "Uh oh ! There's no job directory to change into "
echo "Something is broken. I should inform the programmers"
echo "Save some information that may be of use to them"
echo "Here's LOCAL TMP "
ls -la /local/tmp
echo "AND LOCAL TMP FRED "
ls -la /local/tmp/fred
echo "Exiting"
exit 1
fi
#
# Now we are in the job-specific directory so now can do something useful
#
# Stdout from programs and shell echos will go into the file
# scriptname.o$JOB_ID
# so we'll put a few things in there to help us see what went on
#
echo ==UNAME==
uname -n
echo ==WHO AM I and GROUPS==
id
groups
echo ==SGE_O_WORKDIR==
echo $SGE_O_WORKDIR
echo ==/LOCAL/TMP==
ls -ltr /local/tmp/
echo ==/VOL/GRID-SOLAR==
ls -l /vol/grid-solar/sgeusers/
#
# OK, where are we starting from and what's the environment we're in
#
echo ==RUN HOME==
pwd
ls
echo ==ENV==
env
echo ==SET==
set
#
echo == WHATS IN LOCAL/TMP ON THE MACHINE WE ARE RUNNING ON ==
ls -ltra /local/tmp | tail
#
echo == WHATS IN LOCAL TMP FRED JOB_ID AT THE START==
ls -la
#
# Copy the input file to the local directory
#
cp /vol/grid-solar/sgeusers/fred/krb_tkt_flow.JPG .
echo ==WHATS THERE HAVING COPIED STUFF OVER AS INPUT==
ls -la
#
# Note that we need the full path to this utility, as it is not on the PATH
#
/usr/pkg/bin/convert krb_tkt_flow.JPG krb_tkt_flow.png
#
echo ==AND NOW, HAVING DONE SOMTHING USEFUL AND CREATED SOME OUTPUT==
ls -la
#
# Now we move the output to a place to pick it up from later
# (really should check that directory exists too, but this is just a test)
#
mkdir -p /vol/grid-solar/sgeusers/fred/$JOB_ID
cp krb_tkt_flow.png /vol/grid-solar/sgeusers/fred/$JOB_ID
#
echo "Ran through OK"
As some people on windows machines have had problems cutting-and-pasting the above content,
a downloadable version is available as this attachment:
qstat shows you the state of your jobs
qstat -u \* shows you the state of all jobs
qsub script_name submits the job defined in the script into the queuing system
qdel job_number deletes the job with the job_number from the queuing system
Note that if you wish to force a job deletion, you will need to run the
qdel
from
greta-pt
, the School's
general purpose server, not your workstation.
Emailed output
During the development phase of the enabling of an existing workflow to be run within the grid,
especially when the grid's resources are busy, it can be useful to be notifed by email that a job
has started, failed or ended.
The
qsub
man page suggests that you can add various command line options to give you this
functionality, however, in common with other command line options, you can place these "directives"
inside the job submissions script. An example of using submission script directives is:
#
# Mail me at the b(eginning) and e(nd) of the job
#
#$ -M Fred.Bloggs@ecs.vuw.ac.nz
#$ -m be
#
but be aware that trying to email too many messages out from your account may see you exceed
overall mail quotas.
If Fred does choose to notify himself by email , he will see an email message like this when the job starts:
Subject: Job 341642 (freds_test.sh) Started
Job 341642 (freds_test.sh) Started
User = fred
Queue = GX755
Host = lumiere.ecs.vuw.ac.nz
Start Time = 03/18/2009 16:20:54
and one like this when it ends:
Subject: Job 341642 (freds_test.sh) Complete
Job 341642 (freds_test.sh) Complete
User = fred
Queue = GX755@lumiere.ecs.vuw.ac.nz
Host = lumiere.ecs.vuw.ac.nz
Start Time = 03/18/2009 16:20:54
End Time = 03/18/2009 16:20:55
User Time = 00:00:00
System Time = 00:00:00
Wallclock Time = 00:00:01
CPU = NA
Max vmem = NA
Exit Status = 0
Array Jobs (Task Array Jobs)
The previous example is fine for the submission of one-off jobs, however there may be cases where you
may wish to submit the same process, multiple times but where each invocation of a process will perform different
tasks, based its place within the full set of jobs.
This can be realised by submitting multiple single jobs but another way to do this is to submit an Array Job.
An "array job" is submitted by using the
-t F-L:S
(First-Last:Step) option, for example
qsub -t 1-10:2 my_submission_script.sh
submits 5 jobs into the grid.
Whilst this form of submission is referred to as a
Job Array
, the option used,
-t
and the environmental
variable used to differentiate between the array jobs
$SGE_TASK_ID
, suggests that it could be referred to
as a Task Array Job.
An example script, similar to the basic submission script example, but showing one way to differeniate
the array job output is available as this attachment:
Managing Array Jobs (Task Array Jobs)
As the single task submission notes above state, a temporary work area is created for each job, based
on the environmental variable
$JOB_ID
, however, in grid environments where more than one job
can run on a single machine, the tasks from a Job Array submission would all have the same
$JOB_ID
,
and so a second environmental variable,
$SGE_TASK_ID
, is provided so as to differentiate between
the tasks.
The temporary directory created for Array Jobs is thus of the form
/local/tmp/[username]/$JOB_ID.$SGE_TASK_ID
so, if, for the example given using
-t 1-10:2
, the
JOB_IB
was
1234
, and the user was
fred
these
directories would get created (in the case of the ECS Grid, on five different machines)
/local/tmp/fred/1234.1
/local/tmp/fred/1234.3
/local/tmp/fred/1234.5
/local/tmp/fred/1234.7
/local/tmp/fred/1234.9
and the combination of the two environmentals, eg
JOB_ID.$SGE_TASK_ID
can be used as a general
construct for differentiating between the individual tasks.
Specialised job summission
As previously detailed, basic jobs just need to run on some machine within the Grid.
There may, of course, be classes of job where ensuring that all tasks run on the same architecture is
desirable, an example within ECS being a desire to ensure run timings were not influenced by differences
in the model of machine that individual tasks were executed on.
Similarly, temporary resource partitioning requests that ensure students in a lab tutorial can target the machines
in the lab that has been booked for them, require a
handle through which the user can access a subset
of the full SGE Grid.
A number of the SGE utilities, including
qsub
allow for a
resource request list to be defined by
use of the
-l resource=value
where the
resources are maintained within the
SGE Complex
Currently, the local additions (some of which may not always be populated) to the
SGE Complex
are
ecs_df_local
ecs_model
ecs_netgroup
ecs_room
so a user wishing to target only those machines which are the model
GX745
would need to add
-l ecs_model=GX745
to their
SGE
command.
Targetting machine architectures
Whilst platform neutral stuff, eg school-wide packages you run in
batch mode, or Java programs, should not be affected, if at all, by
differences in the underlying architecture machines within the grid,
it may be useful to specify the architecture you expect your jobs
to run on, so as to
future proof your job submission scripts.
To specifically request the OS you want when submitting the job you
can use use the
-l
argument to
qsub
A qsub command targetting the 32-bit ArchLinux machines would be
qsub -l arch=lx-x86 your_script.sh
A second approach involves checking for the OS your job ends up
running on, in the submission script.
The SGE will actually set an environmental variable for you to test against,eg
SGE_ARCH=lx-x86
You should also have access to the utility that SGE uses to provide
its own view of things, on all the machines in the grid, as
/usr/pkg/sge/util/arch
an invocation of which will return
lx-x86
So, for example, you might have, if you choose to use the value
that SGE would return to differentitate, directories containing
OS-specific binaries with the same name:
/vol/grid-solar/username/mycodes/bin/lx-x86/prog1
or programs, where the names specifies the architecture:
/vol/grid-solar/username/mycodes/bin/prog1.lx-x86
Here is a template (in Bourne shell syntax) that will allow you
to branch your submission script, your commands would go where
the "I could run" echo statements appear:
if [ -z "$SGE_ARCH" ]; then
echo "Can't determine SGE ARCH"
else
if [ "$SGE_ARCH" = "lx-x86" ]; then
echo "I could run a Linux x86 binary"
fi
fi
and a similar version for the C shell syntax (though it is a good idea to
write your job submission scripts in Bourne shell syntax)
if ( $?SGE_ARCH == 0 ) then
echo "Can't determine SGE ARCH"
else
if ( $SGE_ARCH == "lx-x86" ) then
echo "I could run a Linux x86 binary"
endif
endif
Compilation for the ArchLinux machines
If you are someone without access to the ECS lab machines, you'll not have access to
an
ArchLinux
machine on which to compile code targetting those grid resources.
In this case, however, you should find that binaries compiled for
i386
(so not
x86_64
) on
other GNU/Linux machines you have access to may work.
If you experience other problems in this area, please get in touch with us, whilst we await the
deployment of a general purpose ArchLinux 32-bit server.
Running Java programs on the ECS/SGE Grid
Running Java programs on the ECS/SGE Grid has always been compilcated by the fact that
to set up the full Java environment you would get when running interactively, using a
need javaXYZ
command, you did not have access to the
need
facility from within the default Grid environment.
It seems worth providing some basic guidelines that should allow many Java programs to operate.
With a bit of guinea-pig'ing by
Roman Klapaukh, it would appear that the following stanza should allow
one to submit Java programs to the ECS Grid, without worrying about the OS your job ends up running
against.
Note that this solution uses the mechanism outlined above for determining the OS and thus, should
you need to branch any other operations on that test, you could combine them.
Note also that you're probably not the user
fred
so YOU WILL NEED TO EDIT THE SCRIPT
#!/bin/sh
#
# Force Bourne Shell if not Sun Grid Engine default shell (you never know!)
#
#$ -S /bin/sh
#
# I know I have a directory here so I'll use it as my initial working directory
#
#$ -wd /vol/grid-solar/sgeusers/fred
#
# Now let's do something useful, but first change into the job-specific directory that should
# have been created for us
#
if [ -d /local/tmp/fred/$JOB_ID ]; then
cd /local/tmp/fred/$JOB_ID
else
echo "There's no job directory to change into "
echo "Something is broken. I should inform the programmers"
echo "Save some information that may be of use to them"
echo "Here's LOCAL TMP "
ls -la /local/tmp
echo "AND LOCAL TMP FRED "
ls -la /local/tmp/fred
echo "Exiting"
exit 1
fi
#
if [ -z "$SGE_ARCH" ]; then
echo "Can't determine SGE ARCH"
else
if [ "$SGE_ARCH" = "lx-amd64" ]; then
JAVA_HOME="/usr/pkg/java/sun-8"
fi
fi
if [ -z "$JAVA_HOME" ]; then
echo "Can't define a JAVA_HOME"
else
export JAVA_HOME
PATH="/usr/pkg/java/bin:${JAVA_HOME}/bin:${PATH}"; export PATH
java Hello
fi
Note that the
JAVA_HOME
path matches the default that you would get running
a bare
java
command when sitting at an ECS workstation.
The School maintains other Java installations, so if you wish to use those, you will need
to edit the script accordingly.
You can't use need
inside a basic job submission script
The Java example aboive highlights an wider issue for writing job submission scripts
for the ECS Grid.
Some programs, when run in an interactive shell on ECS/MSOR machines, require
the user to have to modify thier shell environment by typing
need [pkgname]
however, the
need()
function is not available within the non-interactive shells that
the grid jobs run in, which requires the user to be aware of what a
need [pkgname]
is actually doing and, should the user find that the program when run inside a
grid job requires the functionality from the "need file", replicate it within the script,
as shown in the Java example above.
You can see what an invocation of
need [pkgname>]
does if, from a shell, you
look at the file
etc/pkgs/[pkgname].sh
It is likely that you won't want to bother replicating everything a "need file" would
do (eg, setting the MANPATH for a non-intercative grid job) but you are unlikey
to cause any issues if you do replicate everything.
If you do try to use
need [pkgname]
inside a job submission script, you are
likely to see a message akin to this
"ERROR: configuration error -- this is the wrong version of need
You should never see this -- please report to bugs@ecs.vuw.ac.nz"
in which case, even if you current job does run, reporting to bugs might help
us to give you some extra infomation.
Using DRMAA with the ECS/SGE Grid
This section provides some information and typical commands required
to compile and run codes making use of DRMAA.
Background
Some simple source code examples of using DRMAA via the C and Java
bindings have been placed below:
/vol/grid-solar/sgeusers/admin/DRMAA
as an introduction to users wishing to experiment with DRMAA codes.
The C example sources originally come from a Dr Dobbs Journal article
(2004, Frederic Pariente)
http://www.ddj.com/184405932
which can seemingly still be found at:
http://www.drdobbs.com/184405932
though an archived version, less the Flash adverts, of the original
article is provided locally:
/vol/grid-solar/sgeusers/admin/DRMAA/DrDobbs-2004-Article/
The Java example sources come from the SGE source code distribution,
although a small change is required in order to have the codes work
as expected.
C Bindings
The C bindings make use of the header file
/usr/pkg/sge/include/drmaa.h
and the default SGE shared library
/usr/pkg/sge/lib/lx-amd64/libdrmaa.so.1.0
Java Bindings
As you will have read above these are
not currently installed
The Java bindings make use of a locally-compiled JAR-file and dynamic
library
/vol/grid-solar/sgeusers/admin/DRMAA/lx-x86/drmaa.jar
/vol/grid-solar/sgeusers/admin/admin/DRMAA/lx-x86/libdrmaa.so
which required a rebuild from the SGE sources.
Simple, proof of concept example: C Binding
Create a directory
~/DRMAA
, change into that directory and copy
the exammple codes provided over,
% cp /vol/grid-solar/sgeusers/admin/DRMAA/DrDobbs-Code/* .
Compile and link the proof of concept source
% gcc -c -I/usr/pkg/sge/include/ ListingOne.c
% gcc -o ListingOne \
-L/usr/pkg/sge/lib/lx-x86/ \
-Wl,-R/usr/pkg/sge/lib/lx-x86/ -ldrmaa ListingOne.o
You did rememeber to
% need sgegrid
Now we can test that things work
% ./ListingOne
Successfully started the DRMAA library
Spawning an actual job into the SGE: C Binding
The file
drdobbs-shell.c
is a slightly modified version of
ListingTwo.c
,
which allows one to specify the script to be executed as a commnad line
argument and sets a SGE-native option required to tell SGE to "do the right
thing"
Compile the source
% gcc -c -I/usr/pkg/sge/include/ drdobbs-shell.c
% gcc -o drdobbs-shell \
-L/usr/pkg/sge/lib/lx-amd64/ \
-Wl,-R/usr/pkg/sge/lib/lx-amd64/ -ldrmaa \
drdobbs-shell.o
Edit, or otherwise replace, the placeholder username
"fred"
used within
the job submission script to match your username
% mv i_am_alive.sh i_am_alive.sh.orig
% sed -e "s/fred/yourusername/g" i_am_alive.sh.orig > i_am_alive.sh
% chmod u+x i_am_alive.sh
As written the DRMAA code that will spawn the job will not pay attention
to the directory you are in when you use the DRMAA executable to spawn
your job script into the SGE, so we run as follows:
% ~/DRMAA/drdobbs-shell ~/DRMAA/i_am_alive.sh
Your job "/u/students/fred/DRMAA/i_am_alive.sh"has been submitted with id 000000
%
after which you should find the log files from the running of your script
in your home directory
% ls -ltr ~
...
drwx------ 2 fred students 512 Oct 6 12:02 DRMAA
-rw-r--r-- 1 fred students 0 Oct 6 12:20 i_am_alive.sh.e000000
-rw-r--r-- 1 fred students 29753 Oct 6 12:20 i_am_alive.sh.o000000
%
Note that, as written, the directive at the top of the job submission
script which requests an initial working directory
#$ -wd /vol/grid-solar/sgeusers/fred
has been ignored and inspection of the output file confirms this:
% cat ~/i_am_alive.sh.o000000
==UNAME==
breaker.msor.vuw.ac.nz
==WHO AM I and GROUPS==
uid=0000(fred) gid=25(students) groups=25(students),1500(c302t1)
students c302t1
==SGE_O_WORKDIR==
/home/rialto1/fred/DRMAA
...
Spawning an actual job into the SGE: Java Binding
As you will have read above the supporting files for the Java Binding are
not currently installed
The modified version of the SGE-provided
Howto2.java
adds the setting of
an SGE-native option required to tell SGE to "do the right thing" and
removes the
package
qualification from the code.
Create a directory DRMAA, change into that directory and copy
the exammple codes provided over,
cp /vol/grid-solar/sgeusers/admin/DRMAA/SGE-Code/* .
Make sure you have the "Native" version of Java
% need java2-native
Compile the Java source against the locally-built DRMAA JAR-file
% javac -cp /vol/grid-solar/sgeusers/admin/DRMAA/lx-amd64/drmaa.jar:. Howto2.java
As written, the DRMAA code will look for the hard-coded script it is going
to launch (
sleeper.sh
) as the SGE job, in your home directory so we copy
it there and ensure it is executable, before running the DRMAA code.
You did remember to
% need sgegrid
as well though?
Notice that we need to tell Java to use the locally built dynamic library
as well, by defining the search path within the Java environment
% cp sleeper.sh ~
% chmod u+x ~/sleeper.sh
% java -Djava.library.path=/vol/grid-solar/sgeusers/admin/DRMAA/lx-amd64 \
-cp /vol/grid-solar/sgeusers/admin/DRMAA/lx-amd64/drmaa.jar:. Howto2
Your job has been submitted with id 000000
%
after which you should find the log files from the running of your script
in your home directory
% ls -ltr ~
...
-rwx------ 1 fred 1746 Sep 30 14:50 sleeper.sh
drwx------ 2 fred 512 Sep 30 15:25 DRMAA
-rw-r--r-- 1 fred 0 Sep 30 15:28 Sleeper.e000000
-rw-r--r-- 1 fred 99 Sep 30 15:28 Sleeper.o000000
% cat ~/Sleeper.o
Here I am. Sleeping now at: Tue Sep 30 15:28:06 NZDT 2014
Now it is: Tue Sep 30 15:28:11 NZDT 2014
%
Note that the job submission has taken notice of the script directive
requesting that our job have the name
"Sleeper"
#$ -N Sleeper
and that this is reflected in the names of the logfiles.
Caveats
Environmental variables now have an SGE_
prefix not GE_
As noticed by many but formally pointed out by Kourosh Neshatian (and, belatedly, Lloyd Parkes):
The Sun documentation and man pages for the Sun Grid Engine (
SGE
) mention environmental
variables of the form
GE_SOME_THING
.
The docs are out of date with respect to current
SGE
implementations and users should be using
environmental variables of the form
SGE_SOME_THING
Jobs in Error states: unable to chdir
We have seen occurences of jobs being unable to start, and thus entering an Error state( ie,showing as
Eqw
when the user does a
qstat
)
If, in response to a
qstat -explain E -j 12345
where
12345
is the job number, you are told that the the
error reason
was an inablity to change to a directory that you know to be there, then you may simply have been a victim of networking congestion on the fileserver, at the time the job tried to start.
In this case, you should be able to clear the error condition by carrying out these steps
- ssh greta-pt
- need sgegrid
- qmod --clearjob 12345
If the job doesn't start then you should send an email to
jobs
giving us as much detail as possible.