Technical Note - Grid Computing: SCS Grid
The information in this technical note is obsolete as the SCS grid is no longer operational
Summary
There doesn't seem to be any ECS-focused documentation aimed at the users wishing to run jobs on the SCS Grid,
so here are the basics.
A more complete description can be found in the
providers documentation.
Details
General
ECS administers two "Grids" known, in administration circles, as the
ECS Grid and the
SCS Grid.
You may also find these referred to as the
SGE Grid and the
Condor Grid respectively.
The two are seperate.
The Tech Note for the ECS Grid is at:
http://ecs.victoria.ac.nz/Support/TechNoteEcsGrid
The SCS Grid runs under control of a Condor instance and exists to make use of the computing
power of the University's ITS-provided student computing service (
SCS) lab machines at times
when they are unused, ie when they should have no one logged in at the console.
Jobs, basically text files decribing the tasks to be carried out and the files needed to do so, are submitted
from the Condor master machine (
users will thus need to obtain a login account for this machine with their ITS username)
into the Condor queuing system where they remain, in turn, until an "unused" machine is able to start them.
(Note that being
able to start them is not the same as being able to run them to completion).
At present, users at the console of a machine have priority over Grid jobs running on the same machine
to the extent that a Grid job will be suspended upon a machine where there is console activity, so users
submitting Grid jobs should be aware that there is no guaranteed run time for any given task.
Basically,
it'll finish when it finishes
There is a low volume mailing list which is used to inform users of the SCS/Condor grid of matters
which may be of interest to them. You can subscribe your School or University email address here
http://ecs.victoria.ac.nz/mailman/listinfo.cgi/scs-grid-users
The Condor Master
The ITS machine from which all
Condor
activities are carried out is
vuwunicondgrd01.vuw.ac.nz
As of
November 2010 the machine may be accessed using the name
condor01.vuw.ac.nz
The Condor Master runs RedHat Enterprise Linux (RHEL) and ssh access is via port 10 (This is an ITS standard, apparently)
If your username on the Condor Master does not match your ECS username, you will need to type:
ssh vuwunicondgrd01.vuw.ac.nz -p 10 -l username
to logon, otherwise, with matching usernames, a simple
ssh vuwunicondgrd01.vuw.ac.nz -p 10
will suffice.
Your home directory on the Condor Master will be of the form
/home/username
You can use
scp
to move files to and from your ECS filestore to the Condor Master, eg
scp -P 10 localfile username@vuwunicondgrd01.vuw.ac.nz:/path/to/remotedir/.
scp -P 10 username@vuwunicondgrd01.vuw.ac.nz:file_in_homedir.txt /path/to/local/dir/.
It is possible to use
rsync
to maintain a level of synchronisation between directories
within ECS filestore and on the Condor Master, eg.
% hostname
some-machine.ecs.vuw.ac.nz
% cd ~/top/of/my/local/condorgrid/directories
% rsync -avi --rsh="ssh -p 10" vuwunicondgrd01.vuw.ac.nz:/home/username/remotedir/ ./remotedir/
Note that you need to specify the fact that you need to talk
ssh
over
port 10
vis the
--rsh
option
to
rsync
Setting up the environment
The machines ITS provide for the SCS run a
Windows
operating system and the machines
are effectively installed using a single OS+packages image rolled out from the centre.
ECS users will thus find a limited range of native software packages, placed onto the SCS
machines at the request of various academics across the University, for use by students in
teaching labs or self-study, accessible by default.
ITS does not seem to have a list of the user-accessible packages, on the SCS ,available at time
of writing.
Where Grid jobs merely require such packages to be operated in a batch mode, such jobs can
obviously make use of those packages.
In general, you will need to use the full path to the executables you wish to invoke, eg:.
c:\"program files"\r\r-2.6.0\bin\r.exe --no-save < myscript.r
With software packages in use within ECS but not available on the SCS, ITS may consent to installing
the
Windows
version, if available, of that package within the image they roll-out to the SCS machines.
Users new to the SCS machines should be aware that ITS are, understandably, reluctant to alter
their images once a teaching trimester is underway and there may well be issues in having multiple
versions of the same package installed at the same time. Planning ahead so as to liaise with those
who have requested packages be installed on the SCS is thus a good thing.
Users wishing to make use of the service to run bespoke codes or programs which normally operate
within an ECS, or other, UNIX environment have two approaches.
- They will need to recompile sources, or otherwise ensure that any binaries run, against the
Cygwin
emulation software and then upload a matching Cygwin DLL
as part of the job submission payload. The ITS machines do not provide, nor hence constrain the user to, any particular Cygwin DLL
version.
- They will need to recompile sources using a
Windows
compiler to produce native binaries, though care should be taken to ensure that any run-time dependencies will be resolved when running on an SCS machine. The ITS machines do provide access to one Windows
compiler suite.
Do I have a home on the Grid?
Not a permanent home. no: it's more like a rented bach with the Condor Master
as your trailer of stuff from home.
Basically, Condor would normally upload all the files needed for a task into a temporary
directory,
%_CONDOR_SCRATCH_DIR%
, on each remote machine from a user's directory
on the Condor Master.
At the end of the job execution, Condor can download files back to filestore areas on the
Condor master.
This means that a user might first need to copy files from their ECS filestore onto
the Condor Master so as to then make them accessible to the Condor Grid, and copy
files off the Condor Master back into their ECS filestore for post processing.
To continue with the bach/trailor metaphor, you pack it before you go, unpack when you arrive,
pack it again before you leave and unpack when you get back home.
Files in the temporary directory on machines the jobs were executed on
are not currently preserved between individual job executions.
(So someone tidies up the bach after you have been there, too).
It may be possible to have ITS load large static data sets onto the SCS image so that
such data appears at a constant path on any SCS machine, however, as mentioned
above, ITS are, understandably, reluctant to alter the image once a teaching trimester
is underway.
Because this is a distributed batch processing environment, there's usually no clear
indication as to which machine(s) your job(s) will end up running on.
You thus need to give a little more thought to the location of input and output files than
if you were simply running a job on your own workstation where everything is local to
the machine.
Within the SCS Grid, Condor can stage files and directories to and from the master
machine.
Preserving results after execution
In order to get your output back to somewhere more useful to you, you need to tell the
task that it should copy the files back. Condor will then, by default, copy back any altered
or newly created files from the remote execution node.
Cleaning up
With the temporary area allocated to your job being automatically removed once the
job has completed, there is no explicit cleaning required for simple job submission.
Where do stdin
, stdout
and stderr
appear
When one runs programs locally, program output and error messages will often appear
on the console, or in the terminal emulator, and one can usually perform command line
redrection for input.
When you are running a non-interactive job on a remote machine however, it is likely that
you aren't going to see any console output during the execution of the program.
Condor can therefore be instructed to redirect the stdout and stderr channels to file so that
they may be inspected after the job has finished.
The submission script has directives that allow the user to name files for the redirection.
If you do not specify filenames then you implicity get
/dev/null
, or platform equivalent.
In order to create unique filenames for each task submitted, use can be made of the fact
that Condor provides for a number of variables to be expanded within the submission script.
Condor allows one to submit multiple jobs (tasks) within a single submission. The submission
is referred to as a
Cluster
and within each cluster individual instances of the job are known
as a
Process
, numbered starting from
0
A basic script
Within a directory containing
cygwin1.dll hworld.cmd hworld.exe
with
hworld.cmd
being a very basic script (11 lines) like this:
universe = vanilla
environment = path=c:\WINDOWS\SYSTEM32
executable = hworld.exe
TransferInputFiles = cygwin1.dll
output = hworld.out.$(Cluster).$(Process)
error = hworld.err.$(Cluster).$(Process)
log = hworld.log.$(Cluster).$(Process)
Requirements = (OpSys == "WINNT51") && (Arch == "INTEL")
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
queue 1
one can submit a job by typing
condor_submit hworld.cmd
If the job submission returns something like this
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1234.
then you would expect to see the following files at the end of the job,
along with any files created on the remote execution node(s).
hworld.err.1234.0
hworld.out.1234.0
hworld.log.1234.0
Emailed output
By default, Condor will send an email message to the users mailbox on the Condor Master,
on completion of the job.
This is an automated email from the Condor system
on machine "vuwunicondgrd01.vuw.ac.nz". Do not reply.
Condor job 1234.0
/home/username/hworld/hworld.exe
has exited normally with status 12
Submitted at: Mon Jul 13 12:37:48 2009
Completed at: Mon Jul 13 12:38:30 2009
Real Time: 0 00:00:42
Virtual Image Size: 0 Kilobytes
Statistics from last run:
Allocation/Run time: 0 00:00:02
Remote User CPU Time: 0 00:00:00
Remote System CPU Time: 0 00:00:00
Total Remote CPU Time: 0 00:00:00
Statistics totaled from all runs:
Allocation/Run time: 0 00:00:02
Network:
1.8 MB Run Bytes Received By Job
12.0 B Run Bytes Sent By Job
1.8 MB Total Bytes Received By Job
12.0 B Total Bytes Sent By Job
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: condor-admin@ecs.vuw.ac.nz
The Official Condor Homepage is http://www.cs.wisc.edu/condor
Various attributes of the notification process can be altered within the
job submission script.
Running Java programs
Been a couple of people running these of late so here's some basic info
Let's assume that you will be running two instances of a Java program called
myprog.jar
which, when run on your own machine reads from a datafile
mydata.txt
and produces an
output file called
myoutput.txt
The job submission script, which will be called
myprog.cmd
might be expected to look
something like this
universe = vanilla
environment = path=c:\WINDOWS\SYSTEM32
executable = myprog.bat
TransferInputFiles = myprog.jar mydata.txt
output = hworld.out.$(Cluster).$(Process)
error = hworld.err.$(Cluster).$(Process)
log = hworld.log.$(Cluster).$(Process)
Requirements = (OpSys == "WINNT51") && (Arch == "INTEL")
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
queue 2
Note, in comparison with the basic script, that rather than running the java program directly, our executable is a DOS
batch (
.bat
) file in which we will invoke Java to run our JAR file, with a command such as
java -jar myprog.jar
and that we copy both the JAR and our data file over using
TransferInputFiles
however, when we run the two instances of the program, both instances will, if the program runs,
produce output files named
myoutput.txt
.
This represents a problem.
Condor will automatically fetch back, to the directory on the master from where we submitted the
job, all files that have changed on each execution host, which means it will overwrite the output
file.
What's needed here is a mechanism for ensuring that the files that get created are named uniquely.
In the same way that we can use the Condor information about the
Cluster
and
Process
to differentiate the
stdout
,
stderr
and logfiles we can use it to solve the file name issue
by making those values available as environmental variables in the DOS environment.
This is done by modifying the
environment
value in the submission script so that is looks like this
(note the quotes)
environment = "path=c:\WINDOWS\SYSTEM32 CONDOR_CLUSTER=$(Cluster) CONDOR_PROCESS=$(Process)"
Within our DOS batch file, we can access the DOS environmental variables using the standard
%VARNAME%
syntax.
This allows us to either alter our program so that it accepts the Condor information as arguments and alters the output filenames
internally, or simply rename the output file as part of the set of command in the DOS batch file that will be run.
The latter approach works because Condor transfers the files after the
executable
has ended, which means after
all the commands in the DOS batch file have run.
We might thus have a DOS batch file that looks like this
java -jar myprog.jar
ren myoutput.txt myoutput.%CONDOR_CLUSTER%.%CONDOR_PROCESS%
Note: the DOS command
ren
is the equivalent of the UNIX
mv
.
Similarly if we wish to have the program changing it's operation based on those values we might have DOS batch file
looking like this
java -jar myprog.jar %CONDOR_CLUSTER% %CONDOR_PROCESS%
Note that the latter approach, making the executable aware of its place in the set of jobs,
can be useful as a mechanism, eg, for invoking different behaviours or defining initial conditions
such as random number generation.
One can submit a job by typing
condor_submit myprog.cmd
If the job submission returns something like this
Submitting job(s).
Logging submit event(s).
2 job(s) submitted to cluster 1234.
then you would expect to see the following files at the end of the job,
along with any files created on the remote execution node(s).
hworld.err.1234.0
hworld.err.1234.1
hworld.out.1234.0
hworld.out.1234.1
hworld.log.1234.0
hworld.log.1234.1
myoutput.txt.1234.0
myoutput.txt.1234.1
Java versions
As of Feb 2011, there were four versions of Java available on the SCS machines,
with one machine examined showing:
10/09/2008 08:34 a.m. <DIR> jdk1.5.0_09
10/09/2008 08:22 a.m. <DIR> jre1.5.0_06
10/09/2008 08:23 a.m. <DIR> jre1.6.0_07
09/03/2010 05:10 p.m. <DIR> jre6
the default version, which I am told is the one without any version information
visible, is 1.6.0_18.
ECS machines had access to the
OpenJDK 7
at this time.
Update May 2015
Someone in SEF who was looking to use Java on the SCS/Condor Grid was informed by ITS that
their lab machines no longer had a deployment of Java. Despite this assertion it was, however,
possible to find and run against the following Java versions (minimum number of machines with
each version listed)
java version "1.6.0_17" 1
java version "1.6.0_18" 32
java version "1.6.0_24" 3
java version "1.6.0_25" 1
java version "1.7.0_07" 9
java version "1.7.0_51" 10
java version "1.8.0_25" 1
Given the ITS assertion, it is probably best not to plan on using Java on the SCS/Condor Grid.
Using Java to ease data transfer
Because of the way that the remote shell is set-up when you run jobs on the SCS/Condor Grid, you
may not have access to common utilities that you would expect to have in an interactive session.
Su Nguyen, from SECS, was looking to transfer a complete directory structure over but was unable
to access the local machine's
unzip
utility, so he wrote some Java code that allowed him to do
the packing and unpacking as part of his SCS/Condor Grid job.
Su says that his program is invoked as follows
java -jar ZIPJ.jar zip <dir_name> <zip_name>
java -jar ZIPJ.jar unzip <zip_name> <dir_name>
and the Java source is attached to this page.
You may need to rename the file in order to create the JAR-file.
Delayed starting of SCS/Condor Grid jobs
This facility has been useful to some users of the SCS/Condor Grid so here it is.
OK, here's how to make the jobs start at say, 1am, when you think you might get six or seven hours of uniterrupted computing on each machine you can get your hands on overnight.
It's a distillation of Section 2.12.1 (Time Scheduling for Job Execution) in the Condor Manual
http://www.cs.wisc.edu/condor/manual/
Choose the Stable Release, Online HTML to view the relevant stuff.
Basically, you need to add three lines to the submission file, eg
deferral_time = 1291291200
deferral_window = 18000
deferral_prep_time = 120
Those values are in seconds.
The reason the
deferral_time
is so large is that it's the number of seconds since midnight GMT on Jan 1, 1970.
What the above says is:
start my job (or jobs if you queue more than one per submission file) at exactly 2010-12-03 01:00:00 or within a window of 5 hours afterwards, however don't try and grab any resources until 2 minutes before the time I'd like it to start.
Note that you need to be specific about the date, not just the time of day to remove possible confusion.
When you submit the job(s) they'll appear as "Idle" in the queue up until 2 minutes before the proposed start time when, assuming there are resources available, they'll start to grab resources.
The reason for having the 2 minutes is that without it, you might grab the resources straight away, ie hog the machine with nothing else running on it until the start time. This potentially wastes resources and defeats the object of the grid.
The reason for having the window is to allow your job to start if it can't get resources at the exact time specified.
Whilst I am sure you will be able to do the math each time you need to calculate the number of seconds since Jan 1, 1970, on most UNIX boxes you should be able to do the following
$ date -d "2010-12-03 01:00:00" +%s
and get back
1291291200
The Condor manual suggests you'ld type
$ date --date "12/03/2010 01:00:00" +%s
with an American date style, though personally I get confused.
I am not sure how you would achieve that from a windows box but you have access to the Condor master so you could do it on there.