|
|
Grid Testing |
||||||||
|
Documentation of some grid computing experiments |
||||||||
IntroductionThe inspiral code has a simple algorithm for execution that should make it fairly easy to farm execution out to the grid using the Condor standalone checkpoint mechanism. The structure of the code is:
Most of the execution time is spent doing step 4 during which there is no disk I/O. All open files have been closed and the output files are not opened until all the data has been filtered. The ratio of FFT to I/O can be made (almost) arbitrarily large with little impact on memory by increasing the number of templates. This idea is to use Condor's checkpoint mechanism to farm the CPU intensive part out to the grid by creating a checkpoint file after the data is read in at UWM and downsampled. The executable and checkpoint image can be sent out onto the grid using the condor globus universe. All the data that is needed is contained in the checkpoint file so the remote cluster does not need access to the LIGo data. The code performs the filtering and writes out the trigger file on the remote file system which can then be fetch back to UWM for post-processing. Initial testing and problems foundI did a quick test of this principle using the Medusa cluster at UWM and the Pleiades Cluster at Penn State University. Medusa has all the LIGO data needed for the search and runs a full installation of Condor. This is where I usually run the search using Condor DAGman to manage execution of the pipeline. The Pleiades Cluster runs the PBS batch system and it allows job submission via a Globus job manager so jobs can be submitted using the Condor Globus universe. (see the Pleiades job submission documentation for details.) I modified the inspiral code to call the functions init_image_with_file_name() and then ckpt_and_exit() from the Condor checkpoint library after all input files were closed. I ran this at UWM and then sent the executable and the checkpoint file off to PSU to be restarted with -_condor_restart. The problem that I found was that the condor checkpoint mechanism expects to be able to chdir() back to the directory that it was started in. At UWM the code ran in the directory /home/duncan/projects/condor/psu/checkpoint but at PSU it is running in a directory crated by the Globus job manager, e.g. /usr3/home/ux001010/some_gram_scratch_dir. This means that the checkpointed code fails to restart with the message Condor: Notice: Will restart from L1-INSPIRAL-733022064-2048.ckpt Condor: Error: Couldn't move to '/home/duncan/projects/condor/psu/checkpoint' (No such file or directory). Please fix it.
Workaround and a sucessful example as proof of principleI notified the Condor team of this and they suggested using a common directory on both sites (e.g. /tmp). I set the initialdir=/tmp option in the condor submit files at both sites. The inspiral code assumes that an output files should be written in the working directory of the code, so I had to modify the code to allow a full path to be specified for the checkpoint file and the output trigger file. This isn't a very "grid" solution as it means I need to know the directories at both sites, but it's a workaround. I wrote a small DAG to execute the search which did the following:
This all worked sucessfully (the actual files are shown below) and the triggers were retrieved back to UWM.
Suggestions for future improvementsIt would be really nice to have an option in the standalone checkpointing mechanism so the restarted code doesn't care what directory it is running in as long as there are no open files when the code is checkpointed. This would mean that I wouldn't have to know anything about the file system on the remote site and Condor could take care of fetching the result file and cleaning it up afterwards (since it would be written in the local output directory). Condor DAG and submit files used with outputThe simple DAG file to do the above was: job localinspiral inspiral-l.sub vars localinspiral lcldir="/home/duncan/projects/condor/psu/checkpoint" outdir="/usr3/home/ux001010" script post localinspiral test.sh L1-INSPIRAL-733022064-2048.ckpt job gridinspiral inspiral-g.sub script post gridinspiral getoutput.sh /usr3/home/ux001010/L1-INSPIRAL-733022064-2048.xml /home/duncan/projects/condor/psu/checkpoint/L1-INSPIRAL-733022064-2048.xml parent localinspiral child gridinspiral This first ran the local inspiral code to get the data, template bank, calibration and then checkpoint. The submit file inspiral-l.sub contains: universe = vanilla
initialdir = /tmp
executable = $(lcldir)/lalapps_inspiral
arguments = --verbose --data-checkpoint
--checkpoint-path $(lcldir)
--output-path $(outdir)
--bank-file $(lcldir)/L1-TMPLTBANK-733022064-2048.xml
--frame-cache $(lcldir)/cache/L-733022056-733027646.cache
--calibration-cache
$(lcldir)/cache_files/L1-CAL-V03-729273600-734367600.cache
--enable-output
output = $(lcldir)/inspiral-l.out
error = $(lcldir)/inspiral-l.err
log = /home/duncan/projects/condor/psu/checkpoint/inspiral.log
queue
I have deleted some of the options for brevity. This file then wrote a checkpoint file called L1-TMPLTBANK-733022064-2048.ckpt. The post script test.sh #!/bin/sh
test -f ${1}
makes sure that it exists and prevents DAGman from stopping at the abnormal exit from the inspiral code. Next everything is shipped off to PSU and executed with: # run an inspiral job from a checkpointed image at psu executable = lalapps_inspiral transfer_executable = true arguments = '-_condor_restart' 'L1-INSPIRAL-733022064-2048.ckpt' environment = KMP_LIBRARY=serial;MKL_SERIAL=yes globusscheduler = ligo-grid.aset.psu.edu/jobmanager-pbs universe = globus globusrsl = (job_type=single)(queue=lsc) transfer_input_files = L1-INSPIRAL-733022064-2048.ckpt stream_output = false stream_error = false when_to_transfer_output = on_exit output = inspiral-g.out error = inspiral-g.err log = /home/duncan/projects/condor/psu/checkpoint/inspiral.log queue and finally the getoutput.sh post script fetches the output back to UWM and deletes the output file at PSU: #!/bin/sh
globus-url-copy gsiftp://ligo-grid.aset.psu.edu/${1} file:${2} &&
globusrun -r ligo-grid.aset.psu.edu
"&(executable=/usr/bin/env)(arguments=rm -f ${1})" ||
exit 1
Condor log file000 (12684.000.000) 04/06 00:46:20 Job submitted from host: <129.89.57.47:47555>
DAG Node: localinspiral
...
001 (12684.000.000) 04/06 00:46:42 Job executing on host: <129.89.57.47:47554>
...
006 (12684.000.000) 04/06 00:46:50 Image size of job updated: 140100
...
005 (12684.000.000) 04/06 00:48:14 Job terminated.
(0) Abnormal termination (signal 12)
(0) No core file
Usr 0 00:01:03, Sys 0 00:00:02 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:01:03, Sys 0 00:00:02 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
016 (12684.000.000) 04/06 00:48:15 POST Script terminated.
(1) Normal termination (return value 0)
...
000 (12685.000.000) 04/06 00:48:26 Job submitted from host: <129.89.57.47:47555>
DAG Node: gridinspiral
...
017 (12685.000.000) 04/06 00:48:39 Job submitted to Globus
RM-Contact: ligo-grid.aset.psu.edu/jobmanager-pbs
JM-Contact: https://ligo-grid.aset.psu.edu:43283/2008/1081230511/
Can-Restart-JM: 1
...
001 (12685.000.000) 04/06 00:51:10 Job executing on host: ligo-grid.aset.psu.edu
...
005 (12685.000.000) 04/06 00:52:25 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
016 (12685.000.000) 04/06 00:52:35 POST Script terminated.
(1) Normal termination (return value 0)
...
|
|||||||||
|
|||||||||