Home Page        Calendar        Wiki    
   

Grid Testing

 

Documentation of some grid computing experiments

 

Introduction

The inspiral code has a simple algorithm for execution that should make it fairly easy to farm execution out to the grid using the Condor standalone checkpoint mechanism. The structure of the code is:

  1. Read in some amount of interferometer data from files at 16,384 Hz.
  2. Resample the data to the requested rate (typically 2048 Hz).
  3. Read in a calibration and template bank file.
  4. Execute many FFTs on the data for each and generate inspiral triggers in memory.
  5. Write out the triggers to disk and exit.

Most of the execution time is spent doing step 4 during which there is no disk I/O. All open files have been closed and the output files are not opened until all the data has been filtered. The ratio of FFT to I/O can be made (almost) arbitrarily large with little impact on memory by increasing the number of templates.

This idea is to use Condor's checkpoint mechanism to farm the CPU intensive part out to the grid by creating a checkpoint file after the data is read in at UWM and downsampled. The executable and checkpoint image can be sent out onto the grid using the condor globus universe. All the data that is needed is contained in the checkpoint file so the remote cluster does not need access to the LIGo data. The code performs the filtering and writes out the trigger file on the remote file system which can then be fetch back to UWM for post-processing.

Initial testing and problems found

I did a quick test of this principle using the Medusa cluster at UWM and the Pleiades Cluster at Penn State University. Medusa has all the LIGO data needed for the search and runs a full installation of Condor. This is where I usually run the search using Condor DAGman to manage execution of the pipeline. The Pleiades Cluster runs the PBS batch system and it allows job submission via a Globus job manager so jobs can be submitted using the Condor Globus universe. (see the Pleiades job submission documentation for details.)

I modified the inspiral code to call the functions init_image_with_file_name() and then ckpt_and_exit() from the Condor checkpoint library after all input files were closed. I ran this at UWM and then sent the executable and the checkpoint file off to PSU to be restarted with -_condor_restart.

The problem that I found was that the condor checkpoint mechanism expects to be able to chdir() back to the directory that it was started in. At UWM the code ran in the directory /home/duncan/projects/condor/psu/checkpoint but at PSU it is running in a directory crated by the Globus job manager, e.g. /usr3/home/ux001010/some_gram_scratch_dir. This means that the checkpointed code fails to restart with the message

Condor: Notice: Will restart from L1-INSPIRAL-733022064-2048.ckpt
Condor: Error: Couldn't move to '/home/duncan/projects/condor/psu/checkpoint'
(No such file or directory).  Please fix it.

Workaround and a sucessful example as proof of principle

I notified the Condor team of this and they suggested using a common directory on both sites (e.g. /tmp). I set the initialdir=/tmp option in the condor submit files at both sites. The inspiral code assumes that an output files should be written in the working directory of the code, so I had to modify the code to allow a full path to be specified for the checkpoint file and the output trigger file. This isn't a very "grid" solution as it means I need to know the directories at both sites, but it's a workaround.

I wrote a small DAG to execute the search which did the following:

  1. Run the inspiral code locally at UWM. The code is run in the vanilla universe so Condor doesn't interpret the checkpoint as a standard universe eviction.
  2. After the inspiral code call ckpt_and_exit() condor reports and abnormal termination with code 12, so I ran a post script that checks for the existence of the checkpoint file. If it's there the post scrip returns sucess so the DAG can continue.
  3. Submit a job to the globus universe that schedules the job at PSU. The executable and checkpoint file are staged as input and the job runs.
  4. The job writes it's output file to a directory that I have specifed at PSU. Since this isn't in the gram scratch directory, Condor can't retrieve it for me, so I have to run a post script to fetch it (using globus-url-copy) and then delete the output file by globus-running rm.

This all worked sucessfully (the actual files are shown below) and the triggers were retrieved back to UWM.

Suggestions for future improvements

It would be really nice to have an option in the standalone checkpointing mechanism so the restarted code doesn't care what directory it is running in as long as there are no open files when the code is checkpointed. This would mean that I wouldn't have to know anything about the file system on the remote site and Condor could take care of fetching the result file and cleaning it up afterwards (since it would be written in the local output directory).

Condor DAG and submit files used with output

The simple DAG file to do the above was:

job localinspiral inspiral-l.sub
vars localinspiral lcldir="/home/duncan/projects/condor/psu/checkpoint" 
 outdir="/usr3/home/ux001010"
script post localinspiral test.sh L1-INSPIRAL-733022064-2048.ckpt
job gridinspiral inspiral-g.sub
script post gridinspiral getoutput.sh 
 /usr3/home/ux001010/L1-INSPIRAL-733022064-2048.xml 
 /home/duncan/projects/condor/psu/checkpoint/L1-INSPIRAL-733022064-2048.xml
parent localinspiral child gridinspiral

This first ran the local inspiral code to get the data, template bank, calibration and then checkpoint. The submit file inspiral-l.sub contains:

universe = vanilla
initialdir = /tmp
executable = $(lcldir)/lalapps_inspiral
arguments = --verbose --data-checkpoint 
  --checkpoint-path $(lcldir)
  --output-path $(outdir)
  --bank-file $(lcldir)/L1-TMPLTBANK-733022064-2048.xml
  --frame-cache $(lcldir)/cache/L-733022056-733027646.cache
  --calibration-cache
    $(lcldir)/cache_files/L1-CAL-V03-729273600-734367600.cache
  --enable-output
output = $(lcldir)/inspiral-l.out
error = $(lcldir)/inspiral-l.err
log = /home/duncan/projects/condor/psu/checkpoint/inspiral.log
queue

I have deleted some of the options for brevity.

This file then wrote a checkpoint file called L1-TMPLTBANK-733022064-2048.ckpt. The post script test.sh

#!/bin/sh
test -f ${1}

makes sure that it exists and prevents DAGman from stopping at the abnormal exit from the inspiral code.

Next everything is shipped off to PSU and executed with:

# run an inspiral job from a checkpointed image at psu
executable = lalapps_inspiral
transfer_executable = true
arguments = '-_condor_restart' 'L1-INSPIRAL-733022064-2048.ckpt'
environment = KMP_LIBRARY=serial;MKL_SERIAL=yes
globusscheduler = ligo-grid.aset.psu.edu/jobmanager-pbs
universe = globus
globusrsl = (job_type=single)(queue=lsc)
transfer_input_files = L1-INSPIRAL-733022064-2048.ckpt
stream_output = false
stream_error = false
when_to_transfer_output = on_exit
output = inspiral-g.out
error = inspiral-g.err
log = /home/duncan/projects/condor/psu/checkpoint/inspiral.log
queue

and finally the getoutput.sh post script fetches the output back to UWM and deletes the output file at PSU:

#!/bin/sh
globus-url-copy gsiftp://ligo-grid.aset.psu.edu/${1} file:${2} && 
globusrun -r ligo-grid.aset.psu.edu 
"&(executable=/usr/bin/env)(arguments=rm -f ${1})" ||
exit 1

Condor log file

000 (12684.000.000) 04/06 00:46:20 Job submitted from host: <129.89.57.47:47555>
    DAG Node: localinspiral
...
001 (12684.000.000) 04/06 00:46:42 Job executing on host: <129.89.57.47:47554>
...
006 (12684.000.000) 04/06 00:46:50 Image size of job updated: 140100
...
005 (12684.000.000) 04/06 00:48:14 Job terminated.
        (0) Abnormal termination (signal 12)
        (0) No core file
                Usr 0 00:01:03, Sys 0 00:00:02  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:01:03, Sys 0 00:00:02  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...
016 (12684.000.000) 04/06 00:48:15 POST Script terminated.
        (1) Normal termination (return value 0)
...
000 (12685.000.000) 04/06 00:48:26 Job submitted from host: <129.89.57.47:47555>
    DAG Node: gridinspiral
...
017 (12685.000.000) 04/06 00:48:39 Job submitted to Globus
    RM-Contact: ligo-grid.aset.psu.edu/jobmanager-pbs
    JM-Contact: https://ligo-grid.aset.psu.edu:43283/2008/1081230511/
    Can-Restart-JM: 1
...
001 (12685.000.000) 04/06 00:51:10 Job executing on host: ligo-grid.aset.psu.edu
...
005 (12685.000.000) 04/06 00:52:25 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...
016 (12685.000.000) 04/06 00:52:35 POST Script terminated.
        (1) Normal termination (return value 0)
...