= Troubleshooting Condor Jobs = <> The first step to troubleshoot a condor job is to run: {{{ ~$ condor_q -- Submitter: pongo.cacr.caltech.edu : <131.215.145.189:52215> : pongo.cacr.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 22.0 user 3/31 06:07 0+00:00:00 H 0 0.0 peakfinderBinningF 23.0 user 3/31 06:08 0+00:00:00 H 0 0.0 peakfinderBinningF }}} The column labeled ST is "State". and the main states are Running:: Your job is running somewhere. Idle:: Your job is waiting to be scheduled [[#Troubleshooting Idle]] Held:: Your job has some problem where it's not able to run. [[#Troubleshooting Held]] = Troubleshooting Idle = Condor can take up to a minute to get get around to scheduling your job, so first be patient. If something stays in the idle state for more than a few minutes, there is likely something wrong. The first thing to do is run {{{condor_status}}}. {{{ ~$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@mondom.cacr. LINUX X86_64 Unclaimed Idle 7.020 47804 0+00:10:05 slot1@myogenin.cac LINUX X86_64 Unclaimed Idle 15.090 127701 0+03:15:05 Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 2 0 0 2 0 0 0 }}} == No Output == If you run {{{condor_status}}} and get no output, there's something wrong with condor and you should contact a system administrator. == Claimed == The next thing that can happen is condor can be busy processing someone else's job. If all the slots are {{{ Claimed }}} you'll have to wait your turn. == High Load Average == The other problem is that some hosts aren't dedicated to condor, so condor is configured to play nice and not push the system load too high. So if the number in the LoadAvg column is above the number of cpus on a machine, condor won't try to run jobs on the shared hosts. = Troubleshooting Held = If you're in the '''Held''' state, we need to investigate whats wrong. {{{ ~$ condor_q -better-analyze -global -- Submitter: pongo.cacr.caltech.edu : <131.215.145.189:52215> : pongo.cacr.caltech.edu --- 022.000: Request is held. Hold reason: Error from slot1@myogenin.cacr.caltech.edu: Failed to execute '/woldlab/castor/data00/home/user/peakfinderBinningForChIP.py' with arguments files.txt: Permission denied }}} The end of the Hold reason contains the likely error message. For instance, in the above case the error is "Permission denied" == Permission denied == If you get this on a script, it's checking to see if the file is "executable" which means it's permissions look like: {{{ ~$ ls -l script.py -rwxrwxr-x 1 diane diane 103 2009-11-12 14:24 script.py* }}} (Note the 'x'es in the first column. those tell the operating system and condor that the owning user (first x), the owning group (second x) and everyone else (third x). can run this script. However if you just change the permissions, you're likely to run into the [[#Failed_to_execute_.3Cscript.3E|Failed to execute