How to monitor and manage jobs (linux)

Monitor jobs

qstat <arguments>

Arguments

Group Argument Description Example
Queues -g c Show cluster queue summary
qstat -g c
-q <queue name> Query the selected queue only see below
-f List all details
qstat -q fastq -f
-F [<resource>[,<resource>]] List all details and show (selected) resources
qstat -F h_cpu,h_rss
Jobs -u <user ID> List jobs of a selected user
qstat -u abcd123
-j <job ID> List details of a selected job
qstat -j 1234

Manage jobs

Rerun jobs after error

qmod -cj <job ID>

Run this only after the error has been resolved!

Delete job

qdel <job ID>

To delete all your jobs at once:

qselect -u `whoami` | xargs qdel

Local execution

Programs may produce two kind of outputs:

  • error message, called standard error
  • result, called standard output

When run locally, i.e. in a terminal, they are outputted also in the terminal. E.g. running FSL's bet (brain extraction tool) without configuring FSL, will output a standard error:

bash: bet: command not found

After configuring FSL bet outputs a standard output with some descriptives of the brain extracted:

...
min 0 thresh2 0 thresh 97.8208 thresh98 978.208 max 2779
c-of-g 134.204 88.4442 150.463 mm
radius 90.7149 mm
median within-brain intensity 281
self-intersection total 307.343 (threshold=4000.0)

Job execution

Jobs running on the cluster are not interactive, i.e. outputs are not sent to the terminal but redirected to files saved in home by default:

  • standard error: <home>/<script name>.e<job ID>[.<task ID>]
  • standard output: <home>/<script name>.o<job ID>[.<task ID>]

It also mean that you should regularly check your home folder and delete these files.

You can change their location by specifying arguments -e (for standard error) and/or -o (for standard output) when submitting the job via qsub.

Output folder for logfiles does not exist

  1. List all of my jobs
    [abcd123@psyclogin cluster]$ qstat -u abcd123
    job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
    -----------------------------------------------------------------------------------------------------------------
       5157 0.05500 ls         abcd123      Eqw   07/12/2017 14:17:28                                    1        
       ... 

    The Eqw state means the job started but there was some error.

  2. Check the error of job 5157
    [abcd123@psyclogin cluster]$ qstat -j 5157 
    ==============================================================
    job_number:          5157
    submission_time:     Wed Jul 12 14:20:44 2017
    sge_o_workdir:       /MRIWork/...
    stdout_path_list:    NONE:NONE:/MRIWork/.../outdir/Job1.out
    script_file:         ls
    error reason      1: 07/12/2017 15:20:56 [795053430:3198]: error: can't open output file "/MRIWork/.../outdir/Job1.out": No such file or directory
    scheduling info:     Job is in error state
    ...

    Output file outdir/Job1.out cannot be created. Does /MRIWork/…/outdir exist?

  3. Create /MRIWork/…/outdir
  4. Rerun job 5157
    qmod -cj 5157

Top (Cluster)