How to monitor and manage jobs (linux)
Scheduler
Monitor jobs
qstat <arguments>
Arguments
Group | Argument | Description | Example |
---|---|---|---|
Queues | -g c | Show cluster queue summary | qstat -g c
|
-q <queue name> | Query the selected queue only | see below | |
-f | List all details | qstat -q fastq -f |
|
-F [<resource>[,<resource>]] | List all details and show (selected) resources | qstat -F h_cpu,h_rss
|
|
Jobs | -u <user ID> | List jobs of a selected user | qstat -u abcd123
|
-j <job ID> | List details of a selected job | qstat -j 1234 |
Manage jobs
Rerun jobs after error
qmod -cj <job ID>
Run this only after the error has been resolved!
Delete job
qdel <job ID>
To delete all your jobs at once:
qselect -u `whoami` | xargs qdel
Log files
Local execution
Programs may produce two kind of outputs:
- error message, called standard error
- result, called standard output
When run locally, i.e. in a terminal, they are outputted also in the terminal. E.g. running FSL's bet (brain extraction tool) without configuring FSL, will output a standard error:
bash: bet: command not found
After configuring FSL bet outputs a standard output with some descriptives of the brain extracted:
... min 0 thresh2 0 thresh 97.8208 thresh98 978.208 max 2779 c-of-g 134.204 88.4442 150.463 mm radius 90.7149 mm median within-brain intensity 281 self-intersection total 307.343 (threshold=4000.0)
Job execution
Jobs running on the cluster are not interactive, i.e. outputs are not sent to the terminal but redirected to files saved in home by default:
- standard error: <home>/<script name>.e<job ID>[.<task ID>]
- standard output: <home>/<script name>.o<job ID>[.<task ID>]
It also mean that you should regularly check your home folder and delete these files.
You can change their location by specifying arguments -e (for standard error) and/or -o (for standard output) when submitting the job via qsub.
Use cases
Output folder for logfiles does not exist
- List all of my jobs
[abcd123@psyclogin cluster]$ qstat -u abcd123 job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 5157 0.05500 ls abcd123 Eqw 07/12/2017 14:17:28 1 ...
The Eqw state means the job started but there was some error.
- Check the error of job 5157
[abcd123@psyclogin cluster]$ qstat -j 5157 ============================================================== job_number: 5157 submission_time: Wed Jul 12 14:20:44 2017 sge_o_workdir: /MRIWork/... stdout_path_list: NONE:NONE:/MRIWork/.../outdir/Job1.out script_file: ls error reason 1: 07/12/2017 15:20:56 [795053430:3198]: error: can't open output file "/MRIWork/.../outdir/Job1.out": No such file or directory scheduling info: Job is in error state ...
Output file outdir/Job1.out cannot be created. Does /MRIWork/…/outdir exist?
- Create /MRIWork/…/outdir
- Rerun job 5157
qmod -cj 5157