SLURM Job Management¶
After a job is submitted to SLURM, user may check the job status with commands sq or showq as described below. Show any running/pending jobs
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
23598 shortq dphpcap_ p210006c R 13-23:48:34 1 compute26
23707 shortq dphpcap_ p210006c R 11-15:37:20 1 compute27
23708 shortq dphpcap_ p210006c R 11-13:07:41 1 compute17
23716 shortq ts1-opt p170072c R 11-08:32:55 1 compute06
23752 shortq dphpcap_ p210006c R 5-02:59:36 1 compute23
23872 shortq Ge-m3-bs p180028c R 7-16:37:59 1 compute12
23949 mediumq nfkappa8 p140114n R 5-07:31:59 3 compute[01-02,23]
23953 shortq as-origi p200019c R 6-02:59:31 1 compute05
23970 gpuq Sol-2 p180089p R 3-15:38:49 1 compute32
23976 mediumq Au8-O2 p180089p R 5-13:23:58 2 compute[05,25]
24004 shortq mix raghuc R 3-06:19:34 1 compute29
24020 gpuq e8r_fexo tomskari R 3-03:13:48 1 compute33
24023 gpuq 58RPE p210139b R 3-01:11:50 1 compute33
24038 shortq mix2 raghuc R 2-21:38:58 1 compute28
24039 shortq mix1 raghuc R 2-21:38:41 1 compute30
24055 mediumq mix3 raghuc R 1-13:28:53 2 compute[01-02]
24056 shortq Cu8_o21 p180089p R 1-13:28:53 1 compute03
24079 mediumq md2 p200028p R 1-20:38:10 2 compute[26-27]
24086 mediumq Ni_ads p220106p R 1-17:05:33 2 compute[17,24]
24119 shortq NEB1 p200028p R 14:49:51 1 compute31
24124 mediumq nfkappa8 p140114n R 17:44:16 2 compute[19,25]
24125 longq nfkappa8 p140114n R 17:39:01 5 compute[03,05,11,19-20]
24127 shortq elabqa p170030c R 5:43:32 1 compute20
24130 shortq zlabqa p170030c R 5:39:36 1 compute24
24131 shortq zlabqa p170030c R 5:39:11 1 compute11
24133 shortq bcc30 p190146m R 4:50:52 1 compute12
24146 shortq NEB2_2 p200028p R 35:49 1 compute09
24147 shortq visc8p2v p140114n R 1:27 1 compute13
Show specific job,¶
squeue -j <JobID>
$ squeue -j 123456
Show jobs in a specific partition,¶
squeue -p <partition>
$ squeue -p shortq
Show running job¶
$ squeue -t R
Show pending job¶
$ squeue -t PD
Job information provided | Description | |
---|---|---|
JOBID | : | Job ID |
PARTITION | : | Partition |
NAME | : | Job name given |
ST (status) | : | |
TIME | : | Running Time |
NODELIST | : | List of the nodes which the job is using |
NODELIST(REASON) | : | Show the reason that explain the current job status |
Status | Description |
---|---|
R | Running |
PD | Pending (queuing) |
CD | Completed (exit code 0 — without error) |
F | Failure (exit code non-zero) |
DL | Failure (job terminated on deadline) |
Reason | Description |
---|---|
Priority | The job is waiting for higher priority job(s) to complete |
Dependency | The job is waiting for a dependent job to complete |
Resources | The job is waiting for resources to become available |
MaxJobsPerUser | Maximum number of jobs for users are in running |
Delete / cancel a job¶
$ scancel <JobID>
Delete / cancel all jobs for a user¶
$ scancel -u <Username>
Update attributes of submitted jobs¶
Update walltime request of a queuing job (a job which is pending and not yet start to run) to 1 hour.
$ scontrol update jobid=<JobID> TimeLimit=01:00:00
Check Partition/Node Usage¶
User can use command plist to check the status of partitions and nodes
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
testq up 1:00:00 13 mix compute[01-03,06,11-12,15,19-20,23-25,27]
testq up 1:00:00 18 alloc compute[04-05,07-10,13-14,16-18,21-22,26,28-31]
shortq up infinite 13 mix compute[01-03,06,11-12,15,19-20,23-25,27]
shortq up infinite 18 alloc compute[04-05,07-10,13-14,16-18,21-22,26,28-31]
mediumq up infinite 13 mix compute[01-03,06,11-12,15,19-20,23-25,27]
mediumq up infinite 18 alloc compute[04-05,07-10,13-14,16-18,21-22,26,28-31]
longq up infinite 13 mix compute[01-03,06,11-12,15,19-20,23-25,27]
longq up infinite 18 alloc compute[04-05,07-10,13-14,16-18,21-22,26,28-31]
testgpuq up 1:00:00 1 mix compute32
testgpuq up 1:00:00 1 alloc compute33
gpuq up infinite 1 mix compute32
gpuq up infinite 1 alloc compute33