Job Scheduling and managing¶
Schedule a Job¶
On our cluster, you control your jobs using a job scheduling system called Slurm that allocates and manages compute resources for you. You can submit your jobs in one of two ways. For testing and small jobs you may want to run a job interactively. This way you can directly interact with the compute node(s) in real time. The other way, which is the preferred way for multiple jobs or long-running jobs, involves writing your job commands in a script and submitting that to the job scheduler. Please see our Slurm documentation for more details.
Running Interactive Jobs¶
In general, the jobs can be run in an interactive manner or in batch mode. You can run an interactive job as follows:
The following command asks for a single core in testq for one hour with a default amount of memory.
srun --nodes=1 --ntasks-per-node=1 --time=01:00:00 –partition=testq –job-name= job-name --pty /usr/bin/bash
The command prompt will appear as soon as the job starts. This is how it looks once the interactive job starts:
image
Exit the bash shell to end the job. If you exceed the time or memory limits the job will also abort. Please note that Madhava HPC Cluster is NOT meant for executing interactive jobs. However, for the purpose of quickly ascertaining successful run of a job before submitting a large job in batch (with large iteration counts), this can be used. This can even be used for running small jobs. The point to be kept in mind is that, since others too would be using this node, it is prudent not to inconvenience them by running large jobs.
Submit the Job¶
To Submitting a simple standalone job , This is a simple submit script which is to be submitted
sbatch job-script-name
A sample Slurm job script
#!/bin/sh
#SBATCH --nodes 1 // specifies number of nodes
#SBATCH --ntasks-per-node=40 // specifies core per node
#SBATCH --time=06:50:20 // specifies maximum duration of run
#SBATCH --job-name=lammps // specifies job name
#SBATCH --output=job_output.txt //specifies output file name
#SBATCH --partition=shortq // specifies queue name
#SBATCH --ntasks=10 //specifies total no of cores required
##########################################################
echo “Job Submitted”
#load required modules
module load ...
#------------- run your commands here—-----------------#
mpirun -n $SLURM_NTASKS ...
echo “Job finished successfully”
Managing Jobs¶
Madhava HPC Cluster extensively uses modules. The purpose of the module is to provide the production environment for a given application, outside of the application itself. This also specifies which version of the application is available for a given session. All applications and libraries are made available through module files. A User has to load the appropriate module from the available modules.
module avail # This command lists all the available modules
module load compilers/intel/parallel_studio_xe_2020.4.912 # This will load the intel compilers into your environment
module unload compilers/intel/parallel_studio_xe_2020.4.912 # This will remove all environment setting related to intel compiler loaded previously
module list #This will list currently loaded modules.
List Partition¶
sinfo displays information about nodes and partitions(queues).
sinfo
List jobs¶
Monitoring jobs on SLURM can be done using the command squeue. squeue is used to view job and job step information for jobs managed by SLURM squeue
Get job details¶
scontrol can be used to report more detailed information about nodes, partitions, jobs, job steps, and configuration.
scontrol show node - shows detailed information about compute nodes.
scontrol show partition - shows detailed information about a specific partition
scontrol show job - shows detailed information about a specific job or all jobs if no job id isgiven.
scontrol update job - change attributes of submitted job.
Suspend a job (root only):¶
# scontrol suspend 135
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
135 shortq simple.s user1 S 0:13 1 compute01
Resume a job (root only):¶
# scontrol resume 135
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
135 shortq simple.s user1 R 0:13 1 compute01
Kill a job.
Users can kill their own jobs, root can kill any job.
$ scancel 135
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Hold a job:¶
$ scontrol hold 139
Release a job:
$ scontrol release 139
More about Batch Jobs (SLURM)¶
SLURM (Simple Linux Utility for Resource Management) is a workload manager that provides a framework for job queues, allocation of compute nodes, and the start and execution of jobs.
It is important to note:
- Compilations are done on the login node. Only the execution is scheduled via SLURM on the compute/GPU nodes
- Upon Submission of a Job script, each job gets a unique Job Id. This can be obtained from the ‘squeue’ command.
- The Job Id is also appended to the output and error filenames.
Parameters used in SLURM job Script¶
The job flags are used with SBATCH command. The syntax for the SLURM directive in a script is "#SBATCH ". Some of the flags are used with the srun and salloc commands.
Resource Flag | Syntax | Description |
---|---|---|
partition | --partition=partition-name | Partition is a queue for jobs. |
time | --time=01:00:00 | Time limit for the job. |
nodes | --nodes=2 | Number of compute nodes for for the job |
cpus/cores | --ntasks-per-node=8 | Corresponds to number of cores on the compute node |
job name | --job-name="job-1" | Name of job |
output file | --output=job-1.out | Name of file for stdout. |
access | --exclusive | Exclusive access to computenodes |
resource | --gres=gpu:2 | Request use of GPUs on compute node |
Some useful SLURM commands.¶
Sl No | Purpose | Command |
---|---|---|
1 | To check the queue status | squeue |
2 | To check the status/ availability of nodes | sinfo |
3 | To cancel a job running | scancel note: job id can be obtained by command squeue |
4 | To check the jobs of a particular user | squeue -u |
5 | To list all running jobs of a user | squeue -u -t RUNNING |
6 | To list all pending jobs of a user | squeue -u -t PENDING |
7 | To cancel all the pending jobs for a user | scancel -t PENDING -u username |