Job Scheduling and managing¶

Schedule a Job¶

On our cluster, you control your jobs using a job scheduling system called Slurm that allocates and manages compute resources for you. You can submit your jobs in one of two ways. For testing and small jobs you may want to run a job interactively. This way you can directly interact with the compute node(s) in real time. The other way, which is the preferred way for multiple jobs or long-running jobs, involves writing your job commands in a script and submitting that to the job scheduler. Please see our Slurm documentation for more details.

Running Interactive Jobs¶

In general, the jobs can be run in an interactive manner or in batch mode. You can run an interactive job as follows:

The following command asks for a single core in testq for one hour with a default amount of memory.

    srun --nodes=1 --ntasks-per-node=1 --time=01:00:00 –partition=testq –job-name= job-name --pty /usr/bin/bash

The command prompt will appear as soon as the job starts. This is how it looks once the interactive job starts:

image

Exit the bash shell to end the job. If you exceed the time or memory limits the job will also abort. Please note that Madhava HPC Cluster is NOT meant for executing interactive jobs. However, for the purpose of quickly ascertaining successful run of a job before submitting a large job in batch (with large iteration counts), this can be used. This can even be used for running small jobs. The point to be kept in mind is that, since others too would be using this node, it is prudent not to inconvenience them by running large jobs.

Submit the Job¶

To Submitting a simple standalone job , This is a simple submit script which is to be submitted

    sbatch job-script-name

A sample Slurm job script

    #!/bin/sh
    #SBATCH --nodes 1 // specifies number of nodes
    #SBATCH --ntasks-per-node=40 // specifies core per node
    #SBATCH --time=06:50:20 // specifies maximum duration of run
    #SBATCH --job-name=lammps // specifies job name
    #SBATCH --output=job_output.txt //specifies output file name
    #SBATCH --partition=shortq // specifies queue name




    #SBATCH --ntasks=10 //specifies total no of cores required 
    ##########################################################
    echo “Job Submitted”
    #load required modules
    module load ...
    #------------- run your commands here—-----------------#
    mpirun -n $SLURM_NTASKS ...
    echo “Job finished successfully”

Managing Jobs¶

Madhava HPC Cluster extensively uses modules. The purpose of the module is to provide the production environment for a given application, outside of the application itself. This also specifies which version of the application is available for a given session. All applications and libraries are made available through module files. A User has to load the appropriate module from the available modules.

    module avail # This command lists all the available modules
    module load compilers/intel/parallel_studio_xe_2020.4.912 # This will load the intel compilers into your environment
    module unload compilers/intel/parallel_studio_xe_2020.4.912 # This will remove all environment setting related to intel compiler loaded previously
    module list #This will list currently loaded modules.

List Partition¶

sinfo displays information about nodes and partitions(queues).

    sinfo

List jobs¶

Monitoring jobs on SLURM can be done using the command squeue. squeue is used to view job and job step information for jobs managed by SLURM squeue

Get job details¶

scontrol can be used to report more detailed information about nodes, partitions, jobs, job steps, and configuration.

    scontrol show node - shows detailed information about compute nodes.
    scontrol show partition - shows detailed information about a specific partition
    scontrol show job - shows detailed information about a specific job or all jobs if no job id isgiven.
    scontrol update job - change attributes of submitted job.

Suspend a job (root only):¶

    # scontrol suspend 135
    # squeue
    JOBID PARTITION NAME     USER  ST TIME NODES NODELIST(REASON)
    135   shortq    simple.s user1 S  0:13  1     compute01

Resume a job (root only):¶

    # scontrol resume 135
    # squeue
    JOBID PARTITION NAME     USER  ST TIME NODES NODELIST(REASON)
    135   shortq    simple.s user1 R  0:13 1     compute01

Kill a job.

Users can kill their own jobs, root can kill any job.

    $ scancel 135
    $ squeue
    JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

Hold a job:¶

    $ scontrol hold 139
    Release a job:
    $ scontrol release 139

More about Batch Jobs (SLURM)¶

SLURM (Simple Linux Utility for Resource Management) is a workload manager that provides a framework for job queues, allocation of compute nodes, and the start and execution of jobs.

It is important to note:

Compilations are done on the login node. Only the execution is scheduled via SLURM on the compute/GPU nodes
Upon Submission of a Job script, each job gets a unique Job Id. This can be obtained from the ‘squeue’ command.
The Job Id is also appended to the output and error filenames.

Parameters used in SLURM job Script¶

The job flags are used with SBATCH command. The syntax for the SLURM directive in a script is "#SBATCH ". Some of the flags are used with the srun and salloc commands.

Resource Flag	Syntax	Description
partition	--partition=partition-name	Partition is a queue for jobs.
time	--time=01:00:00	Time limit for the job.
nodes	--nodes=2	Number of compute nodes for for the job
cpus/cores	--ntasks-per-node=8	Corresponds to number of cores on the compute node
job name	--job-name="job-1"	Name of job
output file	--output=job-1.out	Name of file for stdout.
access	--exclusive	Exclusive access to computenodes
resource	--gres=gpu:2	Request use of GPUs on compute node

Some useful SLURM commands.¶

Sl No	Purpose	Command
1	To check the queue status	squeue
2	To check the status/ availability of nodes	sinfo
3	To cancel a job running	scancel note: job id can be obtained by command squeue
4	To check the jobs of a particular user	squeue -u
5	To list all running jobs of a user	squeue -u -t RUNNING
6	To list all pending jobs of a user	squeue -u -t PENDING
7	To cancel all the pending jobs for a user	scancel -t PENDING -u username