SLURM Job script¶

To execute a program in the cluster system, a user has to write a batch script and submit it to the SLURM job scheduler. Sample of general SLURM scripts are located in each user’s hpc2021 home directory ~/slurm-samples and user guide for individual software can be referenced.

Sample job script¶

In the example job script (script.cmd) below, it requests the following resources and actual programs/commands to be executed.

Name the job as “pilot_study” for easy reference.
Request for the “shortq” partition
Request for “shortq” QoS.
Request for allocation of 40 CPU cores contributed from a total of one compute node.
Request for 10GB physical RAM.

Request for job execution 3 days and 10 hours ( The job would be terminated by SLURM after the specified amount of time no matter it has finished or not). Write the standard output and standard error to the file “pilot_study_2021.out” and “pilot_study_2021.err” respectively under the folder where the job is submitted. The path supports the use of replacement symbols.

    #!/bin/bash
    #SBATCH --job-name=pilot_study        # 1. Job name 
    #SBATCH --partition=shortq            # 2. Request a partition
    #SBATCH --ntasks=40                   # 3. Request total number of tasks (MPI workers)
    #SBATCH --nodes=1                     # 4. Request number of node(s)
    #SBATCH --mem=10G                     # 5. Request total amount of RAM
    #SBATCH --time=3-10:00:00             # 6. Job execution duration limit day-hour:min:sec
    #SBATCH --output=%x_%j.out            # 7. Standard output log as $job_name_$job_id.out
    #SBATCH --error=%x_%j.err             # 8. Standard error log as $job_name_$job_id.err

    ##print the start time
    date
    command1 ...
    command2 ...
    command3 ...
    ##print the end time
    date

SLURM Job Directives¶

A SLURM script includes a list of SLURM job directives at the top of the file, where each line starts with #SBATCH followed by option name to value pairs to tell the job scheduler the resources that a job requests.

Long Option	Short Option	Default value	Description
--job-name	-J	file name of job script	User defined name to identify a job
--partition	-p	intel	Partition where a job to be executed
--time	-t	24:00:00	Specify a limit on the maximum execution time (walltime) for the job (D-HH:MM:SS) .For example, -t 1- is one day, -t 6:00:00 is 6 hours
--nodes	-N		Total number of node(s)
--ntasks	-n	1	Number of tasks (MPI workers)
--ntasks-per-node			Number of tasks per node
--cpus-per-task	-c	1	Number of CPUs required per task
--mem			Amount of memory allocated per node. Different units can be specified using the suffix [K
--mem-per-cpu			Amount of memory allocated per cpu per code (For multicore jobs). Different units can be specified using the suffix [K
--exclude	-x		Explicitly exclude certain nodes from the resources granted to the job.

More SLURM directives are available here

Running Serial / Single Threaded Jobs using a CPU on a node¶

Serial or single CPU core jobs are those jobs that can only make use of one CPU on a node. A SLURM batch script below will request for a single CPU on a node with default amount of RAM for 30 minutes in testq partition.

        #!/bin/bash
        #SBATCH --job-name=test-job1         
        #SBATCH --ntasks=1
        #SBATCH --partition=shortq            
        #SBATCH --time=00:30:00

        command1 ...

Running Multi-threaded Jobs using multiple CPU cores on a node¶

For those jobs that can leverage multiple CPU cores on a node via creating multiple threads within a process (e.g. OpenMP), a SLURM batch script below may be used that requests for allocation to a task with 8 CPU cores on a single node and 6GB RAM per core (Totally 6GB x 8 = 48GB RAM on a node ) for 1 hour in shortq partition.

Note

--cpus-per-task should be no more than the number of cores on a compute node you request. You may want to experiment with the number of threads for your job to determine the optimal number, as computational speed does not always increase with more threads. Note that if --cpus-per-task is fewer than the number of cores on a node, your job will not make full use of the node.

Note

For program that can only use a single CPU, requesting for more CPU will NOT make it run faster but will likely make the queuing time longer.

eg:

    #!/bin/bash
    #SBATCH --job-name=pilot_study       
    #SBATCH --partition=shortq            
    #SBATCH --ntasks=1                   
    #SBATCH --nodes=1                    
    #SBATCH --mem-percpu=6G                     
    #SBATCH --time=01:00:00             
    #SBATCH --cpus-per-task=8

    #For jobs supporting OpenMP, assign the value of the requested CPU cores to the OMP_NUM_THREADS variable
    #that would be automatically passed to your command supporting OpenMP

    export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
    command1 ...
    #For jobs not supporting OpenMP, supply the value of the requested CPU cores as command-line argument to the command
    command2 -t ${SLURM_CPUS_PER_TASK} ...

Running MPI jobs using multiple nodes¶

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to allow for execution of programs using CPUs on multiple nodes where CPUs across nodes communicate over the network. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. Intel MPI and OpenMPI are available in Madhava HPC system and SLURM jobs may make use of either MPI implementations.

Note

Requesting for multiple nodes and /or loading any MPI modules may not necessarily make your code faster, your code must be MPI aware to use MPI. Even though running a non-MPI code with mpirun might possibly succeed, you will most likely have every core assigned to your job running the exact computation, duplicating each others work, and wasting resources.

Note

The version of the MPI commands you run must match the version of the MPI library used in compiling your code, or your job is likely to fail. And the version of the MPI daemons started on all the nodes for your job must also match. For example, an MPI program compiled with Intel MPI compilers should be executed using Intel MPI runtime instead of Open MPI runtime.

A SLURM batch script below requests for allocation to 40 tasks (MPI processes) each use a single core from two nodes and 3GB RAM per core for 1 hour in mediumq

        #!/bin/bash
        #SBATCH --job-name=run_mpi        
        #SBATCH --partition=mediumq            
        #SBATCH --ntasks=40                   
        #SBATCH --nodes=2                     
        #SBATCH --mem=10G                     
        #SBATCH --time=01:00:00 
        #SBATCH --cpus-per-task=1
        #SBATCH --mem-per-cpu=3G            
        #SBATCH --output=%x_%j.out           
        #SBATCH --error=%x_%j.err

        cd ${SLURM_SUBMIT_DIR}
        #Load the environment for Intel MPI
        module load compilers/intel/parallel_studio_xe_2020.4.912

        #run the program supporting MPI with the "mpirun" command
        #The -n option is not required since mpirun will automatically determine from SLURM settings
        mpirun ./program_mpi

This example make use of all the cores on two, 20-core nodes in the “mediumq” partition. If same number of tasks (i.e. 64) is requested from partition “amd”, you should set “--nodes=1” so that all 64 cores will be allocated from a single AMD (64-core or 128-core) node . Otherwise, SLURM will assign 64 CPUs from 2 compute nodes which would induce unnecessary inter-node communication overhead.

Running hybrid OpenMP/MPI jobs using multiple nodes¶

For those jobs that support both OpenMP and MPI, a SLURM batch script may specify the number of MPI tasks to run and the number of CPU core that each task should use. A SLURM batch script below requests for allocation of 2 nodes and 80 CPU cores in total for 1 hour in mediumq. Each compute node runs 2 MPI tasks, where each MPI task uses 20 CPU core and each core uses 3GB RAM. This would make use of all the cores on two, 40-core nodes in the “intel” partition.

    #!/bin/bash
    #SBATCH --nodes=2
    #SBATCH --ntasks-per-node=2
    #SBATCH --cpus-per-task=20
    #SBATCH --mem-per-cpu=3G
    #SBATCH --time=01:00:00

    cd ${SLURM_SUBMIT_DIR}
    #Load the environment for Intel MPI
    module load compilers/intel/parallel_studio_xe_2020.4.912

    #assign the value of the requested CPU cores per task to the OMP_NUM_THREADS variable
    export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

    # run the program supporting MPI with the "mpirun" command.
    # The -n option is not required since mpirun will automatically determine from SLURM settings
    mpirun ./program_mpi-omp

Sample MPI & hybrid MPI/OpenMP codes and the corresponding SLURM scripts are available at.............................

Running jobs using GPU¶

Note

❗Your code must be GPU aware to benefit from nodes with GPU, otherwise other partition without GPU should be used.

A SLURM batch script below request for 8 CPU cores and 2 GPU cards from one compute node in the “gpuq” partition

    #!/bin/bash
    #SBATCH --job-name=run_gpu        
    #SBATCH --partition=gpuq            
    #SBATCH --ntasks=1                   
    #SBATCH --nodes=1
    #SBATCH --cpus-per-task=8

    ## Load the environment module for Nvidia CUDA
    module load compilers/nvidia/cuda/11.0

    commands