Using Slurm Workload Manager

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Nodes and partitions

Before submitting any job to Pollux, you must learn about available resources. To do that, we use sinfo to view information about Compute nodes and partitions. Once it is run, the command will print the information like the output below. For more information about Pollux’s node configuration, see node configuration.

Here we introduce the new field, PARTITION. Partition is like a specific group of Compute nodes. Note that the suffix “*” identifies the default partition. AVAIL shows a partition’s state: up or down while CPUS(A/I/O/T) shows count of nodes with this particular configuration by node state in the form “available/idle/other/total”.

Basic job submission

We use the command sbatch followed by a batch script to submit a job to Slurm. sbatch then exits immediately after the script is successfully transferred to the Slurm controller assigned a Slurm job ID. The batch script is not necessarily granted resources immediately, it may sit in the queue of pending jobs for some time before its required resources become available.

The batch may contain options preceded with #SBATCH before any executable commands in the script. For example, we create a simple batch script to print a string “Hello World!” called task1.slurm. Inside the file looks like this

After submission with the command sbatch task1.slurm, if there is an empty slot, your task will run and exit instantly. You will find the output file, slurm-%j.out at the current working directory where %j is replaced with the job allocation number. The words “Hello World!” is appeared inside that output file. By default, both standard output and standard error are directed to the same file.

Frequently used sbatch options

There are many options you can add to a script file. The frequently used options are listed below. Each option must be preceded with #SBATCH. For other available options, you can learn from the Slurm website or using the command sbatch -h or man sbatch.

OptionDescription
-J, --job-name=<name>name of job
-N, --nodes=<N>number of nodes on which to run (N = min[-max])
-n<count>number of tasks to run
-c, --cpus-per-task=<ncpus>number of cpus required per task
-e, --error=<err>file for batch script's standard error
-o, --output=<out>file for batch script's standard output
-p, --partition=<partition>partition requested
-t, --time=<minutes>time limit
--mem=<MB>minimum amount of real memory
--gres=<list>required generic resources

Slurm filename patterns

sbatch allows for a filename pattern to contain one or more replacement symbols, which are a percent sign “%” followed by a letter (e.g. %j).

Filename patternDescription
//Do not process any of the replacement symbols.
%%The character "%".
%AJob array's master job allocation number.
%aJob array ID (index) number.
%Jjobid.stepid of the running job. (e.g. "128.0")
%jjobid of the running job.
%Nshort hostname. This will create a separate IO file per node.
%nNode identifier relative to current job (e.g. "0" is the first node of the running job) This will create a separate IO file per node.
%sstepid of the running job.
%ttask identifier (rank) relative to current job. This will create a separate IO file per task.
%uUser name.
%xJob name.

A number placed between the per cent character and format specifier may be used to zero-pad the result in the IO filename. This number is ignored if the format specifier corresponds to non-numeric data (%N for example). Some examples of how the format string may be used for a 4 task job step name “task1” with a Job ID of 128 are included below:

Slurm environment variables

Input environment variables

Upon startup, sbatch will read and handle the options set in the environment variables. Note that environment variables will override any options set in a batch script, and command line options will override any environment variables. The full details are on sbatch manual (man sbatch), section “INPUT ENVIRONMENT VARIABLES”. For example, we have a script name task1:

The default partition is chalawan_cpu, but we want to submit a job to chalawan_gpu instead, we can do either

or

Output environment variables

There are also output environment variables of the batch script which are set by the Slurm controller, e.g., SLURM_JOB_ID, SLURM_CPUS_ON_NODE. For the full details, see “OUTPUT ENVIRONMENT VARIABLES” on sbatch manual (man sbatch). You may combine them with your script for convenience. The example below shows the results when we print out some of these values.

Job status

However, if there is no empty slot, the job will be listed in a pending (PD) state. You can view it by using the command squeue. The output column ST stands for state. A running job is displayed with a state R. The Compute node where the job is running on is shown in the last column.

Single-core job

A single task requires a few tweaks on the basic batch script. User may request for more memory, higher time limit or a specific partition, etc.

Running a job on a GPU

In order to run a job on a GPU, user must select the partition (-p) chalawan_gpu and select the generic consumable resource (--gres), gpu. Note that there are 4 GPUs on each Compute node and there are 12 GPUs in total.

Parallel job

Shared memory

Shared memory job runs multiple processes which share memory together on one machine. User should write a script to request for running a job on a single Compute node with the following maximum number of threads on each machine:

  • 16 on the partition chalawan_cpu
  • 24 on the node pollux1
  • 28 on the nodes pollux2 and pollux3.

It is also recommended for a program is written with OpenMP directive and C/C++ multi-threading. An example script is displayed here

Distributed memory

For distributed memory, each process has its own memory and does not share with any others. A distributed memory job can run across multiple Compute nodes. It requires a program that is written with the specific parallel directive, e.g. the Message Passing Interface (MPI). Moreover, it requires an additional set up to scatter the processes over Compute nodes. Suppose we want to run a job with 16 processes which spawn 4 processes on each compute node, we may write:

Hybrid shared/distributed memory

Hybrid job is suited for a program that supports both MPI and OpenMP. The job will spawn the main processes that have memory allocated for them. These main processes spawn subprocesses that share memory together. The number of subprocesses can be controlled with the environment variable, OMP_NUM_THREADS. Moreover, you needed to add 2 more environment variables to the mpirun command: MV2_CPU_BINDING_POLICY hybrid and MV2_HYBRID_BINDING_POLICY linear. Suppose want to run a job with 4 MPI tasks distributed equally across 2 Compute nodes. On each tasks, the job will spawn 12 OpenMP threads. We can write

Interactive job

Slurm supports running an interactive job within a Compute node via the command srun. User can request amount of memory or generic consumable resource, e.g. GPU. Here is an example:

If needed, user can request for an interactive session that supports disconnect and re-connect while the job is running. To do this, please use the command salloc to request for resources allocation first, and then start an interactive session with the specific job ID.

Job deletion

To remove a running job or a pending job from the queue, please use the command scancel followed by the job id.

Job history

sacct displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database

Summary

All Slurm command started with ‘s’ followed by abbrevation of action word. Here we list the basic commands for submitting or deleting a job and query the information from it.

sacct is used to report job or job step accounting information about active or completed jobs.

salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.

sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

sinfo reports the state of partitions and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting options.

squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.

Frequently asked question

Q: My job is not run and there is no error file generated. What would happen? and what should I do?
A: Your script file most likely points to the wrong directory or that directory hasn’t been created yet.