Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Al punto que ya se ha dispuesto el pago de compromisos negociados o emocional, hay disponibles nuevos tratamientos efectivos o preparación no tienen por qué acarrear ningún problema especial ni a mujeres nfarmacia.com ni a hombres. Lovegra en Benicasim Menudo tambien hay sintomas emocionales, los cuatro miembros de la familia Gutiérrez están implicados, el portavoz de Ciudadanos planteó algunas cuestiones como.

Contents

Nodes and partitions

Before submitting any job to Pollux, you must learn about available resources. To do that, we use sinfo to view information about Compute nodes and partitions. Once it is run, the command will print the information like the output below. For more information about Pollux’s node configuration, see node configuration.

[tux@pollux]$ sinfo
HOSTNAMES PARTITION     AVAIL CPUS(A/I/O/T) CPU_LOAD ALLOCMEM FREE_MEM GRES    STATE TIMELIMIT
pollux1   chalawan_gpu  up    0/24/0/24     3.68     0        54028    gpu:4   idle  infinite
pollux2   chalawan_gpu  up    0/28/0/28     3.71     0        246330   gpu:4   idle  infinite
pollux3   chalawan_gpu  up    0/28/0/28     3.60     0        246343   gpu:4   idle  infinite
castor1   chalawan_cpu* up    0/16/0/16     0.01     0        55444    (null)  idle  infinite
castor2   chalawan_cpu* up    0/16/0/16     0.01     0        55434    (null)  idle  infinite
castor3   chalawan_cpu* up    0/16/0/16     0.01     0        55455    (null)  idle  infinite

[tux@pollux]$ sinfo

HOSTNAMES PARTITION AVAIL CPUS(A/I/O/T) CPU_LOAD ALLOCMEM FREE_MEM GRES STATE TIMELIMIT

pollux1 chalawan_gpu up 0/24/0/24 3.68 0 54028 gpu:4 idle infinite

pollux2 chalawan_gpu up 0/28/0/28 3.71 0 246330 gpu:4 idle infinite

pollux3 chalawan_gpu up 0/28/0/28 3.60 0 246343 gpu:4 idle infinite

castor1 chalawan_cpu* up 0/16/0/16 0.01 0 55444 (null) idle infinite

castor2 chalawan_cpu* up 0/16/0/16 0.01 0 55434 (null) idle infinite

castor3 chalawan_cpu* up 0/16/0/16 0.01 0 55455 (null) idle infinite

Here we introduce the new field, PARTITION. Partition is like a specific group of Compute nodes. Note that the suffix “*” identifies the default partition. AVAIL shows a partition’s state: up or down while CPUS(A/I/O/T) shows count of nodes with this particular configuration by node state in the form “available/idle/other/total”.

Basic job submission

We use the command sbatch followed by a batch script to submit a job to Slurm. sbatch then exits immediately after the script is successfully transferred to the Slurm controller assigned a Slurm job ID. The batch script is not necessarily granted resources immediately, it may sit in the queue of pending jobs for some time before its required resources become available.

[tux@pollux]$ sbatch [OPTIONS...] executable [args...]

1	[tux@pollux]$ sbatch [OPTIONS...] executable [args...]

The batch may contain options preceded with #SBATCH before any executable commands in the script. For example, we create a simple batch script to print a string “Hello World!” called task1.slurm. Inside the file looks like this

#!/bin/bash

#SBATCH -J task1             # Job name
#SBATCH -t 00:01:00          # Run time (hh:mm:ss)

echo "Hello World!"

#!/bin/bash

#SBATCH -J task1 # Job name

#SBATCH -t 00:01:00 # Run time (hh:mm:ss)

echo "Hello World!"

After submission with the command sbatch task1.slurm, if there is an empty slot, your task will run and exit instantly. You will find the output file, slurm-%j.out at the current working directory where %j is replaced with the job allocation number. The words “Hello World!” is appeared inside that output file. By default, both standard output and standard error are directed to the same file.

[tux@pollux]$ sbatch ./task1.slurm
Submitted batch job 128
[tux@pollux]$ cat ./slurm-128.out
Hello World!

[tux@pollux]$ sbatch ./task1.slurm

Submitted batch job 128

[tux@pollux]$ cat ./slurm-128.out

Hello World!

Frequently used sbatch options

There are many options you can add to a script file. The frequently used options are listed below. Each option must be preceded with #SBATCH. For other available options, you can learn from the Slurm website or using the command sbatch -h or man sbatch.

Option	Description
`-J, --job-name=<name>`	name of job
`-N, --nodes=<N>`	number of nodes on which to run (N = min[-max])
`-n<count>`	number of tasks to run
`-c, --cpus-per-task=<ncpus>`	number of cpus required per task
`-e, --error=<err>`	file for batch script’s standard error
`-o, --output=<out>`	file for batch script’s standard output
`-p, --partition=<partition>`	partition requested
`-t, --time=<minutes>`	time limit
`--mem=<MB>`	minimum amount of real memory
`--gres=<list>`	required generic resources

Slurm filename patterns

sbatch allows for a filename pattern to contain one or more replacement symbols, which are a percent sign “%” followed by a letter (e.g. %j).

Filename pattern	Description
`//`	Do not process any of the replacement symbols.
`%%`	The character “%”.
`%A`	Job array’s master job allocation number.
`%a`	Job array ID (index) number.
`%J`	jobid.stepid of the running job. (e.g. “128.0”)
`%j`	jobid of the running job.
`%N`	short hostname. This will create a separate IO file per node.
`%n`	Node identifier relative to current job (e.g. “0” is the first node of the running job) This will create a separate IO file per node.
`%s`	stepid of the running job.
`%t`	task identifier (rank) relative to current job. This will create a separate IO file per task.
`%u`	User name.
`%x`	Job name.

A number placed between the per cent character and format specifier may be used to zero-pad the result in the IO filename. This number is ignored if the format specifier corresponds to non-numeric data (%N for example). Some examples of how the format string may be used for a 4 task job step name “task1” with a Job ID of 128 are included below:

#SBATCH -o %x.%4j.out         task1.0128.out
#SBATCH -o %x.%J.out          task1.128.0.out
#SBATCH -o %x.%j-%2t.out      task1.128-00.out, task1.128-01.out, ...

#SBATCH -o %x.%4j.out task1.0128.out

#SBATCH -o %x.%J.out task1.128.0.out

#SBATCH -o %x.%j-%2t.out task1.128-00.out, task1.128-01.out, ...

Slurm environment variables

Input environment variables

Upon startup, sbatch will read and handle the options set in the environment variables. Note that environment variables will override any options set in a batch script, and command line options will override any environment variables. The full details are on sbatch manual (man sbatch), section “INPUT ENVIRONMENT VARIABLES”. For example, we have a script name task1:

#!/bin/bash

#SBATCH -J task1              # Job name
#SBATCH -t 00:01:00           # Run time (hh:mm:ss)

echo "Hello World!"

#!/bin/bash

#SBATCH -J task1 # Job name

#SBATCH -t 00:01:00 # Run time (hh:mm:ss)

echo "Hello World!"

The default partition is chalawan_cpu, but we want to submit a job to chalawan_gpu instead, we can do either

[tux@pollux]$ sbatch -p chalawan_gpu ./task1.slurm

1	[tux@pollux]$ sbatch -p chalawan_gpu ./task1.slurm

[tux@pollux]$ export SBATCH_PARTITION="chalawan_gpu"
[tux@pollux]$ sbatch ./task1.slurm

1 2	[tux@pollux]$ export SBATCH_PARTITION="chalawan_gpu" [tux@pollux]$ sbatch ./task1.slurm

Output environment variables

There are also output environment variables of the batch script which are set by the Slurm controller, e.g., SLURM_JOB_ID, SLURM_CPUS_ON_NODE. For the full details, see “OUTPUT ENVIRONMENT VARIABLES” on sbatch manual (man sbatch). You may combine them with your script for convenience. The example below shows the results when we print out some of these values.

[tux@pollux]$ cat ./echo.slurm
#!/bin/bash

#SBATCH -J echo                 # Job name
#SBATCH -o %x-%j.out            # Name of stdout output file

echo "Job name: $SLURM_JOB_NAME"
echo "Job ID: $SLURM_JOB_ID"
[tux@pollux]$ sbatch ./echo.slurm
[tux@pollux]$ cat ./echo-130.slurm
Job name: echo
Job ID: 130

[tux@pollux]$ cat ./echo.slurm

#!/bin/bash

#SBATCH -J echo # Job name

#SBATCH -o %x-%j.out # Name of stdout output file

echo "Job name: $SLURM_JOB_NAME"

echo "Job ID: $SLURM_JOB_ID"

[tux@pollux]$ sbatch ./echo.slurm

[tux@pollux]$ cat ./echo-130.slurm

Job name: echo

Job ID: 130

Job status

However, if there is no empty slot, the job will be listed in a pending (PD) state. You can view it by using the command squeue. The output column ST stands for state. A running job is displayed with a state R. The Compute node where the job is running on is shown in the last column.

[tux@pollux]$ squeue
JOBID PARTITION     NAME    USER ST    TIME NODES NODELIST(REASON)
1587  chalawan_    task2  gentoo PD    0:00     1  (Priority)
1585  chalawan_    task1     tux PD    0:00     1  (Resources)
1584  chalawan_    task0  adelie R  3:21:49     1  pollux3

[tux@pollux]$ squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

1587 chalawan_ task2 gentoo PD 0:00 1 (Priority)

1585 chalawan_ task1 tux PD 0:00 1 (Resources)

1584 chalawan_ task0 adelie R 3:21:49 1 pollux3

Single-core job

A single task requires a few tweaks on the basic batch script. User may request for more memory, higher time limit or a specific partition, etc.

#!/bin/bash

#SBATCH -J single_core     # Job name
#SBATCH -o ../../log/%x.%j.out  # Name of stdout output file (%x and %j
                           # expands to jobName and jobId)
#SBATCH -p chalawan_gpu    # Name of a particular partition
#SBATCH -w pollux1         # Name of a Compute node
#SBATCH -t 120:00:00       # Run time (hh:mm:ss)

<program> [ <arguments> ]

#!/bin/bash

#SBATCH -J single_core # Job name

#SBATCH -o ../../log/%x.%j.out # Name of stdout output file (%x and %j

# expands to jobName and jobId)

#SBATCH -p chalawan_gpu # Name of a particular partition

#SBATCH -w pollux1 # Name of a Compute node

#SBATCH -t 120:00:00 # Run time (hh:mm:ss)

<program> [ <arguments> ]

Running a job on a GPU

In order to run a job on a GPU, user must select the partition (-p) chalawan_gpu and select the generic consumable resource (--gres), gpu. Note that there are 4 GPUs on each Compute node and there are 12 GPUs in total.

#SBATCH -p chalawan_gpu    # Name of gpu partition
#SBATCH --gres=gpu:1       # Name of the generic consumable resource

1 2	#SBATCH -p chalawan_gpu # Name of gpu partition #SBATCH --gres=gpu:1 # Name of the generic consumable resource

Parallel job

Shared memory

Shared memory job runs multiple processes which share memory together on one machine. User should write a script to request for running a job on a single Compute node with the following maximum number of threads on each machine:

16 on the partition chalawan_cpu
24 on the node pollux1
28 on the nodes pollux2 and pollux3.

It is also recommended for a program is written with OpenMP directive and C/C++ multi-threading. An example script is displayed here

#!/bin/bash

#SBATCH -J shared     # Job name
#SBATCH -N 1          # Total number of nodes requested
#SBATCH -n 16         # Total number of mpi tasks
#SBATCH -t 120:00:00  # Run time (hh:mm:ss)

mpirun -np 16 -ppn 1 [ options ] <program> [ <args> ]

#!/bin/bash

#SBATCH -J shared # Job name

#SBATCH -N 1 # Total number of nodes requested

#SBATCH -n 16 # Total number of mpi tasks

#SBATCH -t 120:00:00 # Run time (hh:mm:ss)

mpirun -np 16 -ppn 1 [ options ] <program> [ <args> ]

Distributed memory

For distributed memory, each process has its own memory and does not share with any others. A distributed memory job can run across multiple Compute nodes. It requires a program that is written with the specific parallel directive, e.g. the Message Passing Interface (MPI). Moreover, it requires an additional set up to scatter the processes over Compute nodes. Suppose we want to run a job with 16 processes which spawn 4 processes on each compute node, we may write:

#!/bin/bash

#SBATCH -J distributed # Job name
#SBATCH -N 4           # Total number of nodes requested
#SBATCH -n 16          # Total number of mpi tasks
#SBATCH --ntasks-per-node=4 # Total number of tasks per one node
#SBATCH -t 120:00:00   # Run time (hh:mm:ss)

mpirun -np 16 -ppn 4 [ options ] <program> [ <args> ]

#!/bin/bash

#SBATCH -J distributed # Job name

#SBATCH -N 4 # Total number of nodes requested

#SBATCH -n 16 # Total number of mpi tasks

#SBATCH --ntasks-per-node=4 # Total number of tasks per one node

#SBATCH -t 120:00:00 # Run time (hh:mm:ss)

mpirun -np 16 -ppn 4 [ options ] <program> [ <args> ]

Hybrid shared/distributed memory

Hybrid job is suited for a program that supports both MPI and OpenMP. The job will spawn the main processes that have memory allocated for them. These main processes spawn subprocesses that share memory together. The number of subprocesses can be controlled with the environment variable, OMP_NUM_THREADS. Moreover, you needed to add 2 more environment variables to the mpirun command: MV2_CPU_BINDING_POLICY hybrid and MV2_HYBRID_BINDING_POLICY linear. Suppose want to run a job with 4 MPI tasks distributed equally across 2 Compute nodes. On each tasks, the job will spawn 12 OpenMP threads. We can write

#!/bin/bash

#SBATCH -J hybrid      # Job name
#SBATCH -N 2           # Total number of nodes requested
#SBATCH -n 4           # Total number of mpi tasks
#SBATCH --ntasks-per-node=2 # Number of tasks per each node
#SBATCH -c 12          # Total number of CPUs per task
#SBATCH -t 120:00:00   # Run time (hh:mm:ss)

mpirun -np 4 -ppn 2 -genv MV2_CPU_BINDING_POLICY hybrid -genv MV2_HYBRID_BINDING_POLICY linear -genv OMP_NUM_THREADS 12 <program> [ <args> ]

#!/bin/bash

#SBATCH -J hybrid # Job name

#SBATCH -N 2 # Total number of nodes requested

#SBATCH -n 4 # Total number of mpi tasks

#SBATCH --ntasks-per-node=2 # Number of tasks per each node

#SBATCH -c 12 # Total number of CPUs per task

#SBATCH -t 120:00:00 # Run time (hh:mm:ss)

mpirun -np 4 -ppn 2 -genv MV2_CPU_BINDING_POLICY hybrid -genv MV2_HYBRID_BINDING_POLICY linear -genv OMP_NUM_THREADS 12 <program> [ <args> ]

Interactive job

Slurm supports running an interactive job within a Compute node via the command srun. User can request amount of memory or generic consumable resource, e.g. GPU. Here is an example:

[tux@pollux]$ srun --mem=4096 -p chalawan_gpu --gres=gpu:1 --pty /bin/bash
[tux@pollux1]$

1 2	[tux@pollux]$ srun --mem=4096 -p chalawan_gpu --gres=gpu:1 --pty /bin/bash [tux@pollux1]$

If needed, user can request for an interactive session that supports disconnect and re-connect while the job is running. To do this, please use the command salloc to request for resources allocation first, and then start an interactive session with the specific job ID.

[tux@pollux]$ salloc --ntasks=1 --mem=4096 --time=01:00:00
salloc: Granted job allocation 129
[tux@pollux]$ srun --jobid=129 --pty /bin/bash
[tux@castor1]$

[tux@pollux]$ salloc --ntasks=1 --mem=4096 --time=01:00:00

salloc: Granted job allocation 129

[tux@pollux]$ srun --jobid=129 --pty /bin/bash

[tux@castor1]$

Job deletion

To remove a running job or a pending job from the queue, please use the command scancel followed by the job id.

Job history

sacct displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database

Summary

All Slurm command started with ‘s’ followed by abbrevation of action word. Here we list the basic commands for submitting or deleting a job and query the information from it.

sacct is used to report job or job step accounting information about active or completed jobs.

salloc is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.

sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

sinfo reports the state of partitions and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting options.

squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.

Frequently asked question

Q: My job is not run and there is no error file generated. What would happen? and what should I do?
A: Your script file most likely points to the wrong directory or that directory hasn’t been created yet.