Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions.
Al punto que ya se ha dispuesto el pago de compromisos negociados o emocional, hay disponibles nuevos tratamientos efectivos o preparación no tienen por qué acarrear ningún problema especial ni a mujeres nfarmacia.com ni a hombres. Lovegra en Benicasim Menudo tambien hay sintomas emocionales, los cuatro miembros de la familia Gutiérrez están implicados, el portavoz de Ciudadanos planteó algunas cuestiones como.
Contents
Nodes and partitions
Before submitting any job to Pollux, you must learn about available resources. To do that, we sinfo
1 2 3 4 5 6 7 8 |
[tux@pollux]$ sinfo HOSTNAMES PARTITION AVAIL CPUS(A/I/O/T) CPU_LOAD ALLOCMEM FREE_MEM GRES STATE TIMELIMIT pollux1 chalawan_gpu up 0/24/0/24 3.68 0 54028 gpu:4 idle infinite pollux2 chalawan_gpu up 0/28/0/28 3.71 0 246330 gpu:4 idle infinite pollux3 chalawan_gpu up 0/28/0/28 3.60 0 246343 gpu:4 idle infinite castor1 chalawan_cpu* up 0/16/0/16 0.01 0 55444 (null) idle infinite castor2 chalawan_cpu* up 0/16/0/16 0.01 0 55434 (null) idle infinite castor3 chalawan_cpu* up 0/16/0/16 0.01 0 55455 (null) idle infinite |
Here we introduce the new field, PARTITION. Partition is like a specific group of Compute nodes. Note that the suffix “*” identifies the default partition. AVAIL shows a partition’s state: up or down while
Basic job submission
We use the command sbatch
1 |
[tux@pollux]$ sbatch [OPTIONS...] executable [args...] |
The batch may contain options preceded with #SBATCH
before any executable commands in the script. For example, we create a simple batch script to print
1 2 3 4 5 6 |
#!/bin/bash #SBATCH -J task1 # Job name #SBATCH -t 00:01:00 # Run time (hh:mm:ss) echo "Hello World!" |
After submission with the command sbatch task1.slurm
, if there is an empty slot, your task will run and exit instantly. You will find the output file, slurm-%j.out at the current working directory where %j is replaced with the job allocation number. The words “Hello World!” is appeared inside that output file. By default, both standard output and standard error are directed to the same file.
1 2 3 4 |
[tux@pollux]$ sbatch ./task1.slurm Submitted batch job 128 [tux@pollux]$ cat ./slurm-128.out Hello World! |
Frequently used sbatch options
There are many options you can add to a script file. The frequently used options are listed below. Each option must be preceded with #SBATCH
. For other available options, you can learn from the Slurm website or using the sbatch -h
man sbatch
Option | Description |
---|---|
-J, --job-name=<name> |
name of job |
-N, --nodes=<N> |
number of nodes on which to run (N = min[-max]) |
-n<count> |
number of tasks to run |
-c, --cpus-per-task=<ncpus> |
number of cpus required per task |
-e, --error=<err> |
file for batch script’s standard error |
-o, --output=<out> |
file for batch script’s standard output |
-p, --partition=<partition> |
partition requested |
-t, --time=<minutes> |
time limit |
--mem=<MB> |
minimum amount of real memory |
--gres=<list> |
required generic resources |
Slurm filename patterns
Filename pattern | Description |
---|---|
// |
Do not process any of the replacement symbols. |
%% |
The character “%”. |
%A |
Job array’s master job allocation number. |
%a |
Job array ID (index) number. |
%J |
jobid.stepid of the running job. (e.g. “128.0”) |
%j |
jobid of the running job. |
%N |
short hostname. This will create a separate IO file per node. |
%n |
Node identifier relative to current job (e.g. “0” is the first node of the running job) This will create a separate IO file per node. |
%s |
stepid of the running job. |
%t |
task identifier (rank) relative to current job. This will create a separate IO file per task. |
%u |
User name. |
%x |
Job name. |
A number placed between the per cent character and format specifier may be used to zero-pad the result in the IO filename. This number is ignored if the format specifier corresponds to non-numeric data (%N for example). Some examples of how the format string may be used for a 4 task job step name “task1” with a Job ID of 128 are included below:
1 2 3 |
#SBATCH -o %x.%4j.out task1.0128.out #SBATCH -o %x.%J.out task1.128.0.out #SBATCH -o %x.%j-%2t.out task1.128-00.out, task1.128-01.out, ... |
Slurm environment variables
Input environment variables
Upon startup, sbatch will read and handle the options set in the environment variables. Note that environment variables will override any options set in a batch script, and command line options will override any environment variables. The full details are on sbatch manual (man sbatch
), section “INPUT ENVIRONMENT VARIABLES”. For example, we have a script name task1:
1 2 3 4 5 6 |
#!/bin/bash #SBATCH -J task1 # Job name #SBATCH -t 00:01:00 # Run time (hh:mm:ss) echo "Hello World!" |
The default partition is chalawan_cpu, but we want to submit a job to chalawan_gpu instead, we can do either
1 |
[tux@pollux]$ sbatch -p chalawan_gpu ./task1.slurm |
or
1 2 |
[tux@pollux]$ export SBATCH_PARTITION="chalawan_gpu" [tux@pollux]$ sbatch ./task1.slurm |
Output environment variables
There are also output environment variables of the batch script which are set by the Slurm controller, e.g., SLURM_JOB_ID
, SLURM_CPUS_ON_NODE
. For the full details, see “OUTPUT ENVIRONMENT VARIABLES” on sbatch manual (man sbatch
). You may combine them with your script for convenience. The example below shows the results when we print out some of these values.
1 2 3 4 5 6 7 8 9 10 11 12 |
[tux@pollux]$ cat ./echo.slurm #!/bin/bash #SBATCH -J echo # Job name #SBATCH -o %x-%j.out # Name of stdout output file echo "Job name: $SLURM_JOB_NAME" echo "Job ID: $SLURM_JOB_ID" [tux@pollux]$ sbatch ./echo.slurm [tux@pollux]$ cat ./echo-130.slurm Job name: echo Job ID: 130 |
Job status
However, if there is no empty slot, the job will be listed in a pending (PD) state. You can view it by using the squeue
1 2 3 4 5 |
[tux@pollux]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1587 chalawan_ task2 gentoo PD 0:00 1 (Priority) 1585 chalawan_ task1 tux PD 0:00 1 (Resources) 1584 chalawan_ task0 adelie R 3:21:49 1 pollux3 |
Single-core job
A single task requires a few tweaks on the basic batch script. User may request for more memory, higher time limit or a specific partition, etc.
1 2 3 4 5 6 7 8 9 10 |
#!/bin/bash #SBATCH -J single_core # Job name #SBATCH -o ../../log/%x.%j.out # Name of stdout output file (%x and %j # expands to jobName and jobId) #SBATCH -p chalawan_gpu # Name of a particular partition #SBATCH -w pollux1 # Name of a Compute node #SBATCH -t 120:00:00 # Run time (hh:mm:ss) <program> [ <arguments> ] |
Running a job on a GPU
In order to run a job on a GPU, -p
) chalawan_gpu and select the generic consumable resource (--
gres
),
1 2 |
#SBATCH -p chalawan_gpu # Name of gpu partition #SBATCH --gres=gpu:1 # Name of the generic consumable resource |
Parallel job
Shared memory job runs multiple processes which share memory together on one machine. User should write a script to request for running a job on a single Compute node with the following maximum number of threads on each machine:
- 16 on the partition chalawan_cpu
- 24 on the node pollux1
- 28 on the nodes pollux2 and pollux3.
It is also recommended for a program is written with OpenMP directive and C/C++ multi-threading. An example script is displayed here
1 2 3 4 5 6 7 8 |
#!/bin/bash #SBATCH -J shared # Job name #SBATCH -N 1 # Total number of nodes requested #SBATCH -n 16 # Total number of mpi tasks #SBATCH -t 120:00:00 # Run time (hh:mm:ss) mpirun -np 16 -ppn 1 [ options ] <program> [ <args> ] |
Distributed memory
For distributed memory, each process has its own memory and does not share with any others. A distributed memory job can run across multiple Compute nodes. It requires a program that is written with the specific parallel directive, e.g. the Message Passing Interface (MPI). Moreover, it requires an additional set up to scatter the processes over Compute nodes. Suppose we want to run a job with 16 processes which spawn 4 processes on each compute node, we may write:
1 2 3 4 5 6 7 8 9 |
#!/bin/bash #SBATCH -J distributed # Job name #SBATCH -N 4 # Total number of nodes requested #SBATCH -n 16 # Total number of mpi tasks #SBATCH --ntasks-per-node=4 # Total number of tasks per one node #SBATCH -t 120:00:00 # Run time (hh:mm:ss) mpirun -np 16 -ppn 4 [ options ] <program> [ <args> ] |
Hybrid job is suited for a program that supports both MPI and OpenMP. The job will spawn the main processes that have memory allocated for them. These main processes spawn subprocesses that share memory together. The number of subprocesses can be controlled with the environment variableOMP_NUM_THREADS
mpirun
MV2_CPU_BINDING_POLICY hybrid
MV2_HYBRID_BINDING_POLICY linear
1 2 3 4 5 6 7 8 9 10 |
#!/bin/bash #SBATCH -J hybrid # Job name #SBATCH -N 2 # Total number of nodes requested #SBATCH -n 4 # Total number of mpi tasks #SBATCH --ntasks-per-node=2 # Number of tasks per each node #SBATCH -c 12 # Total number of CPUs per task #SBATCH -t 120:00:00 # Run time (hh:mm:ss) mpirun -np 4 -ppn 2 -genv MV2_CPU_BINDING_POLICY hybrid -genv MV2_HYBRID_BINDING_POLICY linear -genv OMP_NUM_THREADS 12 <program> [ <args> ] |
Interactive job
Slurm supports running an interactive job within a Compute node via the command srun. User can request amount of memory or generic consumable resource, e.g. GPU. Here is an example:
1 2 |
[tux@pollux]$ srun --mem=4096 -p chalawan_gpu --gres=gpu:1 --pty /bin/bash [tux@pollux1]$ |
If needed, user can request for an interactive session that supports disconnect and re-connect while the job is running. To do this, please use the command salloc to request for resources allocation first, and then start an interactive session with the specific job ID.
1 2 3 4 |
[tux@pollux]$ salloc --ntasks=1 --mem=4096 --time=01:00:00 salloc: Granted job allocation 129 [tux@pollux]$ srun --jobid=129 --pty /bin/bash [tux@castor1]$ |
Job deletion
To remove a running job or a pending job from the queue, please use the scancel
Job history
sacct
displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database
Summary
All Slurm command started with ‘s’ followed by
sacct
is used to report job or job step accounting information about active or completed jobs.
salloc
is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.
sbatch
is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
scancel
is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
sinfo
reports the state of partitions and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting options.
squeue
reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
Frequently asked question
Q: My job is not run and there is no error file generated. What would happen? and what should I do?
A: Your script file most likely points to the wrong directory or that directory hasn’t been created yet.