When running parallel jobs, SLURM will automatically set up some default process affinity. This means that every task spawned by srun (each MPI rank on an MPI execution) will be pinned to a specific core or set of cores within every computing node.

However, the default affinity may not be what you would expect, and depending on the application it could have a significant impact in performance.

Slurm Reference

You may see all the possible options when it comes to setting up Affinity in SLURM in the official documentation pages https://slurm.schedmd.com/mc_support.html

Below are some examples of how the affinity is setup by default in the different cases.

Understanding CPU layout

Every node has a total of 256 virtual cores (128 physical). Every core will have an ID, with IDs 0 and 128 being the two hardware threads on the same physical core. See Atos HPCF: System overview for all the details.

MPI single threaded execution

Default setup

This is the simplest case. Slurm will define affinity at the level of physical cores, but allow the task to use the two hardware threads. In this example, we run a 128 task MPI job with default settings, on a single node:

MPI single threaded with default options

#!/bin/bash
#SBATCH -q np
#SBATCH -n 128
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
srun check-affinity

Affinity obtained

[MPI rank    0 Thread   0 Node at1-1000.bullx] Core affinity: 0,128
[MPI rank    1 Thread   0 Node at1-1000.bullx] Core affinity: 1,129
[MPI rank    2 Thread   0 Node at1-1000.bullx] Core affinity: 2,130
[MPI rank    3 Thread   0 Node at1-1000.bullx] Core affinity: 3,131
...
[MPI rank  124 Thread   0 Node at1-1000.bullx] Core affinity: 124,252
[MPI rank  125 Thread   0 Node at1-1000.bullx] Core affinity: 125,253
[MPI rank  126 Thread   0 Node at1-1000.bullx] Core affinity: 126,254
[MPI rank  127 Thread   0 Node at1-1000.bullx] Core affinity: 127,255

Disabling multithread use

If you want to restrict your process to just one of the hardware threads, you may use the --hint=nomultithread option

MPI Single threaded using only physical cores

#!/bin/bash
#SBATCH -q np
#SBATCH -n 128
#SBATCH --hint=nomultithread
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
srun check-affinity

Affinity obtained

[MPI rank    0 Thread   0 Node at1-1000.bullx] Core affinity: 0
[MPI rank    1 Thread   0 Node at1-1000.bullx] Core affinity: 1
[MPI rank    2 Thread   0 Node at1-1000.bullx] Core affinity: 2
[MPI rank    3 Thread   0 Node at1-1000.bullx] Core affinity: 3
...
[MPI rank  124 Thread   0 Node at1-1000.bullx] Core affinity: 124
[MPI rank  125 Thread   0 Node at1-1000.bullx] Core affinity: 125
[MPI rank  126 Thread   0 Node at1-1000.bullx] Core affinity: 126
[MPI rank  127 Thread   0 Node at1-1000.bullx] Core affinity: 127

By defining the nomultithread, the maximum number of threads is halved from 256 to 128.

Hybrid MPI + OpenMP execution

Default setup - Least ideal

Slurm will allocate a number of cores for each task, and will bind each process to a group of cores equal to the number of threads. However, all threads may run in any of the cores in the group for the same process. In the following example we run a 4 task MPI program, with every rank spawning 4 threads.

MPI + OpenMP with default options

#!/bin/bash
#SBATCH -q np
#SBATCH -n 32
#SBATCH -c 4
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
srun check-affinity

Affinity obtained

[MPI rank    0 Thread   0 Node at1-1000.bullx] Core affinity: 0,1,128,129
[MPI rank    0 Thread   1 Node at1-1000.bullx] Core affinity: 0,1,128,129
[MPI rank    0 Thread   2 Node at1-1000.bullx] Core affinity: 0,1,128,129
[MPI rank    0 Thread   3 Node at1-1000.bullx] Core affinity: 0,1,128,129
[MPI rank    1 Thread   0 Node at1-1000.bullx] Core affinity: 2,3,130,131
[MPI rank    1 Thread   1 Node at1-1000.bullx] Core affinity: 2,3,130,131
[MPI rank    1 Thread   2 Node at1-1000.bullx] Core affinity: 2,3,130,131
[MPI rank    1 Thread   3 Node at1-1000.bullx] Core affinity: 2,3,130,131
...
[MPI rank   30 Thread   0 Node at1-1000.bullx] Core affinity: 60,61,188,189
[MPI rank   30 Thread   1 Node at1-1000.bullx] Core affinity: 60,61,188,189
[MPI rank   30 Thread   2 Node at1-1000.bullx] Core affinity: 60,61,188,189
[MPI rank   30 Thread   3 Node at1-1000.bullx] Core affinity: 60,61,188,189
[MPI rank   31 Thread   0 Node at1-1000.bullx] Core affinity: 62,63,190,191
[MPI rank   31 Thread   1 Node at1-1000.bullx] Core affinity: 62,63,190,191
[MPI rank   31 Thread   2 Node at1-1000.bullx] Core affinity: 62,63,190,191
[MPI rank   31 Thread   3 Node at1-1000.bullx] Core affinity: 62,63,190,191

In some cases with low number of tasks/threads, it may be necessary to force that binding defining the hint:

#SBATCH --hint=multithread

Binding OpenMP threads to single cores - Best if hyper-threading desired

If you want to bind every thread to a single core, then you may use the OpenMP variable OMP_PLACES.

MPI + OpenMP, binding individual threads

#!/bin/bash
#SBATCH -q np
#SBATCH -n 32
#SBATCH -c 4
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export OMP_PLACES=threads
srun check-affinity

Affinity obtained

[MPI rank    0 Thread   0 Node at1-1000.bullx] Core affinity: 0
[MPI rank    0 Thread   1 Node at1-1000.bullx] Core affinity: 128
[MPI rank    0 Thread   2 Node at1-1000.bullx] Core affinity: 1
[MPI rank    0 Thread   3 Node at1-1000.bullx] Core affinity: 129
[MPI rank    1 Thread   0 Node at1-1000.bullx] Core affinity: 2
[MPI rank    1 Thread   1 Node at1-1000.bullx] Core affinity: 130
[MPI rank    1 Thread   2 Node at1-1000.bullx] Core affinity: 3
[MPI rank    1 Thread   3 Node at1-1000.bullx] Core affinity: 131
...
[MPI rank   30 Thread   0 Node at1-1000.bullx] Core affinity: 60
[MPI rank   30 Thread   1 Node at1-1000.bullx] Core affinity: 188
[MPI rank   30 Thread   2 Node at1-1000.bullx] Core affinity: 61
[MPI rank   30 Thread   3 Node at1-1000.bullx] Core affinity: 189
[MPI rank   31 Thread   0 Node at1-1000.bullx] Core affinity: 62
[MPI rank   31 Thread   1 Node at1-1000.bullx] Core affinity: 190
[MPI rank   31 Thread   2 Node at1-1000.bullx] Core affinity: 63
[MPI rank   31 Thread   3 Node at1-1000.bullx] Core affinity: 191

As you can see, this is not using all the physical cores in the node.

Disabling multithread use - Best if no hyper-threading desired

If you want to avoid having two threads sharing the same physical core and maximise the use of all physical cores in the node, you may use the --hint=nomultithread option.

MPI + OpenMP, binding individual threads

#!/bin/bash
#SBATCH -q np
#SBATCH -n 32
#SBATCH -c 4
#SBATCH --hint=nomultithread
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export OMP_PLACES=threads
srun check-affinity

Affinity obtained

[MPI rank    0 Thread   0 Node at1-1000.bullx] Core affinity: 0
[MPI rank    0 Thread   1 Node at1-1000.bullx] Core affinity: 1
[MPI rank    0 Thread   2 Node at1-1000.bullx] Core affinity: 2
[MPI rank    0 Thread   3 Node at1-1000.bullx] Core affinity: 3
[MPI rank    1 Thread   0 Node at1-1000.bullx] Core affinity: 4
[MPI rank    1 Thread   1 Node at1-1000.bullx] Core affinity: 5
[MPI rank    1 Thread   2 Node at1-1000.bullx] Core affinity: 6
[MPI rank    1 Thread   3 Node at1-1000.bullx] Core affinity: 7
...
[MPI rank   30 Thread   0 Node at1-1000.bullx] Core affinity: 120
[MPI rank   30 Thread   1 Node at1-1000.bullx] Core affinity: 121
[MPI rank   30 Thread   2 Node at1-1000.bullx] Core affinity: 122
[MPI rank   30 Thread   3 Node at1-1000.bullx] Core affinity: 123
[MPI rank   31 Thread   0 Node at1-1000.bullx] Core affinity: 124
[MPI rank   31 Thread   1 Node at1-1000.bullx] Core affinity: 125
[MPI rank   31 Thread   2 Node at1-1000.bullx] Core affinity: 126
[MPI rank   31 Thread   3 Node at1-1000.bullx] Core affinity: 127

Further customisation

Only recommended for expert users who want to have full control of how the binding and distribution is done.

If you wish to further customise how the binding and task/thread distribution is done, check the man pages for sbatch an srun, or check the online documentation.

You may also use most flags as both sbatch options/directives or directly on the srun line. If passed onto srun, they will override the configuration coming from the job. Note that srun will fail if trying to use more resources than allocated for the job.

You may use the --cpu-bind option to fine tune how the binding is done. All the possible values are defined in the official man pages. There are the main highlights:

Binding can be disabled altogether passing --cpu-bind=no to srun.
You may see the actual binding mask applied by passing --cpu-bind=verbose.
Custom masks/maps can be defined and passed to srun with --cpu-bind=map_cpu and --cpu-bind=mask_cpu.

You can also control how the processes/threads are distributed and bound with the -m or --distribution option in srun. The default is block:cyclic. The first element refers to the distribution of tasks among nodes, and in this case block would distribute them in such a way that consecutive tasks would share a node. The second element is how the CPUs are dstributed within the node. As default cyclic will distribute allocated CPUs for binding to a given task consecutively from the same socket, and from the next consecutive socket for the next task, in a round-robin fashion across sockets.

Content

Space Tools

MPI single threaded execution

Default setup

Disabling multithread use

Hybrid MPI + OpenMP execution

Default setup - Least ideal

Binding OpenMP threads to single cores - Best if hyper-threading desired

Disabling multithread use - Best if no hyper-threading desired

Further customisation

Content

Space Tools

HPC2020: Affinity

MPI single threaded execution

Default setup

Disabling multithread use

Hybrid MPI + OpenMP execution

Default setup - Least ideal

Binding OpenMP threads to single cores - Best if hyper-threading desired

Disabling multithread use - Best if no hyper-threading desired

Further customisation