When running parallel jobs, SLURM will automatically set up some default process affinity. This means that every task spawned by srun
(each MPI rank on an MPI execution) will be pinned to a specific core or set of cores within every computing node.
However, the default affinity may not be what you would expect, and depending on the application it could have a significant impact in performance.
You may see all the possible options when it comes to setting up Affinity in SLURM in the official documentation pages https://slurm.schedmd.com/mc_support.html |
Below are some examples of how the affinity is setup by default in the different cases.
Every node has a total of 256 virtual cores (128 physical). Every core will have an ID, with IDs 0 and 128 being the two hardware threads on the same physical core. See Atos HPCF: System overview for all the details. |
For this tests we will use David McKain's version of the Cray xthi code to visualise how the process and thread placement takes place.
This is the simplest case. Slurm will define affinity at the level of physical cores, but allow the task to use the two hardware threads. In this example, we run a 128 task MPI job with default settings, on a single node:
#!/bin/bash #SBATCH -q np #SBATCH -n 128 export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1} ml xthi srun -c ${SLURM_CPUS_PER_TASK:-1} xthi |
Host=ac2-2083 MPI Rank= 0 CPU=128 NUMA Node=0 CPU Affinity= 0,128 Host=ac2-2083 MPI Rank= 1 CPU= 1 NUMA Node=0 CPU Affinity= 1,129 Host=ac2-2083 MPI Rank= 2 CPU=130 NUMA Node=0 CPU Affinity= 2,130 Host=ac2-2083 MPI Rank= 3 CPU= 3 NUMA Node=0 CPU Affinity= 3,131 ... Host=ac2-2083 MPI Rank=124 CPU=252 NUMA Node=7 CPU Affinity=124,252 Host=ac2-2083 MPI Rank=125 CPU=125 NUMA Node=7 CPU Affinity=125,253 Host=ac2-2083 MPI Rank=126 CPU=254 NUMA Node=7 CPU Affinity=126,254 Host=ac2-2083 MPI Rank=127 CPU=127 NUMA Node=7 CPU Affinity=127,255 |
If you want to restrict your process to just one of the hardware threads, you may use the --hint=nomultithread option
#!/bin/bash #SBATCH -q np #SBATCH -n 128 #SBATCH --hint=nomultithread export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1} ml xthi srun -c ${SLURM_CPUS_PER_TASK:-1} xthi |
Host=ac1-2083 MPI Rank= 0 CPU= 0 NUMA Node=0 CPU Affinity= 0 Host=ac1-2083 MPI Rank= 1 CPU= 1 NUMA Node=0 CPU Affinity= 1 Host=ac1-2083 MPI Rank= 2 CPU= 2 NUMA Node=0 CPU Affinity= 2 Host=ac1-2083 MPI Rank= 3 CPU= 3 NUMA Node=0 CPU Affinity= 3 ... Host=ac1-2083 MPI Rank=124 CPU=124 NUMA Node=7 CPU Affinity=124 Host=ac1-2083 MPI Rank=125 CPU=125 NUMA Node=7 CPU Affinity=125 Host=ac1-2083 MPI Rank=126 CPU=126 NUMA Node=7 CPU Affinity=126 Host=ac1-2083 MPI Rank=127 CPU=127 NUMA Node=7 CPU Affinity=127 |
By defining the nomultithread, the maximum number of threads is halved from 256 to 128. |
Slurm will allocate a number of cores for each task, and will bind each process to a group of cores equal to the number of threads. However, all threads may run in any of the cores in the group for the same process. In the following example we run a 4 task MPI program, with every rank spawning 4 threads.
#!/bin/bash #SBATCH -q np #SBATCH -n 32 #SBATCH -c 4 ml xthi srun -c ${SLURM_CPUS_PER_TASK:-1} xthi |
Host=ac1-1035 MPI Rank= 0 OMP Thread=0 CPU= 0 NUMA Node=0 CPU Affinity= 0,1,128,129 Host=ac1-1035 MPI Rank= 0 OMP Thread=1 CPU=128 NUMA Node=0 CPU Affinity= 0,1,128,129 Host=ac1-1035 MPI Rank= 0 OMP Thread=2 CPU= 0 NUMA Node=0 CPU Affinity= 0,1,128,129 Host=ac1-1035 MPI Rank= 0 OMP Thread=3 CPU= 0 NUMA Node=0 CPU Affinity= 0,1,128,129 Host=ac1-1035 MPI Rank= 1 OMP Thread=0 CPU= 2 NUMA Node=0 CPU Affinity= 2,3,130,131 Host=ac1-1035 MPI Rank= 1 OMP Thread=1 CPU= 3 NUMA Node=0 CPU Affinity= 2,3,130,131 Host=ac1-1035 MPI Rank= 1 OMP Thread=2 CPU=130 NUMA Node=0 CPU Affinity= 2,3,130,131 Host=ac1-1035 MPI Rank= 1 OMP Thread=3 CPU=131 NUMA Node=0 CPU Affinity= 2,3,130,131 ... Host=ac1-1081 MPI Rank=30 OMP Thread=0 CPU=124 NUMA Node=7 CPU Affinity=124,125,252,253 Host=ac1-1081 MPI Rank=30 OMP Thread=1 CPU=253 NUMA Node=7 CPU Affinity=124,125,252,253 Host=ac1-1081 MPI Rank=30 OMP Thread=2 CPU=252 NUMA Node=7 CPU Affinity=124,125,252,253 Host=ac1-1081 MPI Rank=30 OMP Thread=3 CPU=125 NUMA Node=7 CPU Affinity=124,125,252,253 Host=ac1-1081 MPI Rank=31 OMP Thread=0 CPU=255 NUMA Node=7 CPU Affinity=126,127,254,255 Host=ac1-1081 MPI Rank=31 OMP Thread=1 CPU=126 NUMA Node=7 CPU Affinity=126,127,254,255 Host=ac1-1081 MPI Rank=31 OMP Thread=2 CPU=127 NUMA Node=7 CPU Affinity=126,127,254,255 Host=ac1-1081 MPI Rank=31 OMP Thread=3 CPU=254 NUMA Node=7 CPU Affinity=126,127,254,255 |
In some cases with low number of tasks/threads, it may be necessary to force that binding defining the hint:
|
If you want to bind every thread to a single core, then you may use the OpenMP variable OMP_PLACES.
#!/bin/bash #SBATCH -q np #SBATCH -n 32 #SBATCH -c 4 export OMP_PLACES=threads ml xthi srun -c ${SLURM_CPUS_PER_TASK:-1} xthi |
Host=ac2-1078 MPI Rank= 0 OMP Thread=0 CPU= 0 NUMA Node=0 CPU Affinity= 0 Host=ac2-1078 MPI Rank= 0 OMP Thread=1 CPU=128 NUMA Node=0 CPU Affinity=128 Host=ac2-1078 MPI Rank= 0 OMP Thread=2 CPU= 1 NUMA Node=0 CPU Affinity= 1 Host=ac2-1078 MPI Rank= 0 OMP Thread=3 CPU=129 NUMA Node=0 CPU Affinity=129 Host=ac2-1078 MPI Rank= 1 OMP Thread=0 CPU= 2 NUMA Node=0 CPU Affinity= 2 Host=ac2-1078 MPI Rank= 1 OMP Thread=1 CPU=130 NUMA Node=0 CPU Affinity=130 Host=ac2-1078 MPI Rank= 1 OMP Thread=2 CPU= 3 NUMA Node=0 CPU Affinity= 3 Host=ac2-1078 MPI Rank= 1 OMP Thread=3 CPU=131 NUMA Node=0 CPU Affinity=131 ... Host=ac2-1078 MPI Rank=30 OMP Thread=0 CPU=116 NUMA Node=7 CPU Affinity=116 Host=ac2-1078 MPI Rank=30 OMP Thread=1 CPU=244 NUMA Node=7 CPU Affinity=244 Host=ac2-1078 MPI Rank=30 OMP Thread=2 CPU=117 NUMA Node=7 CPU Affinity=117 Host=ac2-1078 MPI Rank=30 OMP Thread=3 CPU=245 NUMA Node=7 CPU Affinity=245 Host=ac2-1078 MPI Rank=31 OMP Thread=0 CPU=118 NUMA Node=7 CPU Affinity=118 Host=ac2-1078 MPI Rank=31 OMP Thread=1 CPU=246 NUMA Node=7 CPU Affinity=246 Host=ac2-1078 MPI Rank=31 OMP Thread=2 CPU=119 NUMA Node=7 CPU Affinity=119 Host=ac2-1078 MPI Rank=31 OMP Thread=3 CPU=247 NUMA Node=7 CPU Affinity=247 |
As you can see, this is not using all the physical cores in the node. |
If you want to avoid having two threads sharing the same physical core and maximise the use of all physical cores in the node, you may use the --hint=nomultithread option.
#!/bin/bash #SBATCH -q np #SBATCH -n 32 #SBATCH -c 4 #SBATCH --hint=nomultithread export OMP_PLACES=threads ml xthi srun -c ${SLURM_CPUS_PER_TASK:-1} xthi |
Host=ac2-1078 MPI Rank= 0 OMP Thread=0 CPU= 0 NUMA Node=0 CPU Affinity= 0 Host=ac2-1078 MPI Rank= 0 OMP Thread=1 CPU= 1 NUMA Node=0 CPU Affinity= 1 Host=ac2-1078 MPI Rank= 0 OMP Thread=2 CPU= 2 NUMA Node=0 CPU Affinity= 2 Host=ac2-1078 MPI Rank= 0 OMP Thread=3 CPU= 3 NUMA Node=0 CPU Affinity= 3 Host=ac2-1078 MPI Rank= 1 OMP Thread=0 CPU= 4 NUMA Node=0 CPU Affinity= 4 Host=ac2-1078 MPI Rank= 1 OMP Thread=1 CPU= 5 NUMA Node=0 CPU Affinity= 5 Host=ac2-1078 MPI Rank= 1 OMP Thread=2 CPU= 6 NUMA Node=0 CPU Affinity= 6 Host=ac2-1078 MPI Rank= 1 OMP Thread=3 CPU= 7 NUMA Node=0 CPU Affinity= 7 ... Host=ac2-1078 MPI Rank=30 OMP Thread=0 CPU=120 NUMA Node=7 CPU Affinity=120 Host=ac2-1078 MPI Rank=30 OMP Thread=1 CPU=121 NUMA Node=7 CPU Affinity=121 Host=ac2-1078 MPI Rank=30 OMP Thread=2 CPU=122 NUMA Node=7 CPU Affinity=122 Host=ac2-1078 MPI Rank=30 OMP Thread=3 CPU=123 NUMA Node=7 CPU Affinity=123 Host=ac2-1078 MPI Rank=31 OMP Thread=0 CPU=124 NUMA Node=7 CPU Affinity=124 Host=ac2-1078 MPI Rank=31 OMP Thread=1 CPU=125 NUMA Node=7 CPU Affinity=125 Host=ac2-1078 MPI Rank=31 OMP Thread=2 CPU=126 NUMA Node=7 CPU Affinity=126 Host=ac2-1078 MPI Rank=31 OMP Thread=3 CPU=127 NUMA Node=7 CPU Affinity=127 |
Only recommended for expert users who want to have full control of how the binding and distribution is done. |
If you wish to further customise how the binding and task/thread distribution is done, check the man pages for sbatch
an srun
, or check the online documentation.
You may also use most flags as both |
You may use the --cpu-bind
option to fine tune how the binding is done. All the possible values are defined in the official man pages. There are the main highlights:
--cpu-bind=no
to srun.
--cpu-bind=verbose
.srun
with --cpu-bind=map_cpu
and --cpu-bind=mask_cpu
. You can also control how the processes/threads are distributed and bound with the -m
or --distribution option in srun
. The default is block:cyclic
. The first element refers to the distribution of tasks among nodes, and in this case block
would distribute them in such a way that consecutive tasks would share a node. The second element is how the CPUs are dstributed within the node. As default cyclic
will distribute allocated CPUs for binding to a given task consecutively from the same socket, and from the next consecutive socket for the next task, in a round-robin fashion across sockets.