...
Load the
xthi
module with:No Format module load xthi
Run the program interactively to familiarise yourself with the ouptut:
No Format $ xthi Host=ac6-200 MPI Rank=0 CPU=128 NUMA Node=0 CPU Affinity=0,128
As you can see, only 1 process and 1 thread are run, and they may run on one of two virtual cores assigned to my session (which correspond to the same physical CPU). If you try to run with 4 OpenMP threads, you will see they will effectively fight each other for those same two cores, impacting the performance of your application but not anyone else in the login node:
No Format $ OMP_NUM_THREADS=4 xthi Host=ac6-200 MPI Rank=0 OMP Thread=0 CPU=128 NUMA Node=0 CPU Affinity=0,128 Host=ac6-200 MPI Rank=0 OMP Thread=1 CPU= 0 NUMA Node=0 CPU Affinity=0,128 Host=ac6-200 MPI Rank=0 OMP Thread=2 CPU=128 NUMA Node=0 CPU Affinity=0,128 Host=ac6-200 MPI Rank=0 OMP Thread=3 CPU= 0 NUMA Node=0 CPU Affinity=0,128
Create a new job script
fractional.sh
to runxthi
with 2 MPI tasks and 2 OpenMP threads, submit it and check the output to ensure the right number of tasks and threads were spawned.Here is a job template to start with:
Code Block language bash title fractional.sh collapse true #!/bin/bash #SBATCH --output=fractional.out # Add here the missing SBATCH directives for the relevant resources module load xthi # Add here the line to run xthi # Hint: use srun
Expand title Solution Using your favourite editor, create a file called
fractional.sh
with the following content:Code Block language bash title fractional.sh #!/bin/bash #SBATCH --output=fractional.out # Add here the missing SBATCH directives for the relevant resources #SBATCH --ntasks=2 #SBATCH --cpus-per-task=2 module load xthi # Add here the line to run xthi # Hint: use srun srun -c $SLURM_CPUS_PER_TASK xthi
You need to request 2 tasks, and 2 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the
cpus-per-task
, which must be explicitly passed to srun.You can submit it with sbatch:
No Format sbatch fractional.sh
The job should be run shortly. When finished, a new file called
fractional.out
should appear in the same directory. You can check the relevant output with:No Format grep -v ECMWF-INFO fractional.out
You should see an output similar to:
No Format $ grep -v ECMWF-INFO fractional.out Host=ad6-202 MPI Rank=0 OMP Thread=0 CPU= 5 NUMA Node=0 CPU Affinity=5,133 Host=ad6-202 MPI Rank=0 OMP Thread=1 CPU=133 NUMA Node=0 CPU Affinity=5,133 Host=ad6-202 MPI Rank=1 OMP Thread=0 CPU=137 NUMA Node=0 CPU Affinity=9,137 Host=ad6-202 MPI Rank=1 OMP Thread=1 CPU= 9 NUMA Node=0 CPU Affinity=9,137
Info title Srun automatic cpu binding You can see srun automatically ensures certain binding of the cores to the tasks. If you were to instruct srun to avoid any cpu binding with
--cpu-bind=none
, you would see something like:No Format $ grep -v ECMWF-INFO fractional.out Host=aa6-203 MPI Rank=0 OMP Thread=0 CPU=136 NUMA Node=0 CPU Affinity=4,8,132,136 Host=aa6-203 MPI Rank=0 OMP Thread=1 CPU= 8 NUMA Node=0 CPU Affinity=4,8,132,136 Host=aa6-203 MPI Rank=1 OMP Thread=0 CPU=132 NUMA Node=0 CPU Affinity=4,8,132,136 Host=aa6-203 MPI Rank=1 OMP Thread=1 CPU= 4 NUMA Node=0 CPU Affinity=4,8,132,136
Here all processes/threads could run in any of the cores assigned to the job, potentially having them hopping from cpu to cpu during the program's execution
Can you ensure each one of the OpenMP threads runs on a single physical core, without exploiting the hyperthreading, for optimal performance?
Expand title Solution In order to ensure each thread gets their own core, you can use the environment variable
OMP_PLACES=threads
.Then, to make sure only physical cores are used for performance, we need to use the
--hint=nomultithread
directive:Code Block language bash title fractional.sh #!/bin/bash #SBATCH --output=fractional.out # Add here the missing SBATCH directives for the relevant resources #SBATCH --ntasks=2 #SBATCH --cpus-per-task=2 #SBATCH --hint=nomultithread module load xthi # Add here the line to run xthi # Hint: use srun export OMP_PLACES=threads srun -c $SLURM_CPUS_PER_TASK xthi
You can submit the modified job with sbatch:
No Format sbatch fractional.sh
You should see an output similar to the following one, where each thread is in a different core with a number lower than 128:
No Format $ grep -v ECMWF-INFO fractional.out Host=ad6-201 MPI Rank=0 OMP Thread=0 CPU=18 NUMA Node=1 CPU Affinity=18 Host=ad6-201 MPI Rank=0 OMP Thread=1 CPU=20 NUMA Node=1 CPU Affinity=20 Host=ad6-201 MPI Rank=1 OMP Thread=0 CPU=21 NUMA Node=1 CPU Affinity=21 Host=ad6-201 MPI Rank=1 OMP Thread=1 CPU=22 NUMA Node=1 CPU Affinity=22
...