...
Create a directory for this tutorial so all the exercises and outputs are contained inside:
No Format mkdir ~/batch_tutorial cd ~/batch_tutorial
Create and submit a job called
simplest.sh
with just default settings that runs the commandhostname
. Can you find the output and inspect it? Where did your job run?Expand title Solution Using your favourite editor, create a file called
simplest.sh
with the following contentCode Block language bash title simplest.sh #!/bin/bash hostname
You can submit it with sbatch:
No Format sbatch simplest.sh
The job should be run shortly. When finished, a new file called
slurm-<jobid>.out
should appear in the same directory. You can check the output with:No Format $ cat $(ls -1 slurm-*.out | tail -n1) ab6-202.bullx [ECMWF-INFO -ecepilog] ---------------------------------------------------------------------------------------------------- [ECMWF-INFO -ecepilog] This is the ECMWF job Epilogue [ECMWF-INFO -ecepilog] +++ Please report issues using the Support portal +++ [ECMWF-INFO -ecepilog] +++ https://support.ecmwf.int +++ [ECMWF-INFO -ecepilog] ---------------------------------------------------------------------------------------------------- [ECMWF-INFO -ecepilog] Run at 2023-10-25T11:31:53 on ecs [ECMWF-INFO -ecepilog] JobName : simplest.sh [ECMWF-INFO -ecepilog] JobID : 64273363 [ECMWF-INFO -ecepilog] Submit : 2023-10-25T11:31:36 [ECMWF-INFO -ecepilog] Start : 2023-10-25T11:31:51 [ECMWF-INFO -ecepilog] End : 2023-10-25T11:31:53 [ECMWF-INFO -ecepilog] QueuedTime : 15.0 [ECMWF-INFO -ecepilog] ElapsedRaw : 2 [ECMWF-INFO -ecepilog] ExitCode : 0:0 [ECMWF-INFO -ecepilog] DerivedExitCode : 0:0 [ECMWF-INFO -ecepilog] State : COMPLETED [ECMWF-INFO -ecepilog] Account : myaccount [ECMWF-INFO -ecepilog] QOS : ef [ECMWF-INFO -ecepilog] User : user [ECMWF-INFO -ecepilog] StdOut : /etc/ecmwf/nfs/dh1_home_a/user/slurm-64273363.out [ECMWF-INFO -ecepilog] StdErr : /etc/ecmwf/nfs/dh1_home_a/user/slurm-64273363.out [ECMWF-INFO -ecepilog] NNodes : 1 [ECMWF-INFO -ecepilog] NCPUS : 2 [ECMWF-INFO -ecepilog] SBU : 0.011 [ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------
You can then see that the script has run on a different node than the one you are on.
If you repeat the operation, you may get your job to run on a different node every time, whichever happens to be free at the time.
Configure your
simplest.sh
job to direct the output tosimplest-<jobid>.out
, the error tosimplest-<jobid>.err
both in the same directory, and the job name to just "simplest". Note you will need to use a special placeholder for the -<jobid>
.Expand title Solution Using your favourite editor, open the
simplest.sh
job script and add the relevant #SBATCH directives:Code Block language bash title simplest.sh #!/bin/bash #SBATCH --job-name=simplest #SBATCH --output=simplest-%j.out #SBATCH --outputerror=simplest-%j.err hostname
You can submit it again with:
No Format sbatch simplest.sh
After a few moments, you should see the new files appear in your directory (job id will be different than the one displayed here):
No Format $ ls simplest-*.* simplest-64274497.err simplest-64274497.out
You can check that the job name was also changed in the end of job report:
No Format $ grep -i jobname $(ls -1 simplest-*.err | tail -n1) [ECMWF-INFO -ecepilog] JobName : simplest
From a terminal session outside the Atos HPCF or ECS your VDI or computer, submit the
simplest.sh
job remotely. What hostname should you use?Expand title Solution You must use hpc-batch for HPCF job submissions, or ecs-batch for remote submissions:
No Format ssh hpc-batch "cd ~/batch_tutorial; sbatch simplest.sh"
No Format ssh ecs-batch "cd ~/batch_tutorial; sbatch simplest.sh"
Note the change of directory so both the job script, the working directory of the job and its outputs are generated in the right place.
An alternative way of doing this without changing directory would be to tell sbatch to do it for you:
No Format ssh hpc-batch sbatch -D ~/batch_tutorial ~/batch_tutorial/simplest.sh
or for ECS:
No Format ssh ecs-batch sbatch -D ~/batch_tutorial ~/batch_tutorial/simplest.sh
...
Create a new job script
broken1.sh
with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?Code Block language bash title broken1.sh collapse true #SBATCH --job-name = broken 1 #SBATCH --output = broken1-%J.out #SBATCH --error = broken1-%J.out #SBATCH --qos = express #SBATCH --time = 00:05:00 echo "I was broken!"
Expand title Solution The job above has the following problems:
- There is no shebang at the beginning of the script.
- There should be no spaces in the directives
- There should be no space
- QoS "express" does not exist
Here is an amended version following best practices for the jobs:
Code Block language bash title broken1_fixed.sh #!/bin/bash #SBATCH --job-name=broken1 #SBATCH --output=broken1-%J.out #SBATCH --error=broken1-%J.out #SBATCH --time=00:05:00 echo "I was broken!"
Note that the QoS line was removed, but you may also use the following if running on ECS:
No Format #SBATCH --qos=ef
or the alternatively, if on Atos HPCF:
No Format #SBATCH --qos=nf
Check that the actual job run and generated the expected output:
No Format $ grep -v ECMWF-INFO $(ls -1 broken1-*.out | tail -n1) I was broken!
Create a new job script
broken2.sh
with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?Code Block language bash title broken2.sh collapse true #!/bin/bash #SBATCH --job-name=broken2 #SBATCH --output=broken2-%J.out #SBATCH --error=broken2-%J.out #SBATCH --qos=ns #SBATCH --time=10-00 echo "I was broken!"
Expand title Solution The job above has the following problems:
- QoS "ns" does not exist. Either remove to use the default or use the corresponding QoS on ECS (ef) or HPCF (nf)
- The time requested is 10 days, which is longer than the maximum allowed. it was probably meant to be 10 minutes
Here is an amended version:
Code Block language bash title broken1.sh #!/bin/bash #SBATCH --job-name=broken2 #SBATCH --output=broken2-%J.out #SBATCH --error=broken2-%J.out #SBATCH --time=10:00 echo "I was broken!"
Again, note that the QoS line was removed, but you may also use the following if running on ECS:
No Format #SBATCH --qos=ef
or the alternatively, if on Atos HPCF:
No Format #SBATCH --qos=nf
Check that the actual job run and generated the expected output:
No Format $ grep -v ECMWF-INFO $(ls -1 broken2-*.out | tail -n1) I was broken!
Create a new job script
broken3.sh
with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?Code Block language bash title broken3.sh collapse true #!/bin/bash #SBATCH --job-name=broken3 #SBATCH --chdir=$SCRATCH #SBATCH --output=broken3output/broken3-%J.out #SBATCH --error=broken3output/broken3-%J.out echo "I was broken!"
Expand title Solution The job above has the following problems:
- Variables are not expanded on job directives. You must specify your paths explicitly
The directory where the output and error files will go must exist beforehand. Otherwise the job will fail but you will not get any hint as to what may have happened to the job. The only hint would be if checking sacct:
No Format $ sacct -X --name=broken3 JobID JobName QOS State ExitCode Elapsed NNodes NodeList ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 64281800 broken3 ef FAILED 0:53 00:00:02 1 ad6-201
You will need to create the output directory with:
No Format mkdir -p $SCRATCH/broken3output/
Here is an amended version of the job:
Code Block language bash title broken3.sh #!/bin/bash #SBATCH --job-name=broken3 #SBATCH --chdir=/scratch/<your_user_id> #SBATCH --output=broken3output/broken3-%J.out #SBATCH --error=broken3output/broken3-%J.out echo "I was broken!"
Check that the actual job run and generated the expected output:
No Format $ grep -v ECMWF-INFO $(ls -1 $SCRATCH/broken3output/broken3-*.out | tail -n1) I was broken!
You may clean up the output directory with
No Format rm -rf $SCRATCH/broken3output
Create a new job script
broken4.sh
with the contents below and try to submit the job. You should not see the message in the output. What happened? Can you fix the job and keep trying until it runs successfully?Code Block language bash title broken3broken4.sh collapse true #!/bin/bash #SBATCH --job-name=broken4 #SBATCH --output=broken4-%J.out ls $FOO/bar echo "I should not be here"
Expand title Solution The job above has the following problems:
FOO
variable is undefined when used. Undefined variables often lead to unexpected failures that are not always easy to spot.- Even if
FOO
was defined to "", the ls command fails but the job keeps running and eventually will apparently finish successfully from Slurm point of view, but it should have failed and been interrupted on the first error.
Here is an amended version of the job following best practices:
Code Block language bash title broken4.sh #!/bin/bash #SBATCH --output=broken4-%J.out set -x # echo script lines as they are executed set -e # stop the shell on first error set -u # fail when using an undefined variable set -o pipefail # If any command in a pipeline fails, that return code will be used as the return code of the whole pipeline ls $FOO/bar echo "I should not be here"
With the extra shell options, we guarantee we get some extra information on the output about the commands being written, and we ensure that the job will stop when encountering the first error (non-zero exit code), as well as if an undefined variable is found.
Info title Best practices Even if most examples in this tutorial do not have the extra shell options for simplicity, you should always include those in your production jobs.
...
Create a new job script
naughty.sh
with the following contents:Code Block language bash title naughty.sh #!/bin/bash #SBATCH --mem=100 #SBATCH --output=naughty.out MEMORYMEM=300 perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"
Submit
naughty.sh
to the batch system and check its status. What happened to the job?Expand title Solution You can submit it with:
No Format sbatch naughty.sh
You can then monitor the state of your job with squeue:
No Format squeue -j <jobid>
After a few seconds of running, you may see the job finishes and disappears. If we use sacct, we can see the job has failed, with an exit code of 9, which indicates it was killed:
No Format $ sacct -X --name naughty.sh JobID JobName QOS State ExitCode Elapsed NNodes NodeList ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 64303470 naughty.sh ef FAILED 9:0 00:00:04 1 ac6-202
Inspecting the job output the reason becomes apparent:
No Format $ grep -v ECMWF-INFO naughty.out | head -n 22 __ __ _____ __ __ _ _____ _ _ __/\_| \/ | ____| \/ | |/ /_ _| | | | __/\__ \ / |\/| | _| | |\/| | ' / | || | | | \ / /_ _\ | | | |___| | | | . \ | || |___| |__/_ _\ \/ |_| |_|_____|_| |_|_|\_\___|_____|_____|\/ BEGIN OF ECMWF MEMKILL REPORT [ERROR__ECMWF_MEMORY_SUPERVISOR:JOB_OR_SESSION_OUT_OF_MEMORy,ac6-202.bullx:/usr/local/sbin/watch_cgroup:l454:Thu Oct 26 10:49:40 2023] [summary] job/session: 64303470 requested/default memory limit for job/session: 100MiB sum of active and inactive _anonymous memory of job/session: 301MiB ACTION: about to issue: 'kill -SIGKILL' to pid: 3649110 to-be-killed process: "perl -e $a="A"x(300*1024*1024/2); sleep", with resident-segment-size: 304MiB
The job had a limit of 100 MiB, but it tried to use up to 300 MiB, so the system killed the process.Edit
naughty.sh
to comment the request for memory, and then play with the MEM value.Code Block language bash title naughty.sh #!/bin/bash #SBATCH --output=naughty.out ##SBATCH --mem=100 MEMORYMEM=300 perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"
How high can you with the default memory limit on the default QoS before the system kills it?
Expand title Solution With trial and error, you will see the system will kill your tasks that go over 8000 MiB:
Code Block language bash title naughty.sh #!/bin/bash #SBATCH --output=naughty.out ##SBATCH --mem=100 MEMORYMEM=8000 perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"
Inspecting the job output will confirm that:
No Format $ grep -v ECMWF-INFO naughty.out | head -n 22 __ __ _____ __ __ _ _____ _ _ __/\_| \/ | ____| \/ | |/ /_ _| | | | __/\__ \ / |\/| | _| | |\/| | ' / | || | | | \ / /_ _\ | | | |___| | | | . \ | || |___| |__/_ _\ \/ |_| |_|_____|_| |_|_|\_\___|_____|_____|\/ BEGIN OF ECMWF MEMKILL REPORT [ERROR__ECMWF_MEMORY_SUPERVISOR:JOB_OR_SESSION_OUT_OF_MEMORy,ac6-202.bullx:/usr/local/sbin/watch_cgroup:l454:Thu Oct 26 11:16:43 2023] [summary] job/session: 64304303 requested/default memory limit for job/session: 8000MiB sum of active and inactive _anonymous memory of job/session: 8001MiB ACTION: about to issue: 'kill -SIGKILL' to pid: 4016899 to-be-killed process: "perl -e $a='A'x(8000*1024*1024/2); sleep", with resident-segment-size: 8004MiB
How could you have checked this beforehand instead of taking the trial and error approach?
Expand title Solution You could have checked HPC2020: Batch system, or you could also ask Slurm for this information. Default memory is defined per partition, so you can then do
No Format scontrol show partition
The field we are looking for is
DefMemPerNode
:No Format $ scontrol -o show partition | tr " " "\n" | grep -i -e "DefMem" -e "PartitionName"
Can you check, without trial and error this time, what is the maximum wall clock time, maximum CPUs, and maximum memory you can request to Slurm for each QoS?
Expand title Solution Again, you will find this information ini HPC2020: Batch system, but you can also ask Slurm. These settings are part of the QoS setup so the command is
No Format sacctmgr show qos
The fields we are looking for this time are
MaxWall
andMaxTRES
:No Format sacctmgr -P show qos format=name,MaxWall,MaxTRES
If you run this on HPCF, you may notice there is no maximum limit set at the QoS level for the np parallel QoS, so you are bound by the maximum memory available in the node.
You can also see other limits such as the local SSD tmpdir space.
How many jobs could you potentially have running concurrently? How many jobs could you have in the system (pending or running), before a further submission fails?
Expand title Solution Again, you will find this information ini HPC2020: Batch system, but you can also ask Slurm. These settings are part of the Association setup so the command is
No Format sacctmgr show assoc where user=$USER
The fields we are looking for are
MaxJobs
andMaxSubmit
:No Format sacctmgr show assoc user=$USER format=account,user,partition,maxjobs,maxsubmit
Remember that a Slurm Association is made of the user, project account and partition, and the limits are set at the association level.
...
For this tests we will use David McKain's version of the Cray xthi code to visualise how the process and thread placement takes place.
Download and compile the code in your Atos HPCF or ECS shell session with the following commandsLoad the
xthi
module with:No Format module load prgenv/gnu hpcx-openmpi wget https://git.ecdf.ed.ac.uk/dmckain/xthi/-/raw/master/xthi.c mpicc -o xthi -fopenmp xthi.c -lnuma
Run the program interactively to familiarise yourself with the ouptut:
No Format $ ./xthi Host=ac6-200 MPI Rank=0 CPU=128 NUMA Node=0 CPU Affinity=0,128
As you can see, only 1 process and 1 thread are run, and they may run on one of two virtual cores assigned to my session (which correspond to the same physical CPU). If you try to run with 4 OpenMP threads, you will see they will effectively fight each other for those same two cores, impacting the performance of your application but not anyone else in the login node:
No Format $ OMP_NUM_THREADS=4 ./xthi Host=ac6-200 MPI Rank=0 OMP Thread=0 CPU=128 NUMA Node=0 CPU Affinity=0,128 Host=ac6-200 MPI Rank=0 OMP Thread=1 CPU= 0 NUMA Node=0 CPU Affinity=0,128 Host=ac6-200 MPI Rank=0 OMP Thread=2 CPU=128 NUMA Node=0 CPU Affinity=0,128 Host=ac6-200 MPI Rank=0 OMP Thread=3 CPU= 0 NUMA Node=0 CPU Affinity=0,128
Create a new job script
fractional.sh
to runxthi
with 2 MPI tasks and 2 OpenMP threads, submit it and check the output to ensure the right number of tasks and threads were spawned.Here is a job template to start with:
Code Block language bash title fractional.sh collapse true #!/bin/bash #SBATCH --output=fractional.out # TODO: Add here the missing SBATCH directives for the relevant resources # Add hereDefine the linenumber toof runOpenMP xthithreads # Hint: use srunexport OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1} # Load xthi tool module load xthi # TODO: Add here the line to run xthi # Hint: use srun
Expand title Solution Using your favourite editor, create a file called
fractional.sh
with the following content:Code Block language bash title fractional.sh #!/bin/bash #SBATCH --output=fractional.out # Add here the missing SBATCH directives for the relevant resources #SBATCH --ntasks=2 #SBATCH --cpus-per-task=2 # AddDefine here the linenumber toof runOpenMP xthithreads # Hint: use srun srun -c $SLURM_CPUS_PER_TASK ./export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1} # Load xthi tool module load xthi srun -c $SLURM_CPUS_PER_TASK xthi
You need to request 2 tasks, and 2 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the
cpus-per-task
, which must be explicitly passed to srun.You can submit it with sbatch:
No Format sbatch fractional.sh
The job should be run shortly. When finished, a new file called
fractional.out
should appear in the same directory. You can check the relevant output with:No Format grep -v ECMWF-INFO fractional.out
You should see an output similar to:
No Format $ grep -v ECMWF-INFO fractional.out Host=ad6-202 MPI Rank=0 OMP Thread=0 CPU= 5 NUMA Node=0 CPU Affinity=5,133 Host=ad6-202 MPI Rank=0 OMP Thread=1 CPU=133 NUMA Node=0 CPU Affinity=5,133 Host=ad6-202 MPI Rank=1 OMP Thread=0 CPU=137 NUMA Node=0 CPU Affinity=9,137 Host=ad6-202 MPI Rank=1 OMP Thread=1 CPU= 9 NUMA Node=0 CPU Affinity=9,137
Info title Srun automatic cpu binding You can see srun automatically ensures certain binding of the cores to the tasks. If you were to instruct srun to avoid any cpu binding with
--cpu-bind=none
, you would see something like:No Format $ grep -v ECMWF-INFO fractional.out Host=aa6-203 MPI Rank=0 OMP Thread=0 CPU=136 NUMA Node=0 CPU Affinity=4,8,132,136 Host=aa6-203 MPI Rank=0 OMP Thread=1 CPU= 8 NUMA Node=0 CPU Affinity=4,8,132,136 Host=aa6-203 MPI Rank=1 OMP Thread=0 CPU=132 NUMA Node=0 CPU Affinity=4,8,132,136 Host=aa6-203 MPI Rank=1 OMP Thread=1 CPU= 4 NUMA Node=0 CPU Affinity=4,8,132,136
Here all processes/threads could run in any of the cores assigned to the job, potentially having them hopping from cpu to cpu during the program's execution
Can you ensure each one of the OpenMP threads runs on a single physical core, without exploiting the hyperthreading, for optimal performance?
Expand title Solution In order to ensure each thread gets their own core, you can use the environment variable
OMP_PLACES=threads
.Then, to make sure only physical cores are used for performance, we need to use the
--hint=nomultithread
directive:Code Block language bash title fractional.sh #!/bin/bash #SBATCH --output=fractional.out # Add here the missing SBATCH directives for the relevant resources #SBATCH --ntasks=2 #SBATCH --cpus-per-task=2 #SBATCH --hint=nomultithreadno multithread # AddDefine herethe thenumber lineof toOpenMP run xthi # Hint: use srunthreads export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1} # Ensure proper OpenMP thread CPU pinning export OMP_PLACES=threads # Load xthi tool module load xthi srun -c $SLURM_CPUS_PER_TASK ./xthi
You can submit the modified job with sbatch:
No Format sbatch fractional.sh
You should see an output similar to the following one, where each thread is in a different core with a number lower than 128:
No Format $ grep -v ECMWF-INFO fractional.out Host=ad6-201 MPI Rank=0 OMP Thread=0 CPU=18 NUMA Node=1 CPU Affinity=18 Host=ad6-201 MPI Rank=0 OMP Thread=1 CPU=20 NUMA Node=1 CPU Affinity=20 Host=ad6-201 MPI Rank=1 OMP Thread=0 CPU=21 NUMA Node=1 CPU Affinity=21 Host=ad6-201 MPI Rank=1 OMP Thread=1 CPU=22 NUMA Node=1 CPU Affinity=22
...
- If not already on HPCF, open a session on
hpc-login
. Create a new job script
parallel.sh
to runxthi
with 32 MPI tasks and 4 OpenMP threads, leaving hyperthreading enabled. Submit it and check the output to ensure the right number of tasks and threads were spawned. Take note of what cpus are used, and how much SBUs you used.Here is a job template to start with:
Code Block language bash title parallel.sh collapse true #!/bin/bash #SBATCH --output=parallel-%j.out #SBATCH --qos=np # TODO: Add here the missing SBATCH directives for the relevant resources # Define the number of OpenMP threads export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1} # Ensure proper OpenMP thread CPU pinning export OMP_PLACES=threads # Load xthi tool module load xthi srun -c $SLURM_CPUS_PER_TASK ./xthi
Expand title Solution Using your favourite editor, create a file called parallel
.sh
with the following content:Code Block language bash title paralell.sh #!/bin/bash #SBATCH --output=parallel-%j.out #SBATCH --qos=np # Add here the missing SBATCH directives for the relevant resources #SBATCH --ntasks=32 #SBATCH --cpus-per-task=4 # Define the number of OpenMP threads export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1} # Ensure proper OpenMP thread CPU pinning export OMP_PLACES=threads # Load xthi tool module load xthi srun -c $SLURM_CPUS_PER_TASK ./xthi
You need to request 32 tasks, and 4 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the
cpus-per-task
, which must be explicitly passed to srun.You can submit it with sbatch:
No Format sbatch parallel.sh
The job should be run shortly. When finished, a new file called
parallel-<jobid>.out
should appear in the same directory. You can check the relevant output with:No Format grep -v ECMWF-INFO $(ls -1 parallel-*.out | tail -n1)
You should see an output similar to:
No Format Host=ac2-4046 MPI Rank= 0 OMP Thread=0 CPU= 0 NUMA Node=0 CPU Affinity= 0 Host=ac2-4046 MPI Rank= 0 OMP Thread=1 CPU=128 NUMA Node=0 CPU Affinity=128 Host=ac2-4046 MPI Rank= 0 OMP Thread=2 CPU= 1 NUMA Node=0 CPU Affinity= 1 Host=ac2-4046 MPI Rank= 0 OMP Thread=3 CPU=129 NUMA Node=0 CPU Affinity=129 Host=ac2-4046 MPI Rank= 1 OMP Thread=0 CPU= 2 NUMA Node=0 CPU Affinity= 2 Host=ac2-4046 MPI Rank= 1 OMP Thread=1 CPU=130 NUMA Node=0 CPU Affinity=130 Host=ac2-4046 MPI Rank= 1 OMP Thread=2 CPU= 3 NUMA Node=0 CPU Affinity= 3 Host=ac2-4046 MPI Rank= 1 OMP Thread=3 CPU=131 NUMA Node=0 CPU Affinity=131 ... Host=ac2-4046 MPI Rank=30 OMP Thread=0 CPU=116 NUMA Node=7 CPU Affinity=116 Host=ac2-4046 MPI Rank=30 OMP Thread=1 CPU=244 NUMA Node=7 CPU Affinity=244 Host=ac2-4046 MPI Rank=30 OMP Thread=2 CPU=117 NUMA Node=7 CPU Affinity=117 Host=ac2-4046 MPI Rank=30 OMP Thread=3 CPU=245 NUMA Node=7 CPU Affinity=245 Host=ac2-4046 MPI Rank=31 OMP Thread=0 CPU=118 NUMA Node=7 CPU Affinity=118 Host=ac2-4046 MPI Rank=31 OMP Thread=1 CPU=246 NUMA Node=7 CPU Affinity=246 Host=ac2-4046 MPI Rank=31 OMP Thread=2 CPU=119 NUMA Node=7 CPU Affinity=119 Host=ac2-4046 MPI Rank=31 OMP Thread=3 CPU=247 NUMA Node=7 CPU Affinity=247
Note the following facts:
- Both the main cores (0-127) and hyperthreads (128-256) were used.
- You get consecutive threads on the same physical CPU (0 with 128, 1 with 129...).
- There are physical cpus entirely unused, since their cpu number does show in the output.
In terms of SBUs, this job cost:
No Format $ grep SBU $(ls -1 parallel-*.out | tail -n1) [ECMWF-INFO -ecepilog] SBU : 6.051
Modify the
parallel.sh
job geometry (number of tasks, threads and use of hyperthreading) so that you fully utilise all the physical cores, and only those, i.e. 0-127.Expand title Solution Without using hyperthreading, an Atos HPCF node has 128 phyisical cores available. Any combination of tasks and threads that adds up to that figure will fill the node. Examples include 32 tasks x 4 threads, 64 tasks x 2 threads or 128 single-threaded tasks. For this example, we picked the first one:
Code Block language bash title paralell.sh #!/bin/bash #SBATCH --output=parallel-%j.out #SBATCH --qos=np # Add here the missing SBATCH directives for the relevant resources #SBATCH --ntasks=32 #SBATCH --cpus-per-task=4 #SBATCH --hint=nomultithread export OMP_PLACES=threads srun -c $SLURM_CPUS_PER_TASK ./# Define the number of OpenMP threads export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1} # Ensure proper OpenMP thread CPU pinning export OMP_PLACES=threads # Load xthi tool module load xthi srun -c $SLURM_CPUS_PER_TASK xthi
You can submit it with sbatch:
No Format sbatch parallel.sh
The job should be run shortly. When finished, a new file called
parallel-<jobid>.out
should appear in the same directory. You can check the relevant output with:No Format grep -v ECMWF-INFO $(ls -1 parallel-*.out | tail -n1)
You should see an output similar to:
No Format Host=ac3-2015 MPI Rank= 0 OMP Thread=0 CPU= 0 NUMA Node=0 CPU Affinity= 0 Host=ac3-2015 MPI Rank= 0 OMP Thread=1 CPU= 1 NUMA Node=0 CPU Affinity= 1 Host=ac3-2015 MPI Rank= 0 OMP Thread=2 CPU= 2 NUMA Node=0 CPU Affinity= 2 Host=ac3-2015 MPI Rank= 0 OMP Thread=3 CPU= 3 NUMA Node=0 CPU Affinity= 3 Host=ac3-2015 MPI Rank= 1 OMP Thread=0 CPU= 4 NUMA Node=0 CPU Affinity= 4 Host=ac3-2015 MPI Rank= 1 OMP Thread=1 CPU= 5 NUMA Node=0 CPU Affinity= 5 Host=ac3-2015 MPI Rank= 1 OMP Thread=2 CPU= 6 NUMA Node=0 CPU Affinity= 6 Host=ac3-2015 MPI Rank= 1 OMP Thread=3 CPU= 7 NUMA Node=0 CPU Affinity= 7 ... Host=ac3-2015 MPI Rank=30 OMP Thread=0 CPU=120 NUMA Node=7 CPU Affinity=120 Host=ac3-2015 MPI Rank=30 OMP Thread=1 CPU=121 NUMA Node=7 CPU Affinity=121 Host=ac3-2015 MPI Rank=30 OMP Thread=2 CPU=122 NUMA Node=7 CPU Affinity=122 Host=ac3-2015 MPI Rank=30 OMP Thread=3 CPU=123 NUMA Node=7 CPU Affinity=123 Host=ac3-2015 MPI Rank=31 OMP Thread=0 CPU=124 NUMA Node=7 CPU Affinity=124 Host=ac3-2015 MPI Rank=31 OMP Thread=1 CPU=125 NUMA Node=7 CPU Affinity=125 Host=ac3-2015 MPI Rank=31 OMP Thread=2 CPU=126 NUMA Node=7 CPU Affinity=126 Host=ac3-2015 MPI Rank=31 OMP Thread=3 CPU=127 NUMA Node=7 CPU Affinity=127
Note the following facts:
- Only the main cores (0-127) were used.
- Each thread gets one and only one cpu pinned to it.
- All the phyisical cores are in use
In terms of SBUs, this job cost:
No Format $ grep SBU $(ls -1 parallel-*.out | tail -n1) [ECMWF-INFO -ecepilog] SBU : 5.379
Modify the
parallel.sh
job geometry so it still runs on the np qos QoS, but only with 2 tasks and 2 threads. Check the SBU cost. Since the execution is 32 times smaller, did it cost 32 times less than the previous? Why?Expand title Solution Let's use the following job:
Code Block language bash title paralell.sh #!/bin/bash #SBATCH --output=parallel-%j.out #SBATCH --qos=np # Add here the missing SBATCH directives for the relevant resources #SBATCH --ntasks=2 #SBATCH --cpus-per-task=2 #SBATCH --hint=nomultithread module load xthi export OMP_PLACES=threads srun -c $SLURM_CPUS_PER_TASK ./xthi
You can submit it with sbatch:
No Format sbatch fractional.sh
The job should be run shortly. When finished, a new file called
parallel-<jobid>.out
should appear in the same directory. You can check the relevant output with:No Format grep -v ECMWF-INFO $(ls -1 parallel-*.out | tail -n1)
You should see an output similar to:
No Format Host=ac2-3073 MPI Rank=0 OMP Thread=0 CPU= 0 NUMA Node=0 CPU Affinity= 0 Host=ac2-3073 MPI Rank=0 OMP Thread=1 CPU= 1 NUMA Node=0 CPU Affinity= 1 Host=ac2-3073 MPI Rank=1 OMP Thread=0 CPU=16 NUMA Node=1 CPU Affinity=16 Host=ac2-3073 MPI Rank=1 OMP Thread=1 CPU=17 NUMA Node=1 CPU Affinity=17
In terms of SBUs, this job costof SBUs, this job cost:
No Format $ grep SBU $(ls -1 parallel-*.out | tail -n1) [ECMWF-INFO -ecepilog] SBU : 4.034
This is in a similar scale to the previous one which 32 times bigger one. The reason behind it is that on the np QoS the allocation is done in full nodes. The SBU cost takes into account the allocated nodes for a given period of time, no matter how they are used.
You may compare the cost of your last parallel job and your last fractional, with the same geometry (2x2):
This is in a similar scale to the previous one which 32 times bigger one. The reason behind it is that on the np QoS the allocation is done in full nodes. The SBU cost takes into account the allocated nodes for a given period of time, no matter how they are used.No Format $ grep -h SBU $(ls -1 parallel-*.out | tail -n1) fractional.out [ECMWF-INFO -ecepilog] SBU : 4.034
[ECMWF-INFO -ecepilog] SBU : 0.084