...
Create a new job script
naughty.sh
with the following contents:Code Block language bash title naughty.sh #!/bin/bash #SBATCH --mem=100 #SBATCH --output=naughty.out MEMORY=300 perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"
Submit
naughty.sh
to the batch system and check its status. What happened to the job?Expand title Solution You can submit it with:
No Format sbatch naughty.sh
You can then monitor the state of your job with squeue:
No Format squeue -j <jobid>
After a few seconds of running, you may see the job finishes and disappears. If we use sacct, we can see the job has failed, with an exit code of 9, which indicates it was killed:
No Format $ sacct -X --name naughty.sh JobID JobName QOS State ExitCode Elapsed NNodes NodeList ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 64303470 naughty.sh ef FAILED 9:0 00:00:04 1 ac6-202
Inspecting the job output the reason becomes apparent:
No Format $ grep -v ECMWF-INFO naughty.out | head -n 22 __ __ _____ __ __ _ _____ _ _ __/\_| \/ | ____| \/ | |/ /_ _| | | | __/\__ \ / |\/| | _| | |\/| | ' / | || | | | \ / /_ _\ | | | |___| | | | . \ | || |___| |__/_ _\ \/ |_| |_|_____|_| |_|_|\_\___|_____|_____|\/ BEGIN OF ECMWF MEMKILL REPORT [ERROR__ECMWF_MEMORY_SUPERVISOR:JOB_OR_SESSION_OUT_OF_MEMORy,ac6-202.bullx:/usr/local/sbin/watch_cgroup:l454:Thu Oct 26 10:49:40 2023] [summary] job/session: 64303470 requested/default memory limit for job/session: 100MiB sum of active and inactive _anonymous memory of job/session: 301MiB ACTION: about to issue: 'kill -SIGKILL' to pid: 3649110 to-be-killed process: "perl -e $a="A"x(300*1024*1024/2); sleep", with resident-segment-size: 304MiB
The job had a limit of 100 MiB, but it tried to use up to 300 MiB, so the system killed the process.Edit
naughty.sh
to comment the request for memory, and then play with the MEM value.Code Block language bash title naughty.sh #!/bin/bash #SBATCH --output=naughty.out ##SBATCH --mem=100 MEMORY=300 perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"
How high can you with the default memory limit on the default queue before the system kills it?
Expand title Solution With trial and error, you will see the system will kill your tasks that go over 8000 MiB:
Code Block language bash title naughty.sh #!/bin/bash #SBATCH --output=naughty.out ##SBATCH --mem=100 MEMORY=8000 perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"
Inspecting the job output will confirm that:
No Format $ grep -v ECMWF-INFO naughty.out | head -n 22 __ __ _____ __ __ _ _____ _ _ __/\_| \/ | ____| \/ | |/ /_ _| | | | __/\__ \ / |\/| | _| | |\/| | ' / | || | | | \ / /_ _\ | | | |___| | | | . \ | || |___| |__/_ _\ \/ |_| |_|_____|_| |_|_|\_\___|_____|_____|\/ BEGIN OF ECMWF MEMKILL REPORT [ERROR__ECMWF_MEMORY_SUPERVISOR:JOB_OR_SESSION_OUT_OF_MEMORy,ac6-202.bullx:/usr/local/sbin/watch_cgroup:l454:Thu Oct 26 11:16:43 2023] [summary] job/session: 64304303 requested/default memory limit for job/session: 8000MiB sum of active and inactive _anonymous memory of job/session: 8001MiB ACTION: about to issue: 'kill -SIGKILL' to pid: 4016899 to-be-killed process: "perl -e $a='A'x(8000*1024*1024/2); sleep", with resident-segment-size: 8004MiB
How could you have checked this beforehand instead of taking the trial and error approach?
Expand title Solution You could have checked HPC2020: Batch system, or you could also ask Slurm for this information. Default memory is defined per partition, so you can then do
No Format scontrol show partition
The field we are looking for is
DefMemPerNode
:No Format $ scontrol -o show partition | tr " " "\n" | grep -i -e "DefMem" -e "PartitionName"
Can you check, without trial and error this time, what is the maximum wall clock time, maximum CPUs, and maximum memory you can request to Slurm for each QoS?
Expand title Solution Again, you will find this information ini HPC2020: Batch system, but you can also ask Slurm. These settings are part of the QoS setup so , the command is
No Format sacctmgr show qos
The field fields we are looking for this time are
MaxWall
andMaxTRES
.:No Format sacctmgr -P show qos format=name,MaxWall,MaxTRES
If you run this on HPCF, you may notice there is no maximum limit set at the QoS level for the np parallel queue, so you are bound by the maximum memory available in the node.
You can also see other limits such as the local SSD tmpdir space.
How many jobs could you potentially have running concurrently? How many jobs could you have in the system (pending or running), before a further submission fails?
Expand title Solution Again, you will find this information ini HPC2020: Batch system, but you can also ask Slurm. These settings are part of the Association setup so the command is
No Format sacctmgr show assoc where user=$USER
The fields we are looking for are
MaxJobs
andMaxSubmit
:No Format sacctmgr show assoc user=$USER format=account,user,partition,maxjobs,maxsubmit
Remember that a Slurm Association is made of the user, project account and partition, and the limits are set at the association level.
Running small parallel jobs - fractional
Info | ||
---|---|---|
| ||
So far we have only run serial jobs. You may also want to run small parallel jobs, either concurrently using just multiple threads, multiple processes or both. Examples of this are MPI and OpenMP programs. We call these kind of small parallel jobs "fractional", because they will run on a fraction of a node, sharing it with other users.
If you followed this tutorial so far, you will have realised ECS users may run very small parallel jobs on the default ef queue, whereas HPCF users may run slightly bigger jobs (up to half a GPIL node) on the default nf queue.
For this tests we will use David McKain's version of the Cray xthi code to visualise how the process and thread placement takes place.
Download and compile the code in your Atos HPCF or ECS shell session with the following commands:
No Format module load prgenv/gnu hpcx-openmpi wget https://git.ecdf.ed.ac.uk/dmckain/xthi/-/raw/master/xthi.c mpicc -o xthi -fopenmp xthi.c -lnuma
Try to run the program interactively to familiarise yourself with the ouptut:
No Format $ ./xthi Host=ac6-200 MPI Rank=0 CPU=128 NUMA Node=0 CPU Affinity=0,128
As you can see, only 1 process and 1 thread are run, and they may run on one of two virtual cores assigned to my session (which correspond to the same physical CPU). If you try to run with 4 OpenMP threads, you will see they will effectively fight each other for those same two cores, impacting the performance of your application but not anyone else in the login node:
No Format $ OMP_NUM_THREADS=4 ./xthi Host=ac6-200 MPI Rank=0 OMP Thread=0 CPU=128 NUMA Node=0 CPU Affinity=0,128 Host=ac6-200 MPI Rank=0 OMP Thread=1 CPU= 0 NUMA Node=0 CPU Affinity=0,128 Host=ac6-200 MPI Rank=0 OMP Thread=2 CPU=128 NUMA Node=0 CPU Affinity=0,128 Host=ac6-200 MPI Rank=0 OMP Thread=3 CPU= 0 NUMA Node=0 CPU Affinity=0,128
Create a new job script
fractional.sh
to runxthi
with 2 MPI tasks and 2 OpenMP threads, submit it and check the output to ensure the right number of tasks and threads were spawned.Here is a job template to start with:
Code Block language bash title broken1.sh collapse true #!/bin/bash #SBATCH --output=fractional.out # Add here the missing SBATCH directives for the relevant resources # Add here the line to run xthi # Hint: use srun
Expand title Solution Using your favourite editor, create a file called
fractional.sh
with the following content:Code Block language bash title fractional.sh #!/bin/bash #SBATCH --output=fractional.out # Add here the missing SBATCH directives for the relevant resources #SBATCH --ntasks=2 #SBATCH --cpus-per-task=2 # Add here the line to run xthi # Hint: use srun srun -c $SLURM_CPUS_PER_TASK ./xthi
You need to request 2 tasks, and 2 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the
cpus-per-task
, which must be explicitly passed to srun.You can submit it with sbatch:
No Format sbatch fractional.sh
The job should be run shortly. When finished, a new file called
fractional.out
should appear in the same directory. You can check the relevant output with:No Format grep -v ECMWF-INFO fractional.out
You should see an output similar to:
No Format $ grep -v ECMWF-INFO fractional.out Host=ad6-202 MPI Rank=0 OMP Thread=0 CPU= 5 NUMA Node=0 CPU Affinity=5,133 Host=ad6-202 MPI Rank=0 OMP Thread=1 CPU=133 NUMA Node=0 CPU Affinity=5,133 Host=ad6-202 MPI Rank=1 OMP Thread=0 CPU=137 NUMA Node=0 CPU Affinity=9,137 Host=ad6-202 MPI Rank=1 OMP Thread=1 CPU= 9 NUMA Node=0 CPU Affinity=9,137
Info title Srun automatic cpu binding You can see srun automatically does certain binding of the cores to the tasks, although perhaps not the best. If you were to instruct srun to avoid any cpu binding with
--cpu-bind=none
, you would see something like:No Format $ grep -v ECMWF-INFO fractional.out Host=aa6-203 MPI Rank=0 OMP Thread=0 CPU=136 NUMA Node=0 CPU Affinity=4,8,132,136 Host=aa6-203 MPI Rank=0 OMP Thread=1 CPU= 8 NUMA Node=0 CPU Affinity=4,8,132,136 Host=aa6-203 MPI Rank=0 OMP Thread=2 CPU= 8 NUMA Node=0 CPU Affinity=4,8,132,136 Host=aa6-203 MPI Rank=0 OMP Thread=3 CPU= 4 NUMA Node=0 CPU Affinity=4,8,132,136 Host=aa6-203 MPI Rank=1 OMP Thread=0 CPU=132 NUMA Node=0 CPU Affinity=4,8,132,136 Host=aa6-203 MPI Rank=1 OMP Thread=1 CPU= 4 NUMA Node=0 CPU Affinity=4,8,132,136 Host=aa6-203 MPI Rank=1 OMP Thread=2 CPU=132 NUMA Node=0 CPU Affinity=4,8,132,136 Host=aa6-203 MPI Rank=1 OMP Thread=3 CPU=132 NUMA Node=0 CPU Affinity=4,8,132,136
- Can you ensure each one of those processes and threads runs on a single physical core, without exploiting the hyperthreading?