Page History

...

Create a new job script broken1.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?

Code Block

language	bash
title	broken1.sh
collapse	true

#SBATCH --job-name = broken 1
#SBATCH --output = broken1-%J.out
#SBATCH --error = broken1-%J.out
#SBATCH --qos = express
#SBATCH --time = 00:05:00 

echo "I was broken!"

Expand

title	Solution

The job above has the following problems:

There is no shebang at the beginning of the script.
There should be no spaces in the directives
There should be no space
QoS "express" does not exist

Here is an amended version:

Code Block

language	bash
title	broken1_fixed.sh

#!/bin/bash
#SBATCH --job-name=broken1
#SBATCH --output=broken1-%J.out
#SBATCH --error=broken1-%J.out
#SBATCH --time=00:05:00 

echo "I was broken!"

Note that the QoS line was removed, but you may also use the following if running on ECS:

No Format
#SBATCH --qos=ef

or the alternatively, if on Atos HPCF:

No Format
#SBATCH --qos=nf

Check that the actual job run and generated the expected output:

No Format
$ grep -v ECMWF-INFO $(ls -1 broken1-*.out \| head -n1) I was broken!

Create a new job script broken2.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?

Code Block

language	bash
title	broken2.sh
collapse	true

#!/bin/bash
#SBATCH --job-name=broken2
#SBATCH --output=broken2-%J.out
#SBATCH --error=broken2-%J.out
#SBATCH --qos=ns
#SBATCH --time=10-00

echo "I was broken!"

Expand

title	Solution

The job above has the following problems:

QoS "ns" does not exist. Either remove to use the default or use the corresponding queue QoS on ECS (ef) or HPCF (nf)
The time requested is 10 days, which is longer than the maximum allowed. it was probably meant to be 10 minutes

Here is an amended version:

Code Block

language	bash
title	broken1.sh

#!/bin/bash
#SBATCH --job-name=broken2
#SBATCH --output=broken2-%J.out
#SBATCH --error=broken2-%J.out
#SBATCH --time=10:00

echo "I was broken!"

Again, note that the QoS line was removed, but you may also use the following if running on ECS:

No Format
#SBATCH --qos=ef

or the alternatively, if on Atos HPCF:

No Format
#SBATCH --qos=nf

Check that the actual job run and generated the expected output:

No Format
$ grep -v ECMWF-INFO $(ls -1 broken2-*.out \| head -n1) I was broken!

Create a new job script broken3.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?

Code Block

language	bash
title	broken3.sh
collapse	true

#!/bin/bash
#SBATCH --job-name=broken3
#SBATCH --chdir=$SCRATCH
#SBATCH --output=broken3output/broken3-%J.out
#SBATCH --error=broken3output/broken3-%J.out

echo "I was broken!"

Expand

title	Solution

The job above has the following problems:

Variables are not expanded on job directives. You must specify your paths explicitly

The directory where the output and error files will go must exist beforehand. Otherwise the job will fail but you will not get any hint as to what may have happened to the job. The only hint would be if checking sacct:

No Format

$ sacct -X --name=broken3
JobID                 JobName       QOS      State ExitCode    Elapsed   NNodes             NodeList 
------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 
64281800              broken3        ef     FAILED     0:53   00:00:02        1              ad6-201

You will need to create the output directory with:

No Format
mkdir -p $SCRATCH/broken3output/

Here is an amended version of the job:

Code Block

language	bash
title	broken3.sh

#!/bin/bash
#SBATCH --job-name=broken3
#SBATCH --chdir=/scratch/<your_user_id>
#SBATCH --output=broken3output/broken3-%J.out
#SBATCH --error=broken3output/broken3-%J.out

echo "I was broken!"

Check that the actual job run and generated the expected output:

No Format
$ grep -v ECMWF-INFO $(ls -1 $SCRATCH/broken3output/broken3-*.out \| head -n1) I was broken!

You may clean up the output directory with

No Format
rm -rf $SCRATCH/broken3output

...

Create a new job script naughty.sh with the following contents:

Code Block

language	bash
title	naughty.sh

#!/bin/bash
#SBATCH --mem=100
#SBATCH --output=naughty.out
MEMORY=300
perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"

Submit naughty.sh to the batch system and check its status. What happened to the job?

Expand

title	Solution

You can submit it with:

No Format
sbatch naughty.sh

You can then monitor the state of your job with squeue:

No Format
squeue -j <jobid>

After a few seconds of running, you may see the job finishes and disappears. If we use sacct, we can see the job has failed, with an exit code of 9, which indicates it was killed:

No Format

$ sacct -X --name naughty.sh                                                                                                                                                       
JobID                 JobName       QOS      State ExitCode    Elapsed   NNodes             NodeList 
------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 
64303470           naughty.sh        ef     FAILED      9:0   00:00:04        1              ac6-202

Inspecting the job output the reason becomes apparent:

No Format

$ grep -v ECMWF-INFO naughty.out  | head -n 22




          __  __ _____ __  __ _  _____ _     _         
    __/\_|  \/  | ____|  \/  | |/ /_ _| |   | |  __/\__
    \    / |\/| |  _| | |\/| | ' / | || |   | |  \    /
    /_  _\ |  | | |___| |  | | . \ | || |___| |__/_  _\
      \/ |_|  |_|_____|_|  |_|_|\_\___|_____|_____|\/  

              BEGIN OF ECMWF MEMKILL REPORT


[ERROR__ECMWF_MEMORY_SUPERVISOR:JOB_OR_SESSION_OUT_OF_MEMORy,ac6-202.bullx:/usr/local/sbin/watch_cgroup:l454:Thu Oct 26 10:49:40 2023]

[summary]
job/session: 64303470
requested/default memory limit for job/session: 100MiB
sum of active and inactive _anonymous memory of job/session: 301MiB
ACTION: about to issue: 'kill -SIGKILL' to pid: 3649110
to-be-killed process: "perl -e $a="A"x(300*1024*1024/2); sleep", with resident-segment-size: 304MiB

The job had a limit of 100 MiB, but it tried to use up to 300 MiB, so the system killed the process.

Edit naughty.sh to comment the request for memory, and then play with the MEM value.

Code Block

language	bash
title	naughty.sh

#!/bin/bash
#SBATCH --output=naughty.out
##SBATCH --mem=100
MEMORY=300
perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"

How high can you with the default memory limit on the default queue QoS before the system kills it?

Expand

title	Solution

With trial and error, you will see the system will kill your tasks that go over 8000 MiB:

Code Block

language	bash
title	naughty.sh

#!/bin/bash
#SBATCH --output=naughty.out
##SBATCH --mem=100
MEMORY=8000
perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"

Inspecting the job output will confirm that:

No Format

$ grep -v ECMWF-INFO naughty.out  | head -n 22




          __  __ _____ __  __ _  _____ _     _         
    __/\_|  \/  | ____|  \/  | |/ /_ _| |   | |  __/\__
    \    / |\/| |  _| | |\/| | ' / | || |   | |  \    /
    /_  _\ |  | | |___| |  | | . \ | || |___| |__/_  _\
      \/ |_|  |_|_____|_|  |_|_|\_\___|_____|_____|\/  

              BEGIN OF ECMWF MEMKILL REPORT


[ERROR__ECMWF_MEMORY_SUPERVISOR:JOB_OR_SESSION_OUT_OF_MEMORy,ac6-202.bullx:/usr/local/sbin/watch_cgroup:l454:Thu Oct 26 11:16:43 2023]

[summary]
job/session: 64304303
requested/default memory limit for job/session: 8000MiB
sum of active and inactive _anonymous memory of job/session: 8001MiB
ACTION: about to issue: 'kill -SIGKILL' to pid: 4016899
to-be-killed process: "perl -e $a='A'x(8000*1024*1024/2); sleep", with resident-segment-size: 8004MiB

How could you have checked this beforehand instead of taking the trial and error approach?

Expand

title	Solution

You could have checked HPC2020: Batch system, or you could also ask Slurm for this information. Default memory is defined per partition, so you can then do

No Format
scontrol show partition

The field we are looking for is DefMemPerNode:

No Format
$ scontrol -o show partition \| tr " " "\n" \| grep -i -e "DefMem" -e "PartitionName"

Can you check, without trial and error this time, what is the maximum wall clock time, maximum CPUs, and maximum memory you can request to Slurm for each QoS?

Expand

title	Solution

Again, you will find this information ini HPC2020: Batch system, but you can also ask Slurm. These settings are part of the QoS setup so the command is

No Format
sacctmgr show qos

The fields we are looking for this time are MaxWall and MaxTRES:

No Format
sacctmgr -P show qos format=name,MaxWall,MaxTRES

If you run this on HPCF, you may notice there is no maximum limit set at the QoS level for the np parallel queueQoS, so you are bound by the maximum memory available in the node.

You can also see other limits such as the local SSD tmpdir space.

How many jobs could you potentially have running concurrently? How many jobs could you have in the system (pending or running), before a further submission fails?

Expand

title	Solution

Again, you will find this information ini HPC2020: Batch system, but you can also ask Slurm. These settings are part of the Association setup so the command is

No Format
sacctmgr show assoc where user=$USER

The fields we are looking for are MaxJobs and MaxSubmit:

No Format
sacctmgr show assoc user=$USER format=account,user,partition,maxjobs,maxsubmit

Remember that a Slurm Association is made of the user, project account and partition, and the limits are set at the association level.

...

If you followed this tutorial so far, you will have realised ECS users may run very small parallel jobs on the default ef queue QoS, whereas HPCF users may run slightly bigger jobs (up to half a GPIL node) on the default nf queue QoS.

For this tests we will use David McKain's version of the Cray xthi code to visualise how the process and thread placement takes place.

Download and compile the code in your Atos HPCF or ECS shell session with the following commands:

No Format
module load prgenv/gnu hpcx-openmpi wget https://git.ecdf.ed.ac.uk/dmckain/xthi/-/raw/master/xthi.c mpicc -o xthi -fopenmp xthi.c -lnuma

Run the program interactively to familiarise yourself with the ouptut:

No Format
$ ./xthi Host=ac6-200 MPI Rank=0 CPU=128 NUMA Node=0 CPU Affinity=0,128

As you can see, only 1 process and 1 thread are run, and they may run on one of two virtual cores assigned to my session (which correspond to the same physical CPU). If you try to run with 4 OpenMP threads, you will see they will effectively fight each other for those same two cores, impacting the performance of your application but not anyone else in the login node:

No Format

$ OMP_NUM_THREADS=4 ./xthi
Host=ac6-200  MPI Rank=0  OMP Thread=0  CPU=128  NUMA Node=0  CPU Affinity=0,128
Host=ac6-200  MPI Rank=0  OMP Thread=1  CPU=  0  NUMA Node=0  CPU Affinity=0,128
Host=ac6-200  MPI Rank=0  OMP Thread=2  CPU=128  NUMA Node=0  CPU Affinity=0,128
Host=ac6-200  MPI Rank=0  OMP Thread=3  CPU=  0  NUMA Node=0  CPU Affinity=0,128

Create a new job script fractional.sh to run xthi with 2 MPI tasks and 2 OpenMP threads, submit it and check the output to ensure the right number of tasks and threads were spawned.

Here is a job template to start with:

Code Block

language	bash
title	broken1.sh
collapse	true

#!/bin/bash
#SBATCH --output=fractional.out
# Add here the missing SBATCH directives for the relevant resources

# Add here the line to run xthi
# Hint: use srun

Expand

title	Solution

Using your favourite editor, create a file called fractional.sh with the following content:

Code Block

language	bash
title	fractional.sh

#!/bin/bash
#SBATCH --output=fractional.out
# Add here the missing SBATCH directives for the relevant resources
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2

# Add here the line to run xthi
# Hint: use srun
srun -c $SLURM_CPUS_PER_TASK ./xthi

You need to request 2 tasks, and 2 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the cpus-per-task, which must be explicitly passed to srun.

You can submit it with sbatch:

No Format
sbatch fractional.sh

The job should be run shortly. When finished, a new file called fractional.out should appear in the same directory. You can check the relevant output with:

No Format
grep -v ECMWF-INFO fractional.out

You should see an output similar to:

No Format

$ grep -v ECMWF-INFO fractional.out
Host=ad6-202  MPI Rank=0  OMP Thread=0  CPU=  5  NUMA Node=0  CPU Affinity=5,133
Host=ad6-202  MPI Rank=0  OMP Thread=1  CPU=133  NUMA Node=0  CPU Affinity=5,133
Host=ad6-202  MPI Rank=1  OMP Thread=0  CPU=137  NUMA Node=0  CPU Affinity=9,137
Host=ad6-202  MPI Rank=1  OMP Thread=1  CPU=  9  NUMA Node=0  CPU Affinity=9,137

Info

title	Srun automatic cpu binding

You can see srun automatically does ensures certain binding of the cores to the tasks, although perhaps not the best. If you were to instruct srun to avoid any cpu binding with --cpu-bind=none, you would see something like:

No Format

$ grep -v ECMWF-INFO fractional.out
Host=aa6-203  MPI Rank=0  OMP Thread=0  CPU=136  NUMA Node=0  CPU Affinity=4,8,132,136
Host=aa6-203  MPI Rank=0  OMP Thread=1  CPU=  8  NUMA Node=0  CPU Affinity=4,8,132,136
Host=aa6-203  MPI Rank=01  OMP Thread=20  CPU=  8132  NUMA Node=0  CPU Affinity=4,8,132,136
Host=aa6-203  MPI Rank=01  OMP Thread=31  CPU=  4  NUMA Node=0  CPU Affinity=4,8,132,136
Host=aa6-203  MPI Rank=1  OMP Thread=0  CPU=132  NUMA Node=0  CPU Affinity=4,8,132,136
Host=aa6-203  MPI Rank=1  OMP Thread=1  CPU=  4  NUMA Node=0  CPU Affinity=4,8,132,136
Host=aa6-203  MPI Rank=1  OMP Thread=2  CPU=132  NUMA Node=0  CPU Affinity=4,8,132,136
Host=aa6-203  MPI Rank=1  OMP Thread=3  CPU=132  NUMA Node=0  CPU Affinity=4,8,132,136

Here all processes/threads could run in any of the cores assigned to the job, potentially having them hopping from cpu to cpu during the program's execution

Can you ensure each one of the OpenMP threads runs on a single physical core, without exploiting the hyperthreading, for optimal performance?

Expand

title	Solution

In order to ensure each thread gets their own core, you can use the environment variable OMP_PLACES=threads.

Then, to make sure only physical cores are used for performance, we need to use the --hint=nomultithread directive:

Code Block

language	bash
title	fractional.sh

#!/bin/bash
#SBATCH --output=fractional.out
# Add here the missing SBATCH directives for the relevant resources
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
#SBATCH --hint=nomultithread

# Add here the line to run xthi
# Hint: use srun
export OMP_PLACES=threads
srun -c $SLURM_CPUS_PER_TASK ./xthi

You can submit the modified job with sbatch:

No Format
sbatch fractional.sh

You should see an output similar to the following one, where each thread is in a different core with a number lower than 128:

No Format

$ grep -v ECMWF-INFO fractional.out
Host=ad6-201  MPI Rank=0  OMP Thread=0  CPU=18  NUMA Node=1  CPU Affinity=18
Host=ad6-201  MPI Rank=0  OMP Thread=1  CPU=20  NUMA Node=1  CPU Affinity=20
Host=ad6-201  MPI Rank=1  OMP Thread=0  CPU=21  NUMA Node=1  CPU Affinity=2021
Host=ad6-201  MPI Rank=1  OMP Thread=01  CPU=2122  NUMA Node=1  CPU Affinity=21
Host=ad6-201  MPI Rank=1  OMP Thread=1  CPU=22  NUMA Node=1  CPU Affinity=22

Running parallel jobs - HPCF only

...

title	Reference Documentation

HPC2020: Submitting a parallel job

22

Running parallel jobs - HPCF only

Info

title	Reference Documentation

HPC2020: Submitting a parallel job

HPC2020: Affinity

For bigger parallel executions, you will need to use the HPCF's parallel QoS, np, which gives access to the biggest partition of nodes in every complex.

When running in such configuration, your job will get exclusive use of the nodes where it will run so external interferences are minimised. It is important then that the resources allocated are used efficiently.

Gliffy Diagram

macroId	152f57ca-cbad-43d6-a395-74d349c880c5
displayName	Atos HPCF AMD Rome simplified architecture
name	Atos HPCF AMD Rome simplified architecture
pagePin	2

...

So far we have only run serial jobs. You may also want to run small parallel jobs, either concurrently using just multiple threads, multiple processes or both. Examples of this are MPI and OpenMP programs. We call these kind of small parallel jobs "fractional", because they will run on a fraction of a node, sharing it with other users.

If you followed this tutorial so far, you will have realised ECS users may run very small parallel jobs on the default ef queue QoS, whereas HPCF users may run slightly bigger jobs (up to half a GPIL node) on the default nf queue QoS.

For this tests we will use David McKain's version of the Cray xthi code to visualise how the process and thread placement takes place.

...

Space shortcuts

Page tree

Versions Compared

Old Version 15

New Version 16

Key

Running parallel jobs - HPCF only

Running parallel jobs - HPCF only