Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Create a directory for this tutorial so all the exercises and outputs are contained inside:

    No Format
    mkdir ~/batch_tutorial
    cd ~/batch_tutorial


  2. Create and submit a job called simplest.sh with just default settings that runs the command hostname. Can you find the output and inspect it? Where did your job run?

    Expand
    titleSolution

    Using your favourite editor, create a file called simplest.sh with the following content

    Code Block
    languagebash
    titlesimplest.sh
    #!/bin/bash
    hostname

    You can submit it with sbatch:

    No Format
    sbatch simplest.sh

    The job should be run shortly. When finished, a new file called slurm-<jobid>.out should appear in the same directory. You can check the output with:

    No Format
    $ cat $(ls -r11 slurm-*.out | headtail -n1)
    ab6-202.bullx
    [ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------
    [ECMWF-INFO -ecepilog] This is the ECMWF job Epilogue
    [ECMWF-INFO -ecepilog] +++ Please report issues using the Support portal +++
    [ECMWF-INFO -ecepilog] +++ https://support.ecmwf.int                     +++
    [ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------
    [ECMWF-INFO -ecepilog] Run at 2023-10-25T11:31:53 on ecs
    [ECMWF-INFO -ecepilog] JobName                   : simplest.sh
    [ECMWF-INFO -ecepilog] JobID                     : 64273363
    [ECMWF-INFO -ecepilog] Submit                    : 2023-10-25T11:31:36
    [ECMWF-INFO -ecepilog] Start                     : 2023-10-25T11:31:51
    [ECMWF-INFO -ecepilog] End                       : 2023-10-25T11:31:53
    [ECMWF-INFO -ecepilog] QueuedTime                : 15.0
    [ECMWF-INFO -ecepilog] ElapsedRaw                : 2
    [ECMWF-INFO -ecepilog] ExitCode                  : 0:0
    [ECMWF-INFO -ecepilog] DerivedExitCode           : 0:0
    [ECMWF-INFO -ecepilog] State                     : COMPLETED
    [ECMWF-INFO -ecepilog] Account                   : myaccount
    [ECMWF-INFO -ecepilog] QOS                       : ef
    [ECMWF-INFO -ecepilog] User                      : user
    [ECMWF-INFO -ecepilog] StdOut                    : /etc/ecmwf/nfs/dh1_home_a/user/slurm-64273363.out
    [ECMWF-INFO -ecepilog] StdErr                    : /etc/ecmwf/nfs/dh1_home_a/user/slurm-64273363.out
    [ECMWF-INFO -ecepilog] NNodes                    : 1
    [ECMWF-INFO -ecepilog] NCPUS                     : 2
    [ECMWF-INFO -ecepilog] SBU                       : 0.011
    [ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------

    You can then see that the script has run on a different node than the one you are on.

    If you repeat the operation, you may get your job to run on a different node every time, whichever happens to be free at the time.


  3. Configure your simplest.sh job to direct the output to simplest-<jobid>.out, the error to simplest-<jobid>.err both in the same directory, and the job name to just "simplest". Note you will need to use a special placeholder for the -<jobid>.

    Expand
    titleSolution

    Using your favourite editor, open the simplest.sh job script and add the relevant #SBATCH directives:

    Code Block
    languagebash
    titlesimplest.sh
    #!/bin/bash
    #SBATCH --job-name=simplest
    #SBATCH --output=simplest-%j.out
    #SBATCH --outputerror=simplest-%j.err
    hostname

    You can submit it again with:

    No Format
    sbatch simplest.sh

    After a few moments, you should see the new files appear in your directory (job id will be different than the one displayed here):

    No Format
    $ ls simplest-*.*
    simplest-64274497.err  simplest-64274497.out

    You can check that the job name was also changed in the end of job report:

    No Format
    $ grep -i jobname $(ls -r11 simplest-*.err | headtail -n1)
    [ECMWF-INFO -ecepilog] JobName                   : simplest



  4. From a terminal session outside the Atos HPCF or ECS your VDI or computer, submit the simplest.sh job remotely. What hostname should you use?

    Expand
    titleSolution

    You must use hpc-batch for HPCF job submissions, or ecs-batch for remote submissions:

    No Format
    ssh hpc-batch "cd ~/batch_tutorial; sbatch simplest.sh"


    No Format
    ssh ecs-batch "cd ~/batch_tutorial; sbatch simplest.sh"

    Note the change of directory so both the job script, the working directory of the job and its outputs are generated in the right place. 

    An alternative way of doing this without changing directory would be to tell sbatch to do it for you:

    No Format
    ssh hpc-batch sbatch -D ~/batch_tutorial ~/batch_tutorial/simplest.sh

    or for ECS:

    No Format
    ssh ecs-batch sbatch -D ~/batch_tutorial ~/batch_tutorial/simplest.sh



...

  1. Create a new job script broken1.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?

    Code Block
    languagebash
    titlebroken1.sh
    collapsetrue
    #SBATCH --job-name = broken 1
    #SBATCH --output = broken1-%J.out
    #SBATCH --error = broken1-%J.out
    #SBATCH --qos = express
    #SBATCH --time = 00:05:00 
    
    echo "I was broken!"


    Expand
    titleSolution

    The job above has the following problems:

    • There is no shebang at the beginning of the script.
    • There should be no spaces in the directives
    • There should be no space
    • QoS "express" does not exist

    Here is an amended version following best practices for the jobs:

    Code Block
    languagebash
    titlebroken1_fixed.sh
    #!/bin/bash
    #SBATCH --job-name=broken1
    #SBATCH --output=broken1-%J.out
    #SBATCH --error=broken1-%J.out
    #SBATCH --time=00:05:00 
    
    echo "I was broken!"

    Note that the QoS line was removed, but you may also use the following if running on ECS:

    No Format
    #SBATCH --qos=ef

    or the alternatively, if on Atos HPCF:

    No Format
    #SBATCH --qos=nf

    Check that the actual job run and generated the expected output:

    No Format
    $ grep -v ECMWF-INFO  $(ls -1 broken1-*.out | headtail -n1)
    I was broken!



  2. Create a new job script broken2.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?

    Code Block
    languagebash
    titlebroken2.sh
    collapsetrue
    #!/bin/bash
    #SBATCH --job-name=broken2
    #SBATCH --output=broken2-%J.out
    #SBATCH --error=broken2-%J.out
    #SBATCH --qos=ns
    #SBATCH --time=10-00
    
    echo "I was broken!"


    Expand
    titleSolution

    The job above has the following problems:

    • QoS "ns" does not exist. Either remove to use the default or use the corresponding QoS on ECS (ef) or HPCF (nf)
    • The time requested is 10 days, which is longer than the maximum allowed. it was probably meant to be 10 minutes

    Here is an amended version:

    Code Block
    languagebash
    titlebroken1.sh
    #!/bin/bash
    #SBATCH --job-name=broken2
    #SBATCH --output=broken2-%J.out
    #SBATCH --error=broken2-%J.out
    #SBATCH --time=10:00
    
    echo "I was broken!"

    Again, note that the QoS line was removed, but you may also use the following if running on ECS:

    No Format
    #SBATCH --qos=ef

    or the alternatively, if on Atos HPCF:

    No Format
    #SBATCH --qos=nf

    Check that the actual job run and generated the expected output:

    No Format
    $ grep -v ECMWF-INFO  $(ls -1 broken2-*.out | headtail -n1)
    I was broken!



  3. Create a new job script broken3.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?

    Code Block
    languagebash
    titlebroken3.sh
    collapsetrue
    #!/bin/bash
    #SBATCH --job-name=broken3
    #SBATCH --chdir=$SCRATCH
    #SBATCH --output=broken3output/broken3-%J.out
    #SBATCH --error=broken3output/broken3-%J.out
    
    echo "I was broken!"


    Expand
    titleSolution

    The job above has the following problems:

    • Variables are not expanded on job directives. You must specify your paths explicitly
    • The directory where the output and error files will go must exist beforehand. Otherwise the job will fail but you will not get any hint as to what may have happened to the job. The only hint would be if checking sacct:

      No Format
      $ sacct -X --name=broken3
      JobID                 JobName       QOS      State ExitCode    Elapsed   NNodes             NodeList 
      ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 
      64281800              broken3        ef     FAILED     0:53   00:00:02        1              ad6-201 


    You will need to create the output directory with:

    No Format
    mkdir -p $SCRATCH/broken3output/

    Here is an amended version of the job:

    Code Block
    languagebash
    titlebroken3.sh
    #!/bin/bash
    #SBATCH --job-name=broken3
    #SBATCH --chdir=/scratch/<your_user_id>
    #SBATCH --output=broken3output/broken3-%J.out
    #SBATCH --error=broken3output/broken3-%J.out
    
    echo "I was broken!"

    Check that the actual job run and generated the expected output:

    No Format
    $ grep -v ECMWF-INFO  $(ls -1 $SCRATCH/broken3output/broken3-*.out | headtail -n1)
    I was broken!

    You may clean up the output directory with

    No Format
    rm -rf $SCRATCH/broken3output

...

Although most limits are described in HPC2020: Batch system, you can also check them (or reach them) for yourself in the system.



  1. Create a new job script naughtybroken4.sh with the following contents:contents below and try to submit the job. You should not see the message in the output. What happened? Can you fix the job and keep trying until it runs successfully?

    Code Block
    languagebash
    titlenaughtybroken4.sh
    collapsetrue
    #!/bin/bash
    #SBATCH --job-memname=100broken4
    #SBATCH --output=naughtybroken4-%J.out
    MEMORY=300
    perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"
    Submit  naughty.sh to the batch system and check its status. What happened to the job?
    ls $FOO/bar
    echo "I should not be here"


    Expand
    titleSolution

    The job above has the following problems:

    • FOO variable is undefined when used. Undefined variables often lead to unexpected failures that are not always easy to spot.
    • Even if FOO was defined to "", the ls command fails but the job keeps running and eventually will apparently finish successfully from Slurm point of view, but it should have failed and been interrupted on the first error.

    Here is an amended version of the job following best practices:

    Code Block
    languagebash
    titlebroken4.sh
    #!/bin/bash
    #SBATCH --output=broken4-%J.out
    
    set -x # echo script lines as they are executed
    set -e # stop the shell on first error
    set -u # fail when using an undefined variable
    set -o pipefail # If any command in a pipeline fails, that return code will be used as the return code of the whole pipeline
    
    ls $FOO/bar
    echo "I should not be here"

    With the extra shell options, we guarantee we get some extra information on the output about the commands being written, and we ensure that the job will stop when encountering the first error (non-zero exit code), as well as if an undefined variable is found.

    Info
    titleBest practices

    Even if most examples in this tutorial do not have the extra shell options for simplicity, you should always include those in your production jobs.



Understanding your limits

Although most limits are described in HPC2020: Batch system, you can also check them (or reach them) for yourself in the system.

  1. Create a new job script naughty.sh with the following contents:

    Code Block
    languagebash
    titlenaughty.sh
    #!/bin/bash
    #SBATCH --mem=100
    #SBATCH --output=naughty.out
    MEM=300
    perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"


  2. Submit  naughty.sh to the batch system and check its status. What happened to the job?

    Expand
    titleSolution

    You can submit it with:

    No Format
    sbatch naughty.sh

    You can then monitor the state of your job with  squeue:

    No Format
    squeue -j <jobid>

    After a few seconds of running, you may see the job finishes and disappears. If we use sacct, we can see the job has failed, with an exit code of 9, which indicates it was killed:

    No Format
    $ sacct -X --name naughty.sh

    You can submit it with:

    No Format
    sbatch naughty.sh

    You can then monitor the state of your job with  squeue:

    No Format
    squeue -j <jobid>

    After a few seconds of running, you may see the job finishes and disappears. If we use sacct, we can see the job has failed, with an exit code of 9, which indicates it was killed:

    No Format
    $ sacct -X --name naughty.sh                                                                                                                                                       
    JobID                 JobName       QOS      State ExitCode    Elapsed   NNodes             NodeList 
    ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 
    64303470           naughty.sh        ef     FAILED      9:0   00:00:04        1              ac6-202 

    Inspecting the job output the reason becomes apparent:

    No Format
    $ grep -v ECMWF-INFO naughty.out  | head -n 22
    
    
    
    
              __  __ _____ __  __ _  _____ _     _       
    JobID  
        __/\_|  \/  | ____|  \/  | |/ /_JobName _| |   | | QOS __/\__
        \ State ExitCode  / |\/| |Elapsed  _| | |\/| | ' / | || | NNodes        | |  \  NodeList  /
        /_  _\ |  | | |___| |  | | . \ | || |___| |__/_  _\
          \/ |_|  |_|_____|_|  |_|_|\_\___|_____|_____|\/  
    
    
    ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 
    64303470           naughty.sh        ef     FAILED      9:0   00:00:04        1              ac6-202  BEGIN OF ECMWF MEMKILL REPORT
    
    
    [ERROR__ECMWF_MEMORY_SUPERVISOR:JOB_OR_SESSION_OUT_OF_MEMORy,ac6-202.bullx:/usr/local/sbin/watch_cgroup:l454:Thu Oct 26 10:49:40 2023]
    
    [summary]
    job/session: 64303470
    requested/default memory limit for job/session: 100MiB
    sum of active and inactive _anonymous memory of job/session: 301MiB
    ACTION: about to issue: 'kill -SIGKILL' to pid: 3649110
    to-be-killed process: "perl -e $a="A"x(300*1024*1024/2); sleep", with resident-segment-size: 304MiB
    
    
    The job had a limit of 100 MiB, but it tried to use up to 300 MiB, so the system killed the process.

    Edit  naughty.sh to comment the request for memory, and then play with the MEM value.

    Code Block
    languagebash
    titlenaughty.sh
    #!/bin/bash
    #SBATCH --output=naughty.out
    ##SBATCH --mem=100
    MEMORY=300
    perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"

    Inspecting the job output the reason becomes apparent:

    No Format
    $ grep -v ECMWF-INFO naughty.out  | head -n 22
    
    
    
    
              __  __ _____ __  __ _  _____ _     _         
        __/\_|  \/  | ____|  \/  | |/ /_ _| |   | |  __/\__
        \    / |\/| |  _| | |\/| | ' / | || |   | |  \    /
        /_  _\ |  | | |___| |  | | . \ | || |___| |__/_  _\
          \/ |_|  |_|_____|_|  |_|_|\_\___|_____|_____|\/  
    
         

    How high can you with the default memory limit on the default QoS before the system kills it? 

    Expand
    titleSolution

    With trial and error, you will see the system will kill your tasks that go over 8000 MiB:

    Code Block
    languagebash
    titlenaughty.sh
    #!/bin/bash
    #SBATCH --output=naughty.out
    ##SBATCH --mem=100
    MEMORY=8000
    perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"

    Inspecting the job output will confirm that:

    No Format$ grep -v ECMWF-INFO naughty.out | head -n 22 __ __ _____ __ __ _ _____ _ _
             
    BEGIN OF ECMWF MEMKILL REPORT
    
    
    [ERROR__ECMWF_MEMORY_SUPERVISOR:JOB_OR_SESSION_OUT_OF_MEMORy,ac6-202.bullx:/usr/local/sbin/watch_cgroup:l454:Thu Oct 26 10:49:40 2023]
    
    [summary]
    job/session: 64303470
    requested/default memory limit for job/session: 100MiB
    sum of active and inactive _anonymous memory of job/session: 301MiB
    ACTION: about to issue: 'kill -SIGKILL' to pid: 3649110
    to-be-killed process: "perl -e $a="A"x(300*1024*1024/2); sleep", with resident-segment-size: 304MiB
    
    


    The job had a limit of 100 MiB, but it tried to use up to 300 MiB, so the system killed the process.


  3. Edit  naughty.sh to comment the request for memory, and then play with the MEM value.

    Code Block
    languagebash
    titlenaughty.sh
    #!/bin/bash
    #SBATCH --output=naughty.out
    ##SBATCH --mem=100
    MEM=300
    perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"
    1. How high can you with the default memory limit on the default QoS before the system kills it? 

      Expand
      titleSolution

      With trial and error, you will see the system will kill your tasks that go over 8000 MiB:

      Code Block
      languagebash
      titlenaughty.sh
      #!/bin/bash
      #SBATCH --output=naughty.out
      ##SBATCH --mem=100
      MEM=8000
      perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"

      Inspecting the job output will confirm that:

      No Format
      $ grep -v ECMWF-INFO naughty.out  | head -n 22
      
      
      
      
                __  __ _____ __  __ _  _____ _     _         
          __/\_|  \/  | ____|  \/  | |/ /_ _| |   | |  __/\__
          \    / |\/| |  _| | |\/| | ' / | || |   | |  \    /
          /_  _\ |  | | |___| |  | | . \ | || |___| |__/_  _\
            \/ |_|  |_|_____|_|  |_|_|\_\___|_____|_____|\/  
      
      
      /\_| \/ | ____| \/ | |/ /_ _| | | | __/\__ \ / |\/| | _| | |\/| | ' / | || | | | \ / /_ _\ | | | |___| | | | . \ | || |___| |__/_ _\ \/ |_| |_|_____|_| |_|_|\_\___|_____|_____|\/ BEGIN OF ECMWF MEMKILL REPORT [ERROR__ECMWF_MEMORY_SUPERVISOR:JOB_OR_SESSION_OUT_OF_MEMORy,ac6-202.bullx:/usr/local/sbin/watch_cgroup:l454:Thu Oct 26 11:16:43 2023] [summary] job/session: 64304303 requested/default memory limit for job/session: 8000MiB sum of active and inactive _anonymous memory of job/session: 8001MiB ACTION: about to issue: 'kill -SIGKILL' to pid: 4016899 to-be-killed process: "perl -e $a='A'x(8000*1024*1024/2); sleep", with resident-segment-size: 8004MiB

      How could you have checked this beforehand instead of taking the trial and error approach?

      Expand
      titleSolution

      You could have checked HPC2020: Batch system, or you could also ask Slurm for this information. Default memory is defined per partition, so you can then do

      No Format
      scontrol show partition

      The field we are looking for is DefMemPerNode:

      No Format
      $ scontrol -o show partition | tr " " "\n" | grep -i -e "DefMem" -e "PartitionName"

      Can you check, without trial and error this time, what is the maximum wall clock time,  maximum CPUs, and maximum memory you can request to Slurm for each QoS?

      Expand
      titleSolution

      Again, you will find this information ini HPC2020: Batch system, but you can also ask Slurm. These settings are part of the QoS setup so the command is

      No Format
      sacctmgr show qos

      The fields we are looking for this time are MaxWall and MaxTRES:

      No Format
      sacctmgr -P show qos format=name,MaxWall,MaxTRES                                     BEGIN OF ECMWF MEMKILL REPORT
      
      
      [ERROR__ECMWF_MEMORY_SUPERVISOR:JOB_OR_SESSION_OUT_OF_MEMORy,ac6-202.bullx:/usr/local/sbin/watch_cgroup:l454:Thu Oct 26 11:16:43 2023]
      
      [summary]
      job/session: 64304303
      requested/default memory limit for job/session: 8000MiB
      sum of active and inactive _anonymous memory of job/session: 8001MiB
      ACTION: about to issue: 'kill -SIGKILL' to pid: 4016899
      to-be-killed process: "perl -e $a='A'x(8000*1024*1024/2); sleep", with resident-segment-size: 8004MiB



    2. How could you have checked this beforehand instead of taking the trial and error approach?

      Expand
      titleSolution

      You could have checked HPC2020: Batch system, or you could also ask Slurm for this information. Default memory is defined per partition, so you can then do

      No Format
      scontrol show partition

      The field we are looking for is DefMemPerNode:

      No Format
      $ scontrol -o show partition | tr " " "\n" | grep -i -e "DefMem" -e "PartitionName"



    3. Can you check, without trial and error this time, what is the maximum wall clock time,  maximum CPUs, and maximum memory you can request to Slurm for each QoS?

      Expand
      titleSolution

      Again, you will find this information ini HPC2020: Batch system, but you can also ask Slurm. These settings are part of the QoS setup so the command is

      No Format
      sacctmgr show qos

      The fields we are looking for this time are MaxWall and MaxTRES:

      No Format
      sacctmgr -P show qos format=name,MaxWall,MaxTRES                                                                                             

      If you run this on HPCF, you may notice there is no maximum limit set at the QoS level for the np parallel QoS, so you are bound by the maximum memory available in the node.

      You can also see other limits such as the local SSD tmpdir space.

    How many jobs could you potentially have running concurrently? How many jobs could you have in the system (pending or running), before a further submission fails?

    Expand
    titleSolution

    Again, you will find this information ini HPC2020: Batch system, but you can also ask Slurm. These settings are part of the Association setup so the command is

    No Format
    sacctmgr show assoc where user=$USER

    The fields we are looking for are MaxJobs and MaxSubmit:

    No Format
    sacctmgr show assoc user=$USER format=account,user,partition,maxjobs,maxsubmit

    Remember that a Slurm Association is made of the user, project account and partition, and the limits are set at the association level.

Running small parallel jobs - fractional

Info
titleReference Documentation

HPC2020: Submitting a serial or small parallel job

HPC2020: Affinity

    1.                                                                                                                                  

      If you run this on HPCF, you may notice there is no maximum limit set at the QoS level for the np parallel QoS, so you are bound by the maximum memory available in the node.

      You can also see other limits such as the local SSD tmpdir space.


  1. How many jobs could you potentially have running concurrently? How many jobs could you have in the system (pending or running), before a further submission fails?

    Expand
    titleSolution

    Again, you will find this information ini HPC2020: Batch system, but you can also ask Slurm. These settings are part of the Association setup so the command is

    No Format
    sacctmgr show assoc where user=$USER

    The fields we are looking for are MaxJobs and MaxSubmit:

    No Format
    sacctmgr show assoc user=$USER format=account,user,partition,maxjobs,maxsubmit

    Remember that a Slurm Association is made of the user, project account and partition, and the limits are set at the association level.


Running small parallel jobs - fractional

Info
titleReference Documentation

HPC2020: Submitting a serial or small parallel job

HPC2020: Affinity

So far we have only run serial jobs. So far we have only run serial jobs. You may also want to run small parallel jobs, either concurrently using just multiple threads, multiple processes or both. Examples of this are MPI and OpenMP programs. We call these kind of small parallel jobs "fractional", because they will run on a fraction of a node, sharing it with other users.

...

For this tests we will use David McKain's version of the Cray xthi code to visualise how the process and thread placement takes place.

  1. Download and compile the code in your Atos HPCF or ECS shell session with the following commandsLoad the xthi module with:

    No Format
    module load prgenv/gnu hpcx-openmpi
    wget https://git.ecdf.ed.ac.uk/dmckain/xthi/-/raw/master/xthi.c
    mpicc -o xthi -fopenmp xthi.c -lnumaxthi


  2. Run the program interactively to familiarise yourself with the ouptut:

    No Format
    $ ./xthi
    Host=ac6-200  MPI Rank=0  CPU=128  NUMA Node=0  CPU Affinity=0,128

    As you can see,  only 1 process and 1 thread are run, and they may run on one of two virtual cores assigned to my session (which correspond to the same physical CPU). If you try to run with 4 OpenMP threads, you will see they will effectively fight each other for those same two cores, impacting the performance of your application but not anyone else in the login node:

    No Format
    $ OMP_NUM_THREADS=4 ./xthi
    Host=ac6-200  MPI Rank=0  OMP Thread=0  CPU=128  NUMA Node=0  CPU Affinity=0,128
    Host=ac6-200  MPI Rank=0  OMP Thread=1  CPU=  0  NUMA Node=0  CPU Affinity=0,128
    Host=ac6-200  MPI Rank=0  OMP Thread=2  CPU=128  NUMA Node=0  CPU Affinity=0,128
    Host=ac6-200  MPI Rank=0  OMP Thread=3  CPU=  0  NUMA Node=0  CPU Affinity=0,128


  3. Create a new job script fractional.sh to run xthi with 2 MPI tasks and 2 OpenMP threads, submit it and check the output to ensure the right number of tasks and threads were spawned. 

    Here is a job template to start with:

    Code Block
    languagebash
    titlefractional.sh
    collapsetrue
    #!/bin/bash
    #SBATCH --output=fractional.out
    # TODO: Add here the missing SBATCH directives for the relevant resources
    
    
    # AddDefine herethe thenumber lineof toOpenMP run xthi
    # Hint: use srun
    threads
    export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
    
    # Load xthi tool
    module load xthi
    
    # TODO: Add here the line to run xthi
    # Hint: use srun
    


    Expand
    Expand
    titleSolution

    Using your favourite editor, create a file called fractional.sh with the following content:

    Code Block
    languagebash
    titlefractional.sh
    #!/bin/bash
    #SBATCH --output=fractional.out
    # Add here the missing SBATCH directives for the relevant resources
    #SBATCH --ntasks=2
    #SBATCH --cpus-per-task=2   
    
    # AddDefine herethe thenumber lineof toOpenMP run xthi
    # Hint: use srunthreads
    export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
    
    # Load xthi tool
    module load xthi
    
    srun -c $SLURM_CPUS_PER_TASK ./xthi

    You need to request 2 tasks, and 2 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the cpus-per-task, which must be explicitly passed to srun.

    You can submit it with sbatch:

    No Format
    sbatch fractional.sh

    The job should be run shortly. When finished, a new file called fractional.out should appear in the same directory. You can check the relevant output with:

    No Format
    grep -v ECMWF-INFO fractional.out

    You should see an output similar to:

    No Format
    $ grep -v ECMWF-INFO fractional.out
    Host=ad6-202  MPI Rank=0  OMP Thread=0  CPU=  5  NUMA Node=0  CPU Affinity=5,133
    Host=ad6-202  MPI Rank=0  OMP Thread=1  CPU=133  NUMA Node=0  CPU Affinity=5,133
    Host=ad6-202  MPI Rank=1  OMP Thread=0  CPU=137  NUMA Node=0  CPU Affinity=9,137
    Host=ad6-202  MPI Rank=1  OMP Thread=1  CPU=  9  NUMA Node=0  CPU Affinity=9,137


    Info
    titleSrun automatic cpu binding

    You can see srun automatically ensures certain binding of the cores to the tasks. If you were to instruct srun to avoid any cpu binding with --cpu-bind=none, you would see something like:

    No Format
    $ grep -v ECMWF-INFO fractional.out
    Host=aa6-203  MPI Rank=0  OMP Thread=0  CPU=136  NUMA Node=0  CPU Affinity=4,8,132,136
    Host=aa6-203  MPI Rank=0  OMP Thread=1  CPU=  8  NUMA Node=0  CPU Affinity=4,8,132,136
    Host=aa6-203  MPI Rank=1  OMP Thread=0  CPU=132  NUMA Node=0  CPU Affinity=4,8,132,136
    Host=aa6-203  MPI Rank=1  OMP Thread=1  CPU=  4  NUMA Node=0  CPU Affinity=4,8,132,136

    Here all processes/threads could run in any of the cores assigned to the job, potentially having them hopping from cpu to cpu during the program's execution



  4. Can you ensure each one of the OpenMP threads runs on a single physical core, without exploiting the hyperthreading, for optimal performance?

    Expand
    titleSolution

    In order to ensure each thread gets their own core, you can use the environment variable OMP_PLACES=threads.

    Then, to make sure only physical cores are used for performance, we need to use the --hint=nomultithread directive:

    Code Block
    languagebash
    titlefractional.sh
    #!/bin/bash
    #SBATCH --output=fractional.out
    # Add here the missing SBATCH directives for the relevant resources
    #SBATCH --#SBATCH --ntasks=2
    #SBATCH --cpus-per-task=2
    #SBATCH --hint=nomultithreadno multithread
    
    # AddDefine herethe thenumber lineof toOpenMP run xthi
    # Hint: use srunthreads
    export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
    
    # Ensure proper OpenMP thread CPU pinning
    export OMP_PLACES=threads
    
    # Load xthi tool
    module load xthi
     
    srun -c $SLURM_CPUS_PER_TASK ./xthi

    You can submit the modified job with sbatch:

    No Format
    sbatch fractional.sh

    You should see an output similar to the following one, where each thread is in a different core with a number lower than 128:

    No Format
    $ grep -v ECMWF-INFO fractional.out
    Host=ad6-201  MPI Rank=0  OMP Thread=0  CPU=18  NUMA Node=1  CPU Affinity=18
    Host=ad6-201  MPI Rank=0  OMP Thread=1  CPU=20  NUMA Node=1  CPU Affinity=20
    Host=ad6-201  MPI Rank=1  OMP Thread=0  CPU=21  NUMA Node=1  CPU Affinity=21
    Host=ad6-201  MPI Rank=1  OMP Thread=1  CPU=22  NUMA Node=1  CPU Affinity=22



...

Here is a very simplified diagram of the Atos HPCF node that you should keep in mind when deciding your job geometries:

Gliffy Diagram
macroId152f57ca-cbad-43d6-a395-74d349c880c5
displayNameAtos HPCF AMD Rome simplified architecture
nameAtos HPCF AMD Rome simplified architecture
pagePin4

  1. If not already on HPCF, open a session on hpc-login. hpc-login.
  2. Create a new job script parallel.sh to run xthi with 32 MPI tasks and 4 OpenMP threads, leaving hyperthreading enabled. Submit it and check the output to ensure the right number of tasks and threads were spawned. Take note of what cpus are used, and how much SBUs you used.

    Here is a job template to start with:

    Code Block
    languagebash
    titleparallel.sh
    collapsetrue
    #!/bin/bash
    #SBATCH --output=parallel-%j.out
    #SBATCH --qos=np 
    # TODO: Add here the missing SBATCH directives for the relevant resources
    
    # Define the number of OpenMP threads
    export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
    
    # Ensure proper OpenMP thread CPU pinning
    export OMP_PLACES=threads
    
    # Load xthi tool
    module load xthi 
    
    srun -c $SLURM_CPUS_PER_TASK xthi 


    Expand
    titleSolution

    Using your favourite editor, create a file called parallel.sh with the following content:

    Code Block
    languagebash
    titleparalell.sh
    #!/bin/bash 
    #SBATCH --output=parallel-%j.out
    #SBATCH --qos=np 
    #SBATCH --ntasks=32
    #SBATCH --cpus-per-task=4
    
    # Define the number of OpenMP threads
    export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
    
    # Ensure proper OpenMP thread CPU pinning
    export OMP_PLACES=threads
    
    # Load xthi tool
    module load xthi
    
    srun -c $SLURM_CPUS_PER_TASK xthi

    You need to request 32 tasks, and 4 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the cpus-per-task, which must be explicitly passed to srun.

    You can submit it with sbatch:

    No Format
    sbatch parallel.sh

    The job should be run shortly. When finished, a new file called parallel-<jobid>.out should appear in the same directory. You can check the relevant output with:

    No Format
    grep -v ECMWF-INFO $(ls -1 parallel-*.out | tail -n1)

    You should see an output similar to:

    No Format
    Host=ac2-4046  MPI Rank= 0  OMP Thread=0  CPU=  0  NUMA Node=0  CPU Affinity=  0
    Host=ac2-4046  MPI Rank= 0  OMP Thread=1  CPU=128  NUMA Node=0  CPU Affinity=128
    Host=ac2-4046  MPI Rank= 0  OMP Thread=2  CPU=  1  NUMA Node=0  CPU Affinity=  1
    Host=ac2-4046  MPI Rank= 0  OMP Thread=3  CPU=129  NUMA Node=0  CPU Affinity=129
    Host=ac2-4046  MPI Rank= 1  OMP Thread=0  CPU=  2  NUMA Node=0  CPU Affinity=  2
    Host=ac2-4046  MPI Rank= 1  OMP Thread=1  CPU=130  NUMA Node=0  CPU Affinity=130
    Host=ac2-4046  MPI Rank= 1  OMP Thread=2  CPU=  3  NUMA Node=0  CPU Affinity=  3
    Host=ac2-4046  MPI Rank= 1  OMP Thread=3  CPU=131  NUMA Node=0  CPU Affinity=131
    ...
    Host=ac2-4046  MPI Rank=30  OMP Thread=0  CPU=116  NUMA Node=7  CPU Affinity=116
    Host=ac2-4046  MPI Rank=30  OMP Thread=1  CPU=244  NUMA Node=7  CPU Affinity=244
    Host=ac2-4046  MPI Rank=30  OMP Thread=2  CPU=117  NUMA Node=7  CPU Affinity=117
    Host=ac2-4046  MPI Rank=30  OMP Thread=3  CPU=245  NUMA Node=7  CPU Affinity=245
    Host=ac2-4046  MPI Rank=31  OMP Thread=0  CPU=118  NUMA Node=7  CPU Affinity=118
    Host=ac2-4046  MPI Rank=31  OMP Thread=1  CPU=246  NUMA Node=7  CPU Affinity=246
    Host=ac2-4046  MPI Rank=31  OMP Thread=2  CPU=119  NUMA Node=7  CPU Affinity=119
    Host=ac2-4046  MPI Rank=31  OMP Thread=3  CPU=247  NUMA Node=7  CPU Affinity=247

    Note the following facts:

    • Both the main cores (0-127) and hyperthreads (128-256) were used.
    • You get consecutive threads on the same physical CPU (0 with 128, 1 with 129...).
    • There are physical cpus entirely unused, since their cpu number does show in the output.

    In terms of SBUs, this job cost:

    No Format
    $ grep SBU $(ls -1 parallel-*.out | tail -n1)                                                                                                                                                                      
    [ECMWF-INFO -ecepilog] SBU                       : 6.051



  3. Modify the parallel.sh job geometry (number of tasks,  threads and use of hyperthreading) so that you fully utilise all the physical cores, and only those, i.e. 0-127.

    Expand
    titleSolution

    Without using hyperthreading, an Atos HPCF node has 128 phyisical cores available. Any combination of tasks and threads that adds up to that figure will fill the node. Examples include 32 tasks x 4 threads, 64 tasks x 2 threads or 128 single-threaded tasks. For this example, we picked the first one:

    Code Block
    languagebash
    titleparalell.sh
    #!/bin/bash 
    #SBATCH --output=parallel-%j.out
    #SBATCH --qos=np
    #SBATCH --ntasks=32
    #SBATCH --cpus-per-task=4
    #SBATCH --hint=nomultithread
    
    # Define the number of OpenMP threads
    export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
    
    # Ensure proper OpenMP thread CPU pinning
    export OMP_PLACES=threads
    
    # Load xthi tool
    module load xthi
    
    srun -c $SLURM_CPUS_PER_TASK xthi

    You can submit it with sbatch:

    No Format
    sbatch parallel.sh

    The job should be run shortly. When finished, a new file called parallel-<jobid>.out should appear in the same directory. You can check the relevant output with:

    No Format
    grep -v ECMWF-INFO $(ls -1 parallel-*.out | tail -n1)

    You should see an output similar to:

    No Format
    Host=ac3-2015  MPI Rank= 0  OMP Thread=0  CPU=  0  NUMA Node=0  CPU Affinity=  0                                                                                                              
    Host=ac3-2015  MPI Rank= 0  OMP Thread=1  CPU=  1  NUMA Node=0  CPU Affinity=  1                                                                                                              
    Host=ac3-2015  MPI Rank= 0  OMP Thread=2  CPU=  2  NUMA Node=0  CPU Affinity=  2                                                                                                              
    Host=ac3-2015

    Create a new job script parallel.sh to run xthi with 32 MPI tasks and 4 OpenMP threads, leaving hyperthreading enabled. Submit it and check the output to ensure the right number of tasks and threads were spawned. Take note of what cpus are used, and how much SBUs you used.

    Here is a job template to start with:

    Code Block
    languagebash
    titleparallel.sh
    collapsetrue
    #!/bin/bash
    #SBATCH --output=parallel-%j.out
    #SBATCH --qos=np
    # Add here the missing SBATCH directives for the relevant resources  
    
    export OMP_PLACES=threads
    srun -c $SLURM_CPUS_PER_TASK ./xthi 
    Expand
    titleSolution

    Using your favourite editor, create a file called parallel.sh with the following content:

    Code Block
    languagebash
    titleparalell.sh
    #!/bin/bash 
    #SBATCH --output=parallel-%j.out
    #SBATCH --qos=np 
    # Add here the missing SBATCH directives for the relevant resources
    #SBATCH --ntasks=32
    #SBATCH --cpus-per-task=4
    
    export OMP_PLACES=threads
    srun -c $SLURM_CPUS_PER_TASK ./xthi

    You need to request 32 tasks, and 4 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the cpus-per-task, which must be explicitly passed to srun.

    You can submit it with sbatch:

    No Format
    sbatch fractional.sh

    The job should be run shortly. When finished, a new file called parallel-<jobid>.out should appear in the same directory. You can check the relevant output with:

    No Format
    grep -v ECMWF-INFO $(ls -1 parallel-*.out | head -n1)

    You should see an output similar to:

    No Format
    Host=ac2-4046  MPI Rank= 0  OMP Thread=0  CPU=  0  NUMA Node=0  CPU Affinity=  0
    Host=ac2-4046  MPI Rank= 0  OMP Thread=1  CPU=128  NUMA Node=0  CPU Affinity=128
    Host=ac2-4046  MPI Rank= 0  OMP Thread=2  CPU=  1  NUMA Node=0  CPU Affinity=  1
    Host=ac2-4046  MPI Rank= 0  OMP Thread=3  CPU=129  3  NUMA Node=0  CPU Affinity=129  3
    Host=ac2ac3-40462015  MPI Rank= 1  OMP Thread=0  CPU=  24  NUMA Node=0  CPU Affinity=  24
    Host=ac2ac3-40462015  MPI Rank= 1  OMP Thread=1  CPU=130  5  NUMA Node=0  CPU Affinity=130  5
    Host=ac2ac3-40462015  MPI Rank= 1  OMP Thread=2  CPU=  36  NUMA Node=0  CPU Affinity=  36
    Host=ac2ac3-40462015  MPI Rank= 1  OMP Thread=3  CPU=131  7  NUMA Node=0  CPU Affinity=131  7
    ... 
    Host=ac2ac3-40462015  MPI Rank=30  OMP Thread=0  CPU=116120  NUMA Node=7  CPU Affinity=116120
    Host=ac2ac3-40462015  MPI Rank=30  OMP Thread=1  CPU=244121  NUMA Node=7  CPU Affinity=244121
    Host=ac2ac3-40462015  MPI Rank=30  OMP Thread=2  CPU=117122  NUMA Node=7  CPU Affinity=117122
    Host=ac2ac3-40462015  MPI Rank=30  OMP Thread=3  CPU=245123  NUMA Node=7  CPU Affinity=245123
    Host=ac2ac3-40462015  MPI Rank=31  OMP Thread=0  CPU=118124  NUMA Node=7  CPU Affinity=118124
    Host=ac2ac3-40462015  MPI Rank=31  OMP Thread=1  CPU=246125  NUMA Node=7  CPU Affinity=246125
    Host=ac2ac3-40462015  MPI Rank=31  OMP Thread=2  CPU=119126  NUMA Node=7  CPU Affinity=119126
    Host=ac2ac3-40462015  MPI Rank=31  OMP Thread=3  CPU=247127  NUMA Node=7  CPU Affinity=247127

    Note the following facts:

    • Both Only the main cores (0-127) and hyperthreads (128-256) where used.
    • You get consecutive threads on the same physical CPU (0 with 128, 1 with 129...).
    • (0-127) were used.
    • Each thread gets one and only one cpu pinned to it.
    • All the phyisical cores are in useThere are physical cpus entirely unused, since their cpu number does show in the output.

    In terms of SBUs, this job cost:

    No Format
    $ grep SBU $(ls -1 parallel-*.out | headtail -n1)                                                                                                                                                                      
    [ECMWF-INFO -ecepilog] SBU                       : 25.689379



  4. Modify the  the parallel.sh job geometry (number of so it still runs on the np QoS, but only with 2 tasks and threads) so that you fully utilise all the physical cores of the node but none of the hyperthreads, i.e. 0-127.2 threads. Check the SBU cost. Since the execution is 32 times smaller, did it cost 32 times less than the previous? Why?

    Expand
    titleSolution

    Using your favourite editor, create a file called parallel.sh with the following contentLet's use the following job:

    Code Block
    languagebash
    titleparalell.sh
    #!/bin/bash 
    #SBATCH --output=parallel-%j.out
    #SBATCH --qos=np 
    # Add here the missing SBATCH directives for the relevant resources
    #SBATCH --ntasks=322
    #SBATCH --cpus-per-task=42
    #SBATCH --hint=nomultithread
    
    module load xthi
    
    export OMP_PLACES=threads
    srun -c $SLURM_CPUS_PER_TASK ./xthi
    You need to request 32 tasks, and 4 cpus per task in the job. Then we will use srun to spawn our parallel run, which should inherit the job geometry requested, except the cpus-per-task, which must be explicitly passed to srun.
    =threads
    srun -c $SLURM_CPUS_PER_TASK xthi

    You can submit it with sbatch:

    No Format
    sbatch fractional.sh

    The job should be run shortly. When finished, a new file called parallel-<jobid>.out should appear in the same directory. You can check the relevant output with:

    No Format
    grep -v ECMWF-INFO $(ls -1 parallel-*.out | headtail -n1)

    You should see an output similar to:

    No Format
    Host=ac2-40463073  MPI Rank= 0  OMP Thread=0  CPU=  0  NUMA Node=0  CPU Affinity=  0
    Host=ac2-40463073  MPI Rank= 0  OMP Thread=1  CPU=128  NUMA Node=0  CPU Affinity=128
    Host=ac2-4046  MPI Rank= 0  OMP Thread=2  CPU=  1  NUMA Node=0  CPU Affinity=  1
    Host=ac2-4046  MPI Rank= 0  OMP Thread=3  CPU=129 1  NUMA Node=0  CPU Affinity=129 1
    Host=ac2-40463073  MPI Rank= 1  OMP Thread=0  CPU=16  2  NUMA Node=01  CPU Affinity=  216
    Host=ac2-40463073  MPI Rank= 1  OMP Thread=1  CPU=13017  NUMA Node=01  CPU Affinity=130
    Host=ac2-4046  MPI Rank= 1  OMP Thread=2  CPU=  3  NUMA Node=0  CPU Affinity=  3
    Host=ac2-4046  MPI Rank= 1  OMP Thread=3  CPU=131  NUMA Node=0  CPU Affinity=131
    ...
    Host=ac2-4046  MPI Rank=30  OMP Thread=0  CPU=116  NUMA Node=7  CPU Affinity=116
    Host=ac2-4046  MPI Rank=30  OMP Thread=1  CPU=244  NUMA Node=7  CPU Affinity=244
    Host=ac2-4046  MPI Rank=30  OMP Thread=2  CPU=117  NUMA Node=7  CPU Affinity=117
    Host=ac2-4046  MPI Rank=30  OMP Thread=3  CPU=245  NUMA Node=7  CPU Affinity=245
    Host=ac2-4046  MPI Rank=31  OMP Thread=0  CPU=118  NUMA Node=7  CPU Affinity=118
    Host=ac2-4046  MPI Rank=31  OMP Thread=1  CPU=246  NUMA Node=7  CPU Affinity=246
    Host=ac2-4046  MPI Rank=31  OMP Thread=2  CPU=119  NUMA Node=7  CPU Affinity=119
    Host=ac2-4046  MPI Rank=31  OMP Thread=3  CPU=247  NUMA Node=7  CPU Affinity=247

    Note the following facts:

    • Both the main cores (0-127) and hyperthreads (128-256) where used.
    • You get consecutive threads on the same physical CPU (0 with 128, 1 with 129...).
    • There are physical cpus entirely unused, since their cpu number does show in the output.

    In terms of SBUs, this job cost:

    17

    In terms of SBUs, this job cost:

    No Format
    $ grep SBU $(ls -1 parallel-*.out | tail -n1)                                                                                                                                                                      
    [ECMWF-INFO -ecepilog] SBU                       : 4.034

    This is in a similar scale to the previous one which 32 times bigger one. The reason behind it is that on the np QoS the allocation is done in full nodes. The SBU cost takes into account the allocated nodes for a given period of time, no matter how they are used.

    You may compare the cost of your last parallel job and your last fractional, with the same geometry (2x2):

    No Format
    $ grep -h
    No Format
    $ grep SBU $(ls -1 parallel-*.out | headtail -n1) fractional.out                                                                                                                                                                   
    [ECMWF-INFO -ecepilog] SBU                       : 4.034
    [ECMWF-INFO -ecepilog] SBU                       : 20.689084