Working in Batch - Atos HPCF and ECS Introduction Tutorial

Reference Documentation

HPC2020: Batch system

Let's explore how to use the Slurm Batch System to the ATOS HPCF or ECS.

Basic job submission

Access the default login node of the ATOS HPCF or ECS.

Create a directory for this tutorial so all the exercises and outputs are contained inside:
```
mkdir ~/batch_tutorial
cd ~/batch_tutorial
```

Create and submit a job called simplest.sh with just default settings that runs the command hostname. Can you find the output and inspect it? Where did your job run?

Using your favourite editor, create a file called simplest.sh with the following content

simplest.sh

#!/bin/bash
hostname

You can submit it with sbatch:

sbatch simplest.sh

The job should be run shortly. When finished, a new file called slurm-<jobid>.out should appear in the same directory. You can check the output with:

$ cat $(ls -r1 slurm-*.out | head -n1)
ab6-202.bullx
[ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------
[ECMWF-INFO -ecepilog] This is the ECMWF job Epilogue
[ECMWF-INFO -ecepilog] +++ Please report issues using the Support portal +++
[ECMWF-INFO -ecepilog] +++ https://support.ecmwf.int                     +++
[ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------
[ECMWF-INFO -ecepilog] Run at 2023-10-25T11:31:53 on ecs
[ECMWF-INFO -ecepilog] JobName                   : simplest.sh
[ECMWF-INFO -ecepilog] JobID                     : 64273363
[ECMWF-INFO -ecepilog] Submit                    : 2023-10-25T11:31:36
[ECMWF-INFO -ecepilog] Start                     : 2023-10-25T11:31:51
[ECMWF-INFO -ecepilog] End                       : 2023-10-25T11:31:53
[ECMWF-INFO -ecepilog] QueuedTime                : 15.0
[ECMWF-INFO -ecepilog] ElapsedRaw                : 2
[ECMWF-INFO -ecepilog] ExitCode                  : 0:0
[ECMWF-INFO -ecepilog] DerivedExitCode           : 0:0
[ECMWF-INFO -ecepilog] State                     : COMPLETED
[ECMWF-INFO -ecepilog] Account                   : myaccount
[ECMWF-INFO -ecepilog] QOS                       : ef
[ECMWF-INFO -ecepilog] User                      : user
[ECMWF-INFO -ecepilog] StdOut                    : /etc/ecmwf/nfs/dh1_home_a/user/slurm-64273363.out
[ECMWF-INFO -ecepilog] StdErr                    : /etc/ecmwf/nfs/dh1_home_a/user/slurm-64273363.out
[ECMWF-INFO -ecepilog] NNodes                    : 1
[ECMWF-INFO -ecepilog] NCPUS                     : 2
[ECMWF-INFO -ecepilog] SBU                       : 0.011
[ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------

You can then see that the script has run on a different node than the one you are on.

If you repeat the operation, you may get your job to run on a different node every time, whichever happens to be free at the time.

Configure your simplest.sh job to direct the output to simplest-<jobid>.out, the error to simplest-<jobid>.err both in the same directory, and the job name to just "simplest". Note you will need to use a special placeholder for the -<jobid>.
Using your favourite editor, open the simplest.sh job script and add the relevant #SBATCH directives:
simplest.sh
#!/bin/bash #SBATCH --job-name=simplest #SBATCH --output=simplest-%j.out #SBATCH --output=simplest-%j.err hostname
You can submit it again with:
sbatch simplest.sh
After a few moments, you should see the new files appear in your directory (job id will be different than the one displayed here):
$ ls simplest-*.* simplest-64274497.err simplest-64274497.out
You can check that the job name was also changed in the end of job report:
$ grep -i jobname $(ls -r1 simplest-*.err | head -n1) [ECMWF-INFO -ecepilog] JobName : simplest
From a terminal session outside the Atos HPCF or ECS your VDI or computer, submit the simplest.sh job remotely. What hostname should you use?
You must use hpc-batch for HPCF job submissions, or ecs-batch for remote submissions:
ssh hpc-batch "cd ~/batch_tutorial; sbatch simplest.sh"
ssh ecs-batch "cd ~/batch_tutorial; sbatch simplest.sh"
Note the change of directory so both the job script, the working directory of the job and its outputs are generated in the right place.
An alternative way of doing this without changing directory would be to tell sbatch to do it for you:
ssh hpc-batch sbatch -D ~/batch_tutorial ~/batch_tutorial/simplest.sh
or for ECS:
ssh ecs-batch sbatch -D ~/batch_tutorial ~/batch_tutorial/simplest.sh

Basic job management

Create a new job script sleepy.sh with the contents below:
sleepy.sh
```
#!/bin/bash
sleep 120
```
Submit sleepy.sh to the batch system and check its status. Once it is running, cancel it and inspect the output.
You can submit your job with:
sbatch sleepy.sh
You can then check the state of your job with squeue:
squeue -j <jobid>
if you use the <jobid> of the job you just submitted, or just:
squeue --me
to list all your jobs.
To cancel your job, just run scancel:
scancel <jobid>
If you inspect the output file from your last job, you will see a message like the following:
slurmstepd: error: *** JOB 64281137 ON ab6-202 CANCELLED AT 2023-10-25T15:40:51 ***
Can you get information about the jobs you have run so far today, including those that have finished already?
When jobs finish, they will not appear in the squeue output any longer. You can then check the Accounting Database with sacct:
sacct
With no arguments, this command will show you the list of all jobs run by you on this day.
In the output you may see or more entries 3 entries such as:
JobID JobName QOS State ExitCode Elapsed NNodes NodeList ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- ... 64281137 sleepy.sh ef CANCELLED+ 0:0 00:00:16 1 ab6-202 64281137.ba+ batch CANCELLED 0:15 00:00:17 1 ab6-202 64281137.ex+ extern COMPLETED 0:0 00:00:16 1 ab6-202
The first one corresponds to the job itself. The second one (always named batch) corresponds to the actual job script and the third (named extern) corresponds to the external step used to generate the end of job information. You may have more lines if your job contains more steps, which typically correspond to srun parallel executions.
If you want to list just the entry for the job itself, you can do:
sacct -X
Can you get information of all the jobs run today by you that were cancelled?
You can filter jobs by state with the -s option. But If you run it naively:
sacct -X -t CANCELLED
You will get no output. That is because when using state you must also specify the start and end times of your query period. You can then do something like:
sacct -X -s CANCELLED -S $(date +%Y-%m-%d) -E $(date +%Y-%m-%dT%H:%M:%S)
The default information shown on the screen when querying past jobs is limited. Can you extract the submit, start, and end times of your cancelled jobs today? What about their output and error path? Hint: use the corresponding man page for all the options.
You can use the following command to see all the possible output fields you can query for:
sacct -e
While there are dedicated fields for the job submit, start and end times, there is none for the output and error paths. However, the AdminComment field is used to carry that information. Since it is a long field, you may want to pass a length to the fieldname to avoid truncation:
sacct -X -s CANCELLED -S $(date +%Y-%m-%d) -E $(date +%Y-%m-%dT%H:%M:%S) -o jobid,jobname,state,submit,start,end,AdminComment%150
or you can also ask for a parsable output:
sacct -X -s CANCELLED -S $(date +%Y-%m-%d) -E $(date +%Y-%m-%dT%H:%M:%S) -o jobid,jobname,state,submit,start,end,AdminComment -p

Common pitfalls

Reference Documentation

HPC2020: Writing SLURM jobs

We will now attempt to troubleshoot some issues

Create a new job script broken1.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?
broken1.sh Expand source
```
#SBATCH --job-name = broken 1
#SBATCH --output = broken1-%J.out
#SBATCH --error = broken1-%J.out
#SBATCH --qos = express
#SBATCH --time = 00:05:00 

echo "I was broken!"
```
The job above has the following problems:
- There is no shebang at the beginning of the script.
- There should be no spaces in the directives
- There should be no space
- QoS "express" does not exist
Here is an amended version:
broken1_fixed.sh
#!/bin/bash #SBATCH --job-name=broken1 #SBATCH --output=broken1-%J.out #SBATCH --error=broken1-%J.out #SBATCH --time=00:05:00 echo "I was broken!"
Note that the QoS line was removed, but you may also use the following if running on ECS:
#SBATCH --qos=ef
or the alternatively, if on Atos HPCF:
#SBATCH --qos=nf
Check that the actual job run and generated the expected output:
$ grep -v ECMWF-INFO $(ls -1 broken1-*.out | head -n1) I was broken!
Create a new job script broken2.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?
broken2.sh Expand source
```
#!/bin/bash
#SBATCH --job-name=broken2
#SBATCH --output=broken2-%J.out
#SBATCH --error=broken2-%J.out
#SBATCH --qos=ns
#SBATCH --time=10-00

echo "I was broken!"
```
The job above has the following problems:
- QoS "ns" does not exist. Either remove to use the default or use the corresponding queue on ECS (ef) or HPCF (nf)
- The time requested is 10 days, which is longer than the maximum allowed. it was probably meant to be 10 minutes
Here is an amended version:
broken1.sh
#!/bin/bash #SBATCH --job-name=broken2 #SBATCH --output=broken2-%J.out #SBATCH --error=broken2-%J.out #SBATCH --time=10:00 echo "I was broken!"
Again, note that the QoS line was removed, but you may also use the following if running on ECS:
#SBATCH --qos=ef
or the alternatively, if on Atos HPCF:
#SBATCH --qos=nf
Check that the actual job run and generated the expected output:
$ grep -v ECMWF-INFO $(ls -1 broken2-*.out | head -n1) I was broken!

Create a new job script broken3.sh with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?

broken3.sh

#!/bin/bash
#SBATCH --job-name=broken3
#SBATCH --chdir=$SCRATCH
#SBATCH --output=broken3output/broken3-%J.out
#SBATCH --error=broken3output/broken3-%J.out

echo "I was broken!"

The job above has the following problems:

Variables are not expanded on job directives. You must specify your paths explicitly

The directory where the output and error files will go must exist beforehand. Otherwise the job will fail but you will not get any hint as to what may have happened to the job. The only hint would be if checking sacct:

$ sacct -X --name=broken3
JobID                 JobName       QOS      State ExitCode    Elapsed   NNodes             NodeList 
------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 
64281800              broken3        ef     FAILED     0:53   00:00:02        1              ad6-201

You will need to create the output directory with:

mkdir -p $SCRATCH/broken3output/

Here is an amended version of the job:

broken3.sh

#!/bin/bash
#SBATCH --job-name=broken3
#SBATCH --chdir=/scratch/<your_user_id>
#SBATCH --output=broken3output/broken3-%J.out
#SBATCH --error=broken3output/broken3-%J.out

echo "I was broken!"

Check that the actual job run and generated the expected output:

$ grep -v ECMWF-INFO  $(ls -1 $SCRATCH/broken3output/broken3-*.out | head -n1)
I was broken!

You may clean up the output directory with

rm -rf $SCRATCH/broken3output

Understanding your limits

Although most limits are described in HPC2020: Batch system, you can also check them (or reach them) for yourself in the system.

Create a new job script naughty.sh with the following contents:

naughty.sh

#!/bin/bash
#SBATCH --mem=100
#SBATCH --output=naughty.out
MEMORY=300
perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"

Submit naughty.sh to the batch system and check its status. What happened to the job?

You can submit it with:

sbatch naughty.sh

You can then monitor the state of your job with squeue:

squeue -j <jobid>

After a few seconds of running, you may see the job finishes and disappears. If we use sacct, we can see the job has failed, with an exit code of 9, which indicates it was killed:

$ sacct -X --name naughty.sh                                                                                                                                                       
JobID                 JobName       QOS      State ExitCode    Elapsed   NNodes             NodeList 
------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 
64303470           naughty.sh        ef     FAILED      9:0   00:00:04        1              ac6-202

Inspecting the job output the reason becomes apparent:

$ grep -v ECMWF-INFO naughty.out  | head -n 22




          __  __ _____ __  __ _  _____ _     _         
    __/\_|  \/  | ____|  \/  | |/ /_ _| |   | |  __/\__
    \    / |\/| |  _| | |\/| | ' / | || |   | |  \    /
    /_  _\ |  | | |___| |  | | . \ | || |___| |__/_  _\
      \/ |_|  |_|_____|_|  |_|_|\_\___|_____|_____|\/  

              BEGIN OF ECMWF MEMKILL REPORT


[ERROR__ECMWF_MEMORY_SUPERVISOR:JOB_OR_SESSION_OUT_OF_MEMORy,ac6-202.bullx:/usr/local/sbin/watch_cgroup:l454:Thu Oct 26 10:49:40 2023]

[summary]
job/session: 64303470
requested/default memory limit for job/session: 100MiB
sum of active and inactive _anonymous memory of job/session: 301MiB
ACTION: about to issue: 'kill -SIGKILL' to pid: 3649110
to-be-killed process: "perl -e $a="A"x(300*1024*1024/2); sleep", with resident-segment-size: 304MiB

The job had a limit of 100 MiB, but it tried to use up to 300 MiB, so the system killed the process.

Edit naughty.sh to comment the request for memory, and then play with the MEM value.

naughty.sh

#!/bin/bash
#SBATCH --output=naughty.out
##SBATCH --mem=100
MEMORY=300
perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"

How high can you with the default memory limit on the default queue before the system kills it?

With trial and error, you will see the system will kill your tasks that go over 8000 MiB:

naughty.sh

#!/bin/bash
#SBATCH --output=naughty.out
##SBATCH --mem=100
MEMORY=8000
perl -e "\$a='A'x($MEM*1024*1024/2);sleep 60"

Inspecting the job output will confirm that:

$ grep -v ECMWF-INFO naughty.out  | head -n 22




          __  __ _____ __  __ _  _____ _     _         
    __/\_|  \/  | ____|  \/  | |/ /_ _| |   | |  __/\__
    \    / |\/| |  _| | |\/| | ' / | || |   | |  \    /
    /_  _\ |  | | |___| |  | | . \ | || |___| |__/_  _\
      \/ |_|  |_|_____|_|  |_|_|\_\___|_____|_____|\/  

              BEGIN OF ECMWF MEMKILL REPORT


[ERROR__ECMWF_MEMORY_SUPERVISOR:JOB_OR_SESSION_OUT_OF_MEMORy,ac6-202.bullx:/usr/local/sbin/watch_cgroup:l454:Thu Oct 26 11:16:43 2023]

[summary]
job/session: 64304303
requested/default memory limit for job/session: 8000MiB
sum of active and inactive _anonymous memory of job/session: 8001MiB
ACTION: about to issue: 'kill -SIGKILL' to pid: 4016899
to-be-killed process: "perl -e $a='A'x(8000*1024*1024/2); sleep", with resident-segment-size: 8004MiB

How could you have checked this beforehand instead of taking the trial and error approach?
You could have checked HPC2020: Batch system, or you could also ask Slurm for this information. Default memory is defined per partition, so you can then do
scontrol show partition
The field we are looking for is DefMemPerNode:
$ scontrol -o show partition | tr " " "\n" | grep -i -e "DefMem" -e "PartitionName"
Can you check, without trial and error this time, what is the maximum wall clock time, maximum CPUs, and maximum memory you can request to Slurm for each QoS?
Again, you will find this information ini HPC2020: Batch system, but you can also ask Slurm. These settings are part of the QoS setup so, the command is
sacctmgr show qos
The field we are looking for this time are MaxWall and MaxTRES.:
sacctmgr -P show qos format=name,MaxWall,MaxTRES
If you run this on HPCF, you may notice there is no maximum limit set at the QoS level for the np parallel queue, so you are bound by the maximum memory available in the node.
You can also see other limits such as the local SSD tmpdir space.

How many jobs could you potentially have running concurrently? How many jobs could you have in the system (pending or running), before a further submission fails?

Content

Space Tools

Basic job submission

Basic job management

Common pitfalls

Understanding your limits