...
Create a directory for this tutorial so all the exercises and outputs are contained inside:
No Format mkdir ~/batch_tutorial cd ~/batch_tutorial
Create and submit a job called
simplest.sh
with just default settings that runs the commandhostname
. Can you find the output and inspect it? Where did your job run?Expand title Solution Using your favourite editor, create a file called
simplest.sh
with the following contentCode Block language bash title simplest.sh #!/bin/bash hostname
You can submit it with sbatch:
No Format sbatch simplest.sh
The job should be run shortly. When finished, a new file called
slurm-<jobid>.out
should appear in the same directory. You can check the output with:No Format $ cat $(ls -r1 slurm-*.out | head -n1) ab6-202.bullx [ECMWF-INFO -ecepilog] ---------------------------------------------------------------------------------------------------- [ECMWF-INFO -ecepilog] This is the ECMWF job Epilogue [ECMWF-INFO -ecepilog] +++ Please report issues using the Support portal +++ [ECMWF-INFO -ecepilog] +++ https://support.ecmwf.int +++ [ECMWF-INFO -ecepilog] ---------------------------------------------------------------------------------------------------- [ECMWF-INFO -ecepilog] Run at 2023-10-25T11:31:53 on ecs [ECMWF-INFO -ecepilog] JobName : simplest.sh [ECMWF-INFO -ecepilog] JobID : 64273363 [ECMWF-INFO -ecepilog] Submit : 2023-10-25T11:31:36 [ECMWF-INFO -ecepilog] Start : 2023-10-25T11:31:51 [ECMWF-INFO -ecepilog] End : 2023-10-25T11:31:53 [ECMWF-INFO -ecepilog] QueuedTime : 15.0 [ECMWF-INFO -ecepilog] ElapsedRaw : 2 [ECMWF-INFO -ecepilog] ExitCode : 0:0 [ECMWF-INFO -ecepilog] DerivedExitCode : 0:0 [ECMWF-INFO -ecepilog] State : COMPLETED [ECMWF-INFO -ecepilog] Account : myaccount [ECMWF-INFO -ecepilog] QOS : ef [ECMWF-INFO -ecepilog] User : user [ECMWF-INFO -ecepilog] StdOut : /etc/ecmwf/nfs/dh1_home_a/user/slurm-64273363.out [ECMWF-INFO -ecepilog] StdErr : /etc/ecmwf/nfs/dh1_home_a/user/slurm-64273363.out [ECMWF-INFO -ecepilog] NNodes : 1 [ECMWF-INFO -ecepilog] NCPUS : 2 [ECMWF-INFO -ecepilog] SBU : 0.011 [ECMWF-INFO -ecepilog] ----------------------------------------------------------------------------------------------------
You can then see that the script has run on a different node than the one you are on.
If you repeat the operation, you may get your job to run on a different node every time, whichever happens to be free at the time.
Configure your
simplest.sh
job to direct the output tosimplest-<jobid>.out
, the error tosimplest-<jobid>.err
both in the same directory, and the job name to just "simplest". Note you will need to use a special placeholder for the -<jobid>
.Expand title Solution Using your favourite editor, open the
simplest.sh
job script and add the relevant #SBATCH directives:Code Block language bash title simplest.sh #!/bin/bash #SBATCH --job-name=simplest #SBATCH --output=simplest-%j.out #SBATCH --output=simplest-%j.err hostname
You can submit it again with:
No Format sbatch simplest.sh
After a few moments, you should see the new files appear in your directory (job id will be different than the one displayed here):
No Format $ ls simplest-*.* simplest-64274497.err simplest-64274497.out
You can check that the job name was also changed in the end of job report:
No Format $ grep -i jobname $(ls -r1 simplest-*.err | head -n1) [ECMWF-INFO -ecepilog] JobName : simplest
Basic job management
From a terminal session outside the Atos HPCF or ECS your VDI or computer, submit the
simplest.sh
job remotely. What hostname should you use?Expand title Solution You must use hpc-batch for HPCF job submissions, or ecs-batch for remote submissions:
No Format ssh hpc-batch "cd ~/batch_tutorial; sbatch simplest.sh"
No Format ssh ecs-batch "cd ~/batch_tutorial; sbatch simplest.sh"
Note the change of directory so both the job script, the working directory of the job and its outputs are generated in the right place.
An alternative way of doing this without changing directory would be to tell sbatch to do it for you:
No Format ssh hpc-batch sbatch -D ~/batch_tutorial ~/batch_tutorial/simplest.sh
or for ECS:
No Format ssh ecs-batch sbatch -D ~/batch_tutorial ~/batch_tutorial/simplest.sh
Basic job management
Create a new job script
sleepy.sh
with the contents below:Code Block language bash title sleepy.sh #!/bin/bash sleep 120
Submit
sleepy.sh
to the batch system and check its status. Once it is running, cancel it and inspect the output.Expand title Solution You can submit your job with:
No Format sbatch sleepy.sh
You can then check the state of your job with squeue:
No Format squeue -j <jobid>
if you use the
<jobid>
of the job you just submitted, or just:No Format squeue --me
to list all your jobs.
To cancel your job, just run scancel:
No Format scancel <jobid>
If you inspect the output file from your last job, you will see a message like the following:
No Format slurmstepd: error: *** JOB 64281137 ON ab6-202 CANCELLED AT 2023-10-25T15:40:51 ***
Can you get information about the jobs you have run so far today, including those that have finished already?
Expand title Solution When jobs finish, they will not appear in the
squeue
output any longer. You can then check the Accounting Database with sacct:No Format sacct
With no arguments, this command will show you the list of all jobs run by you on this day.
In the output you may see or more entries 3 entries such as:
No Format JobID
Create a new job script
sleepy.sh
with the contents below:Code Block language bash title sleepy.sh #!/bin/bash sleep 120
Submit
sleepy.sh
to the batch system and check its status. Once it is running, cancel it and inspect the output.Expand title Solution Using your favourite editor, create
sleepy.sh
job script with the contents above. Then you can submit it with:No Format sbatch sleepy.sh
You can then check the state of your job with squeue:
No Format squeue -j <jobid>
if you use the
<jobid>
of the job you just submitted, or just:No Format squeue --me
to list all your jobs.
To cancel your job, just run scancel:
No Format scancel <jobid>
If you inspect the output file from your last job, you will see a message like the following:
No Format slurmstepd: error: *** JOB 64281137 ON ab6-202 CANCELLED AT 2023-10-25T15:40:51 ***
Can you get information about the jobs you have run so far today, including those that have finished already?
Expand title Solution When jobs finish, they will not appear in the
squeue
output any longer. You can then check the Accounting Database with sacct:No Format sacct
With no arguments, this command will show you the list of all jobs run by you on this day.
In the output you may see or more entries 3 entries such as:
No Format JobID JobName QOS State ExitCode Elapsed NNodes NodeList ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- ... 64281137 sleepy.sh JobName ef CANCELLED+ QOS 0:0 00:00:16State ExitCode Elapsed 1NNodes ab6-202 64281137.ba+ batch NodeList ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- ... 64281137 sleepy.sh ef CANCELLED+ 0:150 00:00:1716 1 ab6-202 64281137.exba+ extern batch COMPLETED CANCELLED 0:015 00:00:1617 1 ab6-202 64281137.ex+ extern COMPLETED 0:0 00:00:16 1 ab6-202
The first one corresponds to the job itself. The second one (always named batch) corresponds to the actual job script and the third (named extern) corresponds to the external step used to generate the end of job information. You may have The first one corresponds to the job itself. The second one (always named batch) corresponds to the actual job script and the third (named extern) corresponds to the external step used to generate the end of job information. You may have more lines if your job contains more steps, which typically correspond to srun parallel executions.
If you want to list just the entry for the job itself, you can do:
No Format sacct -X
Can you get information of all the jobs run today by you that were cancelled?
Expand title Solution You can filter jobs by state with the -s option. But If you run it naively:
No Format sacct -X -t CANCELLED
You will get no output. That is because when using state you must also specify the start and end times of your query period. You can then do something like:
No Format sacct -X -s CANCELLED -S $(date +%Y-%m-%d) -E $(date +%Y-%m-%dT%H:%M:%S)
The default information shown on the screen when querying past jobs is limited. Can you extract the submit, start, and end times of your cancelled jobs today? What about their output and error path? Hint: use the corresponding man page for all the options.
Expand title Solution You can use the following command to see all the possible output fields you can query for:
No Format sacct -e
While there are dedicated fields for the job submit, start and end times, there is none for the output and error paths. However, the AdminComment field is used to carry that information. Since it is a long field, you may want to pass a length to the fieldname to avoid truncation:
No Format sacct -X -s CANCELLED -S $(date +%Y-%m-%d) -E $(date +%Y-%m-%dT%H:%M:%S) -o jobid,jobname,state,submit,start,end,AdminComment%150
or you can also ask for a parsable output:
No Format sacct -X -s CANCELLED -S $(date +%Y-%m-%d) -E $(date +%Y-%m-%dT%H:%M:%S) -o jobid,jobname,state,submit,start,end,AdminComment -p
Common pitfalls
Info | ||
---|---|---|
| ||
We will now attempt to troubleshoot some issues
Create a new job script
broken1.sh
with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?Code Block language bash title broken1.sh collapse true #SBATCH --job-name = broken 1 #SBATCH --output = broken1-%J.out #SBATCH --error = broken1-%J.out #SBATCH --qos = express #SBATCH --time = 00:05:00 echo "I was broken!"
Expand title Solution The job above has the following problems:
- There is no shebang at the beginning of the script.
- There should be no spaces in the directives
- There should be no space
- QoS "express" does not exist
Here is an amended version:
Code Block language bash title broken1_fixed.sh #!/bin/bash #SBATCH --job-name=broken1 #SBATCH --output=broken1-%J.out #SBATCH --error=broken1-%J.out #SBATCH --time=00:05:00 echo "I was broken!"
Note that the QoS line was removed, but you may also use the following if running on ECS:
No Format #SBATCH --qos=ef
or the alternatively, if on Atos HPCF:
No Format #SBATCH --qos=nf
Check that the actual job run and generated the expected output:
No Format $ grep -v ECMWF-INFO $(ls -1 broken1-*.out | head -n1) I was broken!
Create a new job script
broken2.sh
with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?Code Block language bash title broken2.sh collapse true #!/bin/bash #SBATCH --job-name=broken2 #SBATCH --output=broken2-%J.out #SBATCH --error=broken2-%J.out #SBATCH --qos=ns #SBATCH --time=10-00 echo "I was broken!"
Expand title Solution The job above has the following problems:
- QoS "ns" does not exist. Either remove to use the default or use the corresponding queue on ECS (ef) or HPCF (nf)
- The time requested is 10 days, which is longer than the maximum allowed. it was probably meant to be 10 minutes
Here is an amended version:
Code Block language bash title broken1.sh #!/bin/bash #SBATCH --job-name=broken2 #SBATCH --output=broken2-%J.out #SBATCH --error=broken2-%J.out #SBATCH --time=10:00 echo "I was broken!"
Again, note that the QoS line was removed, but you may also use the following if running on ECS:
No Format #SBATCH --qos=ef
or the alternatively, if on Atos HPCF:
No Format #SBATCH --qos=nf
Check that the actual job run and generated the expected output:
No Format $ grep -v ECMWF-INFO $(ls -1 broken2-*.out | head -n1) I was broken!
Create a new job script
broken3.sh
with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?Code Block language bash title broken3.sh collapse true #!/bin/bash #SBATCH --job-name=broken3 #SBATCH --chdir=$SCRATCH #SBATCH --output=broken3output/broken3-%J.out #SBATCH --error=broken3output/broken3-%J.out echo "I was broken!"
Expand title Solution The job above has the following problems:
- Variables are not expanded on job directives. You must specify your paths explicitly
The directory where the output and error files will go must exist beforehand. Otherwise the job will fail but you will not get any hint as to what may have happened to the job. The only hint would be if checking sacct:
No Format $ sacct -X --name=broken3 JobID sacct -X --name=broken3 JobID JobName QOS State ExitCode Elapsed NNodes NodeList ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 64281800 broken3 JobName ef QOS FAILED State ExitCode 0:53 Elapsed 00:00:02 NNodes 1 NodeList ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 64281800 broken3 ef FAILED 0:53 00:00:02 1 ad6-201
You will need to create the output directory with:
Here is an amended version of the jobNo Format mkdir -p $SCRATCH/broken3output/
ad6-201
You will need to create the output directory with:
No Format mkdir -p $SCRATCH/broken3output/
Here is an amended version of the job:
Code Block language bash title broken3.sh #!/bin/bash #SBATCH --job-name=broken3 #SBATCH --chdir=/scratch/<your_user_id> #SBATCH --output=broken3output/broken3-%J.out #SBATCH --error=broken3output/broken3-%J.out echo "I was broken!"
Check that the actual job run and generated the expected output:
No Format $ grep -v ECMWF-INFO $(ls -1 $SCRATCH/broken3output/broken3-*.out | head -n1) I was broken!
You may clean up the output directory with
No Format rm -rf $SCRATCH/broken3output
Understanding your limits
Although most limits are described in HPC2020: Batch system, you can also check them (or reach them) for yourself in the system.
Create a new job script
naughty.sh
with a time limit of 1 minute:Code Block broken3language bash title naughty.sh #!/bin
/bash #SBATCH --job-name=broken3 #SBATCH --chdir=/scratch/<your_user_id> #SBATCH --output=broken3output/broken3-%J.out/bash #SBATCH --
error=broken3output/broken3-%J.out echo "I was broken!"mem=100 perl -e '$a="A"x(300*1024*1024/2); sleep'
Submit
naughty.sh
to the batch system and check its status. Is it still running after one minute? Why?Expand title Solution You can submit it with:
No Format sbatch naughty.sh
You can then monitor the state of your job with watch and squeue:
No Format watch -n 10 squeue -j <jobid>
You can see that the job is not killed after one minute, and keeps going beyond. The reason of that is in:
Check that the actual job run and generated the expected output:
No Format $ grep -v ECMWF-INFO $(ls -1 $SCRATCH/broken3output/broken3-*.out | head -n1) I was broken!
You may clean up the output directory with
rm -rf $SCRATCH/broken3outputNo Format