...
Create a new job script
broken1.sh
with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?Code Block language bash title broken1.sh collapse true #SBATCH --job-name = broken 1 #SBATCH --output = broken1-%J.out #SBATCH --error = broken1-%J.out #SBATCH --qos = express #SBATCH --time = 00:05:00 echo "I was broken!"
Expand title Solution The job above has the following problems:
- There is no shebang at the beginning of the script.
- There should be no spaces in the directives
- There should be no space
- QoS "express" does not exist
Here is an amended version following best practices for the jobs:
Code Block language bash title broken1_fixed.sh #!/bin/bash #SBATCH --job-name=broken1 #SBATCH --output=broken1-%J.out #SBATCH --error=broken1-%J.out #SBATCH --time=00:05:00 echo "I was broken!"
Note that the QoS line was removed, but you may also use the following if running on ECS:
No Format #SBATCH --qos=ef
or the alternatively, if on Atos HPCF:
No Format #SBATCH --qos=nf
Check that the actual job run and generated the expected output:
No Format $ grep -v ECMWF-INFO $(ls -1 broken1-*.out | tail -n1) I was broken!
Create a new job script
broken2.sh
with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?Code Block language bash title broken2.sh collapse true #!/bin/bash #SBATCH --job-name=broken2 #SBATCH --output=broken2-%J.out #SBATCH --error=broken2-%J.out #SBATCH --qos=ns #SBATCH --time=10-00 echo "I was broken!"
Expand title Solution The job above has the following problems:
- QoS "ns" does not exist. Either remove to use the default or use the corresponding QoS on ECS (ef) or HPCF (nf)
- The time requested is 10 days, which is longer than the maximum allowed. it was probably meant to be 10 minutes
Here is an amended version:
Code Block language bash title broken1.sh #!/bin/bash #SBATCH --job-name=broken2 #SBATCH --output=broken2-%J.out #SBATCH --error=broken2-%J.out #SBATCH --time=10:00 echo "I was broken!"
Again, note that the QoS line was removed, but you may also use the following if running on ECS:
No Format #SBATCH --qos=ef
or the alternatively, if on Atos HPCF:
No Format #SBATCH --qos=nf
Check that the actual job run and generated the expected output:
No Format $ grep -v ECMWF-INFO $(ls -1 broken2-*.out | tail -n1) I was broken!
Create a new job script
broken3.sh
with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?Code Block language bash title broken3.sh collapse true #!/bin/bash #SBATCH --job-name=broken3 #SBATCH --chdir=$SCRATCH #SBATCH --output=broken3output/broken3-%J.out #SBATCH --error=broken3output/broken3-%J.out echo "I was broken!"
Expand title Solution The job above has the following problems:
- Variables are not expanded on job directives. You must specify your paths explicitly
The directory where the output and error files will go must exist beforehand. Otherwise the job will fail but you will not get any hint as to what may have happened to the job. The only hint would be if checking sacct:
No Format $ sacct -X --name=broken3 JobID JobName QOS State ExitCode Elapsed NNodes NodeList ------------ ---------------- --------- ---------- -------- ---------- -------- -------------------- 64281800 broken3 ef FAILED 0:53 00:00:02 1 ad6-201
You will need to create the output directory with:
No Format mkdir -p $SCRATCH/broken3output/
Here is an amended version of the job:
Code Block language bash title broken3.sh #!/bin/bash #SBATCH --job-name=broken3 #SBATCH --chdir=/scratch/<your_user_id> #SBATCH --output=broken3output/broken3-%J.out #SBATCH --error=broken3output/broken3-%J.out echo "I was broken!"
Check that the actual job run and generated the expected output:
No Format $ grep -v ECMWF-INFO $(ls -1 $SCRATCH/broken3output/broken3-*.out | tail -n1) I was broken!
You may clean up the output directory with
No Format rm -rf $SCRATCH/broken3output
Create a new job script
broken4.sh
with the contents below and try to submit the job. You should not see the message in the output. What happened? Can you fix the job and keep trying until it runs successfully?Code Block language bash title broken3.sh collapse true #!/bin/bash #SBATCH --job-name=broken4 #SBATCH --output=broken4-%J.out ls $FOO/bar echo "I should not be here"
Expand title Solution The job above has the following problems:
FOO
variable is undefined when used. Undefined variables often lead to unexpected failures that are not always easy to spot.- Even if
FOO
was defined to "", the The ls command fails but the job keeps running and eventually will apparently finish successfully from Slurm point of view, but it should have failed and been interrupted on the first errorFOO variable is undefined when used. Undefined variables often lead to unexpected failures that are not always easy to spot.
Here is an amended version of the job following best practices:
Code Block language bash title broken3broken4.sh #!/bin/bash #SBATCH --job-name=broken3 #SBATCH --chdir=/scratch/<your_user_id> #SBATCH --output=broken3output/broken3broken4-%J.out #SBATCH --error=broken3output/broken3-%J.out echo "I was broken!"
Check that the actual job run and generated the expected output:
No Format $ grep -v ECMWF-INFO $(ls -1 $SCRATCH/broken3output/broken3-*.out | tail -n1) I was broken!
You may clean up the output directory with
set -x # echo script lines as they are executed set -e # stop the shell on first error set -u # fail when using an undefined variable set -o pipefail # If any command in a pipeline fails, that return code will be used as the return code of the whole pipeline ls $FOO/bar echo "I should not be here"
With the extra shell options, we guarantee we get some extra information on the output about the commands being written, and we ensure that the job will stop when encountering the first error (non-zero exit code), as well as if an undefined variable is found.
Info title Best practices Even if most examples in this tutorial do not have the extra shell options for simplicity, you should always include those in your production jobs.
No Format rm -rf $SCRATCH/broken3output
Understanding your limits
...