Reference Documentation
Let's explore how to use the Slurm Batch System to the ATOS HPCF or ECS.
Basic job submission
Access the default login node of the ATOS HPCF or ECS.
Create a directory for this tutorial so all the exercises and outputs are contained inside:
mkdir ~/batch_tutorial cd ~/batch_tutorial
Create and submit a job called
simplest.sh
with just default settings that runs the commandhostname
. Can you find the output and inspect it? Where did your job run?Configure your
simplest.sh
job to direct the output tosimplest-<jobid>.out
, the error tosimplest-<jobid>.err
both in the same directory, and the job name to just "simplest". Note you will need to use a special placeholder for the -<jobid>
.
Basic job management
Create a new job script
sleepy.sh
with the contents below:sleepy.sh#!/bin/bash sleep 120
Submit
sleepy.sh
to the batch system and check its status. Once it is running, cancel it and inspect the output.Can you get information about the jobs you have run so far today, including those that have finished already?
Can you get information of all the jobs run today by you that were cancelled?
The default information shown on the screen when querying past jobs is limited. Can you extract the submit, start, and end times of your cancelled jobs today? What about their output and error path? Hint: use the corresponding man page for all the options.
Common pitfalls
We will now attempt to troubleshoot some issues
Create a new job script
broken1.sh
with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?broken1.sh#SBATCH --job-name = broken 1 #SBATCH --output = broken1-%J.out #SBATCH --error = broken1-%J.out #SBATCH --qos = express #SBATCH --time = 00:05:00 # This is the job echo "I was broken!" sleep 30
Create a new job script
broken2.sh
with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?broken2.sh#!/bin/bash #SBATCH --job-name=broken2 #SBATCH --output=broken2-%J.out #SBATCH --error=broken2-%J.out #SBATCH --qos=ns #SBATCH --time=10-00 # This is the broken echo "I was broken!"
Create a new job script
broken3.sh
with the contents below and try to submit the job. What happened? Can you fix the job and keep trying until it runs successfully?broken3.sh#!/bin/bash #SBATCH --job-name=broken3 #SBATCH --chdir=$SCRATCH #SBATCH --output=broken3output/broken3-%J.out #SBATCH --error=broken3output/broken3-%J.out # This is the job echo "I was broken!"