With 4 identical Atos complexes (also known as clusters) installed in our Data Centre in Bologna - see Atos HPCF: System overview -, we are now able to provide a more reliable computing service at ECMWF, including for batch work. For example, during a system session on one complex, we will submit batch jobs to a different complex. This enhanced batch service however may require the use of some ECMWF customised SLURM commands.
Submitting a job: sbatch
The default PATH includes /usr/local/bin, which contains an ECMWF local version of sbatch, that can submit batch jobs to a different complex. For example, before one session, say on complex AA, we may decide to submit HPCF batch jobs to another complex, say, AB. This will happen transparently for all our users.
sbatch
If you use the SLURM sbatch command, in /usr/bin, you will not benefit from the cross-complex job submission. E.g., under cron and by default, PATH only contains /usr/bin; you will then only submit jobs to the complex you cron entry is on.
All SLURM sbatch options are available with the ECMWF customised sbatch command
Job IDs are unique amongst all complexes, no risk to have duplicated ones.
Monitoring a job: ecsqueue
The default SLURM command 'squeue' will list jobs on the current complex. To list the jobs running on another complex - or all complexes, one should use the 'ecsqueue' command.
$ ecsqueue --help usage: ecsqueue [-u USER] [-h] [-o FORMAT] [-O FORMAT] [-q QOS] [-j JOBID] [-M CLUSTERS] $ ecsqueue -u $USER # will show all the jobs running for you on the 4 Atos complexes.
ecsqueue
ecsqueue is located in /usr/local/bin. You may need to adapt your PATH.
Only limited SLURM squeue options are available with ecsqueue.
Deleting a job: ecscancel
The default SLURM command 'scancel' will delete a job on the current complex. To delete a job running on another complex, one will use the command ecscancel:
$ ecscancel --help usage: ecscancel [-h] [-u USER] [-t STATE] [-f] [-b] [-i] [-q QOS] [-n JOBNAME] [-s SIGNAL] [-M CLUSTERS] [jobid [jobid ...]] positional arguments: jobid list of jobids optional arguments: -h, --help show this help message and exit -u USER, --user USER scancel for particular user -t STATE, --state STATE scancel for particular state -f, --full scancel full -b, --batch scancel batch step -i, --interactive scancel interactive -q QOS, --qos QOS scancel qos -n JOBNAME, --jobname JOBNAME scancel jobname -s SIGNAL, --signal SIGNAL scancel with a signal -M CLUSTERS, --clusters CLUSTERS scancel for particular cluster, or comma separated list of clusters $ ecscancel <jobid> # will cancel job <jobid> on one of the four complexes.
ecscancel
ecsqueue is located in /usr/local/bin. You may need to adapt your PATH.
Only limited SLURM scancel options are available with ecscancel.