Slurm is the batch system available. Any script can be submitted as a job with no changes, but you might want to see Writing SLURM jobs to customise it.
...
And cancel a job with
Note |
---|
Currently the "scancel" command shall be executed on the login node of the same cluster where the job is running. |
See the Slurm documentation for more details on the different commands available to submit, query or cancel jobs.
...
QoS name | Type | Suitable for... | Shared nodes | Maximum jobs per user | Default / Max Wall Clock Limit | Default / Max CPUs | Default / Max Memory |
---|
nf | fractional | serial and small parallel jobs. It is the default | Yes | - | 2 day average runtime + standard deviation / 2 days | 1 / 64 | 8 GB / 128 GB |
---|
ni | interactive | serial and small parallel interactive jobs | Yes | 1 | 1 day 12 hours / 7 days | 1 / 32 | 8 GB / 32 GB |
---|
np | parallel | parallel jobs requiring more than half a node | No | - | average runtime + standard deviation / 2 days | - | 240GB / 240 GB per node (all usable memory in a node) |
---|
Show If |
---|
|
GPU special PartitionOn the AC complex there is also the ng queue that gives access to the special partition with GPU-enabled nodes. See HPC2020: GPU usage for AI and Machine Learning for all the details on how to make use of those special resources. Excerpt |
---|
QoS name | Type | Suitable for... | Shared nodes | Maximum jobs per user | Default / Max Wall Clock Limit | Default / Max CPUs | Default / Max Memory per node |
---|
ng | GPU | serial and small parallel jobs. It is the default | Yes | - | average runtime + standard deviation / 2 days | 1 / - |
---|
|
|
...
ECS
For those using ECS, these are the different QoS (or queues) available for standard users of this service:
QoS name | Type | Suitable for... | Shared nodes | Maximum jobs per user | Default / Max Wall Clock Limit | Default / Max CPUs | Default / Max Memory |
---|
ef | fractional | serial and small parallel jobs - ECGATE service | Yes | - | 12 hours average job runtime + standard deviation / 2 days | 1 / 8 | 8 GB / 16 GB |
---|
ei | interactive | serial and small parallel interactive jobs - ECGATE service | Yes | 1 | 12 hours / 7 days | 1 / 4 | 8 GB / 8 GB |
---|
el | long | serial and small parallel interactive jobs - ECGATE service | Yes | - | 12 hours average job runtime + standard deviation / 7 days | 1 / 8 | 8 GB / 16 GB |
---|
et | Time-critical Option 1 | serial and small parallel Time-Critical jobs. Only usable through ECACCESS Time Critical Option-1 | Yes | - | 12 hours average job runtime + standard deviation / 12 hours | 1 / 8 | 8 GB / 16 GB |
---|
Info |
---|
title | Time limit management |
---|
|
See HPC2020: Job Runtime Management for more information on how the default Wall Clock Time limit is calculated. |
Note |
---|
title | Limits are not set in stone |
---|
|
Different limits on the different QoSs may be introduced or changed as the system evolves. |
Tip |
---|
|
If you want to get all the details of a particular QoS on the system, you may run, for example: No Format |
---|
sacctmgr list qos names=nf |
|
Submitting jobs remotely
If you are submitting jobs from a different platform via ssh, please use the *-batch dedicated nodes instead of the *-login equivalents:
- For generic remote job submission on HPCF: hpc-batch or hpc2020-batch
- For remote job submission on a specific HPCF complex: <complex_name>-batch
- For remote job submission to the ECS virtual complex: ecs-batch
For example, to submit a job from a remote platform onto the Atos HCPF:
No Format |
---|
ssh hpc-batch "sbatch myjob.sh" |
Note |
---|
|
Different limits on the different QoSs may be introduced or changed as the system evolves to its final configuration. |
HTML |
---|
<style>
div#content h2 a::after {
content: " - [read more]";
}
</style> |
...