Slurm is the batch system available. Any script can be submitted as a job with no changes, but you might want to see Writing SLURM jobs to customise it.
...
And cancel a job with
Note |
---|
Currently the "scancel" command shall be executed on the login node of the same cluster where the job is running. |
See the Slurm documentation for more details on the different commands available to submit, query or cancel jobs.
...
These are the different QoS (or queues) available for standard users on the four complexes:
QoS name | Type | Suitable for... | Shared nodes | Maximum jobs per user | Default / Max Wall Clock Limit | Default / Max CPUs | Default / Max Memory |
---|
nf | fractional | serial and small parallel jobs. It is the default | Yes | - |
---|
2 day average runtime + standard deviation / 2 days | 1 / 64 | 8 GB / 128 GB |
ni | interactive | serial and small parallel interactive jobs | Yes | 1 |
---|
1 day 12 hours / 7 days | 1 / 32 | 8 GB / 32 GB |
np | parallel | parallel jobs requiring more than half a node | No | - | average runtime + standard deviation / 2 days | - | 240GB / 240 GB per node (all usable memory in a node) |
---|
Show If |
---|
|
GPU special PartitionOn the AC complex there is also the ng queue that gives access to the special partition with GPU-enabled nodes. See HPC2020: GPU usage for AI and Machine Learning for all the details on how to make use of those special resources. Excerpt |
---|
QoS name | Type | Suitable for... | Shared nodes | Maximum jobs per user | Default / Max Wall Clock Limit | Default / Max CPUs | Default / Max Memory per node |
---|
ng | GPU | serial and small parallel jobs. It is the default | Yes | - | average runtime + standard deviation / 2 days | 1 / - |
---|
|
|
...
ECS
For those using ECS, these are the different QoS (or queues) available for standard users of this service:
QoS name | Type | Suitable for... | Shared nodes | Maximum jobs per user | Default / Max Wall Clock Limit | Default / Max CPUs | Default / Max Memory |
---|
ef | fractional | serial and small parallel jobs - ECGATE service | Yes | - | 2 day average job runtime + standard deviation / 2 days | 1 / 8 | 8 GB / 16 GB |
---|
ei | interactive | serial and small parallel interactive jobs - ECGATE service | Yes | 1 | 1 day 12 hours / 7 days | 1 / 4 | 8 GB / 8 GB |
---|
el | long | serial and small parallel interactive jobs - ECGATE service | Yes | - | 7 day average job runtime + standard deviation / 7 days | 1 / 8 | 8 GB / 16 GB |
---|
et | Time-critical Option 1 | serial and small parallel Time-Critical jobs. Only usable through ECACCESS Time Critical Option-1 | Yes | - | 12 hours average job runtime + standard deviation / 12 hours | 1 / 8 | 8 GB / 16 GB |
---|
Info |
---|
title | Time limit management |
---|
|
See HPC2020: Job Runtime Management for more information on how the default Wall Clock Time limit is calculated. |
Note |
---|
title | Limits are not set in stone |
---|
|
Different limits on the different QoSs may be introduced or changed as the system evolves. |
Tip |
---|
|
If you want to get all the details of a particular QoS on the system, you may run, for example: No Format |
---|
sacctmgr list qos names=nf |
|
Submitting jobs remotely
If you are submitting jobs from a different platform via ssh, please use the *-batch dedicated nodes instead of the *-login equivalents:
- For generic remote job submission on HPCF: hpc-batch or hpc2020-batch
- For remote job submission on a specific HPCF complex: <complex_name>-batch
- For remote job submission to the ECS virtual complex: ecs-batch
For example, to submit a job from a remote platform onto the Atos HCPF:
No Format |
---|
ssh hpc-batch "sbatch myjob.sh" |
Note |
---|
|
Different limits on the different QoSs may be introduced or changed as the system evolves to its final configuration. |
HTML |
---|
<style>
div#content h2 a::after {
content: " - [read more]";
}
</style> |
...