You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

This is the User guide for the Atos Sequana XH2000 HPCF, installed in ECMWF's data centre in Bologna.

Below you will find some basic information on the different parts of the system. Please click on the headers or links to get all the details for the given topic.

News Feed

2023-11-22 Change of default versions of ECMWF software packages

When?

The changes will take place on Wednesday 22 November 2023 09:00 UTC

Do I need to do anything?

2023-05-31 Change of default versions of ECMWF and third-party software packages

When?

The changes will take place on Wednesday 31 May 2023 09:00 UTC

Do I need to do anything?

2023-03-27 Scratch automatic purge enabled

From  the automatic purge of unused files in SCRATCH is enforced. Any files that have not been accessed at any time in the previous 30 days will be automatically deleted. This purge will be conducted regularly, in order to keep the usage of this filesystem within optimal parameters.

SCRATCH is designed to hold temporary large files and to act as the main storage and working filesystem for your jobs and experiments input and output files, but not to keep data for long term.

2023-01-18 Improving the time and memory limit management for batch jobs

Explicit time limit honoured

From ECMWF will enforce killing jobs if they have reached their wall time if #SBATCH --time or command line option -–time was provided with the job.

Alternatively ECMWF accepts jobs without #SBATCH --time or command line option -–time and ECMWF will instead use average runtime of previous "similar" jobs by generating job tag based on user, job name, job geometry and job output path.

2022-12-07 Important change in the new Slurm on the Atos HPC

On Slurm on Atos AD complex was updated to version 22.05. Since AD has been the default cluster with hpc-login and hpc-batch being aliases for nodes on AD complex.

The same version of Slurm 22.05 has also been installed on AA and AB complexes and will be installed on AC complex on  . 

2022-11-30 Unavailability of AA Atos cluster due to system update

On   at 08 UTC AA, the default Atos cluster will became unavailable for essential Slurm and security updates. In preparation of this session:

  • The default Atos login/batch cluster has been changed to AD on  at 9 UTC
  • Batch work on AA will be drained and jobs scheduled to finish after at 06 UTC will be automatically redirected to other complexes.

2022-08-10 SSH host keys fixed on all nodes and Cron service

On 2022-08-10 we have set up the same ssh key across the 4 complexes using the ac-login node as a master key. This change addressed the issue of ssh errors when connecting regarding host key changes for a given host after an update, such as:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The ECDSA host key for hpc-login has changed,
and the key for the corresponding IP address 10.100.192.100
is unknown. This could either mean that
DNS SPOOFING is happening or the IP address for the host
and its host key have changed at the same time.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:QdNPyN2jAR5m7ngLbtIUjc2JgzknvFP2flMOGbd1i5k.
Please contact your system administrator.
Add correct host key in /home/user/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /home/user/.ssh/known_hosts:4
ECDSA host key for hpc-login has changed and you have requested strict checking.
Host key verification failed.

2022-07-20 Atos ecgate cluster in Bologna available to all ECMWF Member State ecgate users

We are pleased to announce the availability of the General purpose Atos computing service - named 'ecs' - in Bologna, which will replace the ecgate service in Reading.

We invite you to start testing your activities currently running on ecgate in Reading onto 'ecs' in Bologna. To help you with this work, we have made available the Atos HPCF and ecgate Documentation. We strongly encourage you to read carefully through those pages before you start your tests on 'ecs'. Interactive login access to the systems in Bologna will no longer be through ECaccess, but through Teleport.

2022-06-22 AA unavailability and new generic Atos HPCF host names

On 22 June 2022 at 10 UTC the AA complex became unavailable for essential hardware and software upgrades. In preparation for it, the complex was drained and jobs due to finish after 22 June 2022 at 10 UTC did not run. Those jobs still running on the system at the time would have been killed. You may continue to use the AB complex for your interactive and batch workloads.  

2022-03-01 Switch PERM to its final NFS location

In the early days of the Atos HPCF your PERM  lived temporarily in Lustre while we waited for the storage infrastructure to be completed. On Tuesday 1 March 2022 at 09UTC during a 2 hour system session we switched to the new permanent location based on TrueNAS NFS, with greater capacity.

This relocation should have been transparent to you, so you don't need to do anything. On the days before the session all your data was being copied in the background from your Lustre to the new PERM, with the final sync happening during the system session to ensure a clean switchover. Your new PERM is now automounted when you access a node either interactively or via Slurm.

  • No labels