Update Documentation authored by Roland Pabel's avatar Roland Pabel
# Table of contents
1. [RAMSES Specifications](#ramses-specs)
2. [Get access to the RAMSES cluster](#ramses-application)
3. [generate SSH Keypair](#ssh-gen)
4. [set up Cisco Duo](Cisco-Duo)
5. [Login](#login)
6. [Data transfer](#data-transfer)
7. [Filesystem](#filesystem)
8. [Submitting Jobs](#submitting)
9. [Backup options](#backup)
10. [Environment modules](#env-modules)
<br>
<br>
## RAMSES Specifications:<a name="ramses-specs"></a>
(**R**esearch **A**ccelerator for **M**odeling and **S**imulation with **E**nhanced **S**ecurity)
- 164 compute nodes, 10 Kubernetes nodes
- 348 CPUs = 31576 Cores
- Accelerators
- 40 NVIDIA Hopper H100 GPUs
- 32 NVIDIA A30 GPUs
- 2 AMD Instinct GPUs
- 2 NEC Vector Engines
- Performance
- CPU Performance: 1,7 PFLOP/s
- GPGPU Performance: 3,1 PFLOP/s
- total performance: 4,8 PFLOP/s
- main memory
- 167 TB
- Storage
- 15 PB HDD Speicherplatz
- 940 TB SSD NVMe Speicherplatz
- high-speed interconnect
- HDR100 InfiniBand
<br>
<br>
## Get access<a name="ramses-application"></a>
To gain access to RAMSES you need to fulfill three requirements (in any order):
- apply for a project
- secure the connection with ssh keys
- setup a second authentication factor (2FA)
Apply for a **user account**:
- [Application form for ITCC projects](https://hpc-access.itcc.uni-koeln.de/jards/WEB/application/login.php?appkind=itcc)
New users can apply for a trial account with limited core/GPU hours without a project description. Applications for a full account need a project description to be reviewed. Up to 15 million core hours per project, a technical review (reasonable usage of resources) is sufficient. Beyond that, a scientific review (importance of research) becomes necessary.
### 2FA
To secure access to RAMSES, we use Two-Factor-Authentication (2FA/MFA).
For security reasons, you can't login with a username/password. We use a system called **2-Factor-Authentication** (2FA/MFA), meaning you need to prove your identity with two different (as in different systems/locations) 'factors':
- The first factor is an SSH public key. Please send your SSH *public* key to
the HPC team. [General information on public key authentication](https://www.ssh.com/academy/ssh/public-key-authentication)
- As the second factor we use Cisco Duo. To use it, you will need to enroll
your account, see [cisco-duo-setup.pdf](uploads/cd518a29f4362a9383c7345a975ed065/cisco-duo-setup.pdf) .
* The first factor is an SSH public key. Please send your SSH _public_ key to the HPC team.
* As the second factor we use Cisco Duo. To use it, you will need to enroll your account, see [cisco-duo-setup.pdf](/rpabel2/itcc-hpc-ramses/-/wikis/uploads/cd518a29f4362a9383c7345a975ed065/cisco-duo-setup.pdf) .
If you own a [Yubikey](https://en.wikipedia.org/wiki/YubiKey) hardware token, you can also use it (in OTP mode) as the second authentication factor instead of Cisco Duo. If you are interested in using a Yubikey, please contact the [HPC-Team](mailto:hpc-mgr@uni-koeln.de).
Please note: we can't provide Yubikeys to users, but it could be a worthwhile investment for about 50€.
After you have successfully enrolled in Duo and prepared your SSH Key, please send your key to pabel@uni-koeln.de .
After you have successfully enrolled in Duo and prepared your SSH Key, please
send your key.
### SSH KEYS
Here is a quick intro to ssh keys: There is always a private and a public key in a key pair. The public key (\*.pub) is put into the file `~/.ssh/ authorized_keys` on ramses . When you have the matching private key, this makes the login authentication work. Do not give away the private key and secure it with a passphrase:
### Generate SSH keys<a name="ssh-gen"></a>
Here is a quick intro to ssh keys: There is always a private (as in **private - don't share, don't give away**) and a public key in a key pair. The public key (*.pub) is put into the file `~/.ssh/authorized_keys` on ramses . When you have the matching private key, this makes the login authentication work. Do not give away the private key and secure it with a passphrase.
The keypairs are usually stored in a hidden directory (folder) named .ssh (same on Linux/Mac/WIN).
You can create a modern key (ed25519) using
......@@ -17,11 +82,42 @@ You can create a modern key (ed25519) using
ssh-keygen -t ed25519 -C "Your Name"
```
and it should be created as `~/.ssh/id_ed25519(.pub)`
You are asked for a file location, just press ENTER
```
Enter file in which to save the key (/home/<username>/.ssh/id_ed25519):
```
Next you **have to** enter a passphrase (you could use this [passphrase generator](https://www.tu-braunschweig.de/it-sicherheit/pwsec/pwgen)):
```
Enter passphrase (empty for no passphrase):
```
**Don't** leave it empty!! As usual, store your password in a secure place, use a password-manager ([e.g. KeePass](https://keepass.info/download.html)).
Then send us the **id_ed25519.pub** file.
```
Enter same passphrase again:
Your identification has been saved in /home/<username>/.ssh/id_ed25519
Your public key has been saved in /home/<username>/.ssh/id_ed25519.pub
The key fingerprint is:
SHA256:mDIS+q3blablaBLABLAjePkEMEoR4sAIVumEoJiCXDNVs Your Name
The key's randomart image is:
+--[ED25519 256]--+
...
| .. .. |
+----[SHA256]-----+
```
If your `ssh` on your computer is old, it will not know the key type ed25519. In this case use
You can ignore the rest of the output. The keypair is stored under \~/.ssh/id_ed25519(.pub). You can now send us the **public** key (id_ed25519\*\*.pub\*\*), either as a file or just copy/paste the key itself:
```
cat ~/.ssh/id_ed25519.pub
```
**Please send the public key to: [hpc-mgr@uni-koeln.de](mailto:hpc-mgr@uni-koeln.de)**
If `ssh` on your computer is old, it will not know the key type ed25519.
In this case use
```
ssh-keygen -t rsa -b 4096 -C "Your Name"
......@@ -29,32 +125,52 @@ ssh-keygen -t rsa -b 4096 -C "Your Name"
and send us the file `~/.ssh/id_rsa.pub` instead.
Please set a password on the ssh key (it will ask you for one during `ssh- keygen`) and use the `ssh-agent` to load the file into memory:
On most Linux and Macs this is pre-installed, you can check with the command `ssh-add -l`. This should not return an error, but usually `This agent has no identities`. Then add your key:
To avoid having to enter the passphrase every time you log in, you can load the key into memory using the ssh-agent.
On most Linux and Macs this is pre-installed, you can check with the command
`ssh-add -l`. This should not return an error, but usually
`This agent has no identities`. Otherwise you can start the ssh-agent:
```
ssh-agent # start the ssh-agent
```
Then you add the public key you just created:
```
ssh-add [ path to your key file, ~/.ssh/id_rsa or id_ed25519 ]
```
You can usually just run `ssh-add` since `ssh-add` can find the files on its own. `ssh-add` asks for the password you set in the `ssh-keygen` step and afterwards `ssh-add -l` should list your key like this:
You can usually just run `ssh-add` since `ssh-add` can find the files on its own.
`ssh-add` asks for the password you set in the `ssh-keygen` step and afterwards
`ssh-add -l` should list your key like this:
```
# ssh-add -l
4096 SHA256:RGqC9iR+ayXlLPXOSfRYWZ7yU8wnhG7iJ3KMzs7s7ao .ssh/id_rsa (RSA)
```
You can now use it within your session without having to re-enter your SSH Key password.
You can now use it within your session without having to re-enter your SSH Key
password.
If you have to use a Windows System: [Key-based authentication in OpenSSH for Windows](https://learn.microsoft.com/en-gb/windows-server/administration/openssh/openssh_keymanagement)
If you already have access to RAMSES but you are using the CHEOPS key, I advise you to create your own SSH key on your local machine/laptop and then add the public key to your `.ssh/authorized_keys` file in your home on RAMSES. Any text editor will work for this.
If you already have access to RAMSES but you are using the CHEOPS key, we
advise you to create your own SSH key on your local machine/laptop and then
add the public key to your `.ssh/authorized_keys` file in your home on RAMSES.
Any text editor will work for this.
**PLEASE NOTE**: Do no share SSH Keys with other people and do not copy around private keys to other computers. Just create new SSH Key pairs on each computer you use regularly. You can also use SSH Agent Forwarding, where an SSH Key is taken along into a SSH session to a remote computer, eliminating the need to create many keys.
**PLEASE NOTE**: Do no share SSH Keys with other people and do not copy private keys to other computers. Just create new SSH Key pairs on each computer you use regularly. You can also use [SSH Agent Forwarding](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/using-ssh-agent-forwarding), where an SSH Key is taken along into a SSH session to a remote computer, eliminating the need to create many keys.
### LOGIN
There are 4 login servers: ramses1.itcc.uni-koeln.de up to ramses4.itcc.uni-koeln.de Do not use ramses2 or ramses3, they are for internal use only for now.
When you log into ramses1, a verification request is automatically pushed to your Duo App on your phone.
### LOGIN<a name="login"></a>
There are 4 login servers:
ramses1.itcc.uni-koeln.de up to ramses4.itcc.uni-koeln.de
Do not use ramses2 or ramses3, they are for internal use only for now.
When logging in to ramses1, the public key you sent us is authenticated with the private key on your computer (1st factor, you will be asked for the ssh passphrase). If successful, a verification request is automatically pushed to the Duo App on your device where you confirm the login (2nd factor).
On your terminal you should see something like this:
......@@ -68,9 +184,11 @@ Success. Logging you in...
rpabel2@ramses1:~>
```
Even though the message `Autopushing...` appears twice, only one push is executed and only one verification is needed.
Even though the message `Autopushing...` appears twice, only one push is
executed and only one verification is needed.
On ramses4, you can choose different Cisco Duo authenticators, if you have configured any:
On ramses4, you can choose different Cisco Duo authenticators, if you have
configured any:
```
rpabel2@soliton:~> ssh ramses4
......@@ -78,62 +196,156 @@ Duo two-factor login for rpabel2
Enter a passcode or select one of the following options:
1. Duo Push to Android
Passcode or option (1-1):
1. Duo Push to +XX XXX XXXX456
2. Duo Push to Android
Passcode or option (1-2): 1
Success. Logging you in...
```
You can also enter here the 6-digit TOTP Passcode that is shown in the Duo App. This code changes every 30 seconds.
**PLEASE NOTE**: Be carefull with scripted logins: Any login attempt with your SSH Key that triggers Duo Autopush is counted by Duo. If you don't respond in your App, your account will be blocked after 10 attempts (and has to be unlocked by an admin).
In this example, if you choose '1', an authentication request is pushed to your phone and you just have to confirm it with a tap on the screen. Alternatively, instead of choosing a number in the above example, you could also open the Duo Mobile App on your device and enter the 6-digit passcode shown in the app. This code changes every 30 seconds.
Regarding 2FA login: If you own (\*) a [Yubikey](https://www.yubico.com/de/product/yubikey-5-series/yubikey-5-nfc/) hardware token, it is now possible to use it (in OTP mode) as the second authentication factor instead of Duo Push. If you are interested in using a yubikey, please contact us at [hpc-mgr](mailto:hpc-mgr@uni-koeln.de) .
**PLEASE NOTE**: Be carefull with scripted logins: Any login attempt with your SSH
Key that triggers Duo Autopush is counted by Duo. If you don't respond in your
App, your account will be blocked after 10 attempts (and has to be unlocked by
an admin).
<br>
<br>
(\*) We cannot supply yubikeys to users, since these cost about 50€ per piece. Maybe ask your department head if they are willing to order some for your work group. Nitrokeys are not supported yet, sadly.
## Data transfer<a name="data-transfer"></a>
### FILESYSTEMS
To transfer your data to the cluster, we recommend using [scp](https://tldr.inbrowser.app/pages/common/scp) (**s**ecure **c**o**p**y) - either on the command line (CLI/Terminal) or with a graphical client (e.g. WinSCP).\
The filesystem setup is exactly as on Cheops, with `/home`, `/projects` and `/ scratch`:
There is no automatic mechanism to sync/copy files between Cheops and Ramses. You have to copy your files yourself.
* On home the quota is also 100GB and 100.000 files. There is no backup yet.
* you can use `/scratch` up to 40TB, the automatic deletion of files will be enabled soon.
* You can create your own projects directory for now under `/projects/ friendly_users/` . These will be deleted after this phase ended.
Please note: you can transfer data ONLY to the login nodes (ramses1 ... ramses4), not directly to compute nodes.
There is no automatic mechanism to sync files in Cheops and Ramses. You have to copy your files yourself.
- for small numbers of files/folder:
### SUBMITTING JOBS
```
- scp local_file username@ramses1.itcc.uni-koeln.de:remote_destination_dir ( . for home folder)
- scp -r local_folder username@ramses1.itcc.uni-koeln.de:remote_destination_dir
```
- for huge amounts of (small) files use [tar](https://www.gnu.org/software/tar/manual/html_chapter/Tutorial.html#Tutorial) (or zip) to create an archive-file before copying:
There are several partitions/queues in slurm intended for general usage:
```
- create: tar -czf name_of_archive.tar.gz files_or_folder_to_add
- extract: tar -xvzf example.tar.gz
- show contents: tar -tf <file>
```
- you can also use [rsync](https://tldr.inbrowser.app/pages/common/rsync)
- if you prefer interactive transfer with a shiny GUI: e.g. [FileZilla (Linux/Mac/Win)](https://filezilla-project.org/), [WinSCP (Win only)](https://winscp.net/eng/download.php), [Cyberduck (Mac only)](https://cyberduck.io/download/)
* “smp”: default partition, with 136 smp nodes, for single node jobs
* “bigsmp”: partition with 8 bigsmp nodes, for large single node jobs
* "mpi": same nodeset as "smp" but for MPI jobs
* “interactive”: partition with 8 interactive nodes, dedicated for interactive usage
* “gpu”: partition with 10 gpu nodes with the following gpu types: h100:38, h100_1g.12gb:1, h100_2g.24gb:3, h100_3g.47gb:1, h100_4g.47gb:1
* “ft-instinct”: a partition with a single node that contains two AMD Instinct MI210 GPUs
* "ft-aurora": a partition with a single node that contains two NEC SX Aurora Vector Engine Cards
<br>
<br>
The corresponding node types are:
* smp: 192 cores, 750G RAM
* bigsmp: 192 cores, 3000G RAM
* interactive: 192 cores, 1500G RAM
* gpu: 96 cores, 1500G RAM
When a partition isn't explicitly specified with the “-p” parameter, the automatic routing mechanism determines the right partition for the job:
## Filesystem<a name="filesystem"></a>
* "mpi" partition:
* when the memory specification is core oreiented (mem_per_cpu) and multiple tasks are specified
* when multiple nodes are specified
* "bigsmp”: when the requested memory exceeds 750GB per node
* "smp": in all other cases
The filesystem setup is exactly as on CHEOPS:
In order to get access to GPU cards, make sure to specify the “gpu” partition as well as the type and number of GPU cards with the “-G” parameter, e.g. “-p gpu -G h100:2” in order to get 2x H100 GPU Cards. Types like “h100_2g. 24gb” are instances of the H100 card created by MIG partitioning, they behave like a separate device.
- /home/\<user\>
- size per user 100GB / 100.000 files
- directory 'user' automatically created
- (Backup - coming soon)
- /scratch/\<user\>
- size per user 40TB
- create directory youself! (you _could_ use any name you like, but for clarity we recommend choosing your login-name.
- NO AUTOMATIC BACKUP, automatic deletion of files will be activated soon
- typical usage: input data should be copied to the scratch-partition only for running or soon running jobs. Accordingly, input and temporary data on /scratch should be deleted and output data transferred to longer term storage after job completion.
- /project/\<user/group\>
- created on request
- NO AUTOMATIC BACKUP
<br>
<br>
### SUBMITTING JOBS<a name="submitting"></a>
There are several partitions/queues in slurm intended for general usage:
Each user has a default group account in slurm which corresponds to his workgroup (not uniuser/hpcuser/smail). For each job the right group account must be specified with the “-A” parameter. Without it the default group account will be chosen automatically. The default group account can be found out by executing the following command:
- _smp_
- default partition, for single node jobs
- 136 nodes
- 768GB/192 cores per node
- 960GB NVMe SSD
- _bigsmp_
- for large single node jobs
- 8 nodes
- 3TB/192 cores per node
- 960GB NVMe SSD
- 15,36TB NVMe SSD
- _mpi_
- same nodeset as smp but for MPI jobs
- _gpu_
- 10 gpu nodes
- 1500GB/96 cores per node
- gpu types: h100:38, h100_1g.12gb:1, h100_2g.24gb:3, h100_3g.47gb:1, h100_4g.47gb:1
- _interactive_
- for interactive usage
- 8 nodes
- 1536GB/192 cores per node
- 960GB NVMe SSD
- 12,8TB NVMe SSD
- _ft-instinct_
- a partition with a single node that contains two AMD Instinct MI210 GPUs
- _ft-aurora_
- a partition with a single node that contains two NEC SX Aurora Vector Engine Cards
When a partition isn't explicitly specified with the “-p” parameter, the automatic routing mechanism determines the right partition for the job:
- "mpi" partition:
- when the memory specification is core oreiented (mem_per_cpu) and multiple tasks are specified
- when multiple nodes are specified
- "bigsmp”: when the requested memory exceeds 750GB per node
- "smp": in all other cases
In order to get access to GPU cards, make sure to specify the “gpu” partition
as well as the type and number of GPU cards with the “-G” parameter, e.g.
“-p gpu -G h100:2” in order to get 2x H100 GPU Cards. Types like “h100_2g.
24gb” are instances of the H100 card created by MIG partitioning, they behave
like a separate device.
Each user has a default group account in slurm which corresponds to his
workgroup (not uniuser/hpcuser/smail). For each job the right group account
must be specified with the “-A” parameter. Without it the default group account
will be chosen automatically. The default group account can be found out by
executing the following command:
```
sacctmgr show assoc -n user=$USER format=Account
```
<br>
<br>
## Backup your data<a name="backup"></a>
[coming soon]
<br>
<br>
## Environment Modules<a name="env-modules"></a>
To avoid software conflicts (resulting from incompatibilities, versioning, dependencies...), software is provided as Environment Modules. By using Modules, it is possible to have different versions of software installed on the system.\
You can select the module(s) you need directly on the command line or in your scripts.
Basic commands are:
```
- module avail : list **available** modules
- module whatis <module>: show desription
- module load \<module> : load module
- check the software environment: which \<command>, echo $PATH
- modules list : list **loaded** modules
- module unload \<module> (module purge \<module>: also unload dependencies)
```
<br>
<br>
<br>
<br>
If you encounter any problems, please write to hpc-mgr@uni-koeln.de .
\ No newline at end of file