Hardware and Configuration

Overview

The cluster consists of several sections:

MPI section for MPI intense applications
MEM section for applications requiring lots of memory (in a single node)
ACC section for applications that use accelerators
TEST section for evaluating new hardware

The whole system is located at the HPC building (L5|08) on campus Lichtwiese, and consists of several stages (“phases”), running concurrently.

Phase II of Lichtenberg II is currently in testing.
Phase I of Lichtenberg II was operational in December 2020 (in testing since September 2020).
Phase II of Lichtenberg I was operational in February 2015 (and expanded end of 2015), and was decommissioned in May 2021.
Phase I of Lichtenberg I has been in operation since fall 2013 and was decommissioned in April 2020.

Each node can be used as-is, ie. “single node”, with either one large or several smaller jobs/programs
several nodes concurrently by interprocess communication (MPI) via InfiniBand

The distinct phases (expansion stages) of the Lichtenberg 2 still are large islands with respect to their interconnect : only the compute nodes of the same phase (expansion stage) can reach and talk to each other with almost the same speed and latency – their InfiniBand fabric (inside one island/phase) is “non-blocking”.

In contrast, the bandwidth between distinct phases (islands) is limited.

643 Compute nodes and 8 Login nodes

Processors: in total, ~4,5 PFlop/s computing power (Double Precision, peak – theoretical)
- Realistically achieved: ca. 3,03 PFlop/s computing power with Linpack benchmark
Accelerators: overall 424 TFlop/s computing power (Double Precision/FP64, peak – theoretical)
and ~6,8 Tensor PFlop/s (Half Precision/FP16)
Memory: in total, ~250 TByte main memory
All compute and accelerator nodes in one large island:
- MPI section: 630 nodes (each with 96 CPU cores and 384 GByte main memory)
- ACC section: 8 nodes (each with 96 CPU cores and 384 GByte main memory)
  - 4 nodes with 4x Nvidia V100 GPUs each (total: 16)
  - 4 nodes with 4x Nvidia A100 GPUs each (total: 16)
- MEM section: 2 nodes (each with 96 CPU cores and 1536 GByte main memory)
NVIDIA DGX A100
- 3 nodes (each with 128 CPU cores, 1024 GByte main memory)
  - 8x NVIDIA A100 Tensor Core GPUs (320 GByte total)
  - Local storage: ca. 19 TByte (Flash, NVME)

In “Operations”/“Hardware”, you can find prozessor and accelerator details .

The current storage system is a IBM/Lenovo “Elastic Storage System” and was put in operation 2022-12-20. The ESS consists entirely of NVMe flash drives (in total: 576), instead of legacy (magnetic) hard disks. NVMe are solid state disks directly connected via PCI express to the storage servers' CPUs, rather than by intermediary SAS or SATA controllers with added latency.

The ESS is thus capable of providing way more storage bandwidth and throughput as well as I/Os per second than the former system.

In total, 2.1 PByte of storage capacity is available.

The high speed parallel file system is “IBM Storage Scale” (formerly known as General Parallel File System), well known for its parallel performance and flexibility.

The stored data is being delivered to all cluster nodes via the fast interconnect , allowing all nodes concurrent read and write access.

One of the most notable features of the storage system is its constant distribution of all files and directories over all available disks and SSDs/NVMe. Unlike before, there is almost no performance difference any longer between eg. /work/scratch and /home. In addition, any expansion in storage capacity also eventuates a substantial gain in storage performance.

Hardware

Overview

Utilization Models

Hardware of Cluster Stage 2 Lichtenberg II

586 Compute Nodes and 8 Login Nodes

Hardware of Cluster Stage 1 Lichtenberg II

643 Compute nodes and 8 Login nodes

Hardware of Lichtenberg I Phase II (2015-2021)

632 Compute nodes and 8 Login nodes (decommissioned since 2021-05-31)

Hardware of Lichtenberg I Phase I (2013-2020)

780 compute nodes and 4 login nodes (decommissioned since 2020-04-27)

File Systems / Storage

GPFS

ILM