LIZA
System overview
The Linux Innovation Zone Amsterdam (LIZA) is a computer cluster designed to experiment with various hardware platforms. The cluster-based design of LIZA utilizes a classical batch scheduler called Slurm, which makes it easy for researchers to deploy their experiments and benchmarks. Unlike Snellius, LIZA is not a production cluster, which allows the ETP system administrators to adapt the system's hardware and software configuration to cater to the user's needs. Additionally, LIZA boasts a broader variety of node types and architectures.
System specifications
# Nodes | Server | CPU | # Cores | Memory | # GB | Disk | # TB | Devices | Bus | Features |
---|---|---|---|---|---|---|---|---|---|---|
16 | Dell PowerEdge T640 | 2x Intel Xeon Gold 5118 | 24 | 12x DDR4-2400 16GB | 192 | NVMe | 1.5 | 4x NVIDIA Titan RTX 24GB | PCIe 3.0 x16 | hwperf, skylake, sse4, avx512, gold_5118, gpu_nvidia |
3 | Dell PowerEdge T640 | 2x Intel Xeon Gold 6230 | 40 | 24x DDR4-2933 64GB | 1,536 | NVMe | 1.5 | 2x NVIDIA Titan RTX 24GB | PCIe 3.0 x16 | hwperf, skylake, sse4, avx512, gold_6230, gpu_nvidia |
6 | Dell PowerEdge C6420 | 2x Intel Xeon Gold 6230R | 52 | 12x DDR4-2933 32GB | 384 | NVMe | 3.0 | hwperf, skylake, sse4, avx512, gold_6230R | ||
1 | Lenovo SR650V2 | 2x Xeon Platinum 8360Y | 72 | 16x DDR4-3200 32GB | 512 | 1x AMD Instinct M210 64GB | PCIe 4.0 x16 | hwperf, skylake, sse4, avx512, platinum_8360, gpu_amd | ||
1 | 2x Platinum 8480+ | 112 | 16x DDR5-4200 32GB | 512 | 2x Intel GPU Max 1100 48GB | PCIe 5.0 x16 | hwperf, sapphire_rapids, sse4, avx512, platinum_8480, gpu_intel | |||
1 | 1x NVIDIA Grace | 72 | LPDDR5X | 480 | 1x NVIDIA Hopper H100 96GB | NVILINK C2C PCIe 5.0 x16 | nvidia_grace, gpu_nvidia | |||
4 | Dell PowerEdge R7515 | 2x AMD EPYC 7702P | 128 | 512 | NVMe | 5.8 | 1x HBA to Liqid | PCIe 4.0 x16 |
Liqid Composable Infrastructure
"Liqid composable infrastructure leverages industry-standard data center components to deliver a flexible, scalable architecture built from pools of disaggregated resources. Compute, networking, storage, GPU, FPGA, and Intel® Optane™ memory devices are interconnected over intelligent fabrics to deliver dynamically configurable bare-metal servers, perfectly sized, with the exact physical resources required by each deployed application.
Our solutions and services enable infrastructure to adapt and approach full utilization. Processes can be automated to realize further efficiencies to address better data demand associated with next-generation applications in AI, IoT deployment, DevOps, Cloud and Edge computing, NVMe- and GPU-over-Fabric (NVMe-oF, GPU-oF) support, and beyond."
PCI box 1 | PCI box 2 | |
---|---|---|
1 | NextSilicon Maverick v1 | |
2 | NextSilicon Maverick v1 | |
3 | Asus CRL-G116U-P3DF (16x Google Coral Edge TPUs) | |
4 | Asus CRL-G116U-P3DF (16x Google Coral Edge TPUs) | |
5 | Asus CRL-G116U-P3DF (16x Google Coral Edge TPUs) | |
6 | Asus CRL-G116U-P3DF (16x Google Coral Edge TPUs) | |
7 | AMD Instinct M210 64GB | |
8 | AMD Instinct M210 64GB | |
9 | ||
10 |
Usage
Connecting to LIZA
To connect to LIZA, you will need to use the SSH protocol, which encrypts all the data and passwords exchanged between your local system and the LIZA system. The way you connect will depend on the type of local system you are using. In all cases, you will access LIZA through one of the login nodes. These are publicly accessible nodes that you use as a stepping stone to work with the batch system and compute nodes.
No long-running processes on login nodes
The login nodes are intended to be used for tasks such as preparing and submitting jobs, checking on the status of running jobs, and transferring data to and from the system. It is not allowed to use the login nodes for running, testing, or debugging processes, as this could negatively impact the experience of other users. To ensure that the login nodes remain usable for everyone, there is an automatic cleanup feature that will terminate processes that consume excessive CPU time or memory.
Open a terminal and type:
ssh <username>@liza.surf.nl
There is only one partition for Slurm on LIZA which contains all the available nodes. To select specific types of nodes, the Slurm constraint option is used to request nodes with specific features. The table above provides a list of node features for each node type. The example script below demonstrates how to execute Slurm commands on different node types.
$ srun --constraint=gpu_amd --gpus=1 hostname srun: job 358 queued and waiting for resources srun: job 358 has been allocated resources j14n2.mgt.liza.surf.nl $ srun --constraint=gpu_intel --gpus=2 hostname srun: job 359 queued and waiting for resources srun: job 359 has been allocated resources j16n1.mgt.liza.surf.nl
It is noted that no values are applied: you get what you ask for. This means that users are responsible to indicate the required resources using a range of Slurm flags (see Slurm sbatch).
--nodes=<minnodes>[-maxnodes]|<size_string> --ntasks=<number> --cpus-per-task=<ncpus> --mem=<size>[units] --gpus=[type:]<number>
Alternatively, you may consider using the --exclusive
flag to allocate all CPUs and GRES on all nodes in the allocation. Note that by default, the --exclusive
flag only allocates as much memory as requested; however, this behavior has been modified on LIZA to allocate all memory as well.