Synopsis

This page gives an overview of the Lisa system and details the various types of file systems, nodes, and services available to end-users.

System overview

The Lisa system is a Beowulf cluster computer consisting of several hundreds of multi-core nodes running the Debian Linux operating system. The system is  constantly evolving and growing to satisfy the needs of the participants.

Node types

At the moment Lisa has the following configuration:

# Nodes

Processor Type

Clock

Scratch

Memory

Sockets

Cache

Cores

Accelerator(s)

Interconnect

23bronze_31041.70 GHz1.5 TB NVME256 GB UPI 10.4 GT/s28.25 MB124 x GeForce 1080Ti, 11GB GDDR5X40 Gbit/s ethernet
2bronze_31041.70 GHz1.5 TB NVME256 GB UPI 10.4 GT/s28.25 MB124 x Titan V, 12GB HBM240 Gbit/s ethernet
29gold_51182.30 GHz1.5 TB NVME192 GB UPI 10.4 GT/s216.5 MB244 x Titan RTX, 24 GB GDDR640 Gbit/s ethernet
192gold_61302.10 GHz1.7 TB96 GB UPI 10.4 GT/s122 MB16-10 Gbit/s ethernet
96silver_41102.10 GHz1.8 TB96 GB UPI 9.6 GT/s211 MB16-10 Gbit/s ethernet
1gold_61262.60 GHz11 TB2 TB UPI 10.4 GT/s419.25 MB48-40 Gbit/s ethernet
6gold_6230R2.10 GHz3 TB376 GB UPI 10.4 GT/s235.75 MB52-2 x 25Gbit/s ethernet

File systems 

There are several file systems accessible on Lisa.

File systemQuotaSpeedShared between nodesMountpointExpirationBackup
Home200 GBNormalYes/home/<username>15 weeks after project expirationNightly incremental
Scratch-local1.5 <> 1.7 TBFastNo/scratchEnd of jobNo
Scratch-sharedN.A. (size 3TB)NormalYes/nfs/scratchAt most 14 daysNo
Projectbased on requestNormalYes

/project/<project_name>

Project durationNo
Archive ServiceN.A.Very slowOnly available on login nodes/archive/<username>Project durationNightly

The Home file system

When you login on Lisa, by default, you're on the home file system. This is the regular file system where you can store your job scripts, datasets, etc.  You can always access the home file system through the $HOME environment variable, e.g. using ls -als $HOME you can list all the files and folder in your home folder. 

The home file system contains the files you normally use. By default, your account is provisioned with an home folder of 200 GB. Your current usage is shown when you login to Lisa, or you can see it using

quota -h

The home file system is a network file system that is available on all login and batch nodes. Thus, your jobs can access the home file system from all nodes. The downside is that the home file system is not particularly fast, especially with the handling of metadata: creating and destroying of files; opening and closing of files; many small updates to files and so on.

Backup & restore

  • We do nightly incremental backups.
  • Files that are open at the time of backup will be skipped.
  • We can restore files and/or directories when you accidentally remove them up to 15 days back, provided they already existed during the last successful backup.

The scratch file system

The scratch file system is intended as fast, temporary storage that can be used while running a job, and can be accessed by all users with a valid account on the system. Every compute node in the Lisa system contains a local disk for the scratch file system that can only be accessed by that particular node. There is no quota for the scratch file system; use of the scratch file system is eventually limited by the capacity of these disks (see table above). Scratch disks are not backed up and are cleaned at the end of a job.

Since the disks are local, read and write operations on to the scratch file system are much faster than on the home file system. This makes the scratch file system very suitable for I/O intensive operations.

The scratch disks in the GPU partition are NVME SSDs, which are particularly fast and suitable for intensive I/O, making these very suitable for machine learning training sets.

You access the scratch file system by using the environment variable $TMPDIR: this points to an existing directory on the local disk of each node. For example: to create a directory 'work' on the scratch file system and copy a file from the home file system to that directory:

mkdir "$TMPDIR"/work
cp my-file "$TMPDIR"/work

Note the use of "$TMPDIR" (with quotes) rather than $TMPDIR (without quotes). The reason is that $TMPDIR can contain meta-characters (e.g. [ and ]); the quotes take care that the shell will leave those characters as-is.

In addition to temporary storage that is local to each node (like scratch), you may need some temporary storage that is shared among nodes. For this we have a shared scratch disk accessible through

cd /nfs/scratch

The size of this shared scratch space is currently 1 TB and there is no quota for individual users. Note that this shared scratch has two disadvantages compared to the local scratch disk

  • The speed of /nfs/scratch is similar to the home file system and thus slower than the local scratch disk at "$TMPDIR".
  • You share /nfs/scratch with all other users and there may not be enough space to write all the files you want. Thus, carefully think how your job will behave if it tries to write to /nfs/scratch and there is insufficient space: it would be a waste of budget if the results of a long computation are lost because of it.

How to best use scratch

In general, the best way to use scratch is to copy your input files from your home to scratch at the start of a job, create all temporary files needed by your job on scratch (assuming they don't need to be shared with other nodes) and copy all output files at the end of a job back to the home file system. There are two things to note:

  • Don't forget to copy your results back to the home file system! Scratch will be cleaned after your job finishes and your results will be lost if you forget this step.
  • If you created files with the same filename on the scratch disk of different nodes, copying them back will result in a clash: you're trying to write two different files to the same filename on the home file system. Avoid this by including something unique to the host (e.g. the hostname, which you can retrieve with the hostname command) in the file or directory name.

The project file system

A project file system can be used

  1. If you need additional storage space, but do not require a backup.
  2. If you need to share files within a collaboration.

By default accounts on our systems are not provisioned with a project space. It can be requested when you apply for an account, or by contacting our service desk (depending on the type of account different conditions may apply, contact us to know if your account is eligible for a project space).

A project space can be accessed at the location /project/<project_space_name>. Project quota are implemented as group quota, not as user quota and you can check how much free space you have using df -h /project/<project_space_name>.

To share files on the project file system, you need to make sure to write files with the correct file permissions. See the corresponding section in the documentation on how to do that.

Expiration of project space

Project spaces duration coincide with the allocation within which has been granted. When the agreed upon period of time for the project space expires, the project space will be made inaccessible. If no further notice from the project space users is received, we are entitled to delete all the files and directories in your project space after a grace period of an additional four weeks.

Backup & restore

We do not make backups of project spaces, and thus cannot restore data. Users are responsible for making their own backups, if needed.

The archive file system

The Data Archive is not a traditional file system, nor is it specific to Lisa. It is an independent facility for long-term storage of data, which uses a tape storage backend. It is accessible from other systems as well. For more information, see this separate page about the archive

The archive file system Service is intended for long term storage of large amounts of data. Most of this data is (eventually) stored on tape, and therefore accessing it may take a while. If you are a user of the data archive, it is is accessible at /archive/<username>. The archive system is designed to only handle large files efficiently. If you want to archive many smaller files, please compress them first in a single tar file, before copying it to the archive. Never store a large amount of small files on the archive: they may be scattered across different tapes and it will put a large load on the archive to retrieve all those files if you need them at a later stage. See this section of the documentation for more information on using the archive appropriately.

Lisa Hostkey Fingerprints

When you log in to a new system for the first time with the SSH protocol, the system returns a hostkey fingerprint to you:

The authenticity of host 'lisa.surfsara.nl (145.101.35.179)' can't be established.
ED25519 key fingerprint is SHA256:UhVbfNE+O1oEjdLidcM9YS0hLHrO3tQYrVIo4BAqwNo.
ED25519 key fingerprint is MD5:a0:d5:6e:e6:41:41:8d:06:68:5a:1d:aa:03:7f:40:3b.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'lisa.surfsara.nl' (ED25519) to the list of known hosts.

Before you type "yes" to the question posed to you, you can verify this fingerprint against the list of correct fingerprints below to check that you are indeed logged to the correct system.

ED25519
===
SHA256:UhVbfNE+O1oEjdLidcM9YS0hLHrO3tQYrVIo4BAqwNo
MD5:a0:d5:6e:e6:41:41:8d:06:68:5a:1d:aa:03:7f:40:3b
RSA
===
SHA256:8wVrNrBzU399UFktk3sNHvp6x2cjbhJBai5MRe10w8E
MD5:b0:69:85:a5:21:d6:43:40:bc:6c:da:e3:a2:cc:b5:8b

Installation and maintenance of the system

Most of the software that is used to manage the Lisa system is Open Source. We are using the following software to manage the system:

SALI(Sara Automatic Linux Installer) is a tool that allows you to install Linux on multiple machines at once. It support several protocols for downloading to install a machine. For example, BitTorrent and rsync are supported. SALI originates from SystemImager and still uses the same philosophy. It is a scalable method for performing unattended installation. SALI is mostly used in cluster setups.