You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 42 Next »

As Snellius is a new system and was newly installed and configured some things might not be fully working yet and are still in the process of being set up. Here, we list the issues that are known to us and that you don't have to report to the Service Desk. Of course, if you encounter issues on Snellius not listed here then please let us know (through the Service Desk).

Resolved issues (updated 20-11-22 10:13)

Project spaces

Project space data should be complete and writeable as of October 20, 10:30. Please check your data.

Remote visualization

The remote visualization stack is available as of November 22nd, 10:13. The relevant module is called remotevis/git on both Snellius and Lisa, containing the vnc_desktop script to launch a remote VNC desktop .

The password needed to log into the VNC server on Snellius has changed, compared to Cartesius. On both Snellius and Lisa the credentials used to log into the VNC server are your regular CUA username and password. There is no longer a separate VNC password.

Accounting not active in October

Your jobs will not get accounted against your budget in the month of October. For jobs starting in November accounting will be fully active, including jobs submitted in October which have not started before November 1st.

Infiniband and file system performance

We found that the infiniband connections are not always stable, and may be underperforming. Similarly, the GPFS file systems (home, project and scratch) are not performing as expected. Reading/writing large files performs as expected, but reading/writing many small files is slower than expected. Among other things, this can affect the time it takes to start a binary, run commands, etc.

We are looking into both issues. 

Allocating multiple GPU nodes

Normally, batch scripts like

#!/bin/bash
#SBATCH -p gpu
#SBATCH -n 8
#SBATCH --ntasks-per-node=4
#SBATCH --gpus=8
#SBATCH -t 20:00
#SBATCH --exclusive

module load ...

srun <my_executable>

Should get you an allocation with 2 GPU nodes, 8 gpus, and 4 MPI tasks per node. However, right now, there is an issue related to specifying an amount of GPUs larger than 4: jobs with the above SBATCH arguments that use OpenMPI and call srun or mpirun will hang.

Instead of specifying the total number of GPUs, please specify the number of GPUs per node, combined with the number of nodes instead. E.g.

#!/bin/bash
#SBATCH -p gpu
#SBATCH -N 2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH -t 20:00
#SBATCH --exclusive

module load ...

srun <my_executable>

This will give you the desired allocation with a total of 2 GPU nodes, 8 gpus, and 4 MPI tasks per node, and the srun (or mpirun) will not hang.

myquota command not available

The myquota command, used to check the disk quota on the system, is not available on Snellius at the moment. We are currently working to port it to the new system. 


Project space quota exceeded on Snellius

During the data-migration of the project spaces contents from Cartesius to Snellius, there was a considerable time (weeks) in which there was no freeze on the (Cartesius) source side, in which users kept working, causing files to be created, and sometimes also deleted. During the resync runs, the migration software has kept a things in sync  by "soft deleting" files that were migrated but at a later stage appeared to be no longer present on the Catesius side. That is: rather than actually remove them, the AFM software moved them into a .ptrash directory.

Depending on the productivity and rate of data turnover of users in the last weeks of Cartesius, this can add up and cause "quota pressure" for users. The .ptrash directories are owned by root. Snellius system administration has to delete them, users cannot do it themselves. Several quota-related service desk tickets may have this as their root cause - but only for project spaces. Home directories were migrate in a different manner and do not have this issue.

Missing software packages

A number of software packages of the 2021 environment and system tools, are still being installed and it might take a few days to become available (status as of 16/11/21):

  • COMSOL-5.6.0.341
  • Darshan-3.3.1
  • DFlowFM
  • EIGENSOFT-7.2.1
  • FSL-6.0.4
  • LAMMPS-29Oct2020
  • swak4foam
  • Trilinos-13.0.1
  • ncview-2.1.7
  • nedit

Cartopy: ibv_fork_init() warning

Users can encounter the following warning message, when import "cartopy" and "netCDF" modules in Python:

>>> import netCDF4 as nc4
>>> import cartopy.crs as ccrs
[1637231606.273759] [tcn1:3884074:0]          ib_md.c:1161 UCX  WARN  IB: ibv_fork_init() was disabled or failed, yet a fork() has been issued.
[1637231606.273775] [tcn1:3884074:0]          ib_md.c:1162 UCX  WARN  IB: data corruption might occur when using registered memory.

The issue is similar to the one reported here. The warning will disappear if "cartopy" is imported before "netCDF".

Intel toolchain on GPU nodes

Status as of 18/11/21: at this moment, the "intel" toolchain from the 2021 software stack is not functioning on GPU nodes.


OUT_OF_MEMORY problem for several software packages on Snellius

We have received several reports from users that packages like QuantumESPRESSO, VASP and CP2K sometimes crash with an OUT_OF_MEMORY message. We suspect that this is caused by a memory leak in a library routine, but we haven't pinpointed the cause yet.

We are investigating this matter.

  • No labels