You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 37 Next »

As Snellius is a new system and was newly installed and configured some things might not be fully working yet and are still in the process of being set up. Here, we list the issues that are known to us and that you don't have to report to the Service Desk. Of course, if you encounter issues on Snellius not listed here then please let us know (through the Service Desk).

Resolved issues (updated 20-10-21 11:37)

Project spaces

Project space data should be complete and writeable as of October 20, 10:30. Please check your data.

Accounting not active in October

Your jobs will not get accounted against your budget in the month of October. For jobs starting in November accounting will be fully active, including jobs submitted in October which have not started before November 1st.

Remote visualization

The remote visualization stack is not available yet (status as of 18/10/21):

  • The remotevis  module is not present in the 2021 environment
  • GPU-based accelerated rendering, e.g. in ParaView, is not available yet

Infiniband and file system performance

We found that the infiniband connections are not always stable, and may be underperforming. Similarly, the GPFS file systems (home, project and scratch) are not performing as expected. Reading/writing large files performs as expected, but reading/writing many small files is slower than expected. Among other things, this can affect the time it takes to start a binary, run commands, etc.

We are looking into both issues. 

Allocating multiple GPU nodes

Normally, batch scripts like

#!/bin/bash
#SBATCH -p gpu
#SBATCH -n 8
#SBATCH --ntasks-per-node=4
#SBATCH --gpus=8
#SBATCH -t 20:00
#SBATCH --exclusive

module load ...

srun <my_executable>

Should get you an allocation with 2 GPU nodes, 8 gpus, and 4 MPI tasks per node. However, right now, there is an issue related to specifying an amount of GPUs larger than 4: jobs with the above SBATCH arguments that use OpenMPI and call srun or mpirun will hang.

Instead of specifying the total number of GPUs, please specify the number of GPUs per node, combined with the number of nodes instead. E.g.

#!/bin/bash
#SBATCH -p gpu
#SBATCH -N 2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH -t 20:00
#SBATCH --exclusive

module load ...

srun <my_executable>

This will give you the desired allocation with a total of 2 GPU nodes, 8 gpus, and 4 MPI tasks per node, and the srun (or mpirun) will not hang.

myquota command not available

The myquota command, used to check the disk quota on the system, is not available on Snellius at the moment. We are currently working to port it to the new system. 


Project space quota exceeded on Snellius

During the data-migration of the project spaces contents from Cartesius to Snellius, there was a considerable time (weeks) in which there was no freeze on the (Cartesius) source side, in which users kept working, causing files to be created, and sometimes also deleted. During the resync runs, the migration software has kept a things in sync  by "soft deleting" files that were migrated but at a later stage appeared to be no longer present on the Catesius side. That is: rather than actually remove them, the AFM software moved them into a .ptrash directory.

Depending on the productivity and rate of data turnover of users in the last weeks of Cartesius, this can add up and cause "quota pressure" for users. The .ptrash directories are owned by root. Snellius system administration has to delete them, users cannot do it themselves. Several quota-related service desk tickets may have this as their root cause - but only for project spaces. Home directories were migrate in a different manner and do not have this issue.

Missing software packages

A number of software packages of the 2021 environment and system tools, are still being installed and it might take a few days to become available (status as of 16/11/21):

  • BLIS-3.0
  • COMSOL-5.6.0.341
  • Darshan-3.3.1
  • DFlowFM
  • EIGENSOFT-7.2.1
  • Extrae-3.8.3
  • FSL-6.0.4
  • LAMMPS-29Oct2020
  • RStudio-Server-1.4.1717
  • swak4foam
  • Trilinos-13.0.1
  • ncview-2.1.7
  • gedit
  • nedit

Cartopy: ibv_fork_init() warning

Users can encounter the following warning message, when import "cartopy" and "netCDF" modules in Python:

>>> import netCDF4 as nc4
>>> import cartopy.crs as ccrs
[1637231606.273759] [tcn1:3884074:0]          ib_md.c:1161 UCX  WARN  IB: ibv_fork_init() was disabled or failed, yet a fork() has been issued.
[1637231606.273775] [tcn1:3884074:0]          ib_md.c:1162 UCX  WARN  IB: data corruption might occur when using registered memory.

The issue is similar to the one reported here. The warning will disappear if "cartopy" is imported before "netCDF".

Intel toolchain on GPU nodes

Status as of 18/11/21: at this moment, the "intel" toolchain is not functioning on GPU nodes.

  • No labels