You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 15 Next »

You can apply through our Service Desk, via the link "Small Applications". These requests will only be assessed on technical feasibility by SURF and are usually handled within 2 weeks. The scope of these applications is listed on the table.

It is important to understand that the resources you request need to be justified. You need to detail how many SBUs you require and how you plan to use them. The above parameters are only the maximum limits of the application and are not granted by default. The resources of each application is tailored per project. To ensure a smooth application process we have provided a template that you can use when you apply for a small request on either Snellius or Lisa. Please edit the template to fit your project.


Small Application Template

  • Provide a justification of the amount of requested SBUs in the form of a simple estimation. Something like "I'll need to do X runs using Y cores per run, with a runtime of Z hours, i.e. totaling X*Y*Z = XXX SBUs" is sufficient. If you do different types of run, please specify X/Y/Z/XXX per type of run, as well as a total number of SBUs.
  • Will your computations have a high memory requirement? If yes, can you make an estimation?
  • If you are requesting project space. What are your needs in terms of storage? What is the typical input and output size of your computations? Do you need long term storage? Please have a look at the Snellius configuration page to check the differences between the different file systems on Snellius, and explain why you need the requested project space (permanent non backed up space on high volume, high data throughput file systems that support parallel I/O). Do you have access to local storage where you will be able to copy your data back after the expiration of the project?
  • Do you have specific software needs?

Example application 1: training a neural network to detect pathologies in X-ray images

To make the above a bit more concrete, we include here an example application for an (imaginary) machine learning project:

Project description:

In this project, we aim to detect pathologies in chest X-rays using neural networks. For this purpose, we explore two neural network architectures (a ResNeXt and EfficientNet architecture). We train on the CheXpert dataset, which contains 224,316 images. For each network, we will do hyperparameter optimization using a grid-search approach. We will explore two optimizers (ADAM and a traditional momentum optimizer), 5 learning rates (0.001, 0.002, 0.005, 0.01 and 0.02) and 3 batch sizes (8, 16, 32), for a total of 2*5*3 = 30 hyperparameter settings per network architecture.

Requirements:

Compute:

Each run will take an estimated 100 epochs to converge. We have run a small test on an NVIDIA GeForce 1080Ti GPU. For both ResNeXt and EfficientNet, it took 2 hours to run 1 epoch. We estimate that the A100 GPUs in Snellius are approximately two times faster. Thus, a single run would take an estimated 100 hours to complete on a single A100 GPU. Since a single GPU in Snellius costs 128 SBU/hour, training a single network with a single hyperparameter setting costs an estimated 100h*128 SBU/h = 12,800 SBU. With 30 hyperparameter settings for each of the two neural networks, we need to do 60 runs in total. To allow for trial and error, we request an additional 5%. Thus, the total requested amount of compute is 60 * 12,800 * 1.05 = 806,400 SBU (on the GPU partition). 

Memory:

Test runs on a 1080Ti showed that a batch size of 8 fit in the 10 GB of GPU memory of the 1080Ti. Thus, we expect no problems running with a four times larger batch size of 32 on the A100 in Snellius, since the memory requirement will at most  be four times larger, and will thus fit the 40 GB of GPU memory of an A100. We have no special requirements for CPU memory.

Storage:

We need to store three items:

  • The CheXpert dataset (440 GB)
  • Intermediate files (checkpoints, logs) of the training runs (10 GB per run)
  • Final checkpoint & logs of each training (1 GB per run)

The CheXpert dataset, final model checkpoints, and logs of each training will need to be stored for the duration of the project. The intermediate files generated during the run are temporary, and can be removed after some initial analysis. We expect we don't need to store the intermediate files for more than 10 runs at any given time. Thus, we need a total of 440 GB + 10 * 10 GB + 60 GB = 600 GB of storage. We therefore request 1 TB of Snellius project space (the minimum project space size).

Software:

We aim to use the PyTorch installation from the module environment to perform this training.

Service
Lisa
  • Maximum of 100,000 CPU core-hours (SBU)
  • Maximum 50 TiB of offline tape storage

By default:

  • 200 GiB home directory storage
  • 4 hours of support
  • Project duration 1 year.
Snellius
  • Maximum 1,000,000 CPU/GPU core-hours (SBU)
  • Maximum of 50 TiB of project space
  • Maximum of 50 TiB of offline tape storage

By default:

  • 200 GiB home directory storage
  • 4 hours of support
  • Project duration 1 year.



  • No labels