Synopsis

This section is aimed at people developing their own software on our HPC systems. This section begins with introducing the module framework and provides a general understanding on how to use the installed software on the HPC systems. Further a brief outline is given on concepts of code parallelization and optimization. If you only use existing software, you may skip this section.

Modules

The module framework allows users to manage which compilers, languages, tools, and other software that they will use on the HPC system. Sometimes, as a developer, you want to use a specific version of a library or compiler when you compile your software. The module framework allows many versions of the same library/compiler to be installed on a system, with users being able to load specific versions depending on their requirements. It is important to load a module environment first. As for now these are 2021 and 2022, with 2022 containing the newest versions of most programs and libraries and 2021 older legacy versions. To load the module environment use:

module load 2022

To see all available modules, use

module avail

To see all available versions for a given module (e.g. the fftw3 library), use

module avail FFTW

Currently, this shows the following output for the 2022 environment

FFTW.MPI/3.3.10-gompi-2022a    FFTW/3.3.10-GCC-11.3.0    imkl-FFTW/2022.1.0-iimpi-2022a

As you can see, the fftw3 library is available in different versions, and has also been compiled with different compilers (the GNU and intel compilers). It is recommended to use the version build with the compiler that you will also use to build your own software.

Generally, the software manual pages also list the module command that can be used to load particular libraries/compilers.

Loading/unloading modules

To load the fftw3 module that was compiled with the gnu compilers, you can use the command:

module load FFTW/3.3.10-GCC-11.3.0

When loading a module, a number of environment variables are set, such as paths that are searched for executables, libraries, etc.

Some modules may depend on other modules, e.g. because one library depends on another. You need to make sure to load the modules for the dependencies as well. Some modules will explicitly tell you if it has dependencies, but unfortunately that is not always the case. Check the software manual pages to see if it lists any dependencies for your module.

Code optimization

You can find some general tips on optimizing your code here. However, there are many things you can (and should) do to speed up your application before you should start implementing manual optimizations in your code.

Programming language

Not all programming languages are equally fast. As a rule of thumb, high level programming languages produce smaller code than low level programming languages, but the application is generally slower. However, not all high level programming languages are equally slow, and not all low level programming languages are equally fast. Additionally, good code in a high level programming language may outperform poor code in a low level programming language.

The last point also makes it very hard to make a fair comparison between programming languages: you can implement the same benchmark in two programming languages, but how far do you go in optimizing both implementations? Nonetheless, people have attempted to compare the speed of programming languages by comparing their performance on a range of benchmarks.

So, should you choose a programming language based on pure speed and/or code size? The best answer is probably: it depends. Is performance super important for you, but is your code currently written in a high level language? Then rewriting it in a lower level language may provide you with a substantial speedup, but will take you a substantial amount of time. It depends on the size of the computational problem and the expected speedup whether that time investment is worth it.

Overall, we would recommend to consider the following points when choosing a programming language (in no particular order)

  • Intrinsic speed of the language
  • Time needed to implement the program in that language / size of the code (these are generally correlated)
  • Experience of the programmer in that language
  • Size of the computational problem
  • Is there a current implementation, i.e. do you already have the code available in a certain language, or do you start from scratch?

Note that high level languages are not necessarily slow for every program. Many of the native functions in high level languages like MATLAB - e.g. matrix multiplication, mtimes(), or inversion, inv() - are actually written in a low level language. Therefore, if your MATLAB code spends most of its time in native MATLAB functions, you may not be able to speed it up much by porting the code to e.g. C/C++/Fortran.

Existing libraries

Highly optimized libraries are available for a large number of mathematical routines. Libraries like BLAS, CUBLAS, LAPACK, MKL, and FFTW provide functions for things like matrix algebra, Fourier transforms, random number generation, etc. Even in a low level programming language like C/C++/Fortran, you will probably be able to beat the performance of such optimized library functions, as these functions are optimized using assembly code and may contain different optimizations depending on the specific architecture on which the libraries are compiled.

This project illustrates the extensive effort needed to turn optimized C code for matrix multiplication in optimized assembly code. It also shows that the speedup between optimized C code and these library functions is approximately a factor of 3, although this may depend on the performance of the compiler as well (better C compilers will be able to get closer to the optimized performance).

In summary, always try to use optimized libraries to perform standard mathematical routines: it takes relatively little effort to replace your own functions in the code by these library functions (or better yet: use library functions from the start), and may provide substantial speedup.

Compiler optimization

Compilers can do many things to produce optimized code, but in general, you have to tell them to do it. The most well known (and essential) is to set the optimization flags (-O1, -O2, -O3 depending on the level of optimization that is required). These enable things like loop optimizations, such as loop unrolling and will provide substantial performance benefits in almost all codes.

Such optimizations require the compiler to interpret some of your code, and not all compilers are equally good at this. It can be worth trying a different compiler to compile your program and see if there is a performance benefit.

In addition to the regular compiler optimization, the compiler can do architecture-specific optimization. For example, most modern architectures support vectorization through the Advanced Vector Extensions (AVX) instruction set. However, by default, the compiler will not build code that uses such vectorization, as it will not run on processors that don't support it.

For the GNU compilers, architecture-specific optimization is performed if you specify the -march=cpu-type flag, where cpu-type is e.g. ivybridge, haswell, broadwell, skylake, knl, etc (see the GNU online documentation). Alternatively, you can specify -march=native to have the GNU compiler optimize for the CPU type of the host machine (i.e. the machine compiling the code).

For intel compilers, architecture-specific optimization can be performed using the -x<code> or -ax<code> flags, with <code> the specific instruction sets that you want to support in the optimization (see the Intel documentation). The code may not run on CPUs that do not support (all) of the instruction sets. Alternatively, you can specify the -xHOST flag to have the intel compiler optimize the code for the CPU type of the host machine (i.e. the machine compiling the code). Intel also provides a good overview of the above optimization steps.

Software policy Snellius

Intel HPC toolkit

AMD processors used on the Rome and Genoa partition do not fulfil the official hardware requirements of the Intel® HPC Toolkit, which is installed for local builds for experienced users only. In general, the "foss" toolchain (BLAS + ScaLAPACK + OpenMPI + GNU Compilers) is recommended for building software on Snellius. For the Intel® HPC Toolkit, only limited support can be provided.


Additionally, the compiler can do link-time optimization (LTO) or interprocedural optimization (IPO, Intel's version of LTO). Finally, the compiler can do profile-guided optimization (PGO). Instructions for the Intel compiler can be found here. For GNU compliers, you can use the -fprofile-generate and -fprofile-use=path compiler options.

Profiling

Profiling is a form of program analysis in which you determine, for example, which parts of your program consume the most computation time. This helps you identify which parts of your code to focus on during optimization.

On Lisa, you can use the gperftools, gprof, valgrind (using the callgrind tool) or Intel VTune Amplifier profilers. Note that you generally need debugging symbols enabled (-g compilation flag) for these tools. Additionally, gprof needs the -pg compilation flag, while gperftools needs your application to be linked to -lprofiler. Information on how to use these tools is widely available on the web, so we will not cover that here.

Parallelization

Functional vs data parallel

Two important classes of parallelism have emerged in the history of parallel programming: data parallel and functional parallel.

In functional parallelism (also known as task parallelism), you have different tasks that you want to perform, possibly on the same data. For example, you may want to apply two different filters to a single image. Then you could assign one node to do one task (e.g. apply filter A to the original image) and one node to do another task (e.g. apply filter B to the original image). Another form of functional parallelism would be pipelining. For example, you want to apply filter A to the original image and subsequently apply filter B to the filtered image. You can't do this at the same time, because you want to apply filter B after A. But when you have a set of images, you can apply filter B on the first image and (simultaneously) apply filter A to the first image. Such a procedure would be similar to how processors perform instruction pipelining.

In data parallelism, you perform the same task on different elements of a dataset. For example, you want to apply filter A to each image in a dataset. Because these operations are independent, you can easily distribute the images over different processors and apply filter A at the same time to multiple images. Another example would be the multiplication of to vectors of length N: you can easily split these vectors in (for example) 4 equal parts and perform the multiplications of each 'sub-vector' on different processor cores.

Shared vs distributed memory

On the hardware side, there are two basic systems to consider: shared memory systems and distributed memory systems.

On a shared memory system, all processors and processor cores that are part of the system have access to the same memory. A single node in Lisa is a shared memory system: it contains multiple physical processors, each processor containing multiple cores, but all processors and cores can access the same shared memory. Your own computer is probably also a shared memory system: you likely have a single CPU with multiple cores, each of which can access the same shared memory.

On a distributed memory system, processors do not all have access to the same memory. An example is a cluster computer like Lisa as a whole: the CPU in one node cannot directly access the memory from another node. That is not an issue if you run completely independent processes on the different nodes (e.g. one node applies filter A to image 1 - 100 in your dataset, while a second node applies the same filter to image 101 - 200). However, if your processes are not independent (e.g. you want to train a single regression model on a large dataset and want to distribute parts of this dataset over multiple nodes), some explicit communication between the nodes is needed.

OpenMP

OpenMP is a standard that supports parallel programming on shared memory systems in C/C++ and Fortran. OpenMP is defined as an extension to the C/C++ and Fortran languages, and thus needs compiler support in order to be able to use it. The GNU and Intel compilers have supported openMP for a long time, but the version of the OpenMP standard that is supported depends on the version of the GNU/Intel compilers (newer compilers support newer OpenMP versions). To compile code with OpenMP support, you need to pass the appropriate flag to the compiler (-qopenmp for Intel compilers, -fopenmp for GNU compilers). E.g.

icc -qopenmp my_openmp_code.c

for Intel or

gcc -fopenmp my_openmp_code.c

for GNU.

OpenMP is most suitable for implementing data parallel paradigms.

MPI

MPI (Message Passing Interface) is a standard that is designed for parallel programming on distributed memory systems, though it is also suitable for parallelization on shared memory systems. Several implementations of MPI exist, including:

  • MPICH
  • OpenMPI
  • Vendor-specific mpi implementations, such as IntelMPI.

MPI is a well defined standard to which all implementations should comply. The fact that there are several competing implementations has led to very efficient implementations. MPI is the most common tool for parallelization of high performance application on distributed and shared memory systems. It is suitable for both data-parallel and function-parallel paradigms.

We also have a short introduction to MPI. A more extensive MPI tutorial can be found here.

Hybrid OpenMP/MPI

You can in principle combine the OpenMP and MPI frameworks. This allows you to use MPI to communicate between nodes (or even between different CPUs) and use openMP to parallelize within a node (or CPU). Such a programming model may be slightly more efficient then using pure MPI (also for parallelization within a node/CPU). However, it is more complex to program than a pure MPI or pure openMP program. Generally, hybrid OpenMP/MPI programming is only used in applications that need to scale to a very large amount of nodes and/or for applications in which even a few percent performance gain is worth the substantially more complex code.

  • No labels