GPU Acceleration of VASP

GPU Acceleration of VASP using Volta nodes on NUS HPC

Xie Weihang, Canepa Research Group https://caneparesearch.org, Department of Materials Science and Engineering, College of Design and Engineering

Introduction

The Vienna Ab initio Simulation Package (VASP) is a well-known and largely used software package for performing DFT calculations. VASP requires a significant amount of computing resources and was ranked top 5 used applications on NSCC (National Supercomputing Centre) in terms of CPU hours. Starting from version 6.2.0, VASP is officially capable of acceleration by GPU through the OpenACC port. Since the FP64 performance of V100 (7.8TFLOPS) on HPC@NUS is very strong compared to CPU (3.7TFLOPS for currently fastest CPU 64-cores EPYC7763), it is expected that GPU can be a good accelerator.

Container

Enabling the OpenACC acceleration on VASP requires NVHPC compiler suite (previously PGI) which contains a full stack of compilers (C, C++, Fortran, MPI, etc.). Both NVHPC and VASP are actively updated and extremely picky on the environment. Therefore, using singularity containers appears to be a more straightforward way to deal with dependencies and version control. You can start creating and using containers in NUS IT’s HPC environment by following the information round here: https://nusit.nus.edu.sg/technus/running-hpc-ai-applications-in-containers/. You can also find officially built NVHPC containers here by Nvidia: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc. NVHPC contains almost every library that is required by VASP, except the Fast Fourier Transform library, which necessitates a separate installation of either FFTW or Intel MKL.

# FFTW

tar -xzvf fftw-3.3.10.tar.gz

cd fftw-3.3.10

CC=mpicc CXX=mpicxx FC=mpifort ./configure --enable-openmp --enable-mpi --prefix=/opt/fftw

make -j32

make install

My attempt started with NVHPC 20.9 and ended with version – NVHPC 22.3. Indeed, there is an unknown “best match” between a specific VASP version and the NVHPC version that leads to the best performance. NVHPC 20.9 is the last version that supports CPU threading management by Intel OpenMP (which is much easier to handle compared with the OpenMP in the newer version), while the latest NVHPC 22.3 has CUDA 11.6 (which is faster than CUDA 11.0 in NVHPC 20.9). Another crucial metric to consider is the numerical accuracy of a compiled version of VASP in various environments. Old compiler suites may be slower but are typically well tested and validated, whereas new compiler suites may contain unknown bugs. For example, any OpenMP threading in NVHPC 22.2 will cause VASP to be stuck immediately after initialization.

After successfully building the container, the submit script looks like (you need to modify some variables here):

#!/bin/bash

#PBS -P volta_pilot
#PBS -j oe
#PBS -N volta_gputest
#PBS -q volta_gpu
#PBS -l select=1:ncpus=5:mem=35gb:ngpus=1
#PBS -l walltime=72:00:00

export CONTAINER_PATH=YOUR_CONTAINER_PATH

singularity exec ${CONTAINER_PATH} mpirun -n 1 vasp_std

Since V100 GPU has only 32GB of video memory, VASP generally can only occupy as much as 32 GB of memory. Thus, OpenACC VASP cannot deal with systems as large as pure CPU VASP is.

Tuning the CPU amount appears to be another difficult task. OpenACC VASP leverage some CPU threading (which is actually OpenMP threading to other idle cores) to accelerate the calculation. This is different from MPI parallelization. Such threading is handled by mpirun args. For example, in the case of NVHPC 20.9 mixed with Intel MKL, it is enabled by the “map-by” keywords:

export OMP_NUM_THREADS=4

singularity exec ${CONTAINER_PATH} mpirun -n 1 –map-by socket:PE=$OMP_NUM_THREADS vasp_std

One good feature of HPC@NUS is that an interactive node is provided so that you can test the containers interactively. By executing `singularity exec ${CONTAINER_PATH} bash` on the volta interactive node, one can access the container and run some tiny tests to see whether the VASP is compiled properly.

Performance

After some careful optimization of the parameters together with help from HPC@NUS staff, a reduction of running time between 20% to 50% by 1 Volta GPU is observed on the same VASP input set, compared to 2 CPU nodes on NSCC Aspire1 (i.e., 48 cores of Intel Xeon E5-2690v3). On the other hand, the parallelization on multiple GPUs largely depends on the input type, showing parallel efficiency from 55% to 95%. Such results are inspiring since the OpenACC can be further optimized. The following figure shows the comparison of total elapsed time based on the same input set. “5omp@2mpi” means that there are 5 OpenMP threads per MPI rank, and there is a total of 2 MPI ranks.

The output results show no difference from pure CPU VASP in terms of accuracy.

Acknowledgement

I would like to thank Dr Miguel Dias Costa for his guidance on tuning the parallelization parameters and for the invitation to write this summary. I would also like to express my gratitude to Dr Wang Junhong for guidance on resource allocation and benchmarking. HPC@NUS is an easy-to-use platform with comprehensive toolchains and regular updates, and I’m always receiving active help from their friendly staff. I would also like to thank my colleague, Tim, for reviewing the summary. We thank NRF for sponsoring my PhD fellowship and the Canepa Research group.