» Get Better Performance for Your Machine Learning - NUS Information Technology

Get Better Performance for Your Machine Learning

Wang Junhong, Research Computing, NUS Information Technology

Abstract

MLPerf benchmark was carried out on one of the central shared 4-GPU servers with different configurations on the number of GPUs and the input batch sizes. The computing performance and the different configurations are compared. Good performance is demonstrated by the GPU server when running the MLPerf case on all 4 GPUs. Users are recommended to run a quick benchmark to find out the right GPU configuration and the right batch size to achieve optimal performance.

Background

Motivated in part by the Standard Performance Evaluation Consortium (SPEC) benchmark for general-purpose computing (CPU in general), MLPerf benchmark is developed to build fair and useful benchmarks that provide unbiased evaluations of training and inference performance for hardware, software, and services—all conducted under prescribed conditions. MLPerf benchmark is widely accepted and adopted benchmark suite by the Artificial Intelligence (AI) community.

MLPerf benchmark was carried out on a 4-GPU server with different configurations on 1) number of GPUs, and 2) input batch size. The relationship of performance and the different configurations will be compared and discussed.

Benchmark Environment and Configuration

The benchmark case is carried out on one of the central shared 4-GPU servers, which are available to all NUS researchers. The 4-GPU server is equipped with 4 Nvidia Tesla V100-32GB cards and fast IO speed SSD-based storage.

MLPerf Benchmark Suite

The MLPerf Benchmark suite includes:

Training set: 1000 categories and 1.2 million images
Validation & test: 100 categories and 150,00 images

Benchmark Configuration

On the 4-GPU server, the benchmark tests were carried out on 1 GPU, 2 GPUs and 4 GPUs for comparison.

The input batch size shall be chosen in order to fully utilize the underlying computational resources such as GPU memory or disk IO bandwidth, to achieve better performance. Considering the existing GPU server’s GPU memory resource, the benchmark tests were carried out with the following batch sizes:

Number of GPUs in Use	Batch size
1-GPU	64, 128, 256
2-GPU	64, 128, 256, 512
4-GPU	64, 128, 256, 512, 1024

Benchmark Results

The computing time taken for each of the benchmark cases is plotted in the graph below. From the results, it is observed that:

The computing time is shortened significantly when running the benchmark case on 2-GPU and 4-GPU, with about 1.7X and 3.3X faster approximately.
In general, the computing time is shortened by about 7% when the batch size is doubled on all the 3 GPU configurations.
The performance improvement is very minor or even worse when the batch size per GPU is increased to 256.

Conclusion and Recommendation

From this MLPerf benchmark test on the 4-GPU server, the GPU server demonstrates good performance when using multiple GPUs. The input batch size affects the performance marginally and reaches optimal performance at 512 per GPU, which may fully use the GPU resources.

The data set of this MLPerf benchmark is quite sizable. To enjoy the better or best performance on running your own ML and DL on the GPU server, you are recommended to do a quick benchmark by comparing the performance among different GPU configurations, different batch sizes and others. If resources permit, you can consider to run your machine learning (ML) and deep learning (DL) with more GPUs and sizable batch size to get optimal performance.