» How To Be More Productive in HPC - NUS Information Technology

HOW TO BE MORE PRODUCTIVE IN HPC

By Tan Chee Chiang (Research Computing, Computer Centre) on 28 May, 2015

When you are running research simulation at our HPC system, using as many CPU cores as you can get for a single job may not necessarily be the most productive approach. We will look into how to strike a balance between efficiency and speedup.

2014 Performance Indicators

In 2014, the central HPC resources at Computer Centre handle a total of 577,990 simulations. Even though only 22% of the total simulations were performed in the parallel queues, the parallel simulations consumed 80% of the total CPU hours. This is encouraging, as a majority of the resources had been used for parallel processing.

How about the speedup, the acceleration of research computation? We allow each user to use up to 48 cores maximum for parallel processing. The average speedup recorded last year was 6.32X. Is this good enough?

Efficiency vs speedup

Parallel applications are written to make use of multiple CPU or accelerator cores to achieve speedup but not all applications are enabled to scale well beyond certain number of cores. Beyond certain point efficiency will drop and speedup will be gained at a slower pace. Figure 1 shows a typical scalability trend for an application that is not quite scalable.

For a large supercomputer catered specifically for application such as weather prediction, as shortening of prediction time is extremely important, it can be justified to operate at the low efficiency range to gain higher speedup. However for our shared HPC system with limited resources used by many researchers, efficiency will be our priority. Assuming that your application has the same scalability as shown in Figure 1, here are the benefits you may gain if you use 8 CPU cores (~4.5X speedup) instead of the maximum 48 cores (~6.5X speedup) allowed:

Queuing time for CPU resources will be shorter as longer time is required to free up larger number of CPU cores.
For licensed software, queuing time for licenses will be shorter, especially software with limited number of licenses.
Faster overall turnaround time if you have many jobs to run. In this example, the completion time for each 8-thread and 48-thread jobs is 22 hours and 15.5 hours respectively. If you run 6 8-thread jobs concurrently instead of 6 48-thread jobs sequentially, potentially you can save up to 71 hours.
Better performance by running all the threads within a single nodes (run 12 threads for server node with 12 CPU cores) by avoiding the network overhead.

However in some cases, efficiency and speedup are not the deciding factors. For example some applications are simply too large to be fitted into a single server memory hence have to be split to run on multiple server nodes.

Accelerating Research Computation

Here are some other options to consider if you are developing your own simulation codes:

Parallelize it using OpenMP or MPI if possible.
Compile it with higher-performance compiler such as the licensed Intel compiler. (please check out the related article in the current issue “Achieving High Performance Through Benchmarking and Compilation Optimization“).
Use optimized numerical libraries such as those come with Intel compiler.

If you are using off-the-self software packages and if software license is a constraint, you can consider the open-source alternative. Here are some of the alternatives available on our system:

Fluent (limited licenses) vs OpenFoam (open-source)
Matlab (limited licenses) vs Octave (open-source)
SAS (not available on our system) vs R (open-source)

Please contact us at ccehpc@nus.edu.sg if you like to know more about the above options and alternatives.