SOLVING HIGH-DIMENSIONAL STOCHASTIC INVENTORY PROBLEMS WITH OPENMP ON HPC – PHASE II
How has high performance computing helped me in addressing the review comments on my research paper?
Last year I wrote an article to HPC@NUS on how I utilized HPC resources to solve multi-dimensional dynamic programs for one of my research projects on stochastic inventory models. With the help of OpenMP, I was able to parallelize my C++ code and run it across multiple CPU cores. Together with other code optimizations (e.g. storing intermediate results in memory to avoid redundant calculations), the eventual speedup was about two orders of magnitude. Among the 510 scenarios I ran, most of them finished in several hours (around 25 days for the worst cases).
After about two months of computing on HPC clusters, we happily finished the project and submitted the resultant research paper to a journal. As any academician would expect, we received tons of review comments on how we could potentially improve the paper along various dimensions. The most devastating comments were not theoretical but those like “study a larger scale problem for robustness check” and “consider several more variations of the existing model and compare them numerically”. To follow the reviewers’ suggestions, we constructed a new set of problems that are of the same structure but more than 10 times larger in terms of complexity. As I cannot afford years of computing, I sought the help of the HPC specialists just like before.
Now five months have passed since I started the second round of computing, and I am about to finish up, having consumed 300,000 CPU hours at HPC in the last month alone. Here is a quick summary of how we managed to complete the humongous task.
1. Use the Right Compiler
I coded in C++ and had been using GCC as my compiler – a no-brainer choice – until I happened to read about Intel ICC one day. I did a quick benchmark on my Intel Xeon-based Linux workstation, which largely mimics the HPC environment. To my great surprise, ICC turned out to be at least 2 to 3 times faster than GCC consistently across all the scenarios I tested. (I have no idea about and have no interest in knowing why this is the case.) I am very much grateful to Srikanth Gumma and the other HPC specialists who spent numerous hours setting up and fine-tuning a working environment on the HPC Clusters with GCC, ICC, and the Boost Library, which are all required by my updated code and the trio is somehow inter-dependent on each other.
2. Use the /hpctmp High Performance File System
Another challenge that arose from the previous round of computing was enabling checkpointing. In order to hedge against the risks of HPC unavailability (e.g. unexpected power disruption, schedule maintenance etc.), I need to frequently “checkpoint” my programs by saving the intermediate computational results in memory into files on hard disks. A typical scenario now can easily eat up as much as 48GB of memory and saving them into my local storage is impossible because of not only the limited disk capacity but also the time it takes to write to the disks (the code stops computing when writing files for checkpointing). /hpctmp provides a perfect solution to me as it is huge in size, quite reliable and, more importantly, incredibly fast in writing. Since I started using /hpctmp, I have never ever worried about the “checkpointing”.
3. Run the Big Jobs in a Distributed Way
My code was written in Shared Memory Parallelism (SMP) using OpenMP. Parallelization was relatively easy to implement but is not really scalable beyond the 12 cores in a computer node. At the same time, I am sure that some of my big jobs will not be done in a year with only 12 CPU cores. For those big jobs, what I have implemented is a stone-age manual distributed computing system. In plainer language, I cut one big job into many small pieces of sub-jobs, submit them onto multiple HPC queues, and run them simultaneously. The big jobs themselves are not intrinsically separable, therefore there are some redundancies in computing, and also it involves the manual merging of results from the sub-jobs. I know it is not the best way to do it, but given the available HPC resources (more specifically, the “short” LSF queue that can handle 20 jobs simultaneously), I managed to achieve more than 5 times speed-up.
I anticipate another round of computing with even higher complexity, following another round of review of my research paper. If so, I will invest more effort in improving the distributed computing. The current solution is very primitive, and promising technologies on Cloud Computing, such as Map-Reduce and Hadoop, might kick in here. . It will be helpful to researchers if Computer Centre can provide Map-Reduce and Hadoop support through the cloud computing service.