RE-ASSESSING HPC CLOUD
Having seen a breakout year for HPC Cloud in 2017, we are beginning to understand what made it work and how we could ride on this fast moving wave to advance computational research on campus.
The time HPC Cloud struggled to take off
When Cloud was thriving with growing consumer and enterprise use, HPC adoption was limited due to various reasons. Barring some embarrassingly parallel applications, the Cloud performance had been considered inadequate for HPC without bare-metal servers, Parallel File System and InfiniBand network. Based on cost comparisons, the Cloud was usually found to be more expensive compared to in-house HPC installation. As most Cloud providers focused more on the consumer and enterprise markets, there was also a lack of effort developing an HPC specific software ecosystem to entice users.
What has changed?
Widespread industry and enterprise adoption of Data Analytics and Machine Learning coupled with the need to use HPC technologies and resources to accelerate research have driven major Cloud Service Providers (CSPs) to add and expand resources and services relevant to HPC. This development has made Cloud more attractive in the following ways:
Competitive advantages
Most CSPs now have the reserved instances that are comparable in cost with the in-house system – that is if you do not mind being locked in with certain technologies for a period of time. The cloud will save you the trouble of system maintenance. However if time-to-market is essential, a right combination of reserved, on-demand and spot instances[1] subscriptions in the cloud will offer a balance between the cost savings and time-to-market advantages.
Better user experience
Even in the virtualized Cloud environment, the complexity of configuring, operating and maintaining HPC system is still quite daunting for most users. The emergence of solutions such as Ronin.cloud and Rescale helps to hide such complexity and encourage adoption, especially for installations without internal IT support.
Researchers typically have a fixed computing budget to spend. Out-of-control spending in the Cloud is a legitimate concern. Some of these tools provide budget management capabilities to address such a concern.
Acceptable performance
On performance, some market studies indicate that the performance drop in many HPC applications is not significant after moving to the Cloud, probably due to improvements in virtualization technology. For example in a joint study by ANSYS and AWS[2] in 2016, the results showed “near-ideal scalability past 1000 cores and a reduced overall solution time even beyond 2000 cores”. Even with 10-20% performance drop in specific areas, researchers could still be getting shorter turnaround time (or solution time). This is if hours of waiting in the queue (not uncommon in the in-house shared environment) can be eliminated. The value of almost instant provisioning in the Cloud will be even more significant when researchers need to scale their simulation beyond the in-house capacity. The opportunity cost will be very high if researchers have to wait for months to complete the procurement and installation of new in-house capacity.
Up-to-date options
When we purchase an in-house HPC system, we are committed to making full use of a specific technology and capability for at least 5 years. In the Cloud, we can find two generations of GPU (Graphics Processing Unit) technology introduced within a year. With technology moving this fast, it becomes more challenging for us to keep up through the traditional way of procuring, running and maintaining in-house hardware.
Besides the access to the latest technologies, it is also important to have choices. Some applications can make better use of one technology over another technology. Whether the choice is GPU or FPGA (Field Programmable Gate Arrays) devices now or Quantum computer in the future, it is more likely to find them in the Cloud when they become mainstream.
More application software bundle
Researchers have access to many free open-source software. Even though there is no licensing cost involved, there will be hidden costs when researchers have to spend time troubleshooting installation and OS compatibility. With software marketplaces such as Alces Flight in the Cloud, as well as AWS Marketplace, Azure Marketplace, and Google Cloud Launcher, instances with optimized and up-to-date application software can be provisioned off-the-shelf.
Where commercial software is needed on an ad-hoc basis, the on-demand software licensing model in the Cloud will make it more cost-effective than committing a full-year licensing cost in-house.
More scalable and flexible storage solutions
We often purchase storage capacity to cater and buffer for at least the next twelve months’ requirement.. The usage pattern is more likely to be cumulative rather than full usage from day one. There would be potentially large portion of in-house capacity that may be idle for long periods of time. The pay-as-you-use model in the Cloud will definitely save cost for such a usage pattern and with greater adoption, the subscription rate is expected to fall.
More innovation opportunities
Cloud provides a flexible environment for experimentation and exploration of ideas and solutions so new systems and platforms can be built and reconfigured quickly. As technology leaders, the pace as to which Cloud Service Providers have been introducing new technologies, has fueled further innovation among the users.
What do we have today and what can we look forward to?
For a more secure environment, we have established a private cloud environment at AWS to provide HPC resources. With this setup, resources and services subscribed have been provisioned within a private and secure network for NUS users. NUS access control has also been integrated to allow the use of NUS-ID and password to access the Cloud resources. Three key use cases today are:
• provisioning of GPU computing resources and services to support AI/Deep Learning research
• provisioning for Proof-of-Concept (POCs) and experimentations
• HPC software provisioning and performance studies
Today, HPC provisioning in the private cloud has to be performed by NUS IT staff. Moving forward, we plan to implement a solution to enable easy self-service and effective budget management for researchers.
It is reasonable to say that HPC has played a key role in driving the renaissance of AI research. We will be interested in exploring how AI, in turn, can help to drive the development of HPC to a new level, particularly in the Cloud. In the next stage of HPC Cloud development, we will be asking some of the following questions:
• How can we build intelligence into auto provisioning to minimize cost and maximise effectiveness?
• Is auto code generation and optimisation possible?
• How AI can help to speed up research discovery?