New Computational Cluster in the Cloud
by Yeo Eng Hee, Research Computing, NUS Iinformation Technology
What’s New?
Over the past few articles in the HPC Newsletter, I have been writing on Cloud resources and how computational jobs can be run in the Cloud. The Research Computing team here has been working hard to make the Cloud resources available to registered HPC users in a secure and simple way, so that our HPC workloads can be run in the Cloud as well.
These are the highlights of using the Cloud:
- ● the ability to run far bigger jobs than is physically possible in our own data centres,
- ● the option to choose different computational hardware to suit a job’s requirement,
- ● access to the latest hardware without going through a long procurement process, and
- ● paying for what we use only.
NUS IT Research Computing’s new HPC Cloud had just been released in April 2020, with the following initial set of applications fully tested, with more applications to be announced soon:
- ● Intel Compilers
- ● GNU Compilers
- ● Abaqus
- ● Ansys/Fluent
- ● Gaussian
A new way to run HPC jobs
In the new Cloud economy, we need to change some of the ways we work, so that we can effectively use the Cloud without incurring more cost than is necessary. Cloud usage is pay-as-you-use, and therefore, every resource we do not use should be terminated or removed. This includes data files which should be deleted if they are no longer needed.
As such, the home directory in the Cloud is also the workspace, and there will no longer be a separate /hpctmp or /hpctmp2 workspace. Treat the home directory as a temporary workspace and not as a permanent storage for your files (files older than 30 days are automatically purged from your home directories).
How do you start using it?
Instructions on how to login to HPC Cloud is posted here. There is also a command line help similar to the one we have in NUS HPC. Type: “hpc help”
Example:
hpc help
Use command as:
> hpc [Abaqus|Fluent|Gaussian|Intel|C|Fortran|GNU|Own] [job|script]
For example:
> hpc Abaqus job ==> to list sample Abaqus job submission script.
> hpc Fluent script ==> to list sample Fluent job submission script.
> hpc help ==> to list this help message.
HPC Cloud also uses the same PBS Pro job scheduler to manage jobs in the cloud, so regular HPC users should be familiar with the PBS Pro commands. Take note of the changes in the PBS Pro job scripts, though, as we have tried to minimize the options so that jobs can be submitted in a more uniform manner in the cloud. For example, there is no longer a need to specify a specific queue: users only need to specify the number of CPUs required. Advanced users may still specify other requirements, within the limits set in HPC Cloud’s job queue definition.
Optimizing your time in HPC Cloud
The HPC Cloud cluster has been set up to optimally run HPC workloads in the Cloud with some configuration to ensure that the cloud instances are able to run distributed MPI jobs. These settings include specifying the compute instances to run within the same location in the data centre and making use of local SSD hard disks for faster I/O within each compute node. Here are the specifications for the Cloud compute instances:
Instance type model |
c5d |
Intel Processor |
2nd generation Intel Scalable Processor (Cascade Lake), 3.0 GHz |
Network bandwidth |
12 Gbit/sec Ethernet |
One option that advanced users may want to consider is the PBS Pro placement parameter. By default, HPC Cloud’s placement flag is set to “-l place=free” which enables all job processes to fill up any available job slot within the cluster. This will allow jobs to start earlier, whenever the requisite number of job slots are available, regardless of which node the slot is located at. Some jobs (for example Fluent) may require all slots to be within the same virtual machine. For this option, use the PBS option: “-l place=pack” as can be seen in the Fluent job script sample within the “hpc” command line (type: “hpc fluent script”). This will ensure that the job will only reside within the same virtual machine, but the drawback is that large jobs may need to wait until there are sufficient slots within the same computational node.
In my benchmark tests with Abaqus, choosing “-l place=pack” will reduce the overhead of running across multiple computational nodes by about 10%. Therefore, it is up to the users to decide whether they prefer “-l place=pack” or “-l place=free”.
Conclusion
HPC Cloud is a new, cloud-only resource in which HPC users can now run their HPC analysis jobs, in addition to the on-premise NUS HPC resources. Please provide us with some feedback on the usage and improvement suggestions via nTouch (https://ntouch.nus.edu.sg/).