CHEAPER CLOUD RESOURCES WITH AWS SPOT INSTANCES
In my previous article, I wrote about the massive size of the resources currently available in the public cloud service providers such as Amazon AWS and Microsoft Azure. Their compute resources span across huge data centres in different physical locations around the globe. It is hard to imagine the enormity of the resources, and subsequently, the complexity of the resource management within these cloud service providers.
One basic question for a service provider is: how can the resource utilisation in the virtual cloud be increased? Obviously, for the resource provider, this should be as close as possible to 100%. On the other hand, for a cloud service provider who needs to be able to provide on-demand resources in a flexible manner, it is inevitable that spare capacities be set aside as buffer pools to meet unexpected spikes in demands. As such, for a cloud service provider, perhaps a more important issue is: how can the buffer capacity be put to good use when demand is low?
AWS Spot Instances
Different cloud providers may approach this issue differently. To address this issue, AWS introduced Spot Instances . This special kind of instance allows customers to bid for their compute resources, by specifying the maximum price that they are willing to pay per instance-hour for the resource. When the price of the Spot Instances goes below their bid price, they will be given the instances to run their computational jobs. The catch here is that when the price of Spot Instances rises above their bid value (i.e. when the buffer pool is needed to run higher priced, on-demand instances), their instances will be terminated (with a two-minute notification alert).
What kind of jobs can benefit from Spot Instance?
Obviously, Spot Instances are not for all types of jobs. The dynamic nature of the resource means that long running computations may be halted before they can be completed successfully. The following are characteristics of jobs that are suitable to be run in Spot Instances:
• Jobs that can be checkpointed and restarted from the latest checkpoint – many software come with checkpoint capabilities, or if you write your own code, you will need to save your analysis state into an intermediate working file (or a restart file) which your program can then read from once it is restarted
• Jobs that can be broken down into small, independent sub tasks, the results from which can be aggregated before the final result – examples of this would be Monte Carlo simulations, or more recently, big data analytics workloads (in the analysis of multiple, large streams of data)
• Test jobs that are short and iterative in nature
What to expect when using Spot Instances?
When using Spot Instances, users cannot expect to just launch a job and leave it until it is done. Since the jobs may be interrupted due to fluctuations in Spot pricing, it is the users’ responsibility to:
• determine the optimal bid price for their Spot Instances, weighing in factors such as their budget, and the time they are willing to wait for the jobs to complete successfully
• determine the best way to respond to a termination alert
• determine how a job can be resumed when Spot Instances are available again
AWS also provides tools that allows you to monitor and manage your Spot Instances. The AWS Spot Bid Advisor allows you to analyse the Spot Price history and determine the suitable price that you are comfortable to bid.
Conclusion
Running and managing cloud instances, especially Spot Instances, is quite similar to running and managing jobs in a HPC environment in some cases. It is no surprise that many HPC jobs are also good candidates for Spot Instances. If you have been managing your HPC jobs with restarted and checkpoints, and took the extra effort to find the optimal queue to run your jobs in our HPC clusters, you will be quite equipped to write scripts to manage your jobs running in the Spot Instances. If you have jobs that fit the profile described above, the potential cost savings may be a strong motivation for you to make use of Spot Instances to run these jobs.