» Understanding PBS Job submission in HPC Cloud - NUS Information Technology

Understanding PBS Job submission in HPC Cloud

Yeo Eng Hee, Research Computing, NUS Information Technology

Introduction

The PBS Pro job scheduler is the ubiquitous tool in our HPC clusters, that takes care of the demanding task of managing all the HPC workloads running on each compute node in all our clusters. HPC users are familiar with the concepts of queues in the scheduler and understand that jobs need to be submitted to the appropriate queues to run on the hardware they prefer. Therefore, we have created multiple queues to match our physical clusters’ hardware capabilities in our data centres.

With the introduction of the cloud, the computational clusters have become more dynamic and the cloud’s flexibility and agility allow us to create clusters that are more homogenous than the multi-generational hardware in our data centres. As such, a new way of defining queues is introduced, to make it easier for users to specify their HPC jobs in the cloud.

Which queue should I choose in the cloud?

In a departure from the usual practice of specifying a queue to run your HPC jobs, the PBS Pro job scheduler in the cloud is set to send all jobs to a routing queue by default. A routing queue, as the name suggests, only routes jobs to another queue, based on the job requirements. The use of a routing queue effectively allows a user to submit a job to the job scheduler without having to specify any queue at all! So, the following job directive can be effectively removed from your job scripts in the cloud:

#PBS -q <queue_name> ### You can omit this line in the cloud.

How do I specify my job requirements?

Basic specification

The most basic job requirements specifications are therefore reduced to one resource specification: the number of CPUs or processing cores. In PBS, this is specified in a directive such as the one below:

#PBS -l select=1:ncpus=1

In fact, the directive above is the minimum requirement you need to specify in the cloud, to run a simple, serial job on one processor core.

The concept of a resource chunk

At this point, it is important to introduce the concept of a resource chunk in PBS Pro. A chunk is the basic set of resources that is given to a job, using the select=N directive. So, in the example:

#PBS -l select=1:ncpus=1

One resource chunk is requested. Implicit in the statement is the other default resources defined in a chunk:

#PBS -l select=1:ncpus=1:mem=1950mb:mpiprocs=1:ompthreads=1

The previous two directives are identical, and PBS Pro will assign 1 CPU, 1950MB of memory, 1 MPI processor thread and 1 OpenMP thread for the job (the last 2 are superfluous, since this is a serial job). For a job that requires 2 processors, use the following:

#PBS -l select=2:ncpus=1:mem=1950mb:mpiprocs=1:ompthreads=1

which assigns 2 CPUs, 2900MB memory, 2 MPI processor threads and 2 OpenMP threads to the job.

Can I select different combinations of “ncpus=” and “mem=” directives?

Yes, you can. The default chunk as specified in the previous paragraph is set to the ratio of CPU to memory in a cloud instance (or a cloud virtual server), and using the default chunk will allow the jobs to run in the cloud optimally. However, PBS Pro is flexible, and users can specify different combinations of CPUs and memory to suit their jobs, effectively creating their own customized resource chunks. Specifying customized resource chunks may be necessary at times, when a job fails to run using the default chunks. Users can try to modify the ncpus= and mem= to see what works for them.

I am ready to submit my jobs in the cloud

If you have not registered your account to use the HPC resources, go to our web page here. If you are new, the Introductory Guide for New HPC Users page will give you more details on how to start using our HPC resources.

The cloud resources are accessible via ssh to the HPC cloud login node:

prompt> ssh <username>@hpclogin

where username is your NUS userid.

To create a job script, for example, for a serial job, use the command line:

prompt> hpc serial job

which will printout the full sample script that you can copy into your own job script file:

#!/bin/bash

#PBS -P myproj
#PBS -j oe
#PBS -N myprog
#PBS -l select=1:ncpus=1
#PBS -l place=free:shared

cd ${PBS_O_WORKDIR};   ## this line is needed, do not delete.

##--- Put your exec/application commands below ---
## Make a temporary scratch space (this should be on /scratch)
scratch=/scratch/${USER}/${PBS_JOBID}
export TMPDIR=$scratch
mkdir -p $scratch

./myprog.exe

# Remove scratch space
rm ‐rf $scratch
exit $?

Conclusion

The HPC cluster in the cloud is a new resource that brings the flexibility and agility of the cloud resources to our HPC users. To optimally use the cloud resources for HPC workloads, a new paradigm is adopted, whereby queues are no longer used to define the types of resources needed for each HPC job. Instead, you should use the resource chunks definitions to more accurately specify the resources (CPUs and memory, minimally) to run your HPC jobs.

For any questions, send us a query in nTouch (https://ntouch.nus.edu.sg/)