TIPS TO IMPROVE PARALLEL JOB EFFICIENCY
Running parallel jobs can speed up the computation or simulation. Resources can also be used more efficiently. A recent study of the finished jobs in the past few months shows that most parallel jobs run very well. However some jobs submitted to the parallel queues did not run in parallel mode or make full use of the resources. We list here a few findings and some tips to improve parallel job efficiency:
1. Serial jobs are submitted to the parallel queues
Different LSF queues are configured for different kinds of jobs. Serial queues are for running jobs in serial mode on one core only. Parallel queues are designed for running jobs in parallel mode, with which jobs would run with multithreads on all the processor cores. However, if serial jobs are submitted to the parallel queues, they will cause huge wastage of resources. For example, when a serial job is submitted to a parallel queues with 12 processor core specified, the job will run using only one processor core. As the job in the parallel queue will get all the 12 processors core of the node, the other 11 cores will be left idle. This means more than 90% of resources are wasted. The waste will be significant if the serial job runs for a very long time. Furthermore, when more serial jobs are running in the parallel queues, more parallel jobs could be held in the queue in pending status for a long time.
Tip: Submit serial jobs to serial queues (serial, large, short) only.
2. Shared memory (OpenMP) parallel jobs are submitted using more than one node
Shared memory parallel jobs such as OpenMP programs are supposed to run on only one node. If you specify 24 processor cores in the job submission, then LSF will allocate at least 2 nodes for the job. In this situation, only the 12 cores on the master node will be used to run the program and the rest of 12 cores will be left idle. Hence, half of the resources are wasted.
Tip: Submit shared memory OpenMP parallel jobs to queue “parallel8” using 8 threads, to queue “parallel” or queue “openmp” using 12 threads exactly.
3. Jobs fails to produce results after being dispatched
Some jobs can be dispatched and run, but no results are generated due to problems with the executable, parameters or any options used. Normally when users find their jobs are not running properly, they’ll terminate them and resubmit them, but a lot of compute resources and users’ time have already been wasted.
Tip: Test and debug your programs and make sure they can run successfully and produce expected results by submitting jobs to queue “parallel_test” first. The queue is catered for testing and debugging of parallel jobs, therefore, high priority is given to the jobs in the queue and very minimum waiting time is expected. Use a small number of iterations or terminate your jobs when you find they can run properly. Refrain from submitting long running jobs in this queue as they will be terminated due to a CPU limit set in the queue.
4. Programs are not well parallelized
Parallel jobs can scale linearly if the code is well parallelized. Jobs that are not well parallelized will waste resources. It will take longer for you to get your results. The “hog factor” shown in the LSF job accounting gives you an indication of how well your code has been parallelized and how efficiently you have used the compute resources. To check the “hog factor”, you can run a “bacct –l JOBID” after the jobs are finished. You’ll have the similar output information on the jobs shown as the follows:
The Hog Factor can be calculated using this formula:
Hog Factor = CPU Time Comsumed / (Turnaround time – Wait time in queue)
In the above example,
Hog factor = 168936.1 / (7672 – 4134.0) = 47.7.
As 48 processor cores are specified, the obtained scale factor of 47.7 indicates that the code has been parallelized very well. If you find the hog factor is far less than the number of processors specified, you may need to check your code or find ways to optimize it or re-parallelize it.