NUSIT HPC

Home » for-migration-xixun » » Tackling HPC issue for parallel computing in MATLAB

Tackling HPC issue for parallel computing in MATLAB

by Vamshidhar Gangu, Research Computing, NUS Information Technology

Introduction

Here, we address a potential problem that occurs while using Parallel Computing Toolbox (PCT) on NUSIT-HPC. This problem occurs when submitting multiple jobs on PBS using Parallel computing toolbox. Most of these jobs would crash and the usual scenario is that the first job would run, while the subsequent jobs would hang/crash indicating that a second matlabpool cannot be opened.

Root Cause of the problem

When using Parallel Computing Toolbox (PCT), MATLAB creates a matlabpool for each job separately and when you submit multiple PCT jobs, these matlabpool that get created have the ability to interfere with one another which can lead to errors and early termination of your scripts.

The Parallel Computing Toolbox (PCT) requires a temporary “Job Storage Location” where it stores information about matlabpool  that is in use.  This is simply a directory on the file system that MATLAB writes various files to coordinate the parallelization of the matlabpool. By default, this information is stored in “/home/svu/YOURUSERNAME/.matlab/”.  When multiple PCT jobs are submitted to the job scheduler (PBS), all jobs will attempt to use this default location for storing job information and thereby create a race condition where one job modifies the files created by other jobs. This situation must be avoided.

Solution

The solution is to have a unique Job Storage location for each PCT job. For this, a temporary directory must be created before launching MATLAB in our job submission script and inside your MATLAB script, the matlabpool must be created to explicitly use this unique temporary directory. An example job submission script is shown in the box below. As good housekeeping practise, this temporary directory can be purged after the MATLAB script is run.

#!/bin/bash

#PBS -P Project_Name_of_Job

####--- For matlab job with parallel computing ----

#PBS -q parallel12

#PBS -l select=1:ncpus=12:mem=45GB

#PBS -j oe

#PBS -N Job_Name

cd $PBS_O_WORKDIR;   ## This line is needed, do not modify.

##--- Put your exec/application commands below ---

## If your matlab program is < my_matlab_prog.m >.

## create a temp directory

mkdir -p ${PBS_O_WORKDIR}/pct_${PBS_JOBID}

## run your matlab pct script (my_matlab_prog.m)

matlab -nodisplay -r my_matlab_prog

## clean up temp directiry

rm -fr ${PBS_O_WORKDIR}/pct_${PBS_JOBID}

And the corresponding MATLAB script (my_matlab_prog.m) needs to include these lines:

% create a local cluster object

pc = parcluster('local')

% set the JobStorageLocation to the temp directory

% that was created in your job submission script

pc.JobStorageLocation = strcat(getenv('PBS_O_WORKDIR'), '/', getenv('PBS_JOBID'))

% start the parallel pool with 12 workers

numcores = feature('numcores')

parpool(pc, numcores)

Please contact us via nTouch, if you need help with your HPC issues. 

Leave a Reply

Your email address will not be published. Required fields are marked *