EFFICIENT PROCESSING OF LARGE NEXT GENERATION SEQUENCING DATASETS
The advent of next-generation sequencing has made the generation of terabytes and petabytes of data readily accessible. When processing such a large amount of data, storage space and I/O (read/write speed from disk) of a computing system would more often than not become a limiting factor. In such cases, it is important to ensure proper design of one’s analysis to minimize storage requirements and reduce the I/O overhead on the computing infrastructure to obtain faster results.
In this article, we would be sharing some basic tips to help improve the speed and reduce the storage demands of your NGS applications.
Pipes are great for avoiding the need to generate a temporary intermediate file
In many NGS applications, a temporary intermediate file often needs to be generated. This intermediate file can often be much bigger than the raw data file, wasting valued storage space. In addition, reading and writing the file to disk would often strain the I/O of the computing infrastructure, causing this to become a bottleneck in some instances. Piping the intermediate files from one program to the next would however allow one to avoid these issues.
This is illustrated here using the allignment of a pair of fastq files with BWA as an example.
Method 1 (No Piping)
bwa mem -t 4 hg19.fa tumorSample.fq1.gz tumorSample.fq2.gz > tumorSample.sam
samtools view -bS tumorSample.sam > tumorSample.bam
Method 2 (With Piping)
bwa mem -t 4 hg19.fa tumorSample.fq1.gz tumorSample.fq2.gz | samtools view -bS - > tumorSample.bam
Method 1 | Method 2 | Space saved/Speed up | |
---|---|---|---|
Total storage space used | 4183 MB | 906 MB | 78.3% |
Compute Time | 15min 31secs | 13min 16secs | 1.16x |
In the first method above, the alignment result is first written to disk as a “.sam” text file before it is then read from disk and then converted into a “.bam” file. The generation of an intermediate “.sam” file is however unnecessary. An alternative and more efficient way to generate the alignment results is to pipe the standard output of the alignment program into samtools directly for conversion into a bam file on the fly as depicted in Method 2 above.
Based on the results above, we can see that piping of the intermediate file in the second method significantly reduces the storage footprint of the analysis. Also, we see a noticeable speed up in the speed of the analysis as the conversion of the SAM to BAM is done in parallel to the alignment process.
Redirecting multiple standard outputs
Redirecting and piping multiple standard outputs into an application is also possible. We would illustrate this using the Varscan somatic mutation caller.
Method 1
samtools mpileup –q 1 –f hg19.fa norm.bam > norm.bam.pileup samtools mpileup –q 1 –f hg19.fa tum.bam > tum.bam.pileup java –jar VarScan.v2.3.9.jar somatic norm.bam.pileup tum.bam.pileup outputFile
Method 2
java –jar VarScan.v2.3.9.jar somatic <(samtools mpileup –q 1 –f hg19.fa norm.bam) <(samtools mpileup –q 1 –f hg19.fa tum.bam) outputFile
Method 1 | Method 2 | Space saved/Speed up | |
---|---|---|---|
Storage space | 11175 MB | 1479 MB | 86.8% |
Compute Time | 30 min 1 sec | 15 min 39 secs | 1.92x |
As observed from the test results, the omission of an intermediate file and running the 3 steps in parallel (Method 2) saves a large amount of disk space and running time. A similar approach could also be
High speed read/writes using a ramdisk
Certain applications may necessitate high speed I/O and normal hard drives may be too slow in such situations. A possible solution would be to mount a portion of the RAM on the machine as a ramdisk which can then be utilized as a normal storage drive. files that are transferred into the ramdisk could then be accessed at high speeds just like a normal file.
The following command illustrates how we can create a 10GB ramdisk:
mkdir /media/ramdisk mount -t tmpfs -o size=10G tmpfs /media/ramdisk/
In the table below, we can see dramatic speedups when using a ramdisk as compared to a normal disk.
Time (seconds) | Speedup | |
---|---|---|
Time to copy 1GB into Normal Drive | 8.917 | – |
Time to copy 1GB into Ramdisk | 0.819 | 10.9x |
NOTE
NOTE: A ramdisk is only for temporary file storage. Files stored within a ramdisk would be lost following the shutdown of the server.
Final word:
NGS applications are a lot more space and I/O intensive as compared to other applications. The command “iostat -x 1” is also something that you might find useful to identify if there is any I/O bottlenecks during your analysis. In all, when designing NGS analyses and pipelines, significant attention has to be paid towards the storage and I/O requirements and we certainly hope you’ll find these tips useful in your own analysis.