openpower.foundation/content/blog/high-performance-secondary-...

7.3 KiB

title date categories tags
High Performance Secondary Analysis of Sequencing Data 2018-11-13
blogs
featured

Genomic analysis is on the cusp of revolutionizing the understanding of diseases and the methods for their treatment and prevention. With the advancements in Next Generation Sequencing (NGS) technologies, the number of human genomes sequenced is predicted to double every year. This market growth is further fueled by the ongoing transition of NGS into the clinical market where it is enabling personalized medicine, that promises to transform the diagnosis and treatment of diseases, leading to a disruptive change in modern medicine.

However, current DNA analysis is restricted to using limited data due to the large time and cost for Whole Genome Sequencing (WGS). As biochemical sequencing is getting faster and cheaper, the bottleneck is the analysis of the large volumes of data generated by these technologies. Faster and cheaper computational processing is required to make genomic analysis available for the masses. Furthermore, pharmaceutical companies, consumer genomic companies, and research centers are currently processing hundreds of thousands of genomes with great cost and will hugely benefit from this improvement as well.

Parabricks brings high performance computing technologies that are tailored for NGS analyses and accelerates the standard NGS software from several days to approximately one hour. The accelerated software is a drop-in replacement of existing tools that does not sacrifice output accuracy or configurability. Parabricks provides 30-36 times faster secondary analysis of FASTQ files coming out of sequencer to variant call files (VCFs) for tertiary analysis on Power 9 servers. The standard pipeline shown below consists of three steps and are defined as the Genome Analysis Toolkit (GATK). Parabricks accelerates existing GATK 4 best practices to generate equivalent results as the baseline. The image below shows the pipeline currently supported by Parabricks.

caption id="attachment\_5912" align="aligncenter" width="757"

Power Hardware Configuration

The Power System AC922 server is co-designed with OpenPOWER Foundation ecosystem members for the demanding needs of deep learning and AI, high-performance analytics, and high-performance computing users. It is deployed in the most powerful supercomputers on the planet through a partnership between IBM, NVIDIA, and Mellanox, among others.

The IBM AC922 Server is an accelerator optimized server with support for four NVIDIA Tesla V100 GPUs connected via NVLINK 2.0 to the POWER9 CPUs at 150GBs speed each GPU. The hardware and system software configurations are summarized below.

Server

IBM AC922 (8335-GTH)

Processor40-core at 2.4 GHz (3.0 GHz turbo) IBM POWER9 NVLink 2.0 technology,
4x SMT
Memory·        512 GB DDR4 (8 Channels) - supporting up to 2 TB of memory
GPU4x NVIDIA V100-16GB HBM2, SMX2

Table 1 - Hardware configuration

Performance Evaluation

Secondary analysis of genomic data on CPUs has been known to take a long time. 30x WGS data can take upto 30-40 hours for running the pipeline shown before using HaplotypeCaller for variant calling. Below, the raw run times in minutes for the Parabricks software on a Power9 server for 3 DNA samples with different coverages including NA12878.

BenchmarkCoverageCPU only
(minutes)
BWA-MemOthers*HaplotypeCallerTotal Time
(minutes)
Speedup
S225x2,74656.814.6513.284.532.4
NA1287843x312562.714.111.588.335.39
NIST 1287841x299361.0514.9513.7189.7133.96

Table 2 - Others include Co-ordinate sorting, marking duplicates, bqsr and applybqsr.

Accuracy Evaluation

The accuracy of Parabricks solution compared to GATK4 solution is done at two steps:

i) BAM after Marking Duplicates

ii) VCF after calling variants

Parabricks generates 100% equivalent BAM as compared to the CPU only solution and has over 99.99% concordance with CPU vcf.

BenchmarkCoverageBAMVCF
S225x100%99.998%
NA1287843x100%99.996%
NIST 1287841x100%99.996%

Table 3

Features of Parabricks software

  • 30-35 times faster analysis: Compared to a CPU-only solution, Parabricks accelerates secondary analysis by orders of magnitude.
  • 100% Deterministic and Reproducible: Parabricks software regardless of platform and number/type of resources generates the exact same results every execution.
  • Equivalent Results: Parabricks pipeline generates equivalent results as the reference Broad Institute GATK 4 best practices pipeline as the same algorithm is used.
  • Up to Date Support of All Tool Versions: Parabricks accelerated software supports multiple versions of BWA-Mem, Picard and GATK and will support all future versions of these tools.
  • Visualization: Parabricks generates several key visualizations real-time, while performing secondary analysis that can improve the users understanding of the data.
  • Single Node Execution: The entire pipeline is run using one computing node and does not incur any overhead of distributing data and work across multiple servers.
  • Turnkey Solution: Parabricks software runs on standard CPU and GPU nodes available on the cloud or on-premise, and requires no additional setup steps by the user.
  • On-Premise and Cloud: Parabricks software can run on local servers, AWS, Google Cloud, and Azure.

Please contact info@parabricks.com for further inquiries.