Using MPI One-sided Communication to Accelerate ...mug.mvapich.cse.ohio-state.edu/static/media/mug/...synergy.cs.vt.edu Using MPI One-sided Communication to Accelerate Bioinformatics

synergy.cs.vt.edu

Using MPI One-sided Communication to Accelerate Bioinformatics Applications

Hao Wang ([email protected])

Department of Computer Science, Virginia Tech

mailto:[email protected]

synergy.cs.vt.edu

Next-Generation Sequencing (NGS) Data Analysis

– DNA is isolation from normal tissue and blood

– DNA is fragmented and the captured DNA is washed and amplified

– DNA is sequenced and analyzed

– DNA is used for clinical trials, e.g., disease detection, personalized medicine, etc.

NGS Data Analysis

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

NGS Data Analysis

• Next Generation Sequencing (NGS) has significantly reduced cost per genome; and data analysis (instead of sequencing) is becoming the bottleneck

• NGS data analysis market is boosting and predicted to exceed 1 Billion in 2024• NIH, “DNA Sequencing Costs: Data”, https://www.genome.gov/27541954/dna-sequencing-costs-data/• Grand View Research, “NGS Data Analysis Market Analysis 2024”, http://www.grandviewresearch.com/industry-analysis/next-

generation-sequencing-ngs-data-analysis-market

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

Irregular NGS Data Analysis Applications

• NGS applications can be characterized by – Irregular memory accesses

– Irregular control flows

– Irregular communication patterns

• Many such applications exhibit irregularities– Basic Local Alignment Search Tool (BLAST) for sequence search

• Heuristic algorithms

– BWA, Bowtie1/2, and SOAPaligner for short read mapping

• Compressed data structures

Hao Wang, MUG'17, Aug. 14-16, 2017

These applications have irregular communication patterns!

synergy.cs.vt.edu

Outline

• Background

• Sequence Search

• Using one-sided communications for sequence search

• Evaluation (early stage)

• Summary and Future Work

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

Sequence Search

• Search for similarities between a query sequence and database sequences (i.e., subject sequences)

ADGIFAIDQFTKVLLNYTGHITWNPPAIFKSYCEIIVTYFPFDEQNCSMKLG…..

Query Sequence>gi|1703080|sp|P14144.2|ACHA_NATTE ADGIFAIDQFTKVLLNYTGHITWNPPAIFKSYCEIIVTYFPFDEQNCSMKLGTRTYDGTV.......>gi|113073|sp|P14143.1| ACHA_NATTE MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFD….

>gi|113075|sp|P25108.1|ACHA_RATMARVTVQDAVEKIGNRFDLVLVAARRARQIQSGGKDALVPEENDKVTVIALREIEEGLITNQILDVRERQEQQEQ….

>gi|113072|sp|P04756.1|ACHA_MOUSEADGDFAIVKFTKVLLQYTGHITWTPPAIFKSYCEIIVTHFPFDEQNCSMKLGTWTYDGSV……

……

Database (Subject Sequences)

Sequence Search

gi|1703080|sp|P14144.2|…. Score = 278.1gi|113072|sp|P04756.1|… Score = 225.3gi|113071|sp|P02708.2|… Score = 223.0….

Output

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

MPI Implementation

• Inter-node parallel implementation using MPI– Partition the database D into j subsets D0 D1 …… Dj

– For a query sequence qi, search qi on each database subset Dj in parallel and get local search result Rij

– Merge and sort all local search result Ri0 to Rij and get the final result Ri

Query sequence qiMerge, sort {Ri0, Ri1, …

Rij} and get Ri

MPI Rank 0

search qi on D0 and get Ri0

MPI Rank 1

search qi on D1 and get Ri1

MPI Rank j

search qi on Dj and get Rij

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

MPI Implementation

• Inter-node parallel implementation using MPI– Partition the database D into j subsets D0 D1 …… Dj

– For a query sequence qi, search qi on each database subset Dj in parallel and get local search result Rij

– Merge and sort all local search result Ri0 to Rij and get the final result Ri

Query sequence

batch {q0, q1…qi-1}

Merge, sort and get {R0, R1 … Ri} of batch0

search batch {q0, q1…qi-1}

on D0, get {R00,R10 …Ri-10}


on D1, get {R01, R11 … Ri-11}


on Dj, get {R0j, R1j … Ri-1j}

Query sequence

batch {qi, qi+2…q2i-1}

Query sequence

batch {q2i, q2i+1…q3i-1}

Merge, sort and get {R0, R1 … Ri} of batch1Merge, sort and get {R0, R1 … Ri} of batch2

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

mpiBLAST Implementations

• Characteristics– Both computation time and data size of compute

nodes are highly diverse

• A dedicated MPI process as the master1. All workers send meta data, i.e., query id, search

score, and data size, to the master

2. The master merges and sorts meta data, and selects a worker for IO and notifies all workers

3. All workers send local selected results to the IO worker

4. The IO worker finally writes data to disk

H. Lin, et al. "Coordinating computation and i/o in massively parallel sequence search." Parallel and Distributed Systems, IEEE Transactions on 22.4 (2011): 529-543.

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

Why Redesign Sequence Search

• Bottlenecks in previous sequence search tools– Local search

– Disk IO

• New tendencies of sequence search– Local search is much faster right now, e.g., DIAMOND1

– Sequence search has become a stage of NGS work flow, and search results are resided in memory for reuse2

1. B. Buchfink, Xie C., D. Huson, "Fast and sensitive protein alignment using DIAMOND", Nature Methods 12, 59-60 (2015).

2. Genome Analysis Toolkit, https://software.broadinstitute.org/gatk/

Data communication is becoming a new performance bottleneck!

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

Outline

• Background

• Sequence Search




Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

Using MPI One-sided for Sequence Search

• Benefits of using MPI one-sided communication– More economically express irregular communication pattern

– More efficiently overlap communication and computation

– Bypass tag matching in two-sided communication

• Basic ideas– Use MPI one-sided communications (put and get) to overlap

communication and computation

– Don’t need a dedicated MPI process as the master to coordinate disk IO

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

Challenges

• Challenges– Three one-sided synchronization modes: Fence (active target), Post-

Start-Complete-Wait (active target), Lock/Unlock (passive target). Which is better?

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

MPI Windows for Meta Data and Local Results

qi

qi+1

qi+2

qi qi

qi+1

qi+2qi+2

Buf 0 on Worker0 for qi


<addr, size> <addr, size> <addr, size>

Cyclic buffer pool on each MPI process

Register to MPI window

• Each MPI process registers two types of cyclic buffers to MPI window, for meta data and local search results respectively

qi+1

Rank 0 Rank 1 Rank n-1

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

Use MPI Put to Write Meta Data

qi

qi+1

qi+2

qi qi

qi+1

qi+2qi+2


gi|1703080|sp|P14144.2|…. Score = 278.7gi|113072|sp|P04756.1|… Score = 225.3gi|113071|sp|P02708.2|… Score = 223.0…





Put metadata to others

• After local search for a batch of query sequences, a MPI process will write meta data to others with MPI Put

qi+1


Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

Wait the Finish of Put on Previous Batch

qi

qi+1

qi+2

qi qi

qi+1

qi+2qi+2







Merge & sort metadata

• After MPI Put (for the current batch), MPI processes will – Wait for the finish of MPI Put of the previous batch, e.g., MPI_Win_fence()

– Merge and sort meta data for the pervious batch and select out final results (meta data)

qi+1


Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

A Process will Gather Data by MPI Get

qi

qi+1

qi+2

qi qi

qi+1

qi+2qi+2







Get needed local results to the selected process

• A MPI process is selected out as the one to merge final results, e.g., who has most final results, and it will gather data from others by using MPI Get

• Other processes will continue the computation for the next batch

qi+1


Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

Summary of Our Method

create MPI Windows for metadata and local results

for batch 0 : n-1

Local search: do the sequence search on local partition of database

Write metadata: MPI_Put() metadata of current batch to others

Wait: wait for the finish of MPI_Put() on previous batch

Merge & sort: merge and sort metadata

if (I’m the selected process)

Get local results: MPI_Get() local results from all processes

Generate output: Sort and write final results for the previous batch

endif

endfor

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

Implementations and Optimizations

• Double buffering– Use double windows for metadata in order to wait for the finish

of previous batch, after issuing MPI_Put() for the current batch

• Different types of synchronization methods– Fence mechanism: MPI_Win_fence()

– Lock/unlock mechanism: MPI_Win_flush()

– PSCW mechanism: MPI_Win_wait()

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu


• One window vs one windows per rank– Create one window per process to avoid the unnecessary wait

Hao Wang, MUG'17, Aug. 14-16, 2017

qi

qi+1

qi+2

qi qi

qi+1

qi+2qi+2

Create one MPI window and register buffers to one window

qi+1


synergy.cs.vt.edu


• One window vs one windows per rank– Create one window per process to avoid the unnecessary wait

Hao Wang, MUG'17, Aug. 14-16, 2017

qi

qi+1

qi+2

qi qi

qi+1

qi+2qi+2

Create n MPI windows and register the buffer per rank per window

qi+1


synergy.cs.vt.edu

Outline

• Background

• Sequence Search




Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

Experimental Setups

• Hardware– Up to 16 compute nodes, each of which has 2 Intel Xeon CPU E5-2670 (Sandy

Bridge EP, 16 cores in total)

– 64 GB main memory

– Mellanox ConnectX-3 MT27500

• Datasets– env_nr and nr databases from NCBI GeneBank

– Randomly select 10000 sequences from the target database as query sequences

• Data partitions– Partition databases evenly on each compute node

• Software– DIAMOND (C++ Thread) + MPI

– MVAPICH2 (version 2.2)

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

Breakdown

• Different MPI processes contribute different sizes of data to the final results

• Different MPI processes have different computation time in each batch

Setups: running on 8 nodes,10000 query sequences in 10 batches

Hao Wang, MUG'17, Aug. 14-16, 2017

0

200

400

600

800

1000

1200

1400

1600

rank0 rank1 rank2 rank3 rank4 rank5 rank6 rank7

Dat

a si

ze (

MB

)

batch1 batch2 batch3 batch4 batch5batch6 batch7 batch8 batch9 batch10

0

10

20

30

40

50

60

rank0 rank1 rank2 rank3 rank4 rank5 rank6 rank7

Co

mp

uta

tio

n t

ime

(sec

)

batch1 batch2 batch3 batch4 batch5batch6 batch7 batch8 batch9 batch10

synergy.cs.vt.edu

Overall Performance on 8 nodes

• MPI_Win_Fence() with multiple windows is best

– 1.4x and 1.32 x speedup over 2sided w/ and w/o master, respectively

Hao Wang, MUG'17, Aug. 14-16, 2017

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

1000 1500 2000 2500 3000

No

rmal

ized

Exe

cuti

on

tim

e

Batch size

Fence_mwins Fence_1win PSCW_mwins PSCW_1win

LockFlush_mwins LockFlush_1win SendRecv_w/_master SendRecv_w/o_master

Lower is better

synergy.cs.vt.edu

Overall Performance on 16 nodes

• MPI_Win_Fence() with multiple windows is best

– 1.42x and 1.19 x speedup over 2sided w/ and w/o master, respectively

Hao Wang, MUG'17, Aug. 14-16, 2017

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1000 1500 2000 2500 3000

No

rmal

ized

Exe

cuti

on

tim

e

Batch size

Fence_mwins Fence_1win PSCW_mwins PSCW_1win

LockFlush_mwins LockFlush_1win SendRecv_w/_master SendRecv_w/o_master

Lower is better

synergy.cs.vt.edu

Observations

• MPI fence exhibits better performance than MPI flush– All-to-all communication pattern in metadata communication

Hao Wang, MUG'17, Aug. 14-16, 2017

synergy.cs.vt.edu

Summary and Future Work

• We use MPI one-sided communication to accelerate sequence search on InfiniBand clusters

• The experimental results show up to 1.42x speedup over two-sided communication

• We are analyzing performance numbers of different one-sided synchronization mechanisms

• We are collecting more application performance numbers, for mpiBLAST, DIAMOND, and pBWA

• We would like to check application performance with MVAPICH2-2.3b

Hao Wang, MUG'17, Aug. 14-16, 2017

Documents

Using MPI One-sided Communication to Accelerate ...mug.mvapich.cse.ohio-state.edu/static/media/mug/...synergy.cs.vt.edu Using MPI One-sided Communication to Accelerate Bioinformatics