Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
synergy.cs.vt.edu
Using MPI One-sided Communication to Accelerate Bioinformatics Applications
Hao Wang ([email protected])
Department of Computer Science, Virginia Tech
synergy.cs.vt.edu
Next-Generation Sequencing (NGS) Data Analysis
– DNA is isolation from normal tissue and blood
– DNA is fragmented and the captured DNA is washed and amplified
– DNA is sequenced and analyzed
– DNA is used for clinical trials, e.g., disease detection, personalized medicine, etc.
NGS Data Analysis
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
NGS Data Analysis
• Next Generation Sequencing (NGS) has significantly reduced cost per genome; and data analysis (instead of sequencing) is becoming the bottleneck
• NGS data analysis market is boosting and predicted to exceed 1 Billion in 2024• NIH, “DNA Sequencing Costs: Data”, https://www.genome.gov/27541954/dna-sequencing-costs-data/• Grand View Research, “NGS Data Analysis Market Analysis 2024”, http://www.grandviewresearch.com/industry-analysis/next-
generation-sequencing-ngs-data-analysis-market
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Irregular NGS Data Analysis Applications
• NGS applications can be characterized by – Irregular memory accesses
– Irregular control flows
– Irregular communication patterns
• Many such applications exhibit irregularities– Basic Local Alignment Search Tool (BLAST) for sequence search
• Heuristic algorithms
– BWA, Bowtie1/2, and SOAPaligner for short read mapping
• Compressed data structures
Hao Wang, MUG'17, Aug. 14-16, 2017
These applications have irregular communication patterns!
synergy.cs.vt.edu
Outline
• Background
• Sequence Search
• Using one-sided communications for sequence search
• Evaluation (early stage)
• Summary and Future Work
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Sequence Search
• Search for similarities between a query sequence and database sequences (i.e., subject sequences)
ADGIFAIDQFTKVLLNYTGHITWNPPAIFKSYCEIIVTYFPFDEQNCSMKLG…..
Query Sequence>gi|1703080|sp|P14144.2|ACHA_NATTE ADGIFAIDQFTKVLLNYTGHITWNPPAIFKSYCEIIVTYFPFDEQNCSMKLGTRTYDGTV.......>gi|113073|sp|P14143.1| ACHA_NATTE MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFD….
>gi|113075|sp|P25108.1|ACHA_RATMARVTVQDAVEKIGNRFDLVLVAARRARQIQSGGKDALVPEENDKVTVIALREIEEGLITNQILDVRERQEQQEQ….
>gi|113072|sp|P04756.1|ACHA_MOUSEADGDFAIVKFTKVLLQYTGHITWTPPAIFKSYCEIIVTHFPFDEQNCSMKLGTWTYDGSV……
……
Database (Subject Sequences)
Sequence Search
gi|1703080|sp|P14144.2|…. Score = 278.1gi|113072|sp|P04756.1|… Score = 225.3gi|113071|sp|P02708.2|… Score = 223.0….
Output
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
MPI Implementation
• Inter-node parallel implementation using MPI– Partition the database D into j subsets D0 D1 …… Dj
– For a query sequence qi, search qi on each database subset Dj in parallel and get local search result Rij
– Merge and sort all local search result Ri0 to Rij and get the final result Ri
Query sequence qiMerge, sort {Ri0, Ri1, …
Rij} and get Ri
MPI Rank 0
search qi on D0 and get Ri0
MPI Rank 1
search qi on D1 and get Ri1
MPI Rank j
search qi on Dj and get Rij
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
MPI Implementation
• Inter-node parallel implementation using MPI– Partition the database D into j subsets D0 D1 …… Dj
– For a query sequence qi, search qi on each database subset Dj in parallel and get local search result Rij
– Merge and sort all local search result Ri0 to Rij and get the final result Ri
Query sequence
batch {q0, q1…qi-1}
Merge, sort and get {R0, R1 … Ri} of batch0
search batch {q0, q1…qi-1}
on D0, get {R00,R10 …Ri-10}
search batch {q0, q1…qi-1}
on D1, get {R01, R11 … Ri-11}
search batch {q0, q1…qi-1}
on Dj, get {R0j, R1j … Ri-1j}
Query sequence
batch {qi, qi+2…q2i-1}
Query sequence
batch {q2i, q2i+1…q3i-1}
Merge, sort and get {R0, R1 … Ri} of batch1Merge, sort and get {R0, R1 … Ri} of batch2
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
mpiBLAST Implementations
• Characteristics– Both computation time and data size of compute
nodes are highly diverse
• A dedicated MPI process as the master1. All workers send meta data, i.e., query id, search
score, and data size, to the master
2. The master merges and sorts meta data, and selects a worker for IO and notifies all workers
3. All workers send local selected results to the IO worker
4. The IO worker finally writes data to disk
H. Lin, et al. "Coordinating computation and i/o in massively parallel sequence search." Parallel and Distributed Systems, IEEE Transactions on 22.4 (2011): 529-543.
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Why Redesign Sequence Search
• Bottlenecks in previous sequence search tools– Local search
– Disk IO
• New tendencies of sequence search– Local search is much faster right now, e.g., DIAMOND1
– Sequence search has become a stage of NGS work flow, and search results are resided in memory for reuse2
1. B. Buchfink, Xie C., D. Huson, "Fast and sensitive protein alignment using DIAMOND", Nature Methods 12, 59-60 (2015).
2. Genome Analysis Toolkit, https://software.broadinstitute.org/gatk/
Data communication is becoming a new performance bottleneck!
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Outline
• Background
• Sequence Search
• Using one-sided communications for sequence search
• Evaluation (early stage)
• Summary and Future Work
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Using MPI One-sided for Sequence Search
• Benefits of using MPI one-sided communication– More economically express irregular communication pattern
– More efficiently overlap communication and computation
– Bypass tag matching in two-sided communication
• Basic ideas– Use MPI one-sided communications (put and get) to overlap
communication and computation
– Don’t need a dedicated MPI process as the master to coordinate disk IO
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Challenges
• Challenges– Three one-sided synchronization modes: Fence (active target), Post-
Start-Complete-Wait (active target), Lock/Unlock (passive target). Which is better?
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
MPI Windows for Meta Data and Local Results
qi
qi+1
qi+2
qi qi
qi+1
qi+2qi+2
Buf 0 on Worker0 for qi
Buf 1 on Worker1 for qi
<addr, size> <addr, size> <addr, size>
Cyclic buffer pool on each MPI process
Register to MPI window
• Each MPI process registers two types of cyclic buffers to MPI window, for meta data and local search results respectively
qi+1
Rank 0 Rank 1 Rank n-1
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Use MPI Put to Write Meta Data
qi
qi+1
qi+2
qi qi
qi+1
qi+2qi+2
Buf 0 on Worker0 for qi
gi|1703080|sp|P14144.2|…. Score = 278.7gi|113072|sp|P04756.1|… Score = 225.3gi|113071|sp|P02708.2|… Score = 223.0…
Buf 1 on Worker1 for qi
gi|1923080|sp|P17247.7|…. Score = 282.1gi|158072|sp|P03182.3|… Score = 278.3gi|117002|sp|P02701.2|… Score = 212.0…
<addr, size> <addr, size> <addr, size>
Cyclic buffer pool on each MPI process
Put metadata to others
• After local search for a batch of query sequences, a MPI process will write meta data to others with MPI Put
qi+1
Rank 0 Rank 1 Rank n-1
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Wait the Finish of Put on Previous Batch
qi
qi+1
qi+2
qi qi
qi+1
qi+2qi+2
Buf 0 on Worker0 for qi
gi|1703080|sp|P14144.2|…. Score = 278.7gi|113072|sp|P04756.1|… Score = 225.3gi|113071|sp|P02708.2|… Score = 223.0…
Buf 1 on Worker1 for qi
gi|1923080|sp|P17247.7|…. Score = 282.1gi|158072|sp|P03182.3|… Score = 278.3gi|117002|sp|P02701.2|… Score = 212.0…
<addr, size> <addr, size> <addr, size>
Cyclic buffer pool on each MPI process
Merge & sort metadata
• After MPI Put (for the current batch), MPI processes will – Wait for the finish of MPI Put of the previous batch, e.g., MPI_Win_fence()
– Merge and sort meta data for the pervious batch and select out final results (meta data)
qi+1
Rank 0 Rank 1 Rank n-1
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
A Process will Gather Data by MPI Get
qi
qi+1
qi+2
qi qi
qi+1
qi+2qi+2
Buf 0 on Worker0 for qi
gi|1703080|sp|P14144.2|…. Score = 278.7gi|113072|sp|P04756.1|… Score = 225.3gi|113071|sp|P02708.2|… Score = 223.0…
Buf 1 on Worker1 for qi
gi|1923080|sp|P17247.7|…. Score = 282.1gi|158072|sp|P03182.3|… Score = 278.3gi|117002|sp|P02701.2|… Score = 212.0…
<addr, size> <addr, size> <addr, size>
Cyclic buffer pool on each MPI process
Get needed local results to the selected process
• A MPI process is selected out as the one to merge final results, e.g., who has most final results, and it will gather data from others by using MPI Get
• Other processes will continue the computation for the next batch
qi+1
Rank 0 Rank 1 Rank n-1
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Summary of Our Method
create MPI Windows for metadata and local results
for batch 0 : n-1
Local search: do the sequence search on local partition of database
Write metadata: MPI_Put() metadata of current batch to others
Wait: wait for the finish of MPI_Put() on previous batch
Merge & sort: merge and sort metadata
if (I’m the selected process)
Get local results: MPI_Get() local results from all processes
Generate output: Sort and write final results for the previous batch
endif
endfor
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Implementations and Optimizations
• Double buffering– Use double windows for metadata in order to wait for the finish
of previous batch, after issuing MPI_Put() for the current batch
• Different types of synchronization methods– Fence mechanism: MPI_Win_fence()
– Lock/unlock mechanism: MPI_Win_flush()
– PSCW mechanism: MPI_Win_wait()
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Implementations and Optimizations
• One window vs one windows per rank– Create one window per process to avoid the unnecessary wait
Hao Wang, MUG'17, Aug. 14-16, 2017
qi
qi+1
qi+2
qi qi
qi+1
qi+2qi+2
Create one MPI window and register buffers to one window
qi+1
Rank 0 Rank 1 Rank n-1
synergy.cs.vt.edu
Implementations and Optimizations
• One window vs one windows per rank– Create one window per process to avoid the unnecessary wait
Hao Wang, MUG'17, Aug. 14-16, 2017
qi
qi+1
qi+2
qi qi
qi+1
qi+2qi+2
Create n MPI windows and register the buffer per rank per window
qi+1
Rank 0 Rank 1 Rank n-1
synergy.cs.vt.edu
Outline
• Background
• Sequence Search
• Using one-sided communications for sequence search
• Evaluation (early stage)
• Summary and Future Work
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Experimental Setups
• Hardware– Up to 16 compute nodes, each of which has 2 Intel Xeon CPU E5-2670 (Sandy
Bridge EP, 16 cores in total)
– 64 GB main memory
– Mellanox ConnectX-3 MT27500
• Datasets– env_nr and nr databases from NCBI GeneBank
– Randomly select 10000 sequences from the target database as query sequences
• Data partitions– Partition databases evenly on each compute node
• Software– DIAMOND (C++ Thread) + MPI
– MVAPICH2 (version 2.2)
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Breakdown
• Different MPI processes contribute different sizes of data to the final results
• Different MPI processes have different computation time in each batch
Setups: running on 8 nodes,10000 query sequences in 10 batches
Hao Wang, MUG'17, Aug. 14-16, 2017
0
200
400
600
800
1000
1200
1400
1600
rank0 rank1 rank2 rank3 rank4 rank5 rank6 rank7
Dat
a si
ze (
MB
)
batch1 batch2 batch3 batch4 batch5batch6 batch7 batch8 batch9 batch10
0
10
20
30
40
50
60
rank0 rank1 rank2 rank3 rank4 rank5 rank6 rank7
Co
mp
uta
tio
n t
ime
(sec
)
batch1 batch2 batch3 batch4 batch5batch6 batch7 batch8 batch9 batch10
synergy.cs.vt.edu
Overall Performance on 8 nodes
• MPI_Win_Fence() with multiple windows is best
– 1.4x and 1.32 x speedup over 2sided w/ and w/o master, respectively
Hao Wang, MUG'17, Aug. 14-16, 2017
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
1000 1500 2000 2500 3000
No
rmal
ized
Exe
cuti
on
tim
e
Batch size
Fence_mwins Fence_1win PSCW_mwins PSCW_1win
LockFlush_mwins LockFlush_1win SendRecv_w/_master SendRecv_w/o_master
Lower is better
synergy.cs.vt.edu
Overall Performance on 16 nodes
• MPI_Win_Fence() with multiple windows is best
– 1.42x and 1.19 x speedup over 2sided w/ and w/o master, respectively
Hao Wang, MUG'17, Aug. 14-16, 2017
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1000 1500 2000 2500 3000
No
rmal
ized
Exe
cuti
on
tim
e
Batch size
Fence_mwins Fence_1win PSCW_mwins PSCW_1win
LockFlush_mwins LockFlush_1win SendRecv_w/_master SendRecv_w/o_master
Lower is better
synergy.cs.vt.edu
Observations
• MPI fence exhibits better performance than MPI flush– All-to-all communication pattern in metadata communication
Hao Wang, MUG'17, Aug. 14-16, 2017
synergy.cs.vt.edu
Summary and Future Work
• We use MPI one-sided communication to accelerate sequence search on InfiniBand clusters
• The experimental results show up to 1.42x speedup over two-sided communication
• We are analyzing performance numbers of different one-sided synchronization mechanisms
• We are collecting more application performance numbers, for mpiBLAST, DIAMOND, and pBWA
• We would like to check application performance with MVAPICH2-2.3b
Hao Wang, MUG'17, Aug. 14-16, 2017