28
Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2 , Erik Saule 1 , Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical and Computer Engineering The Ohio State University

Masher: Mapping Long(er) Reads with Hash-based …esaule/public-website/slides/acmbcb13...Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh1,2,

  • Upload
    lamdieu

  • View
    221

  • Download
    1

Embed Size (px)

Citation preview

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs

Anas Abu-Doleh1,2, Erik Saule1, Kamer Kaya1 and Ümit V. Çatalyürek1,2

1 Department of Biomedical Informatics2 Department of Electrical and Computer Engineering

The Ohio State University

I. Introduction

• Motivation

• Contribution

• Related Work

II. Masher Workflow

• Index Construction

• Mapping

III. Experiments and Results

IV. Conclusion and Future Work

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 2ACM-BCB13

23 Sep 2013

Outline

The read length of next generation sequencing (NGS) devices is continuouslyincreasing so there is a wide interest in efficient and accurate mapping oflong(er) reads.

Utilizing the powerful capabilities of GPUs to improve the mapping of NGSreads.

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 3ACM-BCB13

23 Sep 2013

Motivation

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 4ACM-BCB13

23 Sep 2013

Related Work and Contributions

Burrows-Wheeler Transform (BWT)o Bowtie2o CUSHAW2o Soap3-dp

Hash Indexingo SeqAltoo BFAST

)

Related Work

A novel hash-based indexing technique by which: For large genomes, the memory footprint small enough to be stored in a

restricted-memory device such as a GPU. The index data structure is more suitable for GPU parallelization

Contribution

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 5ACM-BCB13

23 Sep 2013

Masher workflow

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 6ACM-BCB13

23 Sep 2013

Index Construction

Base pairs to 2 bit format.

Replacing each N with A.

Processing genome file

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 7ACM-BCB13

23 Sep 2013

Index Construction

Base pairs to 2 bit format.

Replacing each N with A.

Processing genome file

Seed length LS

Indexing step size ∆G

Indexing

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 8ACM-BCB13

23 Sep 2013

Index Construction

Genome length, N Stores the indexed locations in

order for each seed Location array size = log2(N) x

𝑁/∆G

Size ≈ 2.9 GB , hg19, ∆G = 4

Index arrays - Locations array

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 9ACM-BCB13

23 Sep 2013

Index Construction

Stores the number of occurrences for each seed

Size = 4Ls x log2 𝑁/∆G

Store at most 255 locations. Appear more than 255, do

uniform selection. Size = 1 GB , LS = 15.

Index arrays - Count array

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 10ACM-BCB13

23 Sep 2013

Index Construction

Stores the starting index at locs array for a group of seeds

Seed group size, δ. Group id = seed/δ Size = 4L/ δ x log2 ( 𝑁/∆G

Size = 0.5 GB , δ = 8, ∆G = 4.

Index arrays - Ptrs array

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 11ACM-BCB13

23 Sep 2013

Index Construction

LS = 15, ∆G = 4, δ = 8 , hg19 Total indexing arrays size =

2.9 + 1 + 0.5 = 4.4 GB. Space–time tradeoff

Index arrays

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 12ACM-BCB13

23 Sep 2013

Index Construction

Count array• Assume seed = i + 4• Belongs to seed group (i , i + δ −1 ) • , δ = 8 , i mod δ = 0.• Seed index in group, k = (i +4) mod δ• Ck=4 = count[i + 4 ]

Ptrs array• j = seed /δ , • Locs group index (Lgi) = ptrs[ j ]

• Locs seed index (Lsi) = Lgi + 𝑛=0𝑘−1𝐶𝑛

Locs array• Extract locations from (Lsi , Lsi + Ck - 1 )

Accessing the Index

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 13ACM-BCB13

23 Sep 2013

Index Construction

0.5

0.6

0.7

0.8

0.9

1

1 6 11 16 21 26 31 36 41 46 51 56

Pr(

cou

nt

<= x

)

Seeds count

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 14ACM-BCB13

23 Sep 2013

Mapping

Read step size, ∆R

Read length, LR

Nseeds = ∆G x (LR − LS)/∆R

Seed & hash

Each thread is assigned to a specific seed.

Locate candidate alignment locations (CALs)

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 15ACM-BCB13

23 Sep 2013

Mapping

In merging CALs, if two CALs are within a threshold distance, the second weight will be added to the first weight.

For efficiency purpose, Masher consists of two main loops.

Merge CALs and weights

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 16ACM-BCB13

23 Sep 2013

Mapping

Sorting and setting the CALs in batches with respect to their weights. At this stage, a filter operation for CALs with low weight could be applied.

Sorting and Batching CALs

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 17ACM-BCB13

23 Sep 2013

Mapping

A parameterized variant of Smith-Waterman (SW) algorithm supporting affinity gap scoring.

Bounded alignment, only the matrix cells (i, j) where |i - j| <= w are visited and scored.

Masher does two passes and sets w to 4 and 16 respectively

GPU block performs multiple SWs in parallel.

Bounded local Alignment

Sorting and setting the CALs in batches with respect to their weights. At this stage, a filter operation for CALs with low weight could be applied.

Sorting and Batching CALs

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 18ACM-BCB13

23 Sep 2013

Experiments and Results

Intel core i7-960 CPU clocked at 3.2 Ghz. 4 Hyper-Threading cores, 24GB of DDR3 memory.

Tesla K20c GPU, 4.8GB of global memory. CUDA 5.0 and GCC 4.2.4.

Platform

Human genome hg19 Wgsim simulator, 100K reads of length 100, 300, 500, and 1000 with error rates 2%, 4%,

6%, and 8%.

Human genome and Simulated Reads

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 19ACM-BCB13

23 Sep 2013

Experiments and Results

Sensitivity, is the percentage of the aligned reads. Accuracy, is the percentage of the reads correctly aligned to simulator locations among

all aligned reads. Execution time: Only alignment time was measured. The lower bound for a valid alignment score is set to

scoreLB = LR x (1.9 - 0.5 x Error Rate)

Metrics for comparison

Normal mode, ∆R = 0.7 LR Fast mode, ∆R = LR

Two modes of Masher

Bowtie2 (sensitive and fast) , 8 threads SOAP3-dp CUSHAW2-GPU.

Comparison with

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 20ACM-BCB13

23 Sep 2013

Experiments and Results

99

.23

99

.44

99

.36

98

.87

97

.55

96

.81

94

.5

89

.8398

.8

98

96

93

.15

98

94

.63

88

.8

80

.6

98

.5

92

.5

81

.7

67

.7

99

.9

99

.9

98

.8

96

.2

40

50

60

70

80

90

100

1 2 3 4

Sen

siti

vity

%

Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU

95

.01

93

.82

92

.42

90

.84

95

.49

94

.44

93

.07

91

.499

5.2

94

92

.6

91

.1

95

93

.78

91

.7

89

.47

96

.2

95

.5

94

.5

93

95

.2

94

.3

93

.2

91

.9

80

85

90

95

100

2% 4% 6% 8%

Acc

ura

cy %

Error rate

LR = 100 bps.

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 21ACM-BCB13

23 Sep 2013

Experiments and Results

99

.89

99

.84

99

.74

99

.62

99

.89

99

.78

99

.51

98

.93

99

.9

99

.9

99

.9

99

.9

99

.9

99

.8

99

.34

97

.7

99

.2

94

.3

75

.3

48

.6

40

50

60

70

80

90

100

1 2 3 4

Sen

siti

vity

%

Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp

97

.69

97

.2

96

.83

96

.25

97

.78

97

.19

96

.83

96

.15

98

.2

98

97

.8

97

.4

98

.1

97

.8

97

.6

979

8.8

98

.5

98

.3

98

80

85

90

95

100

2% 4% 6% 8%

Acc

ura

cy %

Error rate

LR = 500 bps.

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 22ACM-BCB13

23 Sep 2013

Experiments and Results

10

0

10

0

10

0

10

0

99

.8

99

.73

99

.53

98

.93

99

.99

99

.9

99

.9

99

.8

99

.9

99

.9

99

.8

99

.5

99

.3

98

.7

91

.4

68

.9

40

50

60

70

80

90

100

1 2 3 4

Sen

siti

vity

%

Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp

98

.5

98

.28

97

.86

97

.41

98

.25

97

.78

97

.24

96

.66

98

.5

98

.3

97

.5

96

.43

98

.5

98

.1

97

.3

96

.198

.9

98

.5

97

.8

96

80

85

90

95

100

2% 4% 6% 8%

Acc

ura

cy %

Error rate

LR = 1000 bps.

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 23ACM-BCB13

23 Sep 2013

Experiments and Results

9.1

8.6 9.3 9.4

5 4.9 5.3 5.5

11

10

9 9

5 5 5 5

7.3 8

.3

6.6

6.6

9.3 1

0.5

14

.9

11

.8

1

5

25

2% 4% 6% 8%

Exe

cuti

on

tim

e (

sec.

) in

log

scal

e

Error rate

Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp CUSHAW2-GPU

LR = 100 bps.

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 24ACM-BCB13

23 Sep 2013

Experiments and Results

15

.1 19

.1 24 3

1.7

9.9

8.2 1

1 11

.7

13

4

16

0

16

5

18

0

10

0

11

1

11

7

12

3

10

10

73

4

52

2

33

3

1

5

25

125

625

3125

2% 4% 6% 8%

Exe

cuti

on

tim

e (

sec.

) in

log

scal

e

Error rate

Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp

LR = 500 bps.

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 25ACM-BCB13

23 Sep 2013

Experiments and Results

17

.6

18

.5

20

.4

21

.8

15

.5

17

.5

20

.2

22

45

6 56

7

66

2

75

2

34

5

40

3

45

2

49

1

56

07

46

00

32

06

20

27

1

5

25

125

625

3125

2% 4% 6% 8%

Exe

cuti

on

tim

e (

sec.

) in

log

scal

e

Error rate

Masher Masher-fast Bowtie2 Bowtie2-fast SOAP3-dp

LR = 1000 bps.

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 26ACM-BCB13

23 Sep 2013

Experiments and Results

LR = 1000 bps, Error rate 2%

1

10

100

1000

10000

90 92 94 96 98 100

Accuracy %

Masher

Masher-fast

Bowtie2

Bowtie2-fast

SOAP3-dp1

10

100

1000

10000

90 92 94 96 98 100Exe

cuti

on

tim

e (

sec.

) in

log

scal

e

Sensitivity %

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 27ACM-BCB13

23 Sep 2013

Conclusion and future work

Masher, a fast and accurate short/long read mapper, which uses memory efficient indexing scheme to reduce the size of a human genome index and to make it fit to the memory of a GPU.

The results show that Masher produces accurate alignments. Its speed is competitive with the tested state-of-the-art tools for reads of length less than

500 and an order of magnitude faster when the reads are longer than 500.

Conclusion

Making the software publicly available. Improving Masher’s performance further by using GPU-specific optimizations and with a

better CPU/GPU pipelining. Adding new features such as a support for paired-end sequences or fastq format.

Future work

• For more information• Visit http://bmi.osu.edu/hpc

• Acknowledgement of Support

A Abu-Doleh “Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs" 28ACM-BCB13

23 Sep 2013

Thanks