25
CS 6293 Advanced Topics: Translational Bioinformatics Next-generation sequencing - Mapping short reads

CS 6293 Advanced Topics: Translational Bioinformatics

  • Upload
    meghan

  • View
    51

  • Download
    1

Embed Size (px)

DESCRIPTION

CS 6293 Advanced Topics: Translational Bioinformatics. Next-generation sequencing - Mapping short reads. Short read mapping. Input: A reference genome A collection of many 25-100bp tags (reads) User-specified parameters Output: One or more genomic coordinates for each tag - PowerPoint PPT Presentation

Citation preview

Page 1: CS 6293 Advanced Topics: Translational Bioinformatics

CS 6293 Advanced Topics: Translational Bioinformatics

Next-generation sequencing -Mapping short reads

Page 2: CS 6293 Advanced Topics: Translational Bioinformatics

Short read mapping

• Input:– A reference genome– A collection of many 25-100bp tags (reads)– User-specified parameters

• Output:– One or more genomic coordinates for each tag

• In practice, only 70-75% of tags successfully map to the reference genome. Why?

Page 3: CS 6293 Advanced Topics: Translational Bioinformatics

Multiple mapping

• A single tag may occur more than once in the reference genome.

• The user may choose to ignore tags that appear more than n times.

• As n gets large, you get more data, but also more noise in the data.

Page 4: CS 6293 Advanced Topics: Translational Bioinformatics

Inexact matching

• An observed tag may not exactly match any position in the reference genome.

• Sometimes, the tag almost matches one or more positions.• Such mismatches may represent a SNP (single-nucleotide

polymorphism, see wikipedia) or a bad read-out.• The user can specify the maximum number of mismatches, or a

phred-style quality score threshold.• As the number of allowed mismatches goes up, the number of

mapped tags increases, but so does the number of incorrectly mapped tags.

?

Page 5: CS 6293 Advanced Topics: Translational Bioinformatics

Read Length is Not As Important For Resequencing

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

8 10 12 14 16 18 20

Length of K-mer Reads (bp)

% o

f Pai

red

K-m

ers

with

Uni

quel

y As

sign

able

Loc

atio

n

E.COLI

HUMAN

Jay Shendure

Page 6: CS 6293 Advanced Topics: Translational Bioinformatics

Mapping Reads Back• Hash Table (Lookup table)

– FAST, but requires perfect matches. [O(m n + N)]• Array Scanning

– Can handle mismatches, but not gaps. [O(m N)]• Dynamic Programming (Smith Waterman)

– Indels– Mathematically optimal solution– Slow (most programs use Hash Mapping as a prefilter) [O(mnN)]

• Burrows-Wheeler Transform (BW Transform)– FAST. [O(m + N)] (without mismatch/gap)– Memory efficient.– But for gaps/mismatches, it lacks sensitivity

Page 7: CS 6293 Advanced Topics: Translational Bioinformatics

Spaced seed alignment

• Tags and tag-sized pieces of reference are cut into small “seeds.”

• Pairs of spaced seeds are stored in an index.

• Look up spaced seeds for each tag.

• For each “hit,” confirm the remaining positions.

• Report results to the user.

Page 8: CS 6293 Advanced Topics: Translational Bioinformatics

Burrows-Wheeler

• Store entire reference genome.

• Align tag base by base from the end.

• When tag is traversed, all active locations are reported.

• If no match is found, then back up and try a substitution.

Page 9: CS 6293 Advanced Topics: Translational Bioinformatics

Why Burrows-Wheeler?

• BWT very compact:– Approximately ½ byte per base– As large as the original text, plus a few “extras”– Can fit onto a standard computer with 2GB of

memory

• Linear-time search algorithm– proportional to length of query for exact

matches

Page 10: CS 6293 Advanced Topics: Translational Bioinformatics

Burrows-Wheeler Transform (BWT)

acaacg$

$acaacgaacg$acacaacg$acg$acacaacg$acg$acaag$acaac

gc$aaac

Burrows-Wheeler Matrix (BWM)

BWT

Page 11: CS 6293 Advanced Topics: Translational Bioinformatics

Burrows-Wheeler Matrix

$acaacgaacg$acacaacg$acg$acacaacg$acg$acaag$acaac

Page 12: CS 6293 Advanced Topics: Translational Bioinformatics

Burrows-Wheeler Matrix

$acaacgaacg$acacaacg$acg$acacaacg$acg$acaag$acaac

See the suffix array?

314256

Page 13: CS 6293 Advanced Topics: Translational Bioinformatics

Key observation1$acaacg1

2aacg$ac1

1acaacg$1

3acg$aca2

1caacg$a1

2cg$acaa3

1g$acaac2

a1c1a2a3c2g1$1

“last first (LF) mapping”

The i-th occurrence of character X in the last column corresponds tothe same text character as the i-th occurrence of X in the first column.

Page 14: CS 6293 Advanced Topics: Translational Bioinformatics

56

4

56

3

4

56

3

4256

314256

6

Recover text

Page 15: CS 6293 Advanced Topics: Translational Bioinformatics

Exact match

314256

Page 16: CS 6293 Advanced Topics: Translational Bioinformatics

Exact match (another example)

$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac

BWT(agcagcagact) = tgcc$ggaaaac

$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac

Search for pattern: gca

$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac

$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac

gcagcagcagca

Test with your own seq and pattern at: http://www.allisons.org/ll/AlgDS/Strings/BWT/

Page 17: CS 6293 Advanced Topics: Translational Bioinformatics

Auxiliary data structures

$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac

a c g T

rank 1 5 8 11

9

7

4

1

6

3

10

8

5

2

11

1

2

3

4

5

6

7

8

9

10

11

SA t

g

c

c

$

g

g

a

a

a

a

c

Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space.

BWT

a c g t

0 0 1 1

0 1 1 1

0 2 1 1

0 2 1 1

0 2 2 1

0 2 3 1

1 2 3 1

2 2 3 1

3 2 3 1

4 2 3 1

4 3 3 1

FM indices

Page 18: CS 6293 Advanced Topics: Translational Bioinformatics

Auxiliary data structures

$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac

a c g t

rank 1 5 8 11

9

7

4

1

6

3

10

8

5

2

11

1

2

3

4

5

6

7

8

9

10

11

SA t

g

c

c

$

g

g

a

a

a

a

c

Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space.

BWT

a c g t

0 0 1 1

0 1 1 1

0 2 1 1

0 2 1 1

0 2 2 1

0 2 3 1

1 2 3 1

2 2 3 1

3 2 3 1

4 2 3 1

4 3 3 1

FM indices

gca

Next block:From 1 + 0 = 1 to 1 + (4-1) = 4

Page 19: CS 6293 Advanced Topics: Translational Bioinformatics

Auxiliary data structures

$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac

a c g T

rank 1 5 8 11

9

7

4

1

6

3

10

8

5

2

11

1

2

3

4

5

6

7

8

9

10

11

SA t

g

c

c

$

g

g

a

a

a

a

c

Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space.

BWT

a c g t

0 0 1 1

0 1 1 1

0 2 1 1

0 2 1 1

0 2 2 1

0 2 3 1

1 2 3 1

2 2 3 1

3 2 3 1

4 2 3 1

4 3 3 1

FM indices

gca

Next block:From 5 + 0 = 5 to 5 + (2-1) = 6

Page 20: CS 6293 Advanced Topics: Translational Bioinformatics

Auxiliary data structures

$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac

a c g T

rank 1 5 8 11

9

7

4

1

6

3

10

8

5

2

11

1

2

3

4

5

6

7

8

9

10

11

SA t

g

c

c

$

g

g

a

a

a

a

c

Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space.

BWT

a c g t

0 0 1 1

0 1 1 1

0 2 1 1

0 2 1 1

0 2 2 1

0 2 3 1

1 2 3 1

2 2 3 1

3 2 3 1

4 2 3 1

4 3 3 1

FM indices

gca

Next block:From 8 + 1 = 9 to 8 + (3-1) = 10

Page 21: CS 6293 Advanced Topics: Translational Bioinformatics

Inexact match

Page 22: CS 6293 Advanced Topics: Translational Bioinformatics

Main advantage of BWT against suffix array

• BWT needs less memory than suffix array• For human genome m = 3 * 109 :

– Suffix array: mlog2(m) bits = 4m bytes = 12GB– BWT: m/4 bytes plus extras = 1 - 2 GB

• m/4 bytes to store BWT (2 bits per char)• Suffix array and occurrence counts array take 5 m

log2 m bits = 20 n bytes• In practice, SA and OCC only partially stored, most

elements are computed on demand (takes time!)• Tradeoff between time and space

Page 23: CS 6293 Advanced Topics: Translational Bioinformatics

ComparisonBurrows-Wheeler• Requires <2Gb of

memory.• Runs 30-fold faster.• Is much more

complicated to program.

Bowtie

Spaced seeds• Requires ~50Gb of

memory.• Runs 30-fold slower.• Is much simpler to

program.

MAQ

Page 24: CS 6293 Advanced Topics: Translational Bioinformatics

Short-read mapping software       

Software Technique Developer LicenseEland Hashing reads Illumnia ?SOAP Hashing refs BGI AcademicMaq Hashing reads Sanger (Li, Heng) GNUPLBowtie BWT Salzberg/UMD GNUPLBWA BWT Sanger (Li, Heng) GNUPLSOAP2 BWT & hashing BGI Academic       

http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html

Page 25: CS 6293 Advanced Topics: Translational Bioinformatics

References• (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences

to the human genome, Langmead et al, Genome Biology 2009, 10:R25  • SOAP: short oligonucleotide alignment, Ruiqiang Li et al. Bioinformatics

(2008) 24: 713-4• (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler

Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760• SOAP2: an improved ultrafast tool for short read alignment, Ruiqiang Li,

(2009) 25: 1966–1967• (MAQ) Mapping short DNA sequencing reads and calling variants using

mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. (2008) 18:1851-8.

• Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009)

• http://www.allisons.org/ll/AlgDS/Strings/BWT/