CS 6293 Advanced Topics: Translational Bioinformatics

CS 6293 Advanced Topics: Translational Bioinformatics

Next-generation sequencing -Mapping short reads

Short read mapping

• Input:– A reference genome– A collection of many 25-100bp tags (reads)– User-specified parameters

• Output:– One or more genomic coordinates for each tag

• In practice, only 70-75% of tags successfully map to the reference genome. Why?

Multiple mapping

• A single tag may occur more than once in the reference genome.

• The user may choose to ignore tags that appear more than n times.

• As n gets large, you get more data, but also more noise in the data.

Inexact matching

• An observed tag may not exactly match any position in the reference genome.

• Sometimes, the tag almost matches one or more positions.• Such mismatches may represent a SNP (single-nucleotide

polymorphism, see wikipedia) or a bad read-out.• The user can specify the maximum number of mismatches, or a

phred-style quality score threshold.• As the number of allowed mismatches goes up, the number of

mapped tags increases, but so does the number of incorrectly mapped tags.

?

http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism

Read Length is Not As Important For Resequencing

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

8 10 12 14 16 18 20

Length of K-mer Reads (bp)

% o

f Pai

red

K-m

ers

with

Uni

quel

y As

sign

able

Loc

atio

n

E.COLI

HUMAN

Jay Shendure

Mapping Reads Back• Hash Table (Lookup table)

– FAST, but requires perfect matches. [O(m n + N)]• Array Scanning

– Can handle mismatches, but not gaps. [O(m N)]• Dynamic Programming (Smith Waterman)

– Indels– Mathematically optimal solution– Slow (most programs use Hash Mapping as a prefilter) [O(mnN)]

• Burrows-Wheeler Transform (BW Transform)– FAST. [O(m + N)] (without mismatch/gap)– Memory efficient.– But for gaps/mismatches, it lacks sensitivity

Spaced seed alignment

• Tags and tag-sized pieces of reference are cut into small “seeds.”

• Pairs of spaced seeds are stored in an index.

• Look up spaced seeds for each tag.

• For each “hit,” confirm the remaining positions.

• Report results to the user.

Burrows-Wheeler

• Store entire reference genome.

• Align tag base by base from the end.

• When tag is traversed, all active locations are reported.

• If no match is found, then back up and try a substitution.

Why Burrows-Wheeler?

• BWT very compact:– Approximately ½ byte per base– As large as the original text, plus a few “extras”– Can fit onto a standard computer with 2GB of

memory

• Linear-time search algorithm– proportional to length of query for exact

matches

Burrows-Wheeler Transform (BWT)

acaacg$

$acaacgaacg$acacaacg$acg$acacaacg$acg$acaag$acaac

gc$aaac

Burrows-Wheeler Matrix (BWM)

BWT

Burrows-Wheeler Matrix


Burrows-Wheeler Matrix


See the suffix array?

314256

Key observation1$acaacg1

2aacg$ac1

1acaacg$1

3acg$aca2

1caacg$a1

2cg$acaa3

1g$acaac2

a1c1a2a3c2g1$1

“last first (LF) mapping”

The i-th occurrence of character X in the last column corresponds tothe same text character as the i-th occurrence of X in the first column.

56

4

56

3

4

56

3

4256

314256

6

Recover text

Exact match

314256

Exact match (another example)

$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac

BWT(agcagcagact) = tgcc$ggaaaac


Search for pattern: gca



gcagcagcagca

Test with your own seq and pattern at: http://www.allisons.org/ll/AlgDS/Strings/BWT/

http://www.allisons.org/ll/AlgDS/Strings/BWT/



Auxiliary data structures


a c g T

rank 1 5 8 11

9

7

4

1

6

3

10

8

5

2

11

1

2

3

4

5

6

7

8

9

10

11

SA t

g

c

c

$

g

g

a

a

a

a

c

Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space.

BWT

a c g t

0 0 1 1

0 1 1 1

0 2 1 1

0 2 1 1

0 2 2 1

0 2 3 1

1 2 3 1

2 2 3 1

3 2 3 1

4 2 3 1

4 3 3 1

FM indices



a c g t

rank 1 5 8 11

9

7

4

1

6

3

10

8

5

2

11

1

2

3

4

5

6

7

8

9

10

11

SA t

g

c

c

$

g

g

a

a

a

a

c


BWT

a c g t

0 0 1 1

0 1 1 1

0 2 1 1

0 2 1 1

0 2 2 1

0 2 3 1

1 2 3 1

2 2 3 1

3 2 3 1

4 2 3 1

4 3 3 1

FM indices

gca

Next block:From 1 + 0 = 1 to 1 + (4-1) = 4



a c g T

rank 1 5 8 11

9

7

4

1

6

3

10

8

5

2

11

1

2

3

4

5

6

7

8

9

10

11

SA t

g

c

c

$

g

g

a

a

a

a

c


BWT

a c g t

0 0 1 1

0 1 1 1

0 2 1 1

0 2 1 1

0 2 2 1

0 2 3 1

1 2 3 1

2 2 3 1

3 2 3 1

4 2 3 1

4 3 3 1

FM indices

gca




a c g T

rank 1 5 8 11

9

7

4

1

6

3

10

8

5

2

11

1

2

3

4

5

6

7

8

9

10

11

SA t

g

c

c

$

g

g

a

a

a

a

c


BWT

a c g t

0 0 1 1

0 1 1 1

0 2 1 1

0 2 1 1

0 2 2 1

0 2 3 1

1 2 3 1

2 2 3 1

3 2 3 1

4 2 3 1

4 3 3 1

FM indices

gca


Inexact match

Main advantage of BWT against suffix array

• BWT needs less memory than suffix array• For human genome m = 3 * 109 :

– Suffix array: mlog2(m) bits = 4m bytes = 12GB– BWT: m/4 bytes plus extras = 1 - 2 GB

• m/4 bytes to store BWT (2 bits per char)• Suffix array and occurrence counts array take 5 m

log2 m bits = 20 n bytes• In practice, SA and OCC only partially stored, most

elements are computed on demand (takes time!)• Tradeoff between time and space

ComparisonBurrows-Wheeler• Requires <2Gb of

memory.• Runs 30-fold faster.• Is much more

complicated to program.

Bowtie

Spaced seeds• Requires ~50Gb of

memory.• Runs 30-fold slower.• Is much simpler to

program.

MAQ

Short-read mapping software

Software Technique Developer LicenseEland Hashing reads Illumnia ?SOAP Hashing refs BGI AcademicMaq Hashing reads Sanger (Li, Heng) GNUPLBowtie BWT Salzberg/UMD GNUPLBWA BWT Sanger (Li, Heng) GNUPLSOAP2 BWT & hashing BGI Academic

http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html

References• (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences

to the human genome, Langmead et al, Genome Biology 2009, 10:R25 • SOAP: short oligonucleotide alignment, Ruiqiang Li et al. Bioinformatics

(2008) 24: 713-4• (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler

Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760• SOAP2: an improved ultrafast tool for short read alignment, Ruiqiang Li,

(2009) 25: 1966–1967• (MAQ) Mapping short DNA sequencing reads and calling variants using

mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. (2008) 18:1851-8.

• Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009)

• http://www.allisons.org/ll/AlgDS/Strings/BWT/




Documents

CS 6293 Advanced Topics: Translational Bioinformatics