Upload
meghan
View
51
Download
1
Tags:
Embed Size (px)
DESCRIPTION
CS 6293 Advanced Topics: Translational Bioinformatics. Next-generation sequencing - Mapping short reads. Short read mapping. Input: A reference genome A collection of many 25-100bp tags (reads) User-specified parameters Output: One or more genomic coordinates for each tag - PowerPoint PPT Presentation
Citation preview
CS 6293 Advanced Topics: Translational Bioinformatics
Next-generation sequencing -Mapping short reads
Short read mapping
• Input:– A reference genome– A collection of many 25-100bp tags (reads)– User-specified parameters
• Output:– One or more genomic coordinates for each tag
• In practice, only 70-75% of tags successfully map to the reference genome. Why?
Multiple mapping
• A single tag may occur more than once in the reference genome.
• The user may choose to ignore tags that appear more than n times.
• As n gets large, you get more data, but also more noise in the data.
Inexact matching
• An observed tag may not exactly match any position in the reference genome.
• Sometimes, the tag almost matches one or more positions.• Such mismatches may represent a SNP (single-nucleotide
polymorphism, see wikipedia) or a bad read-out.• The user can specify the maximum number of mismatches, or a
phred-style quality score threshold.• As the number of allowed mismatches goes up, the number of
mapped tags increases, but so does the number of incorrectly mapped tags.
?
Read Length is Not As Important For Resequencing
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
8 10 12 14 16 18 20
Length of K-mer Reads (bp)
% o
f Pai
red
K-m
ers
with
Uni
quel
y As
sign
able
Loc
atio
n
E.COLI
HUMAN
Jay Shendure
Mapping Reads Back• Hash Table (Lookup table)
– FAST, but requires perfect matches. [O(m n + N)]• Array Scanning
– Can handle mismatches, but not gaps. [O(m N)]• Dynamic Programming (Smith Waterman)
– Indels– Mathematically optimal solution– Slow (most programs use Hash Mapping as a prefilter) [O(mnN)]
• Burrows-Wheeler Transform (BW Transform)– FAST. [O(m + N)] (without mismatch/gap)– Memory efficient.– But for gaps/mismatches, it lacks sensitivity
Spaced seed alignment
• Tags and tag-sized pieces of reference are cut into small “seeds.”
• Pairs of spaced seeds are stored in an index.
• Look up spaced seeds for each tag.
• For each “hit,” confirm the remaining positions.
• Report results to the user.
Burrows-Wheeler
• Store entire reference genome.
• Align tag base by base from the end.
• When tag is traversed, all active locations are reported.
• If no match is found, then back up and try a substitution.
Why Burrows-Wheeler?
• BWT very compact:– Approximately ½ byte per base– As large as the original text, plus a few “extras”– Can fit onto a standard computer with 2GB of
memory
• Linear-time search algorithm– proportional to length of query for exact
matches
Burrows-Wheeler Transform (BWT)
acaacg$
$acaacgaacg$acacaacg$acg$acacaacg$acg$acaag$acaac
gc$aaac
Burrows-Wheeler Matrix (BWM)
BWT
Burrows-Wheeler Matrix
$acaacgaacg$acacaacg$acg$acacaacg$acg$acaag$acaac
Burrows-Wheeler Matrix
$acaacgaacg$acacaacg$acg$acacaacg$acg$acaag$acaac
See the suffix array?
314256
Key observation1$acaacg1
2aacg$ac1
1acaacg$1
3acg$aca2
1caacg$a1
2cg$acaa3
1g$acaac2
a1c1a2a3c2g1$1
“last first (LF) mapping”
The i-th occurrence of character X in the last column corresponds tothe same text character as the i-th occurrence of X in the first column.
56
4
56
3
4
56
3
4256
314256
6
Recover text
Exact match
314256
Exact match (another example)
$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac
BWT(agcagcagact) = tgcc$ggaaaac
$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac
Search for pattern: gca
$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac
$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac
gcagcagcagca
Test with your own seq and pattern at: http://www.allisons.org/ll/AlgDS/Strings/BWT/
Auxiliary data structures
$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac
a c g T
rank 1 5 8 11
9
7
4
1
6
3
10
8
5
2
11
1
2
3
4
5
6
7
8
9
10
11
SA t
g
c
c
$
g
g
a
a
a
a
c
Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space.
BWT
a c g t
0 0 1 1
0 1 1 1
0 2 1 1
0 2 1 1
0 2 2 1
0 2 3 1
1 2 3 1
2 2 3 1
3 2 3 1
4 2 3 1
4 3 3 1
FM indices
Auxiliary data structures
$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac
a c g t
rank 1 5 8 11
9
7
4
1
6
3
10
8
5
2
11
1
2
3
4
5
6
7
8
9
10
11
SA t
g
c
c
$
g
g
a
a
a
a
c
Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space.
BWT
a c g t
0 0 1 1
0 1 1 1
0 2 1 1
0 2 1 1
0 2 2 1
0 2 3 1
1 2 3 1
2 2 3 1
3 2 3 1
4 2 3 1
4 3 3 1
FM indices
gca
Next block:From 1 + 0 = 1 to 1 + (4-1) = 4
Auxiliary data structures
$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac
a c g T
rank 1 5 8 11
9
7
4
1
6
3
10
8
5
2
11
1
2
3
4
5
6
7
8
9
10
11
SA t
g
c
c
$
g
g
a
a
a
a
c
Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space.
BWT
a c g t
0 0 1 1
0 1 1 1
0 2 1 1
0 2 1 1
0 2 2 1
0 2 3 1
1 2 3 1
2 2 3 1
3 2 3 1
4 2 3 1
4 3 3 1
FM indices
gca
Next block:From 5 + 0 = 5 to 5 + (2-1) = 6
Auxiliary data structures
$agcagcagactact$agcagcagagact$agcagcagcagact$agcagcagcagact$cagact$agcagcagcagact$agct$agcagcagagact$agcagcagcagact$agcagcagcagact$at$agcagcagac
a c g T
rank 1 5 8 11
9
7
4
1
6
3
10
8
5
2
11
1
2
3
4
5
6
7
8
9
10
11
SA t
g
c
c
$
g
g
a
a
a
a
c
Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space.
BWT
a c g t
0 0 1 1
0 1 1 1
0 2 1 1
0 2 1 1
0 2 2 1
0 2 3 1
1 2 3 1
2 2 3 1
3 2 3 1
4 2 3 1
4 3 3 1
FM indices
gca
Next block:From 8 + 1 = 9 to 8 + (3-1) = 10
Inexact match
Main advantage of BWT against suffix array
• BWT needs less memory than suffix array• For human genome m = 3 * 109 :
– Suffix array: mlog2(m) bits = 4m bytes = 12GB– BWT: m/4 bytes plus extras = 1 - 2 GB
• m/4 bytes to store BWT (2 bits per char)• Suffix array and occurrence counts array take 5 m
log2 m bits = 20 n bytes• In practice, SA and OCC only partially stored, most
elements are computed on demand (takes time!)• Tradeoff between time and space
ComparisonBurrows-Wheeler• Requires <2Gb of
memory.• Runs 30-fold faster.• Is much more
complicated to program.
Bowtie
Spaced seeds• Requires ~50Gb of
memory.• Runs 30-fold slower.• Is much simpler to
program.
MAQ
Short-read mapping software
Software Technique Developer LicenseEland Hashing reads Illumnia ?SOAP Hashing refs BGI AcademicMaq Hashing reads Sanger (Li, Heng) GNUPLBowtie BWT Salzberg/UMD GNUPLBWA BWT Sanger (Li, Heng) GNUPLSOAP2 BWT & hashing BGI Academic
http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html
References• (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences
to the human genome, Langmead et al, Genome Biology 2009, 10:R25 • SOAP: short oligonucleotide alignment, Ruiqiang Li et al. Bioinformatics
(2008) 24: 713-4• (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler
Transform, Li Heng and Richard Durbin, (2009) 25:1754–1760• SOAP2: an improved ultrafast tool for short read alignment, Ruiqiang Li,
(2009) 25: 1966–1967• (MAQ) Mapping short DNA sequencing reads and calling variants using
mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. (2008) 18:1851-8.
• Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009)
• http://www.allisons.org/ll/AlgDS/Strings/BWT/