Upload
theodore-pierce
View
223
Download
0
Embed Size (px)
Citation preview
Achim TreschComputational Biology
‘Omics’
- Analysis of high
dimensional Data
Next Generation Sequencing
Today:
Illumina NGS platform, Fastq files
Sequence bioinformatics:Hash tablesSuffix arrays
Burrows-Wheeler transform
Illumina
Slides from Kurt Strueber Genome Center MPIPZ Cologne
• RNA-seq, ChIP-seq, Methyl-seq
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…GCGCCCTA
GCCCTATCGGCCCTATCG
CCTATCGGACTATCGGAAA
AAATTTGCAAATTTGC
TTTGCGGTTTGCGGTA
GCGGTATA
GTATAC…
TCGGAAATTCGGAAATTT
CGGTATAC
TAGGCTATA
GCCCTATCGGCCCTATCG
CCTATCGGACTATCGGAAA
AAATTTGCAAATTTGC
TTTGCGGT
TCGGAAATTCGGAAATTTCGGAAATTT
AGGCTATATAGGCTATATAGGCTATAT
GGCTATATGCTATATGCG
…CC…CC…CCA…CCA…CCAT
ATAC…C…C…
…CCAT…CCATAG TATGCGCCC
GGTATAC…CGGTATAC
GGAAATTTG
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…ATAC……CC
GAAATTTGC
Goal: identify variations
Goal: classify, measure significant peaks
Short Read Applications
• Genotyping
Reference genome
Short reads
Illumina
Illumina
Illumina
Illumina
Illumina
Illumina
Illumina
Illumina
Illumina
Illumina
Illumina
Slides taken from Michael MainUniversity of Colorado
Hash tables
• The simplest kind of hash table is an array of records.
• This example has 701 records.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
An array of records
. . .
[ 700]
Hash tables
• Each record has a special field, called its key.
• In this example, the key is a long integer field called Number.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
. . .
[ 700]
[ 4 ]
Number 506643548
Hash tables
• The number might be a person's identification number, and the rest of the record has information about the person.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
. . .
[ 700]
[ 4 ]
Number 506643548
Hash tables
• When a hash table is in use, some spots contain valid records, and other spots are "empty".
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .
Hash tables
• In order to insert a new record, the key must somehow be converted to an array index.
• The index is called the hash value of the key.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .
Number 580625685
In our case: The keys are short sequences, and the records contain their location in the genome
Inserting a new record
• Typical way create a hash value:
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .
Number 580625685
(Number mod 701)
What is (580625685 mod 701) ?
Inserting a new record
• Typical way to create a hash value:
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .
Number 580625685
(Number mod 701)
What is (580625685 mod 701) ?3
Inserting a new record
• The hash value is used for the location of the new record.
Number 580625685
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .
[3]
Inserting a new record
• The hash value is used for the location of the new record.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685
Inserting a new record
• Here is another new record to insert, with a hash value of 2.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685
Number 701466868
My hashvalue is
[2].
Collisions
• This is called a collision, because there is already another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685
Number 701466868
When a collision occurs,
move forward until you
find an empty spot.
Collisions
• This is called a collision, because there is already another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685
Number 701466868
When a collision occurs,
move forward until you
find an empty spot.
Collisions
• This is called a collision, because there is already another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685
Number 701466868
When a collision occurs,
move forward until you
find an empty spot.
Collisions
• This is called a collision, because there is already another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
The new record goes
in the empty spot.
Collisions
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322Number 580625685 Number 701466868
. . .
If the keys were short sequences, where would you place the sequenceATACCG?(NB: this is an ill-posed question)
A Quiz
• The data that's attached to a key can be found fairly quickly.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Number 701466868
Searching for a Key
• Calculate the hash value.• Check that location of the array
for the key.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is
[2].
Not me.
Searching for a Key
• Keep moving forward until you find the key, or you reach an empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is
[2].Not me.
Searching for a Key
• Keep moving forward until you find the key, or you reach an empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is
[2].Not me.
Searching for a Key
• Keep moving forward until you find the key, or you reach an empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is
[2].Yes!
Searching for a Key
• When the item is found, the information can be copied to the necessary location.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is
[2].Yes!
Searching for a Key
Hash tables store a collection of records with keys.
The location of a record depends on the hash value of the record's key.
When a collision occurs, the next available location is used.
Searching for a particular key is generally quick.
THE END
Summary
Suffix Arrays
• Suffix arrays were introduced by Manber and Myers in 1993
• More space efficient than suffix trees• A suffix array for a string x of length m is an array of
size m that specifies the lexicographic ordering of the suffixes of x.
Idea: Every substring is a prefix of a suffix
Example of a suffix array for acaaacatat$
3415792681011
Starting position of
that suffix in the search
string
Suffix Arrays
• Naive in place construction– Similar to insertion sort– Insert all the suffixes into the array one by one
making sure that the new inserted suffix is in its correct place
– Running time complexity:• O(m2) where m is the length of the string
• Manber and Myers give a O(m log m) construction in their 1993 paper.
Suffix Array Construction
• O(n) space where n is the size of the database string• Space efficient. However, there’s an increase in
query time• Lookup query
– Binary search– O(m log n) time; m is the size of the query– Can reduce time to O(m + log n) using a more efficient
implementation
Suffix Array Construction
find(Pattern P in SuffixArray A):
i = 0 lo = 0, hi = length(A) for 0<=i<length(P):
Binary search for x,y
where P[i]=S[A[j]+i] for lo<=x<=j<y<=hi
lo = x, hi = y
return {A[lo],A[lo+1],...,A[hi-1]}
Suffix Array Search
Search ‘is’ in mississippi$ 0 11 i$
1 8 ippi$
2 5 issippi$
3 2 ississippi$
4 1 mississippi$
5 10 pi$
6 9 ppi$
7 7 sippi$
8 4 sissippi$
9 6 ssippi$
10 3 ssissippi$
11 12 $
Examine the pattern letter by letter, reducing the range of occurrence each time.
- First letter i: occurs in indices from 0 to 3- Second letter s: occurs in indices from 2 to 3
Done. Output: issippi$ and ississippi$
Suffix Array Search
• It can be built very fast.• It can answer queries very fast:
– How many times ATG appears?
• Disadvantages: – Can’t do approximate matching– Hard to insert new stuff dynamically
(need to rebuild the array)
Summary
• http://pauillac.inria.fr/~quercia/documents-info/Luminy-98/albert/JAVA+html/SuffixTreeGrow.html
• http://home.in.tum.de/~maass/suffix.html• http://homepage.usask.ca/~ctl271/857/suffix_tree.shtml• http://homepage.usask.ca/~ctl271/810/
approximate_matching.shtml• http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/• http://dogma.net/markn/articles/suffixt/suffixt.htm• http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/
Suffix/
Links
Bowtie: A Highly Scalable Tool for Post-Genomic
Datasets
(Slides by Ben Langmead)
Short Read Applications
• Genotyping
• RNA-seq, ChIP-seq, Methyl-seq
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…GCGCCCTA
GCCCTATCGGCCCTATCG
CCTATCGGACTATCGGAAA
AAATTTGCAAATTTGC
TTTGCGGTTTGCGGTA
GCGGTATA
GTATAC…
TCGGAAATTCGGAAATTT
CGGTATAC
TAGGCTATA
GCCCTATCGGCCCTATCG
CCTATCGGACTATCGGAAA
AAATTTGCAAATTTGC
TTTGCGGT
TCGGAAATTCGGAAATTTCGGAAATTT
AGGCTATATAGGCTATATAGGCTATAT
GGCTATATGCTATATGCG
…CC…CC…CCA…CCA…CCAT
ATAC…C…C…
…CCAT…CCATAG TATGCGCCC
GGTATAC…CGGTATAC
GGAAATTTG
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…ATAC……CC
GAAATTTGC
Goal: identify variations
Goal: classify, measure significant peaks
Short Read Applications
Finding the alignments is typically the performance bottleneck
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…GCGCCCTA
GCCCTATCGGCCCTATCG
CCTATCGGACTATCGGAAA
AAATTTGCAAATTTGC
TTTGCGGTTTGCGGTA
GCGGTATA
GTATAC…
TCGGAAATTCGGAAATTT
CGGTATAC
TAGGCTATA
GCCCTATCGGCCCTATCG
CCTATCGGACTATCGGAAA
AAATTTGCAAATTTGC
TTTGCGGT
TCGGAAATTCGGAAATTTCGGAAATTT
AGGCTATATAGGCTATATAGGCTATAT
GGCTATATGCTATATGCG
…CC…CC…CCA…CCA…CCAT
ATAC…C…C…
…CCAT…CCATAG TATGCGCCC
GGTATAC…CGGTATAC
GGAAATTTG
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…ATAC……CC
GAAATTTGC
Short Read Alignment
• Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists– Approximate answer to: where in genome did read
originate?
…TGATCATA… GATCAA
…TGATCATA… GAGAAT
better than
• What is “good”? For now, we concentrate on:
…TGATATTA… GATcaT
…TGATcaTA… GTACAT
better than
– Fewer mismatches is better
– Failing to align a low-quality base is better than failing to align a high-quality base
Indexing
• Genomes and reads are too large for direct approaches like dynamic programming
• Indexing is required
• Choice of index is key to performance
Suffix tree Suffix array Seed hash tablesMany variants, incl. spaced seeds
Indexing
• Genome indices can be big. For human:
• Large indices necessitate painful compromises1. Require big-memory machine2. Use secondary storage
> 35 GBs > 12 GBs > 12 GBs
3. Build new index each run4. Subindex and do multiple
passes
Burrows-Wheeler Transform
Burrows WheelerMatrix
Last column contains the characters
preceding the characters in the first column
BWT(T)
a c a a c g $$ a c a a c gg $ a c a a ca c g $ a c aa a c g $ a cc a a c g $ aa c a a c g $
Rotate string one by one in each row
Sort suffixes lexicographically
Text T
Burrows-Wheeler Transform• Reversible permutation used originally in compression
• Once BWT(T) is built, all else shown here is discarded– Matrix will be shown for illustration only
• In long texts, BWT(T) contains more repeated character occurrences than the original text easier to compress!
BurrowsWheelerMatrix
Last column
BWT(T)T
Burrows-Wheeler Transform
• Property that makes BWT(T) reversible is “LF Mapping”– ith occurrence of a character in Last column is
same text occurrence as the ith occurrence in First column
BWT(T)
Burrows WheelerMatrix
Rank: 2
Rank: 2
Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994
Burrows-Wheeler Transform
• To recreate T from BWT(T), repeatedly apply rule:T BWT[ LF(i) ] + T; i = LF(i)– Where LF(i) maps row i to row whose first
character corresponds to i’s last per LF Mapping
• Could be called “unpermute” or “walk-left” algorithm
Final T
BWT in Bioinformatics
• Oligomer counting– Healy J et al: Annotating large genomes with exact word
matches. Genome Res 2003, 13(10):2306-2315.
• Whole-genome alignment– Lippert RA: Space-efficient whole genome comparisons
with Burrows-Wheeler transforms. J Comp Bio 2005, 12(4):407-415.
• Smith-Waterman alignment to large reference– Lam TW et al: Compressed indexing and local alignment
of DNA. Bioinformatics 2008, 24(6):791-797.
Comparison to Maq & SOAP
• PC: 2.4 GHz Intel Core 2, 2 GB RAM• Server: 2.4 GHz AMD Opteron, 32 GB RAM• Bowtie v0.9.6, Maq v0.6.6, SOAP v1.10• SOAP not run on PC due to memory constraints• Reads: FASTQ 8.84 M reads from 1000 Genomes (Acc:
SRR001115)• Reference: Human (NCBI 36.3, contigs)
CPU timeWall clock
time
Readsper
hour
Peak virtual memory footprint
Bowtiespeedu
p
Reads aligned (%)
Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4
SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3
Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB - 71.9
Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7
Bowtie (server) 17m:58s 18m:26s 28.8 M 1,353 MB - 71.9
Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 804 MB 107x 74.7
• Bowtie delivers about 30 million alignments per CPU hour
TopHat: Bowtie for RNA-seq
• TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads using Bowtie, and then analyzes the mapping results to identify splice junctions between exons.– Contact: Cole Trapnell ([email protected])– http://tophat.cbcb.umd.edu
Nicolas Delhomme, EMBL Heidelberg
University of Umeå
Acknowledgements
NGS Exercises were designed by