  • 1.Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

2. Whole Genome Sequencing Usually need 30-50X coverage (~ 3 lanes of 100bp PE HiSeq2000 sequencing) 2 3. Exome Sequencing 2011 3 4. Exome Sequencing Solution Hybrid Selection: Probes in solution can capture all exons (exome) for high throughput sequencing 1-2% of whole genome seq Easily multiplex 20 samples in one lane 4 5. Comparative Sequencing Somatic mutation detection between normal / cancer pairs WGS or WES More mutation yield and better causal gene identification than Mendelian disorders 5 Meyerson et al, Nat Rev Genet 2010 6. Hallmark of Mendelian Disease Gene Discovery 6 Gilissen, Genome Biol 2011 7. Hallmark of Mendelian Disease Gene Discovery 7 Gilissen, Genome Biol 2011 8. Mutation Targets vs Disorder Frequency Rarer disorders are focused on fewer mutated genes 8 Gilissen, Genome Biol 2011 9. Whole Genome or Exome Seq? Enabling technologies: NGS machines, open-source algorithms, capture reagents, lowering cost, big sample collections Exomes more cost effective: Sequence patient DNA and filter common SNPs; compare parents child trios; compare paired normal cancer Challenges: Still cant interpret many Mendelian disorders Rare variants need large samples sizes Exome might miss region (e.g. novel non-coding genes) Unsuccessful at using exome-seq to interpret clinical data9 Shendure, Genome Biol 2011 10. Read Mapping Mapping hundreds of millions of reads back to the reference genome is CPU and RAM intensive, and slow Read quality decreases with length (small single nucleotide mismatches or indels) Very few mapper deals with indel, and often allow ~2 mismatches within first 30bp (4 ^ 28 could still uniquely identify most 30bp sequences in a 3GB genome) Mapping output: SAM (BAM) or BED 10 11. Spaced seed alignment Tags and tag-sized pieces of reference are cut into small seeds. Pairs of spaced seeds are stored in an index. Look up spaced seeds for each tag. For each hit, confirm the remaining positions. Report results to the user. 12. Burrows-Wheeler Store entire reference genome. Align tag base by base from the end. When tag is traversed, all active locations are reported. If no match is found, then back up and try a substitution. Trapnell & Salzberg, Nat Biotech 2009Trapnell & Salzberg, Nat Biotech 2009 13. Burrows-Wheeler Transform Reversible permutation used originally in compression Once BWT(T) is built, all else shown here is discarded Matrix will be shown for illustration only Burrows Wheeler Matrix Last column BWT(T)T Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994 Slides from Ben Langmead 14. Burrows-Wheeler Transform Property that makes BWT(T) reversible is LF Mapping ith occurrence of a character in Last column is same text occurrence as the ith occurrence in First column T BWT(T) Burrows Wheeler Matrix Rank: 2 Rank: 2 Slides from Ben Langmead 15. Burrows-Wheeler Transform To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to is last per LF Mapping Final T Slides from Ben Langmead 16. Exact Matching with FM Index To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to is last character as if it were qc Slides from Ben Langmead 17. Exact Matching with FM Index In progressive rounds, top & bot delimit the range of rows beginning with progressively longer suffixes of Q Slides from Ben Langmead 18. Exact Matching with FM Index If range becomes empty (top = bot) the query suffix (and therefore the query) does not occur in the text Slides from Ben Langmead 19. Backtracking Consider an attempt to find Q = agc in T = acaacg: Instead of giving up, try to backtrack to a previous position and try a different base (much slower) For 50bp reads, need to have ~25bp perfect match gc does not occur in the text g c Slides from Ben Langmead 20. Seq Files Raw FASTQ Sequence ID, sequence Quality ID, quality score Mapped SAM Map: 0 OK, 4 unmapped, 16 mapped reverse strand XA (mapper-specific) MD: mismatch info NM: number of mismatch Mapped BED Chr, start, end, strand 20 @HWI-EAS305:1:1:1:991#0/1 GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT +HWI-EAS305:1:1:1:991#0/1 MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB @HWI-EAS305:1:1:1:201#0/1 AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT +HWI-EAS305:1:1:1:201#0/1 PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB HWUSI-EAS366_0112:6:1:1298:18828#0/1 16 chr9 98116600 255 38M * 0 0 TACAATATGTCTTTATTTGAGATATGGATTTTAG GCCG Y]bc^dab [_UU`^`LbTUTccLbbYaY`cWLYW^ XA:i:1 MD:Z:3C30T3 NM:i:2 HWUSI-EAS366_0112:6:1:1257:18819#0/1 4 * 0 0 * * 0 0 AGACCACATGAAGCTCAAGAAGAAGGAAGACA AAAGTG ece^dddTcT^c`a`ccdKc^^__]Yb_cKS^_W XM:i:1 HWUSI-EAS366_0112:6:1:1315:19529#0/1 16 chr9 102610263 255 38M * 0 0 GCACTCAAGGGTACAGGAAAAGGGTCAGAAGT GTGGCC ^c_YcLcb`bbYdTadd`dda`cddYddd^cT` XA:i:0 MD:Z:38 NM:i:0 chr1 123450 123500 + chr5 28374615 28374615 - 21. Data Analysis Heuristic filtering to identify novel genes for Mendelian disorders 21 Stitziel et al, Genome Biol 2011 22. Genomic Structural Variation 22 Baker et al, Nat Meth 2012 altered genome found in a sample is shown at the bottom. B) Inversion (INV) has reciprocal join in opposite orientations. C) Intra-chromosome translocation (ITX) has unilateral join in opposite orientation. D) Deletion (DEL) has two breakpoints joined in ascending order of genomic coordinates in the same orientation. E) Insertion (INS) has two breakpoints joined in descending order of genomic coordinates in the same orientation. 23. Structural Variation Detection BreakDancer Chen et al, Nat Meth 2009 Only look at anomalous read pairs 24. Structural Variation Detection Crest (Wang et al, Nat Meth 2011) Use soft-clipped reads, kind of like bidir-blast 24 25. Copy Number Variation Detection Change in read coverage 25 26. Representation: VCF Format 26 27. Summary Whole genome and whole exome sequencing Solution hybrid selection Specific locus for rare diseases Bioinformatics issues: Read mapping SNP, indel detection Heuristic filtering Structural variation detection 27