Download ppt - Exome Sequencing

Transcript
Page 1: Exome Sequencing

Genome & Exome SequencingRead Mapping

Xiaole Shirley Liu

STAT115, STAT215, BIO298, BIST520

Page 2: Exome Sequencing

Whole Genome Sequencing• Usually need 30-50X coverage (~ 3 lanes of

100bp PE HiSeq2000 sequencing)

2

Page 3: Exome Sequencing

Exome Sequencing

• 2011

3

Page 4: Exome Sequencing

Exome Sequencing

• Solution Hybrid Selection: Probes in solution can capture all exons (exome) for high throughput sequencing

• 1-2% of whole genome seq

• Easily multiplex 20 samples in one lane

4

Page 5: Exome Sequencing

Comparative Sequencing

• Somatic mutation detection between normal / cancer pairs

• WGS or WES

• More mutation yield and better causal gene identification than Mendelian disorders

5Meyerson et al, Nat Rev Genet 2010

Page 6: Exome Sequencing

Hallmark of Mendelian Disease Gene Discovery

6 Gilissen, Genome Biol 2011

Page 7: Exome Sequencing

Hallmark of Mendelian Disease Gene Discovery

7 Gilissen, Genome Biol 2011

Page 8: Exome Sequencing

Mutation Targets vs Disorder Frequency

Rarer disorders are focused on fewer mutated genes

8 Gilissen, Genome Biol 2011

Page 9: Exome Sequencing

Whole Genome or Exome Seq?

• Enabling technologies: NGS machines, open-source algorithms, capture reagents, lowering cost, big sample collections

• Exomes more cost effective: Sequence patient DNA and filter common SNPs; compare parents child trios; compare paired normal cancer

• Challenges:– Still can’t interpret many Mendelian disorders– Rare variants need large samples sizes– Exome might miss region (e.g. novel non-coding genes)– Unsuccessful at using exome-seq to interpret clinical data

9 Shendure, Genome Biol 2011

Page 10: Exome Sequencing

Read Mapping

• Mapping hundreds of millions of reads back to the reference genome is CPU and RAM intensive, and slow

• Read quality decreases with length (small single nucleotide mismatches or indels)

• Very few mapper deals with indel, and often allow ~2 mismatches within first 30bp (4 ^ 28 could still uniquely identify most 30bp sequences in a 3GB genome)

• Mapping output: SAM (BAM) or BED10

Page 11: Exome Sequencing

Spaced seed alignment

• Tags and tag-sized pieces of reference are cut into small “seeds.”

• Pairs of spaced seeds are stored in an index.

• Look up spaced seeds for each tag.

• For each “hit,” confirm the remaining positions.

• Report results to the user.

Page 12: Exome Sequencing

Burrows-Wheeler

• Store entire reference genome.

• Align tag base by base from the end.

• When tag is traversed, all active locations are reported.

• If no match is found, then back up and try a substitution.

Trapnell & Salzberg, Nat Biotech 2009Trapnell & Salzberg, Nat Biotech 2009

Page 13: Exome Sequencing

Burrows-Wheeler Transform

• Reversible permutation used originally in compression

• Once BWT(T) is built, all else shown here is discarded– Matrix will be shown for illustration only

BurrowsWheelerMatrix

Last column

BWT(T)T

Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994Slides from Ben Langmead

Page 14: Exome Sequencing

Burrows-Wheeler Transform

• Property that makes BWT(T) reversible is “LF Mapping”– ith occurrence of a character in Last column is

same text occurrence as the ith occurrence in First column

T

BWT(T)

Burrows WheelerMatrix

Rank: 2

Rank: 2

Slides from Ben Langmead

Page 15: Exome Sequencing

Burrows-Wheeler Transform

• To recreate T from BWT(T), repeatedly apply rule:T = BWT[ LF(i) ] + T; i = LF(i)– Where LF(i) maps row i to row whose first

character corresponds to i’s last per LF Mapping Final T

Slides from Ben Langmead

Page 16: Exome Sequencing

Exact Matching with FM Index

• To match Q in T using BWT(T), repeatedly apply rule:top = LF(top, qc); bot = LF(bot, qc)– Where qc is the next character in Q (right-to-

left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc

Slides from Ben Langmead

Page 17: Exome Sequencing

Exact Matching with FM Index

• In progressive rounds, top & bot delimit the range of rows beginning with progressively longer suffixes of Q

Slides from Ben Langmead

Page 18: Exome Sequencing

Exact Matching with FM Index

• If range becomes empty (top = bot) the query suffix (and therefore the query) does not occur in the text

Slides from Ben Langmead

Page 19: Exome Sequencing

Backtracking

• Consider an attempt to find Q = “agc” in T = “acaacg”:

• Instead of giving up, try to “backtrack” to a previous position and try a different base (much slower)

• For 50bp reads, need to have ~25bp perfect match

“gc” does not occur in the text

“g”

“c”

Slides from Ben Langmead

Page 20: Exome Sequencing

Seq Files

• Raw FASTQ– Sequence ID, sequence

– Quality ID, quality score

• Mapped SAM– Map: 0 OK, 4 unmapped,

16 mapped reverse strand

– XA (mapper-specific)

– MD: mismatch info

– NM: number of mismatch

• Mapped BED– Chr, start, end, strand

20

@HWI-EAS305:1:1:1:991#0/1GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT+HWI-EAS305:1:1:1:991#0/1MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB@HWI-EAS305:1:1:1:201#0/1AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT+HWI-EAS305:1:1:1:201#0/1PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB

HWUSI-EAS366_0112:6:1:1298:18828#0/1    16      chr9    98116600        255     38M     *       0       0       TACAATATGTCTTTATTTGAGATATGGATTTTAGGCCG  Y\]bc^dab\[_UU`^`LbTUT\ccLbbYaY`cWLYW^  XA:i:1  MD:Z:3C30T3     NM:i:2

HWUSI-EAS366_0112:6:1:1257:18819#0/1    4       *       0       0       *       *       0       0       AGACCACATGAAGCTCAAGAAGAAGGAAGACAAAAGTG  ece^dddT\cT^c`a`ccdK\c^^__]Yb\_cKS^_W\  XM:i:1

HWUSI-EAS366_0112:6:1:1315:19529#0/1    16      chr9    102610263       255     38M     *       0       0       GCACTCAAGGGTACAGGAAAAGGGTCAGAAGTGTGGCC  ^c_Yc\Lcb`bbYdTa\dd\`dda`cdd\Y\ddd^cT`  XA:i:0  MD:Z:38 NM:i:0

chr1123450 123500 +chr528374615 28374615 -

http://samtools.sourceforge.net/SAM1.pdf

Page 21: Exome Sequencing

Data Analysis

• Heuristic filtering to identify novel genes for Mendelian disorders

21Stitziel et al, Genome Biol 2011

Page 22: Exome Sequencing

Genomic Structural Variation

22 Baker et al, Nat Meth 2012

Page 23: Exome Sequencing

Structural Variation Detection

BreakDancer

Chen et al, Nat Meth 2009

Only look at anomalous read pairs

Page 24: Exome Sequencing

Structural Variation Detection

• Crest (Wang et al, Nat Meth 2011)– Use soft-clipped reads, kind of like bidir-blast

24

Page 25: Exome Sequencing

Copy Number Variation Detection

• Change in read coverage

25

Page 26: Exome Sequencing

Representation: VCF Format

• http://www.1000genomes.org/node/101

26

Page 27: Exome Sequencing

Summary

• Whole genome and whole exome sequencing– Solution hybrid selection– Specific locus for rare diseases

• Bioinformatics issues:– Read mapping– SNP, indel detection– Heuristic filtering– Structural variation detection

27


Recommended