Upload
grazia
View
55
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Whole Exome Sequencing for Variant Discovery and Prioritisation. First, a recap. What have we learned?. NGS platforms – short and long reads What the data looks like How to QC data General procedures in processing data - PowerPoint PPT Presentation
Whole Exome Sequencing for Variant Discovery and Prioritisation
First, a recap. What have we learned?
• NGS platforms – short and long reads
• What the data looks like
• How to QC data
• General procedures in processing data
• How to find biological signal in data - RNA-Seq lectures + practical (in progress)
• There’s a LOT more, but it’s not necessarily more complex or very different!
0
100
200
300
400
500
600
2005 2006 2007 2008 2009 2010 2011 2012 2013
Year
Papers
2013: ~ 800 papers
2014: ~ 1200 papers
Forero DA, 2012
Exomes: Publication Trends
Total: 925 (Oct 2012)
NGS Variation Discovery Workflow (resequencing based)
Variant Discovery Application: Disease
• An equivalent of the genome would amount almost 2000 books, containing 1.5 million letters each (average books with 200 pages)!
• This information is contained in any single cell of the body.
Monogenic Diseases
• Single mutation
• How do we find it in all those ‘books’?
• A bioinformatics challenge
• NGS sequencers can only read small portions
• So, the library is fragments of pages of the books!
Mendelian Disease Gene Discovery
7Gilissen, Genome Biol 2011
Mendelian Disease Gene Discovery
8Gilissen, Genome Biol 2011
Opportunities and Challenges• Enabling technologies: NGS machines, open-source algorithms,
capture reagents, lowering cost, big sample collections
• Exomes more cost effective: Sequence patient DNA and filter common SNPs; compare parents child trios; compare paired normal cancer
• Challenges:– Still can’t interpret many Mendelian disorders– Rare variants need large samples sizes– Exome might miss region (e.g. novel non-coding genes)
9Shendure, Genome Biol 2011
Why exome sequencing?• WGS still too costly & added value of intergenic mutations is low
• WES: targeted sequencing of coding regions (~1% of human genome)
• Mendelian disorders disrupt protein-coding sequences (mostly)
• Large fraction of rare non-synonymous variants in human genome are predicted to be deleterious
• Splice sites also enriched for highly functional variation
• The exome represents a highly enriched subset of the genome in which to search for variants with large effect sizes
A representation of the relationship between the size of the mutational target and the frequency of disease for disorders
caused by de novo mutations
Gilissen, Genom Biol 2011
Majewski, J Med Genet 2011
Bamshad, Nat Rev Genet 2011
Maximizing chances of finding disease-causing rare variants using exome sequencing
Example: Comparative Sequencing
• Somatic mutation detection between normal / cancer pairs
• More mutation yield and better causal gene identification than Mendelian disorders
14Meyerson et al, Nat Rev Genet 2010
Pierce, Am J Hum Genet 2010
Perrault syndrome (HSD17B4)
BUT Exome Analysis for single patient can be informative
Exome sequencing procedure
Read Mapping
• Mapping hundreds of millions of reads the reference genome is CPU and RAM intensive, and ‘slow’
• Read quality decreases with length (small single nucleotide mismatches or indels – real or artifact?)
• Very few mappers appropriately deal with indels
• Mapping output: SAM (BAM) or BED
17
Mapped Data: SAM specification• Generic sequence alignment format
• Describes alignment of reads to a reference
• Flexible - stores all the alignment information
• Simple enough to be easily generated or converted from other existing alignment formats
• Keeps track of chromosome position, alignment quality and alignment features (extended cigar)
• Includes mate pair / paired end information
• Original FASTQ data can be reproduced from SAM (and BAM)
SAM FIELDS
BAM format
• Binary version of SAM - more compact• Makes downstream analysis independent from
the mapping program• Allows most of operations on alignment to work
on a stream without loading the whole alignment into memory
• Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus
VCF format
• Emerging standard for storing variant data• Originally designed for SNPs and short INDELs,
it also works for structural variations• Consists of header and data sections• The data section is TAB delimited with each
line consisting of at least 8 mandatory fields
VCF FIELDS
Variant filtering
Variant Prioritization
• Heuristic filtering to identify novel genes for Mendelian disorders
25Stitziel et al, Genome Biol 2011
More than just SNVs and ‘short’ indels
Structural VariationBreakDancerChen et al, Nat Meth 2009
Only looks at anomalous read pairs
Copy Number Variation DetectionChange in read coverage
28
Example WES-based variant discovery workflow
1. Map the reads to a reference genome1. index the reference genome2. Map (BWA, BOWTIE, NOVOAOLIGN, ETC)
2. Sort BAM file3. Remove PCR duplicates4. Realign around indels (‘optional’)5. Call variants6. Recalibrate quality scores (‘optional’)7. Filter variants 8. Basic variant annotation9. Biological interpretation only starts here