Click here to load reader
View
52
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Whole Exome Sequencing for Variant Discovery and Prioritisation. First, a recap. What have we learned?. NGS platforms – short and long reads What the data looks like How to QC data General procedures in processing data - PowerPoint PPT Presentation
Slide 1
Whole Exome Sequencing for Variant Discovery and PrioritisationFirst, a recap. What have we learned?NGS platforms short and long reads
What the data looks like
How to QC data
General procedures in processing data
How to find biological signal in data - RNA-Seq lectures + practical (in progress)
Theres a LOT more, but its not necessarily more complex or very different!
2013: ~ 800 papers2014: ~ 1200 papersForero DA, 2012Exomes: Publication Trends Total: 925 (Oct 2012)NGS Variation Discovery Workflow (resequencing based)
Variant Discovery Application: DiseaseAn equivalent of the genome would amount almost 2000 books, containing 1.5 million letters each (average books with 200 pages)!
This information is contained in any single cell of the body.Monogenic DiseasesSingle mutation
How do we find it in all those books?
A bioinformatics challenge
NGS sequencers can only read small portions
So, the library is fragments of pages of the books!Mendelian Disease Gene Discovery7
Gilissen, Genome Biol 2011Mendelian Disease Gene Discovery8Gilissen, Genome Biol 2011
Opportunities and ChallengesEnabling technologies: NGS machines, open-source algorithms, capture reagents, lowering cost, big sample collections
Exomes more cost effective: Sequence patient DNA and filter common SNPs; compare parents child trios; compare paired normal cancer
Challenges:Still cant interpret many Mendelian disordersRare variants need large samples sizesExome might miss region (e.g. novel non-coding genes)
9Shendure, Genome Biol 2011Why exome sequencing?WGS still too costly & added value of intergenic mutations is low
WES: targeted sequencing of coding regions (~1% of human genome)
Mendelian disorders disrupt protein-coding sequences (mostly)
Large fraction of rare non-synonymous variants in human genome are predicted to be deleterious
Splice sites also enriched for highly functional variation
The exome represents a highly enriched subset of the genome in which to search for variants with large effect sizes
A representation of the relationship between the size of the mutational target and the frequency of disease for disorders caused by de novo mutationsGilissen, Genom Biol 2011
Majewski, J Med Genet 2011
Bamshad, Nat Rev Genet 2011Maximizing chances of finding disease-causing rare variants using exome sequencingExample: Comparative SequencingSomatic mutation detection between normal / cancer pairs
More mutation yield and better causal gene identification than Mendelian disorders14
Meyerson et al, Nat Rev Genet 2010
Pierce, Am J Hum Genet 2010Perrault syndrome (HSD17B4)BUT Exome Analysis for single patient can be informativeExome sequencing procedure
Read MappingMapping hundreds of millions of reads the reference genome is CPU and RAM intensive, and slow
Read quality decreases with length (small single nucleotide mismatches or indels real or artifact?)
Very few mappers appropriately deal with indels
Mapping output: SAM (BAM) or BED
17Mapped Data: SAM specificationGeneric sequence alignment format
Describes alignment of reads to a reference
Flexible - stores all the alignment information
Simple enough to be easily generated or converted from other existing alignment formats
Keeps track of chromosome position, alignment quality and alignment features (extended cigar)
Includes mate pair / paired end information
Original FASTQ data can be reproduced from SAM (and BAM)SAM FIELDS
BAM formatBinary version of SAM - more compactMakes downstream analysis independent from the mapping programAllows most of operations on alignment to work on a stream without loading the whole alignment into memoryAllows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locusVCF formatEmerging standard for storing variant dataOriginally designed for SNPs and short INDELs, it also works for structural variationsConsists of header and data sectionsThe data section is TAB delimited with each line consisting of at least 8 mandatory fieldsVCF FIELDS
Variant filtering
Variant PrioritizationHeuristic filtering to identify novel genes for Mendelian disorders25
Stitziel et al, Genome Biol 2011More than just SNVs and short indels
Structural VariationBreakDancerChen et al, Nat Meth 2009
Only looks at anomalous read pairs
Copy Number Variation DetectionChange in read coverage28
Example WES-based variant discovery workflowMap the reads to a reference genomeindex the reference genomeMap (BWA, BOWTIE, NOVOAOLIGN, ETC)Sort BAM fileRemove PCR duplicatesRealign around indels (optional)Call variantsRecalibrate quality scores (optional)Filter variants Basic variant annotationBiological interpretation only starts here
Grfico1004766370531
YearPapers
Hoja120060200702008420097201066201137020125312013
Hoja1
YearPapers
Hoja2
Hoja3