of 29 /29
Whole Exome Sequencing for Variant Discovery and Prioritisation

Whole Exome Sequencing for Variant Discovery and Prioritisation

Embed Size (px)

Citation preview

Page 1: Whole Exome Sequencing for Variant Discovery and Prioritisation

Whole Exome Sequencing for Variant Discovery and Prioritisation

Page 2: Whole Exome Sequencing for Variant Discovery and Prioritisation

First, a recap. What have we learned?

• NGS platforms – short and long reads

• What the data looks like

• How to QC data

• General procedures in processing data

• How to find biological signal in data - RNA-Seq lectures + practical (in progress)

• There’s a LOT more, but it’s not necessarily more complex or very different!

Page 3: Whole Exome Sequencing for Variant Discovery and Prioritisation

0

100

200

300

400

500

600

2005 2006 2007 2008 2009 2010 2011 2012 2013

Year

Papers

2013: ~ 800 papers

2014: ~ 1200 papers

Forero DA, 2012

Exomes: Publication Trends

Total: 925 (Oct 2012)

Page 4: Whole Exome Sequencing for Variant Discovery and Prioritisation

NGS Variation Discovery Workflow (resequencing based)

Page 5: Whole Exome Sequencing for Variant Discovery and Prioritisation

Variant Discovery Application: Disease

• An equivalent of the genome would amount almost 2000 books, containing 1.5 million letters each (average books with 200 pages)!

• This information is contained in any single cell of the body.

Page 6: Whole Exome Sequencing for Variant Discovery and Prioritisation

Monogenic Diseases

• Single mutation

• How do we find it in all those ‘books’?

• A bioinformatics challenge

• NGS sequencers can only read small portions

• So, the library is fragments of pages of the books!

Page 7: Whole Exome Sequencing for Variant Discovery and Prioritisation

Mendelian Disease Gene Discovery

7Gilissen, Genome Biol 2011

Page 8: Whole Exome Sequencing for Variant Discovery and Prioritisation

Mendelian Disease Gene Discovery

8Gilissen, Genome Biol 2011

Page 9: Whole Exome Sequencing for Variant Discovery and Prioritisation

Opportunities and Challenges

• Enabling technologies: NGS machines, open-source algorithms, capture reagents, lowering cost, big sample collections

• Exomes more cost effective: Sequence patient DNA and filter common SNPs; compare parents child trios; compare paired normal cancer

• Challenges:– Still can’t interpret many Mendelian disorders– Rare variants need large samples sizes– Exome might miss region (e.g. novel non-coding genes)

9Shendure, Genome Biol 2011

Page 10: Whole Exome Sequencing for Variant Discovery and Prioritisation

Why exome sequencing?• WGS still too costly & added value of intergenic mutations is low

• WES: targeted sequencing of coding regions (~1% of human genome)

• Mendelian disorders disrupt protein-coding sequences (mostly)

• Large fraction of rare non-synonymous variants in human genome are predicted to be deleterious

• Splice sites also enriched for highly functional variation

• The exome represents a highly enriched subset of the genome in which to search for variants with large effect sizes

Page 11: Whole Exome Sequencing for Variant Discovery and Prioritisation

A representation of the relationship between the size of the mutational target and the frequency of disease for disorders

caused by de novo mutations

Gilissen, Genom Biol 2011

Page 12: Whole Exome Sequencing for Variant Discovery and Prioritisation

Majewski, J Med Genet 2011

Page 13: Whole Exome Sequencing for Variant Discovery and Prioritisation

Bamshad, Nat Rev Genet 2011

Maximizing chances of finding disease-causing rare variants using exome sequencing

Page 14: Whole Exome Sequencing for Variant Discovery and Prioritisation

Example: Comparative Sequencing

• Somatic mutation detection between normal / cancer pairs

• More mutation yield and better causal gene identification than Mendelian disorders

14Meyerson et al, Nat Rev Genet 2010

Page 15: Whole Exome Sequencing for Variant Discovery and Prioritisation

Pierce, Am J Hum Genet 2010

Perrault syndrome (HSD17B4)

BUT Exome Analysis for single patient can be informative

Page 16: Whole Exome Sequencing for Variant Discovery and Prioritisation

Exome sequencing procedure

Page 17: Whole Exome Sequencing for Variant Discovery and Prioritisation

Read Mapping

• Mapping hundreds of millions of reads the reference genome is CPU and RAM intensive, and ‘slow’

• Read quality decreases with length (small single nucleotide mismatches or indels – real or artifact?)

• Very few mappers appropriately deal with indels

• Mapping output: SAM (BAM) or BED

17

Page 18: Whole Exome Sequencing for Variant Discovery and Prioritisation

Mapped Data: SAM specification• Generic sequence alignment format

• Describes alignment of reads to a reference

• Flexible - stores all the alignment information

• Simple enough to be easily generated or converted from other existing alignment formats

• Keeps track of chromosome position, alignment quality and alignment features (extended cigar)

• Includes mate pair / paired end information

• Original FASTQ data can be reproduced from SAM (and BAM)

Page 19: Whole Exome Sequencing for Variant Discovery and Prioritisation

SAM FIELDS

Page 20: Whole Exome Sequencing for Variant Discovery and Prioritisation

BAM format

• Binary version of SAM - more compact• Makes downstream analysis independent from

the mapping program• Allows most of operations on alignment to

work on a stream without loading the whole alignment into memory

• Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus

Page 21: Whole Exome Sequencing for Variant Discovery and Prioritisation

VCF format

• Emerging standard for storing variant data• Originally designed for SNPs and short INDELs,

it also works for structural variations• Consists of header and data sections• The data section is TAB delimited with each

line consisting of at least 8 mandatory fields

Page 22: Whole Exome Sequencing for Variant Discovery and Prioritisation

VCF FIELDS

Page 23: Whole Exome Sequencing for Variant Discovery and Prioritisation
Page 24: Whole Exome Sequencing for Variant Discovery and Prioritisation

Variant filtering

Page 25: Whole Exome Sequencing for Variant Discovery and Prioritisation

Variant Prioritization

• Heuristic filtering to identify novel genes for Mendelian disorders

25Stitziel et al, Genome Biol 2011

Page 26: Whole Exome Sequencing for Variant Discovery and Prioritisation

More than just SNVs and ‘short’ indels

Page 27: Whole Exome Sequencing for Variant Discovery and Prioritisation

Structural VariationBreakDancer

Chen et al, Nat Meth 2009

Only looks at anomalous read pairs

Page 28: Whole Exome Sequencing for Variant Discovery and Prioritisation

Copy Number Variation DetectionChange in read coverage

28

Page 29: Whole Exome Sequencing for Variant Discovery and Prioritisation

Example WES-based variant discovery workflow

1. Map the reads to a reference genome1. index the reference genome2. Map (BWA, BOWTIE, NOVOAOLIGN, ETC)

2. Sort BAM file3. Remove PCR duplicates4. Realign around indels (‘optional’)5. Call variants6. Recalibrate quality scores (‘optional’)7. Filter variants 8. Basic variant annotation9. Biological interpretation only starts here