Mutation detection using whole genome sequencingbioinformatics.org.au/ws17/wp-content/uploads//sites/13/2016/02/A… · •Annotation •Biological interpretation Generalised sequencing

Mutation detection using whole genome

sequencing

2017 Winter School in Mathematical and

Computational Biology

Ann-Marie Patch

3rd July 2017

Mutation Detection success depends on previous steps

Sample

preparation

Library

preparation

Sequence

generation

Initial data

processing

Data

analysis

•DNA

•RNA

•miRNA

•BAC

•Fragmentation

•Size selection

•Target

enrichment

•Indexing

•Platform

•Sequence

length

•Base calling

•Quality

assessment

•De novo

assembly

•Alignment

•Mutation

detection

•Annotation

•Biological

interpretation

Generalised sequencing workflow

Whole genome paired-end sequencing process recap

- library preparation Genomic DNA

Sample preparation is key to getting good results

High molecular weight DNA is required (not just DNA quantification)

10 000bp

Smears indicate degraded samples

Genomic DNA

Fragment DNA

Clean-up DNA fragments

Consistent fragment size distribution

across all your samples


- library preparation

Adaptors added

Sequence reads produced from both

ends of each fragment

The distance from the ends of the reads

should follow the DNA size distribution

Clean-up DNA fragments


- library preparation

HiSeq ~300 bp

HiSeq XTen ~ 400 bp

FFPE (formalin fixed paraffin embedded)

DNA samples have a high degree of

fragmentation.

This produces a shorter TLEN so think about

the read length of the sequencing or you

could end up paying to sequence the

adapters

Fragment length median 150bp Adapter 1 Adapter 2

Read 1

Read 2

HiSeq X Ten sequencing 2x 150bp

Library production is key for successful mutation

detection

I II I I II

Reference genome

Paired-end sequence alignment to a reference genome

Paired-end sequences

mapped to genome

Read depth

Examining how the mapping position and content of the pairs of

reads vary across the reference genome allows us to determine

variations and structural rearrangements

SNV/indels

* * *

I II I I II

Detection software pinpoints differences in your sample

compared to a reference

deletions

amplifications

translocations

Variants are recorded as positional information and read counts generated by

detection software

Software tools are usually designed to detect one type of variant

Choosing mutation calling software

Choice can be guided by

Type of data

The biological question

Available computing resources

Past experience

Related literature

QIMR DNA variant detection

•Substitutions

•qSNP – QIMR

•GATK – Unified genotyper– Broad

•Small insertions and deletions

•GATK – Unified genotyper - Broad

•Large structural variations

•qSV – QIMR

•Copy number aberrations

•ASCAT-ngs

Identifying software to try

One of many on-line resources

https://omictools.com/whole-genome-resequencing-category

Just remember:

Each piece of software will give different results

Results depend on the quality of the starting

material

DNA/RNA quality

disease or organism type

Evaluate the output from each tool for your data

and research question










Visualising the variants detected

Robinson et al 2011 XTen WGS 150bp paired end data

Grey blocks

base pairs

matching

reference

Small coloured

blocks indicate a

change from the

reference

Reference sequence

Below is a 404bp region from human chromosome 3

A pair of reads

Identifying signal from noise

Robinson et al 2011

Signal – consistent variants per

position across a number of reads

Reference sequence

Below is a 40bp region over exon 14 of BAP1

Noise – random

variants per position

in only a few reads

Total read count = 37

No of reference (T) = 19

No of alternate (C) = 18

51% T : 49% C

In a diploid organism this

equates to a heterozygous

call

Ensure coverage is not a limiting factor

In germline sequencing most homozygous SNVs are detected at a 15X average depth but an

average depth of 33X is required to reproducibly detect the same proportion of heterozygous

SNVs.

Bentley, et al. Nature (2008).

Uninformative reads impact on the final coverage and ultimately the mutation detection

sensitivity

Sims et al Nat Rev Genet. 2014

What about when the noise level is high?

Is 22% of the alternate allele

enough to make a good call?

Same XTen data as before but a different region

Understand the characteristics of the sequencing reads

Detection tools read in the alignment files (BAM files)

Samtools can help you investigate the reads that provide evidence of the variant

SAM fields

1 QNAME Query template/pair NAME

2 FLAG bitwise FLAG

3 RNAME Reference sequence NAME

4 POS 1-based leftmost POSition/coordinate of clipped sequence

5 MAPQ MAPping Quality (Phred-scaled)

6 CIGAR extended CIGAR string

7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME)

8 MPOS 1-based Mate POSistion

9 TLEN inferred Template LENgth (insert size)

10 SEQ query SEQuence on the same strand as the reference

11 QUAL query QUALity (ASCII-33 gives the Phred base quality)

12+ OPT variable OPTional fields in the format TAG:VTYPE:VALUE

Mononucleotide runs are particularly error prone

Inconsistent small deletions of 2 to 7 bp at a homopolymer run of A’s Polymerase slippage during sequencing?

Sources of bias in sequencing data

Changes in expected proportions can be due to:

• Sample purity/integrity and heterogeneity

• Stochastic sampling/low coverage depth

• Capture or enrichment bias

• Alignment/mapping bias

• Repetitive regions of the reference sequence

• Sequencing error / Platform related artefact

How should we determine a good call from error?

Detection tools often attempt to provide a confidence level for the variant call

qSNP in-house, rules-based heuristic tool sensitive (Kassahn et al 2013)

GATK (unified genotyper) a Bayesian tool (McKenna et al 2010)

Raw

Germline

Filtered

Germline

qSNP 4,180,630 3,698,034

GATK 4,945,990 4,069,314

A simplified view states humans are 99.9% identical

We therefore expect ~3,000,000 single nucleotide variants per person

( 1000 per Mb or 0.1%)

Filtering of results from mutation detection tools is

necessary

Strategy for identifying and filtering substitution variants

Quality filter the reads that are used by the detection software Remove duplicate reads

Require a minimum mapping quality for reads e.g. >10

Impose a maximum number of mismatches allowed in read e.g.<=3

Require a minimum number of consecutive matched bases in a read >=34

Understand the characteristics that influence the confidence of a variant call What’s the minimum number/proportion of variant containing reads required

Is there a minimum read depth for a good call

Are the base qualities for variant positions taken in to account

Look for potential weakness in calls by adding your own annotation

Position of variant within the reads are they all at the ends of reads

Is the variant identified in reads sequenced in both directions

Is the variant identified in the majority of your samples so could be artefact

Is the variant in a repetitive area of the reference

Download an appropriate published dataset to compare your output with what was published.

For human germline sequencing there are standards datasets

• Genome in a bottle (National Institute of Standards and Technology)

• Platinum genomes from CEPH family (Illumina)

• For cancer COLO-829 is often used

Verification

Use a different technology or source material to test a selection of your variant calls

Benchmark your processes and verify your findings

Detect mutations Examine

Manual IGV review

Identify patterns and

modify filtering

strategies

•PCR and capillary sequencing

•PCR and deep MiSeq sequencing

•SOLiD sequencing

•mRNA sequencing

Mutation cataloguing aids understanding of cancer genetics

Our Projects

Ovarian

Pancreatic

Melanoma

International Cancer Genome Consortium projects

ICGC patient summary of mutations identified

Circos, Krzywinski et al 2009

Chromosomes

SNP array track that shows copy

number gain in red and loss in

green and regions of loss of

heterozygosity

Structural variants in centre

Coding small mutations with

amino acid change

ICGC data portal: https://dcc.icgc.org/

https://dcc.icgc.org/



Cancer sample sequencing involves the parallel analysis of

at least two samples for each patient

Inherited genome sample

Germline

Tumour genome sample

Tumour

This sample contains a mixed population of normal and tumour cells

Also subpopulations of different tumour cells

Data

Analysis •Mutation

detection

•Annotation

•Biological

interpretation

Germline variants Seen in both samples

Somatic mutations Specific to tumour

sample

* * * *

* * * *

Normal/Germline DNA: Germline

SNV

Somatic

SNV

* * *

I II I I II

Cancer genomics studies identify both germline and

somatic changes

Somatic

deletion

Somatic

amplification

Somatic

translocation

Aim: to identify the changes that only occur in the tumour cells

Tumour DNA:

For tumour data low frequency variants may be clinically

relevant

BRCA2 exon 9 BRCA2 exon 10

5 bp germline deletion

Normal control

sample

Metastasis 1

Metastasis 2

Ovarian cancer patient with a deleterious germline BRCA2 deletion

High quality 3 bp somatic deletion

Six deletion reversion mutations identified within BRCA2 from

a single rapid autopsy case

BRCA2 exon 9 BRCA2 exon 10

5 bp germline deletion

Normal control

sample

Metastasis 1

Metastasis 2

Low

frequency

reversion

deletions

Evidence of 2

exon deletion

High frequency

reversion

deletion

Different reversion deletions could be identified in differing

proportions at multiple metastasis sites

Events 1-3 and 9 are found

in many abdominal

deposits where as 5, 7 and

14 are only identified in

one

Patch et al Nature 2015

Large genomic structural variants need different detection

strategies

Sample

Reference

Ovarian cancer genomes have high instability and are highly rearranged

Structural variants underlie copy number changes

Spectral Karyotype from HGS OvCa Cell line Ouellet et al 2008 BMC Cancer

Deletion Duplication/Insertion Translocation Low resolution

Alkan, Coe and Eichler 2011

There are 4 main methods for SV detection in WG sequencing

Some tools only use one detection method

but there are multi-method tools now available

Tumour

Germline

Visualising structural variants

Sub microscopic homozygous deletion in a tumour sample

Robinson et al 2011

Chromosome 13:

1.3Kb somatic deletion

including exon 17 of

RB1 gene

Tumour

Germline

Discordantly mapped read pairs mark rearrangements

Detection tools identify clusters of read pairs with similar

characteristics e.g. BreakDancer Chen et al 2009

reference

Read pairs too

far apart

Read pairs too

close together

Read pairs in wrong

orientation

>1.3kb insert size

Typical aligned read-pair insert size distribution

visualised by qProfiler

DNA fragment size distribution

Base pairs

Log c

ount

300bp median

Normally mapped

reads

Insert size estimation is key for detection with

discordantly mapped read pairs

~300 bp

Insert size depends on

DNA fragmentation step

Paired-end reads

Aligned pairs insert size

Changes in coverage support rearrangements

Clear drop in coverage

over the region in the

tumour sample Tumour

Germline

Tools are available that identify copy number variants from read depth partitioning and GC

content and mapability correction plus allele frequency analysis

Genomic position

Titan (Ha et al 2014 Genome Res)

Changes in coverage can be interpreted as copy number and

can mark rearrangement breakpoints

Fewer reads

mapped

More reads

mapped

Deletion

CN

by C

ove

rag

e

Alle

le fre

qu

ency

Clusters of soft clipping indicate rearrangement break points

Alignment software that performs soft clipping

can reveal exact positions of the break points

Split reads and assembled contigs reveal microhomology

Further realignment of the clipped sequences reveals split reads

Reads with soft clipping and unmapped reads can be assembled into contigs that span

breakpoints

Patterns of microhomology can be obtained from these data

CREST Wang et al 2012

qSV : Detecting Somatic Structural Variants

qSV detects 3 types of supporting evidence

Resolves all lines of evidence to identify breakpoints to base pair resolution

Felicity Newell

http://sourceforge.net/projects/adamajava/



Associating structural variants with proximal genes

How do the breakpoints and rearrangements affect the underlying genes?

© QIMR Berghofer Medical Research Institute | 39

SVs may promote tumour development

Oncogenes can be amplified by rearrangements resulting in gain of function

Chou et al 2013 Genome Med

Amplification of HER2 (ERBB2) in

pancreatic adenocarcinoma

Gene 1 Gene 2

Duplication of Gene 2

Gene 1

Cancer molecular subtyping with cohort studies

Take a group of samples with the same disease and look for the same gene/pathway

being altered - Molecular subtyping

X 100’s

Single patient Disease specific cohort

80% 15% 4% 1%

Cancer molecular subtyping with cohort studies

Molecular subtyping can be performed using, and by integrating, different data sources

Mutations

Gene expression

Methylation

Copy number

Structural rearrangements

450 pancreatic cancers

Bailey et al Nature 2016 Waddell et al Nature 2015

100 WGS pancreatic cancers

Mutational signatures

Pan-cancer molecular subtyping

The Cancer Genome Atlas Pan-Cancer analysis project

Nature Genetics 45, 1113–1120 (2013)

Leukaemia

Lung adenocarcinoma

Lung squamous

Kidney

Bladder

Endometrial

Glioblastoma

Head and neck

Breast

Ovarian

Colon

Rectum

Pan-Cancer analysis to identify common molecular features

of tumours

International consortia make pan-cancer studies possible extending the cohort

approach

X 100’s

Pan-cancer studies are

indicating that existing

treatment options can be

repurposed for other

cancer types

X 1000’s

Single patient Disease specific cohort

Multiple cohorts

Personalised

treatment

selection

Acknowledgements:

Genome informatics:

John Pearson

Conrad Leonard

Oliver Holmes

Qinying Xu

Scott Wood

Sean Grimmond

National Health and Medical Research Council

Australian Government

Anna deFazio

Catherine Kennedy

Yoke-Eng Chiew

Jillian Hung

Clinicians and patients

Medical genomics:

Nicola Waddell

Katia Nones

Felicity Newell

Stephen Kazakoff

Martha Zakrzewski

Venkateswar Addala

Andrew Biankin

David Chang

Peter Bailey

Jianmin Wu

Jeremy Humphris

Mark Pinese

Angela Chou

Mark Cowley

APGI collaborators http://www.pancreaticcancer.net.au/apgi/collaborators

Including: John Fawcett, O’Rourke, Andrew Barbour,

Henry Tang, Kelly Slater, Nik Zeps

Amber Johns

Anthony Gill

Scott Mead

Skye Simpson

Marc Jones

David Bowtell

Dariush Etemadmoghadam

Elizabeth Christie

Dale Garsed

Joshy George

Sian Fereday

Laura Galletta

Kathryn Alsop

Nadia Traficante

Thank you

Email:

[email protected]

www.qimrberghofer.edu.au

Documents

Mutation detection using whole genome sequencingbioinformatics.org.au/ws17/wp-content/uploads//sites/13/2016/02/A… · •Annotation •Biological interpretation Generalised sequencing