Meredith Ashby, August 2015cgs.hku.hk/portal/files/GRC/Events/Seminars/2015... · 2007 ABI 3730 Celera HuRef 2009 Illumina GA SOAP de novo BGI YH 2010 454 GS Flx Titanium Newbler

FIND MEANING IN COMPLEXITY

© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.For Research Use Only. Not for use in diagnostic procedures.

Meredith Ashby, August 2015

The most comprehensive view of the human genome

2

Human Genome

The Human Genome contains:• Over 3 billion DNA base pairs• Organized into 23 chromosomes• With 2 copies of each chromosome

(one maternal , one paternal )• Carrying 20,000 genes• Each encoding an average of 3 proteins

Accessing variation in the human genome enables dis covery research.

“Much of the missing heritability (the 'dark matter' of the genome) will probably turn up as the technology advances.”- Francis Collins

Nature 464, 674-675 (1 April 2010)

Source: NHGRI Fact Sheet

Human Genetics Research from Past to Future

1st Genome Wide Association Studies (GWAS)Finding Disease Genes by associating SNPs

Exome (Re-) SequencingFinding Disease Genes by identifying protein-coding SNPs

Genome (Re-) Sequencing Finding Disease Genes by identifying SNPs genome-wide

More Comprehensive Genome (Re-) Sequencing Comprehensive profiling of Disease Genes(SNPs, SVs, Haplotypes, Splice Variants)

Whole-Genome de novo SequencingComprehensive profiling of the entire genome (SNPs, SVs, Haplotypes, Splice Variants, Epigenetic s)

Technology Waves after Initial Human Genome Sequenc ing: P

ast

Fut

ure

2nd

3rd

4th

5th

Draft genomes, even with trio data, provide insufficient information to resolve the cause of Mendelian disorders in 40% of cases

4

“Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for Mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges.”

“To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants.”

5Eichler et al., Nature Reviews Genetics 11, 446-450 (June 2010) | doi:10.1038/nrg2809

Levy et al. (2007) PLoS Biology 5: e254

Structural Variation is the Most Predominant Form of Sequence Polymorphism in the Human Genome

“Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases . This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure.”

For Research Use Only. Not for use in diagnostic procedures.

Whole vs. Hole Genome Sequencing

7

8

Short Reads and the Cost per Genome Dilemma: Quantity vs. Quality

Sequencing cost is down, but assembling a de novo human genome that meets the same scientific standard as the initial work does NOT follow Moore’s law.

HGP, N50 ~100kb

NCBI-34Contig N50 29Mb

HuRef: 107kb

BGI YH: 7.4kb

KB1: 5.5kb

NA12878: 24kb

CHM1: 144kb

RP11: 127kb

According to NHGRI website, the definition of “sequencing a genome” changed in 2008, when resequencing became favored, trading quality for lower cost.

The 1000 Genomes Project starts in 2008.

Why are High Quality Genomes Important?

To get to a medical grade genome , or a genome that can be used for clinical diagnostic purposes, we need to have the most accurate and complete genome for each individual. We believe that the PacBio SMRT machines will help us reach this goal.”

For the Asian Genome Project, Macrogen and GMI investigators are seeking a more complete Asian reference genome to pursue detailed analyses of the populations in Asia.

"Our goal is to make a complete Asian reference genome for future medical practice," Macrogen's Seo said, noting that the team is pursuing a "medical grade" genome sequence that is highly accurate and can serve as a referenc e in both research and clinical settings.

9

10

Figure 2. Worldwide Frequency Distribution of APOBE C3B Deletion

PLoS Genet. 2007 Apr; 3(4): e63.

Using a population specific reference improves the mapping accuracy of short read data

PacBio SMRT Sequencing Has Unique Strengths

1. Contiguity

– Average read lengths up to 12 kb

– Some reads >40 kb

2. Uniformity

– Lack of GC content or sequence complexity bias

3. Accuracy

– Achieves >99.999% (QV50)

– Lack of systematic sequencing errors

4. Native DNA

– No DNA amplification

– Epigenome characterization

Latest P6-C4 Chemistry Read Length Performance

12

Data per SMRT Cell: 0.5 – 1 Gb

20 kb size-selected human library4 hour movieP6-C4 chemistry

PCR-free Sample Preparation Workflow Means Less Bias

Ross et al. (2013) Characterizing and measuring bias in sequence data. Genome Biology, May 29;14(5):R51

“Pacific Biosciences coverage levels are the least biased ”

SMRT® Sequencing Accuracy – Errors are Random

1st & 2nd gen (conceptual)

SMRT sequencing

14

Long, Unbiased, Highly Accurate Reads Resolve ‘Difficult-to-Sequence’ Regions

• Resolve long palindromes• Identify structural variants• Obtain accurate microsatellite lengths

• Span homopolymeric, low-complexity, and highly repetitive regions

• Delineate tandem repeats

15

Loomis et al. (2013) Sequencing the unsequenceable: Expanded CGG-repeat alleles of the fragile X gene. Genome Research, 23(1):121-8

Fragile X gene with >2 kb of repeat regions

PacBio reads span extreme CGG repeats and AT-rich

regions

Year Technology Assembler Sample

2007 ABI 3730 Celera HuRef

2009 Illumina GASOAP

de novoBGI YH

2010454 GS Flx

TitaniumNewbler KB1

2010 Illumina GA ALLPATHS-LG NA12878

2013454 GS, HiSeq,

MiSeqNewbler RP11_0.7

2014HiSeq, BAC

clonesReference-

guidedCHM1

2014 PacBio RS II FALCON CHM1

2015 PacBio RS II FALCON CHM13

2015 PacBio RS II FALCON AK1

2015 PacBio RS II FALCON HuRef

2015 PacBio RS II FALCON PC-9*

2015 PacBio RS II FALCON SK-BR-3*

Human Genome De Novo Assemblies Comparison

Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/20/2/265.abstract Table II); KB1 (http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html); NA12878 (http://www.pnas.org/content/ early/2010/12/20/1017351108.abstract Table3); CHM1 Illumina (http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2/)

*cancer cell lines

0.11

0.007

0.006

0.024

0.13

0.14

4.38

12.98

7.28

10.38

3.58

2.56

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Contig N50 (Mb)

Short Read Data Contains Coverage Gaps that Thwart Assembly

PacBiocoverage

PacBioreads

HiSeqcoverage

HiSeqreads

Chromosome

Genes

PacBio RS II data at http://datasets.pacb.com/2014/Human54x/fast.html

Illumina HiSeq data at http://www.ncbi.nlm.nih.gov/sra/SRR642636

SHANK3 – Involved in Autism

HiSeq data miss 1st exon and upstream region entirely

PacBiocoverage

PacBioreads

HiSeqcoverage

HiSeqreads

Chromosome

Genes

SHANK3 – Involved in Autism

85% GC-rich

CCCCGTCACAGCCCCCCAGACCCCCGCCCCGTGGCTCGGCCCCCGCCCTCCGCACACACCT

CCCGCCCCCACCCGGGACCCCGCAAGTAACCCCCCAGCACTGGCCCTGAGCCCTCCCGGCC

CCCGCCTCCGGCGCAGCCCCCTCGCCACCCCCGCTTCCCTCCCGTCTCAGGCCCCCTCCCCC

CGCCGCCCCCGCCCCCGGGGAAGGCAGGCGCCGAGCTGAGCCGGGGCCGATGCAGCTG

AGCCGCGCCGCCGCCGCCGCCGCCGCCGCCCCTGCGGAGCCCCCGGAGCCGCTGTCCCCC

GCGCCGGCCCCGGCCCCGGCCCCCCCCGGCCCCCTCCCGCGCAGCGCGGCCGACGGGGC

TCCGGCGGGGGGGAAGGGGGGGCCGGGGCGCCGCGCGCGGAGTCCCCGGGCGCTCCG

TTCCCCGGCGCGAGCGGCCCCGGCCCGGGCCCCGGCGCGGGGATGGACGGCCCCGGGG

CCAGCGCCGTGGTCGTGCGCGTCGGCATCCCGGACCTGCAGCAGACGGTGAGCCCCG

Genes

PacBiocoverage

PacBioreads

HiSeqcoverage

HiSeqreads

Chromosome

Genes


Uncovering Structural Variation in Gold Standard Genomes

Types of Structural Variation (SV) in a Human Genome

Hurles et al. (2008) Trends in Genetics 24: 238-245

Behavioral Diseases Associated with Structural Variation

Girirajan & Eichler (2010) Human Molecular Genetics 19: R176-187

Pang et al. (2014) G3 (Bethesda) 4: 63-65

Structural Variation Detection by 2nd Gen Technologies

“Generating longer reads … can mitigate these shortcomings.”

“We observed that current high-throughput sequencing approaches only detected a fraction of the full size-spectrum of insertions, deletions, and copy number variants compared with a previously published, Sanger-sequenced human genome.”

Human Genetic Variation Sequencing with PacBio Long Reads

1 10 100 1 kb 10 kb 100 kb 1 Mb 10 Mb 100 Mb

Size of Variant

Variant Type

SNPs

Small Indels

STRs & VNTRs

Large Insertions,

Deletions, CNV

Mobile Elements

Complex Variants

Phased SV’s

Indels

Repeat Expansions

One PacBio Read Spans Most Variants Structural VariantsPhased Alleles

Complex Regions

Assembled PacBio Reads Span Euchromatic Genome Varia tion

L1, Alu, SVA

Copy Number Variation

Inversions

Phased SNVs Phased Alleles

Very Large SV’s

Phased Haplotypes

Large Structural Rearrangement

Human Genome Sequencing with the PacBio RS II

Chaisson et al. (2014) Nature doi:10.1038/nature13907

Updated CHM1 Assembly With PacBio’s Newest Chemistry

26

Chr.2

Chr.20

Chr.6

Chr.21

Hydatidiform Mole (CHM1) Assembly Contig Alignments to Selected Chromosomes

From ~50x CHM1 data with the latest P6 chemistry, we getan assembly of

Total = 2.9 Gbp

N50 = 27.9 Mbp

Max = 109.3 Mb

Two contigs ~ 109Mb

Assembly contigs ~ Chromosome arms

Assembler code available at https://github.com/PacificBiosciences/FALCON

PacBio CHM1 Data vs. GRCh37 & 1000 Genomes Project

• Resolved 26,079 euchromatic structural variants at the base-pair level• ~22,000 (85%) of these are novel• 6,796 of the events map within 3,418 genes• Validation rate of 97%, only a fraction of which are detectable with short reads• Closes/extends 55% of the remaining gaps in human reference genome


PacBio CHM1 Data vs. GRCh37 & 1000 Genomes Project

• The percentage repeat composition (x axis) of1-kb sequences flanking insertion sites for Alu, L1 and SVA mobile element insertions.

• Insertion calls from the 1000 Genomes Project (pink)21 compared to calls from CHM1 using PacBio reads (blue) show short read data misses repeat-rich insertion events.


SV Insertion Sequence Data Missing from GRCh37 Reference

29

An additional net insertional bias of 3.9 megabases(Mb) of additional sequence is revealed in the PacBio CHM1 when compared to the human reference.


Short Tandem Repeats (STR’s) Easier to Detect with Long Reads

• 6,007 STR’s

• 2,285 of these expanded STRs occur within genes

• 2,760 Tandem Repeats30

Neurologic Diseases Associated with STRs Occurring within Genes

http://www.dialogues-cns.com/publication/a-glossary-of-relevant-genetic-terms/


Revealing the Complexity of Cancer Genomes

32

SK-BR3 Cancer Genome Sequencing, Dick McCombie, CSHL

Davidson et al, 2000

Cancer genomes can have tremendous structural variation, with chromosomal fragmentation, rearrangement, and large duplication events.

Most commonly used Her2-amplified breast cancer cell line

W. Richard McCombie, CSHLhttp://schatzlab.cshl.edu/presentations/2015/2015.02.27.AGBT%20PacBio%20SKBR3.pdf

Genome-wide Alignment Coverage of PacBio Reads Reveals the Highly Aneuploid Nature of this Cancer Genome

• Genome-wide coverage averages around 54X • Coverage per chromosome varies greatly as expected in this highly aneuploid

genome

Her2


Total Assembly: 2.64 GbpContig N50: 2.56 MbpMax Contig: 23.5 Mbp

Complex Structural Variant Discovery is Facilitated By Long Reads

Chromosome AChromosome A

Chromosome BChromosome B

Chromosome A

Chromosome B

• Alignment-based split read analysis efficiently captures most structural variation events

• Follow up with the Parliament workflow is planned


• 94 high-confidence inter-chromosomal translocations were identified in the SKBR3 genome assembly

• 1200 Intra-chromosomal translocations were found

A High Quality Assembly Enables the Direct Detection of Interchromosomal Translocations

Her2


PacBio

Her2

chr17

A Gold Standard Assembly Enables Cancer Lesion Reconstruction

By comparing the proportion of reads that are spanning or split at breakpoints we can begin to infer the history of the genetic lesions.

1. Healthy diploid genome2. Original translocation into chromosome 83. Duplication, inversion, and inverted duplication within chromosome 84. Final duplication from within chromosome 8


Oncogeneamplifications

ErbB2(Her2/neu)

≈20X

MYC ≈27X

MET ≈8XKnown Gene fusions Confirmed by PacBio reads?

TATDN1 GSDMB Yes

RARA PKIA Yes

ANKHD1 PCDH1 Yes

CCDC85C SETD3 Yes

SUMF1 LRRFIP2 Yes

WDR67 (TBC1D31) ZNF704 Yes

DHX35 ITCH Yes

NFS1 PREX1 Yes *read-through transcription

CYTH1 EIF3H Yes *nested inside 2 translocations

Gold Standard Assemblies Reveal the Locations of Oncogene Duplications and Gene Fusion Events


Her2+ Breast Cancer Reference Genome is Available for Download

http://schatzlab.cshl.edu/data/skbr3/

Available now under the Toronto Agreement:• Fastq & BAM files of aligned reads• Interactive Coverage Analysis with BAM.IOBIO

Available soon• Whole-genome assembly and methylation analysis• Comparison to single cell analysis of >100 individual cells• Iso Seq Data


Comprehensive View of a Cancer Genome & Epigenome

• PC-9 non-small cell lung cancer cell line

• Apply PacBio sequencing for:– De novo long-read assembly of a cancer

genome

– Characterization of gene fusions

– Genome-wide methylomecharacterization

• Sequenced both drug-sensitive as well as drug-resistant sample

Drug-sensitive

Sustained treatment with erlotinib

Drug-resistant

Stably drug-

resistant

Mutations

Drug-sensitive

MT inhibitor

Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri(Genentech) & Y. Suzuki, S. Morishita (U of Tokyo)

PC-9 Cancer Genome De Novo Assembly

PacBio

# of contigs 12,359

Contig N50 1.044 Mb

Max contig length 26.6 Mb

PacBio assembly performed by J. Chin. Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri (Genentech) & Y. Suzuki, S. Morishita (U of Tokyo)

PC-9 Cancer Genome De Novo Assembly

PacBio Short-read 1 Improvement

# of contigs 12,359 424,605 34x

Contig N50 1.044 Mb 0.018 Mb 58x

Max contig length 26.6 Mb 0.28 Mb 95x

PacBio assembly performed by J. Chin. Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri (Genentech) & Y. Suzuki, S. Morishita (U of Tokyo)

1Nagarajan et al. (2012) Whole-genome reconstruction and mutational signatures in gastric cancer. Genome Biology 13: R115.

Mapping against Human Reference Sequence

• Observing unusual sequencing coverage across chromosomes in both samples (CHM1 as control):

• Example: chromosome 3:

PC-9: CHM1:

43

Note: the gap in the center is the centromere (no reference sequence available)

Long Reads Reveal Gene Fusions

• Example: CPSF3-ASAP2

CPSF3

Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri (Genentech) & Y. Suzuki, S. Morishita (U of Tokyo)

Gene Fusion Mapping Result: CPSF3-ASAP2

• Exact break points identified:

ASAP2CPSF3

Same read, split-mapped

Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri (Genentech) & Y. Susuki, S. Morishita (U of Tokyo)

Break point 1: Chr2: 9,391,888

Break point 2: Chr2: 9,438,556

Detection of DNA Base Modifications Using Kinetics

Example: N6-methyladenine

Flusberg et al. (2010) Nature Methods 7: 461-465

A G C T mA G T T Template strand

C G A G C T AG TTC A T G T Template strand

• SMRT Sequencing uses kinetic information from each nucleotide addition to call bases

• This same information can be used to distinguish modified and native bases by comparing results of SMRT Sequencing to an in silico kinetic reference for incorporation dynamics without modifications.

Detection of DNA Base Modifications by SMRT Sequencing

Flusberg et al. (2010) Nature Methods 7: 461-465

SMRT Portal v1.3.3+ can recognize and annotate multi-site modified-base signatures.

5-mC

4-mC

6-mA

Calculation of IPD ratios across the reference gives information about base modification at every position.

Detecting Hypo-methylated CpG Islands in Eukaryotic DNA

• 16-fold per-strand coverage maximizes the accuracy of the results • Algorithm is freely available at https://github.com/hacone/AgIn

48

• The signal strength for 5mC is weaker than for other methylated bases and so requires higher coverage to identify. However, in eukaryotes methylation is a regional phenomenon.

• Prof. Shinichi Morishitadeveloped a method to differentiate hypo- and hyper-methylated regions by integrating the signals across CpG islands

http://www.ashg.org/2014meeting/abstracts/fulltext/f140120432.htm

Epigenome Characterization

• Global methylation status:


• More

hypermethylated

CpG islands in

drug-resistant

sample

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2260

65

70

75

80

85

% H

yper

met

hyla

ted

CpG

Isla

nds

Chromosome

Drug-sensitive Drug-resistant


• Methylation status of CpG islands (https://github.com/hacone/AgIn)

• Chr4: ANKRD17 (breast cancer)




• Chr4: WHSC1 (myelomas)




• Chr4: FGFR3 (fibroblast growth factor 3)




• Chr4: RASSF6 (tumor suppressor gene)


PacBio Long-read Sequencing

Allows for Detailed, De Novo Characterization

of Cancer Genome & Epigenome


Take Structural Variant Discovery to ‘11’ with Low PacBio Coverage

54

Hybrid Structural Variant Calling Using Multiple Data Sources

Data sources include whole-genome PacBio Long-Read sequencing at 10x coverage (10,000 bp read length)

English et al (2015). Assessing structural variation in a personal genome—towards a human reference diploid genome BMC Genomics

Structural Variant Discovery with 10X PacBio Coverage

PBHoney alone, with only 10x PacBio coverage, identifies 4,268 SVs supported by hybrid assembly, representing events “invisible” to PE data.

English et al (2015). Assessing structural variation in a personal genome—towards a human reference diploid genome BMC Genomics

“Applying multiple Parliament workflows, we demonst rate that while method integration is optimal for SV detectio n in Illuminapaired-end data, the addition of long-read data can more than triple the number of SVs detectable in a personal g enome.”

SV Calling: Illumina-only method comparison

Limitations of Illumina-only methods for SV Calling:

High False Discovery Rate (FDR)- Ranges from 3% to > 80%

Low Sensitivity:- Ranges from < 8% to a max of

57%

English et al (2015) BMC Genomics

“Despite these benefits of a multi-algorithm approach, Illumina-only discovery still only recovers approximately half of the 9,777 SVs identified by multi-source Parliament”

PacBio Provides the Most Comprehensive View of Genetic Variation in a Human Genome

“We now have access to a whole new realm of genetic variation that was opaque to us before.”

“Knowing all the variation is going to be a game changer .“

- Professor Evan Eichler, University of Washington

Source: http://www.sciencenewsline.com/articles/2014111017120055.html

For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seqare trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.

59

Documents

Meredith Ashby, August 2015cgs.hku.hk/portal/files/GRC/Events/Seminars/2015... · 2007 ABI 3730 Celera HuRef 2009 Illumina GA SOAP de novo BGI YH 2010 454 GS Flx Titanium Newbler