Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
FIND MEANING IN COMPLEXITY
© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.For Research Use Only. Not for use in diagnostic procedures.
Meredith Ashby, August 2015
The most comprehensive view of the human genome
2
Human Genome
The Human Genome contains:• Over 3 billion DNA base pairs• Organized into 23 chromosomes• With 2 copies of each chromosome
(one maternal , one paternal )• Carrying 20,000 genes• Each encoding an average of 3 proteins
Accessing variation in the human genome enables dis covery research.
“Much of the missing heritability (the 'dark matter' of the genome) will probably turn up as the technology advances.”- Francis Collins
Nature 464, 674-675 (1 April 2010)
Source: NHGRI Fact Sheet
Human Genetics Research from Past to Future
1st Genome Wide Association Studies (GWAS)Finding Disease Genes by associating SNPs
Exome (Re-) SequencingFinding Disease Genes by identifying protein-coding SNPs
Genome (Re-) Sequencing Finding Disease Genes by identifying SNPs genome-wide
More Comprehensive Genome (Re-) Sequencing Comprehensive profiling of Disease Genes(SNPs, SVs, Haplotypes, Splice Variants)
Whole-Genome de novo SequencingComprehensive profiling of the entire genome (SNPs, SVs, Haplotypes, Splice Variants, Epigenetic s)
Technology Waves after Initial Human Genome Sequenc ing: P
ast
Fut
ure
2nd
3rd
4th
5th
Draft genomes, even with trio data, provide insufficient information to resolve the cause of Mendelian disorders in 40% of cases
4
“Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for Mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges.”
“To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants.”
5Eichler et al., Nature Reviews Genetics 11, 446-450 (June 2010) | doi:10.1038/nrg2809
Levy et al. (2007) PLoS Biology 5: e254
Structural Variation is the Most Predominant Form of Sequence Polymorphism in the Human Genome
“Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases . This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure.”
For Research Use Only. Not for use in diagnostic procedures.
Whole vs. Hole Genome Sequencing
7
8
Short Reads and the Cost per Genome Dilemma: Quantity vs. Quality
Sequencing cost is down, but assembling a de novo human genome that meets the same scientific standard as the initial work does NOT follow Moore’s law.
HGP, N50 ~100kb
NCBI-34Contig N50 29Mb
HuRef: 107kb
BGI YH: 7.4kb
KB1: 5.5kb
NA12878: 24kb
CHM1: 144kb
RP11: 127kb
According to NHGRI website, the definition of “sequencing a genome” changed in 2008, when resequencing became favored, trading quality for lower cost.
The 1000 Genomes Project starts in 2008.
Why are High Quality Genomes Important?
To get to a medical grade genome , or a genome that can be used for clinical diagnostic purposes, we need to have the most accurate and complete genome for each individual. We believe that the PacBio SMRT machines will help us reach this goal.”
For the Asian Genome Project, Macrogen and GMI investigators are seeking a more complete Asian reference genome to pursue detailed analyses of the populations in Asia.
"Our goal is to make a complete Asian reference genome for future medical practice," Macrogen's Seo said, noting that the team is pursuing a "medical grade" genome sequence that is highly accurate and can serve as a referenc e in both research and clinical settings.
9
10
Figure 2. Worldwide Frequency Distribution of APOBE C3B Deletion
PLoS Genet. 2007 Apr; 3(4): e63.
Using a population specific reference improves the mapping accuracy of short read data
PacBio SMRT Sequencing Has Unique Strengths
1. Contiguity
– Average read lengths up to 12 kb
– Some reads >40 kb
2. Uniformity
– Lack of GC content or sequence complexity bias
3. Accuracy
– Achieves >99.999% (QV50)
– Lack of systematic sequencing errors
4. Native DNA
– No DNA amplification
– Epigenome characterization
Latest P6-C4 Chemistry Read Length Performance
12
Data per SMRT Cell: 0.5 – 1 Gb
20 kb size-selected human library4 hour movieP6-C4 chemistry
PCR-free Sample Preparation Workflow Means Less Bias
Ross et al. (2013) Characterizing and measuring bias in sequence data. Genome Biology, May 29;14(5):R51
“Pacific Biosciences coverage levels are the least biased ”
SMRT® Sequencing Accuracy – Errors are Random
1st & 2nd gen (conceptual)
SMRT sequencing
14
Long, Unbiased, Highly Accurate Reads Resolve ‘Difficult-to-Sequence’ Regions
• Resolve long palindromes• Identify structural variants• Obtain accurate microsatellite lengths
• Span homopolymeric, low-complexity, and highly repetitive regions
• Delineate tandem repeats
15
Loomis et al. (2013) Sequencing the unsequenceable: Expanded CGG-repeat alleles of the fragile X gene. Genome Research, 23(1):121-8
Fragile X gene with >2 kb of repeat regions
PacBio reads span extreme CGG repeats and AT-rich
regions
Year Technology Assembler Sample
2007 ABI 3730 Celera HuRef
2009 Illumina GASOAP
de novoBGI YH
2010454 GS Flx
TitaniumNewbler KB1
2010 Illumina GA ALLPATHS-LG NA12878
2013454 GS, HiSeq,
MiSeqNewbler RP11_0.7
2014HiSeq, BAC
clonesReference-
guidedCHM1
2014 PacBio RS II FALCON CHM1
2015 PacBio RS II FALCON CHM13
2015 PacBio RS II FALCON AK1
2015 PacBio RS II FALCON HuRef
2015 PacBio RS II FALCON PC-9*
2015 PacBio RS II FALCON SK-BR-3*
Human Genome De Novo Assemblies Comparison
Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/20/2/265.abstract Table II); KB1 (http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html); NA12878 (http://www.pnas.org/content/ early/2010/12/20/1017351108.abstract Table3); CHM1 Illumina (http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2/)
*cancer cell lines
0.11
0.007
0.006
0.024
0.13
0.14
4.38
12.98
7.28
10.38
3.58
2.56
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Contig N50 (Mb)
Short Read Data Contains Coverage Gaps that Thwart Assembly
PacBiocoverage
PacBioreads
HiSeqcoverage
HiSeqreads
Chromosome
Genes
PacBio RS II data at http://datasets.pacb.com/2014/Human54x/fast.html
Illumina HiSeq data at http://www.ncbi.nlm.nih.gov/sra/SRR642636
SHANK3 – Involved in Autism
HiSeq data miss 1st exon and upstream region entirely
PacBiocoverage
PacBioreads
HiSeqcoverage
HiSeqreads
Chromosome
Genes
SHANK3 – Involved in Autism
85% GC-rich
CCCCGTCACAGCCCCCCAGACCCCCGCCCCGTGGCTCGGCCCCCGCCCTCCGCACACACCT
CCCGCCCCCACCCGGGACCCCGCAAGTAACCCCCCAGCACTGGCCCTGAGCCCTCCCGGCC
CCCGCCTCCGGCGCAGCCCCCTCGCCACCCCCGCTTCCCTCCCGTCTCAGGCCCCCTCCCCC
CGCCGCCCCCGCCCCCGGGGAAGGCAGGCGCCGAGCTGAGCCGGGGCCGATGCAGCTG
AGCCGCGCCGCCGCCGCCGCCGCCGCCGCCCCTGCGGAGCCCCCGGAGCCGCTGTCCCCC
GCGCCGGCCCCGGCCCCGGCCCCCCCCGGCCCCCTCCCGCGCAGCGCGGCCGACGGGGC
TCCGGCGGGGGGGAAGGGGGGGCCGGGGCGCCGCGCGCGGAGTCCCCGGGCGCTCCG
TTCCCCGGCGCGAGCGGCCCCGGCCCGGGCCCCGGCGCGGGGATGGACGGCCCCGGGG
CCAGCGCCGTGGTCGTGCGCGTCGGCATCCCGGACCTGCAGCAGACGGTGAGCCCCG
Genes
PacBiocoverage
PacBioreads
HiSeqcoverage
HiSeqreads
Chromosome
Genes
For Research Use Only. Not for use in diagnostic procedures.
Uncovering Structural Variation in Gold Standard Genomes
Types of Structural Variation (SV) in a Human Genome
Hurles et al. (2008) Trends in Genetics 24: 238-245
Behavioral Diseases Associated with Structural Variation
Girirajan & Eichler (2010) Human Molecular Genetics 19: R176-187
Pang et al. (2014) G3 (Bethesda) 4: 63-65
Structural Variation Detection by 2nd Gen Technologies
“Generating longer reads … can mitigate these shortcomings.”
“We observed that current high-throughput sequencing approaches only detected a fraction of the full size-spectrum of insertions, deletions, and copy number variants compared with a previously published, Sanger-sequenced human genome.”
Human Genetic Variation Sequencing with PacBio Long Reads
1 10 100 1 kb 10 kb 100 kb 1 Mb 10 Mb 100 Mb
Size of Variant
Variant Type
SNPs
Small Indels
STRs & VNTRs
Large Insertions,
Deletions, CNV
Mobile Elements
Complex Variants
Phased SV’s
Indels
Repeat Expansions
One PacBio Read Spans Most Variants Structural VariantsPhased Alleles
Complex Regions
Assembled PacBio Reads Span Euchromatic Genome Varia tion
L1, Alu, SVA
Copy Number Variation
Inversions
Phased SNVs Phased Alleles
Very Large SV’s
Phased Haplotypes
Large Structural Rearrangement
Human Genome Sequencing with the PacBio RS II
Chaisson et al. (2014) Nature doi:10.1038/nature13907
Updated CHM1 Assembly With PacBio’s Newest Chemistry
26
Chr.2
Chr.20
Chr.6
Chr.21
Hydatidiform Mole (CHM1) Assembly Contig Alignments to Selected Chromosomes
From ~50x CHM1 data with the latest P6 chemistry, we getan assembly of
Total = 2.9 Gbp
N50 = 27.9 Mbp
Max = 109.3 Mb
Two contigs ~ 109Mb
Assembly contigs ~ Chromosome arms
Assembler code available at https://github.com/PacificBiosciences/FALCON
PacBio CHM1 Data vs. GRCh37 & 1000 Genomes Project
• Resolved 26,079 euchromatic structural variants at the base-pair level• ~22,000 (85%) of these are novel• 6,796 of the events map within 3,418 genes• Validation rate of 97%, only a fraction of which are detectable with short reads• Closes/extends 55% of the remaining gaps in human reference genome
Chaisson et al. (2014) Nature doi:10.1038/nature13907
PacBio CHM1 Data vs. GRCh37 & 1000 Genomes Project
• The percentage repeat composition (x axis) of1-kb sequences flanking insertion sites for Alu, L1 and SVA mobile element insertions.
• Insertion calls from the 1000 Genomes Project (pink)21 compared to calls from CHM1 using PacBio reads (blue) show short read data misses repeat-rich insertion events.
Chaisson et al. (2014) Nature doi:10.1038/nature13907
SV Insertion Sequence Data Missing from GRCh37 Reference
29
An additional net insertional bias of 3.9 megabases(Mb) of additional sequence is revealed in the PacBio CHM1 when compared to the human reference.
Chaisson et al. (2014) Nature doi:10.1038/nature13907
Short Tandem Repeats (STR’s) Easier to Detect with Long Reads
• 6,007 STR’s
• 2,285 of these expanded STRs occur within genes
• 2,760 Tandem Repeats30
Neurologic Diseases Associated with STRs Occurring within Genes
http://www.dialogues-cns.com/publication/a-glossary-of-relevant-genetic-terms/
For Research Use Only. Not for use in diagnostic procedures.
Revealing the Complexity of Cancer Genomes
32
SK-BR3 Cancer Genome Sequencing, Dick McCombie, CSHL
Davidson et al, 2000
Cancer genomes can have tremendous structural variation, with chromosomal fragmentation, rearrangement, and large duplication events.
Most commonly used Her2-amplified breast cancer cell line
W. Richard McCombie, CSHLhttp://schatzlab.cshl.edu/presentations/2015/2015.02.27.AGBT%20PacBio%20SKBR3.pdf
Genome-wide Alignment Coverage of PacBio Reads Reveals the Highly Aneuploid Nature of this Cancer Genome
• Genome-wide coverage averages around 54X • Coverage per chromosome varies greatly as expected in this highly aneuploid
genome
Her2
W. Richard McCombie, CSHLhttp://schatzlab.cshl.edu/presentations/2015/2015.02.27.AGBT%20PacBio%20SKBR3.pdf
Total Assembly: 2.64 GbpContig N50: 2.56 MbpMax Contig: 23.5 Mbp
Complex Structural Variant Discovery is Facilitated By Long Reads
Chromosome AChromosome A
Chromosome BChromosome B
Chromosome A
Chromosome B
• Alignment-based split read analysis efficiently captures most structural variation events
• Follow up with the Parliament workflow is planned
W. Richard McCombie, CSHLhttp://schatzlab.cshl.edu/presentations/2015/2015.02.27.AGBT%20PacBio%20SKBR3.pdf
• 94 high-confidence inter-chromosomal translocations were identified in the SKBR3 genome assembly
• 1200 Intra-chromosomal translocations were found
A High Quality Assembly Enables the Direct Detection of Interchromosomal Translocations
Her2
W. Richard McCombie, CSHLhttp://schatzlab.cshl.edu/presentations/2015/2015.02.27.AGBT%20PacBio%20SKBR3.pdf
PacBio
Her2
chr17
A Gold Standard Assembly Enables Cancer Lesion Reconstruction
By comparing the proportion of reads that are spanning or split at breakpoints we can begin to infer the history of the genetic lesions.
1. Healthy diploid genome2. Original translocation into chromosome 83. Duplication, inversion, and inverted duplication within chromosome 84. Final duplication from within chromosome 8
W. Richard McCombie, CSHLhttp://schatzlab.cshl.edu/presentations/2015/2015.02.27.AGBT%20PacBio%20SKBR3.pdf
Oncogeneamplifications
ErbB2(Her2/neu)
≈20X
MYC ≈27X
MET ≈8XKnown Gene fusions Confirmed by PacBio reads?
TATDN1 GSDMB Yes
RARA PKIA Yes
ANKHD1 PCDH1 Yes
CCDC85C SETD3 Yes
SUMF1 LRRFIP2 Yes
WDR67 (TBC1D31) ZNF704 Yes
DHX35 ITCH Yes
NFS1 PREX1 Yes *read-through transcription
CYTH1 EIF3H Yes *nested inside 2 translocations
Gold Standard Assemblies Reveal the Locations of Oncogene Duplications and Gene Fusion Events
W. Richard McCombie, CSHLhttp://schatzlab.cshl.edu/presentations/2015/2015.02.27.AGBT%20PacBio%20SKBR3.pdf
Her2+ Breast Cancer Reference Genome is Available for Download
http://schatzlab.cshl.edu/data/skbr3/
Available now under the Toronto Agreement:• Fastq & BAM files of aligned reads• Interactive Coverage Analysis with BAM.IOBIO
Available soon• Whole-genome assembly and methylation analysis• Comparison to single cell analysis of >100 individual cells• Iso Seq Data
W. Richard McCombie, CSHLhttp://schatzlab.cshl.edu/presentations/2015/2015.02.27.AGBT%20PacBio%20SKBR3.pdf
Comprehensive View of a Cancer Genome & Epigenome
• PC-9 non-small cell lung cancer cell line
• Apply PacBio sequencing for:– De novo long-read assembly of a cancer
genome
– Characterization of gene fusions
– Genome-wide methylomecharacterization
• Sequenced both drug-sensitive as well as drug-resistant sample
Drug-sensitive
Sustained treatment with erlotinib
Drug-resistant
Stably drug-
resistant
Mutations
Drug-sensitive
MT inhibitor
Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri(Genentech) & Y. Suzuki, S. Morishita (U of Tokyo)
PC-9 Cancer Genome De Novo Assembly
PacBio
# of contigs 12,359
Contig N50 1.044 Mb
Max contig length 26.6 Mb
PacBio assembly performed by J. Chin. Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri (Genentech) & Y. Suzuki, S. Morishita (U of Tokyo)
PC-9 Cancer Genome De Novo Assembly
PacBio Short-read 1 Improvement
# of contigs 12,359 424,605 34x
Contig N50 1.044 Mb 0.018 Mb 58x
Max contig length 26.6 Mb 0.28 Mb 95x
PacBio assembly performed by J. Chin. Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri (Genentech) & Y. Suzuki, S. Morishita (U of Tokyo)
1Nagarajan et al. (2012) Whole-genome reconstruction and mutational signatures in gastric cancer. Genome Biology 13: R115.
Mapping against Human Reference Sequence
• Observing unusual sequencing coverage across chromosomes in both samples (CHM1 as control):
• Example: chromosome 3:
PC-9: CHM1:
43
Note: the gap in the center is the centromere (no reference sequence available)
Long Reads Reveal Gene Fusions
• Example: CPSF3-ASAP2
CPSF3
Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri (Genentech) & Y. Suzuki, S. Morishita (U of Tokyo)
Gene Fusion Mapping Result: CPSF3-ASAP2
• Exact break points identified:
ASAP2CPSF3
Same read, split-mapped
Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri (Genentech) & Y. Susuki, S. Morishita (U of Tokyo)
Break point 1: Chr2: 9,391,888
Break point 2: Chr2: 9,438,556
Detection of DNA Base Modifications Using Kinetics
Example: N6-methyladenine
Flusberg et al. (2010) Nature Methods 7: 461-465
A G C T mA G T T Template strand
C G A G C T AG TTC A T G T Template strand
• SMRT Sequencing uses kinetic information from each nucleotide addition to call bases
• This same information can be used to distinguish modified and native bases by comparing results of SMRT Sequencing to an in silico kinetic reference for incorporation dynamics without modifications.
Detection of DNA Base Modifications by SMRT Sequencing
Flusberg et al. (2010) Nature Methods 7: 461-465
SMRT Portal v1.3.3+ can recognize and annotate multi-site modified-base signatures.
5-mC
4-mC
6-mA
Calculation of IPD ratios across the reference gives information about base modification at every position.
Detecting Hypo-methylated CpG Islands in Eukaryotic DNA
• 16-fold per-strand coverage maximizes the accuracy of the results • Algorithm is freely available at https://github.com/hacone/AgIn
48
• The signal strength for 5mC is weaker than for other methylated bases and so requires higher coverage to identify. However, in eukaryotes methylation is a regional phenomenon.
• Prof. Shinichi Morishitadeveloped a method to differentiate hypo- and hyper-methylated regions by integrating the signals across CpG islands
http://www.ashg.org/2014meeting/abstracts/fulltext/f140120432.htm
Epigenome Characterization
• Global methylation status:
Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri (Genentech) & Y. Suzuki, S. Morishita (U of Tokyo)
• More
hypermethylated
CpG islands in
drug-resistant
sample
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2260
65
70
75
80
85
% H
yper
met
hyla
ted
CpG
Isla
nds
Chromosome
Drug-sensitive Drug-resistant
Epigenome Characterization
• Methylation status of CpG islands (https://github.com/hacone/AgIn)
• Chr4: ANKRD17 (breast cancer)
Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri (Genentech) & Y. Suzuki, S. Morishita (U of Tokyo)
Epigenome Characterization
• Methylation status of CpG islands (https://github.com/hacone/AgIn)
• Chr4: WHSC1 (myelomas)
Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri (Genentech) & Y. Suzuki, S. Morishita (U of Tokyo)
Epigenome Characterization
• Methylation status of CpG islands (https://github.com/hacone/AgIn)
• Chr4: FGFR3 (fibroblast growth factor 3)
Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri (Genentech) & Y. Suzuki, S. Morishita (U of Tokyo)
Epigenome Characterization
• Methylation status of CpG islands (https://github.com/hacone/AgIn)
• Chr4: RASSF6 (tumor suppressor gene)
Collaboration with M. Classon, V. Janakiraman, E. Stawiski, S. Durinck, S. Seshagiri (Genentech) & Y. Suzuki, S. Morishita (U of Tokyo)
PacBio Long-read Sequencing
Allows for Detailed, De Novo Characterization
of Cancer Genome & Epigenome
For Research Use Only. Not for use in diagnostic procedures.
Take Structural Variant Discovery to ‘11’ with Low PacBio Coverage
54
Hybrid Structural Variant Calling Using Multiple Data Sources
Data sources include whole-genome PacBio Long-Read sequencing at 10x coverage (10,000 bp read length)
English et al (2015). Assessing structural variation in a personal genome—towards a human reference diploid genome BMC Genomics
Structural Variant Discovery with 10X PacBio Coverage
PBHoney alone, with only 10x PacBio coverage, identifies 4,268 SVs supported by hybrid assembly, representing events “invisible” to PE data.
English et al (2015). Assessing structural variation in a personal genome—towards a human reference diploid genome BMC Genomics
“Applying multiple Parliament workflows, we demonst rate that while method integration is optimal for SV detectio n in Illuminapaired-end data, the addition of long-read data can more than triple the number of SVs detectable in a personal g enome.”
SV Calling: Illumina-only method comparison
Limitations of Illumina-only methods for SV Calling:
High False Discovery Rate (FDR)- Ranges from 3% to > 80%
Low Sensitivity:- Ranges from < 8% to a max of
57%
English et al (2015) BMC Genomics
“Despite these benefits of a multi-algorithm approach, Illumina-only discovery still only recovers approximately half of the 9,777 SVs identified by multi-source Parliament”
PacBio Provides the Most Comprehensive View of Genetic Variation in a Human Genome
“We now have access to a whole new realm of genetic variation that was opaque to us before.”
“Knowing all the variation is going to be a game changer .“
- Professor Evan Eichler, University of Washington
Source: http://www.sciencenewsline.com/articles/2014111017120055.html
For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seqare trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.
59