Upload
genomeinabottle
View
483
Download
0
Tags:
Embed Size (px)
Citation preview
Genome in a Bottle Consortium January 2015
Stanford University
Reference Materials for Clinical Applications of Human Genome Sequencing
Marc Salit, Ph.D. and Justin Zook, Ph.DNational Institute of Standards and Technology
Advances in Biological/Medical Measurement Science (ABMS @ Stanford)
GIAB Scope
• The Genome in a Bottle Consortium is developing the reference materials, reference methods, and reference data needed to assess confidence in human whole genome variant calls.
• A principal motivation for this consortium is to enable performance assessment of sequencing and science-based regulatory oversight of clinical sequencing.
Genome in a Bottle Consortium Development
• NIST met with sequencing technology developers to assess standards needs– Stanford, June 2011
• Open, exploratory workshop– ASHG, Montreal, Canada– October 2011
• Small, invitational workshop at NIST to develop consortium for human genome reference materials– FDA, NCBI, NHGRI, NCI, CDC, Wash
U, Broad, technology developers, clinical labs, CAP, PGP, Partners, ABRF, others
– developed draft work plan– April 2012
• Open, public meetings of GIAB– August 2012 at NIST– March 2013 at Xgen– August 2013 at NIST– January 2014 at Stanford– August 2014 at NIST– January 2015 at Stanford
• Website– www.genomeinabottle.org
Well-characterized, stable RMs
• Obtain metrics for validation, QC, QA, PT
• Determine sources and types of bias/error
• Learn to resolve difficult structural variants
• Improve reference genome assembly
• Optimization– integration of data from
multiple platforms– sequencing and analysis
• Enable regulated applications Comparison of SNP Calls forNA12878 on 2 platforms, 3
analysis methods
Measurement Process
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference materials will be developed to characterize performance of a part of process– materials will be
certified for their variants against a reference sequence, with confidence estimates
gen
eric
me
asu
rem
en
t p
roce
ss
• NIST working with GiaBto select genomes
• Current plan– NA12878 HapMap
sample as Pilot sample• part of 17-member
pedigree
– trios from PGP as more complete set• 2 trios, focus on children
• varying biogeographic ancestry
12889 12890 12891 12892
12877 12878
12879 12880 12881 12882 12883 12884 12885 12887 12886 12888 12893
CEPH Utah Pedigree 1463
Putting “Genomes” in Bottles
11 children, Birth Order Redacted
Overview of NIST RM DevelopmentGenome(s) Q4 2014 Q1 2015 Q2 2015 Q3 2015 Q4 2015
HG-001/NA12878(“Pilot” Genome)
Release NIST RM8398; Preliminary large deletions
RefinedStructural Variants
HG-002 to HG-004 (Ashkenazim trio)
Illumina,Complete Genomics, Ion, BioNano, homogeneity/stability
Preliminary SNPs/indels; 120x-150x PacBio data; “moleculo”;mate-pair; CG-LFR
Refined SNPs/indels; Preliminary SVs
RefinedStructural Variants
NIST RMs 8391/8392 release
HG-005 (son in Asian trio)
Illumina,Complete Genomics, Ion, BioNano, homogeneity/stability
“moleculo”;mate-pair; CG-LFR
Preliminary SNPs/indels
Refined SNPs/indels; RefinedStructural Variants
NIST RM8393release
Genome in a Bottle Working Groups
Reference Material Selection& Design
Andrew Grupe,Celera
•Develop prioritized list of whole human genomes for Reference Materials
•Identify candidate approaches and materials for artificial RMs
•Develop prioritized list
Meaurements for Reference Material Characterization
Mike Eberle, Illumina
•Develop consensus plan for experimental characterization of Reference Materials
Bioninformatics, Data Integration, and Data Representation
Chunlin Xiao, NCBI
•Develop plan for integrating experimental data and forming consensus variant calls and confidence estimates
•Develop consensus plan for data representation
Performance Metrics & Figures of Merit
Deanna Church, Personalis
•User interface to the Genome-in-a-Bottle Reference Material
•“Dashboard”
•what an end user will see and report to understand and describe the performance of their experiment
•variant call accuracy
•process performance measures to enable optimization
Update
Zook et al., Nature Biotechnology, 2014.
• methods to develop SNP/indel call set described in manuscript
• broad and quick adoption of call set for benchmarking
– struck nerve
Preliminary uses of high-confidence NIST-GIAB genotypes for NA12878
• NIST have released several versions of high-confidence genotypes for its pilot RM
• These data are presently being used for benchmarking
– prior to release of RMs
– SNPs & indels• ~77% of the genome
Highlights
This workshop
• Pilot genome release and use
• Coordinating analyses for PGP GIAB Trios
• Working groups– Spike-in mutation interlab,
FFPE
– FTP site, analysis coordination
– GA4GH
• GIAB papers
Future GIAB work
• Beyond support, improvement/development and maintenance of existing GIAB products…– What future work should
GIAB do that would uniquely take advantage of the momentum we’ve built?
Agenda
Thursday• Breakfast• Welcome and Status Update• Using the Pilot RM• Break• Coordination of PGP analyses• Lunch (provided)• Working Group Breakout
Discussions• Break• Discussion about Planned
GIAB papers• Informal discussions• Reception
Friday• Breakfast• Working Group leaders
present plans and discussion• Break• Future GIAB work• Lunch (provided)• Steering committee meeting
Agenda
Monday• Breakfast and registration• Welcome and Context Setting• NIST RM Update and Status Report• Charge to Working Groups• Coffee Break• Working Group Breakout Discussions• Lunch (provided)• Informal Working Group Reports• Coffee Break• Breakout Topical Discussions
– Topic #1: Moving beyond the 'easy' variants and regions of the genome
– Topic #2: Selecting future genomes for Reference Materials
Tuesday• Breakfast and registration• Use cases: Experiences using the pilot
Reference Material• Discussion of plans to release pilot
Reference Material• Coffee Break• Working Group Breakout discussions• Lunch (provided)• Working Group leaders present plans
and discussion• Steering committee Overview• First meeting of the Steering
Committee (others adjourn)
Please Note
Slides will be made available on SlideShare after the workshop (see genomeinabottle.org).
Tweets are welcome unless the speaker requests
otherwise. Please use #giab as the hashtag.
What’s the future of GIAB?
• What is GIAB uniquely positioned to do?– how will we know when we’re done?
• If we do other stuff, are we the best cohort to do it?
• Other biogeographical ancestry groups?
• Cancer?– spike-in controls– whole-genomes
• tumor/normal?
• Create list of mutattions for spike-ins for germline
• Somatic genomes other than cancer• Prenatal• Forensics – decay of DNA• Transcriptome?• Epigenome?• Interpretation standards?
– functional– clinical
Others working in this space…
Well-characterized genomes
• Illumina Platinum Genomes
• CDC GeT-RM
• Korean Genome Project
• Human Longevity, Inc.
• Hyditaform mole haploid cell line
• Genome Reference Consortium
Performance Metrics
• Global Alliance for Genomics and Health Benchmarking Team
• NCBI/CDC GeT-RM Browser
• GCAT website
Data Release Plans
Individual Datasets
• Uploaded to GIAB FTP site as it is collected
• May include raw reads, aligned reads, and variant/reference calls
Integrated High-confidence Calls
• First develop SNP, indel, and homozygous reference calls
• Then develop SV and non-SV calls
• Released calls are versioned
• Preliminary callsets will be made available to be critiqued
Pilot RM (NA12878)
• Developing reproducible methods for new integrated high-confidence SNPs/indels
• Illumina Platinum Genomes released phased pedigree calls in Dec 2014– Blog will be posted
– also working on SVs
• Developing SV calls– High-confidence
deletions and pre-print will be released Feb 2015
• Planned release as NIST RM8398 in April 2015
Ashkenazim PGP trio
Short reads
• Completed– 300x Illumina paired end on
trio
– Complete Genomics
– Ion exome
• Scheduled– Illumina mate-pair
– possibly SOLiD
Long reads
• Completed– 20x/8x/8x PacBio
– BioNano Genomics
• Scheduled– 60x/30x/30x PacBio (or more)
– custom moleculo
Ashkenazim Jewish PGP RM TrioDataset Characteristics Coverage Availability Good for…
Illumina Paired-end
150x150bp ~300x/individual
Fastq on ftp SNPs/indels/some SVs
Illumina Long Mate pair
~6000 bp insert ~40x/individual Feb-Mar 2015 SVs
Illumina “moleculo”
Custom library ~30x by long fragments
Feb-Mar 2015 SVs/phasing/assembly
Complete Genomics
100x/individual On ftp SNPs/indels/some SVs
Complete Genomics
LFR ?? SNPs/indels/phasing
Ion Proton Exome 1000x/individual
On SRA SNPs/indels in exome
BioNanoGenomics
Feb 2015 SVs/assembly
PacBio ~10kb reads ~120-150x on AJ trio
Finished ~Mar 2015
SVs/phasing/assembly/STRs
SNP/Indel Integration Method Update
• Implementing new integration methods on DNAnexus
– Easier for others to reproduce results
– Easier to apply same methods to new genomes
• First, analyzing NA12878 RM data with new methods to ensure they work well
• Then, apply to PGP trios
Reference genome, Repeatmasker data
SVClassifyUp to 180
annotations per SV
Aligned sequence data (BAM file)
List of structural variants (bed file)
Up to 35 selected
annotations per SV
One class methods
Unsupervised clustering
Support vector
machine
L1distance
SV Integration Methods
Multidimensional scaling plot for visualizing the 8 clusters. We use a 3 dimensional representation of the data space which associates 3 MDS coordinates to each site, one for each dimension. This figure plots MDS-3 against MDS-1
Multi-dimensional scaling showing separation of 8 clusters
Number of sites from each candidate callset that have k=3 L1 Classification scores in each range, where the score is the proportion p of random sites that are closer to the center than each candidate site. These numbers are after filtering sites for which the flanking regions have low mapping quality or high coverage.
<0.68 0.68-0.90 0.90-0.99 >0.99
Random 2,599 773 279 17
Personalis 4 4 182 1,783
1000 Genomes
38 65 557 1,493
One-class scores for Random Non-SVs and “Validated” Callsets
Sample Data Discovery Merge Evaluate Results
Personal Genome
Illumina
PacBio
Nextera
aCGH, Irys
Breakdancer, Delly, CNVnator, Pindel, Crest,
SV-STAT, Tiresias
Honey
Multi-SourceReduce & Cluster
Annotation Sources
Discordant Loci Database
Hybrid Assembly
SVatchra
PacBio Force Calling
Heuristics
Putative SVsYes
Yes
YesNo
No
Parliament SV Integration Pipeline
From Baylor College of Medicine
Data Type Resolution
WGS Illumina HiSeq NGS 48X 100x100 bp paired-end
WGS Illumina Nextera NGS ~2X 100x100 bp mate-pair
WGS SOLiD NGS
3X 35 bp fragment
10X 25x25 bp paired-end
17X 50x50 bp paired-end
WGS PacBio Long-Read 10X ~10,000 bp
Agilent 1M aCGH 1-million-probe oligo array
NimbleGen 2.1M aCGH 2.1-million-probe oligo array
Custom Agilent Array aCGH44,000 neuropathy-specific oligo
array
BioNano Irys Genome Mapping Whole genome architecture
Sanger-Validated Deletions Manual 42 fully resolved deletions
Program Method
BreakDancer Paired End
Crest Split Read
Pindel Paired End
Delly Paired End / Split Read
CNVnator Read Depth
Tiresias Consensus Sequence
SV-STAT Split Read
SVatchra Paired-End
PBHoney Errors, Tail Mapping, Assembly
pb-jelly.sourceforge.net
Potential SV Integration Approach
• GIAB members generate candidate SV calls
• Use SVClassify and Parliament to classify candidate calls as likely TPs or FPs
Analysis Coordinator(s)
• “Face” of the group
• Maintain table of groups doing different types of analyses
• Recruit groups to do missing analyses
• Make workplan and timeline
• Follow-up with analysis groups
• Coordinate comparisons and integration of analyses
• Coordinate writing of papers
Ashkenazim Jewish PGP RM TrioDataset Characteristics Coverage Availability Most useful
for…
Illumina Paired-end
150x150bp ~300x/individual
Fastq on ftp SNPs/indels/some SVs
Illumina Long Mate pair
~6000 bp insert ~40x/individual Feb-Mar 2015 SVs
Illumina “moleculo”
Custom library ~30x by long fragments
Feb-Mar 2015 SVs/phasing/assembly
Complete Genomics
100x/individual On ftp SNPs/indels/some SVs
Complete Genomics
LFR ?? SNPs/indels/phasing
Ion Proton Exome 1000x/individual
On SRA SNPs/indels in exome
BioNanoGenomics
Long optical map reads
Feb 2015 SVs/assembly
PacBio ~10kb reads ~120-150x on AJ trio
Finished ~Mar 2015
SVs/phasing/assembly/STRs