Upload
genomeinabottle
View
71
Download
0
Embed Size (px)
Citation preview
What’s Genome in a Bottle?• Authoritative Characterization
of Human Genomes– enduring commitment to
resource availability• Samples• Data
– widely available open resources
– no restrictions on use or distribution
• Enable technology and tool-building with benchmark samples and methods for…– development– optimization– demonstration
• Germline samples available now
• Developing capacity for somatic sample development
What GIAB Isn’t
• Population genetics
• Disease-specific
• Many clinical samples
• Non-human
• Genome, not transcriptome, epigenome, proteome, metabolome…
Prior workshop takeaways
http://jimb.stanford.edu/giabworkshops
Benchmarking strides forward
• Draft GA4GH manuscript describing best practices for benchmarking germline small variants
– >15 co-authors actively editing manuscript
• Robust, sophisticated benchmarking tools publicly available
– GitHub
– PrecisionFDA
High-confidence calls are in use
• 286 citations of 2014 paper
• PrecisionFDA challenges
• Clinical labs
• Demonstration of new variant callers
https://blog.dnanexus.com/2017-12-05-evaluating-
deepvariant-googles-machine-learning-variant-caller/
Clinical Community adopting GIAB
• Justin in a 2-year term on Association for Molecular Pathology Clinical Practice Committee
• Monica Basehore appointedGIAB/AMP liaison
• GIAB derived products meeting needs in clinical labs– AMP RM Forum hosted
by CDC• Somatic variants
• ctDNA
• Difficult variants
Derived Products are on the market
• 31 products from 3 companies now available based on GIAB PGP cell lines
• DNA + spike-ins
– Clinical variants
– Somatic variants
– Difficult variants
• FFPE
• ctDNA
GIAB Developing New Data• 10X Genomics
– Chinese trio now available
• PacBio Sequel of Chinese trio with Mt Sinai– Read insert N50: ~15kb– 202Gb son and 98-116Gb on each
parent– Data undergoing QC
• BioNano– New DLS labeling method
• Complete Genomics/BGI– stLFR linked reads
• Oxford Nanopore– NIST/Birmingham/
Nottingham Ultra-long reads• Starting soon• Very preliminarily 80-90kb N50
– Max reads >1Mb!
• Current throughput will give ~40x total on AJ trio, but may improve
• Strand-seq– Collaboration with Korbel lab
Progress on New SamplesGermline Samples• Performed rfmix ancestry analysis on 6
PGP individuals with WGS and cell lines– 2 differently admixed Hispanic
– 1 76% African + 24% European
– 1 84% European + 15% South Asian
– 1 77% European + 21% East Asian
– 1 99% European
– PGP1
– 1 self-reported Chinese/Filipino
• Working on MTA for open dissemination
Somatic Samples
• Discussion Friday Morning
Goals for This Workshop• Update consortium and
onboard new members
• Review progress on SVs
• Demo Manual Curation App
• Learn about new methods for characterizing difficult regions
• Review, revisit, and update Principles for Dissemination of GIAB Samples
• Discuss plans for new Germline and Tumor Samples
Workshop Agenda• THURSDAY, JANUARY 25, 2018• 9:00 AM - 10:30 AM: Welcome, Onboarding, and
GIAB Progress Update• 10:30 AM - 11:00 AM: Break• 11:00 AM - 12:15 PM: Training and Trial Manual
Curation of Structural Variants• 12:15 PM - 1:45 PM: Lunch• 1:45 PM - 3:15 PM: Feedback about v0.5 Draft
Structural Variant Benchmark Set• 3:15 PM - 3:45 PM: Break• 3:45 PM – 5:00 PM: New Approaches to
Characterizing Difficult Variants and Regions• 5:00 PM - 5:30 PM: Discussion of Future Work• 5:30 PM - 6:30 PM: Happy Hour, Sponsored by
PacBio and Invitae
• FRIDAY, JANUARY 26, 2018• 9:00 AM - 10:45 AM: Panel Discussion about
Principles for Dissemination of GIAB Samples• 10:45 AM - 11:15 AM: Break• 11:15 AM - 12:00 PM: Discussion about Future
Germline and Tumor Samples• 12:00 PM - 1:00 PM: Break• 1:00 PM - 2:30 PM: GIAB Steering Committee
Meeting
Steering Committee Agenda• Roadmap• Next 2-3 workshops• Resourcing GIAB work• Communications
– Best practices– Ways of working together
• Liaisons– HGSVC– Clinical labs
• Samples, consents, repository relationships– NIST RMs needed?– Distribute all NIST RMs
together– Cells instead of DNA– GIAB Imprimatur
• Research v. Standards-making– Tool development vs reference
sample development
• Develop open-access samples and data for broad uses in industry, academia, and government
• Convene community of experts to characterize genomes -> GIAB/NIST integrates results to form benchmarks
• Develop tools to calculate accurate and standardized performance metrics
Unique GIAB roles in genomics
• New sequencing with long and linked reads
• Developing plan for open access to GIAB materials
• Selecting samples from new ancestries
• Developing cancer samples for somatic benchmarking
• Stay for Friday morning!
Progress Update
• Developing cancer samples for somatic benchmarking
• Draft publications about high-confidence calls and benchmarking methods
Ongoing and Future Work
• Best methods agree on 99.9%+ of “easy” calls
• Evaluating “straw man” large indel/SV callsets
Progress Update
• Characterize challenging 10-20% of genome
• New methods for reference characterization of somatic genomes
• Refining principled integration methods
• Assembly metrology
Ongoing and Future Work
• GIAB Analysis Team focused on large indels and SVs
Progress Update
• Individual collaborations exploring expanding calls for other variant types
Ongoing and Future Work
• Released 3 “straw man” sequence-resolved benchmark callsets >=20bp
• Analysis Team gave critical feedback in each round
• V0.5.0 released Jan 2018
Progress Update
• Evaluate v0.5.0• Write manuscript• Manual Curation• Resolve clusters of variants• Integrate new technologies
and methods
Ongoing and Future Work
Our SV Integration Strategy
Collect many candidate calls for AJ Trio
• Gather candidate calls from a variety of approaches– Many technologies
• Short, linked, and long reads• Optical and nanopore mapping
– Many approaches• Small variant callers• Structural variant callers• Local and global de novo assemblies
• Community submitted >1 million calls from 30+ methods using 5+ technologies
Refine/evaluate/genotype candidates
• Obtain sequence-resolved calls as often as possible using assembly-based approaches
• Compare sequence predictions of candidate calls and merge similar calls
• Determine raw data’s support of each sequence-resolved call and its genotype
Evolution of SV calls for AJ Triov0.2.0
• Only deletions
• Overlap and size-based clustering
• Output sites with multitechsupport
v0.3.0
• New calling methods
• Deletions and insertions
• Sequence-resolved calls
• Sequence-based clustering
• Output sites with multitechsupport
v0.4.0
• Include some single tech calls
• Evaluate read support to remove some false positives
• Add genotypes for trio
v0.5.0
• Better calling methods, especially for large insertions
• Include more single tech calls
• Add some phasing info
Future
• Resolve clusters of differing calls
• Improve phasing
• Add new data types
• Improve sequence resolution
• Collaborate with HGSVC?
• Initiated discussions with several groups working on phasing and calling variant in difficult to map regions
• Similar data and methods used for both problems
Progress Update
• Work with several groups developing new methods
• Integrate difficult to map variants into high-confidence calls
• Integrate phasing into high-confidence calls
Ongoing and Future Work
• Initiated discussions with several groups working on short tandem repeats and complex variants
• Explored using RTG vcfevaland varmatch to harmonize multiple vcfs for integration
Progress Update
• Add STR methods into integration methods
• Test variant harmonization methods for integration
• Find collaborators for HLA and ALT loci characterization (e.g., graph-based methods, linked/long reads)
Ongoing and Future Work
• Draft manuscript for v3.3.2 small variants
• Preliminary machine learning methods can reproduce SV genotypes from svviz
• Demo of SV manual curation web app (see next session!)
Progress Update
• Crowd-source manual curation of SVs
• Use crowd-sourced labels for machine learning
Ongoing and Future Work
• Using assemblies to call and refine structural variants
Progress Update
• Need to develop integration methods for all types of somatic variants
• Need to develop methods to integrate and benchmark diploid assemblies
Ongoing and Future Work
• GA4GH made available sophisticated, standardized tools for benchmarking small variants
• “Best practices” manuscript for small variant benchmarking
Progress Update
• Develop new methods for structural variant benchmarking
• Develop new methods for somatic variant benchmarking
• Predict performance on clinically interesting variants
Ongoing and Future Work
Benchmarking Best Practices Manuscript
• Focus on germline small variants
• Describe benchmark callsets
• Define performance metrics at different stringency levels
• Sophisticated comparison tools are important
• Stratify performance by variant type and genome context
• Tools available on GitHub and PrecisionFDA
The road ahead...2018
• Large variants
• Difficult small variants
• Phasing
2019
• Difficult large variants
• Somatic sample development
• Germline samples from new ancestries
2020+
• Diploid assembly
• Somatic structural variation
• Segmental duplications
• Centromere/ telomere
• ...
Outstanding work summary• Many variant calls cannot be assessed by
comparison to current benchmark callsets (>20% of SNPs, >50% of indels, ~100% of SVs outside our high-confidence regions)– Currently mostly assessing “easy” things
• No broadly consented tumor-normal cell lines are available
• Benchmarking tools for SVs are not standardized
GIAB Roadmap
Genome
Measurement
Science
Germline
Samples
Somatic
Samples
Benchmarking
Publications
2018 2019 2020
IRB
approval
Strategy for cell
line developer/
distributor
Using variant calls to
benchmark assemblies
Identify cell line
developer/
distributor
Small repeats
Difficult to map w/ phasing
Initial large indels/SVs
Challenging large
indels/SVs
Non-variant regions
for large indels/SVs
More difficult
variant calls in
all samples
Machine learning to
integrate indels/SVs
Further automate
arbitration/integration
for new techs and
difficult variants
X/Y
Complex
variants
Select
samples
for new
ancestries
Diploid assemblies
are important part of
integration
SV comparison
tools integrated into
Benchmarking
frameworkBenchmarking/
new integrated
calls
SV
Integration
Machine
learning
Paper with
HGSVC?
Predict
performance for
clinical variants of
interest
Establish
cell linesCharacterize
cell lines
Develop integration
methods for somatic
variants
Implement SV callers
Ultralong read
science