Jan2015 GIAB intro, Update, and Data Analysis Planning

Genome in a Bottle Consortium January 2015

Stanford University

Reference Materials for Clinical Applications of Human Genome Sequencing

Marc Salit, Ph.D. and Justin Zook, Ph.DNational Institute of Standards and Technology

Advances in Biological/Medical Measurement Science (ABMS @ Stanford)

GIAB Scope

• The Genome in a Bottle Consortium is developing the reference materials, reference methods, and reference data needed to assess confidence in human whole genome variant calls.

• A principal motivation for this consortium is to enable performance assessment of sequencing and science-based regulatory oversight of clinical sequencing.

Genome in a Bottle Consortium Development

• NIST met with sequencing technology developers to assess standards needs– Stanford, June 2011

• Open, exploratory workshop– ASHG, Montreal, Canada– October 2011

• Small, invitational workshop at NIST to develop consortium for human genome reference materials– FDA, NCBI, NHGRI, NCI, CDC, Wash

U, Broad, technology developers, clinical labs, CAP, PGP, Partners, ABRF, others

– developed draft work plan– April 2012

• Open, public meetings of GIAB– August 2012 at NIST– March 2013 at Xgen– August 2013 at NIST– January 2014 at Stanford– August 2014 at NIST– January 2015 at Stanford

• Website– www.genomeinabottle.org

Well-characterized, stable RMs

• Obtain metrics for validation, QC, QA, PT

• Determine sources and types of bias/error

• Learn to resolve difficult structural variants

• Improve reference genome assembly

• Optimization– integration of data from

multiple platforms– sequencing and analysis

• Enable regulated applications Comparison of SNP Calls forNA12878 on 2 platforms, 3

analysis methods

Measurement Process

Sample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

• gDNA reference materials will be developed to characterize performance of a part of process– materials will be

certified for their variants against a reference sequence, with confidence estimates

gen

eric

me

asu

rem

en

t p

roce

ss

• NIST working with GiaBto select genomes

• Current plan– NA12878 HapMap

sample as Pilot sample• part of 17-member

pedigree

– trios from PGP as more complete set• 2 trios, focus on children

• varying biogeographic ancestry

12889 12890 12891 12892

12877 12878

12879 12880 12881 12882 12883 12884 12885 12887 12886 12888 12893

CEPH Utah Pedigree 1463

Putting “Genomes” in Bottles

11 children, Birth Order Redacted

Overview of NIST RM DevelopmentGenome(s) Q4 2014 Q1 2015 Q2 2015 Q3 2015 Q4 2015

HG-001/NA12878(“Pilot” Genome)

Release NIST RM8398; Preliminary large deletions

RefinedStructural Variants

HG-002 to HG-004 (Ashkenazim trio)

Illumina,Complete Genomics, Ion, BioNano, homogeneity/stability

Preliminary SNPs/indels; 120x-150x PacBio data; “moleculo”;mate-pair; CG-LFR

Refined SNPs/indels; Preliminary SVs

RefinedStructural Variants

NIST RMs 8391/8392 release

HG-005 (son in Asian trio)

Illumina,Complete Genomics, Ion, BioNano, homogeneity/stability

“moleculo”;mate-pair; CG-LFR

Preliminary SNPs/indels

Refined SNPs/indels; RefinedStructural Variants

NIST RM8393release

Genome in a Bottle Working Groups

Reference Material Selection& Design

Andrew Grupe,Celera

•Develop prioritized list of whole human genomes for Reference Materials

•Identify candidate approaches and materials for artificial RMs

•Develop prioritized list

Meaurements for Reference Material Characterization

Mike Eberle, Illumina

•Develop consensus plan for experimental characterization of Reference Materials

Bioninformatics, Data Integration, and Data Representation

Chunlin Xiao, NCBI

•Develop plan for integrating experimental data and forming consensus variant calls and confidence estimates

•Develop consensus plan for data representation

Performance Metrics & Figures of Merit

Deanna Church, Personalis

•User interface to the Genome-in-a-Bottle Reference Material

•“Dashboard”

•what an end user will see and report to understand and describe the performance of their experiment

•variant call accuracy

•process performance measures to enable optimization

Update

Zook et al., Nature Biotechnology, 2014.

• methods to develop SNP/indel call set described in manuscript

• broad and quick adoption of call set for benchmarking

– struck nerve

Preliminary uses of high-confidence NIST-GIAB genotypes for NA12878

• NIST have released several versions of high-confidence genotypes for its pilot RM

• These data are presently being used for benchmarking

– prior to release of RMs

– SNPs & indels• ~77% of the genome

Highlights

This workshop

• Pilot genome release and use

• Coordinating analyses for PGP GIAB Trios

• Working groups– Spike-in mutation interlab,

FFPE

– FTP site, analysis coordination

– GA4GH

• GIAB papers

Future GIAB work

• Beyond support, improvement/development and maintenance of existing GIAB products…– What future work should

GIAB do that would uniquely take advantage of the momentum we’ve built?

Agenda

Thursday• Breakfast• Welcome and Status Update• Using the Pilot RM• Break• Coordination of PGP analyses• Lunch (provided)• Working Group Breakout

Discussions• Break• Discussion about Planned

GIAB papers• Informal discussions• Reception

Friday• Breakfast• Working Group leaders

present plans and discussion• Break• Future GIAB work• Lunch (provided)• Steering committee meeting

Agenda

Monday• Breakfast and registration• Welcome and Context Setting• NIST RM Update and Status Report• Charge to Working Groups• Coffee Break• Working Group Breakout Discussions• Lunch (provided)• Informal Working Group Reports• Coffee Break• Breakout Topical Discussions

– Topic #1: Moving beyond the 'easy' variants and regions of the genome

– Topic #2: Selecting future genomes for Reference Materials

Tuesday• Breakfast and registration• Use cases: Experiences using the pilot

Reference Material• Discussion of plans to release pilot

Reference Material• Coffee Break• Working Group Breakout discussions• Lunch (provided)• Working Group leaders present plans

and discussion• Steering committee Overview• First meeting of the Steering

Committee (others adjourn)

Please Note

Slides will be made available on SlideShare after the workshop (see genomeinabottle.org).

Tweets are welcome unless the speaker requests

otherwise. Please use #giab as the hashtag.

What’s the future of GIAB?

• What is GIAB uniquely positioned to do?– how will we know when we’re done?

• If we do other stuff, are we the best cohort to do it?

• Other biogeographical ancestry groups?

• Cancer?– spike-in controls– whole-genomes

• tumor/normal?

• Create list of mutattions for spike-ins for germline

• Somatic genomes other than cancer• Prenatal• Forensics – decay of DNA• Transcriptome?• Epigenome?• Interpretation standards?

– functional– clinical

Others working in this space…

Well-characterized genomes

• Illumina Platinum Genomes

• CDC GeT-RM

• Korean Genome Project

• Human Longevity, Inc.

• Hyditaform mole haploid cell line

• Genome Reference Consortium

Performance Metrics

• Global Alliance for Genomics and Health Benchmarking Team

• NCBI/CDC GeT-RM Browser

• GCAT website

Plan for analyses of new PGP RM Trio data

January 2015

Data Release Plans

Individual Datasets

• Uploaded to GIAB FTP site as it is collected

• May include raw reads, aligned reads, and variant/reference calls

Integrated High-confidence Calls

• First develop SNP, indel, and homozygous reference calls

• Then develop SV and non-SV calls

• Released calls are versioned

• Preliminary callsets will be made available to be critiqued

Pilot RM (NA12878)

• Developing reproducible methods for new integrated high-confidence SNPs/indels

• Illumina Platinum Genomes released phased pedigree calls in Dec 2014– Blog will be posted

– also working on SVs

• Developing SV calls– High-confidence

deletions and pre-print will be released Feb 2015

• Planned release as NIST RM8398 in April 2015

Ashkenazim PGP trio

Short reads

• Completed– 300x Illumina paired end on

trio

– Complete Genomics

– Ion exome

• Scheduled– Illumina mate-pair

– possibly SOLiD

Long reads

• Completed– 20x/8x/8x PacBio

– BioNano Genomics

• Scheduled– 60x/30x/30x PacBio (or more)

– custom moleculo

Ashkenazim Jewish PGP RM TrioDataset Characteristics Coverage Availability Good for…

Illumina Paired-end

150x150bp ~300x/individual

Fastq on ftp SNPs/indels/some SVs

Illumina Long Mate pair

~6000 bp insert ~40x/individual Feb-Mar 2015 SVs

Illumina “moleculo”

Custom library ~30x by long fragments

Feb-Mar 2015 SVs/phasing/assembly

Complete Genomics

100x/individual On ftp SNPs/indels/some SVs

Complete Genomics

LFR ?? SNPs/indels/phasing

Ion Proton Exome 1000x/individual

On SRA SNPs/indels in exome

BioNanoGenomics

Feb 2015 SVs/assembly

PacBio ~10kb reads ~120-150x on AJ trio

Finished ~Mar 2015

SVs/phasing/assembly/STRs

Asian PGP trio

• Similar sequencing to Ashkenazim trio except for PacBio

• Only son will be NIST RM

SNP/Indel Integration Method Update

• Implementing new integration methods on DNAnexus

– Easier for others to reproduce results

– Easier to apply same methods to new genomes

• First, analyzing NA12878 RM data with new methods to ensure they work well

• Then, apply to PGP trios

Reference genome, Repeatmasker data

SVClassifyUp to 180

annotations per SV

Aligned sequence data (BAM file)

List of structural variants (bed file)

Up to 35 selected

annotations per SV

One class methods

Unsupervised clustering

Support vector

machine

L1distance

SV Integration Methods

Multidimensional scaling plot for visualizing the 8 clusters. We use a 3 dimensional representation of the data space which associates 3 MDS coordinates to each site, one for each dimension. This figure plots MDS-3 against MDS-1

Multi-dimensional scaling showing separation of 8 clusters

ROC curves for One-class Classification

Number of sites from each candidate callset that have k=3 L1 Classification scores in each range, where the score is the proportion p of random sites that are closer to the center than each candidate site. These numbers are after filtering sites for which the flanking regions have low mapping quality or high coverage.

<0.68 0.68-0.90 0.90-0.99 >0.99

Random 2,599 773 279 17

Personalis 4 4 182 1,783

1000 Genomes

38 65 557 1,493

One-class scores for Random Non-SVs and “Validated” Callsets

Sample Data Discovery Merge Evaluate Results

Personal Genome

Illumina

PacBio

Nextera

aCGH, Irys

Breakdancer, Delly, CNVnator, Pindel, Crest,

SV-STAT, Tiresias

Honey

Multi-SourceReduce & Cluster

Annotation Sources

Discordant Loci Database

Hybrid Assembly

SVatchra

PacBio Force Calling

Heuristics

Putative SVsYes

Yes

YesNo

No

Parliament SV Integration Pipeline

From Baylor College of Medicine

Data Type Resolution

WGS Illumina HiSeq NGS 48X 100x100 bp paired-end

WGS Illumina Nextera NGS ~2X 100x100 bp mate-pair

WGS SOLiD NGS

3X 35 bp fragment

10X 25x25 bp paired-end

17X 50x50 bp paired-end

WGS PacBio Long-Read 10X ~10,000 bp

Agilent 1M aCGH 1-million-probe oligo array

NimbleGen 2.1M aCGH 2.1-million-probe oligo array

Custom Agilent Array aCGH44,000 neuropathy-specific oligo

array

BioNano Irys Genome Mapping Whole genome architecture

Sanger-Validated Deletions Manual 42 fully resolved deletions

Program Method

BreakDancer Paired End

Crest Split Read

Pindel Paired End

Delly Paired End / Split Read

CNVnator Read Depth

Tiresias Consensus Sequence

SV-STAT Split Read

SVatchra Paired-End

PBHoney Errors, Tail Mapping, Assembly

pb-jelly.sourceforge.net

Potential SV Integration Approach

• GIAB members generate candidate SV calls

• Use SVClassify and Parliament to classify candidate calls as likely TPs or FPs

Analysis Coordinator(s)

• “Face” of the group

• Maintain table of groups doing different types of analyses

• Recruit groups to do missing analyses

• Make workplan and timeline

• Follow-up with analysis groups

• Coordinate comparisons and integration of analyses

• Coordinate writing of papers

Ashkenazim Jewish PGP RM TrioDataset Characteristics Coverage Availability Most useful

for…

Illumina Paired-end

150x150bp ~300x/individual

Fastq on ftp SNPs/indels/some SVs

Illumina Long Mate pair

~6000 bp insert ~40x/individual Feb-Mar 2015 SVs

Illumina “moleculo”

Custom library ~30x by long fragments

Feb-Mar 2015 SVs/phasing/assembly

Complete Genomics

100x/individual On ftp SNPs/indels/some SVs

Complete Genomics

LFR ?? SNPs/indels/phasing

Ion Proton Exome 1000x/individual

On SRA SNPs/indels in exome

BioNanoGenomics

Long optical map reads

Feb 2015 SVs/assembly

PacBio ~10kb reads ~120-150x on AJ trio

Finished ~Mar 2015

SVs/phasing/assembly/STRs

Health & Medicine

Jan2015 GIAB intro, Update, and Data Analysis Planning