140127 rtg phased pedigree analyses

Preview:

Citation preview

Development & applications of a segregation-phasing ground truth

Francisco M. De La Vega, D.Sc.Visiting Scholar, Department of GeneticsStanford University School of Medicine

In collaboration with Real Time Genomics, Inc.

G E N O M E - I N - A - B O T T L E W O R K S H O P

Evaluating Variant Calls

O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Medicine 5, 28 (2013).

Beyond Venn Diagrams

Experimental validation (e.g. Sanger, qPCR) Expensive Limited by platform success Statistical sampleReference orthogonal data available for some genomes SNP array data Sparse fosmid sequencing data IncompleteReference genomes sequenced by multiple platforms Arbitration methods (e.g. NIST, Genome-in-a-Bottle) Low FP, but unknown FN (genome-wide) Biases?

Mendelian segregation as “ground truth”

CEPH/Utah Pedigree 1463

Sequenced by CGI and Illumina (Platinum Genomes)Started with 2x100bp 50X WGS Illumina Platinum data Aligned & variant called with rtgVariant 1.1, filter by quality score (AVR≥0.15)

across the samples, excluding problematic sites

Example: Heterozygous variant segregation

Segregation of heterozygous variants to offspring

1 2 3 4 5 6 7 8 9 10 110

20,000

40,000

60,000

80,000

SNV

# of offspring segregating

SNV

coun

t

1 2 3 4 5 6 7 8 9 10 110

100

200

300

400

500

MNP

# of offspring segregating

MN

P co

unt

1 2 3 4 5 6 7 8 9 10 110

2,000

4,000

6,000

8,000

10,000

indel

# of offspring segregating

inde

l co

unt

1 2 3 4 5 6 7 8 9 10 110

20,000

40,000

60,000

80,000

100,000

All Variants

# of offspirng segregating

Varia

nt co

unt

Steps for haplotype phasing in large family

Check calls vs haplotype framework

Connect haplotype islands

Phase contiguity extension

Identify crossovers

Phasing labels given parent and child genotypes

Parents   Children      fa/fb ma/mb        

0/0 0/1 0/0 0/1    

    fa/ma fa/mb    

    fb/ma fb/mb    

0/1 0/1 0/0 0/1 1/1  

    fa/ma fb/ma fb/mb  

      fa/mb    

0/0 1/2 0/1 0/2    

    fa/ma fa/mb    

    fb/ma fb/mb    

0/1 1/2 0/1 0/2 1/1 1/2

    fa/ma fa/mb fb/ma fb/mb

0/1 2/3 0/2 0/3 1/2 1/3

    fa/ma fa/mb fb/ma fb/mb

Identification of recombination crossoversChr 1 Mother

Chr 6, Mother

Recombination crossovers statistics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 220

5

10

15

20

25

30

35

40

45

Father Mother

Total: 686

Linking of phased regionsChr 1, Mother

Chr 6, Mother

Testing for Phase Consistency

PhasingLabels

Father Mother Offspring 1 Offspring 2 Offspring 3 Offspring 4

fa fb ma mb fa ma fa mb fb ma fb mb

Genotypes 0/1 0/1 0/1 0/0 1/1 0/1

Phasings

0 1 0 1 0 0 0 1 1 0 1 10 1 1 0 0 1 0 0 1 1 1 01 0 0 1 1 0 1 1 0 0 0 11 0 1 0 1 1 1 0 0 1 0 0

Genotypes 0/0 0/1 0/0 0/1 0/0 0/1

Phasings

0 0 0 1 0 0 0 1 0 0 0 1

0 0 1 0 0 1 0 0 0 1 0 0

Example with 4 offspring

Given that there are d different genotypes across both the parents and children and that the number of times each of these genotypes occurs is ni and , then the probability is:

Probability of a set of genotypes being phase-consistent by chance

Cleary, J. G., et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. bioRxiv (2014). doi:10.1101/001958

Probability of a set of genotypes being phase-consistent by chance – some examples

Genotype Counts

0/0 0/1 1/1 0/2 1/2 Probability

    13     1

  13       3.01x10-1

6 7       1.01x10-2

1 12       1.11x10-1

1 11 1     1.36x10-2

4 4 5     5.53x10-4

  3 3 3 4 6.13x10-5

  1 3 3 12 3.68x10-1

1 5   6 1 2.75x10-4

1 11   13 1 7.46x10-2

Phasing consistent variants

Call Set

Raw AVR >0.15

n % n %

Phase consistent 5,224,138 77.35 4,606,574 99.28

Phase inconsistent 1,329,189 19.68 13,951 0.30

Repaired 200,450 2.96 19,197 0.41

Calls insidephased segments 6,753,777 99.99 4,639,722 99.99

Illumina 2x100 bp 50X WGS Data, RTG Trio Calls

Y-chromosome excluded

Phasing consistent variants

Call Set

Raw VQSR 1st Tranche

n % n %

Phase consistent 6,941,213 68.34 5,863,035 96.00

Phase inconsistent 2,263,975 22.29 184,169 3.01

Repaired 951,682 9.36 59,592 0.97

Calls insidephased segments 10,156,870 99.53 6,106,796 99.98

Illumina 2x100 bp 50X WGS Data, BWA/GATK UG v1.7 Calls

Y-chromosome excluded

ROC curve: NA12878 vs Phased-Consistent

RTG sorted by AVR; GATK sorted by VQSLOD (1st tranche)

NIST GiaB arbitration vs Phase-Consistent

Confident regions

Genome-wide

Assessment of score recalibration models

rtgVariant v 1.1; NA12878

21

Assessment of MNP & indel calling (rtgVariant 1.0)

• In rtgVariant 1.0, longer insertions have higher FP than small and deletions.

• More FP in MNP• Improvements in

aligner for v1.2

Deletions Insertions

SNV/MNPs

0.5%

Percentage of phase inconsistent calls

rtgVariant v 1.0; NA12878

Summary & Perspectives

• Genetic segregation in a large family offers a unique opportunity to identify “true” sets of variants

• Requires collecting data for whole family as new chemistries and platforms become available (e.g. 2x250bp, Moleculo reads)

• Data from multiple platforms can be merged to create a comprehensive phase-consistent ground truth

• Allows rational assessment of variant pipelines and improvement of algorithms

• Some issues that need to be dealt with: cell line artifacts, CNVs, systematic errors, SVs.

rtgTools v1.0

A toolkit to compare and analyze VCFs

• vcfeval – comparison of VCFs for ROC curves • rocplot – draw ROC curves from vcfeval output• medelian – counts of Mendelian inheritance errors in pedigrees• vcfstats – basic statistics of VCF files• vcffilter – filtering of VCFs by scores, etc.• vcfannotate – annotation of VCF files• vcfmerge – merge VCF files

Java compiled code freely available at GiaB repository:

ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/

http://biorxiv.org/content/early/2014/01/24/001958

Acknowledgements

RTG, Hamilton, New Zealand John Cleary Ross Braithwaite Len TriggRTG, San Bruno, CA Sahar Malakshah Minita ShahMichael Eberle, Illumina, Inc. – Platinum Project dataComplete Genomics, Inc. – CEPH pedigree dataJustin Zook – NIST

Data and tools to compare with phased standard released publicly at NIST Genome-in-a-Bottle repository (s3://giab)

This work was done while the presenter was employed by Real Time Genomics Inc., San Bruno, CA.

© 2014 Real Time Genomics, Inc. All rights reserved.