Upload
genomeinabottle
View
1.342
Download
0
Tags:
Embed Size (px)
Citation preview
© 2013 Real Time Genomics, Inc.
NA12878 Trio/Pedigree Analysis
Francisco M. De La Vega, D.Sc. VP Genome Science
© 2013 Real Time Genomics, Inc.
Leveraging trio information • GiaB has selected reference materials in the form of father,
mother, offspring trios • The goal was to leverage the Mendelian inheritance patterns
to: – Identify variant genotype errors that are inconsistent with
Mendelian inheritance – Remove these errors from the reference baseline calls
• However, if variant identification methods don't use directly pedigree information and jointly analyze the trio alignments, an opportunity to improve the genotype calls would be missed
• We focused on using the RTG Family caller to better leverage the shared information in the trios and improve the call set, whilst reducing Mendelian inconsistent genotype errors
© 2013 Real Time Genomics, Inc.
C A A
A A A A A A
A
A A
A
A A /Genotype:
A A
C A
C
C A A
A A
A /Genotype: C
C
A /Genotype:
A C
C
C
|
||
Variant calling can be improved by jointly analyzing related samples
Shared haplotypes
© 2013 Real Time Genomics, Inc.
C A A
A A A A A A
A
A A
A
A A /Genotype:
A A
C A
C
C A A
A A
A /Genotype: C
C
A /Genotype:
A C
C
C
|
||
Variant calling can be improved by jointly analyzing related samples
Mendelian variant segregaJon
Shared haplotypes
© 2013 Real Time Genomics, Inc.
Mendelian inconsistency
C
C
/Genotype: C
C
C
C C
C C
A A A
A A /Genotype: (Low QV)
C A
A
A A
A
A /Genotype:
C
C
C A
A A
C C
A C
|
||
© 2013 Real Time Genomics, Inc.
Joint trio analysis corrects Mendelian errors
C
/Genotype: C
C
C
C
C T
G
G
G
C T C T
C T
C A
A
A A
A
Genotype:
C
A / C
G
G G G G G
G A A A
Genotype: (Good QV)
C T C T C T C T
A / C
G G G A A
C C
A C
|
||
© 2013 Real Time Genomics, Inc.
NA12878 calls from trio calling
• Comparing offspring variants from singleton vs pedigree calling – Both showing good quality metrics
• Using family information more good calls can be made and dubious calls are downgraded
NA12878 Call set SNVs Indels MNPs
SNV Het/Hom Ti/Tv
% dbSNP (r129)
RTG single 3,329,797 558,242 31,070 1.55 2.11 90.8%
RTG trio 3,363,619 595,030 33,686 1.57 2.11 90.4%
GATK/VQSR 3,263,289 610,837 N/A 1.51 2.09 91.7%
Variant StaBsBcs
Data: WGS 2x100bp >50X Illumina PlaJnum Genomes data (ENA Acc. No. ERP001960). RTG AVR score cut-‐off 0.15; GATK v1.7 & BWA 0.6.1.
142,848
68,000
Family
Singleton
3,849,457
NA12878
NA12891 NA12892
© 2013 Real Time Genomics, Inc.
NA12878 vs reference datasets
NA12878 Call set
1kP OMNI Poly (TP%)
1kP OMNI Mono (FP%)
Get-‐RM¶
(TP %) GiaB (TP%)
GiaB-‐BED (TP%)
RTG single 97.5% 0.10% 97.4% N/A N/A
RTG trio 97.5% 0.24% 97.0% 90.5% 94.1%
GATK/VQSR 97.8% 0.17% 87.8% 88.4% 92.5%
§ RelaJve to dbSNP 137; StaJsJcs for SNVs only. ¶Get-‐RM consistent high-‐quality variants; n=498
NA12878
NA12891 NA12892
– 1000 Genomes Illumina OMNI SNP array • Polymorphic sites – TP proxy • Monomorphic sites – FP proxy
– Get-RM high confidence call set – GiaB high confidence calls in BED region
© 2013 Real Time Genomics, Inc.
ROC Trio calls vs. GiAB baseline (BED)
RTG snpsimeval tool; SNV/indel/MNP; zygosity match
© 2013 Real Time Genomics, Inc.
ROC Trio calls vs. GiaB baseline
RTG snpsimeval tool; SNV/indel/MNP; zygosity match
© 2013 Real Time Genomics, Inc.
ROC Trio calls vs. CGI baseline
RTG snpsimeval tool; SNV/indel/MNP; zygosity match
© 2013 Real Time Genomics, Inc.
Mendelian inconsistency errors
RTG family caller reduces Mendelian Inheritance Errors over 60X vs. RTG singleton calling (over 70X vs. GATK/VQSR)
Log Co
unts of M
IE
1
10
100
1000
10000
100000
1000000
RTG single RTG trio GATK/VQSR
335,625
4,870
351,904
© 2013 Real Time Genomics, Inc.
Pattern #1: Heterozygous variant
Trio Calling
NA12878
NA12892NA12891
NA12877
NA12889 NA12890
NA12879 NA12880 NA12881 NA12882 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12893
0/1
0/10/0
0/0 0/0 0/00/0 0/00/1 0/1 0/10/10/1
© 2013 Real Time Genomics, Inc.
Segregation of heterozygous variants
0
20,000
40,000
60,000
80,000
1 2 3 4 5 6 7 8 9 10 11
SNV coun
t
# of offspring segregaBng
SNV
0
100
200
300
400
500
1 2 3 4 5 6 7 8 9 10 11
MNP coun
t
# of offspring segregaBng
MNP
0
2,000
4,000
6,000
8,000
10,000
1 2 3 4 5 6 7 8 9 10 11
inde
l coun
t
# of offspring segregaBng
indel
0
20,000
40,000
60,000
80,000
100,000
1 2 3 4 5 6 7 8 9 10 11
Varia
nt cou
nt
# of offspirng segregaBng
All Variants
SegregaJon of NA12878 heterozygous variants called as family, GQ>50, homozygous reference in other parent.
© 2013 Real Time Genomics, Inc.
Pattern #2: Homozygous-alt variant
Trio Calling
NA12878
NA12892NA12891
NA12877
NA12889 NA12890
NA12879 NA12880 NA12881 NA12882 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12893
0/1
1/10/0
0/1 0/1 0/10/10/10/1 0/1 0/1 0/1 0/1
© 2013 Real Time Genomics, Inc.
Segregation of homo-alt variants
0
20,000
40,000
60,000
80,000
100,000
120,000
1 2 3 4 5 6 7 8 9 10 11
SNV coun
t
# of offspring segregaBng
SNV
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8 9 10 11
MNP coun
t
# of offspring segregaBng
MNP
0
2,000
4,000
6,000
8,000
10,000
12,000
1 2 3 4 5 6 7 8 9 10 11
inde
l cou
nt
# of offspring segregaBng
indel
0
20,000
40,000
60,000
80,000
100,000
120,000
1 2 3 4 5 6 7 8 9 10 11
Varia
nt cou
nt
# of offspring segregaBng
All Variants
SegregaJon of NA12878 homozygous alternaJve variants called as family, GQ>50, homozygous reference in other parent.
© 2013 Real Time Genomics, Inc.
False positive estimate by segregation GT Type All variants SNV MNP indel Het
TP (10-‐11) 123672 110262 693 12717
FP (1-‐8) 1901 1000 47 854
FP% 1.40% 0.88% 1.42% 5.67% Homo-‐alt
TP (2-‐10) 373260 329642 2258 41360
FP (1,11) 4457 3672 36 749
FP% 1.18% 1.10% 1.57% 1.78% Overall
TP 496932 439904 2951 54077
FP 6358 4672 83 1603 Overall FP% 1.26% 1.05% 2.74% 2.88%
© 2013 Real Time Genomics, Inc.
Data imputation by pedigree caller
• For genomes with no data use population priors – With care can iterate over offspring then each of parents
independently – Avoid exponential explosion so can do whole extended
family in one calling step
© 2013 Real Time Genomics, Inc.
Imputation of family members with no data
Simulated data
True
PosiJves
False PosiJves
1 offspring
2 offspring
4 offspring
4 offspring + father
© 2013 Real Time Genomics, Inc.
ROC vs NA12878 imputed baseline
RTG snpsimeval tool; SNV/indel/MNP; zygosity match
© 2013 Real Time Genomics, Inc.
de novo mutation identification
Call set de novo
candidates de novo germline*
de novo somaBc* TP/FP
Singleton calls 16,902 49 (100%) 941 (99%) 1:17
Trio calls 2,205 49 (100%) 941 (99%) 1:2.2
de novo MutaBon Accuracy (NA12878)
*SensiJvity vs. Conrad et al. (2011) validated dataset of germline and somaJc cell line de novo mutaJons.
– Uses the parental genomes to identify & score de novo mutations in offspring
– Greater than 7X improvement in precision to find de novo mutations vs. naïve methods
NA12878
NA12891 NA12892
© 2013 Real Time Genomics, Inc.
Status
• Working through the complete trio datasets for producing joint pedigree calls for NA12878 trio – Aiming for a trio call set and another that
includes full Platinum pedigree data – There is disproportionally more data for
NA12878 than her parents or offspring • Comprehensive segregation analysis that
includes all Mendelian patterns • Phasing analysis to identify variants that are
inconsistent with transmitted phases
© 2013 Real Time Genomics, Inc.
Issues
• How to integrate pedigree calls with other data? – Variants that segregate appropriately
candidates for inclusion in baseline – Variants that don’t segregate appropriately
candidates for removal of baseline – Improvement of baseline genotypes using
pedigree-based genotypes • Use of the imputed NA12878 baseline • Creation of a more inclusive baseline for ROC
curves to compare new methods and select thresholds
© 2013 Real Time Genomics, Inc.
Acknowledgements
• RTG team at Hamilton, New Zealand – Led by John Cleary, CTO
• RTG team at San Bruno, CA – Sahar Malakshah – Minita Shah – Brian Hilbush
• Michael Eberle, Illumina, Inc. – Platinum Data • Justin Zook, NIST • 1000 Genomes Project
© 2013 Real Time Genomics, Inc. All rights reserved. US Patent 7,640,256. Other patents pending. For research use only. Not for diagnosJc applicaJons.