Upload
duongdiep
View
218
Download
0
Embed Size (px)
Citation preview
talks
GenotypeRefinementWorkflow
Usingaddi8onaldatatoimprovegenotypecallsandlikelihoods
YouarehereintheGATKBestPrac8cesworkflowforgermlinevariantdiscovery
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
Whycareaboutgenotypes?
• Medicalgene8cistsneedgenotypesforpa8ents– Doanypa8entshavetwocopiesofaLOFmuta8on?– Aretheparentsofadiseasedchildlikelytohavemoreafflictedchildren?
• Popula8ongene8cistsneedgenotypesforassocia8onstudies– Howdoesthenumberofcopiesofanalleleaffectthephenotype?
Variantcallvs.Genotypecall
• Variantcall=thereisvaria8onatthissiteinoneormoresamples
• Genotypecall=thesearethehaplotypespresentinthissample
Genotypecallqualityisimportant!
• Somesites/sampleshavepoorgenotypecalls– Canbeambiguousduetolowconfidence– Mightbeen8relywrong!
• Canaddi8onal(independent)dataimprovegenotypecalls?– Usehighqualitydata(like1000G)aspriors– Usepedigree(ifavailable)– Calculateposteriorgenotypeprobabili8es
ReviewofBayes’sRule
Giventhatyourcoworkerjustwalkedinwithanumbrella,whatistheprobabilitythatitisraining?• Observa8on=umbrella• Θ=probabilityofrain
prior
(normalize)
likelihoodposteriorprobability
GenotypeRefinementWorkflow
High Quality Variants
De Novo Variants
Recalibrated Variants
Population Priors Family Priors
CalculateGenotypePosteriors
VariantFiltration
VariantAnnotator
Variants with Posterior Qualities
CalculateGenotypePosteriors
java -jar GenomeAnalysisTK.jar \ -T CalculateGenotypePosteriors \ -R reference.fasta \ -V input.vcf \ -ped family.ped \ -supporting population.vcf \ -o output.vcf
High Quality Variants
De Novo Variants
Recalibrated Variants
Population Priors Family Priors
CalculateGenotypePosteriors
VariantFiltration
VariantAnnotator
Variants with Posterior Qualities
Case1:HOM_VARCallw/LowFrequencyPriors
1) Baseline HOM_VAR call
2) Priors w/low allele frequency applied
3) Posterior genotype called HET
4) In agreement w/NISTand BAMs
Likelihoods x Priors = Posterior Probabilities[895,3,0] AF=0.002 [868,0,27]
[HOM_REF, HET, HOM_VAR] [HOM_REF, HET, HOM_VAR]
Genotype correctedConfidence improved from Q3 to Q27
Case2:HETCallwithHighFrequencyPriors
1) Baseline HET call
2) Priors w/high allele frequency applied
3) Posterior genotype called HOM_VAR
4) In agreement w/NISTand BAMs
Likelihoods x Priors = Posterior Probabilities[894,0,0] AF=0.987 [932,16,0]
[HOM_REF, HET, HOM_VAR] [HOM_REF, HET, HOM_VAR]
Genotype corrected Confidence improvedfrom Q0 to Q16
Popula8onpriorsimprovegenotypeconfidence
Baseline HomRef calls are under confident, but posterior calls are more accurate
Baseline HomVar calls are over confident, but posterior calls are improved
HomRef
HetHom
Var
Assessingconfidenceandcorrectness
Intercept = 9.9612Slope = 0.9302
Average Q10 increase for correct calls, Q≤30
BaselineGenotypeQuality
PosteriorG
enotypeQuality
Incorrect calls stay about the same
HomozygousReferenceCalls
Parentalgenotypesinformchildgenotypes
!(!!|!! ,!! ,!!)!
Child Mother FatherHR HR HRHR HR HETHR HET HRHR HET HET
Child Mother FatherHET HET HETHET HR HETHET HET HRHET HV HETHET HET HVHET HR HVHET HV HR
Child Mother FatherHV HV HVHV HV HETHV HET HVHV HET HET
• Childcanonlyinheritallelespresentinparents• Parentgenotypesdeterminepossiblechildgenotypes
(assumingnomuta8ons)
• HaplotypeCallergives• Giventriodatawecanderive
! ! !"# = ! ! !"# ! !(!)!(!"#|!)!(!)!
!
Bayesianpriorsappliedtotrios
• RecallBayes’sRule:
• Establishgenotypeconfigura8onprobabili8es
• Applyfamilypriors
! !! = !" ! = !!(!! = !") !! !! !! !! !(!)!!,!!!(! ! ! !(!) !
applyprior
normalize
likelihoodposterior
! !!,!! ,!! = !! !!, 1!"!!, 2!"#
1− 10! − 2!!,!"! −!"!
!!
Familypriorsimprovegenotypeconfidence
Baseline HomRef calls are under confident, but posterior calls are more accurate
Posterior HomRef and HomVar calls are higher confidence
HomRef
HetHom
Var
Assessingconfidenceandcorrectness
Intercept = 12.831,Slope = 1.238
Average Q13 increase for correct calls, Q≤30
BaselineGenotypeQuality
PosteriorG
enotypeQuality
Incorrect calls stay about the same
HomozygousReferenceCalls
FilterlowconfidenceGQs
• UseVariantFiltra8ontofilterambiguous,low-confidencecalls
• RecommendedthresholdisGQ=20– GQ20isPhred-scaled99%confidence
• Restrictfurtheranalysistohigh-qualitydata
High Quality Variants
De Novo Variants
Recalibrated Variants
Population Priors Family Priors
CalculateGenotypePosteriors
VariantFiltration
VariantAnnotator
Variants with Posterior Qualities java -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R reference.fasta \ -V input.vcf \ --filterExpression “GQ<20” \ --filterName “lowGQ” \ -o output.vcf
VariantAnnotator
High Quality Variants
De Novo Variants
Recalibrated Variants
Population Priors Family Priors
CalculateGenotypePosteriors
VariantFiltration
VariantAnnotator
Variants with Posterior Qualities
java -jar GenomeAnalysisTK.jar \ -T VariantAnnotator \ -R reference.fasta \ -V input.vcf \ -A PossibleDeNovo \ -o output.vcf
WhatareDeNovomuta8ons?
• CulpritsinmanyrareMendeliandisorders• ~30denovomuta8onsoccurperhumangenome
Parentsarehomozygousreference
Childishet(onecopyofaltallele)
Proper8esofsequencedDeNovos
• Novelty– Childhasonlyaltalleleintrio,notinherited
• Rarity– Allelefrequencyacrossallsamplessequencedislow
• Confidence– SetGQthresholdforparentsandchild– (GQimprovementtoolshelpALOThere!)
Exampleofaclinicalcase
• Realclinicaldata• Suspecteddenovomuta8oninoffspring
417denovosfromrawGTcalls
17denovosbasedonposteriorGTs
8highconfidencedenovosanerGQfiltering
Priorscanbetunedforsensi8vity
Sensi8vityandspecificitycanbetunedasinVQSR
Muta8onpriorisaparameteringenotypeconfigura8onprobability:
IncreasingSensi8vity
! !!,!! ,!! = !! !!, 1!"!!, 2!"#
1− 10! − 2!!,!"! −!"!
!!
Genotyperefinementyieldsmorehigh-qualitygenotypes
• Ini8algenotypecallsmaybeambiguousorwrong
• Applyingpopula8on+familypriorsimprovesconfidence
• Morehighconfidencegenotypes->moredatafordownstreamanalysis!Hom
Var
YouarehereintheGATKBestPrac8cesworkflowforgermlinevariantdiscovery
Analysis-Ready Variants
111Raw Reads
Raw Variants IndelsSNPs
Analysis-ReadyReads
Indel Realignment
Base Recalibration
SNPs & Indels
Variants
IndelsSNPs
VariantAnnotation
Variant Evaluation
look good?
use in projecttroubleshoot
111Analysis-ReadyReads
Genotype Likelihoods
Joint Genotyping
Analysis-Ready
No
n-G
AT
K
Mark Duplicates& Sort (Picard)
Var. Calling HC in ERC mode
separately per variant type
Variant Recalibration
Map to Reference
BWA mem GenotypeRefinement
Data Pre-processing Variant Discovery>> >> Callset Refinement
talks
Furtherreading
hrp://www.broadins8tute.org/gatk/guide/
hrps://www.broadins8tute.org/gatk/gatkdocs/org_broadins8tute_gatk_tools_walkers_variantu8ls_CalculateGenotypePosteriors
hrp://www.broadins8tute.org/gatk/guide/ar8cle?id=4723
hrp://www.broadins8tute.org/gatk/guide/ar8cle?id=4726
talks