Upload
genomeinabottle
View
381
Download
1
Tags:
Embed Size (px)
Citation preview
Ryan Poplin, on behalf of the Genome Sequencing and Analysis Group Program in Medical and Population Genetics August 16, 2012
Understanding sources of bias and error from a prospective Reference Material (NA12878)
NA12878 is a wonderful reference sample!
• Unrestricted cell lines!• Extensive pedigree available!• Extensively sequenced and genotyped at the
Broad and elsewhere!– All Broad techs (both production and
experimental)!– Fosmids!– Many library designs and sample prep
protocols!
SNPs
Indels
Structural variation (SV)
Rawindels
RawSVs
Typically by lane Typically multiple samples simultaneously but can be single sample alone
Input
Output
Mapping
Local realignment
Duplicate marking
Base quality recalibration
Analysis-ready reads
Raw reads Sample 1 reads
Raw variants
RawSNPs
Genotype refinement
Variant quality recalibration
Analysis-ready variants
Pedigrees Known variation
Known genotypes
Population structure
Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis
Sample N reads
External data
Our framework for variation discovery!
DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !
Lots of work required to turn raw sequencing reads into something that is useful!
Phase 1:!NGS data processing!
Input
Output
Mapping
Local realignment
Duplicate marking
Base quality recalibration
Analysis-ready reads
Raw reads
• Unbiased sampling of alleles • Calibrated mapping quality scores
• Indels have correct and consistent alignment in reads
• Duplicate molecules shouldn’t count as extra evidence for event
• Calibrated base quality scores for base subs=tu=ons, base inser=ons, and base dele=ons
Desired proper=es of analysis-‐ready reads:
rs28782535!
rs28783181! rs28788974! rs34877486! rs28788974!
1,000 Genomes Pilot 2 data, raw MAQ alignments! 1,000 Genomes Pilot 2 data, after MSA!
HiSeq data, raw BWA alignments! HiSeq data, after MSA!
Effect of MSA on alignments!NA12878, chr1:1,510,530-1,510,589!
Indels have correct and consistent alignment in reads through multiple sequence local realignment!
5!DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !
Phase 1:!NGS data processing!
Input
Output
Mapping
Local realignment
Duplicate marking
Base quality recalibration
Analysis-ready reads
Raw reads
SLX GA 454 SOLiD HiSeq Complete Genomics
!!!!
!!
!!!
!!!!
!!
!
!
!!
!!
!!
!!
!!!!!! !
!!
0 10 20 30 40
010
20
30
40
Reported Quality
Em
pir
ical Q
ualit
y
!!!!!!!!!!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!!
!
!
Original, RMSE = 5.242
Recalibrated, RMSE = 0.196
!!
!!
!!!
!!
!!
!!
!!
!
!!!!
0 10 20 30 40
010
20
30
40
Reported Quality
Em
pir
ical Q
ualit
y
!!!!!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!
!
!
Original, RMSE = 2.556
Recalibrated, RMSE = 0.213!!!
!
!
!!!
!!!
!!
!!
!!
!!
!!
!!
!!
!
!
0 10 20 30 40
010
20
30
40
Reported Quality
Em
pir
ical Q
ualit
y
!!!!!!!!!!!
!!
!!
!!
!!
!!
!!
!!
!!
!!!!
!!!!
!
!
Original, RMSE = 1.215
Recalibrated, RMSE = 0.756
!!!!!!!!!!!
!!
!!
!!
!!
!!
!!
!!
0 10 20 30 40
010
20
30
40
Reported Quality
Em
pir
ical Q
ualit
y
!!!!!!!!!!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!
!!!
!
!
!
Original, RMSE = 4.479
Recalibrated, RMSE = 0.235
!!!
!!!
!!
!!
! !!
!
!
!
!
!!
!!
!
!
!!
!!
!!!!
!!
!!
!
0 10 20 30 40
010
20
30
40
Reported Quality
Em
pir
ical Q
ualit
y
!!!!!!!!!!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!
!
Original, RMSE = 5.634
Recalibrated, RMSE = 0.135
!!!!!!!!!!!!!!! !! !! !! !
! !! !!
!
!
!! !!!!!
0 5 10 15 20 25 30 35
!10
!5
05
10
Machine Cycle
Accura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!!!!!!!!!!! !! !! !! !! !! !! !! !! !!!! !
!
!
Original, RMSE = 2.207
Recalibrated, RMSE = 0.186
!
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!! !!!!
!!!!!!!!!!!!!!!!
!!!!
!!!!
!!!!!!!!
!!!!!!!!!!!!!!!
! !!!!!!!! !!!!!!!! !
!!!!!!!
!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!
!!!!
!!!!!!!! !!!!
!!!!!!!!!!!
!!!!!
!!
!
!
!
!!
!
!
!!
0 50 100 150 200
!10
!5
05
10
Machine Cycle
Accura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!! !!!! !!!!!!!! !!!! !!!!!!!!!!!!!!!!
!!!
!!!!!
!
!
Original, RMSE = 1.784
Recalibrated, RMSE = 0.136
!!
!
!
!
!
!
!
!
!
!!
!
!!!
!!
! !!!
! !!!
!
!!
!
!!
!!
!! !!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!!
!30 !20 !10 0 10 20 30
!10
!5
05
10
Machine Cycle
Accura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!
!! !! !! !!!! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !
! !! !! !!! !!!!
!! !! ! !!!!!
!
!
Original, RMSE = 1.688
Recalibrated, RMSE = 0.213
!!
!
!!
!!
!!
!!
!
!! !
!!
!
!!
!
!
!! !! !!
!
!
!!! !
!
!
!!
!
!
!
!
!
!! !!
!! !!
!
!
!!
!
! !!
!
!!
!!!!
!!
!30 !20 !10 0 10 20 30
!10
!5
05
10
Machine Cycle
Accura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
! !!
! !! !! !!
!! !! !! !! !! !! !! !! !! !! !! !
! ! !! !! !! !! !! !! !! !! !! !! !! !! !!! ! !!!!!
!
!
Original, RMSE = 2.679
Recalibrated, RMSE = 0.182
!
!!!!!!!!!!!
!!!!!!!!
!!!!
!
!!
!
!!!!!!!!!!!!
!
!!!!!
!!
!
!!
!!
!
!!!
!!
!!
!!
!!!
!
!
!
!!
!!!!
!
!
!
!
!
!!!
!
!
!
!
!!!!!!!! !
!
!!
!!
!
!!!
!!
!
!!!!
!!
!
!
!
!!!!
!
!
!!
!
!
!
!
!!
!
!
!!!!!!!!!!!
!
!!!!!
!!!!!!!!!!!!!
!!
!
!!!!
!
!!!!
!!!!!
!!
!!!!!!!
!!!!
!!
!!!!!!!
!100 !50 0 50 100
!10
!5
05
10
Machine Cycle
Accura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!
!
!
Original, RMSE = 2.609
Recalibrated, RMSE = 0.089
!10
!5
05
10
Dinucleotide
Accura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!
!!!
!!
!
!!!!!!!!!
!!!!!!!!!!!!!!!!
AA AG CA CG GA GG TA TG
Original, RMSE = 2.598
Recalibrated, RMSE = 0.052
!10
!5
05
10
Dinucleotide
Accura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!
!!!!!
!
!!!!
!!!!
!!!!!!!!!!!!!!!!!
AA AG CA CG GA GG TA TG
Original, RMSE = 2.169
Recalibrated, RMSE = 0.135
!10
!5
05
10
Dinucleotide
Accura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!
!!
!!!
!
!!!!!!
!
!!!!!!!!!!!!!!!!!!
AA AG CA CG GA GG TA TG
Original, RMSE = 1.656
Recalibrated, RMSE = 0.088
!10
!5
05
10
Dinucleotide
Accura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!
!
!
!!!
!
!!!!!
!!!!!!!!!!!!!!!!
AA AG CA CG GA GG TA TG
Original, RMSE = 3.503
Recalibrated, RMSE = 0.06
!10
!5
05
10
Dinucleotide
Accura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!
!
!
!!!!
!
!!!!!!!!!!!!!!!!!!!!
AA AG CA CG GA GG TA TG
Original, RMSE = 2.469
Recalibrated, RMSE = 0.083
first of pair reads second of pair reads first of pair reads second of pair reads first of pair reads second of pair reads
Base Quality Score Recalibration provides a calibrated error model from which
to make mutation calls
Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project!!
AAAAA context
suffix
Empi
rical
gap
ope
n pe
nalty
0
10
20
30
40
50
●●●●●●●●●●●●●●●●
●
●●●●
●●●●●●●●●●●●
●
●●●
●●●●●●
●●●●●●●
●
●●●●●●●●●●●
●●●●●
●
●●●●●●●●●●●
●
●●●●
●
●
●●●●●
●●
●
●●●
●●●●
●
●
●●
●●
●●●
●●
●
●
●
●
●
●
●
●●
●●●●●●●●
●●
●●●●
●
●●●●●●●●●
●
●●●●●
●
●
●●
●
●●●●
●●●
●●●
●
●●
●
●●
●●●
●●●●●●●●
●●●
●
●●●●●
●●●
●●●●
●
●
●●
●
●●●●●●●●●●●
●●
●●●
●
●●●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●●
●●●
●●
●
●
●
●●
●
●●●
●●
●●●
●
●
●●●●●●
●
●●
●●●●●
●
●
●
●●●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●●●●
●
●●●●●●●●
●
●
●●
●
●●●
●
●●●●
●
●●●●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●●●●●
●
●●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●●●●
●
●●●●●
●
●
●
●
●●●●
●
●●●●●
●
●●●●●
●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●
●●●●●
●
●
●
●●
●●●
●●●
●
●
●
●●
●
●●
●●
●
●●●●●●
●
●●●
●
●●●●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●●●●
●●●
●
●
●●●●
●●
●
●
●
●●●●●●
●
●
●
●●●●●●
●
●●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●
●
●
●●
●●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●●
●●●
●
●
●●●
●●
●
●
●●●●●
●
●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●●
●●
●●
●
●
●
●
●●●●●
●
●
●●
●●●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●●
●●●
●
●●
●
●●
●●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●●●●
●●
●
●
●
●
●
●
●
●
●
●●●
●●●
●
●
●●
●
●
●●●●●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●●●●
●●
●
●
●●●●●●●●
●
●
●
●●
●
●
●●
●
●●●●●
●
●
●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●
●
●
●●●
●
●
●●●
●
●●●●
●
●●●●●●
●
●
●
●
●
●●
●●●
●●●●●
●
●
●
●●●●●●●
●●
●●●●●
●
●
●
●
●●●●
●●
●●
●●
●
●●
●●
●
●
●
●
●
●●●●●●●
●
●●
●
●
●
●●●
●
●
●●
●
●
●●
●
●●●●
●
●●●
●●
●
●●
●●
●
●
●●●●
●
●
●●
●
●
●
●
●
●
●●●
●
●●●
●
●●
●●●●
●●
●●●●
●
●●●
●
●
●
●●
●
●
●●
●●●
●
●●●●
●
●●●
●●●
●
●●●●●
●
●●
●
●
●
●●●●●●●●
●
●
●
●
●●●
●
AAAAACAAGAATAC
AAC
CAC
GAC
TAG
AAG
CAG
GAG
TATAATCATGATTC
AAC
ACC
AGC
ATC
CA
CC
CC
CG
CC
TC
GA
CG
CC
GG
CG
TC
TAC
TCC
TGC
TTG
AAG
ACG
AGG
ATG
CA
GC
CG
CG
GC
TG
GA
GG
CG
GG
GG
TG
TAG
TCG
TGG
TTTAATACTAGTATTC
ATC
CTC
GTC
TTG
ATG
CTG
GTG
TTTATTCTTGTTT
ReadGroup● 20FUK.1● 20FUK.2● 20FUK.3● 20FUK.4● 20FUK.5● 20FUK.6● 20FUK.7● 20FUK.8● PacBio
7!
AAAAA + AAA context is errorful in HiSeq
HiSeq!
PacBio!
Per-‐base indel error rate also varies by lane, sequence context and sequencing technology
PacBio error rate is 1000x higher but unbiased
Reported Quality Score
Empi
rical
Qua
lity
Scor
e
10
20
30
40
50
Base Substitution
10 20 30 40 50
Base Insertion
!
10 20 30 40 50
Base Deletion
!
10 20 30 40 50
Recalibration! Recalibrated! BQSRv2
log10(nBases)456789
Cycle Covariate
Qua
lity
Scor
e Ac
cura
cy
−6
−4
−2
0
2
4
Base Substitution
−100
−50
0 50 100
Base Insertion
!
!
!!!!!!!!!!!!!!!!!!!!
!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!
−100
−50
0 50 100
Base Deletion
!
!!!!!!!!!!
!
!!
!!
!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!
−100
−50
0 50 100
Recalibration! Recalibrated! BQSRv2
log10(nBases)6.756.806.85
Context Covariate
Qua
lity
Scor
e Ac
cura
cy
−8
−6
−4
−2
0
2
Base Substitution
!! !!
AA AAA
AAC
AAG
AAT
AC ACA
ACC
ACG
ACT
AG AGA
AGC
AGG
AGT
AT ATA
ATC
ATG
ATT
CA
CAA
CAC
CAG
CAT
CC
CC
AC
CC
CC
GC
CT
CG
CG
AC
GC
CG
GC
GT
CT
CTA
CTC
CTG
CTT
GA
GAA
GAC
GAG
GAT
GC
GC
AG
CC
GC
GG
CT
GG
GG
AG
GC
GG
GG
GT
GT
GTA
GTC
GTG
GTT
TA TAA
TAC
TAG
TAT
TC TCA
TCC
TCG
TCT
TG TGA
TGC
TGG
TGT
TT TTA
TTC
TTG
TTT
Base Insertion
AA AAA
AAC
AAG
AAT
AC ACA
ACC
ACG
ACT
AG AGA
AGC
AGG
AGT
AT ATA
ATC
ATG
ATT
CA
CAA
CAC
CAG
CAT
CC
CC
AC
CC
CC
GC
CT
CG
CG
AC
GC
CG
GC
GT
CT
CTA
CTC
CTG
CTT
GA
GAA
GAC
GAG
GAT
GC
GC
AG
CC
GC
GG
CT
GG
GG
AG
GC
GG
GG
GT
GT
GTA
GTC
GTG
GTT
TA TAA
TAC
TAG
TAT
TC TCA
TCC
TCG
TCT
TG TGA
TGC
TGG
TGT
TT TTA
TTC
TTG
TTT
Base Deletion
AA AAA
AAC
AAG
AAT
AC ACA
ACC
ACG
ACT
AG AGA
AGC
AGG
AGT
AT ATA
ATC
ATG
ATT
CA
CAA
CAC
CAG
CAT
CC
CC
AC
CC
CC
GC
CT
CG
CG
AC
GC
CG
GC
GT
CT
CTA
CTC
CTG
CTT
GA
GAA
GAC
GAG
GAT
GC
GC
AG
CC
GC
GG
CT
GG
GG
AG
GC
GG
GG
GT
GT
GTA
GTC
GTG
GTT
TA TAA
TAC
TAG
TAT
TC TCA
TCC
TCG
TCT
TG TGA
TGC
TGG
TGT
TT TTA
TTC
TTG
TTT
Recalibration! Recalibrated! BQSRv2
log10(nBases)6.57.07.58.0
8
UnifiedGenotyper used a flat Q45 in its indel model
Latest version of Base Quality Score Recalibrator empirically estimates the base insertion and deletion error rates in addition to substitutions
Reported Genotype Quality
Empi
rical
Gen
otyp
e Q
ualit
y 0
5
10
15
20
25
30
0
5
10
15
20
25
30
0
5
10
15
20
25
30
UnifiedGenotyper
●●
●
●
●
●●●
●
●●
●●●
●●●●
●
●●
●●
● ●
●●●
●
●●●●
●●●
●●
●
●●●●●
●●
●●
●●
●
●●●
●
●
●●●●●
●●●●●●●●
●●
●
●
●
●
●
●
●●●●
●●●
●
●●●●
●
●●●
●●
●
●●●
●●
●
●
●
●●●●
●●
●
●●●●
●●●●●●
●
●●●
●●
●
●
●
●
●●●●
●●●
●
●●●
●●●●
●●
●
●●●
●●●
●●
●●●
●
●●●●●
●●●
●
●●
●●●●●
●
●●
●
●
●
●●●●
●●●●●
●●
●
●●●
●●●
●●●●
●●
●
●
●
●●●●●●●●●●●
●●●●●
●●●
●●●●●●
●●●●
●●●●●●●●
●●●●●●●●
●●●●●●●●●●
●●●
●●●●●●●●●●●
●●●●●●●●●●
●●
●●
●●●●
●●●●●●●
●●●●●
●●●●●●●
●●●●●●●●●●
●●●●●●●
●●●●●●●●
●●●●●●
●●●●●●●●
●●●●●●●●●
●●●●●●
●●●●●●●
●●●●●●●
●●●●●●
●●●●●●●
●●●
●●●●●
●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●●●
●●●●●●●
●●●●
●●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●●●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●
●●●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●●●●●
●
●●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●●●●●●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●●
●●
●●●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●●●
●●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
0 5 10 15 20 25 30
HaplotypeCaller Original Quality
●●
●●
●●●
●●●●
●
●●●
●
●●
●
●●
●●●
● ●
●●●●
●●
●●●
●●
●●●
●●●●●●
●
●
●
●
●
●
●●
●●
● ●
●●
●
●
●●●
●●
●
●●
●
●●
●●
●
●
●
●●●●
●●
●
●
●
●
●●
●
●●
●●
●
●
●●
●●
●
●
●
●●
●●● ●
●
●●
●
●●
●
●●●●
●
●
●
●
●●●
●●
●●●●
●
●
●
●
●
●
●●●
●●
●●●●
●●●●
●
●
●
●●
●●
● ●
●●●●
●
●
●●●
●●●
●
●
●●
●
●
●
●●●●
●
●
●●●●
●●●
●●
●●●●
●●
●
●●
●
●●
●
●●●
●●
●
●●●●
●●●
●
●
●●
●
●●
●
●
●●●●
●●
●
●●
●●●
●●●●●
●
●●
●
●●
●
●
●●●
●●●●●
●
●
●
●●●●●
●●●●●
●
●
●●
●●
●
●
●●●
●●●●●
●●
●
●●●●●
●●●
●●
●●●
●
●●
●
●
●●
●
●●●●
●●●●
●●●●
●
●
●●●●
●●●●
●
●●●
●●
●
●
●●●
●
●●
●
●●●●●
●
●●
●●●●●●●●
●
●
●●
●
●●●●●●
●●
●●●●
●●
●●●
●●●●
●
●
●●●
●●●
●●●●●●
●●
●●
●
●●
●
●●●●
●●●●●
●
●
●●●●
●●●●
●
●
●●
●
●
●●
●
●●●
●
●●●
●
●
● ●●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●●●
●●
●
●●
●●
●
●●●●
●
●●
●●●
●●●
●●● ●
●●
●
●
●
●●●
●
●
●
●●●●
●●●
●
●●●●●
●●
●
●
●
●●● ●
●
●●
●
●●
●
●
●
●
●●
●●
● ●
●●
●●
●
●●
●●
●
●
●
●
●●●●●
●●
●
●
●
●● ●
●
●
●
●●
●●
●●●●
● ●
● ●
●●
●
●
●
●
●●●
●
●●●
●
●●● ●
●●●●
●●
●
●
0 5 10 15 20 25 30
HaplotypeCaller Calibrated Quality
●●●●●●
●●●●
●●●●
●
●
●●●
●●
●
●●●
●●
●
●●●●●●
●
●●●●●●
●
●
●
●●●●
●
●●
●
●
●●
●
●●●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●●●
●
●●●●
●●●
●
●
●
●
●
●●
●
●
●●●
●●●
●
●
●
●●●
●
●●●●●●
●
●●●
●●●
●●●
●●
●
●●●●●
●●
●
●●●●●
●
●●
●●●
●●●
●●●●●
●●
●●●
●●
●●●●
●
●●
●●●●●●●●
●●
●
●
●●
●
●●●
●●●
●
●
●●●●●
●●●●●●
●●●●●●●
●●●●
●
●
●
●
●●
●●●●
●●●●●●●●●●●●
●
●●●
●●●
●
●
●
●
●
●
●●●●
●
●●●
●
●●●
●●
●
●●
●●●
●
●●●●●●●●
●●●
●●●●●
●●●●●●
●●●
●●●●
●●●●
●
●
●
●
●●●●
●●●●
●
●
●●●●●
●●
●
●
●
●
●●●●●
●
●
●●●●
●●●●
●●●●●●
●●●
●
●
●
●●
●●●●●
●
●
●●●●●
●●●●●●●●
●
●●●
●
●
●●●
●
●
●●
●
●●
●●
●
●●
●●●
●
●
●
●
●
●●
●●●
●
●●
●●
●
●
●●●
●
●●
●●●●●
●●●
●●
●
●
●●
●
●
●●●●
●
●●●
●
●●●
●●●●●●
●●●
●
●
●
●●
●
●●
●●●●●
●
●●
●●● ●●
●●
●
● ●
●●
●●●●●●●
●●●●
●●
● ●●●
●●
●
●
●
●●●●
●
●●●
●
●
●
●●●
●
●
● ●●
●
●
●
●
●●●●
●
●●
●
●●●●
●
●●
●●●●●
●
●●
●
0 5 10 15 20 25 30
Hom
Ref
Het
Hom
Var
pGGivenDType● Hom Ref● Het● Hom Var
log10(Sum)● 1.5● 2.0● 2.5● 3.0
● 3.5
● 4.0
Making use of these calibrated quality scores improves indel likelihood calibra=on
Evaluated calls: Per-‐read-‐group downsampled (~4x coverage) NA12878 indel calls Truth: GATK-‐bundle indel “gold standard” truth sites with high confidence genotypes from deep coverage CEU trio
Another source of bias: We often find consistent (artifactual) alleles at the sites of larger events because they cannot be properly modeled by
the mappers
10
Validated 30bp deletion!
Original BWA
alignments
Alignments showing the actual allele
Chr12:15296246 GTGTGTATGTAAATATATACATACACACAT/-‐
Mul=ple called ar=facts that are hard to filter out, since they are well supported by read data
BAM read bases are all iden=cal; individual alignments differ based on the the whims of the mapper
Allele determination is much more accurate through local assembly of candidate haplotypes
-assembly: 1 multi-allelic SNP and two 1bp indels are called +assembly: Only the complex substitution (TT to TAC) is called
Original BWA
alignments
11
12
As an added bonus we now get physical phasing for free, which allows us to distinguish between e.g. MNPs and compound hets
CEU Trio
Daughter
Father
Mother
Conclusions!
• NA12878 (and potentially more of the CEU pedigree) is a great reference sample!
• Read data must be in an analysis ready form which passes multi-faceted battery of tests related to statistical bias that go beyond sample QC metrics!
• Local de novo assembly around mutations is necessary to avoid biases in alleles resulting from myopic view!
Appendix!
Multiple sequence alignment itself is not enough for calling indels
• MSA is now a standard piece of the BAM processing pipeline and works well for previously seen indels in getting consistent alignments!
• However, it is not empowered to discover large novel indels or more complex alleles!
• What is needed is haplotype reconstruction and then calling of variants from the candidate haplotypes!
• Several groups (Oxford, Sanger, Broad) are actively working on assembly-based approaches (both global and local) for haplotype level calling!
!15!
16 Assembly of large genomes using second-‐genera4on sequencing. Schatz. Genome Research. 2010.
Traverse the graph to enumerate the possible haplotypes. Each edge is weighted by the number of reads which gave evidence for that k-‐mer.
Step 1: propose haplotypes with local de novo assembler via DeBruijn graphs
Bayesian model
4 SNP calling
4.1 Simple genotype likelihoods for presentations
Pr{G|D} =Pr{G}Pr{D|G}
Σi Pr{Gi}Pr{D|Gi}, [Bayes’ rule]
Pr{D|G} =�
j
�Pr{Dj|H1}
2+
Pr{Dj|H2}2
�where G = H1H2
Pr{D|H} is the haploid likelihood function
4.1.1 SNP haploid likelihood
Pr{Dj|H} = Pr{Dj|b}, [single base pileup]
Pr{Dj|b} =
�1− �j Dj = b,
�j otherwise.
4.1.2 Indel haploid likelihood
Pr{Dj|H} =�
alignments π of Dj to H
Pr{Dj, π}
4.2 Genotype likelihoods
Pr{Di|GTi} =�
j
Pr{Di,j|GTi}
Pr{Di,j|GTi = AB} = (Pr{Di,j|A}+ Pr{Di,j|B}) /2
Pr{Di,j|B} =
�1− �i,j Di,j = B,
�i,j · Pr{B is true|Di,j is miscalled} otherwise.
3
Prior of the genotype!
Likelihood of the genotype!
Diploid assumption!
Empirical gap penal=es derived from data using new BQSR. Base mismatch penal=es are the base quality scores.
Step 2: evaluate candidate haplotype likelihoods with Pair HMM
17!
The indel size distribution is more accurate when using local assembly of candidate haplotypes
Key: • -‐assembly • +assembly • fosmid data (truth)
Larger events are missing with previous methods
Variant annota=on sta=s=cs provide signal with which to evaluate callsets
of puta=ve muta=ons
22 49582364 . A G 198.96 . AB=0.67; AC=3; AF=0.50; AN=6; DP=87; Dels=0.00; HRun=1; MQ=71.31; MQ0=22; QD=2.29; SB=-31.76 GT:DP:GQ 0/1:12:99.00 0/1:11:89.43 0/1:28:37.78
VCF record for an A/G SNP at 22:49582364
AC No. chromosomes carrying alt allele
AB Allele balance of ref/alt in hets
AN Total no. of chromosomes HRun Length of longest con=guous homopolymer
AF Allele frequency MQ RMS MAPQ of all reads
DP Depth of coverage MQ0 No. of MAPQ 0 reads at locus
QD QUAL score over depth SB Evidence for strand bias
INFO
field
Variant Quality Score Recalibration (VQSR): modeling error properties of real polymorphism to determine the probability that novel sites are real!
The HapMap3 sites from NA12878 HiSeq!calls are used to train the GMM. Shown!here is the 2D plot of strand bias vs. the!variant quality / depth for those sites.!
Variants are scored based on their!fit to the Gaussians. The variants!(here just the novels) clearly!separate into good and bad clusters.!
20!DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !
Fisher Strand Bias Score
# o
f S
NP
s
0
2000
4000
6000
8000
0 5 10 15 20 25 30
Variant Quality / Depth
# o
f S
NP
s
0
10000
20000
30000
40000
50000
60000
10 20 30 40
novels knowns (dbSNP 132) retained
filtered out FS: Fisher Exact Test of Read Strand
QD: Variant Quality / Depth
Lots of the filtered out strand biased variants show up at the centromere. Very unlikely to be real SNP muta=ons.