Ryan Poplin - Sources of Bias

Ryan Poplin, on behalf of the Genome Sequencing and Analysis Group Program in Medical and Population Genetics August 16, 2012

Understanding sources of bias and error from a prospective Reference Material (NA12878)

NA12878 is a wonderful reference sample!

•  Unrestricted cell lines!•  Extensive pedigree available!•  Extensively sequenced and genotyped at the

Broad and elsewhere!– All Broad techs (both production and

experimental)!– Fosmids!– Many library designs and sample prep

protocols!

SNPs

Indels

Structural variation (SV)

Rawindels

RawSVs

Typically by lane Typically multiple samples simultaneously but can be single sample alone

Input

Output

Mapping

Local realignment

Duplicate marking

Base quality recalibration

Analysis-ready reads

Raw reads Sample 1 reads

Raw variants

RawSNPs

Genotype refinement

Variant quality recalibration

Analysis-ready variants

Pedigrees Known variation

Known genotypes

Population structure

Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis

Sample N reads

External data

Our framework for variation discovery!

DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !

Lots of work required to turn raw sequencing reads into something that is useful!

Phase 1:!NGS data processing!

Input

Output

Mapping

Local realignment

Duplicate marking



Raw reads

•  Unbiased sampling of alleles •  Calibrated mapping quality scores

•  Indels have correct and consistent alignment in reads

•  Duplicate molecules shouldn’t count as extra evidence for event

•  Calibrated base quality scores for base subs=tu=ons, base inser=ons, and base dele=ons

Desired proper=es of analysis-‐ready reads:

rs28782535!

rs28783181! rs28788974! rs34877486! rs28788974!

1,000 Genomes Pilot 2 data, raw MAQ alignments! 1,000 Genomes Pilot 2 data, after MSA!

HiSeq data, raw BWA alignments! HiSeq data, after MSA!

Effect of MSA on alignments!NA12878, chr1:1,510,530-1,510,589!

Indels have correct and consistent alignment in reads through multiple sequence local realignment!

5!DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !

Phase 1:!NGS data processing!

Input

Output

Mapping

Local realignment

Duplicate marking



Raw reads

SLX GA 454 SOLiD HiSeq Complete Genomics

!!!!

!!

!!!

!!!!

!!

!

!

!!

!!

!!

!!

!!!!!! !

!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!!

!

!

Original, RMSE = 5.242

Recalibrated, RMSE = 0.196

!!

!!

!!!

!!

!!

!!

!!

!

!!!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!

!


Recalibrated, RMSE = 0.213!!!

!

!

!!!

!!!

!!

!!

!!

!!

!!

!!

!!

!

!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!!!

!!!!

!

!



!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!!!

!

!

!



!!!

!!!

!!

!!

! !!

!

!

!

!

!!

!!

!

!

!!

!!

!!!!

!!

!!

!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!



!!!!!!!!!!!!!!! !! !! !! !

! !! !!

!

!

!! !!!!!

0 5 10 15 20 25 30 35

!10

!5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!! !! !! !! !! !! !! !! !! !!!! !

!

!



!

!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!! !!!!

!!!!!!!!!!!!!!!!

!!!!

!!!!

!!!!!!!!

!!!!!!!!!!!!!!!

! !!!!!!!! !!!!!!!! !

!!!!!!!

!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!!!!

!!!!

!!!!!!!! !!!!

!!!!!!!!!!!

!!!!!

!!

!

!

!

!!

!

!

!!

0 50 100 150 200

!10

!5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!! !!!! !!!!!!!! !!!! !!!!!!!!!!!!!!!!

!!!

!!!!!

!

!



!!

!

!

!

!

!

!

!

!

!!

!

!!!

!!

! !!!

! !!!

!

!!

!

!!

!!

!! !!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!!

!30 !20 !10 0 10 20 30

!10

!5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!

!! !! !! !!!! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !

! !! !! !!! !!!!

!! !! ! !!!!!

!

!



!!

!

!!

!!

!!

!!

!

!! !

!!

!

!!

!

!

!! !! !!

!

!

!!! !

!

!

!!

!

!

!

!

!

!! !!

!! !!

!

!

!!

!

! !!

!

!!

!!!!

!!

!30 !20 !10 0 10 20 30

!10

!5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

! !!

! !! !! !!

!! !! !! !! !! !! !! !! !! !! !! !

! ! !! !! !! !! !! !! !! !! !! !! !! !! !!! ! !!!!!

!

!



!

!!!!!!!!!!!

!!!!!!!!

!!!!

!

!!

!

!!!!!!!!!!!!

!

!!!!!

!!

!

!!

!!

!

!!!

!!

!!

!!

!!!

!

!

!

!!

!!!!

!

!

!

!

!

!!!

!

!

!

!

!!!!!!!! !

!

!!

!!

!

!!!

!!

!

!!!!

!!

!

!

!

!!!!

!

!

!!

!

!

!

!

!!

!

!

!!!!!!!!!!!

!

!!!!!

!!!!!!!!!!!!!

!!

!

!!!!

!

!!!!

!!!!!

!!

!!!!!!!

!!!!

!!

!!!!!!!

!100 !50 0 50 100

!10

!5

05

10

Machine Cycle

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!

!

!



!10

!5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!

!!!

!!

!

!!!!!!!!!

!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG



!10

!5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!

!!!!!

!

!!!!

!!!!

!!!!!!!!!!!!!!!!!




!10

!5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!

!!

!!!

!

!!!!!!

!

!!!!!!!!!!!!!!!!!!




!10

!5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!

!

!

!!!

!

!!!!!

!!!!!!!!!!!!!!!!




!10

!5

05

10

Dinucleotide

Accura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!

!

!

!!!!

!

!!!!!!!!!!!!!!!!!!!!




first of pair reads second of pair reads first of pair reads second of pair reads first of pair reads second of pair reads

Base Quality Score Recalibration provides a calibrated error model from which

to make mutation calls

Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project!!

AAAAA context

suffix

Empi

rical

gap

ope

n pe

nalty

0

10

20

30

40

50

●●●●●●●●●●●●●●●●

●

●●●●

●●●●●●●●●●●●

●

●●●

●●●●●●

●●●●●●●

●

●●●●●●●●●●●

●●●●●

●

●●●●●●●●●●●

●

●●●●

●

●

●●●●●

●●

●

●●●

●●●●

●

●

●●

●●

●●●

●●

●

●

●

●

●

●

●

●●

●●●●●●●●

●●

●●●●

●

●●●●●●●●●

●

●●●●●

●

●

●●

●

●●●●

●●●

●●●

●

●●

●

●●

●●●

●●●●●●●●

●●●

●

●●●●●

●●●

●●●●

●

●

●●

●

●●●●●●●●●●●

●●

●●●

●

●●●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●●

●●●

●●

●

●

●

●●

●

●●●

●●

●●●

●

●

●●●●●●

●

●●

●●●●●

●

●

●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●●●●

●

●●●●●●●●

●

●

●●

●

●●●

●

●●●●

●

●●●●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●●●●●

●

●●

●

●●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●●●●

●

●●●●●

●

●

●

●

●●●●

●

●●●●●

●

●●●●●

●

●

●

●●

●

●

●

●●

●

●

●●

●●

●

●

●●●●●

●

●

●

●●

●●●

●●●

●

●

●

●●

●

●●

●●

●

●●●●●●

●

●●●

●

●●●●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●●●●

●●●

●

●

●●●●

●●

●

●

●

●●●●●●

●

●

●

●●●●●●

●

●●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●●●

●●

●

●

●

●●

●●

●

●

●

●

●

●●

●●

●●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●●

●●●

●

●

●●●

●●

●

●

●●●●●

●

●●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●●

●●

●●

●

●

●

●

●●●●●

●

●

●●

●●●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●●

●●●

●

●●

●

●●

●●

●

●

●

●

●

●●●

●

●●●

●

●

●

●

●

●

●

●

●

●●●●

●●

●

●

●

●

●

●

●

●

●

●●●

●●●

●

●

●●

●

●

●●●●●

●

●●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●●●●

●●

●

●

●●●●●●●●

●

●

●

●●

●

●

●●

●

●●●●●

●

●

●

●

●

●

●

●●●

●●●

●

●

●

●

●

●

●

●●

●

●●

●

●●●

●

●

●

●

●●●

●

●

●●●

●

●●●●

●

●●●●●●

●

●

●

●

●

●●

●●●

●●●●●

●

●

●

●●●●●●●

●●

●●●●●

●

●

●

●

●●●●

●●

●●

●●

●

●●

●●

●

●

●

●

●

●●●●●●●

●

●●

●

●

●

●●●

●

●

●●

●

●

●●

●

●●●●

●

●●●

●●

●

●●

●●

●

●

●●●●

●

●

●●

●

●

●

●

●

●

●●●

●

●●●

●

●●

●●●●

●●

●●●●

●

●●●

●

●

●

●●

●

●

●●

●●●

●

●●●●

●

●●●

●●●

●

●●●●●

●

●●

●

●

●

●●●●●●●●

●

●

●

●

●●●

●

AAAAACAAGAATAC

AAC

CAC

GAC

TAG

AAG

CAG

GAG

TATAATCATGATTC

AAC

ACC

AGC

ATC

CA

CC

CC

CG

CC

TC

GA

CG

CC

GG

CG

TC

TAC

TCC

TGC

TTG

AAG

ACG

AGG

ATG

CA

GC

CG

CG

GC

TG

GA

GG

CG

GG

GG

TG

TAG

TCG

TGG

TTTAATACTAGTATTC

ATC

CTC

GTC

TTG

ATG

CTG

GTG

TTTATTCTTGTTT

ReadGroup● 20FUK.1● 20FUK.2● 20FUK.3● 20FUK.4● 20FUK.5● 20FUK.6● 20FUK.7● 20FUK.8● PacBio

7!

AAAAA + AAA context is errorful in HiSeq

HiSeq!

PacBio!

Per-‐base indel error rate also varies by lane, sequence context and sequencing technology

PacBio error rate is 1000x higher but unbiased

Reported Quality Score

Empi

rical

Qua

lity

Scor

e

10

20

30

40

50

Base Substitution

10 20 30 40 50

Base Insertion

!

10 20 30 40 50

Base Deletion

!

10 20 30 40 50

Recalibration! Recalibrated! BQSRv2

log10(nBases)456789

Cycle Covariate

Qua

lity

Scor

e Ac

cura

cy

−6

−4

−2

0

2

4

Base Substitution

−100

−50

0 50 100

Base Insertion

!

!

!!!!!!!!!!!!!!!!!!!!

!!!!

!!!!!!!!!!!!!!!!!!!!!!!!!!

−100

−50

0 50 100

Base Deletion

!

!!!!!!!!!!

!

!!

!!

!!!!!!!!!!

!!!!!!!!!!!!!!!!!!!!!!!!!!

−100

−50

0 50 100


log10(nBases)6.756.806.85

Context Covariate

Qua

lity

Scor

e Ac

cura

cy

−8

−6

−4

−2

0

2

Base Substitution

!! !!

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA

CAA

CAC

CAG

CAT

CC

CC

AC

CC

CC

GC

CT

CG

CG

AC

GC

CG

GC

GT

CT

CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GC

AG

CC

GC

GG

CT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Base Insertion

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA

CAA

CAC

CAG

CAT

CC

CC

AC

CC

CC

GC

CT

CG

CG

AC

GC

CG

GC

GT

CT

CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GC

AG

CC

GC

GG

CT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Base Deletion

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA

CAA

CAC

CAG

CAT

CC

CC

AC

CC

CC

GC

CT

CG

CG

AC

GC

CG

GC

GT

CT

CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GC

AG

CC

GC

GG

CT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT


log10(nBases)6.57.07.58.0

8

UnifiedGenotyper used a flat Q45 in its indel model

Latest version of Base Quality Score Recalibrator empirically estimates the base insertion and deletion error rates in addition to substitutions

Reported Genotype Quality

Empi

rical

Gen

otyp

e Q

ualit

y 0

5

10

15

20

25

30

0

5

10

15

20

25

30

0

5

10

15

20

25

30

UnifiedGenotyper

●●

●

●

●

●●●

●

●●

●●●

●●●●

●

●●

●●

● ●

●●●

●

●●●●

●●●

●●

●

●●●●●

●●

●●

●●

●

●●●

●

●

●●●●●

●●●●●●●●

●●

●

●

●

●

●

●

●●●●

●●●

●

●●●●

●

●●●

●●

●

●●●

●●

●

●

●

●●●●

●●

●

●●●●

●●●●●●

●

●●●

●●

●

●

●

●

●●●●

●●●

●

●●●

●●●●

●●

●

●●●

●●●

●●

●●●

●

●●●●●

●●●

●

●●

●●●●●

●

●●

●

●

●

●●●●

●●●●●

●●

●

●●●

●●●

●●●●

●●

●

●

●

●●●●●●●●●●●

●●●●●

●●●

●●●●●●

●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●●

●●●●●●●●●●

●●

●●

●●●●

●●●●●●●

●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●

●●●●●●

●●●●●●●

●●●

●●●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●

●●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●●●●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●●●

●●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●●●●●

●

●●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●●●●●●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●●

●●

●●●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●●●

●●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

0 5 10 15 20 25 30

HaplotypeCaller Original Quality

●●

●●

●●●

●●●●

●

●●●

●

●●

●

●●

●●●

● ●

●●●●

●●

●●●

●●

●●●

●●●●●●

●

●

●

●

●

●

●●

●●

● ●

●●

●

●

●●●

●●

●

●●

●

●●

●●

●

●

●

●●●●

●●

●

●

●

●

●●

●

●●

●●

●

●

●●

●●

●

●

●

●●

●●● ●

●

●●

●

●●

●

●●●●

●

●

●

●

●●●

●●

●●●●

●

●

●

●

●

●

●●●

●●

●●●●

●●●●

●

●

●

●●

●●

● ●

●●●●

●

●

●●●

●●●

●

●

●●

●

●

●

●●●●

●

●

●●●●

●●●

●●

●●●●

●●

●

●●

●

●●

●

●●●

●●

●

●●●●

●●●

●

●

●●

●

●●

●

●

●●●●

●●

●

●●

●●●

●●●●●

●

●●

●

●●

●

●

●●●

●●●●●

●

●

●

●●●●●

●●●●●

●

●

●●

●●

●

●

●●●

●●●●●

●●

●

●●●●●

●●●

●●

●●●

●

●●

●

●

●●

●

●●●●

●●●●

●●●●

●

●

●●●●

●●●●

●

●●●

●●

●

●

●●●

●

●●

●

●●●●●

●

●●

●●●●●●●●

●

●

●●

●

●●●●●●

●●

●●●●

●●

●●●

●●●●

●

●

●●●

●●●

●●●●●●

●●

●●

●

●●

●

●●●●

●●●●●

●

●

●●●●

●●●●

●

●

●●

●

●

●●

●

●●●

●

●●●

●

●

● ●●

●●

●●

●

●

●●

●

●

●

●●

●

●

●

●●●

●●

●

●●

●●

●

●●●●

●

●●

●●●

●●●

●●● ●

●●

●

●

●

●●●

●

●

●

●●●●

●●●

●

●●●●●

●●

●

●

●

●●● ●

●

●●

●

●●

●

●

●

●

●●

●●

● ●

●●

●●

●

●●

●●

●

●

●

●

●●●●●

●●

●

●

●

●● ●

●

●

●

●●

●●

●●●●

● ●

● ●

●●

●

●

●

●

●●●

●

●●●

●

●●● ●

●●●●

●●

●

●

0 5 10 15 20 25 30

HaplotypeCaller Calibrated Quality

●●●●●●

●●●●

●●●●

●

●

●●●

●●

●

●●●

●●

●

●●●●●●

●

●●●●●●

●

●

●

●●●●

●

●●

●

●

●●

●

●●●●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●●●

●

●●●●

●●●

●

●

●

●

●

●●

●

●

●●●

●●●

●

●

●

●●●

●

●●●●●●

●

●●●

●●●

●●●

●●

●

●●●●●

●●

●

●●●●●

●

●●

●●●

●●●

●●●●●

●●

●●●

●●

●●●●

●

●●

●●●●●●●●

●●

●

●

●●

●

●●●

●●●

●

●

●●●●●

●●●●●●

●●●●●●●

●●●●

●

●

●

●

●●

●●●●

●●●●●●●●●●●●

●

●●●

●●●

●

●

●

●

●

●

●●●●

●

●●●

●

●●●

●●

●

●●

●●●

●

●●●●●●●●

●●●

●●●●●

●●●●●●

●●●

●●●●

●●●●

●

●

●

●

●●●●

●●●●

●

●

●●●●●

●●

●

●

●

●

●●●●●

●

●

●●●●

●●●●

●●●●●●

●●●

●

●

●

●●

●●●●●

●

●

●●●●●

●●●●●●●●

●

●●●

●

●

●●●

●

●

●●

●

●●

●●

●

●●

●●●

●

●

●

●

●

●●

●●●

●

●●

●●

●

●

●●●

●

●●

●●●●●

●●●

●●

●

●

●●

●

●

●●●●

●

●●●

●

●●●

●●●●●●

●●●

●

●

●

●●

●

●●

●●●●●

●

●●

●●● ●●

●●

●

● ●

●●

●●●●●●●

●●●●

●●

● ●●●

●●

●

●

●

●●●●

●

●●●

●

●

●

●●●

●

●

● ●●

●

●

●

●

●●●●

●

●●

●

●●●●

●

●●

●●●●●

●

●●

●

0 5 10 15 20 25 30

Hom

Ref

Het

Hom

Var

pGGivenDType● Hom Ref● Het● Hom Var

log10(Sum)● 1.5● 2.0● 2.5● 3.0

● 3.5

● 4.0

Making use of these calibrated quality scores improves indel likelihood calibra=on

Evaluated calls: Per-‐read-‐group downsampled (~4x coverage) NA12878 indel calls Truth: GATK-‐bundle indel “gold standard” truth sites with high confidence genotypes from deep coverage CEU trio

Another source of bias: We often find consistent (artifactual) alleles at the sites of larger events because they cannot be properly modeled by

the mappers

10

Validated 30bp deletion!

Original BWA

alignments

Alignments showing the actual allele

Chr12:15296246 GTGTGTATGTAAATATATACATACACACAT/-‐

Mul=ple called ar=facts that are hard to filter out, since they are well supported by read data

BAM read bases are all iden=cal; individual alignments differ based on the the whims of the mapper

Allele determination is much more accurate through local assembly of candidate haplotypes

-assembly: 1 multi-allelic SNP and two 1bp indels are called +assembly: Only the complex substitution (TT to TAC) is called

Original BWA

alignments

11

12

As an added bonus we now get physical phasing for free, which allows us to distinguish between e.g. MNPs and compound hets

CEU Trio

Daughter

Father

Mother

Conclusions!

•  NA12878 (and potentially more of the CEU pedigree) is a great reference sample!

•  Read data must be in an analysis ready form which passes multi-faceted battery of tests related to statistical bias that go beyond sample QC metrics!

•  Local de novo assembly around mutations is necessary to avoid biases in alleles resulting from myopic view!

Appendix!

Multiple sequence alignment itself is not enough for calling indels

•  MSA is now a standard piece of the BAM processing pipeline and works well for previously seen indels in getting consistent alignments!

•  However, it is not empowered to discover large novel indels or more complex alleles!

•  What is needed is haplotype reconstruction and then calling of variants from the candidate haplotypes!

•  Several groups (Oxford, Sanger, Broad) are actively working on assembly-based approaches (both global and local) for haplotype level calling!

!15!

16 Assembly of large genomes using second-‐genera4on sequencing. Schatz. Genome Research. 2010.

Traverse the graph to enumerate the possible haplotypes. Each edge is weighted by the number of reads which gave evidence for that k-‐mer.

Step 1: propose haplotypes with local de novo assembler via DeBruijn graphs

Bayesian model

4 SNP calling

4.1 Simple genotype likelihoods for presentations

Pr{G|D} =Pr{G}Pr{D|G}

Σi Pr{Gi}Pr{D|Gi}, [Bayes’ rule]

Pr{D|G} =�

j

�Pr{Dj|H1}

2+

Pr{Dj|H2}2

�where G = H1H2

Pr{D|H} is the haploid likelihood function

4.1.1 SNP haploid likelihood

Pr{Dj|H} = Pr{Dj|b}, [single base pileup]

Pr{Dj|b} =

�1− �j Dj = b,

�j otherwise.

4.1.2 Indel haploid likelihood

Pr{Dj|H} =�

alignments π of Dj to H

Pr{Dj, π}

4.2 Genotype likelihoods

Pr{Di|GTi} =�

j

Pr{Di,j|GTi}

Pr{Di,j|GTi = AB} = (Pr{Di,j|A}+ Pr{Di,j|B}) /2

Pr{Di,j|B} =

�1− �i,j Di,j = B,

�i,j · Pr{B is true|Di,j is miscalled} otherwise.

3

Prior of the genotype!

Likelihood of the genotype!

Diploid assumption!

Empirical gap penal=es derived from data using new BQSR. Base mismatch penal=es are the base quality scores.

Step 2: evaluate candidate haplotype likelihoods with Pair HMM

17!

The indel size distribution is more accurate when using local assembly of candidate haplotypes

Key: •  -‐assembly •  +assembly •  fosmid data (truth)

Larger events are missing with previous methods

Variant annota=on sta=s=cs provide signal with which to evaluate callsets

of puta=ve muta=ons

22  49582364 . A G 198.96 . AB=0.67; AC=3; AF=0.50; AN=6; DP=87; Dels=0.00; HRun=1; MQ=71.31; MQ0=22; QD=2.29; SB=-31.76 GT:DP:GQ 0/1:12:99.00 0/1:11:89.43 0/1:28:37.78

VCF record for an A/G SNP at 22:49582364

AC No. chromosomes carrying alt allele

AB Allele balance of ref/alt in hets

AN Total no. of chromosomes HRun Length of longest con=guous homopolymer

AF Allele frequency MQ RMS MAPQ of all reads

DP Depth of coverage MQ0 No. of MAPQ 0 reads at locus

QD QUAL score over depth SB Evidence for strand bias

INFO

field

Variant Quality Score Recalibration (VQSR): modeling error properties of real polymorphism to determine the probability that novel sites are real!

The HapMap3 sites from NA12878 HiSeq!calls are used to train the GMM. Shown!here is the 2D plot of strand bias vs. the!variant quality / depth for those sites.!

Variants are scored based on their!fit to the Gaussians. The variants!(here just the novels) clearly!separate into good and bad clusters.!

20!DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !

Fisher Strand Bias Score

# o

f S

NP

s

0

2000

4000

6000

8000

0 5 10 15 20 25 30

Variant Quality / Depth

# o

f S

NP

s

0

10000

20000

30000

40000

50000

60000

10 20 30 40

novels knowns (dbSNP 132) retained

filtered out FS: Fisher Exact Test of Read Strand

QD: Variant Quality / Depth

Lots of the filtered out strand biased variants show up at the centromere. Very unlikely to be real SNP muta=ons.

Technology

Ryan Poplin - Sources of Bias