14
Indel Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program Feb 17, 2011

Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

Embed Size (px)

Citation preview

Page 1: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

Indel Calling Pipeline in the GATK

Guillermo del Angel, Ph.D.

Genome Sequencing and Analysis Group Medical and Population Genetics Program

Feb 17, 2011

Page 2: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

What are the GATK’s indel processing abilities?

GATK Tool Function IndelRealigner Runs multiple sequence alignment on

reads and forms consensus indels suitable for variant genotyping.

UnifiedGenotyper Determines consensus alternate alleles, optimal allele frequency distribution, determines whether sites should be called, assigns genotypes and annotations.

VariantFiltration Filters calls based on given expressions.

VariantEval Indel metrics and stratifications for analysis

Page 3: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

Step 1: BAM data processing

Input BAMs Indel Realignment

BAMs used for calling

•  Indel realignment is a critical step in preparing BAM’s for indel calling.

•  We recommend full indel realigning (Smith Waterman) at all sites, realignment using only known sites is not enough!

Note: Exome BAM’s coming out of Picard have already been fully indel-realigned!

Page 4: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

Step 2: Indel discovery

Genotype Likelihoo

ds Calculatio

n

BAMs used for calling

•  The genotype likelihoods calculation is inspired by Dindel (with kind permission from C Albers and R Durbin).

•  Typical command line: java -jar GenomeAnalysisTK.jar –R ref.fasta -T UnifiedGenotyper –L mytargets.list –I myreads.bam –o mycalls.vcf -B:dbsnp,VCF dbsnp.vcf -glm DINDEL!

Allele Frequenc

y Calculatio

n

Hard-filters

Unified Genotyper

Only difference with SNP calling!

Beagle

Same computation as

with SNP’s

Page 5: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

Some details and caveats… •  All standard parameters used in UG for SNP calling are also valid

for indels!

–  E.G. –stand_call_conf for a calling threshold.

•  Heuristic for controlling sensitivity:

–  We’ll only consider indels for genotyping if they are present in N reads, controlled by –minIndelCnt parameter. Default value: 5, may want lower value for higher sensitivity in lowpass samples.

•  Limitations:

–  Only bi-allelic sites considered. If more than 2 alt. alleles detected at a site, the one with most supporting reads taken.

•  NOTE: Application of BAQ will severely degrade indel caller performance. Make sure argument –baq is either not included or set to OFF!

Page 6: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

Step 3: variant filtration (indels)

Genotype Likelihoo

ds Calculatio

n

BAMs used for calling

•  Hard filters are needed for eliminating calls coming from read artifacts. •  This is an ongoing area of improvement, stay tuned on the GATK Wiki for best

practice recommendations! •  Example command line with current best practice:

Allele Frequenc

y Calculatio

n

Hard-filters Beagle

Unified Genotyper

java –jar ./dist/GenomeAnalysisTK.jar -T VariantFiltration ref.fasta –o out.vcf -B:variant,VCF input.vcf \!--filterExpression "QUAL<30.0" --filterName "LowQual” \!--filterExpression "SB>=-1.0" --filterName "StrandBias” \!--filterExpression "QD<1.0" --filterName "QualByDepth” \!--filterExpression "(MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1))" --filterName "HARD_TO_VALIDATE” \!--filterExpression "HRun>=15" --filterName "HomopolymerRun"!

Page 7: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

Step 4 (Optional): Genotype refinement

Genotype Likelihoo

ds Calculatio

n

BAMs used for calling

•  Beagle can be used to refine genotypes of indel calls. Current recommended best practice is to merge Indel and SNP calls and running Beagle on combined set. More details our Wiki page.

Allele Frequenc

y Calculatio

n

Hard-filters Beagle

Unified Genotyper

Page 8: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

Assessing indel callsets

•  How do we know if the callset that we have is of high sensitivity and high specificity?

•  How many variants should we typically get?

•  How should indels be distributed in size, allele frequency and types of indels?

Page 9: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

VariantEval’s support for Indels java –jar GenomeAnalysisTK.jar -B:eval,V!CF mycalls.vcf -T VariantEval –R reffile.fasta -EV IndelMetricsByAC -EV IndelStatistics -B:dbsnp,VCF dbsnp.vcf -o output.txt!

This produces a GATK report file with aggregated statistics.

##:GATKReport.v0.1 CountVariants : Counts different classes of variants in the sample!CountVariants CompRod CpG EvalRod JexlExpression Novelty nProcessedLoci nCalledLoci nRefLoci nVariantLoci vari!antRate variantRatePerBp nSNPs nInsertions nDeletions nComplex nNoCalls nHets nHomRef nHomVar nSingleton!s heterozygosity heterozygosityPerBp hetHomRatio indelRate indelRatePerBp deletionInsertionR!atio!CountVariants dbsnp CpG eval none all 63025520 1215 0 1215 0.00!001928 51872.00000000 0 611 604 0 0 724 0 491 0 ! 0.00001149 87051.00000000 1.47454175 0.00001928 51872.00000000 0.98854337 !

CountVariants dbsnp CpG eval none known 63025520 1000 0 1000 0.00!001587 63025.00000000 0 491 509 0 0 567 0 433 0 ! 0.00000900 111156.00000000 1.30946882 0.00001587 63025.00000000 1.03665988 !

CountVariants dbsnp CpG eval none novel 63025520 215 0 215 0.00!000341 293141.00000000 0 120 95 0 0 157 0 58 0 ! 0.00000249 401436.00000000 2.70689655 0.00000341 293141.00000000 0.79166667 !

CountVariants dbsnp all eval none all 63025520 13580 0 13580 0.00!021547 4641.00000000 0 6649 6931 0 0 8852 0 4728 0 ! 0.00014045 7119.00000000 1.87225042 0.00021547 4641.00000000 1.04241239 !

Key module! Produces indel size distributions as well as classification tables

Page 10: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

How many indels should I get?

Lowpass example AC distribution (Chr20 only):

Exome example AC distribution:

High heterozygosity in lowpass call set is leading us to focus on improving specificity of our calls.

1 2 5 10 20 50 100 200

15

10

50

500

5000

Total indels by Allele Count, target captured exomes, N=96, 1/Het=65916.9

Allele count

Num

ber

of vari

ants

GATK UnifiedGenotyperNeutral expectation ( 0.9*0.000015*32950014.0*(1/c(1:192)) )

High number of singletons: maybe also a lot of false positives?

1 2 5 10 20 50 100 200 500

11

00

10

00

0

Total indels by Allele Count, pop= ASN, N=266, 1/Het=8000.0

Allele count

Nu

mb

er

of

va

ria

nts

GATK UnifiedGenotyperNeutral expectation ( 0.9*0.000125*63025520.0*(1/c(1:532)) )

Published estimates In Whole Genome ~ 1 indel/8000 bp

Empirical exome estimate: ~500 indels/exome (33 Mbp)

Page 11: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

A typical plot of indel size distribution in whole genome sets

!!

!!

!!!

!!

!!

!

!!

!

!

!!!

!!!!!!

!

!

!!

!

!!!!!

!

!

!!

!

!

!!!!!!

!

!

!!!!

!

!

!

!!!

!

!

!!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!!!!!!!!

!

!!!!

!

!!!!!

!

!!

!!!

!

!

!

!!!!

!

!!!!!!

!!!!!!!

!!! !!

!!

!!!!

!!!!!

!

!

!!!!!!!!!

!!!

!!!!!!!!

!!!

!

!

!!!!!!!!

!!

!!!!

!

!!!!!!

!!!!!!!

!

!!

!

!!!!

!!!

!!!

!

!!

!!!!!!!!!

!!

!

!

!!!!

!!!

!!!!!!!!!!!!!

!!

!

!!!

!

!

!!

!

!!

!!!!!!!

!!!

!!!

!!!!

!!

!!!!

!!

!!!!!!

!!!

!

!

!!!!!!

!!!

!!

!

!!

!

!!

!

!!

!!!!!!!!!

!

!!!!!!

!!

!!

!!

!

!

!

!!

!!!!

!

!!!

!

!

!

!

!!

!

!

! !!!!!!!!!!!!

!

!

!

!

!!

!

!!!

!!!!

!!

!!!!

!

!!!!!!!!

!

!!!

!

!

!!!!!!

!

!

!

!

!!

!

!!!!

!!!!!!

!!!

!!!!!!!

!!

!

!!

!

!!

!

!!!

!!!

!

!

!!!!

!

! !

!!!

!

!!!!!!!!

!!

!!!!!

!!!

!

!

!!!!!!!!!!

!

!

!

!!

!

!!

!

! !!

!!

!

!!

!

!

!

!!

!

!!!

!

!

!

!!! !

!!!!!!!!!!!!!!!!!!!!

!!!

!!!

110

100

1000

10000

Indel Size Distribution for low!pass 1000G samples, GATK, pop = ASN

Event Size (!:deletion, +:insertion)

Event C

ount per

sam

ple

!3

0

!2

8

!2

6

!2

4

!2

2

!2

0

!1

8

!1

6

!1

4

!1

2

!1

0

!8

!6

!4

!2 0 2 4 6 8

10

12

14

16

18

20

22

24

26

28

30

Notice that: • Counts are very consistent among samples • Counts are mostly symmetric between insertions and deletions • Even counts higher than odd counts

Page 12: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

!!!!!!!!!!!!!!!!!!!!

!!

!! !!!!!!!!!!!!!!!!

!

!

!

!!!!!!!!!!!!!!!!!!!!

!

!

!

!!!!!!!!!!!!!!!!!!!!

!!!!

!!!!

!!

!

!

!

!!!!!!!

!

!

!

!!!!!

!!

!!!!!!!

!

!!!!

!

!!!!!!!!!!!!!

!

!!!!

!

!!!!!!

!!

!!!!!!!!!!!!

!!

!

!!!!!!!!!!!!!!!!

!

!!!!!!!!!!!!!!!! !

!!!

!

!

!

!

!!!

!!!

!

!

!

!

!!

!!!!!!!!!!!!!!!!

!

!

!

!!

!!

!

!!!!!!!!

!

!

!!

!!

!

!!!!!!! !!

!

!

!

!

!!

!!

!

!!

!

!!

!! !

!

!

!

!

!

!

!

!!

!!

!!!!!! !!!!

!!!!!!

!!!!!!!!

!!

!

!

!

!!

!

!

!

!!!!!!

!!!!!!!!!!!!!!!!!!!!!!!!!!

12

510

20

50

100

Indel Size Distribution for 96 exome capture 1000G samples, GATK

Event Size (!:deletion, +:insertion)

Event C

ount per

sam

ple

!30

!28

!26

!24

!22

!20

!18

!16

!14

!12

!10

!8

!6

!4

!2 0 2 4 6 8

10

12

14

16

18

20

22

24

26

28

30

• Strong bias against frameshift indels! • Count per sample much lower than in whole genome (~500 indels per sample) • More deletions than insertions • We believe many of the 1-bp events should be artifacts

Indel size distribution in exomes

Page 13: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

Different indel types come at different rates 1

10

10

01

00

01

00

00

Total indels by Indel type, pop= ASN, N=266#

of

Ind

els

!!!!!!!!!!!!!!!!!!

!

!!

!

!!

!!!!!!!!

!!

!

!

!!!

! !!!!!!!!!!!!!!!

!

!

!

!!

!!!!!!

!!!

!!

!!!!

!

!!

!

!!

!

!!!!!!!!!!!!!

!

!

!!!

!

!

!!!!

!

!

!!!!!!!!!!!!

!

!

!

!!!

!

!

!!!!!

!

!

!

!

!!

!

!!

!

! !

!

!

!!

!

!

!!!!

!

!

!!!!

!

!!

!!

!

!!!!

!!!!

!!

!!

!

!!

!

!!

!

!

!

!

!

!!!!!

!!

!

!

!

!!!

!

!

!

!!

!

!!!!!

!

!

!!

!

!

!

!

!!

!!!

!! !

!

!!

!!

!

!!!

!!!

!

!

!!

!!

!

!

!

!

!

!!

!

!

!

!!

!!!!

!!!

!

!

!

!

!

!!!

!

!

!!

!

!

!!

!

!!

!

!

!

!!!!!

!

!

!!!!!!!

!

!

!!

!

!!!!

!!

!

!!!

!

!!

!!

!!!

!!!!!!!!

!!

!

!!!

!

!!! !

!!!!!!!!!!!!

!

!!

!

!!!!

!!

!!!!!

!

!

!!

!

!!

!!

!!!

!!!

!!!!!!!!

!!!

! !

!

!!!

!!!!!!!

!

!!

!

!

!

!

!

!

!!

!

!

!!!!!!

!!

!

!

!!!!!!!!

!!!!!!!!

!

!!!

!!!

!!

!!!!!

!

!!!!

!!!

!

!

!!!!!!!

!

!!!

!!

!!!

!

!!!!

!

!!

!!!!

!!

!!

!

!!

!!

!!

!

!

!!!!!!!

!

!

!!

!

!!!

!

!!

!!!!

!!!!!!

!!!!!!!

!!!!!!!!!!!!!

!!

!

!!!!!!!!!!!!!!

!

!!!!

!!!!!!!

!

!!

!!!!!!!!!!!!!! !!!!

!

!!

!

!! !!

!!

!!

!

!

!

!!!!!!

!!!!

!

!!

!!

!!

!!

!

11

01

00

10

00

10

00

0

Novel (A

)

Novel (C

)

Novel (G

)

Novel (T

)

Novel L=

1

Novel L=

2

Novel L=

3

Novel L=

4

Novel L=

5

Novel L=

6

Novel L=

7

Novel L=

8

Novel L=

9

Novel L=

10+

RepE

xp (

A)

RepE

xp (

C)

RepE

xp (

G)

RepE

xp (

T)

RepE

xp (

AC

)

RepE

xp (

AG

)

RepE

xp (

AT

)

RepE

xp (

CA

)

RepE

xp (

CG

)

RepE

xp (

CT

)

RepE

xp (

GA

)

RepE

xp (

GC

)

RepE

xp (

GT

)

RepE

xp (

TA

)

RepE

xp (

TC

)

RepE

xp (

TG

)

RepE

xp L

=1

RepE

xp L

=2

RepE

xp L

=3

RepE

xp L

=4

RepE

xp L

=5

RepE

xp L

=6

RepE

xp L

=7

RepE

xp L

=8

RepE

xp L

=9

RepE

xp L

=10+

Oth

er

Consistent difference in the number of A,T vs. C,G repeat expansions: Ratio of (A+T)/(C+G) expansions can be used as a specificity metric!

These counts for the Phase 1 1000 Genome samples taken for illustration only

Page 14: Indel Calling Pipeline in the GATK - Broad Institute Calling Pipeline in the GATK Guillermo del Angel, Ph.D. Genome Sequencing and Analysis Group Medical and Population Genetics Program

Other callers

– Aside from the GATK, SAMTools and DINDEL can be alternatively used for indel calling.

– Example command line using SAMTools’ mpileup caller: samtools mpileup -ugf ref.fasta reads.bam | ../samtools/bcftools/bcftools view -vc - > myout.vcf!

– More info at: •  http://samtools.sourceforge.net/mpileup.shtml •  http://www.sanger.ac.uk/resources/software/dindel/