23
Whole-Exome Sequencing: Technical Details Jim Mullikin Director, NIH Intramural Sequencing Center Head, Comparative Genomics Unit Whole Exome Sequencing, Why? • Focuses on the part of the genome we understand best, the exons of genes • Exomes are ideal to help us understand high-penetrance allelic variation and its relationship to phenotype. •A whole exome is 1/6 the cost of whole genome and 1/15 the amount of data Biesecker et al. Genome Biology 2011, 12:128 1

Whole-Exome Sequencing: Technical Details

  • Upload
    others

  • View
    27

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Whole-Exome Sequencing: Technical Details

Whole-Exome Sequencing Technical Details

Jim Mullikin Director NIH Intramural Sequencing Center

Head Comparative Genomics Unit

Whole Exome Sequencing Why

bull Focuses on the part of the genome we understand best the exons of genes

bull Exomes are ideal to help us understand high-penetrance allelic variation and its relationship to phenotype

bull A whole exome is 16 the cost of whole genome and 115 the amount of data

Biesecker et al Genome Biology 2011 12128

1

Twinbrook ResearchBuilding

5625FishersLaneRockvilleMD

NISCoccupies entire5th floor

NISCSequenceProduction

Feb 2010

2

Sequencing applications processed through NGS production pipeline from April 2009 to August 2011

Number of Samples

Total FY11 Sequenced 169 billion reads

Total NISC Lifetime Sequenced 255 billion reads

4

CimarronSoftwarebasedNextGen LIMS

Computational Resources for 6 GAiiX and 3 HiSeq2000

bull Linux cluster ndash 1000 cores

bull 250 for production ndash 900TB disk

bull 250TB for production with 75TB available bull 15TBmonth long term storage

ndash Network bull 1 and 10 Gigabit-Ethernet

5

Number of Samples

Sequencing applications processed through NGS production pipeline from April 2009 to August 2011

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

6

Library Preparation

httpwwwilluminacomsupportliteratureilmn

AgilentTechnologies SureSelect method

WholeͲexome kit 38Mband50Mb

httpcpliteratureagilentcomlitwebpdf5990-3532ENpdf

7

bull Illumina TruSeq Exome Enrichment bull 62Mb of exome targeted

httpwwwilluminacomsupportliteratureilmn

NISC Exome Process Tag 1

Indexed Libraries Tag 2 Tag 3 Tag 4 Tag 5 Tag 6 Balance

and Pool

Illumina TruSeq Exome Enrichment

Sequence on two lanes

8

Cluster Generation

httpwwwilluminacomsupportliteratureilmn

Sequencing

httpwwwilluminacomsupportliteratureilmn

9

HiSeq Flow Cell

bull 8 lanes bull 15 billion clusters bull Up to 300Gb per flow cell

Sequencing and Data Processing ~150M clusters per lane

2x100 base paired-end reads and the index tag

Tag 1 Tag 2 Tag 3 Tag 4 Tag 5 Tag 6

Demultiplex

Alignment ELAND to Human Genome

Total sequence per sample ~10Gb from two lanes Over 100X coverage of targeted regions

10

Read Depth-of-Coverage

400

ITGA6

Read Depth-of-Coverage

400

ITGA6

11

Targeted Regions Depth-of-Coverage Histogram

0

0005

001

0015

002

0 50 100 150 200

Depth of Coverage

Frac

tion

per B

in

TruSeq_example Poisson

Refining the Alignment (diagCM)

bull ELAND is part of the standard pipeline

bull ELAND accurately places reads in the correct genomic location

bull Use cross-match a Smith-Waterman aligner to improve local alignment

12

Eland version XXX

ELAND Aligned Reads

eland_ms_32 Casava 17

Cross-match Improved Alignment

13

Comparison of Aligners (simulated exome data)

Perc

enta

ge C

orre

ct

85

90

9510

0

-6 -4 -2 0 2 4 6

Variant Size

BayesianGenotypeCalling MostProbableGenotype

ACCTTCCTCGAGGCTAGCCTAGCCGGGGTGGGCAAGGCCTTCCGGGGGGGTAACCTCCTCCAAAACCTCCCTATGCCCCCCCAATTTTTTATATATATCCTCCTCCAA A

A

A A

A

A

A A

A A

A A

P(Genotype|Data)

AT 0 AA -14

Difference between most probable AA 10-6

10-15 and next most probable genotype AC 10-15AG AC -34

AT 099999981 AG -34 10-75CC TT -124 MPG Score = 14 10-75 CT -138CG 10-60CT Convert to natural Log GT -138

GG 10-75 Probabilities and Sort CC -172 10-60GT CG -172 10-54TT GG -172

14

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 2: Whole-Exome Sequencing: Technical Details

Twinbrook ResearchBuilding

5625FishersLaneRockvilleMD

NISCoccupies entire5th floor

NISCSequenceProduction

Feb 2010

2

Sequencing applications processed through NGS production pipeline from April 2009 to August 2011

Number of Samples

Total FY11 Sequenced 169 billion reads

Total NISC Lifetime Sequenced 255 billion reads

4

CimarronSoftwarebasedNextGen LIMS

Computational Resources for 6 GAiiX and 3 HiSeq2000

bull Linux cluster ndash 1000 cores

bull 250 for production ndash 900TB disk

bull 250TB for production with 75TB available bull 15TBmonth long term storage

ndash Network bull 1 and 10 Gigabit-Ethernet

5

Number of Samples

Sequencing applications processed through NGS production pipeline from April 2009 to August 2011

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

6

Library Preparation

httpwwwilluminacomsupportliteratureilmn

AgilentTechnologies SureSelect method

WholeͲexome kit 38Mband50Mb

httpcpliteratureagilentcomlitwebpdf5990-3532ENpdf

7

bull Illumina TruSeq Exome Enrichment bull 62Mb of exome targeted

httpwwwilluminacomsupportliteratureilmn

NISC Exome Process Tag 1

Indexed Libraries Tag 2 Tag 3 Tag 4 Tag 5 Tag 6 Balance

and Pool

Illumina TruSeq Exome Enrichment

Sequence on two lanes

8

Cluster Generation

httpwwwilluminacomsupportliteratureilmn

Sequencing

httpwwwilluminacomsupportliteratureilmn

9

HiSeq Flow Cell

bull 8 lanes bull 15 billion clusters bull Up to 300Gb per flow cell

Sequencing and Data Processing ~150M clusters per lane

2x100 base paired-end reads and the index tag

Tag 1 Tag 2 Tag 3 Tag 4 Tag 5 Tag 6

Demultiplex

Alignment ELAND to Human Genome

Total sequence per sample ~10Gb from two lanes Over 100X coverage of targeted regions

10

Read Depth-of-Coverage

400

ITGA6

Read Depth-of-Coverage

400

ITGA6

11

Targeted Regions Depth-of-Coverage Histogram

0

0005

001

0015

002

0 50 100 150 200

Depth of Coverage

Frac

tion

per B

in

TruSeq_example Poisson

Refining the Alignment (diagCM)

bull ELAND is part of the standard pipeline

bull ELAND accurately places reads in the correct genomic location

bull Use cross-match a Smith-Waterman aligner to improve local alignment

12

Eland version XXX

ELAND Aligned Reads

eland_ms_32 Casava 17

Cross-match Improved Alignment

13

Comparison of Aligners (simulated exome data)

Perc

enta

ge C

orre

ct

85

90

9510

0

-6 -4 -2 0 2 4 6

Variant Size

BayesianGenotypeCalling MostProbableGenotype

ACCTTCCTCGAGGCTAGCCTAGCCGGGGTGGGCAAGGCCTTCCGGGGGGGTAACCTCCTCCAAAACCTCCCTATGCCCCCCCAATTTTTTATATATATCCTCCTCCAA A

A

A A

A

A

A A

A A

A A

P(Genotype|Data)

AT 0 AA -14

Difference between most probable AA 10-6

10-15 and next most probable genotype AC 10-15AG AC -34

AT 099999981 AG -34 10-75CC TT -124 MPG Score = 14 10-75 CT -138CG 10-60CT Convert to natural Log GT -138

GG 10-75 Probabilities and Sort CC -172 10-60GT CG -172 10-54TT GG -172

14

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 3: Whole-Exome Sequencing: Technical Details

Sequencing applications processed through NGS production pipeline from April 2009 to August 2011

Number of Samples

Total FY11 Sequenced 169 billion reads

Total NISC Lifetime Sequenced 255 billion reads

4

CimarronSoftwarebasedNextGen LIMS

Computational Resources for 6 GAiiX and 3 HiSeq2000

bull Linux cluster ndash 1000 cores

bull 250 for production ndash 900TB disk

bull 250TB for production with 75TB available bull 15TBmonth long term storage

ndash Network bull 1 and 10 Gigabit-Ethernet

5

Number of Samples

Sequencing applications processed through NGS production pipeline from April 2009 to August 2011

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

6

Library Preparation

httpwwwilluminacomsupportliteratureilmn

AgilentTechnologies SureSelect method

WholeͲexome kit 38Mband50Mb

httpcpliteratureagilentcomlitwebpdf5990-3532ENpdf

7

bull Illumina TruSeq Exome Enrichment bull 62Mb of exome targeted

httpwwwilluminacomsupportliteratureilmn

NISC Exome Process Tag 1

Indexed Libraries Tag 2 Tag 3 Tag 4 Tag 5 Tag 6 Balance

and Pool

Illumina TruSeq Exome Enrichment

Sequence on two lanes

8

Cluster Generation

httpwwwilluminacomsupportliteratureilmn

Sequencing

httpwwwilluminacomsupportliteratureilmn

9

HiSeq Flow Cell

bull 8 lanes bull 15 billion clusters bull Up to 300Gb per flow cell

Sequencing and Data Processing ~150M clusters per lane

2x100 base paired-end reads and the index tag

Tag 1 Tag 2 Tag 3 Tag 4 Tag 5 Tag 6

Demultiplex

Alignment ELAND to Human Genome

Total sequence per sample ~10Gb from two lanes Over 100X coverage of targeted regions

10

Read Depth-of-Coverage

400

ITGA6

Read Depth-of-Coverage

400

ITGA6

11

Targeted Regions Depth-of-Coverage Histogram

0

0005

001

0015

002

0 50 100 150 200

Depth of Coverage

Frac

tion

per B

in

TruSeq_example Poisson

Refining the Alignment (diagCM)

bull ELAND is part of the standard pipeline

bull ELAND accurately places reads in the correct genomic location

bull Use cross-match a Smith-Waterman aligner to improve local alignment

12

Eland version XXX

ELAND Aligned Reads

eland_ms_32 Casava 17

Cross-match Improved Alignment

13

Comparison of Aligners (simulated exome data)

Perc

enta

ge C

orre

ct

85

90

9510

0

-6 -4 -2 0 2 4 6

Variant Size

BayesianGenotypeCalling MostProbableGenotype

ACCTTCCTCGAGGCTAGCCTAGCCGGGGTGGGCAAGGCCTTCCGGGGGGGTAACCTCCTCCAAAACCTCCCTATGCCCCCCCAATTTTTTATATATATCCTCCTCCAA A

A

A A

A

A

A A

A A

A A

P(Genotype|Data)

AT 0 AA -14

Difference between most probable AA 10-6

10-15 and next most probable genotype AC 10-15AG AC -34

AT 099999981 AG -34 10-75CC TT -124 MPG Score = 14 10-75 CT -138CG 10-60CT Convert to natural Log GT -138

GG 10-75 Probabilities and Sort CC -172 10-60GT CG -172 10-54TT GG -172

14

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 4: Whole-Exome Sequencing: Technical Details

CimarronSoftwarebasedNextGen LIMS

Computational Resources for 6 GAiiX and 3 HiSeq2000

bull Linux cluster ndash 1000 cores

bull 250 for production ndash 900TB disk

bull 250TB for production with 75TB available bull 15TBmonth long term storage

ndash Network bull 1 and 10 Gigabit-Ethernet

5

Number of Samples

Sequencing applications processed through NGS production pipeline from April 2009 to August 2011

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

6

Library Preparation

httpwwwilluminacomsupportliteratureilmn

AgilentTechnologies SureSelect method

WholeͲexome kit 38Mband50Mb

httpcpliteratureagilentcomlitwebpdf5990-3532ENpdf

7

bull Illumina TruSeq Exome Enrichment bull 62Mb of exome targeted

httpwwwilluminacomsupportliteratureilmn

NISC Exome Process Tag 1

Indexed Libraries Tag 2 Tag 3 Tag 4 Tag 5 Tag 6 Balance

and Pool

Illumina TruSeq Exome Enrichment

Sequence on two lanes

8

Cluster Generation

httpwwwilluminacomsupportliteratureilmn

Sequencing

httpwwwilluminacomsupportliteratureilmn

9

HiSeq Flow Cell

bull 8 lanes bull 15 billion clusters bull Up to 300Gb per flow cell

Sequencing and Data Processing ~150M clusters per lane

2x100 base paired-end reads and the index tag

Tag 1 Tag 2 Tag 3 Tag 4 Tag 5 Tag 6

Demultiplex

Alignment ELAND to Human Genome

Total sequence per sample ~10Gb from two lanes Over 100X coverage of targeted regions

10

Read Depth-of-Coverage

400

ITGA6

Read Depth-of-Coverage

400

ITGA6

11

Targeted Regions Depth-of-Coverage Histogram

0

0005

001

0015

002

0 50 100 150 200

Depth of Coverage

Frac

tion

per B

in

TruSeq_example Poisson

Refining the Alignment (diagCM)

bull ELAND is part of the standard pipeline

bull ELAND accurately places reads in the correct genomic location

bull Use cross-match a Smith-Waterman aligner to improve local alignment

12

Eland version XXX

ELAND Aligned Reads

eland_ms_32 Casava 17

Cross-match Improved Alignment

13

Comparison of Aligners (simulated exome data)

Perc

enta

ge C

orre

ct

85

90

9510

0

-6 -4 -2 0 2 4 6

Variant Size

BayesianGenotypeCalling MostProbableGenotype

ACCTTCCTCGAGGCTAGCCTAGCCGGGGTGGGCAAGGCCTTCCGGGGGGGTAACCTCCTCCAAAACCTCCCTATGCCCCCCCAATTTTTTATATATATCCTCCTCCAA A

A

A A

A

A

A A

A A

A A

P(Genotype|Data)

AT 0 AA -14

Difference between most probable AA 10-6

10-15 and next most probable genotype AC 10-15AG AC -34

AT 099999981 AG -34 10-75CC TT -124 MPG Score = 14 10-75 CT -138CG 10-60CT Convert to natural Log GT -138

GG 10-75 Probabilities and Sort CC -172 10-60GT CG -172 10-54TT GG -172

14

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 5: Whole-Exome Sequencing: Technical Details

Number of Samples

Sequencing applications processed through NGS production pipeline from April 2009 to August 2011

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

6

Library Preparation

httpwwwilluminacomsupportliteratureilmn

AgilentTechnologies SureSelect method

WholeͲexome kit 38Mband50Mb

httpcpliteratureagilentcomlitwebpdf5990-3532ENpdf

7

bull Illumina TruSeq Exome Enrichment bull 62Mb of exome targeted

httpwwwilluminacomsupportliteratureilmn

NISC Exome Process Tag 1

Indexed Libraries Tag 2 Tag 3 Tag 4 Tag 5 Tag 6 Balance

and Pool

Illumina TruSeq Exome Enrichment

Sequence on two lanes

8

Cluster Generation

httpwwwilluminacomsupportliteratureilmn

Sequencing

httpwwwilluminacomsupportliteratureilmn

9

HiSeq Flow Cell

bull 8 lanes bull 15 billion clusters bull Up to 300Gb per flow cell

Sequencing and Data Processing ~150M clusters per lane

2x100 base paired-end reads and the index tag

Tag 1 Tag 2 Tag 3 Tag 4 Tag 5 Tag 6

Demultiplex

Alignment ELAND to Human Genome

Total sequence per sample ~10Gb from two lanes Over 100X coverage of targeted regions

10

Read Depth-of-Coverage

400

ITGA6

Read Depth-of-Coverage

400

ITGA6

11

Targeted Regions Depth-of-Coverage Histogram

0

0005

001

0015

002

0 50 100 150 200

Depth of Coverage

Frac

tion

per B

in

TruSeq_example Poisson

Refining the Alignment (diagCM)

bull ELAND is part of the standard pipeline

bull ELAND accurately places reads in the correct genomic location

bull Use cross-match a Smith-Waterman aligner to improve local alignment

12

Eland version XXX

ELAND Aligned Reads

eland_ms_32 Casava 17

Cross-match Improved Alignment

13

Comparison of Aligners (simulated exome data)

Perc

enta

ge C

orre

ct

85

90

9510

0

-6 -4 -2 0 2 4 6

Variant Size

BayesianGenotypeCalling MostProbableGenotype

ACCTTCCTCGAGGCTAGCCTAGCCGGGGTGGGCAAGGCCTTCCGGGGGGGTAACCTCCTCCAAAACCTCCCTATGCCCCCCCAATTTTTTATATATATCCTCCTCCAA A

A

A A

A

A

A A

A A

A A

P(Genotype|Data)

AT 0 AA -14

Difference between most probable AA 10-6

10-15 and next most probable genotype AC 10-15AG AC -34

AT 099999981 AG -34 10-75CC TT -124 MPG Score = 14 10-75 CT -138CG 10-60CT Convert to natural Log GT -138

GG 10-75 Probabilities and Sort CC -172 10-60GT CG -172 10-54TT GG -172

14

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 6: Whole-Exome Sequencing: Technical Details

Library Preparation

httpwwwilluminacomsupportliteratureilmn

AgilentTechnologies SureSelect method

WholeͲexome kit 38Mband50Mb

httpcpliteratureagilentcomlitwebpdf5990-3532ENpdf

7

bull Illumina TruSeq Exome Enrichment bull 62Mb of exome targeted

httpwwwilluminacomsupportliteratureilmn

NISC Exome Process Tag 1

Indexed Libraries Tag 2 Tag 3 Tag 4 Tag 5 Tag 6 Balance

and Pool

Illumina TruSeq Exome Enrichment

Sequence on two lanes

8

Cluster Generation

httpwwwilluminacomsupportliteratureilmn

Sequencing

httpwwwilluminacomsupportliteratureilmn

9

HiSeq Flow Cell

bull 8 lanes bull 15 billion clusters bull Up to 300Gb per flow cell

Sequencing and Data Processing ~150M clusters per lane

2x100 base paired-end reads and the index tag

Tag 1 Tag 2 Tag 3 Tag 4 Tag 5 Tag 6

Demultiplex

Alignment ELAND to Human Genome

Total sequence per sample ~10Gb from two lanes Over 100X coverage of targeted regions

10

Read Depth-of-Coverage

400

ITGA6

Read Depth-of-Coverage

400

ITGA6

11

Targeted Regions Depth-of-Coverage Histogram

0

0005

001

0015

002

0 50 100 150 200

Depth of Coverage

Frac

tion

per B

in

TruSeq_example Poisson

Refining the Alignment (diagCM)

bull ELAND is part of the standard pipeline

bull ELAND accurately places reads in the correct genomic location

bull Use cross-match a Smith-Waterman aligner to improve local alignment

12

Eland version XXX

ELAND Aligned Reads

eland_ms_32 Casava 17

Cross-match Improved Alignment

13

Comparison of Aligners (simulated exome data)

Perc

enta

ge C

orre

ct

85

90

9510

0

-6 -4 -2 0 2 4 6

Variant Size

BayesianGenotypeCalling MostProbableGenotype

ACCTTCCTCGAGGCTAGCCTAGCCGGGGTGGGCAAGGCCTTCCGGGGGGGTAACCTCCTCCAAAACCTCCCTATGCCCCCCCAATTTTTTATATATATCCTCCTCCAA A

A

A A

A

A

A A

A A

A A

P(Genotype|Data)

AT 0 AA -14

Difference between most probable AA 10-6

10-15 and next most probable genotype AC 10-15AG AC -34

AT 099999981 AG -34 10-75CC TT -124 MPG Score = 14 10-75 CT -138CG 10-60CT Convert to natural Log GT -138

GG 10-75 Probabilities and Sort CC -172 10-60GT CG -172 10-54TT GG -172

14

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 7: Whole-Exome Sequencing: Technical Details

bull Illumina TruSeq Exome Enrichment bull 62Mb of exome targeted

httpwwwilluminacomsupportliteratureilmn

NISC Exome Process Tag 1

Indexed Libraries Tag 2 Tag 3 Tag 4 Tag 5 Tag 6 Balance

and Pool

Illumina TruSeq Exome Enrichment

Sequence on two lanes

8

Cluster Generation

httpwwwilluminacomsupportliteratureilmn

Sequencing

httpwwwilluminacomsupportliteratureilmn

9

HiSeq Flow Cell

bull 8 lanes bull 15 billion clusters bull Up to 300Gb per flow cell

Sequencing and Data Processing ~150M clusters per lane

2x100 base paired-end reads and the index tag

Tag 1 Tag 2 Tag 3 Tag 4 Tag 5 Tag 6

Demultiplex

Alignment ELAND to Human Genome

Total sequence per sample ~10Gb from two lanes Over 100X coverage of targeted regions

10

Read Depth-of-Coverage

400

ITGA6

Read Depth-of-Coverage

400

ITGA6

11

Targeted Regions Depth-of-Coverage Histogram

0

0005

001

0015

002

0 50 100 150 200

Depth of Coverage

Frac

tion

per B

in

TruSeq_example Poisson

Refining the Alignment (diagCM)

bull ELAND is part of the standard pipeline

bull ELAND accurately places reads in the correct genomic location

bull Use cross-match a Smith-Waterman aligner to improve local alignment

12

Eland version XXX

ELAND Aligned Reads

eland_ms_32 Casava 17

Cross-match Improved Alignment

13

Comparison of Aligners (simulated exome data)

Perc

enta

ge C

orre

ct

85

90

9510

0

-6 -4 -2 0 2 4 6

Variant Size

BayesianGenotypeCalling MostProbableGenotype

ACCTTCCTCGAGGCTAGCCTAGCCGGGGTGGGCAAGGCCTTCCGGGGGGGTAACCTCCTCCAAAACCTCCCTATGCCCCCCCAATTTTTTATATATATCCTCCTCCAA A

A

A A

A

A

A A

A A

A A

P(Genotype|Data)

AT 0 AA -14

Difference between most probable AA 10-6

10-15 and next most probable genotype AC 10-15AG AC -34

AT 099999981 AG -34 10-75CC TT -124 MPG Score = 14 10-75 CT -138CG 10-60CT Convert to natural Log GT -138

GG 10-75 Probabilities and Sort CC -172 10-60GT CG -172 10-54TT GG -172

14

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 8: Whole-Exome Sequencing: Technical Details

Cluster Generation

httpwwwilluminacomsupportliteratureilmn

Sequencing

httpwwwilluminacomsupportliteratureilmn

9

HiSeq Flow Cell

bull 8 lanes bull 15 billion clusters bull Up to 300Gb per flow cell

Sequencing and Data Processing ~150M clusters per lane

2x100 base paired-end reads and the index tag

Tag 1 Tag 2 Tag 3 Tag 4 Tag 5 Tag 6

Demultiplex

Alignment ELAND to Human Genome

Total sequence per sample ~10Gb from two lanes Over 100X coverage of targeted regions

10

Read Depth-of-Coverage

400

ITGA6

Read Depth-of-Coverage

400

ITGA6

11

Targeted Regions Depth-of-Coverage Histogram

0

0005

001

0015

002

0 50 100 150 200

Depth of Coverage

Frac

tion

per B

in

TruSeq_example Poisson

Refining the Alignment (diagCM)

bull ELAND is part of the standard pipeline

bull ELAND accurately places reads in the correct genomic location

bull Use cross-match a Smith-Waterman aligner to improve local alignment

12

Eland version XXX

ELAND Aligned Reads

eland_ms_32 Casava 17

Cross-match Improved Alignment

13

Comparison of Aligners (simulated exome data)

Perc

enta

ge C

orre

ct

85

90

9510

0

-6 -4 -2 0 2 4 6

Variant Size

BayesianGenotypeCalling MostProbableGenotype

ACCTTCCTCGAGGCTAGCCTAGCCGGGGTGGGCAAGGCCTTCCGGGGGGGTAACCTCCTCCAAAACCTCCCTATGCCCCCCCAATTTTTTATATATATCCTCCTCCAA A

A

A A

A

A

A A

A A

A A

P(Genotype|Data)

AT 0 AA -14

Difference between most probable AA 10-6

10-15 and next most probable genotype AC 10-15AG AC -34

AT 099999981 AG -34 10-75CC TT -124 MPG Score = 14 10-75 CT -138CG 10-60CT Convert to natural Log GT -138

GG 10-75 Probabilities and Sort CC -172 10-60GT CG -172 10-54TT GG -172

14

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 9: Whole-Exome Sequencing: Technical Details

HiSeq Flow Cell

bull 8 lanes bull 15 billion clusters bull Up to 300Gb per flow cell

Sequencing and Data Processing ~150M clusters per lane

2x100 base paired-end reads and the index tag

Tag 1 Tag 2 Tag 3 Tag 4 Tag 5 Tag 6

Demultiplex

Alignment ELAND to Human Genome

Total sequence per sample ~10Gb from two lanes Over 100X coverage of targeted regions

10

Read Depth-of-Coverage

400

ITGA6

Read Depth-of-Coverage

400

ITGA6

11

Targeted Regions Depth-of-Coverage Histogram

0

0005

001

0015

002

0 50 100 150 200

Depth of Coverage

Frac

tion

per B

in

TruSeq_example Poisson

Refining the Alignment (diagCM)

bull ELAND is part of the standard pipeline

bull ELAND accurately places reads in the correct genomic location

bull Use cross-match a Smith-Waterman aligner to improve local alignment

12

Eland version XXX

ELAND Aligned Reads

eland_ms_32 Casava 17

Cross-match Improved Alignment

13

Comparison of Aligners (simulated exome data)

Perc

enta

ge C

orre

ct

85

90

9510

0

-6 -4 -2 0 2 4 6

Variant Size

BayesianGenotypeCalling MostProbableGenotype

ACCTTCCTCGAGGCTAGCCTAGCCGGGGTGGGCAAGGCCTTCCGGGGGGGTAACCTCCTCCAAAACCTCCCTATGCCCCCCCAATTTTTTATATATATCCTCCTCCAA A

A

A A

A

A

A A

A A

A A

P(Genotype|Data)

AT 0 AA -14

Difference between most probable AA 10-6

10-15 and next most probable genotype AC 10-15AG AC -34

AT 099999981 AG -34 10-75CC TT -124 MPG Score = 14 10-75 CT -138CG 10-60CT Convert to natural Log GT -138

GG 10-75 Probabilities and Sort CC -172 10-60GT CG -172 10-54TT GG -172

14

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 10: Whole-Exome Sequencing: Technical Details

Read Depth-of-Coverage

400

ITGA6

Read Depth-of-Coverage

400

ITGA6

11

Targeted Regions Depth-of-Coverage Histogram

0

0005

001

0015

002

0 50 100 150 200

Depth of Coverage

Frac

tion

per B

in

TruSeq_example Poisson

Refining the Alignment (diagCM)

bull ELAND is part of the standard pipeline

bull ELAND accurately places reads in the correct genomic location

bull Use cross-match a Smith-Waterman aligner to improve local alignment

12

Eland version XXX

ELAND Aligned Reads

eland_ms_32 Casava 17

Cross-match Improved Alignment

13

Comparison of Aligners (simulated exome data)

Perc

enta

ge C

orre

ct

85

90

9510

0

-6 -4 -2 0 2 4 6

Variant Size

BayesianGenotypeCalling MostProbableGenotype

ACCTTCCTCGAGGCTAGCCTAGCCGGGGTGGGCAAGGCCTTCCGGGGGGGTAACCTCCTCCAAAACCTCCCTATGCCCCCCCAATTTTTTATATATATCCTCCTCCAA A

A

A A

A

A

A A

A A

A A

P(Genotype|Data)

AT 0 AA -14

Difference between most probable AA 10-6

10-15 and next most probable genotype AC 10-15AG AC -34

AT 099999981 AG -34 10-75CC TT -124 MPG Score = 14 10-75 CT -138CG 10-60CT Convert to natural Log GT -138

GG 10-75 Probabilities and Sort CC -172 10-60GT CG -172 10-54TT GG -172

14

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 11: Whole-Exome Sequencing: Technical Details

Targeted Regions Depth-of-Coverage Histogram

0

0005

001

0015

002

0 50 100 150 200

Depth of Coverage

Frac

tion

per B

in

TruSeq_example Poisson

Refining the Alignment (diagCM)

bull ELAND is part of the standard pipeline

bull ELAND accurately places reads in the correct genomic location

bull Use cross-match a Smith-Waterman aligner to improve local alignment

12

Eland version XXX

ELAND Aligned Reads

eland_ms_32 Casava 17

Cross-match Improved Alignment

13

Comparison of Aligners (simulated exome data)

Perc

enta

ge C

orre

ct

85

90

9510

0

-6 -4 -2 0 2 4 6

Variant Size

BayesianGenotypeCalling MostProbableGenotype

ACCTTCCTCGAGGCTAGCCTAGCCGGGGTGGGCAAGGCCTTCCGGGGGGGTAACCTCCTCCAAAACCTCCCTATGCCCCCCCAATTTTTTATATATATCCTCCTCCAA A

A

A A

A

A

A A

A A

A A

P(Genotype|Data)

AT 0 AA -14

Difference between most probable AA 10-6

10-15 and next most probable genotype AC 10-15AG AC -34

AT 099999981 AG -34 10-75CC TT -124 MPG Score = 14 10-75 CT -138CG 10-60CT Convert to natural Log GT -138

GG 10-75 Probabilities and Sort CC -172 10-60GT CG -172 10-54TT GG -172

14

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 12: Whole-Exome Sequencing: Technical Details

Eland version XXX

ELAND Aligned Reads

eland_ms_32 Casava 17

Cross-match Improved Alignment

13

Comparison of Aligners (simulated exome data)

Perc

enta

ge C

orre

ct

85

90

9510

0

-6 -4 -2 0 2 4 6

Variant Size

BayesianGenotypeCalling MostProbableGenotype

ACCTTCCTCGAGGCTAGCCTAGCCGGGGTGGGCAAGGCCTTCCGGGGGGGTAACCTCCTCCAAAACCTCCCTATGCCCCCCCAATTTTTTATATATATCCTCCTCCAA A

A

A A

A

A

A A

A A

A A

P(Genotype|Data)

AT 0 AA -14

Difference between most probable AA 10-6

10-15 and next most probable genotype AC 10-15AG AC -34

AT 099999981 AG -34 10-75CC TT -124 MPG Score = 14 10-75 CT -138CG 10-60CT Convert to natural Log GT -138

GG 10-75 Probabilities and Sort CC -172 10-60GT CG -172 10-54TT GG -172

14

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 13: Whole-Exome Sequencing: Technical Details

Comparison of Aligners (simulated exome data)

Perc

enta

ge C

orre

ct

85

90

9510

0

-6 -4 -2 0 2 4 6

Variant Size

BayesianGenotypeCalling MostProbableGenotype

ACCTTCCTCGAGGCTAGCCTAGCCGGGGTGGGCAAGGCCTTCCGGGGGGGTAACCTCCTCCAAAACCTCCCTATGCCCCCCCAATTTTTTATATATATCCTCCTCCAA A

A

A A

A

A

A A

A A

A A

P(Genotype|Data)

AT 0 AA -14

Difference between most probable AA 10-6

10-15 and next most probable genotype AC 10-15AG AC -34

AT 099999981 AG -34 10-75CC TT -124 MPG Score = 14 10-75 CT -138CG 10-60CT Convert to natural Log GT -138

GG 10-75 Probabilities and Sort CC -172 10-60GT CG -172 10-54TT GG -172

14

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 14: Whole-Exome Sequencing: Technical Details

MPG of Haploid Regions

bull Human autosomes are normally diploid bull MPG is designed to call two alleles for the

autosomes and X chromosome if the sample is from a female

bull For samples from males MPG is run in haploid mode on the non-sudoautosomalregions of X and Y

bull Thus only testing for the four nucleotides

Exome Coverage versus Input Sequence

15

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 15: Whole-Exome Sequencing: Technical Details

Num

ber o

f Sam

ples

SureSelect 38Mb

SureSelect 50Mb

Illumina TruSeq 62Mb

0 20

0 40

0 60

0 80

0 10

00

Whole Exomes Processed at NISC

Mar 2010 Dec 2010 Sep 2011

NBEAL2

16

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 16: Whole-Exome Sequencing: Technical Details

NBEAL2 5rsquo

Exome Variation Statistics TruSeq 62Mb Male Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 133047403 142361 000072

Auto 125491045 139295 000076

chrX 6842299 2600 NA

chrY 710243 1435 NA

17

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 17: Whole-Exome Sequencing: Technical Details

Exome Variation Statistics TruSeq 62Mb Female Sample

Type

Total Genotype

Calls SNVs Within-sample Heterozygosity

Total 125681915 136993 000075

Auto 120559746 132616 000076

chrX 5096376 2701 000034

Example Heterozygous SNV

18

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 18: Whole-Exome Sequencing: Technical Details

Example Heterozygous Deletion

Coverage of MPG gt= 10 Genotype Calls

Total Raw Sequence

Aligned Sequence

Genotype calls CCDS

Genotype calls UCSC coding

SureSelect 38Mb 67 Gb

50 Gb (131x)

89 74

SureSelect 50Mb 105 Gb

61 Gb (122x)

89 85

TruSeq 62Mb 90 Gb

71 Gb (114x)

91 89

Whole Genome Shotgun

192 Gb 133 Gb (44x)

86 83

19

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 19: Whole-Exome Sequencing: Technical Details

GenotypeConcordance

Total Agreement with Genotype Chip (CCDS)

Whole Genome Shotgun 99908

SureSelect 38Mb 99910

SureSelect 50Mb 99857

TruSeq 62Mb 99865

Whole Exome Sequencing bull Being applied to

ndash Undiagnosed Diseases Program (100rsquos of samples)

ndash ClinSeq (gt1000 samples) ndash Variety of other PI driven projects (eg cancer)

bull Data generation rate per year ndash 200 exomes per GAiiX ndash 1200 exomes per HiSeq 2000

bull Analysis results ndash Genotype data for 90 of consensus coding exon

bases (CCDS) ndash Accuracy of genotype calls over 995

20

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 20: Whole-Exome Sequencing: Technical Details

Exome Sequencing Pipeline

Sample DNA Fragmentation

Illumina Library Preparation

Exome Enrichment

Cluster Generation

Sequencing and Basecalling

Sequence Read Alignment

Variation Detection

Variant Annotation and Working With Whole-Exome Data

bull One sample produces gt 100k variants

bull One hundred samples gives rise to 600k or more

bull How does one work with such large datasets

bull The next speaker Dr Jamie Teer will address these next steps

21

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 21: Whole-Exome Sequencing: Technical Details

Acknowledgements NIH Intramural Sequencing bull Mullikin Lab

Center ndash Nancy Hansen

bull Sequencing Operations ndash Pedro Cruz ndash Bob Blakesley ndash Praveen Cherukuri ndash Alice Young ndash Lab Staff bull Biesecker Lab

bull Bioinformatics ndash Jamie Teer ndash Gerry Bouffard ndash Baishali Maskeri ndash Jenny McDowell ndash Meg Vemulapalli

bull IT Linux Support ndash Jesse Becker ndash Matt Lesko

httpresearchnhgrinihgov

22

23

Page 22: Whole-Exome Sequencing: Technical Details

23