30
Max-Planck- Institute for Molecular Genetics Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing Jorge Duitama 1,2 , Thomas Huebsch 1 , Gayle McEwen 1 , Sabrina Schulz 1 , Eun-Kyung Suk 1 , Margret R. Hoehe 1 1. Max Planck Institute for Molecular Genetics, Berlin, Germany 2. Department of Computer Science and Engineering University of Connecticut, Storrs, CT, USA

Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

  • Upload
    clancy

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing. Jorge Duitama 1,2 , Thomas Huebsch 1 , Gayle McEwen 1 , Sabrina Schulz 1 , Eun -Kyung Suk 1 , Margret R. Hoehe 1 1 . Max Planck Institute for Molecular Genetics, Berlin, Germany - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Bioinformatics Pipeline for Fosmid based Molecular Haplotype

Sequencing

Jorge Duitama1,2, Thomas Huebsch1, Gayle McEwen1, Sabrina Schulz1, Eun-Kyung Suk1, Margret R. Hoehe1

1. Max Planck Institute for Molecular Genetics, Berlin, Germany2. Department of Computer Science and Engineering University of Connecticut, Storrs, CT, USA

Page 2: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

MHC class I MHC class III MHC class II29,74 31,59 32,34 33,21

MHC: Key Region for Common Diseases & Transplant Medicine

Page 3: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

MHC: Variation amongst Haplotypes

Variation and annotation map for eight MHC haplotypes, Horton et al. Immunogenetics (2008) 60,1-18

Variation amongst 8 MHC Haplotypes:• 37.451 Substitutions• 7.093 Short Indels

RCCXCNV

HLA-DRBCNV

Variation of MHC Haplotypes against PGF reference

7 fu

rthe

r MH

C

Hap

loty

pe s

eque

nces

PGF reference sequence

MHC class III MHC class II

Page 4: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

SNP Mapping for Prioritization of MHC Informative PoolsSOLiD NGS Platform Shotgunning complete

40 kb fosmids

40 kb haploid

molecules

5000 fosmids

One pool 3x96-well = 288 fosmid pools

Targeted Enrichment

Complete Fosmid Pool

100 Individuals100 Libraries

Identification of 40 kb fosmid

sequences

Data Analysis PipelineHaplotype A

Haplotype B

Phasing molecular fosmid sequences

Contiguous MHC haplotype

sequence

Experimental Approach

Page 5: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Data Analysis Pipeline

SOLiD Standard Pipeline

Consensus Calling

SNP Analysis

Read Alignment against Genome

Pairing

In House Project Specific Analysis Pipeline

Fosmid Detection Program

Fosmid Sequences Based Phasing

Visualization & MHC Database

Fosmid Specific Matching Algorithm

Page 6: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Data Analysis Pipeline

SOLiD Standard Pipeline

Read Alignment against Genome

Pairing

Consensus Calling

SNP Analysis

In House Project Specific Analysis Pipeline

Fosmid Detection Program

Fosmid Specific Matching Algorithm

Fosmid Sequences Based Phasing

Visualization & MHC Database

Page 7: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Mapping real data

Pool of 15.000 Fosmids 22 Mill. Reads 50bp

mapped reads % unique mapped reads % multiple hits % 0

10

20

30

40

50

60

70

Bioscope classic Bioscope local repeat 40.3

Bioscope local repeat 45.3

Page 8: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Data Analysis Pipeline

SOLiD Standard Pipeline

Read Alignment against Genome

Pairing

Consensus Calling

SNP Analysis

In House Project Specific Analysis Pipeline

Fosmid Detection Program

Fosmid Specific Matching Algorithm

Fosmid Sequences Based Phasing

Visualization & MHC Database

Page 9: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

gDNA

# cov ref consen F3 coord

335 C Y 177/17 62511614

3345 T C 3191/56 62512095

875 G A 862/25 62513689

1795 G K 722/23 62513754

707 C S 528/13 62515375

2643 C Y 1391/20 62517737

643 C Y 417/23 62518998

1074 A R 554/21 62522445

606 C S 226/21 62524689

639 A M 167/15 62532474

158 G R 89/14 62533464

1032 A R 443/26 62534973

7 A G 7/4 62537153

775 T G 742/26 62540402

10 G C 10/5 62540465

698 G C 684/29 62541769

40 C T 40/4 62542550

94 C G 93/9 62542574

286 C T 283/16 62543011

194 C A 190/22 62543067

Fosmid

# cov ref consen F3 coord

595 C T 572/91 62511614

3418 T C 3278/98 62512095

2089 G A 2048/98 62513689

2238 G T 2194/98 62513754

1134 C G 1107/73 62515375

3104 C T 2922/98 62517737

1033 C T 1014/83 62518998

1799 A G 1753/98 62522445

1053 C G 1049/83 62524689

54 G A 39/22 62527964

32 A C 27/23 62529870

1374 A C 1355/95 62532474

973 G A 946/97 62533464

2850 A G 2745/98 62534973

49 A G 48/33 62537153

1888 T G 1845/95 62540402

37 G C 36/20 62540465

923 G C 901/97 62541769

8411 A W 2006/78 62542258

253 C T 253/47 62542550

SNP calls: Haploid fosmids vs. genomic DNA

Page 10: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

SNP Calling Accuracy in the MHC– Affymetrix genotype information for 1583 SNP

positions as reference standard:• - Homozygous identical with reference: 957• - Heterozygous: 562• - Homozygous different from reference: 64

– Compared to variants called from the SOLiD sequenced genomic DNA sample (15x average read coverage)

– Percentage of error in genotype calling: 3.66%– False positive rate: 0.1%– False negative rate: 9.25%

Page 11: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Data Analysis Pipeline

SOLiD Standard Pipeline

Read Alignment against Genome

Pairing

Consensus Calling

SNP Analysis

In House Project Specific Analysis Pipeline

Fosmid Detection Program

Fosmid Specific Matching Algorithm

Fosmid Sequences Based Phasing

Visualization & MHC Database

Page 12: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

UCSC Genome browser http://genome.ucsc.edu/ Kent et al. 2002 Genome Res. 12(6):996-1006.

Fosmid Detection Algorithm1. Assign each read to a single 1kb long bin. Select bins with more than

5 reads2. Perform allele calls for each heterozygous SNP. Mark bins with

heterozygous calls3. Cluster adjacent bins as belonging to the same fosmid if:

i. The gap distance between them is less than 10kb andii. There are no bins with heterozygous SNPs between them

4. Keep fosmids with lengths between 3kb and 60kb

Fosmids Detection

Page 13: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Fosmids Detection

0

500

1000

1500

2000

2500

3000

3500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 10

contig length kb

num

ber o

f con

tigs

fosmid sized contigs

Size distribution of read-contigs

20 – 50 kb

Page 14: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Data Analysis Pipeline

SOLiD Standard Pipeline

Read Alignment against Genome

Pairing

Consensus Calling

SNP Analysis

In House Project Specific Analysis Pipeline

Fosmid Detection Program

Fosmid Specific Matching Algorithm

Fosmid Sequences Based Phasing

Visualization & MHC Database

Page 15: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Haplotyping

Locus

Event Alleles

1 SNV C,T

2 Deletion C,-

3 SNV A,G

4 Insertion

-,GC

Locus

Event Alleles Hap 1 Alleles Hap 2

1 SNV T C

2 Deletion C -

3 SNV A G

4 Insertion

- GC

The process of grouping alleles that are present together on the same chromosome copy of an individual is called haplotyping

Page 16: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Single Individual Haplotyping• Input: Matrix M of m fragments covering n loci

Locus 1 2 3 4 5 ... n

f1 - 0 1 1 0 0

f2 1 1 0 - 1 1

f3 0 0 0 1 1 -

...

fm - - 1 - 1 1

Page 17: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Single Individual Haplotyping• Input: Matrix M of m fragments covering n loci

Locus 1 2 3 4 5 ... n

f1 - 0 1 1 0 0

f2 1 1 0 - 1 1

f3 0 0 0 1 1 -

...

fm - - 1 - 1 1

Page 18: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Single Individual Haplotyping• Input: Matrix M of m fragments covering n loci

Locus 1 2 3 4 5 ... n

f1 - 0 1 1 0 0

f2 1 1 0 - 1 1

f3 0 0 0 1 1 -

...

fm - - 1 - 1 1

Page 19: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Single Individual Haplotyping• Input: Matrix M of m fragments covering n loci

Locus 1 2 3 4 5 ... n

f1 - 0 1 1 0 0

f2 1 1 0 - 1 1

f3 0 0 0 1 1 -

...

fm - - 1 - 1 1

Page 20: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

ReFHap Problem FormulationFor two alleles a1, a2

For two rows i1, i2 of M

f1 - 0 1 1 0

f2 1 1 1 - 1

Score 0 1 -1 0 1

s(M,1,2) = 1

Page 21: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

ReFHap Problem Formulation

For a cut I of rows of M

Page 22: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

ReFHap Algorithm

Locus 1 2 3 4 5f1 - 0 1 1 0f2 1 1 0 - 1f3 1 - - 0 -f4 - 0 0 - 1

31

1

1 -1

-14

2

3

h1 00110h2 11001

• Reduce the problem to Max-Cut.• Solve Max-Cut• Build haplotypes according with the cut

Page 23: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

ReFHap Algorithm1. Build G=(V,E,w) from M2. Sort E from largest to smallest weight3. Init I with a random subset of V4. For each e in the first k edges

a) I’ ← GreedyInit(G,e)b) I’ ← GreedyImprovement(G,I’)c) If s(M, I) < s(M, I’) then I ← I’

Page 24: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

ReFHap Algorithm

• Classical greedy algorithm

1

3

42

1

3

4

2

Page 25: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

ReFHap Algorithm

• Edge flipping

1 2

3 4

2 1

3 4

Page 26: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Phasing the MHC: Mixed Diploid vs Fosmid-Based NGS

Mixed Diploid Fosmid-BasedLibraries Mate Pair & Paired

End Genomic DNAPaired End

16 Barcoded PoolsUniquely Mapped 47 Gb 15 Gb 1/3rd

Number of Blocks 407 40 1/10th

Av. Block Length 438 bp 85 kb 194 xMax. Block Length 3.7 kb 691 kb 186 xTotal Length all Blocks 178 kb 3.4 Mb 19 x% of Phased SNPs 12 % 66 % 5 x

Page 27: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Phasing MHC:Preliminary Results

• Number of blocks: 8 • N50 block length: 793 kb• Maximum block length: 1.6 MB• Total extent of all blocks: 3.8 MB• Fraction of MHC phased into haplotype blocks: 95%• Number of heterozygous SNPs: 8030 SNPs • Fraction of SNPs phased: 86%

Page 28: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Acknowledgements

The Life Tech Team:

Kevin McKernan Alexander SartoriClarence Lee Dustin HollowayJessica Spangler Heather Peckham Tristen Weaver Stephen McLaughlin Tamara Gilbert Tim Harkins

AnitaSuk

SabrinaSchulz

SteffiPalczewski

Britta Horstmann

Margret Hoehe

GayleMcEwen

Roger Horton

Thomas Hübsch

Thank You!

Page 29: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Comparison Mapping algos

mapped reads % unique mapped reads % multiple hits % 0

20

40

60

80

100

120

Bioscope classic Bioscope classic iub Bioscope localBioscope local iub Bioscope local repeat

schemaBfast

COX Haplotype simulated reads

Page 30: Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Max-Planck-Institutefor Molecular Genetics

Phasing MHC