Upload
clancy
View
41
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing. Jorge Duitama 1,2 , Thomas Huebsch 1 , Gayle McEwen 1 , Sabrina Schulz 1 , Eun -Kyung Suk 1 , Margret R. Hoehe 1 1 . Max Planck Institute for Molecular Genetics, Berlin, Germany - PowerPoint PPT Presentation
Citation preview
Max-Planck-Institutefor Molecular Genetics
Bioinformatics Pipeline for Fosmid based Molecular Haplotype
Sequencing
Jorge Duitama1,2, Thomas Huebsch1, Gayle McEwen1, Sabrina Schulz1, Eun-Kyung Suk1, Margret R. Hoehe1
1. Max Planck Institute for Molecular Genetics, Berlin, Germany2. Department of Computer Science and Engineering University of Connecticut, Storrs, CT, USA
Max-Planck-Institutefor Molecular Genetics
MHC class I MHC class III MHC class II29,74 31,59 32,34 33,21
MHC: Key Region for Common Diseases & Transplant Medicine
Max-Planck-Institutefor Molecular Genetics
MHC: Variation amongst Haplotypes
Variation and annotation map for eight MHC haplotypes, Horton et al. Immunogenetics (2008) 60,1-18
Variation amongst 8 MHC Haplotypes:• 37.451 Substitutions• 7.093 Short Indels
RCCXCNV
HLA-DRBCNV
Variation of MHC Haplotypes against PGF reference
7 fu
rthe
r MH
C
Hap
loty
pe s
eque
nces
PGF reference sequence
MHC class III MHC class II
Max-Planck-Institutefor Molecular Genetics
SNP Mapping for Prioritization of MHC Informative PoolsSOLiD NGS Platform Shotgunning complete
40 kb fosmids
40 kb haploid
molecules
5000 fosmids
One pool 3x96-well = 288 fosmid pools
Targeted Enrichment
Complete Fosmid Pool
100 Individuals100 Libraries
Identification of 40 kb fosmid
sequences
Data Analysis PipelineHaplotype A
Haplotype B
Phasing molecular fosmid sequences
Contiguous MHC haplotype
sequence
Experimental Approach
Max-Planck-Institutefor Molecular Genetics
Data Analysis Pipeline
SOLiD Standard Pipeline
Consensus Calling
SNP Analysis
Read Alignment against Genome
Pairing
In House Project Specific Analysis Pipeline
Fosmid Detection Program
Fosmid Sequences Based Phasing
Visualization & MHC Database
Fosmid Specific Matching Algorithm
Max-Planck-Institutefor Molecular Genetics
Data Analysis Pipeline
SOLiD Standard Pipeline
Read Alignment against Genome
Pairing
Consensus Calling
SNP Analysis
In House Project Specific Analysis Pipeline
Fosmid Detection Program
Fosmid Specific Matching Algorithm
Fosmid Sequences Based Phasing
Visualization & MHC Database
Max-Planck-Institutefor Molecular Genetics
Mapping real data
Pool of 15.000 Fosmids 22 Mill. Reads 50bp
mapped reads % unique mapped reads % multiple hits % 0
10
20
30
40
50
60
70
Bioscope classic Bioscope local repeat 40.3
Bioscope local repeat 45.3
Max-Planck-Institutefor Molecular Genetics
Data Analysis Pipeline
SOLiD Standard Pipeline
Read Alignment against Genome
Pairing
Consensus Calling
SNP Analysis
In House Project Specific Analysis Pipeline
Fosmid Detection Program
Fosmid Specific Matching Algorithm
Fosmid Sequences Based Phasing
Visualization & MHC Database
Max-Planck-Institutefor Molecular Genetics
gDNA
# cov ref consen F3 coord
335 C Y 177/17 62511614
3345 T C 3191/56 62512095
875 G A 862/25 62513689
1795 G K 722/23 62513754
707 C S 528/13 62515375
2643 C Y 1391/20 62517737
643 C Y 417/23 62518998
1074 A R 554/21 62522445
606 C S 226/21 62524689
639 A M 167/15 62532474
158 G R 89/14 62533464
1032 A R 443/26 62534973
7 A G 7/4 62537153
775 T G 742/26 62540402
10 G C 10/5 62540465
698 G C 684/29 62541769
40 C T 40/4 62542550
94 C G 93/9 62542574
286 C T 283/16 62543011
194 C A 190/22 62543067
Fosmid
# cov ref consen F3 coord
595 C T 572/91 62511614
3418 T C 3278/98 62512095
2089 G A 2048/98 62513689
2238 G T 2194/98 62513754
1134 C G 1107/73 62515375
3104 C T 2922/98 62517737
1033 C T 1014/83 62518998
1799 A G 1753/98 62522445
1053 C G 1049/83 62524689
54 G A 39/22 62527964
32 A C 27/23 62529870
1374 A C 1355/95 62532474
973 G A 946/97 62533464
2850 A G 2745/98 62534973
49 A G 48/33 62537153
1888 T G 1845/95 62540402
37 G C 36/20 62540465
923 G C 901/97 62541769
8411 A W 2006/78 62542258
253 C T 253/47 62542550
SNP calls: Haploid fosmids vs. genomic DNA
Max-Planck-Institutefor Molecular Genetics
SNP Calling Accuracy in the MHC– Affymetrix genotype information for 1583 SNP
positions as reference standard:• - Homozygous identical with reference: 957• - Heterozygous: 562• - Homozygous different from reference: 64
– Compared to variants called from the SOLiD sequenced genomic DNA sample (15x average read coverage)
– Percentage of error in genotype calling: 3.66%– False positive rate: 0.1%– False negative rate: 9.25%
Max-Planck-Institutefor Molecular Genetics
Data Analysis Pipeline
SOLiD Standard Pipeline
Read Alignment against Genome
Pairing
Consensus Calling
SNP Analysis
In House Project Specific Analysis Pipeline
Fosmid Detection Program
Fosmid Specific Matching Algorithm
Fosmid Sequences Based Phasing
Visualization & MHC Database
Max-Planck-Institutefor Molecular Genetics
UCSC Genome browser http://genome.ucsc.edu/ Kent et al. 2002 Genome Res. 12(6):996-1006.
Fosmid Detection Algorithm1. Assign each read to a single 1kb long bin. Select bins with more than
5 reads2. Perform allele calls for each heterozygous SNP. Mark bins with
heterozygous calls3. Cluster adjacent bins as belonging to the same fosmid if:
i. The gap distance between them is less than 10kb andii. There are no bins with heterozygous SNPs between them
4. Keep fosmids with lengths between 3kb and 60kb
Fosmids Detection
Max-Planck-Institutefor Molecular Genetics
Fosmids Detection
0
500
1000
1500
2000
2500
3000
3500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 10
contig length kb
num
ber o
f con
tigs
fosmid sized contigs
Size distribution of read-contigs
20 – 50 kb
Max-Planck-Institutefor Molecular Genetics
Data Analysis Pipeline
SOLiD Standard Pipeline
Read Alignment against Genome
Pairing
Consensus Calling
SNP Analysis
In House Project Specific Analysis Pipeline
Fosmid Detection Program
Fosmid Specific Matching Algorithm
Fosmid Sequences Based Phasing
Visualization & MHC Database
Max-Planck-Institutefor Molecular Genetics
Haplotyping
Locus
Event Alleles
1 SNV C,T
2 Deletion C,-
3 SNV A,G
4 Insertion
-,GC
Locus
Event Alleles Hap 1 Alleles Hap 2
1 SNV T C
2 Deletion C -
3 SNV A G
4 Insertion
- GC
The process of grouping alleles that are present together on the same chromosome copy of an individual is called haplotyping
Max-Planck-Institutefor Molecular Genetics
Single Individual Haplotyping• Input: Matrix M of m fragments covering n loci
Locus 1 2 3 4 5 ... n
f1 - 0 1 1 0 0
f2 1 1 0 - 1 1
f3 0 0 0 1 1 -
...
fm - - 1 - 1 1
Max-Planck-Institutefor Molecular Genetics
Single Individual Haplotyping• Input: Matrix M of m fragments covering n loci
Locus 1 2 3 4 5 ... n
f1 - 0 1 1 0 0
f2 1 1 0 - 1 1
f3 0 0 0 1 1 -
...
fm - - 1 - 1 1
Max-Planck-Institutefor Molecular Genetics
Single Individual Haplotyping• Input: Matrix M of m fragments covering n loci
Locus 1 2 3 4 5 ... n
f1 - 0 1 1 0 0
f2 1 1 0 - 1 1
f3 0 0 0 1 1 -
...
fm - - 1 - 1 1
Max-Planck-Institutefor Molecular Genetics
Single Individual Haplotyping• Input: Matrix M of m fragments covering n loci
Locus 1 2 3 4 5 ... n
f1 - 0 1 1 0 0
f2 1 1 0 - 1 1
f3 0 0 0 1 1 -
...
fm - - 1 - 1 1
Max-Planck-Institutefor Molecular Genetics
ReFHap Problem FormulationFor two alleles a1, a2
For two rows i1, i2 of M
f1 - 0 1 1 0
f2 1 1 1 - 1
Score 0 1 -1 0 1
s(M,1,2) = 1
Max-Planck-Institutefor Molecular Genetics
ReFHap Problem Formulation
For a cut I of rows of M
Max-Planck-Institutefor Molecular Genetics
ReFHap Algorithm
Locus 1 2 3 4 5f1 - 0 1 1 0f2 1 1 0 - 1f3 1 - - 0 -f4 - 0 0 - 1
31
1
1 -1
-14
2
3
h1 00110h2 11001
• Reduce the problem to Max-Cut.• Solve Max-Cut• Build haplotypes according with the cut
Max-Planck-Institutefor Molecular Genetics
ReFHap Algorithm1. Build G=(V,E,w) from M2. Sort E from largest to smallest weight3. Init I with a random subset of V4. For each e in the first k edges
a) I’ ← GreedyInit(G,e)b) I’ ← GreedyImprovement(G,I’)c) If s(M, I) < s(M, I’) then I ← I’
Max-Planck-Institutefor Molecular Genetics
ReFHap Algorithm
• Classical greedy algorithm
1
3
42
1
3
4
2
Max-Planck-Institutefor Molecular Genetics
ReFHap Algorithm
• Edge flipping
1 2
3 4
2 1
3 4
Max-Planck-Institutefor Molecular Genetics
Phasing the MHC: Mixed Diploid vs Fosmid-Based NGS
Mixed Diploid Fosmid-BasedLibraries Mate Pair & Paired
End Genomic DNAPaired End
16 Barcoded PoolsUniquely Mapped 47 Gb 15 Gb 1/3rd
Number of Blocks 407 40 1/10th
Av. Block Length 438 bp 85 kb 194 xMax. Block Length 3.7 kb 691 kb 186 xTotal Length all Blocks 178 kb 3.4 Mb 19 x% of Phased SNPs 12 % 66 % 5 x
Max-Planck-Institutefor Molecular Genetics
Phasing MHC:Preliminary Results
• Number of blocks: 8 • N50 block length: 793 kb• Maximum block length: 1.6 MB• Total extent of all blocks: 3.8 MB• Fraction of MHC phased into haplotype blocks: 95%• Number of heterozygous SNPs: 8030 SNPs • Fraction of SNPs phased: 86%
Max-Planck-Institutefor Molecular Genetics
Acknowledgements
The Life Tech Team:
Kevin McKernan Alexander SartoriClarence Lee Dustin HollowayJessica Spangler Heather Peckham Tristen Weaver Stephen McLaughlin Tamara Gilbert Tim Harkins
AnitaSuk
SabrinaSchulz
SteffiPalczewski
Britta Horstmann
Margret Hoehe
GayleMcEwen
Roger Horton
Thomas Hübsch
Thank You!
Max-Planck-Institutefor Molecular Genetics
Comparison Mapping algos
mapped reads % unique mapped reads % multiple hits % 0
20
40
60
80
100
120
Bioscope classic Bioscope classic iub Bioscope localBioscope local iub Bioscope local repeat
schemaBfast
COX Haplotype simulated reads
Max-Planck-Institutefor Molecular Genetics
Phasing MHC