View
117
Download
0
Category
Preview:
Citation preview
Organizational Heterogeneity of Human Genome:
Significant variation of recombination rate of 100 kbp sequences within GC ranges
Svetlana FrenkelValery KirzhnerAbraham Korol
Department of Evolutionary and Environmental BiologyInstitute of Evolution
University of Haifa
Some aspects of intra-genome heterogeneity
Varying gene density Clusters of tissue-specific and
housekeeping genes Linkage disequilibrium (LD) blocks Mutation and recombination rates Conserved and Ultraconserved segments Localization of inversions, deletions,
insertions and duplications
Genome Heterogeneity: GC content
From: Costantini, M., Clay, O., Auletta, F., Bernardi, G. (2006) An isochore map of human chromosomes. Genome Res., 16, 536-541.
From: UHN Microarray Centre's CpG Island Database http://data.microarrays.ca/cpg/index.htm
The level of redness denotes the relative number of CpG islands that can be located on the chromosome in that region
4
Genome Signature Samuel Karlin, et al, 1997
Local: • preliminary searches of candidates for gene
alignment• detecting candidate regulatory signals• detecting promoter regions• detecting repetitive elements • duplications of genomic • horizontal gene transfer
Genome-wide: • phylogenetic analysis
• species recognition• whole-genome sequence comparisons
Linguistic-like methods
Detecting all of “words” with certain maximal lengthCharacterizing the sequence “vocabulary”
Scoring the occurrences of fixed-
length “words” from a predefined
“vocabulary”Comparison of “word” frequencies obtained
from different sequencesComparison the
“vocabularies” of different sequences
Compositional Spectra Analysis
Compositional Spectra
A linguistic-like method of genome analysis based on occurrences of “words” in the A,C,G,T alphabet Compositional spectrum (CS) is measured as a histogram of imperfect word occurrences
From: V. Kirzhner et al., 2002-20056
Methods: calculating of distances
d1
d’1 d’2
d2
F(Si, W)
F(S’i, W)
F(Sj, W)
F(S’ j, W)
5’
5’
3’
3’
Manhattan (city block) distanceSpearman Rank Correlation ρ (d= 1-ρ)Kendall distance τ
d = min(di, d’i, dj, d’j)F(Si, W’)
F(Sj, W’)
Methods: Detection of Organizational Pattern groups of segments
Genome segment number
Low HighClustering tree
Relative distance between two clusters
Maximal distance between segments
Neighbor-Joining Clustering
“adaptive cutoff”
Analysis of Organizational Pattern groups of segments
9
Significant variation of evolutionary features of 100 kbp sequences within GC ranges
Testing for potential association between genome-wide distribution of organizational patterns and various evolutionary and structural features reveals the existence of inter-OP heterogeneity in such features as SNP and Indels frequency, recombination rate, number of segmental duplications, size of linkage disequilibrium blocks, and proportion of evolutionary conserved sequence.
10
Estimation of heterogeneity between OP groups
11 GC
Rec
ombi
natio
n R
ate
Estimation of heterogeneity between OP groups
12
0.22 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 8.8×10-5 0.03 1.9×10-3 0.01 0.11 3.9×10-3
-log(
FDR
-cor
rect
ed p
-val
ue)
GC
Kruskal–Wallis non-parametric rank test10,000 segments reshuffles to estimate test critical value FDR correction for multiple comparisons
Reshuffled sequences within every segment as control
2.3 5.1 86.1 48.6 81.9 35.7 21.0 26.0 46.7 36.6 13.6 15.7 15.5 16.9
Detecting the words related to recombination rate
13
GC% ,Average RR in the compared
OPGsProportion of correct classifications of segments to OP
groups% ,
low RR high RR all words set of 47 words set of 8 words
35 0.82 0.93 98.60 98.62 76.0336 0.62 1.16 98.40 96.56 82.3437 0.83 1.28 94.10 93.88 80.4738 0.80 1.46 99.58 99.17 98.33
39 0.91 1.59 97.32 97.32 96.55
40 0.96 1.50 100.0 100.0 100.0
41 1.13 1.81 98.80 98.50 98.50
42 1.05 1.80 100.0 100.0 99.62
43 1.29 1.99 97.48 96.98 95.46
44 1.44 1.83 99.01 99.21 98.81
45 1.35 2.06 100 98.93 98.22
46 1.30 1.88 98.53 98.53 97.3547 1.15 1.74 94.62 94.61 91.4848 1.33 2.04 98.78 98.77 97.55
Oligonucleotides, which showed high importance in more than half of OPG comparisons in classification of 100kbp segments for high and low recombination rate
14
Oligonucleotide GC, %Appeared in the list of 10 most important variables
(times)
Appearedas the most important variable
(times)Previously described
pattern Reference
CAGCCAGGTT 60 11 4 -CCNCCNTNNCCNC--CAGCCAGGTT---- Myers et al. 2008
GACCGGACTG 70 10 1
---CCTCCCT---GACCGGACTG- Myers et al. 2005
-CCNCCNTNNCCNC----GACCGGACTG-- Myers et al. 2008
CGCCGGGACT 80 10 3 -CCNCCNTNNCCNC---CGCCGGGACT--- Myers et al. 2008
GCGTAGGCTA 60 9 0 -CCNCCNTNNCCNC----GCGTAGGCTA-- Myers et al. 2008
TGGGCCCGGC 90 8 4 n/a
GGCGTGCGCG 90 8 1
-GGNGGNAGGGG--GGCGTGCGCG-- Zheng et al. 2010
-CCNCCNTNNCCNC----GGCGTGCGCG-- Myers et al. 2008
CCCGGTATCG 70 8 0-CCNCCNTNNCCNC---CCCGGTATCG--- Myers et al. 2008
GCCCTTTCCT 60 7 0
---CCTCCCT---GCCCTTTCCT- Myers et al. 2005
-CCNCCNTNNCCNC----GCCCTTTCCT-- Myers et al. 2008
-CCTCCCTNNCCAC----GCCCTTTCCT-- Myers et al. 2008
Functionally related genes tend to reside in organizationally similar genomic regions
Genes provided the GO enrichment of four organizational pattern clusters, which showed the most significant GO enrichments.
L2-a cluster is enriched by “mitochondrion”, “intracellular non-membrane-bounded organelle”, “nuclear envelope” and “ribonucleoprotein complex” GO terms;L2-h cluster is enriched by “G-protein-coupled receptor protein signaling pathway” and “sensory perception of smell” GO terms;H1-i cluster is enriched by “epithelial cell differentiation” and “epithelium development” GO terms;H2-a cluster is enriched by “skeletal system development” GO term.
Paz A, Frenkel S, Snir S, Kirzhner V, Korol A. 2014. BMC Genomics 15:252. 15
Thank you for your attention
Acknowledgments
Dr. Valery Kirzhner Prof. Abraham Korol Prof. Edward Trifonov Dr. Arnon Paz and Dr. Zeev Frenkel
This work was supported byThe Israeli Ministry of Immigrant AbsorptionThe Israel Council for Higher Education
Calculating compositional spectra
…AGTAGTTACACTACTATAGTGACGACTCCATCGTCGTCGAGAACGTACCTTCTATATCCAAGGTACTACACTCGCGACCG
…
3676CTACTATAGT
…
…CTACTATAGTCTACTAAAGTCTAGTAAAGTCTAGTAAAGTCTAGTAACGTCGCCTAAAGTCCACTAAGGT
…
256 × 3676 = 941056 86.7%Additional slide
Spearman's rank correlation coefficient rho Spearman's rank correlation coefficient is a non-
parametric measure of correlation ρ is given by:
where:• Di = xi − yi = the difference between the ranks of
corresponding values Xi and Yi, and • n = the number of values in each data set (same for
both sets).
Additional slide
The Kendall tau distance The Kendall tau distance is a metric that counts the number of
pairwise disagreements between two lists. The larger the distance, the more dissimilar the two lists are.
The Kendall tau distance between two lists τ1 and τ2 is
K(τ1,τ2) will be equal to 0 if the two lists are identical and n(n − 1) / 2 (where n is the list size) if one list is the reverse of the other. Often Kendall tau distance is normalized by dividing by n(n − 1) / 2 so a value of 1 indicates maximum disagreement. The normalized Kendall tau distance therefore lies in the interval [0,1].
Additional slide
Recommended