Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
1
Paul Dan CRISTEABio-Medical Engineering Center
“Politehnica” University of BucharestSpl. Independentei 313, 060042 Bucharest, Romania,
Phone: +40-21- 4114437, +40-745-117062e-mail: [email protected]
Morphological and Functional Statistical Features of DNA Molecules
Morphological and Functional Statistical Features of DNA Molecules
Simpozionul Naţional deElectrotehnică Teoretică
22-23 OCTOMBRIE 2004Bucureşti, UPB
Ediţie omagială dedicată împlinirii a 100 de ani de la naşterea marelui savant român Remus Răduleţ
2
1. Introduction - DNA Structure
2. Nucleotide Representation
3. Phase Analysis of Genomic Signals
4. Reorienting DNA Segments
5. HIV Genomic Signal Analysis
6. Representability of Data
7. Conclusions
PAPER OUTLINEPAPER OUTLINEPAPER OUTLINE
2
3
5' end
3' end
OP
OO- O
OCH2
OP
OO- O
OCH2
OP
OO- O
OCH2
OP
OO- O
OCH2
5' end
3' end
O
OO
CH2
O-
PO
O
OO
CH2
O-
PO
O
OO
CH2
O-
PO
O
OO
CH2
O-
PO
NN
O
H NH H HO
H
H
N
NN
N
NH
N N
O
OH
H
CH3H
NHH
AT
GC
H
H N
H
NN
N
N H
NN
O
O H
H
CH3
AT
N
NN
N
N N
O
NHH O
H
H
N
NN
N
NH
G C
H
HH
5' to
3' d
irect
ion
5' to
3' d
irect
ion
1. DNA Structure1. DNA Structure1. DNA Structure
43' end
Triphosphate
O O
O
O
5' end
2'3'4'
5'
Deoxyribose
1' Nitrogenous BaseOCH2
PO
HO O
PHO
PHO
Phosphoanhydride bonds
Phosphoester bonds
O
Nucleosine TriphosphateNucleNucleosine osine TTriphosphateriphosphate
3
5
Adenosine triphosphate (ATP)Adenine attached to
Deoxyribose and three
Phosphate groups
Adenosine diphosphate(ADP)
Adenosine monophosphate(AMP)
Krebs cycle
Pyrophosphate
Krebs cycle
hydrolysis
ATP ADP AMPATP ADP AMPATP ADP AMP
6
2. NUCLEOTIDE REPRESENTATION2. NUCLEOTIDE REPRESENTATION
C (cytosine) . T (thymine)
A (adenine)G (guanine)
R (purines)
S (strong link) W (weak link)
M (amino)
K (keto)
Y (pyrimidines)
4
7
Vector mapping of UIPAC symbolsVector mapping of UIPAC symbols
8
.
,
,
kjit
kjig
kjic
kjia
rrrr
rrrr
rrrr
rrrr
−−=
+−−=
−+−=
++=
ktcy
kgar
jtgk
jcam
igcs
itaw
rrrr
rrrr
rrrr
rrrr
rrrr
rrr
r
−=+
=
=+
=
−=+
=
=+
=
−=+
=
=+
=
2
2
2
2
2
2
33
33
33
33
tgcau
gcath
catgd
atgcb
rrrrr
rrrrr
rrrrr
rrrrr
−=++
=
−=++
=
−=++
=
−=++
=
Vector representation of FASTAVector representation of FASTA
5
9
Quadrantal complex mapping of FASTA symbols
Quadrantal complex mapping of Quadrantal complex mapping of FASTA symbolsFASTA symbols
10
A = 1 + jT = 1 – jC = – 1 – jG = – 1 + jW = 1Y = – jS = – 1 R = jK = M = N = 0
( )j131
−−=B
( )j131
+−=V
( )j131
+=D
( )j131
−=H
Complex representation of FASTAComplex representation of FASTA
6
11
Adenine
Cytosine
-
x = Weak - Strong
-
y = Amino - Keto
z = Purine - Pyrimidines
2g
Thymine
2t2c
Guanine
4t
c
Pyrimidines
Weak Bonds
Strong B
onds
Amin
oKeto4g
g
2a
4a
t
4c
a
Purines
AAA
AAC
AATAAG
Codon Codon tetrahedrontetrahedronrepresentationrepresentation
00
11
22 222 bbbx
rrrr++=
{ } 2,1,0;,,, =∈ itgcabi
rrrrr
12
Codon complex representationCodon complex representationCodon complex representation
Im = R -Y
-S trong bondsWeak bonds
Purines-Pyrimidines
4-4
-4j
4j
AAA
AAT
AAG
AAC
Lysine
Asparagine
-S trong bonds
Re = W - S
Adenine
Cytosine Thymine
Guanine
00
11
22 222 bbbx ++=
{ } 2,1,0;,,, =∈ itgcabi
7
13
AAA
ACA
AAC
ACCACG AGC AGG
AGT
ATG
ATT
ATAAAT
ATC
AAG
ACT
Adenine
Cytosine
G
T
TAA
TCA
TAC
TCC
TCG
TGCTGG
TGTTTG
TTT
TTA
TAT
TGATTC
TAG
TCT
CAA
CCA
CAC
CCC CCG CGC CGG
CGTCTG
CTT
CTA
CAT
CGA
CTC
CAG
CCT
A
C
Guanine
Thymine
GAA
GCA
GAC
GCC GCG GGC GGG
GGTGTG
GTT
GTAGAT
GGAGTC
GAG
GCT
Ala – Alanine
Arg – Arginine
Asn – Asparagine
Asp – Aspartic acid
Cys – Cysteine
Gln – Glutamine
Glu – Glutamic Acid
Gly - Glycine
Ile – Isoleucine
His – Histidine
Leu – Leucine
Lys – Lysine
Met – Methionine
Phe – Phenylalanine
Pro – Proline
Ser – Serine
Thr – Thereonine
Trp – Tryptophan
Tyr – Tyrosine
Val – Valine
Ter - Terminator
Thr
Leu Val
AlaPro Arg
Cys
His
Gln
Met
Ile
TyrTer
Ter
Phe
Leu
Gly
Asp
Glu
Ser
Arg
Ser
Asn
Lys
Ter
Trp
AGA
Genetic Code
14
Genetic Code TableGenetic Code TableGenetic Code TableAG
C T
Im = R-Y
Re = W-S
Phe
Ala
ArgCys
Gln
Glu
Ile
His
LeuLeu
Lys
Met
Pro Ser
Thr
Tyr
ArgGly
Val
Ter
SerAsp Asn
Trp
8
15
3. PHASE ANALYSIS OF GENOMIC 3. PHASE ANALYSIS OF GENOMIC SIGNALSSIGNALS
The phase of a complex number is a periodic magnitude: the complex number does not change when adding or subtracting any multiple of 2π to or from its phase. To remove the ambiguity, the standard mathematical convention restricts the phase of a complex number to the domain (- π, π ] that covers only once all the possible orientations of the associated vector in the complex plane. For the genomic signals obtained by using the
nucleotide complex mapping in the figure, the phases of the nucleotide representations have the values radians.
−−
43,
4,
4,
43 ππππ
16
The cumulated phase is the sum of the phases of the complex numbers in a sequence from the first element in the sequence, up to the current element:
where nA, nC, nG and nT are the numbers of adenine, cytosine, guanine and thymine nucleotides in the sequence from the first to the current location. The slope sc of the cumulated phase along the DNA strand at a
certain location is:
where fA, fC, fG and fT are the nucleotide occurrence frequencies.
( ) ( )[ ],34 TACGc nnnn −+−=πθ
( ) ( )[ ]TACGc ffffs −+−= 34π
9
17
The unwrapped phase is the corrected phase of the elements in a sequence such that the absolute value of the difference between the phase of each element in the sequence and the phase of its preceding element is kept smaller than π.The unwrapped for sequence of nucleotides is:
where n+ is the number of the positive transitions A→G, G→C, C→T, T→A,n- is the number of the negative transitions A→T, T→C, C→G, G→A.
The exact neutral transitions A↔A, C↔C, G↔G, T↔T and the “on the average” neutral transitions A→C, C→A, G→T, T→G do not contribute to the unwrapped phase.The slope su of the variation of the unwrapped phase along a DNA strand is given
by the relation:
where f+ and f- are the frequencies of the positive and negative transitions. A statistically constant slope of the unwrapped phase corresponds to a
statistically helicoidal wrapping of the complex representations of the nucleotides along the DNA strand. The step of the helix, i.e., the spatial period over which the helix completes a turn, is:
( ),2 −+ −= nnuπθ
( ),2 −+ −= ffsuπ
usL π2=
18
su = 0.0667 rad/bp (min: 0.047 rad/bp = 2.7 degree/bp, max: 0.120 rad/bp = 6.9 degree/bp)Cumulated and Unwrapped Phase along the concatenated contigs of homo sapiens chr. 11
10
19
Homo sapiensHomo sapienschr 1, all contigs (phase 3, length 228507674 bp)chr 1, all contigs (phase 3, length 228507674 bp)
20
Phase Analysis of the Genome of Phase Analysis of the Genome of EscherichiaEscherichia colicoli K12K12
Origin of Origin of replication:replication:3,923,500 bp3,923,500 bpCumulated phase Cumulated phase minimum: minimum: 3,923,225 bp3,923,225 bp
Terminus of Terminus of replication:replication:1,588,8001,588,800 bpbpCumulated phase Cumulated phase maximum: maximum: 1,550,413 bp1,550,413 bp
11
21
Cumulated phase and unwrapped phase along the chromosome of BS (NC-000964)su = - 0.057 rad/bp, L = 110 bp,sc+ = 0.106 rad/bp, l+ ≈ 1.945 Mbp, sc- = - 0.0965 rad/bp, l- ≈ 2.270 Mbp,
bp/%75.6=∆ +RhKfbp/%14.6−=∆ −RhKf
Phase Analysis of the Genome of Phase Analysis of the Genome of BacillusBacillus SubtilisSubtilis
22
Phase Analysis of the Genome of Phase Analysis of the Genome of YersiniaYersinia PPestiestiss
12
23
4. REORIENTING DNA SEGMENTS4. REORIENTING DNA SEGMENTS4. REORIENTING DNA SEGMENTS
Initial state
24
Hypothetic segment reversal
REORIENTING DNA SEGMENTSREORIENTING DNA SEGMENTSREORIENTING DNA SEGMENTS
13
25
Joint direction reversal Joint direction reversal and strand switchingand strand switching
REORIENTING DNA SEGMENTSREORIENTING DNA SEGMENTSREORIENTING DNA SEGMENTS
26
Effect of segment reversal and strand switching on positive and negative nucleotide-to-nucleotide
transitions
14
27
Phase analysis of the 4288 concatenated rePhase analysis of the 4288 concatenated re--orientedorientedcoding regions of the complete genomecoding regions of the complete genome of E. coliof E. coli
28
Phase analysis of the concatenated re-oriented coding regions of the complete genom of Yersinia Pestis
15
29
Phase analysis of complete genome and of the concatenated re-oriented 1839 coding regions of
Aeropyrum pernix K
30
Agrobacterium tumefaciens strain C58 CereonAgrobacterium tumefaciens strain C58 CereonCircular chromosome (accession number AE007869)
Complete Sequence2,841, 581 bp
2722 Re-oriented Coding Regions2,539,788 bp
16
31
Agrobacterium tumefaciens strain C58Agrobacterium tumefaciens strain C58 CereonCereonLinear chromosome (accession number AE007870)
Complete Sequence2,074,782 bp
1834 Re-oriented Coding Regions1,897,575 bp
32
Agrobacterium tumefaciens strain C58Agrobacterium tumefaciens strain C58Plasmid AT (accession number AE007872)
Complete Sequence542,869 bp
547 Re-oriented Coding Regions465,906 bp
17
33
Agrobacterium tumefaciens strain C58 CereonAgrobacterium tumefaciens strain C58 CereonPlasmid Ti (accession number AE007871)
Complete Sequence214,233 bp
198 Re-oriented Coding Regions188,667 bp
34
5. REPRESENTABILITY OF DATA5. REPRESENTABILITY OF DATA5. REPRESENTABILITY OF DATAWhen can data be represented properly by a line?
PyVy
Px
)( xy
y PfPV
=
Sx - samples(Nx- pixels)
Px - pixel widthin number of samples
Sy - data spanon a screen
(Ny- pixels)
Py - pixel height indata units
18
35
Pixel size for wellPixel size for well--fitted screensfitted screens
- Horizontal screen size in data samples = Sequence length
- Horizontal pixel size in data samples (Pixel width)
- Vertical screen size in data units= Signal span in the screen
- Vertical pixel size in data units (Pixel height)
LSx =
x
xx N
SP =
)][(min)][(max isisSSS IiIi
y∈∈
−=
},,1{ xS SI K=
y
yy N
SP =
36
Signal span for a pixel
- Data scattering ratio for pixel h, h = 1…Nx
- Average data scattering ratio for pixel
y
y
PhV
hQ)(
)( =
y
y
PV
Q~
~=
Data scattering ratioData scattering ratio
( ))(mean~,,1
hVV yNhyxK=
= - Average vertical signal span on a pixel
)][(min)][(max)( isishVPh
Ph IiIi
y∈∈
−=
xxxPh NhPhPhI ,,1};,,1)1{( KK =+−=
19
37
Representability DiagramAverage data scattering vs Pixel width
,1kK=
kmax=
)(~
~x
y
y PfPV
Q ==
1)1( =xP
1)( 2 −= kkxP
Initial pixel width
Step k of doubling the pixel size
xx NS =)1(Initial screen size in data samples
Screen size at step k1)( 2 −= k
xk
x NS
max,,1 kk K=
=
xNLk 2max log LS k
x =max)(
mL 2=s
xN 2=
38
Monotonous SignalsMonotonous SignalsRepresentability Best CaseRepresentability Best Case
y
kyk
y NS
P)(
)(~
~ =
Size of wellSize of well--fitted screensfitted screens
]1)1[(][)( )()()( +−−= kx
kx
ky SjsSjsjS
Average size of Pixels
]1)1[(][)( )()()( +−−= kx
kx
ky PhsPhshV
Signal span per pixel
20
39
Monotonous SignalsMonotonous SignalsRepresentability DiagramRepresentability Diagram
( )( ))(
)(
mean)1(]1[][
mean)1(]1[][~)(
)()(
kx
kx
Sk
S
Pk
P
x
yk
dNsLs
dNsLs
NN
Q↓
↓
−−−
−−−=
Average data scattering ratio
where is the total number of pixels to represent the input sequence s[i], i = 1, ..., L, for an horizontal pixel size ;
is the total number of screens necessary to represent the data at resolution k, and
is the average of the absolute values of the signal variation between successive samples d[i] = s[i+1] - s[i], down-sampled with step D.
)(kPN
1)( 2 −= kkxP
xk
Pk
S NNN )()( =
( )D
d↓
mean
40
So that the representability characteristic results approximatively the same for any monotonous signal:
xNk
x
kx
x
yk
y
kyk
PP
NN
PV
Q1)(
)(
)(
)()( 1
~~
~−−
==
Asymptotic valuex
yk
NN
Qk
→>>− 12
)(
1
If the digital signal can be considered as resulting from the sampling of a continuous and differentiable function, and if both and are large enough, than:)(k
PN )(kSN
( ) ( )1-
]1[][meanmean )()( LsLsdd k
xk
x PS
−≈≈
↓↓
21
41
Approximatively the same for any monotonous signal:
xNk
x
kx
x
yk
y
kyk
PP
NN
PV
Q1)(
)(
)(
)()( 1
~~
~−−
==
Asymptotic value
x
yk
NN
Qk
→>>− 12
)(
1
Representability diagram for monotonous signalsRepresentability diagram for monotonous signals
42
QQ = = ff ((PPxx)) for monotonous signalsfor monotonous signals
22
43
3D representability diagram3D representability diagram for monotonous signalsfor monotonous signals
44
Uniformly distributed random signalsUniformly distributed random signalsRepresentability Practical Worst CaseRepresentability Practical Worst Case
11
~~
~)(
)()(
+−
== kx
kx
yky
kyk
PPN
PV
Q
yk NQ k →
>>− 12)(
1 Asymptotic value
23
45
The expectation of ordered uniformly distributed random variables
Let Xh , h = 1, ..., n, be n statistically independent random variables uniformly distributed on the interval [0, 1] and {xh : h = 1, ..., n}, the values of an instantiation.
Find the expectation of the random variable X - which has the value x equal to the kth element in {xh}, ordered starting with the smallest in magnitude.
The cummulated distribution function of x, defined as the probability that ξ - the kth of the ordered random variables, be equal or less than x, can be computed as the sum of probabilities of the disjoint cases :
where h variables are smaller or equal to x, i.e., have values in the interval [0, x], while the other n-h variables exceed x, i.e., have values in the interval (x, 1], with h being at least k.
[ ] hnhn
khxx
hn
xPxP −
=
−
=≤= ∑ )1()( ξ
46
The expectation of X results from the chain of relations:
,
where B is Euler’s beta function.
The average of the difference between the maximum and the minimum in the set of n random variables is given by:
1111)1,1(1
)(1)()(mean1
0
1
0
+=
++−
−=+−+
−=
−===
∑
∫∫
= nk
nknhnhB
hn
dxxPdxdx
xdPxxx
n
kh
.11)(mean minmaxminmax +
−=−=−
nnxxxx
24
47
Representability diagram for a uniformly Representability diagram for a uniformly distributed random signaldistributed random signal
48
3D representability diagram3D representability diagram for for a uniformly a uniformly distributed randomdistributed random signalsignal
25
49
Representability of sinus signalsRepresentability of sinus signals
50
Representability of Genomic Phase Signals Representability of Genomic Phase Signals
Contig NT 004424 of Homo sapiens chromosome 1
26
51
3D Representability diagram of the unwrapped phase of the Homo sapiens chr. 1 contig NT 004424
52
Representability diagram for the cumulated and unwrapped phase of hs chr 21, contig NT 011515
27
53
Representability diagram of the unwrapped phase for the first 1048576 bp from the circular chromosome of
Agrobacterium tumefaciens (AE008688)
54
3D Representability diagram of the unwrapped phase for the first 1048576 bp from the circular chromosome of
Agrobacterium tumefaciens (AE008688)
28
55
6. CONCLUSIONS6. CONCLUSIONS6. CONCLUSIONS
• DNA sequences can be converted in genomic signals by using a nucleotide complex representation derived from the nucleotide tetrahedral representation.
• There are large scale second order statistical regularities that generalize Chargaff’s law: The difference between the frequency of occurrence of positive nucleotide-to-nucleotide transitions (A→G, G→C, C→T, T→A) and that of negative transitions (the opposite ones) along a strand of DNA tends to be small, constant and taxon & chromosome specific.
• The reorientation of DNA segments involves the simultaneous The reorientation of DNA segments involves the simultaneous reversal of the order and the complementing of the nucleotides reversal of the order and the complementing of the nucleotides (A with T and C with G) in the inverse coding regions.(A with T and C with G) in the inverse coding regions.
56
• The regularity shown by the nucleotide sequences obtained after The regularity shown by the nucleotide sequences obtained after concatenating the reoriented coding regions suggests the existenconcatenating the reoriented coding regions suggests the existence of a ce of a putative primary ancestral genomic material having a quite unifoputative primary ancestral genomic material having a quite uniform rm large scale statistical structure. This feature is also observedlarge scale statistical structure. This feature is also observed only in the only in the chromosomes, and is not found in the plasmids.chromosomes, and is not found in the plasmids.• Combining DNA segments of opposite orientation to generate certain slopes of the phase, i.e., certain densities of the first and second order repartition of nucleotides, has an important functional role at the level of the chromosome, must probably linked to the crossing-over / recombination process, the identification of interacting regions of chromosomes, and the separation of the species. • The cumulated phase and unwrapped phase can be represented adequately as simple graphic lines for very low and large scales, while for medium scales (thousands to tens of thousands of base pairs)statistical descriptions have to be used.