23
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling Warsaw University

Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

  • View
    220

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Methods of identification

and localization of the DNA coding

sequences

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Jacek LelukInterdisciplinary Centre for Mathematical and

Computational ModellingWarsaw University

Page 2: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Periodic asymmetry

index

Position asymmetry

Codon usage

Markov models

Codon prototype

Measures dependent on a model of coding DNA

Measures independent of a model of coding DNA

Identification of coding/non-coding sequences in genome

oligonucleotide counts

base compositional bias between

codon positions

dependence between

nucleotide positions

base compositionalbias between

codon positions

periodic correlation

between nucleotide positions

Average mutual

informationFourier

spectrum

Amino acid

usage

Codon preference

Hexamer usage

based on: based on:

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 3: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

The notation used

S – DNA sequence of length l, while Si (i=1 ... l) denotes the individual nucleotides

C – sequence of codons; Cj – the codon occupying position j in the sequence

- denotes the sequence of codons that results when the grouping of nucleotides from sequence S into codons starts at nucleotide i

or

,

- denotes the codon occupying position j in the decomposition i of the sequence S

[k] - the nucleotide occupying position k in the codon

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 4: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Examples

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 5: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

The notation used

Measures based on a model of coding DNA

probability of the sequence of nucleotides S, given that S is coding in frame i (i=1, 2, 3)

probability of the non-coding DNA sequence (randomly generated)

Likelihood ratio

The ratio of the probability of finding the sequence of nucleotides S, if S is coding in frame i over the probability of finding the sequence of nucleotides S, if S is non-coding

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 6: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

The notation used

Measures based on a model of coding DNA

Log-likelihood ratio

coding potential of sequence S in frame i given the model of coding DNA

the probability of the sequence of nucleotides S is higher assuming that S is coding in frame i, than assuming that S is non-coding in frame i

the probability of S is higher assuming that S does not code in frame i than assuming that S is coding in frame i

The log-likelihood ratios is computed for all three possible frames. If the sequence is coding, the log-likelihood ratio will larger for one of the frames than for the other two.

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 7: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Codon usage

Measures based on a model of coding DNAMeasures based on oligonucleotide counts

frequency (probability) of codon C in the genes of the considered species (the codon usage table)

probability of finding the sequence of codons C knowing that C codes for a protein

P0(C)=(1/64)mprobability of finding the non-coding sequence

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 8: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Amino acid usage

Measures based on a model of coding DNAMeasures based on oligonucleotide counts

the observed probability of the amino acid encoded by codon C in the existing proteins

This value can be directly derived from a codon usage table by summing up the probabilities of synonymous codons

where means c’ synonymous to c

probability of finding the amino acid sequence resulting of translating the sequence in coding open reading frame

frequency of the „non-coding amino acids”; nc – number of codons synonymous to C

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 9: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Codon preference

Measures based on a model of coding DNAMeasures based on oligonucleotide counts

relative probability in coding regions of codon C among codons synonymous to C

probability of the sequence S encoding the particular amino acid sequence in frame i

probability of codon C in non-coding DNA

In non-coding regions there is no preference between „synonymous codons”. Then:

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 10: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Hexamer usage

Measures based on a model of coding DNAMeasures based on oligonucleotide counts

This approach is based on the hexamer usage table for i=1, 2, 3, ... , 4096. In this case there are six reading frames to be analyzed.

The probability of a sequence of hexanucleotides,

in the coding frame of a coding sequence is

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 11: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Codon prototype

Measures based on a model of coding DNAMeasures based on base compositional bias between codon positions

Let f(b,r) be the probability of nucleotide b at codon position r, as estimated from known coding regions. Then:

P2(S) and P3(S) are computed in similar way

is the probability of codon c in coding regions, assuming independence between adjacent nucleotides

probability of for all triplets c in non-coding DNA

Example:

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 12: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Markov Models

Measures based on a model of coding DNAMeasures based on dependence between nucleotide positions

In the Markov models the probability of a nucleotide at a particular codon position depends on the nucleotide(s) preceding it.

The Markov models of order 1 is the simplest of the Markov models.The probability of a nucleotide depends only on the preceding nucleotide. In this case, the model of coding DNA is based on the probabilities of the four nucleotides at each codon position, depending on the nucleotide occurring at the preceding codon position (technically called the transition probabilities). Thus, instead of one single matrix, as in Codon Prototype, three 4x4 matrices (the transition matrices) are required, F1, F2, and F3, each one corresponding to a different codon position.

There are used Markov models of the order 1 to 5

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 13: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Position asymmetry

Measures independent of a model of coding DNAMeasures based on base compositional bias between codon positions

The goal is to measure how asymmetric is the distribution of nucleotides at the three triplet positions in the sequence.

the relative frequency of nucleotide b at codon r position in the sequence S, as calculated from one of the three decompositions of S in codons (any of them)

average frequency of nucleotide b at the three codon positions

asymmetry in the distribution of nucleotide b

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 14: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Position asymmetry (continued)

Measures independent of a model of coding DNAMeasures based on base compositional bias between codon positions

Position Asymmetry of the sequence

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 15: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Periodic asymmetry index

Measures independent of a model of coding DNAMeasures based on periodic correlation between nucleotide positions

This approach considers three distinct probabilities: - the probability Pin of finding pairs of the same nucleotide at distances k=2, 5, 8, ...- the probability P1

out of finding pairs of the same nucleotide at distances k=0, 3, 6, ...- the probability P2

out of finding pairs of the same nucleotide at distances k=1, 4, 7, ...

The tendency to cluster homogeneous di-nucleotides in a 3-base periodic pattern can be measured by the Periodic Asymmetry Index:

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 16: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Average mutual information

Measures independent of a model of coding DNAMeasures based on periodic correlation between nucleotide positions

absolute number of times when nucleotide i is followed by nucleotide j at a distance of k positions

Correlation between nucleotides i and j at a distance of k positions

probability that nucleotide i is followed by nucleotide j at a distance of k positions

where pi and pj are probabilities of nucleotide i and j occurrence in sequence S

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 17: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Average mutual information (continued)

Measures independent of a model of coding DNAMeasures based on periodic correlation between nucleotide positions

Mutual Information function

quantifies the amount of information that can be obtained from one nucleotide about another nucleotide at a distance k

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 18: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Average mutual information (continued)

Measures independent of a model of coding DNAMeasures based on periodic correlation between nucleotide positions

the in-frame mutual information at distances k=2, 5, 8, ...

Average Mutual Information

the out-frame mutual information at distances k=0, 1, 3, 4, ...

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 19: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Fourier analysis

Measures independent of a model of coding DNAMeasures based on periodic correlation between nucleotide positions

No such ``peak'' is apparent for non-coding sequences

DNA coding regions reveal the characteristic periodicity of 3 as a distinct peak at frequency f =1/3

The partial spectrum of a DNA sequence S of length l corresponding to nucleotide b is defined as:

where Ub(Sj)=1 if Sj=b, and otherwise it is 0, and f is the discrete frequency, f =k/l, for k=1, 2, ... ,l/2

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 20: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Summary of results

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 21: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

List of Gene Identification programs and Internet access (part 1)

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 22: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

List of Gene Identification programs and Internet access (part 2)

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University

Page 23: Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Zestawienie sekwencji (multiple alignment) 52 inhibitorów proteinaz typu Bowman-Birk sporządzone za pomocą algorytmu

semihomologii genetycznej Reszty konserwatywne i typowe wyszczególniono białymi literami na czarnym tle. Szare tło wskazuje aminokwasy

semihomologiczne. 3 10 20 30 40 50 60 P01055 ESSKPCCDQCACTKSNPPQCRCSDMRLNSCHSACKSCICALSYPAQCF-CVDITDFCYEP-CKP P01057 ESSKPCCDECACTKSIPPQCRCTDVRLNSCHSACSSCVCTFSIPAQCV-CVDMKDFCYAP-CKS P01056 QSSKPCCBHCACTKSIPPQCRCTDLRLDSCHSACKSCICTLSIPAQCV-CBBIBDFCYEP-CKS P01058 ESSKPCCDQCSCTKSMPPKCRCSDIRLNSCHSACKSCACTYSIPAKCF-CTDINDFCYEP-CKS P01059 ESSKPCCDLCTCTKSIPPQCHCNDMRLNSCHSACKSCICALSEPAQCF-CVDTTDFCYKS-CHN P01063 ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS P17734 QSSKPCCRQCACTKSIPPQCRCSQVRLNSCHSACKSCACTFSIPAQCF-CGBIBBFCYKP-CKS P81483 -SSKPCCBHCACTKSIPPQCRCSBLRLNSCHSECKGCICTFSIPAQCI-CTDTNNFCYEP-CKS P81484 -SSKPCCBHCACTKSIPPQCRCSBLRLNSCHSECKGCICTFSIPAQCI-CTDTNNFCYEP-CKS P16343 ESSKPCCSSC-CTRSRPPQCQCTDVRLNSCHSACKSCMCTFSDPGMCS-CLDVTDFCYKP-CKS P01064 EYSKPCCDLCMCTRSMPPQCSCEDIRLNSCHSDCKSCMCTRSQPGQCR-CLDTNDFCYKP-CKS P82469 -SSGPCCDRCRCTKSEPPQCQCQDVRLNSCHSACEACVCSHSMPGLCS-CLDITHFCHEP-CKS P01061 ESSHPCCDLCLCTKSIPPQCQCADIRLDSCHSACKSCMCTRSMPGQCR-CLDTHDFCHKP-CKS P01062 ESSEPCCDSCDCTKSIPPECHCANIRLNSCHSACKSCICTRSMPGKCR-CLDTDDFCYKP-CES P01060 QSSPPCCBICVCTASIPPQCVCTBIRLBSCHSACKSCMCTRSMPGKCR-CLBTTBYCYKS-CKS 1BBI: ESSKPCCDQCACTKSNPPQCRCSDMRLNSCHSACKSCICALSYPAQCF-CVDITDFCYEP-CKP 1D6R:I ---KPCCDQCACTKSNPPQCRCSDMRLNSCHSACKSCICALSYPAQCF-CVDITDFCYEP-CK- 1DF9:C ESSEPCCDSCDCTKSIPPQCHCANIRLNSCHSACKSCICTRSMPGKCR-CLDTDDFCYKP-CES 1PI2: EYSKPCCDLCMCTRSMPPQCSCED-RINSCHSDCKSCMCTRSQPGQCR-CLDTNDFCYKP-CKS 1PBI:A DVKSACCDTCLCTKSNPPTCRCVDVGET-CHSACLSCICAYSNPPKCQ-CFDTQKFCYKQ-CHN AAB4719 ESSKPCCDQCTCTKSIPPQCRCTDVRLNSCHSACSSCVCTFSIPAQCV-CVDMKDFCYAP-CKS TISYC2 ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS JC2225 ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS TIZB2 ESSKPCCDQC-CTKSMPPKCRCSDIRLDSCHSACKSCACTYSIPAKCF-CTDINDFCYEP-CKS JC2073 ESSKPCCDECKCTKSEPPQCQCVDTRLESCHSACKLCLCALSFPAKCR-CVDTTDFCYKP-CKS JC2072 ESSKPCCDECKCTKSEPPQCQCVDTRLESCHSACKLCLCALSFPAKCR-CVDTTDFCYKP-CKS 0506164 ESSKPCCDQC-CTKSMPPKCRCSDIRLDSCHSACKSCACTYSIPAKCF-CTDINDFCYEP-CKS 0401177 ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS 763679A ESSKPCCDLCMCTASMPPQCHCADIRLNSCHSACDRCACTRSMPGQCR-CLDTTDFCYKP-CKS TISYD2 EYSKPCCDLCMCTRSMPPQCSCEDIRLNSCHSDCKSCMCTRSQPGQCR-CLDTNDFCYKP-CKS 0907248 ESSEPCCDSCRCTKSIPPQCHCADIRLNSCHSACKSCMCTRSMPGKCR-CLDTDDFCYKP-CES 1102213 ESSEPCCDLCLCTKSIPPQCQCADIRLNSCHSACKSCMCTRSMPGQCH-CLDTHDFCHKP-CKS 1102213 ESSEPCCDLCLCTKSIPPQCQCADIRLNSCHSACKSCMCTRSMPGQCR-CLDTHDFCHKP-CKS 0404180 EYSKPCCDLCMCTRSMPPQCSCEDIRLNSCHSDCKSCMCTRSQPGQCR-CLDTNDFCYKP-CKS TIZB1B ESSHPCCDLCLCTKSIPPQCQCADIRLDSCHSACKSCMCTRSMPGQCH-CLDTHDFCHKP-CKS TIMB ESSEPCCDSCDCTKSKPPQCHCANIRLNSCHSACKSCICTRSMPGKCR-CLDTDDFCYKP-CES TIZB1P ESSHPCCDLCLCTKSIPPQCQCADIRLNSCHSACKSCMCTRSMPGQCR-CLDTHDFCHKP-CKS JC1066 ESSEPCCDSCDCTKSKPPQCHCANIRLNSCHSACKSCICTRSMPGKCR-CLDTDDFCTKP-CES Q41066 DVKSACCDTCLCTKSDPPTCRCVDVGET-CHSACDSCICALSYPPQCQ-CFDTHKFCYKA-CHN P80321 STTTACCDFCPCTRSIPPQCQCTDVREK-CHSACKSCLCTLSIPPQCH-CYDITDFCYPS-CR- Q41065 DVKSACCDTCLCTKSNPPTCRCVDVRET-CHSACDSCICAYSNPPKCQ-CFDTHKFCYKA-CHN P81705 --TSACCDKCFCTKSNPPICQCRDVGET-CHSACKFCICALSYPAQCH-CLDQNTFCYDK-CDS P56679 DVKSACCDTCLCTKSNPPTCRCVDVGET-CHSACLSCICAYSNPPKCQ-CFDTQKFCYKA-CHN P16346 --TTACCNFCPCTRSIPPQCRCTDIGET-CHSACKTCLCTKSIPPQCH-CADITNFCYPK-CN- P01065 DVKSACCDTCLCTRSQPPTCRCVDVGER-CHSACNHCVCNYSNPPQCQ-CFDTHKFCYKA-CHS P24661 DVKSACCDTCLCTKSEPPTCRCVDVGER-CHSACNSCVCRYSNPPKCQ-CFDTHKFCYKS-CHN P07679 KRPWECCDIAMCTRSIPPICRCVDKVDR-CSDACKDCEETEDN--RHV-CFDTYIGDPGPTCHD P19860 ERPWKCCDLQTCTKSIPAFCRCRDLLEQ-CSDACKECGKVRDSDPPRYICQDVYRGIPAPMCHE P22737 ERPWKCCDLQTCTKSIPAFCRCRDLLEQ-CSDACKECGKVRDSDPPRYICQDVYRGIPAPMCHE 220645 ES-EGCCDRCICTKSMPPQCHCHDVRLDSCHSDCETCICTRSYPAQCR-CADTTDFCYKP-C-S P09864 TRPWKCCDRAICTKSFPPMCRCMDMVEQ-CAATCKKCGPATSDSSRRV-CEDXY----------- P09863 KRPWKCCDQAVCTRSIPPICRCMDQVFE-CPSTCKACGPSVGDPSRRV-CQDQYV---------- KONSENSUS ESSKPCCDXCXCTKSIPPQCRCXDXRLNSCHSACKSCXCTRSXPXQCX-CXDTXDFCYKP-CKS

Thank you for your attention