60
Reporte Interno CIC: Results on Using Grammatical Inference to Solve the Cleavage Site Prediction Problem Gloria In´ es Alvarez V., Jorge Hern´an Victoria M., Carlos Arbey Mej´ ıa M., Enrique Bravo M. August 2011 Abstract The goal of this research is to evaluate the performance of two grammatical inference algorithms in the cleavage site prediction prob- lem both in potyviridae poliprotein sequences and in signal peptides. We plan to compare the performance of our algorithms with other approaches available at this time. Cleavage site prediction is a well known problem in bioinformatics, it consists in predict the places on a sequence of aminoacids where it is cut. We found that our algorithms behave well on this task. In fact, they are able tu achieve similar or higher recognition rates than other current approaches. Introduction A formal language is a subset of the chains produced by concatenation of symbols from a given alphabet. An aminoacid set is a finite set, so the names of aminoacids can be considered an alphabet. Grammatical infer- ence is useful to learn inductively a formal language on a given alphabet. Sequences of aminoacids are chains of symbols into an alphabet, and subse- quences of aminoacids corresponding to the final part of any segment are a subset of them, so these subsequences can be considered a language. As a consequence, grammatical inference can be applied to the problem of distin- guishing subsequences which correspond to the end of a segment (cleavage sites) from other subsequences. Our first goal is to apply our grammatical inference algorithms to solve the cleavage site prediction problem and then to compare their performance against other well know techniques. Detecting cleavage sites is part of the 1

Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Reporte Interno CIC:

Results on Using Grammatical Inference to Solve

the Cleavage Site Prediction Problem

Gloria Ines Alvarez V., Jorge Hernan Victoria M.,

Carlos Arbey Mejıa M., Enrique Bravo M.

August 2011

Abstract

The goal of this research is to evaluate the performance of twogrammatical inference algorithms in the cleavage site prediction prob-lem both in potyviridae poliprotein sequences and in signal peptides.We plan to compare the performance of our algorithms with otherapproaches available at this time. Cleavage site prediction is a wellknown problem in bioinformatics, it consists in predict the places on asequence of aminoacids where it is cut. We found that our algorithmsbehave well on this task. In fact, they are able tu achieve similar orhigher recognition rates than other current approaches.

Introduction

A formal language is a subset of the chains produced by concatenation ofsymbols from a given alphabet. An aminoacid set is a finite set, so thenames of aminoacids can be considered an alphabet. Grammatical infer-ence is useful to learn inductively a formal language on a given alphabet.Sequences of aminoacids are chains of symbols into an alphabet, and subse-quences of aminoacids corresponding to the final part of any segment are asubset of them, so these subsequences can be considered a language. As aconsequence, grammatical inference can be applied to the problem of distin-guishing subsequences which correspond to the end of a segment (cleavagesites) from other subsequences.

Our first goal is to apply our grammatical inference algorithms to solvethe cleavage site prediction problem and then to compare their performanceagainst other well know techniques. Detecting cleavage sites is part of the

1

Page 2: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

path to develop new drugs which avoid protein maturation and inhibit itseffects and it is a well known problem in Bioinformatics.

Grammar inference is a technique of inductive learning, belonging to thesyntactic approach of machine learning. We apply two grammatical infer-ence algorithms to predict cleavage sites in two contexts: polyproteins fromtranslation of Potivirus genomes and signal peptides. These real problemsallow us to evaluate the behaviour of HyRPNI and OIL algorithms in realworld conditions which are different from synthetic data tests.

This report is organised as follows: in Section 1 we explain the cleavagesite prediction problem, in Section 2 we describe some important facts aboutPotyviruses. Then we survey other methods applied to solve the cleavagesite prediction problem (in Section 3). In Section 4, we propose our approachto use grammatical inference to solve the problem. In sections 5 and 6, wedescribe our experiments on poliproteines from Potyvirus genomes and onsignal peptides, reporting both experimental design and results. Section 7reports our results on comparing OIL algorithm with SignalP 3.0. Finaly,Section 8 present our conclusions and future works.

1 The Cleavage Site Prediction Problem

Cleavage site prediction is a well known problem in Bioinformatics. It con-sists in finding the places where macromolecules are cut by specific pepti-dases and proteinases, in maduration processes to obtain functional prod-ucts. We can predict cleavage sites for signal peptides, viral coding segmentsand other biological patterns. Thus the generic problem is present in anygenome, from viruses to human beings. Cleavage sites may be visualizedin primary, secondary and tertiary structure of the protein, as we can seein Figure 1: in part a) we see the amino acids chain with the cleavage sitemarked by a red line in the middle of two amino acids T and E; in partb) cleavage site is sorounded with a black circle marking the place of cuton the secondary structure; part C) marks in blue the point of interactionbetween protein and enzime which will produce the cut. Although it is pos-sible to locate the place in any of the representations of proteins, currentlythe prediction is done from primary structure, because this representationis the simplest and it seems to contain information enough to get a solu-tion. Besides, because of the complexity of the cleavage site sequences, theuse of algorithms makes easier the detection of specific features of thosepoints. The prediction of cleavage sites allows isolating specific segments tobe studied and facilitates the analysis and annotation of the data obtained

2

Page 3: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

experimentally and their comparison with those existing in databases suchas GenBank or PDB.

(a) Primary Structure (b) Secondary Structure (c) Tertiary Structure

Figure 1: Cleavage Sites in Primary, Secondary and Tertiary structure ofproteins.

Biologists have advanced in the discovery of the patterns that markmany events in biosequences, in some cases the patterns are almost fixed,for example: termanation codon for translation of proteins from messengerRNA is one of the following three: UAA, UAG, or UGA. However, otherpatterns as cleavage sites, are more variable and the best we can get withmanual methods is a list of possible configurations of the pattern. As thelist of possiblities makes longer the utility of the pattern decreases for theprediction problem.

2 The Potyviridae Family

Potyviridae is a family of viruses of plants. Among their main features wehave: the virion is non-enveloped, the genome is a linear positive sense ss-RNA and their replication occur in the cytoplasm. With respect to theirsize: the nucleocapsid is filamentous and is 11-20 nm in diameter and thegenome ranges from 9000-12000nt. Figure 2 shows the image in the micro-scope of a potato virus which belongs to this family 1.

We are interested in cleavage site prediction of Potyviruses, since theyare pathogenic for many important crop plants such as beans, potato, soy-

1Image from jpkc.ynau.edu.cn visited in 2010-10-22. The image must not be used forcommercial purpose without the consent of the copyright owners. The images are not inthe public domain. The images can be freely used for educational purposes.

3

Page 4: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Figure 2: Image of a potato virus from Potyviridae family in microscope

bean, sugarcane, tobacco among others, which have a large economic andalimentary impact in South America. Many potyviruses cause serious dis-eases in plants since a mosaic effect in the leaves which decreases the plantactivity and its production until the dead of the plant. Figure 3 shows theseeffects on several fruits and leaves: beans, watermelon2 and papaya 3.

(a) Beans (b) Watermelon (c) Papaya

Figure 3: Consequences of the infection by potyvirus in several crop plants.

Prediction of cleavage sites may facilitate the understanding of the molec-ular mechanisms underlying the diseases caused by potyviruses. Researchersin the region have studied this family of viruses [BCM08] and more thanfifty viruses have been sequenced around the world. The potyviral genomeis expressed through the translation of a polyprotein which is cut by virus-encoded proteinases at specific sites in the sequence of amino acids, resulting

2Images of T. A. Zitter from vegetablemdonline.ppath.cornell.edu/photopages/Cucurbit/CucViruses

3Image of A.A. Seif from www.infonet-biovision.org/print/images/133/crops

4

Page 5: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

in 10 functionally mature proteins responsible for the infection and virusreplication called: P1, HCPro, P3, 6K1, CI, 6K2, VPg, NIa, NIb and CP.The genome organisation of a typical member of the family is shown in Fig-ure 4 which indicates the 10 mature proteins and the nine cleavage sites(arrowed). Each cleavage site is identified by the name of the segments itseparates.

Figure 4: Map of a typical member of potyviridae family

Patterns signalling cleavage sites changes from one family to another andinto the same family they may change between individuals because of muta-tion processes. In spite of that, there exists some experimental informationabout the shape of these patterns for potyvirus, Table 1 is a fragment of thepossibilities for the first segment cleavage site, taken from www.dpvweb.net/potycleavage/index.html: column Site shows the pattern mark-ing the cleavage site with slash(/), column Nos of Sequences presents thenumber of different sequences found to have exactly this sequence near to thecleavage site, finally, column Nos of viruses presents the number of viruseswith this kind of pattern. This table reflects the pattern that biologists haverecognised but, as we have said previously, this pattern has high variability,so that some valid sequences are going to have their cleavage site out ofthese patterns.

3 Methods to Solve the Cleavage Site Prediction

Problem

Cleavage site prediction has been a research problem in the last decade.Until the present, the most succesful computational approaches have beenmachine learning ones [LN06], specially neural networks [HGS+04], Bayesiannetworks [May05], Hidden Markov models [JDBB04] or combinations ofsome of them.

5

Page 6: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 1: Example of patterns for the cleavage site between segments P1 andHC−PRO

Cleavage sites are present on the aminoacids sequence of any living being.There are research results on viruses, bacterium, animals and human beings.Notice that many papers report results on one specific cleavage site, and thegoal of their research points to different uses of the information obtainedboth biological or computational. For example, from biological point ofview there are publications related with: the application of variable contextMarkov chains for HIV protease cleavage site prediction [Ogu09]; determin-ing signal peptides [May05],which are useful to direct proteins to the correctdestination within the cell [HGS+04]; Support Vector Machine−based pre-diction of caspase substrates cleavage sites [WTR07] (caspases belong to a

6

Page 7: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

unique class of cysteine proteases that function as critical effectors of apop-tosis, inflammation and other important cellular processes); or predictionof neuropeptide cleavage sites in insects [SSRZ08]. Some results pointingto computational issues are: Comparison of machine learning methods ondetection of HIV−1 protease [LN06]. Support vector machine prediction ofsignal peptide cleavage site using a new class of kernels for strings [Ver02].

Wang’s paper [WYC05] presents the application of vector support ma-chines to several bioinformatics problems, one of them being cleavage siteprediction problem in human signal peptides. Danish Center of Biologi-cal Sequence Analysis, have been working in a tool for cleavage site pre-diction called SignalP from 1997. This tool uses Hidden Markov modelsand Artificial Neural Networks to detect signal peptides cleavage sites inthree contexts: eukariots, gram+ bacteria and gram- bacteria. Their resultsare the highest reported in the last years and besides they make compar-isons against many other available tools [NEBvH97] [NBvH99] [JDBB04][EBvHN07] [LdM+09]. Yang [YC04] propose a vector support machinemodified, it changes the kernel functions by matrices of similarity amongamino acids, new model is called vector bio-supported machines and theyhave been applied to the prediction of cleavage sites for the HIV protease,this approach yields models of reduced complexity and higher robustness.

Other softwares to predict cleavage sites are: [HGS+04] propose PrediSi,which uses improved weighted matrices. Mayo[May05] uses Bayesian net-works. Using artificial neural networks [THY04] applies bio-base functionsinvolving similarity matrices, authors state that this method provides moreefficiency in prediction, higher robustness and reduces temporal complexityin the learning process. Markov chaines are used to solve de cleavage pre-diction problem in HIV-1 protease[Ogu09] and support vector machines incaspase substrats[Wee07]. Considering the species studied, there are reportson: Picornavirus [BHBB96], HIV-1 protease [RY04], in insects [SSRZ08] andin micobacteria [LdSM+09]. Some of them uses biological methods and otheruses computational ones.

Although grammar inference is being used to solve several bioinformaticsproblems, we have no report of any intent to use it to solve the cleavage siteprediction problem. Some problems which are being solved by this techniqueare: learning non deterministic automata to characterize protein families[CK05] [CK06], detection of zones coil-nocoil[PLCS06] and transmembraneproteins [PLC07].

7

Page 8: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

4 The Grammatical Inference Approach to Solve

the Cleavage Site Prediction Problem

There are two main reasons to try on grammatical inference to solve thecleavage site prediction problem: this technique does not make any asump-tion about dependence or independence among simbols. Other approachesas hidden Markov models or Bayesian networks depends on the indepen-dence principle which may not be fulfilled in this context, introducing fromthe begining a cause of distance between the models and the real sistems. Aswe said, in grammatical inference this is not the case. Another importat rea-son to apply grammatical inference is the ease to represent the inputs of theproblem, because the languajes to learn are defined in terms of the alphabetformed by all the amino acids (or nucleotids, depending of the sequences tobe studied), so the sequence by himself is the input for the algorithm onthe contrary, hidden Markov models and artificial neural networks requiresa change of the representation which may increase the length of the inputin twenty times or more. Finally, the common strategy to test grammaticalinference algorithms uses sinthetic data, would be interesting to test thesealgorithms with real data to determine if this approach is ready to solvereal world problems like this. Two grammatical inference algorithms will beused in this experimental research, both of them have been developed withthe participation of some of the authors of this report, they are HyRPNIand OIL.

The algorithms HyRPNI and OIL have in common that they build finitestate automata as models of a target language and they procede by mergingstates from an initial representation of the training sample. They are dif-ferent in the kind of automata to be produced, in HyRPNI is a deterministone but in OIL it is a non deterministic one and in the method to build theautomaton: HyRPNI is a deterministic method but OIL is non deterministicone. Both algorithms identifies in the limit a model of the target language,HyRPNI identifies the minimum automaton and OIL identifies one automa-ton. Besides, both have polinomial complexity as we will see later in thissection. In the rest of the seccion we give some mathematical fundations,describe both algorithms with some detail and discuse their convergence andcomplexity.

4.1 Theoretical Preliminaries

In this section, we declare some definitions used to describe basic concepts ingrammar inference and some explanations about our notation. Definitions

8

Page 9: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

that are not in this section can be found in [HMU01].Let Σ be a finite alphabet and Σ∗ the free monoid4 generated by Σ with

the concatenation as the internal operation and ε as the neutral element. Alanguage L over Σ is a subset of Σ∗. The elements of L are called words.Let w be a word, that is, w ∈ Σ∗ and the length of w is denoted as |w|.

Given x ∈ Σ∗, if x = uv with u,v ∈ Σ∗, then u (resp. v) is called pre-fix (resp. suffix ) of x. Pr(L)(resp. Suf(L)) denotes the set of prefixes(resp.suffixes) of L. The product of two languages L1, L2 ⊆ Σ∗ is defined asL1 · L2 = {u1u2|u1 ∈ L1 ∧ u2 ∈ L2}. Sometimes L1 · L2 will be denoted justas L1L2.

Definicion 1 The lexicographical order in Σ∗ will be denoted as ≪. Sup-posing that < is a total order for Σ and given a, b ∈ Σ∗ with a = a1 . . . am

and b = b1 . . . bn, a ≪ b if and only if (|a| < |b|) or (|a| = |b| and ∃j,1 ≤ j ≤ n,m such that ∀i, 0 ≤ i < j, ai = bi and aj < bj).

Definicion 2 A Deterministic Finite Automaton (DFA) is a 5-tuple A =(Q,Σ , δ, q0, F ) where Q is a finite states set, Σ is an alphabet, δ : Q×Σ → Q

is the transition function in, q0 is the initial state and F ⊆ Q is a set offinal states.

Transition function δ is defined by symbols, but it can be extended towords as follows: Let q ∈ Q, s ∈ Σ∗ if |s| > 1 that is, s = s1, . . . , sn thenδ(q, s) = δ(δ(q, s1, . . . , sn−1), sn). A word x is accepted by A if δ(q0, x) ∈F . The set of accepted words by A is denoted by L(A), and is called thelanguage of A.

Definicion 3 Two automata are equivalent if they recognise the same lan-guage.

Definicion 4 Given a finite set of words D+, the augmented prefix treeacceptor of D+ is defined as the automaton A = (Q,Σ, δ, q0, F ) where Q =Pr(D+), q0 = ε, F = D+ and δ(u, a) = ua, ∀u, ua ∈ Q.

Definicion 5 A Deterministic Moore Machine is a 6-tuple M = (Q,Σ, B, δ, q0,

Φ), where Q is a set of states, Σ is an alphabet, B is the alphabet of theoutput function Φ, δ : Q×Σ → Q is the transition function, q0 is the initialstate and Φ : Q → B is the output function.

4The free monoid is a set that has a binary operation over it, such that the result ofapplying the operation to two elements of the set produces another element in the set;also, the binary operation needs to be associative and the set needs to have a neutralelement.

9

Page 10: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Definicion 6 DFA associated to a Moore Machine M = (Q,Σ, {0, 1, ?}, δ, q0 ,

Φ) is A = (Q,Σ, δ, q0, F ) where F = {q ∈ Q | Φ(q) = 1}. Given a DFA A =(Q,Σ, δ, q0, F ), the Moore machine associated to A is M = (Q,Σ, B, δ, q0,Φ)with B = {0, 1} and satisfies that ∀q ∈ Q Φ(q) = 1 if and only if q ∈ F .

Definicion 7 Given a Moore machine M = (Q,Σ, B, δ, q0,Φ) with B ={0, 1, ?}, the behaviour of M is given by the transition function tM : Σ∗ → B

defined as tM (x) = Φ(δ(q0, x)), for all x ∈ Σ∗ such as δ(q0, x) exists. M isconsistent with (D+, D−) if ∀x ∈ D+ tM(x) = 1 and ∀x ∈ D− tM (x) = 0.A word x is accepted by M if tM (x) = 1. The set of accepted words by M

is denoted by L(M).

Definicion 8 Given two finite disjoint sets D+ and D−, we define (D+,D−)- Augmented Moore Prefix Tree Acceptor, denoted (AMPAT (D+,D−)),as a Moore machine M defined as M = (Q,Σ, B, δ, q0,Φ) with B = {0, 1, ?},Q = Pr(D+ ∪ D−), q0 = ε and δ(u, a) = ua if u, ua ∈ Q and a ∈ Σ. For allstate u, the output function value for u is 1, 0 or ? depending on u belongsto D+, D− or Σ∗ − (D+ ∪ D−) respectively.

A word x is accepted by M if Φ(x) = 1. The set of accepted words by M

is denoted by L(M). The size of the samples (D+, D−) is∑

w∈{D+∪D−} |w|

Definicion 9 Let M be an AMPAT defined as M = (Q,Σ, B, δ, q0,Φ). Themerge tree associated to p, q ∈ Q is a set of pairs (x, y) that correspondsto the states that can be reached from p and q when moving simultaneouslyfrom them, using the same symbol [?].

To ilustrate the Definition 9, consider the AMPAT M = (Q,Σ, B, δ,

q0,Φ) and the merge tree MT shown in Figure 5. Graphical conventionsare: solid nodes represent states with Φ(qi) = 1, thin nodes are states withΦ(qi) = 0, and dashed nodes mean states with Φ(qi) =?. In this example,MT is built starting from the pair of states (0, 1) in M . A transition inMT connect, with a symbol x ∈ Σ, a pair of states (p1, q1) with anotherpair (p2, q2) such that δ(p1, x) = p2 ∧ δ(q1, x) = q2. In this case, MT ={(0, 1), (1, 3), (2, 4), (5, 7), (6, 8)}.

Definicion 10 Given a language L and a word a, the residual language ofL over a a is a−1L = {a ∈ Σ∗ : ab ∈ L}

Definicion 11 Let be an AMPAT M = (Q,Σ, B, δ, q0,Φ) and S ⊆ Q. Thefrontier T of S is the set T = {q ∈ Q | δ(p, a) = q, q 6∈ S, a ∈ Σ, p ∈ S}. Inother words, the frontier of S are all the successors of the states in S whichdoes not belong to S.

10

Page 11: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

0

1

2

3

4

5

6

7

8

0

1

0

1

0

1

0

1

(a) AMPAT M

(0, 1)

(1, 3) (2, 4)

(5, 7) (6, 8)

01

01

(b) Merge Tree MT

Figure 5: Merge Tree MT built from the pair of states (0,1) of AMPAT M .

Definicion 12 If M = (Q,Σ, B, δ, q0,Φ) is a Moore machine and q ∈ Q,we define the language accepted in M from state q as L(M, q) = {v ∈ Σ∗ :δ(q, v) ∩ F 6= ∅}.

Definicion 13 Let M = (Q,Σ, B, δ, q0,Φ) be a Moore machine, and p, q ∈Q. We say p ≺ q, p is included in q if L(M,p) ⊆ L(M, q).

Definicion 14 Given a Moore machine M = (Q,Σ, B, δ, q0,Φ) with B ={0, 1, ?} and p, q ∈ Q, we say p and q are distinguished if there exist u ∈ Σ∗

such that Φ(δ(p, u)) = 1 ∧ Φ(δ(q, u)) = 0 or Φ(δ(p, u)) = 0 ∧ Φ(δ(q, u)) = 1.If word u does not exist, we say p, q are indistinguished.

4.2 Algorithm HyRPNI

Grammar inference is an inductive form of machine learning. We followGold’s model of identification in the limit [Gol67] as our learning modeland we focus on learning automata by a state merging approach. Grammarinference inputs are two sets of words labeled with respect to a target lan-guage: D+ contains words that belong to the language to be learned andD− contains word that belong to its complement. Output of the inference isa DFA consistent with D+ and D− which is a model of the target language.Merging states approach consists on detect indistiguished states in the au-tomaton and eliminate them to obtain the shortest automaton consistentwith training data. Figure 6 shows the input, the process and the outputof this algorithm, tagged samples are the input, they are represented with

11

Page 12: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Figure 6: General approach of algorithm HyRPNI.

a prefix tree Moore machine which is modified by merging states until thefinal hypothesis is reached.

Hy-RPNI (Hybrid Regular Positive and Negative Inference) consists oftwo phases: in the first phase, a heuristic based on a score is used to choosethe pairs of states to be merged; the second phase chooses pairs of statesfor merging in lexicographical order (Definition 2.1). Algorithm 1 showsthe general strategy for inference process. It receives a set of positive(D+)and negative(D−) samples, and the size of the first phase(phaseOneSize) asinput parameters and returns the smallest DFA (Definition 2.2) consistentwith training samples. List S defined in line 2 will store the states that willbe in the final hypothesis. List T defined in line 3 will store the states in thefrontier of S (Definition 2.7). In line 4, S is initialized with q0, the initialstate of the AMPAT M (Definition 2.6). The function calculateFrontier

(lines 5, 17) assigns to list T the frontier of S. The counter countPhase

is defined in line 6 starting from zero, and later it increases to detect theend of the first phase. While statement starts in line 7 and stops when T

becomes empty and final hypothesis is obtained. Line 9 verifies the currentphase of the algorithm, and applies the corresponding method: heuristicor lexicographical. In both cases, only if a merge is done the countPhase

is incremented by one. Algorithm returns the DFA associated to the finalMoore machine (Definitions 2.3 and 2.4).

Merging of states must preserve the consistency of the machine withrespect to training samples. In our approach, two states can be merged if

12

Page 13: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Algorithm 1 Hy-RPNI (D+,D−, phaseOneSize)

1: M = AMPAT(D+,D−) // M = (Q,Σ, B, δ, q0,Φ)2: S = [ ] //Is the list of states in the hypothesis3: T = [ ] //Is the list of states in the frontier4: S.append(q0)5: T = calculateFrontier(M,S)6: countPhase = 0 //counter of the current counter of the phase7: while T is not empty do

8: merged = False

9: if countPhase < phaseOneSize then

10: merged = HeuristicMerge(M,S, T )11: else

12: merged = LexicographicalMerge(M,S, T )13: end if

14: if merged then

15: countPhase = countPhase + 116: end if

17: T = calculateFrontier(M,S)18: end while

19: return DFA associated to M

13

Page 14: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

they are indistinguished (Definition 2.8). Merging functions of the algorithmfirst determine if states can be merged and proceed to merge them whenpossible.Function HeuristicMerge explores all the pairs of states candidatesto merge, formed by an state from set S and one from frontier T . Each pairmergable is scored to reflect the degree of support, in the training sample,to the belief that this merge will be correct, that is, it will lead us to a rightmodel for the target language. Function LexicographicalMerge merges thefirst pair of states, one from set S and the other from ser T , in lexicographicalorder. In both alternatives if a state from T can not be merged with anystate from S, this state is promoted to set S, that is, the state is inserted inset S and deleted from set T .

An important issue for this algorithm is to determine how many mergesare done in the first phase (heuristic selection). If the first phase is too short,decreases computational time but recognition rate may be low. If the firstphase is too long, results may be improved but with a higher computationalcost.

This algorithm gives priority to promote states, that is, if there is nota state to promote, then the algorithm merges the pair with best score.Priority of promotions will become clear later in Algorithm 2.

The HeuristicMerge function selects high rated pair of states to mergebased on a score function and receives the AMPAT M , S and T lists from thealgorithm 1 as input parameters. In line 1, ScoreTable list is defined and itwill be used to store pairs of states that can be merged with their respectivescore. In line 2, PromoteTable is defined and it will be used to store statesthat can not be merged to later choose one to promote to S, it means, tothe final hypothesis. A double For statement is implemented in lines 3 and5 to iterate over T and S lists. It will match all the states of lists S and T tocalculate scores of merge. The variable p will be the iteration variable overT and q will be the iteration variable over S. In line 6, the merge tree Mt

associated to states q and p is built using the AMPAT M (See Definition 9).Using Mt in line 7, Cm will store the total of pairs of states (qi, qj) ∈ Mt thatsatisfies that for a symbol a ∈ Σ, (Φ(δ(qi, a)) = Φ(δ(qj , a)))∧(Φ(δ(qi, a)) 6=?)using the function countCoincidences. Line 8 verifies if the merge is possiblewith the isMerged function. In line 9, we calculate the score followingthe same formula used by Lang in Red-blue algorithm [LPP98]. The score

depends on the Cm variable and the depth of the state q accessed as depth(q).Notice that coincidences in merge tree are the main criterion to increase thescore while depth of the state is secondary as reflect the constants used inthe formula. Observe that this formula will return higher scores when Cm

increases, but also, the score will be higher when the depth of q decreases,

14

Page 15: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Algorithm 2 HeuristicMerge (M,S, T )

Require: M = (Q,Σ, {0, 1, ?}, δ, q0 ,Φ)1: ScoreTable = [ ]2: PromoteTable = [ ]3: for p ∈ T do

4: merged = False

5: for q ∈ S do

6: Mt = MergeTree(q, p,M)7: Cm = countCoincidences(Mt)8: if isMerged(Mt) then

9: score = 100 ∗ Cm + (99 − depth(q))10: ScoreTable.append((q, p, score))11: merged = True

12: end if

13: end for

14: if ¬merged then

15: PromoteTable.append(p)16: end if

17: end for

18: if PromoteTable 6= [ ] then

19: S.append(PromoteTable[0])20: return False21: else

22: (a, b) = getBestMerge(ScoreTable)23: MergeStates(M,a, b)24: return True25: end if

it means, when q is closer to the root of the AMPAT. The constants 100and 99 in the formula are used to amplify the range of values returned andis totally independent of the problem size. In line 10, the pair of states, thescore and the reference to the AMPAT will be added to scoreTable. Whena state of T cannot be merged with anyone of the S list, it will be added tothe PromoteTable list in line 14. In Line 18 is verified if the PromoteTableis empty, and if not, the first state in PromoteTable is promoted in line19. At this point the algorithm gives priority to promote states to the finalhypothesis. On the other hand, if PromoteTable in line 18 is empty, thefunction getBestMerge in line 22 will return the pair of states that obtainedthe best score stored in ScoreTable list. To break ties in the higher scores,

15

Page 16: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Figure 7: Difference between RPNI and HyPRNI.

the pair of states to choose will be those that when the merge is done, yield asmaller automaton. If the ties persist we chose one randomly. The algorithmreturns False to specify outside that a state was promoted or True whenstates were merged.

The difference between HyRPNI and its predecessor RPNI is that thefirst one uses two criteria to select the states to merge while the seconduses one criterium during all the inference process, that is lexicographicalcriterium. Figure 7 shows how this difference is meaningful in the sense thatit changes the final hipothesis and may lead to shorter inference processesor smaller final hypothesis in some cases.

4.2.1 Properties

HyRPNI is a deterministic, polynomial algorithm whose time complexity isupper bounded by the expression O(kn4 + (m− k)n3) where k is a constantrepresenting the size of the first phase in terms of number of merges done, m

is the total number of merges done and n is the size of the initial AMPAT,that is, the number of states of the initial model. This complexity expressionhas two terms corresponding to the inference phases respectively: kn4 cor-responds to the cost of applying HeuristicMerge algorithm (with complexityO(n3)) k times into the main cycle of line 6 in the Hy-RPNI algorithm. Thesecond term: (m − k)n3 corresponds to the cost of performing the rest ofmerges with a cost of n2 each one into the main cycle in line 6 of the Hy-RPNI algorithm. The worst case of this algorithm happens when k = m andall the merges are done in the first phase. In this case, the complexity is thesame as the Red−blue algorithm and computes scores all the time during the

16

Page 17: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

0

1

2

3

4

5

6

7

8

0

1

0

1

0

1

0

1

Figure 8: AMPAT to be processed with RPNI and Hy-RPNI.

0 2

5

6

7

8

0

10

1

0

1

(a) Merging states 0,1.

0 2 5

7

8

0

1

1

00

1

(b) Merging states 0,6.

0 2 5

0

1

1

0, 1

0

(c) Merging states 0,7 and0,8.

Figure 9: The RPNI algorithm Process. a) The states 0 and 1 were merged.b) The states 0 and 6 were merged. c) The states 0 and 7 were merged andthen the states 0 and 8.

inference process. The best case happens when k = 0 and the complexity isthe same as the RPNI algorithm. Convergence of this algorithm is providedbecause it has been proved that the convergence of a state merging inferencealgorithm is obtained independently of the order in which merges are done[?] when we are inferring DFAs as in this case. Notice that changing thecriteria to merge states stablishes just a different order of merge.

4.2.2 Example

Notice that only the first merge of the Hy-RPNI will be done with theheuristic approach and the rest with the lexicographical order. Our goalwith this example is to show that by only doing one merge in a heuristicalapproach we can obtain different results.

17

Page 18: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

0

1

2

3

4

5

6

7

8

0

1

0

1

0

1

0

1

(a) AMPAT from D+ and D−

0 1

3

4

1

00

1

(b) Merging states 0,2

0 1

1

0

0, 1

(c) Merging states 0,3 and 0,4

Figure 10: HyRPNI process. a) Initial AMPAT b) Merge 0 and 2. c) Merge(0, 3) and (0,4)

As we said previously, RPNI uses two lists S and T . In this case, S

and T start with values S = {0} and T = {1, 2}. The process begins bycomparing states in lexicographical order, then the first states to compareare 0 and 1. These states can be merged because they are not distinguished,and the result of that merge is show on the Figure 9a. The lists S and T

update to S = {0} and T = {2}. States 0 and 2 cannot be merged becausethey are distinguishable. Then 2 is promoted and S and T lists change toS = {0, 2} and T = {5, 6}. After that, we check if 0 with 5 and 2 with5 can be merged, but Φ(0) 6= Φ(5) and Φ(2) 6= Φ(5), so 5 is promoted toS. Now S = {0, 2, 5} and T = {6}. Checking in lexicographical order thestates 0 and 6 leads to another merge, Figure 9b shows the result of doingthat merge. After that, S and T change to S = {0, 2, 5} and T = {7, 8},the merge of 0 with 7 is possible, and also the merge of 0 with 8. Figure 9cresumes both merges. Finally S = {0, 2, 5} and T = {}. At this point, thealgorithm end his execution and the inferred automaton is the same of thefigure 9c.

Consider the AMPAT shown in Figure 10a, when HyRPNI is used, itstarts with the first phase, that applies a heuristic method, S and T lists areset as S = {0} and T = {1, 2}. At this point the heuristic function assignstwo scores, one for the pair of states (0, 1), and the other for (0, 2). Usingthe heuristic function, scores are (0, 1) = 199 and (0, 2) = 499. This meanswe should merge states 0 and 2. After that merge shown in the Figure 10b,the first phase ends and the process continues in lexicographical order. Thenext merges are states 0, 3 and 0, 4, in that order. The inferred automaton

18

Page 19: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

using Hy-RPNI is shown in the Figure 10c.

4.3 Algorithm OIL

This algorithm was first published in [GdPAR08], the Order IndependentLanguage (OIL) inference algorithm is a non deterministic approach togrammar inference for regular languages. Algorithm 3 below presents OILstrategy: positive and negative samples are sorted in lexicographical order(lines 1,2). At the beginning, hypothesis M is empty (line 3). Then positivesamples are considered one by one (line 4). If the current hypothesis M ac-cepts a positive sample pS, M remains unchanged. If hypothesis M rejects it(line 5), a new automaton M’ is built to accept pS and it is added to M (lines6,7). The elements of M’ are defined in the following way: Q′ = Pref(pS),δ = {(u, v) | u ∈ Pref(pS), v = ua, a ∈ Σ, ua ∈ Pref(pS)}, q0 = ε, finallyΦ is defined: ∀w ∈ (Pref(pS) − pS),Φ(w) =? and Φ(pS) = 1. In line 8, Mis modified by merging as many states as possible. The states to be mergedare selected randomly. Once the merge is completed, negative samples arecomputed in the new model M; if there are any inconsistencies, the merg-ing procedure is undone. When all the positive samples are processed, thealgorithm ends and the final value of M is the model learned. OIL is a con-vergent algorithm, the proof is in [?]. Notice that every running of OIL mayproduce a different model because it is a non-deterministic algorithm. Forthis reason, we compute a group of models from a given training sample. Totest the algorithm, several heuristics may be applied to get a final response.For example, we can test it with the smaller model (with less states) or byapplying a voting method among models to tag the test samples.

Algorithm 3 OIL (D+,D−)

1: posSample = sort(D+) (in lexicographical order)2: negSample = sort(D−) (in lexicographical order)3: M = (Q,Σ, {0, 1, ?}, δ, q0 ,Φ) (empty automaton)4: for pS in posSample do

5: if M doesn’t accept pS then

6: M’ = (Q′,Σ, {0, 1, ?}, δ′ , q′0,Φ′) (M’ accepts only pS)7: M = M ∪ M’8: M = DoAllMergesPossible(M,negSample)9: end if

10: end for

11: return M

19

Page 20: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

5 Experimentation on Poliproteins from Potyviruses

Genome

In this section we report the results on applying HyRPNI and OIL algorithmto cleavage site prediction problem on poliproteins of potyvirus genomes.Our purpose is to learn a model for recognizing each of the nine cleavage sitespresent in the coding portion of the viral genome once it is translated.Therest of this section describes the experimental design, peformance measuresand obtained results.

5.1 Experimental Design

Training sample was obtained from sequences published in www.dpvweb.net/potycleavage/index.html approximately 50 samples for each cleav-age site. As HyRPNI and OIL need negative samples, we used positive sam-ples of other sites as negative samples for each given model. We have onepostivie sample for each ten negative ones approximately both in trainingand test data sets. We trained with cross validation of four blocks. Theamino acid sequence is considered one window at a time, three lengths ofwindow were explored: 5, 15 and 20; in the first case we suppose cleavagesite is located between forth and fifth symbol, for this reason we refer to thiswindow as 4 1, in similar way, we experiment with windows 14 1 and 10 10.We train nine models, one for each cleavage site. To applying test samplesto the models allows us to quantify the number of samples in each one ofthe following categories: tp (true positive, positive test samples well clasifiedby the automaton), tn (true negative, negative test samples well clasified),fp (false positive, negative test samples classified as positives) and fn (falsenegative,positive test samples classified as negatives). We use three mea-sures: sensitivity, specificity, accuracy and Mathew’s correlation coeficient.They are calculated from counting four posibilities in test samples:

• tp (true positive) the test sample is positive and it is tagged as positiveby our model.

• tn (true negative) the test sample is negative and it is tagged as neg-ative by our model.

• fp (false positive) the test sample is negative but it is tagged as positiveby our model.

• fn (false negative) the test sample is positive and it is tagged as nega-tive by our model.

20

Page 21: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

To calculate the performance measures we use these formulas:

• Sensitivity = tptp+fn

• Specificity = tntn+fp

• Accuracy = tp+tntp+tn+fp+fn

• Matthew’s correlation coeficient: (tp∗tn)−(fp∗fn)√(tp+fn)(tp+fp)(tn+fp)(tn+fn)

Sensitivity counts the number of well classified positive test samples withrespect to the total of positive test samples, it is one if all the positive testsamples were well classified and decreases if some positive samples weremisclassified. Specificity count the number of well classified negative testsamples with respect to the total of negative samples. It is one if all thenegative samples were well classified and decresases if some negative sampleswere misclassified. Accuracy consisits in counting the number of test sampleswell classified by the model, both positive and negative samples. Finally,Mathew’s correlation coeficient (MCC), varies from -1 to 1. It reflects thequality of the prediction made by the model: a value of one means thebest prediction, -1 means the worst prediction, zero value represents thatprediction is the result of the chance [KV10].

Due to the lack of a optimal value for the size of the first phase of thealgorithm HyRPNI, three values were tryed on: 5, 10 and 25 merges. Weselect these values considering results of former experiments and estimationsof the total number of merges in the inference process. In the case of OILalgorithm we will calculate teams of hypothesis of sizes 5, 11 and 15.

5.2 Experimental Results

The following tables present the results obtained with HyRPNI for eachcleavage site. We report all the combinations of window sizes and parametervalues for the algorithm. Results in each position of the table are the averageover four executions of cross validation. Notice that execution time is around30 seconds with window 4 1, but increases until several minutes with 14 1and 10 10 windows, in spite of it, all experiments take acceptable time.

Table 2 presents results for the P1−HCPro site, notice that accuracy val-ues for window 4 1 are the best ones, beyond 98 percent. Watching balancebetween sensitivity and specificity window 4 1 has the best performance

21

Page 22: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 2: HyRPNI Algorithm Applied to Prediction of the Site 1: P1-HCPro

Window Size Size 1st. ph. Exec.time Accuracy Sensitivity Specificity Corr. coef

5 0:6.18s 0.9835 0.91 0.83 0.87

4 1 10 0:14.13s 0.9865 1.0 0.81 0.89

25 0:30.26s 0.9835 0.97 0.79 0.87

5 0:50.52 0.9448 0.57 0.90 0.69

14 1 10 1:10.54m 0.9753 0.95 0.69 0.79

25 3:35.92m 0.9833 1.0 0.76 0.86

5 1:14.14m 0.9575 1.0 0.4 0.61

10 10 10 1:31.83m 0.9575 0.86 0.48 0.62

25 5:13.35 0.9658 0.86 0.63 0.71

Table 3: HyRPNI Algorithm Applied to Prediction of the Site 2: HCPro-P3

Window Size Size 1st. ph. Exec.time Accuracy Sensitivity Specificity Corr. coef

5 0:7.59 0,98 0,83 0,95 0,88

4 1 10 0:19.39 0,98 0,84 0,93 0,88

25 0:50.07 0,98 0,82 0,93 0,86

5 0:45.69 0,97 0,96 0,65 0,78

14 1 10 1:13.45 0,98 0,97 0,76 0,85

25 2:39.77 0,98 0,94 0,76 0,84

5 1:10.87 0,92 0,65 0,56 0,52

10 10 10 1:34.32 0,94 0,73 0,55 0,58

25 3:40.63 0,96 0,94 0,55 0,69

again. Among the parameters of HyRPNI algorithm, in window 4 1, fistphase of length 10 has better correlation coeficient. Following these data,we could conclude that the model to recognize the first cleavage site shouldbe trained using 10 merges in the first phase of the algorithm and using awindow 4 1.

Acording with correlation coeficient in Table 3, window size 10 10 it isnot able to learn properly any pattern, fortunately, window size 4 1 shows abetter behaviour in all the measures for the second cleavage site HCPro-P3.

Sites 3,4,5 and 6 show values of correlation coeficient between 0.5 and 0.7.A 0.5 value means that the algorithm is not learning more than a randomchoice would do. this behavior could be caused by the short number ofsamples for training our models. However, notice that accuracy measures

Table 4: HyRPNI Algorithm Applied to Prediction of the Site 3: P3-6K1

Window Size Size 1st. ph. Exec.time Accuracy Sensitivity Specificity Corr. coef

5 00:13,31 0,96 0,83 0,60 0,69

4 1 10 00:27,80 0,96 0,79 0,60 0,66

25 01:11,92 0,95 0,68 0,64 0,64

5 00:45,26 0,96 0,83 0,53 0,64

14 1 10 01:12,56 0,92 0,50 0,57 0,48

25 05:14,99 0,93 0,62 0,50 0,51

5 00:59,15 0,94 0,60 0,63 0,58

10 10 10 01:22,81 0,95 0,73 0,55 0,59

25 04:23,97 0,95 0,73 0,64 0,66

22

Page 23: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 5: HyRPNI Algorithm Applied to Prediction of the Site 4: 6K1-CI

Window Size Size 1st. ph. Exec.time Accuracy Sensitivity Specificity Corr. coef

5 0:14.64 0,95 0,69 0,65 0,64

4 1 10 0:46.56 0,95 0,66 0,65 0,62

25 1:45.69 0,94 0,63 0,67 0,62

5 0:33.54 0,95 0,73 0,49 0,57

14 1 10 0:48.64 ,95 0,75 0,49 0,58

25 2:38.09 0,93 0,58 0,51 0,51

5 0:44.91 0,97 0,94 0,60 0,73

10 10 10 1:6.07 0,97 0,94 0,60 0,73

25 3:23.48 0,95 0,67 0,60 0,60

Table 6: HyRPNI Algorithm Applied to Prediction of the Site 5: CI-6K2

Window Size Size 1st. ph. Exec.time Accuracy Sensitivity Specificity Corr. coef

5 00:17,6 0,93 0,54 0,58 0,52

4 1 10 00:30,5 0,93 0,52 0,58 0,50

25 01:29,4 0,93 0,56 0,60 0,54

5 00:34,1 0,94 0,70 0,53 0,56

14 1 10 00:47,6 0,93 0,56 0,62 0,54

25 02:28,8 0,96 0,84 0,60 0,69

5 00:51,9 0,92 0,62 0,51 0,50

10 10 10 01:14,3 0,89 0,46 0,48 0,40

25 04:06,8 0,96 0,78 0,60 0,65

Table 7: HyRPNI Algorithm Applied to Prediction of the Site 6: 6K2-VPg

Window Size Size 1st. ph. Exec.time Accuracy Sensitivity Specificity Corr. coef

5 00:11,7 0,95 0,68 0,65 0,64

4 1 10 00:33,6 0,94 0,69 0,67 0,64

25 01:15,4 0,95 0,69 0,71 0,67

5 00:33,4 0,94 0,60 0,58 0,55

14 1 10 00:46,6 0,95 0,70 0,48 0,55

25 02:02,7 0,94 0,71 0,48 0,55

5 00:43,8 0,98 1,00 0,72 0,83

10 10 10 01:06,6 0,97 0,87 0,79 0,81

25 03:20,1 0,96 0,74 0,79 0,74

23

Page 24: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 8: HyRPNI Algorithm Applied to Prediction of the Site 7: VPg-Nia

Window Size Size 1st. ph. Exec.time Accuracy Sensitivity Specificity Corr. coef

5 00:10,8 0,98 0,85 0,83 0,83

4 1 10 00:19,9 0,98 0,86 0,90 0,87

25 01:01,6 0,97 0,83 0,83 0,81

5 00:31,0 0,97 0,84 0,69 0,75

14 1 10 00:46,3 0,95 0,75 0,66 0,66

25 02:49,7 0,96 0,92 0,55 0,69

5 00:45,7 0,98 0,98 0,77 0,85

10 10 10 01:07,1 0,98 0,98 0,77 0,85

25 02:38,6 0,98 0,98 0,71 0,82

Table 9: HyRPNI Algorithm Applied to Prediction of the Site 8: Nia-Nib

Window Size Size 1st. ph. Exec.time Accuracy Sensitivity Specificity Corr. coef

5 00:10,0 0,96 0,76 0,75 0,73

4 1 10 00:28,6 0,96 0,76 0,68 0,70

25 01:07,4 0,96 0,79 0,75 0,75

5 00:33,2 0,95 0,73 0,68 0,67

14 1 10 00:45,5 0,95 0,81 0,60 0,65

25 02:12,3 0,96 0,91 0,59 0,72

5 00:41,0 0,95 0,88 0,57 0,66

10 10 10 00:55,2 0,96 0,88 0,53 0,66

25 01:36,1 0,98 0,97 0,73 0,83

are high, so this measure should not be analysed in isolation. Table 4, Table5, Table 6 and Table 7 shows these values in detail.

Sites 7, 8 and 9 show values of correlation coeficient between 0.6 and 0.8.So we could state that these models grasp patterns in a better way. Tables8 and 10 show that for sites 7 and 9 best chioce is 10 merges in the firstphase and window 4 1. In site 8, window 10 10 and 25 merges in the firstphase achieve the highest results (see Table 9).

Sensitivity vs. specificity graphics for HyRPNI allow to visualize differ-ences in performance for diferent sites. In Figure 11 we can see how in site 9both measures are near to 1 for the three window sizes, although window 4 1seems to be the best. Other cases such as site 4 show a different behavior,with dots spread in the graphic, showing a poorer learning. In spite of the

Table 10: HyRPNI Algorithm Applied to Prediction of the Site 9: Nib-CP

Window Size Size 1st. ph. Exec.time Accuracy Sensitivity Specificity Corr. coef

5 00:14,5 0,91 0,90 0,91 0,82

4 1 10 00:42,2 0,94 0,94 0,92 0,87

25 01:38,4 0,93 0,93 0,92 0,86

5 00:35,9 0,91 0,92 0,88 0,82

14 1 10 00:53,6 0,83 0,77 0,88 0,67

25 04:15,8 0,86 0,82 0,88 0,73

5 00:45,6 0,91 0,95 0,85 0,82

10 10 10 01:03,0 0,91 0,96 0,85 0,83

25 03:32,1 0,92 0,95 0,86 0,84

24

Page 25: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 11: Best HyRPNI Parameter Value and Window Size for the NineCleavage Sites

Site Name Parameter Window Size

1 P1-HcPRO 10 4-12 HcPRO-P3 10 4-13 P3-6K1 5 14-14 6K1-CI 10 10-105 CI-6K2 25 10-106 6K2-VPg 5 10-107 VPg-Nia 5 10-108 Nia-Nib 25 4-19 Nib-CP 25 4-1

Table 12: OIL Algorithm Applied to Prediction of the Site 1: P1-HCPro

Window Size Exec.time Accuracy Sensitivity Specificity Corr. coef

4 1 00:02,6 0,98 0,87 0,81 0,83

14 1 00:37,0 0,95 0,70 0,70 0,67

10 10 01:09,1 0,95 0,78 0,63 0,66

differences, in each graphic there is at least one window with a behaviournear to the upper-right corner showing a good learning degree. Table 11summarices the best window size and parameter value to predict each oneof the cleavage sites. Notice that first and last sites are better learned usingshort window (4 1), but middle sites are better learned with longer and sy-metric window (10 10); window 14 1 only was the best option for the thirdsite. The value of the parameter is really variable, each value is the bestin three sites and there is not a visible correlation between parameter valueand window size.

We run OIL algorithm on the same data looking for a improvement inour performance measures. Three values were tryed to the voting system:5, 11 and 15 hypothesis. First results are reported for the run with 11hypothesis voting to tag test samples: for sites 1,2 and 6 Tables 12, 13 and17 shows better values in all the measures for size 4 1. However, other sitesas 3,4,5,8 and 9 behaves better with 10 10 window as we can see in Tables14, 15, 16, 19 and 20. Finally, Table 18 shows that site 7 works better withwindow 14 1.

Although there are similarities, notice that these results are different

25

Page 26: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

(a) Site 1 P1-HcPRO (b) Site 2 HCPro-P3 (c) Site 3 P3-6K1

(d) Site 4 6K1-CI (e) Site 5 CI-6K2 (f) Site 6 6K2-VPg

(g) Site 7 VPg-Nia (h) Site 8 Nia-Nib (i) Site 9 Nib-CP

Figure 11: HyRPNI Graphics of Sensitivity vs. Specificity for the NineCleavage Sites

26

Page 27: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 13: OIL Algorithm Applied to Prediction of the Site 2: HCPro-P3.

Window Size Exec.time Accuracy Sensitivity Specificity Corr. coef

4 1 00:01,3 0,98 0,92 0,90 0,90

14 1 00:28,6 0,96 0,81 0,69 0,73

10 10 00:38,0 0,97 0,88 0,76 0,80

Table 14: OIL Algorithm Applied to Prediction of the Site 3: P3-6K1.

Window Size Exec.time Accuracy Sensitivity Specificity Corr. coef

4 1 00:05,8 0,96 0,81 0,67 0,72

14 1 00:38,76 0,95 0,68 0,60 0,61

10 10 01:04,0 0,96 0,77 0,74 0,73

Table 15: OIL Algorithm Applied to Prediction of the Site 4: 6K1-CI.

Window Size Exec.time Accuracy Sensitivity Specificity Corr. coef

4 1 00:05,9 0,96 0,70 0,74 0,70

14 1 00:29,2 0,95 0,65 0,63 0,61

10 10 00:35,8 0,97 0,81 0,76 0,77

Table 16: OIL Algorithm Applied to Prediction of the Site 5: CI-6K2.

Window Size Exec.time Accuracy Sensitivity Specificity Corr. coef

4 1 00:06,5 0,96 0,73 0,76 0,73

14 1 00:23,5 0,96 0,77 0,67 0,69

10 10 00:57,6 0,98 0,90 0,76 0,81

Table 17: OIL Algorithm Applied to Prediction of the Site 6: 6K2-VPg.

Window Size Exec.time Accuracy Sensitivity Specificity Corr. coef

4 1 00:06,9 0,95 0,72 0,74 0,71

14 1 00:31,5 0,95 0,66 0,67 0,64

10 10 00:39,9 0,96 0,74 0,65 0,67

Table 18: OIL Algorithm Applied to Prediction of the Site 7: VPg-Nia.

Window Size Exec.time Accuracy Sensitivity Specificity Corr. coef

4 1 00:03,5 0,98 0,83 0,86 0,82

14 1 00:17,3 0,98 0,95 0,79 0,85

10 10 00:27,7 0,98 0,87 0,83 0,83

Table 19: OIL Algorithm Applied to Prediction of the Site 8: NIa-Nib.

Window Size Exec.time Accuracy Sensitivity Specificity Corr. coef

4 1 00:07,7 0,96 0,77 0,75 0,74

14 1 00:26,9 0,95 0,64 0,68 0,64

10 10 00:43,0 0,96 0,78 0,75 0,74

27

Page 28: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 20: OIL Algorithm Applied to Prediction of the Site 9: NIb-CP.

Window Size Exec.time Accuracy Sensitivity Specificity Corr. coef

4 1 00:19,6 0,93 0,92 0,92 0,85

14 1 04:26,4 0,89 0,83 0,93 0,77

10 10 02:57,0 0,93 0,92 0,93 0,87

Table 21: Best Algorithm to Predict Each of the Nine Cleavage Sites

Site Name Algorithm

1 P1-HcPRO HyRPNI2 HcPRO-P3 OIL3 P3-6K1 Both4 6K1-CI OIL5 CI-6K2 OIL6 6K2-VPg HyRPNI7 VPg-Nia OIL8 Nia-Nib HyRPNI9 Nib-CP OIL

from those obtained with algorithm HyRPNI. Considering OIL algorithm,the learning may be improved in some sites, as we can see in Figure 12where both algorithms: HyRPNI and OIL are compared using accuracyas performance measure. In general, the relationship between algorithmsperformance remains although window size changes. However, we cannotsay that one algorithm is the best, because in sites 1,6 and 8 HyRPNIoutperforms OIL, in sites 2,4,5,7,9 OIL wins and in site 3 there is a tie.Table 21 summarize the best algorithm to predict each cleavage site. InTable 22 we consider the best window size for OIL algorithm, window 10 10seems to be the best for six of the nine sites.

The improvements achieved with OIL algorithm make us to wonder if ahigher value for the parameter value would improve even more the perfor-mance. To answer this question, we do a new experiment using a team of15 hypothesis. We compare its results with the previous ones in Figure 13.We can see that increasing the size of the voting set increases recognitionrates significatively for sites 4,5,7 and 9, and it leaves the rates similar forsites 1,2. In sites 3,6 and 8 there are improvement or decrease of rates de-pending on the window size. With respect to Table 21 this experiment doesnot change the conclusion, OIL 15 outperformes OIL 11 for the sites where

28

Page 29: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

(a) Site 1 P1-HcPRO (b) Site 2 HCPro-P3 (c) Site 3 P3-6K1

(d) Site 4 6K1-CI (e) Site 5 CI-6K2 (f) Site 6 6K2-VPg

(g) Site 7 VPg-Nia (h) Site 8 Nia-Nib (i) Site 9 Nib-CP

Figure 12: Comparison of HyRPNI and OIL Algorithms Using Accuracy asPerformance Measure

29

Page 30: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 22: Best Window Size to Predict Each of the Nine Cleavage Sites

Site Name Window Size

1 P1-HcPRO 4 12 HcPRO-P3 4 13 P3-6K1 All4 6K1-CI 10 105 CI-6K2 10 106 6K2-VPg 10 107 VPg-Nia 10 108 Nia-Nib 10 109 Nib-CP 10 10

OIL won, but it is not able to go beyond HyRPNI in the sites it wins. Bestwindow size depends on the case: sites 2 and 3 works better with window4 1, sites 4, 5 and 9 works better with window 10 10 with algorithm OILand 15 hypothesis voting to tag test samples.

Table 23 presents detailed performance measures of the best model foreach one of the nine cleavage sites.

From this experiment we see that OIL algorithm seems to behave betterthan HyRPNI to predict cleavage sites. However, as we have no evidensfrom other methods being applied to these data for the same problem. Ourresults are not enough to validate usefulnes of grammatical inference withrespect to ohter techniques. To be able to compare our method with otherwe must train with data previously used in by other authors to predictcleavage sites. These data comes from signal peptides and the next sectionpresent our experiment and the results of comparison with SignalP 3.0.

6 Experimentation on Signal Peptides

The purpose of this experimentation is to evaluate the performance of al-gorithms HyRPNI and OIL working on signal peptide data for five species:Ecoli, Euk, Gram-negative bacteria, Gram-positive bacteria and Human.

30

Page 31: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

(a) Site 1 P1-HcPRO, Site 2 HCPro-P3

(b) Site 3 P3-6K1, Site 4 6K1-CI

(c) Site 5 CI-6K2, Site 6 6K2-VPg

(d) Site 7 VPg-Nia, Site 8 Nia-Nib (e) Site 9 Nib-CP

Figure 13: Results for OIL Accuracy Including Three Parameter Values: 5,11 and 15 for the Nine Cleavage Sites

31

Page 32: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 23: Best Results to Predict the Nine Cleavage Sites for Poliprotein ofPotyviruses Using Grammatical Inference Algorithms HyRPNI and OIL

Cleavage site 4 1 14 1 10 10Sens. Spec. Acc. Sens. Spec. Acc. Sens. Spec. Acc.

P1-HCPro 0.81 0.85 0.98 0.63 0.84 0.97 0.58 0.78 0.96HCPro-P3 0.88 0.97 0.99 0.74 0.86 0.97 0.67 0.88 0.97P3-6K1 0.65 0.80 0.96 0.61 0.74 0.96 0.65 0.76 0.966K1-CI 0.74 0.76 0.97 0.72 0.66 0.95 0.79 0.87 0.98

CI-6K2 0.7 0.65 0.95 0.72 0.74 0.96 0.79 0.92 0.98

6K2-VPg 0.74 0.73 0.96 0.63 0.69 0.95 0.60 0.74 0.96VPg-Nia 0.81 0.81 0.97 0.81 0.92 0.98 0.72 0.91 0.98NIa-NIb 0.73 0.77 0.96 0.71 0.73 0.96 0.76 0.77 0.97

NIb-CP 0.94 0.91 0.93 0.96 0.87 0.92 0.95 0.95 0.95

6.1 Experimental Design

Samples were taken from the CBS (Center for Biological Sequence Analysis)repository http://www.cbs.dtu.dk/ftp/signalp/ver1/. For eachspecies there are three kinds of samples available: cytoplasmic protein se-quences (CYT is the extension of the file), nuclear protein sequences (NUCextension) and signal peptide sequences (SIG extension). SIG files containpositive samples while the others contain negative ones.

Files structure is shown in Figures 14 and 15. Each sample is representedby three lines: the first indicates the length of the sequence, the Swiss-Prot identifier, the length of signal peptide and its description. Second linecontains the sequence itself codified with the one simbol representation ofamino acid. Third line associates a tag to each symbol of the sequence, thetag may be M,S or C. M means that the corresponding amino acid belongsto the mature protein, S means amino acid belongs to the signal peptide andC indicates that location of the cleavage site is just before of this amino acid,so this amino acid is the first one of the mature protein. Figure 14 showsa entry corresponding to a sample with a cleavage site (from this sequencewe may obtain a positive sample and several negative ones), and Figure 15shows a entry corresponding to a sample without cleavage site (from this wemay obtain several negative samples)

From the sequences, we obtain positive and negative samples using threewindow sizes: 4 1, 10 10 and 14 1. Where 4 1 means to make the samplewith 4 amino acids before the cleavage site and one after it, looking at thetags the sequences would be SSSSC. So, samples for 10 10 window contain 10

32

Page 33: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Figure 14: Format of .SIG Files

Figure 15: Format of .CYT Files

amino acids before cleavage site and ten after it, considering tags, they wouldbe SSSSSSSSSSCMMMMMMMMM. We work in analogous way for window14 1. From these sequences, considering other locations of cleavage site withrespect to the window we may obtain negative samples. Besides, from CYTand NUC files we may obtain more negative samples. To present thesesamples to our programs implementing grammatical inference algorithmsit is necesary to codify the amino acids alphabet with numbers instead ofletters Table ?? shows the codification applied. In Figure 16 we can seeour samples in format .sample. The file has a header line with the numberof samples in file and the size of the alphabet, the rest of lines have threeelements: the tag of the sample (1 means it is positive and 0 means it isnegative), the size of the sample and the sample itself.

We performed the encoding with the script clasificar.py using the follow-ing command:

cd /home/carlos/destino2010/SCRIPTSpython clasificar.py file.red filename.sample option_window

Where file.red is a file of samples encoded in format red (this formatis used in the CBS database) more the directory where these samples arelocated, filename.sample contains the path and name of file of samples en-coded (in the format sample) and option window is the window size to be

Table 24: Numerical Codification of Aminoacids

Letter Number Letter Number Letter Number Letter Number

A 0 G 5 M 10 S 15

C 1 H 6 N 11 T 16

D 2 I 7 P 12 V 17

E 3 K 8 Q 13 W 18

F 4 L 9 R 14 Y 20

33

Page 34: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Figure 16: Format settings used in the experimentation (extension sample)

used: 4 1 (represented by option 5), 14 1(represented by the option 15) and10 10 (represented by the option 20). This script encoded the positive andnegative samples, i.e. the files SIG, NUC and CYT. This script is stored inthe directory:

/home/carlos/destino2010/SCRIPTS

An example of the execution this script is present below:

cd /home/carlos/destino2010/SCRIPTSpython clasificar.py ../DATA_SIGNALP_2.0/Muestras_originales/ECOLISIG.red

../DATA_SIGNALP_2.0/MUESTRAS/4_1/ECOLISIG_4_1.sample 5

In this example, we encoded file samples ECOLISIG.red, sequences cod-ified were stored in the file ECOLISIG 4 1.sample and window size is five,corresponding to 4 1 kind of window.

We execute clasificar.py through the scripts clasificar.bash, its functionis to call clasificar.py with all the samples .red obtained and place in thepositive and negative samples directory in respective window folder (4 1,10 10 or 14 1). We used this script in the following way:

cd /home/carlos/destino2010/SCRIPTSbash clasificar.bash window option_window

Where, window is the window type that was used (4 1, 10 10 or 14 1)and option window is the window option that was used (5, 20 or 15). Forexample:

cd /home/carlos/destino2010/SCRIPTSbash clasificar.bash 4_1 5

The encoded sample files generated for the scripts clasificar.py werestored in different directories, the name of directories are the same as the

34

Page 35: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

selected windows, for example: in directory with the name 4 1 were storedall files created with the window 4 1 of Ecoli, Euk, Gram-negative bacteria,Gram-positive bacteria and Human, and so on with the directories 10 10 and14 1, the name of positive samples files were nameSpecieSIG window.sampleand the name of negative samples file were nameSpecieCYT window.sampleand nameSpecieNUC.sample. The positive samples were stored in the folderMUESTRAS and this folder is in the directory.

/home/carlos/destino2010/DATA_SIGNALP_2.0/MUESTRAS/4_1

The folder 4 1 is the place where were stored the positive samples withwindow 4 1 as mentioned above, of this same form is for folders 10 10 and14 1. The negative samples were stores in the folder MUESTRAS NEGATIVASand this folder is in the directory

/home/carlos/destino2010/DATA_SIGNALP_2.0/MUESTRAS_NEGATIVAS/4_1

The folder 4 1 is the place where were stored the negative samples withwindows 4 1 as mentioned above, of this same form is for folder 10 10 and14 1.

The number of samples that we used for each window were the same, forEcoli we have 105 positives samples and 119 negatives samples, for Euk 1011positives samples and 820 negatives samples, for Gram-negative bacteria 266positives samples and 186 negatives samples, for Gram-positive bacteria 141positives samples and 64 negatives samples and for Human 416 positivessamples and 251 negatives samples. Notice that the number of negativessamples is the combination of the negatives samples of the files NUC andCYT that have Ecoli, Euk, Gram-negative bacteria, Gram-positive bacteriay Human. To combine negatives samples, we used the script combin.py:

cd /home/carlos/destino2010/SCRIPTSpython combin.py fileNUC.sample fileCYT.sample

where fileNUC.sample is the filename of the NUC samples file (encodedin format .sample) with its path, fileCYT.sample is the filename of the CYTsamples file (encoded in format .sample) with its path. The function ofthis script was to combine the negatives samples of both files into one. Anexample of how to use this script is shown below:

cd /home/carlos/destino2010/SCRIPTSpython combin.py ../DATA_SIGNALP_2.0/MUESTRAS_NEGATIVAS/4_1/EUKNUC_4_1.sample

../DATA_SIGNALP_2.0/MUESTRAS_NEGATIVAS/4_1/EUKCYT_4_1.sample

35

Page 36: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Here, we combined the samples of the files EUKNUC 4 1.sample and EU-KCYT 4 1.sample and the filename was EUKCYT 4 1 EUKNUC 4 1.sample.We executed combin.py through script preparacion muestra.bash, its func-tion is to combine the files (NUC and CYT) and prepare the file with thescript preparacion.py, the results were stored in the respectives directories,where, the combined file NUC and CYT were stored in the directory of neg-atives samples and the prepared file for the experimentation where storedin the directory Experimentacion, in their respective folders (4 1, 10 10or 14 1). We executed preparacion muestra.bash in this form:

cd /home/carlos/destino2010/SCRIPTSbash preparacion_muestra.bash window

The parameter window is the type of window you want to use (4 1, 10 10or 14 1), for example:

cd /home/carlos/destino2010/SCRIPTSbash preparacion_muestra.bash 4_1

Here, we prepared the samples for experiment with window 4 1.With these encoded samples, we generated training and test files for

each species with the script preparacion.py. The parameters of the programare two files encoded in .sample format, where a file contains the positivessamples and the other contains negatives samples. This script will create theexperimentation using cross validation of four blocks. We used this scriptin the following way:

cd /home/carlos/destino2010/SCRIPTSpreparacion.py positivesSamples negativesSamples directoryPositives

directoryNegatives directory_training_and_test

The parameter positivesSamples is the name of the positive samples file,negativesSamples is the name of the negative samples file, directoryPositivesis the path where the positive samples are, directoryNegatives is the pathwhere the negative samples are. directory training and test is the directorywhere are stored the training and test samples. For example:

cd /home/carlos/destino2010/SCRIPTSpreparacion.py EUKSIG_4_1.sample EUKCYT_4_1_EUKNUC_4_1.sample

../DATA_SIGNALP_2.0/MUESTRAS

../DATA_SIGNALP_2.0/MUESTRAS_NEGATIVAS

../DATA_SIGNALP_2.0/Experimentacion/4_1

We prepared the samples of the species EUK, where, EUKSIG_4_1.sample is the filename of the positive samples, EUKCYT_4_1_EUKNUC_4_

36

Page 37: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

1.sample is the filename of the negative samples, ../DATA_SIGNALP_2.0/MUESTRAS is the directory where were stored the positive samples,../DATA_SIGNALP_2.0/MUESTRAS\_NEGATIVAS is the directory wherewere stored the negative samples, ../DATA_SIGNALP_2.0/Experimentacion/4_1 is the directory where were stored the training and test samples. Withthis script, we generated 4 training and test files for each species, which wenamed as namePotyvirus_id_training.sample and the test file wasnamed as namePotyvirus_id_test.sample, the id is the identifier ofthe cross validation block for the file.

Table 25: Number of samples in the training and test files.

Training Test

Species No Positives No Negatives No Positives No Negatives

Ecoli 79 90 26 29

Euk 759 615 252 205

Gram-negativas 200 140 66 46

Gram-positivas 106 48 16 35

Human 312 189 104 62

The Table 42 shows the number of samples for training and test filesof species Ecoli, Euk, Gram-negative, Gram-positive and Human obtainedwith the preparacion.py script. Notice that approximately 80% of samplesare used in training and 20% of them in test.

Once data were ready we learned automata with programs oilnsym andrpni2Train corresponding respectively to OIL and HyRPNI algorithms. Oilnsymhas three parameters: training file, output automaton file name and con-trol number to select the method to assign identifiers to the states in themaximal automaton: 0 for random order, 1 ascending order or 2 descend-ing order. The value of control number will be cero in all our experimentsto cause non determinism, which is a key feature in this experimentation.Thus, we used this script:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/fuentes/OIL-nSym./oilnsym trainingFilename.sample automatonName.auto NumberControl

Where trainingFilename.sample is the path and file name of the trainingsamples, automatonName.auto is the path and file name of the resultingautomata and NumberControl is the control number to build the maximalautomaton, this program was stored in the directory:

/home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/fuentes/OIL-nSym

An example of the execution of the program oilnsym is as follow:

37

Page 38: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/fuentes/OIL-nSym./Oilnsym ../../4_1/ECOLI_1_TRAINING.SAMPLE

../../4_1/oilnsym_ECOLI_1_4_1.auto 0

The parameter ECOLI 1 TRAINING.SAMPLE is the name of the train-ing samples of ECOLI, oilnsym ECOLI 1 4 1.auto is the name of the result-ing automaton, as we can see it adds the name of the program that trainedand the number 0 is because we use random order. Notice that the pro-gram should always be used in random order to cause non-determinism inexecution.

We executed oilnsym through the script Oilnsym.bash, this script cre-ates an automaton for each training file of the species Ecoli, Euk, Gram-negativas, Gram-positivas y Human. Therefore it generates 4 automatastored in the same directory of the training samples. Besides this scriptexecutes the script oilTest.py storing the results in a .res file in the folderresultados, this folder is in the directory Experimentacion. For example:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/scriptsbash Oilnsym.bash window

The parameter window is the window type (4 1, 10 10 or 14 1), thisscript was stored in the directory:

/destino2010/DATA_SIGNALP_2.0/Experimentacion/scripts

A real example is:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/scriptsbash Oilnsym.bash 4\_1

Here, we trained the samples of window 4 1.For tha execution of the program rpni2Train we use the following com-

mand:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/fuentes/Hy-RPNI-nSympython rpni2Train.py trainingFilename.sample automatonName.auto numIterationHyRPNI

The parameter trainingFilename.sample is the filename of the trainingsamples more the directory where these samples are located, automaton-Name.auto is the name of the resulting automata more the directory wherethese samples are stored and numIterationHyRPNI is the length of the firstphase. This program was stored in the directory:

/destino2010/DATA_SIGNALP_2.0/Experimentacion/fuentes/Hy-RPNI-nSym

38

Page 39: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

To show how to use the program rpni2Train we present the followingexample:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/fuentes/Hy-RPNI-nSympython rpni2Train.py ../../4_1/ECOLI_1_TRAINING.SAMPLE

../../4_1/hyrpni_ECOLI_1_4_1.auto 5

ECOLI 1 TRAINING.SAMPLE is the file name of the training samples ofspecies ECOLI, hyrpni ECOLI 1 4 1.auto is the name of the resulting au-tomata, as we can see it adds the name of the program that trained and thelength of the first phase was 5.

We executed this program through the script hyrpni.bash, it creates anautomaton for each cross validaton block from each species. The executionof this script for each species generated 4 automata stored in the samedirectory of the training samples, also, this script will execute the scripthyrpniTest nSym.py stored the results in a .res file in the folder resultados,this folder is in the directory Experimentacion. For example:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/scriptsbash hyrpni.bash window numberIterationHyRPNI

The parameter window is the window type (4 1, 10 10 or 14 1), num-berIterationHyRPNI is the length of the first phase. An example of theexecution of hyrpni.bash shown below:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/scriptsbash hyrpni.bash 4\_1 5

Here, we trained the first 5 samples of window 4 1 with a criterion basedon scores and the rest with Grammatical Inference. This script was storedin the directory:

/home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/scripts

As mentioned, oilnsym and rpni2Train.py have scripts (oilTest.py andhyrpniTest nSym.py respectively) to evaluate the automaton generated withtest samples, thus, we used this script:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/fuentespython oilTest.py automatonName.auto testFilenamepython hyrpniTest_nSym.py automatonName.auto testFilename

The two scripts have the same parameters of input, but the scripts oil-Test.py only evaluates results of oilnsym and hyrpniTest nSym.py only eval-uates the results of rpni2Train.py. The parameter automatonName.auto is

39

Page 40: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

the name of the automata that was evaluated with the test samples morethe directory where these samples are located and testFilename is the file-name of the test samples more the directory where these samples are located.Below we show an example for each script:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/fuentespython oilTest.py ../4_1/oilnsym_ECOLI_1_4_1.auto

../4_1/ECOLI_1_test.samplepython hyrpniTest_nSym.py ../4_1/hyrpni_ECOLI_1_4_1.auto

../4_1/ECOLI_1_test.sample

In both scripts was evaluated automata of the species ECOLI with win-dow 4 1 and cross validation of four blocks (type 1), in the script oilTest.pythe name of the resulting files were built with the name of the program thatcreated them, in this case oilnsym and in hyrpniTest nSym.py the name ofthe resulting files were built with the hyrpni name, these results files werestored in the directory /DATA SIGNALP 2.0/Experimentacion/resultados.

6.2 Performance measures

Once automata are learned and evaluated on test samples with programshyrpniTest nSym.py and oilTest.py. we calculate performance measures (seeSection 5.1): sentitivity, especificity and accuracy using the script genRe-sultados.py. With these results we generate graphics and tables. This scripthas as parameter a results file (with extension .res) and the path and nameof the output file. We executed this script with the following line:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/scriptspython genResultados.py file.res resultsName

Where, file.res is the path and file name for results obtained with hyrp-niTest nSym.py or oilTest.py scripts and resultsName is the path and nameof the output file. An example of this script is:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/scriptspython genResultados.py ../resultados/hyrpni_14_1.res

../resultados/tabla_hyrpni_14_1.res.tex

We calculated the performance measures for all the species of window14 1 and were stored in the file tabla hyrpni 14 1.res.tex. We executed gen-Resultados.py through the script genResultados.bash, this script passes re-sults for the given window to the script genResultados.py and stores themin the folder resultados. We used this script in the following way:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/scriptsbash genResultados.bash

40

Page 41: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

6.3 Results

The results were generated from the set of hypotheses of Ecoli,Euk, Gram-negativas bacteria, Gram-positivas bacteria y Human. Each species has fourtraining and test files because of the cross validation. Each training file feedsboth oilnsym and rpni2Train generating automata as hypothesis, the nameof the automaton file contains the name of the algorithm used to generate it,for example oilnsym ECOLI 1 4 1.auto and hyrpni ECOLI 1 4 1.auto arenames of automata learned with the algorithms, besides the first numberidentifies the block of cross-validation and the last two are the window sizeused in the training samples. After experiment were obtained eight au-tomata (4 oilnsym’s automata and 4 rpni2Train’s automata) for each win-dow (4 1, 10 10 and 14 1). They were tested with their corresponding testsamples using the program genResultados.py.

Results for Ecoli are reported in Table 26, 27 and 28 corresponding towindow sizes 4 1, 10 10 and 14 1 respectively. They report results for eachone of the four blocks of cross validation and the average. Notice that OILperformance is clearly better than HyRPNI’s in all three measures. Speciallyin window 4 1 which yields the best results for this species.

Table 26: performance values Window 4 1 of ECOLI species

sensitivity specificity accuracy

Grupo HyRPNI OIL HyRPNI OIL HyRPNI OIL

1 0.90 1.0 1.0 0.93 0.95 0.97

2 0.87 1.0 1.0 1.0 0.93 1.0

3 0.81 1.0 1.0 0.92 0.89 0.96

4 0.85 0.77 1.0 0.96 0.92 0.85

AVERAGE: 0.86 0.94 1.0 0.95 0.92 0.95

Table 27: performance values Window 10 10 of ECOLI species

sensitivity specificity accuracy

Grupo HyRPNI OIL HyRPNI OIL HyRPNI OIL

1 0.53 0.92 0.82 0.79 0.58 0.86

2 0.44 0.5 0.62 0.85 0.45 0.53

3 0.48 0.89 0.77 1.0 0.49 0.95

4 0.51 0.89 0.82 0.86 0.55 0.88

AVERAGE: 0.49 0.80 0.76 0.88 0.52 0.81

Considering sensitivity against specificity (see Figure 17) best window is4 1 for both algorithms, with respect to accuracy (see Figure 17), OIL hasbetter scores than HyRPNI in all the window sizes tested. In short, the bestperformance was obtained with OIL with a window size 4 1.

41

Page 42: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 28: performance values Window 14 1 of ECOLI species

sensitivity specificity accuracy

Grupo HyRPNI OIL HyRPNI OIL HyRPNI OIL

1 0.71 0.86 0.86 0.86 0.76 0.86

2 0.52 0.92 0.92 0.88 0.56 0.91

3 0.63 1.0 0.85 0.85 0.69 0.93

4 0.62 0.87 0.82 0.75 0.68 0.83

AVERAGE: 0.62 0.91 0.86 0.84 0.67 0.88

(a) Sensitivity vs. Specificity (b) Accuracy

Figure 17: Results with HyRPNI and OIL Algorithms in Ecoli Data

With Eukariotes data we observed the same behavior than Ecoli had.OIL had the best performance with window 4 1, as we can see in the Figure18. The values of sensitivity, specificity and accuracy of OIL were close toone in the window 4 1 but for the window 10 10 and 14 1 was below 0.9.HyRPNI got lower rates of specificity, sensitivity and accuracy for windows10 10 and 14 1, where these values were below 0.7 (see Tables 29, 30 and31)

Table 29: performance values Window 4 1 of EUK species

sensitivity specificity accuracy

Grupo HyRPNI OIL HyRPNI OIL HyRPNI OIL

1 0.88 0.98 0.98 0.99 0.92 0.98

2 0.92 0.98 0.98 0.99 0.94 0.98

3 0.86 0.98 0.96 0.98 0.89 0.98

4 0.89 0.98 0.97 0.98 0.92 0.98

AVERAGE: 0.89 0.98 0.97 0.98 0.92 0.98

Gram+ data learning produced results reported in Tables 32, 33 and 34.They show again that OIL with window 4 1 is the best option as we can see

42

Page 43: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 30: performance values Window 10 10 of EUK species

sensitivity specificity accuracy

Grupo HyRPNI OIL HyRPNI OIL HyRPNI OIL

1 0.58 0.58 0.51 0.80 0.52 0.57

2 0.59 0.90 0.58 0.97 0.55 0.92

3 0.55 0.94 0.52 0.96 0.51 0.95

4 0.59 0.60 0.52 0.84 0.54 0.61

AVERAGE: 0.58 0.76 0.53 0.89 0.53 0.76

Table 31: performance values Window 14 1 of EUK species

sensitivity specificity accuracy

Grupo HyRPNI OIL HyRPNI OIL HyRPNI OIL

1 0.59 0.75 0.70 0.81 0.57 0.74

2 0.63 0.75 0.69 0.91 0.61 0.78

3 0.59 0.74 0.69 0.85 0.56 0.75

4 0.62 0.77 0.67 0.93 0.59 0.81

AVERAGE: 0.61 0.75 0.69 0.88 0.58 0.77

in Figure 19.

Table 32: performance values Window 4 1 of Gram-Positive species

sensitivity specificity accuracy

Grupo HyRPNI OIL HyRPNI OIL HyRPNI OIL

1 0.92 0.97 0.97 1.0 0.93 0.98

2 0.89 1.0 1.0 1.0 0.92 1.0

3 0.92 0.95 1.0 1.0 0.94 0.96

4 1.0 1.0 0.95 0.97 0.96 0.98

AVERAGE: 0.93 0.98 0.98 0.99 0.94 0.98

Table 33: performance values Window 10 10 of Gram-Positive species

sensitivity specificity accuracy

Grupo HyRPNI OIL HyRPNI OIL HyRPNI OIL

1 0.77 0.74 0.81 0.78 0.70 0.67

2 0.72 0.95 0.89 1.0 0.69 0.96

3 0.69 0.89 0.86 0.86 0.63 0.83

4 0.72 0.89 0.76 0.95 0.63 0.89

AVERAGE: 0.73 0.87 0.83 0.89 0.66 0.84

In the results for gram- data we observed that HyRPNI was very closeto the results of OIL (see in Figure 24), although OIL still obtains the bestperformance with the window 4 1, as we can see in Tables 35, 36 and 37.For windows 10 10 and 14 1 OIL is clearly superior to HyRPNI.

Finally, for Humand data we obtained the following results: the bestperformance is achieved by OIL with window 4 1, as we can see in Figure

43

Page 44: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

(a) Sensitivity vs. Specificity (b) Accuracy

Figure 18: Results with HyRPNI and OIL Algorithms in Eukariotes Data

(a) Sensitivity vs. Specificity (b) Accuracy

Figure 19: Results with HyRPNI and OIL Algorithms in Gram+ Data

(a) Sensitivity vs. Specificity (b) Accuracy

Figure 20: Results with HyRPNI and OIL Algorithms in Gram+ Data

44

Page 45: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 34: performance values Window 14 1 of Gram-Positive species

sensitivity specificity accuracy

Grupo HyRPNI OIL HyRPNI OIL HyRPNI OIL

1 0.79 0.92 0.89 0.95 0.76 0.91

2 0.82 0.85 0.94 0.94 0.82 0.84

3 0.77 0.89 0.92 0.89 0.75 0.85

4 0.74 0.94 1.0 0.86 0.76 0.87

AVERAGE: 0.78 0.90 0.94 0.91 0.77 0.88

Table 35: performance values Window 4 1 of Gram-Negative species

sensitivity specificity accuracy

Grupo HyRPNI OIL HyRPNI OIL HyRPNI OIL

1 0.94 1.0 0.96 0.97 0.94 0.98

2 0.96 0.97 1.0 0.98 0.97 0.97

3 0.93 0.97 0.97 0.97 0.94 0.96

4 0.92 0.93 0.99 0.96 0.94 0.93

AVERAGE: 0.94 0.97 0.98 0.97 0.95 0.96

Table 36: performance values Window 10 10 of Gram-Negative species

sensitivity specificity accuracy

Grupo HyRPNI oilnsym HyRPNI oilnsym HyRPNI oilnsym

1 0.60 1.0 0.69 0.94 0.55 0.97

2 0.62 0.75 0.74 0.91 0.58 0.77

3 0.65 0.95 0.73 0.95 0.61 0.95

4 0.59 0.72 0.65 0.87 0.53 0.72

AVERAGE: 0.62 0.86 0.70 0.92 0.57 0.85

Table 37: performance values Window 14 1 of Gram-Negative species

sensitivity specificity accuracy

Grupo HyRPNI oilnsym HyRPNI oilnsym HyRPNI oilnsym

1 0.72 0.92 0.71 0.87 0.66 0.88

2 0.67 0.85 0.69 0.94 0.62 0.87

3 0.74 0.81 0.68 0.88 0.67 0.80

4 0.62 0.82 0.78 0.91 0.59 0.83

AVERAGE: 0.69 0.85 0.72 0.90 0.64 0.85

21. Sensitivity, specificity and accuracy of OIL were close to one in thewindow 4 1 (see Table 38), but for the window 10 10 the rates was below0.9 except specificity in the window 10 10 which was 0.99. In the window14 1 the rates was below 0.9 and HyRPNI got the worst values of specificity,sensitivity and accuracy for windows 10 10 and 14 1, with values below 0.7(see Tables 39 and 40).

In conclusion, best performance is achieved with OIL algorithm andwindow size 4 1 for the five species in the task of signal peptides cleavage

45

Page 46: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 38: performance values Window 4 1 of Human species

sensitivity specificity accuracy

Grupo HyRPNI oilnsym HyRPNI oilnsym HyRPNI oilnsym

1 0.92 1.0 0.95 0.96 0.92 0.98

2 0.92 0.99 0.98 1.0 0.93 0.99

3 0.87 1.0 0.94 0.96 0.87 0.98

4 0.94 1.0 0.99 0.97 0.95 0.98

AVERAGE: 0.91 0.99 0.97 0.97 0.92 0.98

Table 39: performance values Window 10 10 of Human species

sensitivity specificity accuracy

Grupo HyRPNI oilnsym HyRPNI oilnsym HyRPNI oilnsym

1 0.58 0.96 0.48 1.0 0.46 0.98

2 0.75 0.97 0.76 1.0 0.69 0.98

3 0.65 0.64 0.49 0.96 0.51 0.64

4 0.66 0.99 0.73 1.0 0.61 0.99

AVERAGE: 0.66 0.89 0.62 0.99 0.57 0.89

Table 40: performance values Window 14 1 of Human species

sensitivity specificity accuracy

Grupo HyRPNI oilnsym HyRPNI oilnsym HyRPNI oilnsym

1 0.67 0.77 0.69 0.84 0.59 0.75

2 0.64 0.81 0.83 0.95 0.60 0.83

3 0.62 0.79 0.69 0.79 0.54 0.74

4 0.61 0.96 0.69 0.97 0.54 0.96

AVERAGE: 0.64 0.83 0.73 0.89 0.57 0.82

site prediction. In general, sensitivity and specificity measures are balancedand performance is higher than 0.95 in all the cases, this results sugest thatOIL algorithm is a competitive method to solve this task. Our next stepwill be to compare OIL performanse with other other techniques running onthe same data.

7 Comparison Between OIL Algorithm and Sig-

nalP 3.0

SignalP is a program to predict signal peptides cleavage sites, which consistof two different predictors based in artificial neural networks and hiddenMarkov models. SignalP have been compared with more than ten otherprograms for the same task[JDBB04] and it seems to be the best. Besides,data used to train this tool are public, so it is possible to run a comparisonbetween OIL algorithm an SignalP 3.0, it would allow as to establish how

46

Page 47: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

(a) Sensitivity vs. Specificity (b) Accuracy

Figure 21: Results with HyRPNI and OIL Algorithms in Gram+ Data

competitive is grammatical inference with respect to several other aproachesto solve this prediction problem.

7.1 Experimental Design

Experimentation samples were taken from PrediSi (Prediction of SIgnal pep-tides) data base http://www.predisi.de/download.html, they arein FASTA format. We keep the same window sizes used in SignalP trainingas they are reported in [?], for Eukaryota samples we consider 20 positions tothe left and 4 to the right from the cleavage site, for Gram+ and Gram- weconsider 11 positions to the left and 3 to the right of the cleavage site. Figure22 shows the structure of FASTA format: the first line contains the aminoacid label and the position of the cleavage site. Second line contains thesequence itself codified with the one simbol representation of amino acids.

Figure 22: FASTA file encoding

47

Page 48: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

To be processed by our algoritms, samples have to be codified with num-bers representing amino acids instead of letters. Table 41 shows the codifi-cation table that we applied.

Table 41: Codification Table to Aminoacids

Letter Number Letter Number Letter Number Letter Number

A 0 G 5 M 10 S 15

C 1 H 6 N 11 T 16

D 2 I 7 P 12 V 17

E 3 K 8 Q 13 W 18

F 4 L 9 R 14 Y 20

X 19

Figure 23: Format specification used for experimentation

This codification was made with the program codificarExtraerVentana.pyusing the following command:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptspython codificarExtraerVentana.py FastaFile species lowRange upRange alphabet

Where FastaFile is a file in FASTA format with its path, species is thename of the data set to be processed: eukaryotes, Gram-negative or Gram-positive, lowRange and upRange define the window size, lowRange is thenumber of positions to the left of the cleavage site and upRange is the num-ber of positions to the right of the cleavage site. Finally, alphabet is the sizeof the alphabet in samples. This program encodes and stores samples in afile, if the income is a positive samples file, it generates both positive andnegative samples according to de position of the window on the sequence.If the income is a negative samples one, it generates only negative sam-ples. The sample files are placed in the directory DATA\_SIGNALP\_2.0/experimentacion\_2010\_mayo/\textbf{species}/datosIntermedios,where species may be euk, gram- and gram+. The resulting samples filesare saved in the same directory, its important to know that this program

48

Page 49: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

generate 2 negative samples files whose names will end in negative1 andnegative2 respectively.

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptspython codificarExtraerVentana.py

../gram-/datosIntermedios/GRAMN.FASTA gram- 11 3 21

This is an example of the coding of samples, here, we encode the positivesamples of gram- with window 11 3 and alphabet size is 21.

From encoded samples(negatives and positives) we choose the numberof samples to use in the experimentation. This selection was made with theprogram MuestrasNegativaPositiva.py using the following command:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptspython MuestrasNegativaPositiva.py speciesNegative1

speciesNegative2 negaNumber species posiNumber

Where speciesNegative1 is a negative samples file (the original file ofnegative samples) more its path, speciesNegative2 is a negative samples file(negative sample file obtained from the positive samples file) more its path,negaNumber is the number of samples to be taken from the files negative2and negative1 (half from one file and half from the other file), species isthe file name of the positive samples more its path and posiNumber is thenumber of positives samples to be taken. Notice that this program generatesan unique negative samples file, because it takes the 50% of the negativesamples from the negative samples file 1 and the other 50% form the negativesamples file 2 and combine it in one only file. An example of the executionof this script is:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptspython MuestrasNegativaPositiva.py ../gram+/datosIntermedios/gram-_11_3_negative1.sample

../gram+/datosIntermedios/gram-_11_3_negative2.sample 6000

../gram+/datosIntermedios/gram-_11_3.sample 556

The number of samples used for each species varies: For the eukary-otes were used 500 positive samples and 4500 negative samples, for thegram- were used 556 positive samples and 6000 negative samples and forthe gram+ were used 236 positive samples and 3000 negative samples. Aswe mentioned above, this selection was done with the program MuestrasNeg-ativaPositiva.py and the execution of this program was performed using thescript clasificar negativas positivas.bash, which is used in the following way:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptsbash clasificar_negativas_positivas.bash species lowRange upRange

posiNumber negaNumber

49

Page 50: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Where species is the name of the data set to be trained, lowRange is thenumber of positions to the left of the cleavage site, upRange is the number ofpositions to the right of the cleavage site, posiNumber is the number of posi-tives samples to be used and negaNumber is the number of negatives samplesto be used. This is an example of the clasificar negativas positivas.bash use,here, we classify the samples of the species gram-:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptsbash clasificar_negativas_positivas.bash gram- 11 3 556 6000

With this encoded samples were generated the training and test files foreach species with the program preparacion.py using the following command:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptspython preparacion.py positivesSamples NegativesSamples species

Where positivesSamples is the filename of the positives samples more itspath, NegativesSamples is the filename of the negative samples more its pathand species is the name of the data set to be used (This parameter serves forlocate the species directory). The purpose of this program is to take the 75%of the positive samples and the 75% of negative samples and combine theminto a single file (training file) and the remaining 25% of both files is used tocreate the test file. These files are stored in the directory respective for eachspecies, the name of training file was nameSpecie window training.sampleand the name of test file was nameSpecie window test.sample. An exampleof the execution this script where we prepare the samples for gram+ speciesis:

python preparacion.py ../gram+/datosIntermedios/gram+_11_3.sample../gram+/datosIntermedios/gram+_11_3_negative_experiment.samplegram+

Table 42 shows the number of samples for training and test files forEuk, Gram-negative and Gram-positive data sets obtained with the prepara-cion.py script. Notice that each file contains both positive and negativesamples.

Table 42: Number of samples in the training and test files.

Training Test

Species No Positives No Negatives No Positives No Negatives

Eukaryotes 375 3375 125 1125

Gram-negativas 417 4500 139 1500

Gram-positivas 177 2250 59 750

50

Page 51: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Once data is prepared, experiment consist in learn automata from themusing OIL and HyRPNI algorithms, in this step we use our implementationsof those algorithms called oilnsym and rpni2Train respectively. Oilnsymhas as parameters, a training file, the output file name and the controlnumber for the order of selection for states to be merged possible values are:0 for random order, 1 ascending order or 2 descending order. Rememberthat always is selected random order (0) to cause non determinism in theexecution. The invocation to oilnsym program is:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/fuentes/OIL-nSym./oilnsym trainingFilename.sample automatonName.auto NumberControl

trainingFilename.sample is the file name of the training samples more itspath, automatonName.auto is the name of the resulting automata more itspath and NumberControl is the control number which value must be zero,this program is stored in directory:

/home/carlos/destino2010/DATA_SIGNALP_2.0/Experimentacion/fuentes/OIL-nSym

An example of the execution of the program oilnsym is show below:

cd /home/carlos/destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/fuentes/OIL-nSym./Oilnsym ../gram+/gram+_11_3_training.sample ../gram+/oilnsym_gram+_1.auto 0

The parameter gram+ 11 3 training.sample is the name of the train-ing samples of gram+, oilnsym gram+ 1.auto is the name of the resultingautomata, as we can see it adds the name of the program trained. Thenumber 0 means random order. We executed oilnsym through the script en-trena oil signalp.bash, it creates 51 automata for each species, the resultingautomata were stored in the respective directories for each species. Thus,we used this script:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptsbash entrena_oil_signalp.bash species lowRange upRange numberHypotheses

Where, species is the name of the data set to be used, lowRange is thenumber of positions to the left of the cleavage site, upRange is the numberof positions to the right of the cleavage site and numberHypotheses is thenumber of Hypotheses to be generated. An example of the execution of theprogram entrena oil signalp.bash is:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptsbash entrena_oil_signalp.bash gram+ 11 3 51

51

Page 52: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Here, we generated 51 hypothesis of the gram+ species. Once we ob-tained automata for each species, we evaluate these automata with testsamples using the script oilv4.py as we can see in the following command:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptspython oilv4.py testFile numberAlternatives directory species

Where, testFile is the test filename more its path, numberAlternatives isthe number of hypotheses to be used in voting system (each hypothesis has anumber that identifies it), directory is the place to locate the automata andspecies is the name of the data set to be used. An example of the executionof the program entrena oil signalp.bash is show below:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptspython oilv4.py ../gram+/gram+_11_3_test.sample 3 ../gram+ gram+

Here, We evaluated 3 automata with gram+ test file (the automata withthe id 1, 2 and 3).

We executed oilv4.py through the script oilv4.bash, applies a voting sys-tem from a set of automata trained to recognize the same language. Eachautomaton classifies all the test samples. A given sample is tagged depend-ing on the number of automata which classified it as positive or negative,sample tag will be that of the mayority of automata. Testing is done withthis command:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptsbash oilv4.bash species lowRange upRange numberAlternatives

Where species is the name of the data set to be used, lowRange is thenumber of positions to the left of the cleavage site, upRange is the numberof positions to the right of the cleavage site and numberAlternatives is thenumber of hypotheses to be tested with samples from the test file (eachhypothesis has a number that identifies). An example of the execution ofthe script oilv4.bash is:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptsbash oilv4.bash gram+ 11 3 3

We evaluated 3 automata of the species with gram+ test file and re-sults are stored in the directory /destino2010/DATA\_SIGNALP\_2.0/experimentacion\_2010\_mayo/resultados.

Once obtained the results on test samples, we calculated performancemeasures:sentitivity, especificity, accuracy with the script genResultados.py.We use these measures to generate graphics and tables. This script hasas parameter a results file (with extension .res). This measures have beendefined in Section 5.1. We executed this script with the following command:

52

Page 53: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptspython genResultados.py file.res resultsName lowRange upRange

Where file.res is a results file name created with the oilv4.py script moreits path, resultsName is the name of the file where were stored the perfor-mance values more its path, lowRange is the number of positions to the leftof the cleavage site, upRange is the number of positions to the right of thecleavage site. An example of this script is:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptspython genResultados.py ../resultados/Tabla_Datos_gram+_11_3.res

../resultados/Datos_Tabla_Datos_gram+_11_3.res 11 3

Here, we calculated the performance measures of the species gram+ andwere stored in the file Datos Tabla Datos gram+ 11 3.res

We executed genResultados.py through the script genResultados.bash.We used this script in the following way:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptsbash genResultados.bash species lowRange upRange

Where species is the name of the data set to be used, lowRange is thenumber of positions to the left of the cleavage site, upRange is the numberof positions to the right of the cleavage site. This is an example of thegenResultados.bash use, here we calculated the performance measures ofthe species gram+:

cd /destino2010/DATA_SIGNALP_2.0/experimentacion_2010_mayo/scriptsbash genResultados.bash gram+ 11 3

7.2 Results

Table 43 shows results for the three species considering the different sizes ofthe voting team and the performance measures sensitivity, specificity andaccuracy. For eukaryotes the best result is obtained with a team of 33automata, reaching a sensitivity of 0.72, specificity of 0.95 and accuracyof 0.92. For gram-positive bacteria the best group size is 43, it reaches asensitivity of 0.62, specificity of 0.99 and accuracy of 0.96. Finally, in gramnegative bacteria we obtain 0.57, 0.99, 0.95 . Notice that once the algorithmreaches its best performance the rates stay unchanged in spite of the growingof the team.

Best results obtained with OIL algorithm are compared with SignalP3.0 results as they are reported in [JDBB04], the comparison is presented in

53

Page 54: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

(a) Sensitivity vs. Specificity in Euk Data (b) Accuracy in Euk Data

(c) Sensitivity vs. Specificity in Gram-Data

(d) Accuracy in Gram- Data

(e) Sensitivity vs. Specificity in Gram+Data

(f) Accuracy in Gram+ Data

Figure 24: Results with OIL Algorithm in Euk, Gram+ and Gram- Data

54

Page 55: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

Table 43: In the tables the column Hypothesis is the number of models usedin each row (We consider sets of increasing size), the columns Se, Sp, Ac aresensitivity, specificity and accuracy respectively.

(a) Results on the species Euk, Gram-positive and Gram-Negative

EUK Gram-negative Gram-positive

Hypothesis Se Sp Ac Se Sp Ac Se Sp Ac

1 0.62 0.84 0.82 0.54 0.94 0.91 0.33 0.91 0.86

3 0.69 0.90 0.88 0.6 0.97 0.94 0.37 0.95 0.90

5 0.69 0.91 0.89 0.56 0.99 0.95 0.47 0.97 0.93

7 0.67 0.92 0.89 0.59 0.99 0.95 0.45 0.97 0.93

9 0.70 0.92 0.90 0.58 0.99 0.95 0.47 0.97 0.94

11 0.71 0.93 0.91 0.58 0.99 0.95 0.52 0.98 0.95

13 0.71 0.93 0.91 0.59 0.99 0.96 0.45 0.98 0.94

15 0.71 0.94 0.91 0.59 0.99 0.96 0.52 0.99 0.95

17 0.70 0.93 0.91 0.59 0.99 0.96 0.55 0.99 0.95

19 0.72 0.93 0.91 0.61 0.99 0.96 0.55 0.99 0.95

21 0.71 0.93 0.91 0.6 0.99 0.96 0.55 0.98 0.95

23 0.74 0.93 0.91 0.59 0.99 0.96 0.53 0.99 0.95

25 0.74 0.94 0.92 0.6 0.99 0.96 0.53 0.98 0.95

27 0.72 0.94 0.92 0.6 0.99 0.96 0.55 0.98 0.95

29 0.74 0.94 0.92 0.6 0.99 0.96 0.57 0.99 0.95

31 0.71 0.94 0.92 0.6 0.99 0.96 0.55 0.99 0.95

33 0.72 0.95 0.92 0.6 0.99 0.96 0.53 0.98 0.95

35 0.73 0.94 0.92 0.61 0.99 0.96 0.55 0.98 0.95

37 0.73 0.94 0.92 0.61 0.99 0.96 0.53 0.98 0.95

39 0.73 0.94 0.92 0.61 0.99 0.96 0.5 0.98 0.94

41 0.71 0.95 0.92 0.61 0.99 0.96 0.53 0.98 0.95

43 0.71 0.95 0.92 0.62 0.99 0.96 0.55 0.98 0.95

45 0.71 0.95 0.92 0.61 0.99 0.96 0.53 0.98 0.95

47 0.69 0.95 0.92 0.61 0.99 0.96 0.53 0.98 0.95

49 0.71 0.95 0.92 0.62 0.99 0.96 0.53 0.98 0.95

51 0.70 0.95 0.92 0.62 0.99 0.96 0.53 0.99 0.95

AVERAGE: 0.71 0.93 0.91 0.59 0.99 0.96 0.51 0.98 0.94

Table 44. Notice that OIL specificity is higher than SignalP’s, but in sensi-tivity SignalP is better. In accuracy, OIL results are better for eukaryotesand gram negative bacteria; SignalP is better in gram positive bacteria.Viewing the results as a whole we claim that grammatical inference is acompetitive method to predict signal peptide cleavage sites, although somework must be done to clearly overcome SignalP performance in all threemeasures.

Table 44: Results Oil against SignalP. The columns Se, Sp, Ac are sensitivity,specificity and accuracy respectively

SignalP3.0 OIL

Especie Se Sp Ac Se Sp Ac

Eukariotas 0.99 0.85 0.93 0.72 0.95 0.92

Gram+ 0.98 0.98 0.98 0.57 0.99 0.95

Gram- 0.94 0.88 0.95 0.62 0.99 0.96

55

Page 56: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

In the table above we could see that the accuracy performance measuresare very similar to the accuracy of the program SignalP, unlike the measuressensitivity that are low for OIL algorithm.

7.3 False Positive Experimentation

We performed another test feeding our best models, generated with OILalgorithm, with only negative samples, the purpose of this experiment is toasses experimentally the model’s tendency to produce false positives. Newsamples were 625075 for eukaryotes, 112646 for gram positive and 220800 forgram negative bacteria. Table 45 shows the percentage of samples acceptedby our models. Notice that the ratio of false positives decreases quickly asthe size of team increases and it is lower than three percent for bacteria andlower than five percent for eukaryotes.

Table 45: Results Oil against SignalP. The columns Gd, Bd, Fp are Good(samples correctly recognized), Bad (samples incorrectly recognized) andFalsePositive (percentage of false positives) respectively

Euk Gram + Gram -

Hypothesis Fp Fp Fp

1 14.6 10.3 6.89

3 10.7 6.62 3.82

5 8.71 5.05 3.08

7 8.24 4.42 2.92

9 7.86 3.76 2.79

11 7.48 3.41 2.60

13 7.35 3.39 2.56

15 7.00 3.23 2.53

17 6.86 3.15 2.53

19 6.69 3.02 2.49

21 6.49 2.94 2.49

23 6.52 3.07 2.49

25 6.36 3.02 2.49

27 6.23 2.89 2.53

29 6.12 2.84 2.55

31 6.02 2.79 2.52

33 6.04 2.75 2.55

35 6.05 2.76 2.50

37 6.04 2.77 2.51

39 5.99 2.70 2.51

41 5.99 2.71 2.51

43 6.00 2.72 2.52

45 5.94 2.67 2.51

47 5.89 2.63 2.52

49 5.90 2.69 2.53

51 5.87 2.67 2.55

In conclusion, we report the results on applying a grammatical inferencealgorithm called OIL to de prediction of signal peptide cleavage sites. Wehave showed that our algorithm produces recognition rates comparable withthose obtained by artificial neural networks and hidden Markov models. We

56

Page 57: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

yield better results in specificity for the three species, and in accuracy fortwo of them. Our future work will be to explore the reasons of the lowsensitivity rates we are obtaining and improve it.

8 Conclusions and Future Work

In this internal report we have registered two years of experimentation ap-plying two grammatical inference algorithms to solve the cleavage site pre-diction problem in two versions: polyproteine and signal peptide cleavagesites. Our results are consistent in both tasks, recognition rates may beconsidered hugh and are competitive with other techniques. We should paymore attention to the balance of positive and negative samples to avoid bi-ases in our models and yield more balanced sensitivity and specificity rates.Our porpose is to optimize our implementations and produce a software toolfor prediction available to the bioinformatics community.

References

[BCM08] Enrique Bravo, Lee A. Calvert, and Francisco J. Morales.The complete nucleotide sequence of the genomic rna ofbean common mosaic virus strain nl4. Revista de laAcademia Colombiana de Ciencias Exactas, Fisicas y Natu-rales, XXXII(122):37–46, 2008.

[BHBB96] Nikolaj Blom, Jan Hansen, Dieter Blaas, and Soren Brunak.Cleavage site analysis in picornaviral activity: discovering cel-lular targets by neural networks. Protein Science, 5:2203–2216,1996.

[CK05] Francois Coste and Goulven Kerbellec. A Similar FragmentsMerging Approach to Learn Automata on Proteins. IRISA Pub-lication Interne, 1735, 2005.

[CK06] Francois Coste and Goulven Kerbellec. Learning automata onprotein sequences. In Jornees Ouvertes en Biologie Informa-tique Mathematiques JOBIM’06, 2006.

[EBvHN07] Olof Emanuelsson, Soren Brunak, Gunnar von Heijne, and Hen-rik Nielsen. Locating proteins in the cell using targetp, signalpand related tools. Nature Protocols, 2(4):953–971, 2007.

57

Page 58: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

[GdPAR08] Pedro Garcia, Manuel Vazquez de Parga, Gloria Alvarez, andJose Ruiz. Learning regular languages using nondeterministicfinite automata. Theoretical Computer Science, 5148:92–101,2008.

[Gol67] E. Mark Gold. Language Identification in the Limit. Informa-tion and Control, 10(5):447–474, 1967.

[HGS+04] Karsten Hiller, Andreas Grote, Maurice Scheer, RichardMunch, and Dieter Jahn. Predisi: prediction of signal peptidesand their cleavage positions. Nucleic Acids Research, 32:W375–W379, 2004.

[HMU01] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ull-man. Introduction to Automata Theory, Langages and Com-putation. Pearson Education. Addison Wesley, http://www-db.stanford.edu/ ullman/ialc.html, second edition, 2001.

[JDBB04] Gunnar von Heijne Jannick Dyrlov Bendtsen, Henrik Nielsenand Soren Brunak. Improved prediction of signal peptides: Sig-nalp 3.0. Journal on Molecular Biology, 340:783–795, 2004.

[KV10] Sofia Khan and Mauno Vihinen. Performance of protein stabil-ity predictors. Human Mutation, 31:675–684, 2010.

[LdM+09] Nils Leversen, Gustavo deSouza, Hiwa Malen, Swati Prasad,Inge Jonassen, and Harald Wiker. Evaluation of signal peptideprediction algorithms for identification of mycobacterial signalpeptides using sequence data from proteomic methods. Micro-biology, 155:2375–2383, 2009.

[LdSM+09] Nils Anders Leversen, Gustavo A. de Souza, Hiwa Malen, SwatiPrasad, Inge Jonassen, and Harald G. Wiker. Evaluation ofsignal peptide prediction algorithms for identification of my-cobacterial signal peptides using sequence data from proteomicmethods. Microbiology, 155:2375–2383, 2009.

[LN06] Alessandra Lumini and Loris Nanni. Machine learning for hiv-1protease cleavage site prediction. Pattern Recognition Letters,27:1537–1544, 2006.

[LPP98] Kevin J. Lang, Barak A. Pearlmutter, and Rodney A. Price.Results of the Abbadingo One DFA Learning Competition and

58

Page 59: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

a New Evidence-driven State Merging Algorithm. Lecture Notesin Artificial Intelligence, 1433:1–12, 1998.

[May05] Michael Mayo. Bayesian sequence learning for predcting proteincleavage points. Lecture Notes in Computer Science, 3518:192–202, 2005.

[NBvH99] H. Nielsen, Soren Brunak, and Gunnar von Heijne. Machinelearning approaches for the prediction of signal peptides andother protein sorting signals. Protein Engineering, 12(1):3–9,1999.

[NEBvH97] H. Nielsen, J. Engelbretch, Soren Brunak, and Gunnar vonHeijne. Identification of prokariotic and eukariotic signal pep-tides and prediction of their cleavage sites. Protein Engineering,10(1):1–6, 1997.

[Ogu09] Hasan Ogul. Variable context markov chains for hiv proteasecleavage site prediction. BioSystems, 96:246–250, 2009.

[PLC07] Piedachu Peris, Damin Lpez, and Marcelino Campos. A gram-matical inference approach to transmembrane domain predic-tion. Proceedings of Seventh International Workshop on A Se-mantic Web for Bioinformatics, pages 121–129, 2007.

[PLCS06] Piedachu Peris, Damin Lpez, Marcelino Campos, and Jos M.Sempere. Protein motif prediction by grammatical inference.LNAI, 4201:175–187, 2006.

[RY04] Thorsteinn Rognvaldsson and Liwen You. Why neural networksshould not be used for hiv-1 protease cleavage site prediction.Bioinformatics, 20(11):1702–1709, 2004.

[SSRZ08] Bruce R. Southey, Jonathan V. Sweedler, and Sandra L.Rodriguez-Zas. Prediction of neuropeptide cleavage sites in in-sects. Bioinformatics Advance Access, 2008.

[THY04] Rebecca Thompson, T. Charles Hodgman, and Zheng RongYang. Characterizing proteolytic cleavage site activity us-ing bio-basis function neural networks. Bioinformatics,19(14):1741–1747, 2004.

59

Page 60: Reporte Interno CIC: Results on Using Grammatical ...cic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:... · Results on Using Grammatical Inference to Solve the Cleavage Site Prediction

[Ver02] J.P. Vert. Support vector machine prediction of signal peptidecleavage site using a new class of kernels for strings. PacificSymposium on Biocomputing, 7:649–660, 2002.

[WTR07] Laurence J.K. Wee, Tin Wee Tan, and Shoba Ranganathan.Casvm: web server for svm-based prediction of caspasesubstrates cleavage sites. Bioinformatics applications note,23:3241–3243, 2007.

[WYC05] M. Wang, J. Yang, and K. Chou. Using string kernel to predictsignal peptide cleavage site based on subsite coupling model.Amino Acids, 28:395–402, 2005.

[YC04] Z. Yang and K. Chou. Biosupport vector machines for compu-tational proteomics. Bioinformatics, 20(5):735–741, 2004.

60