10

Learning Local Languages and Its Application to Protein alpha-Chain Identification]

  • Upload
    waseda

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Learning Local Languages and Its Application to

Protein �-Chain Identi�cation]

Takashi Yokomori Nobuyuki Ishida Satoshi Kobayashi

Department of Computer Science and Information Mathematics

University of Electro-Communications

Chofu, Tokyo 182, Japan

October 7, 1996

Abstract

This paper concerns an e�cient algorithm for

learning in the limit a special type of regular lan-

guages called locally testable languages from positive

data, and its application to identifying the protein

�-chain region in amino acid sequences. First, we

present a linear time algorithm that, given a locally

testable language, learns (identi�es) its deterministic

�nite state automaton in the limit from only positive

data. This provides us with a practical and e�cient

learning method for a speci�c domain of symbolic anal-

ysis.

We then describe several experimental results using

the learning algorithm developed above. Following a

theoretical observation which strongly suggests that a

certain type of amino acid sequences can be expressed

by a locally testable language, we apply the learning

algorithm to identifying the protein �-chain region in

amino acid sequences for hemoglobin. Experimental

scores show an overall success rate of 95 % correct

identi�cation for positive data, and 96 % for negative

data.

1 Introduction

In the study of inductive inference of formal lan-

guages, Gold [6] showed that the class of languages

containing all �nite sets and one in�nite set is not

learnable in the limit from positive data only. This

fact was shocking in a sense because a simple implica-

tion is that even the class of regular languages is not

learnable in the limit from positive data.

Angluin [1] has given several conditions for the class

of languages to be learnable in the limit from posi-

tive data, and presented some examples of learnable

classes. She has also proposed subclasses of regular

languages called k-reversible languages for each k � 0

and shown these classes are learnable in the limit from

positive data with the conjectures updated in polyno-

mial time([2]). Here, the learnability from only pos-

itive data as well as polynomial-time e�ciency is of

great importance from the practical viewpoint.

On the other hand, the notion of a splicing sys-

tem was introduced by Head in [9] as a mathematical

model of restriction enzyme digestion and subsequent

religation in the recombination of DNA molecules. He

showed an interesting relationship between splicing

languages generated by splicing systems and regular

languages. In particular, one of the most signi�cant

results for our purpose is the equivalence between a

certain type of splicing languages, called persistent

splicing languages, and a subclass of regular languages

called locally testable languages. This result is cru-

cially instructive in that, as we will show, it can bridge

the gap between mathematical analysis in molecule bi-

ology and formal language theory in computer science.

In this paper, we will �rst show that, using deter-

ministic �nite state automata(abb., DFAs), for each

k � 1, the class of k-testable languages is learnable

in the limit from positive data with the conjectures

updated in linear time. The class of locally testable

languages consists of all classes of k-testable languages

for every k � 1.

Then, motivated by a theoretical result due to

Head([9]) mentioned above which strongly suggests

that a certain type of amino acid sequences can be

expressed by a locally testable language, we describe

a simple experimental result via a machine identi�-

cation system based on the learning algorithm devel-

oped above. We apply the system to identifying the

protein �-chain region in amino acid sequences for

hemoglobin. Some experimental data show that the

system achieved an overall success rate of 95 % correct

identi�cation for positive data, and 96 % for negative

data. Thus, the main experimental result stated in

this article suggests the great potential of a successful

formal method based on the automata-theoretic ap-

proach to DNA sequence analysis in general.

This article is organized as follows. After preparing

basic de�nitions and notations in Section 2, we intro-

duce the notion of a k-testable language and study its

formal properties in Section 3. A problem of learning

k-testable languages in the limit from positive data

is formalized and a linear time algorithm for solving

the problem is presented. Section 4 in turn deals with

an application issue of the algorithm developed in the

previous section. First, we state a theoretical result on

relationship between a formal model for describing the

behaivors of recombination of DNA molecules, called

splicing system, and a k-testable language. Then, we

apply the learning algorithm to an identi�cation prob-

lem of protein �-chain region in amino acid sequences,

to obtain high scores of overall success rates in several

experiments.

2 Preliminaries

2.1 Basic De�nitions and Notations for

Formal Languages

We assume the reader to be familiar with the rudi-

ments of automata and formal language theory. (For

notions and notations not stated here, see, e.g., [8].)

Let � be a �xed �nite alphabet and �

be the set

of all �nite-length strings over �. Further, let �

+

=

� f�g, where � is the null string. By lg(u) we

denote the length of string u. A language over � is a

subset of �

. The cardinality of a set S is denoted by

jSj.

Let L

k

(w) and R

k

(w) be the pre�x and the su�x

of w of length k, respectively. Further, let I

k

(w) be

the set of interior solid substrings of w of length k.

These are de�ned only when w has length k or more.

If w has length k, then L

k

(w) = R

k

(w) = w, while if

w has length k or k + 1, then I

k

(w) is empty. Let x

be a string over � and L be a language over �. The

left-quotient of L and x, denoted by xnL, is de�ned by

xnL = fy 2 �

jxy 2 Lg.

A determinisitic �nite state automaton (DFA) is a

5-tuple M = (Q;�; �; p

0

; F ), where Q is a �nite set of

states, � is a �nite alphabet of input symbols, p

0

(2 Q)

is the initial state, F (� Q) is the set of �nal(accepting)

states, and � is a state transition mapping such that

for all p 2 Q and a 2 �, �(p; a) is in Q or unde�ned.

A mapping � is extended to Q��

as follows: for all

p 2 Q, �(p; �) = p, and for all p 2 Q, x 2 �

and

a 2 �, �(p; xa) = �(�(p; x); a). A string w over � is

accepted byM if and only if �(p

0

; w) 2 F . A language

L is accepted byM , denoted by L = L(M), if and only

if L is exactly the set of all strings accepted by M .

2.2 Learning in the limit from Positive

Data

Let L be a class of languages to be learned over a

�xed alphabet �. Further, let M be a class of DFAs

such that for all L 2 L, there exists M 2 M such that

L = L(M).

For a given M 2 M, a positive presentation of

L(M) is any in�nite sequence of examples such that

every w 2 L(M) occurs at least once in the sequence

and no other examples not in L(M ) appear in the se-

quence.

LetM be a DFA inM accepting a given L(i.e., L =

L(M)). An algorithm A is said to learn(or identify)

a language L in the limit from positive data using M

if and only if for any positive presentation of L, the

in�nite sequence of DFAs M

i

in M produced by A

satis�es the property that there exists a DFAM

0

inM

such that for all su�ciently large i, the i-th conjecture

(DFA) M

i

is identical to M

0

and L(M

0

) = L(M).

A class of languages L is learnable in the limit from

positive data (using M) if and only if there exists an

algorithm A that, given an L in L, learns L in the

limit from positive data using M.

Let A be an algorithm for learning a language class

L in the limit from positive data. A class L is learnable

in the limit from positive data with the conjectures

updated in polynomial time if and only if there exists

an algorithmA for learning L in the limit from positive

data with the property that there exists a polynomial

p such that for any n, for any L in L for which a correct

DFA is of size n, and for any positive presentation of L,

the time used byA between receiving the i-th example

w

i

and outputting the i-th conjectured DFA M

i

is at

most p(n;m

1

+ � � �+m

i

), where m

j

= lg(w

j

)(1 � j �

i).

3 Learning Locally Testable Languages

from Positive Data

3.1 Locally Testable Languages

Given a positive integer k, let �

k

= fw 2

jlg(w) = kg (i.e., the set of all strings over � of

length k). A language L over � is k-testable if and

only if there exist �nite sets A;B, and C such that

A;B;C � �

k

, and for all w with lg(w) � k, w 2 L if

and only if L

k

(w) 2 A, R

k

(w) 2 B, and I

k

(w) � C.

(In this case, (A;B;C) is called a triple for L.) A

language L is locally testable if and only if there exists

an integer k � 1 such that L is k-testable ([11]).

Note 1 (1) In the literature [11], a k-testable lan-

guage in this paper is called a strictly k-testable lan-

guage.

(2) If L is k-testable, then L is k

0

-testable for all

k

0

> k. Further, note that the de�nition of \k-

testable" says nothing about the strings of length k�1

or less.

(3) In the sequel, we will omit the proofs for all

formal results simply because of the space limit, and

the reader interested in those is suggested to refer to

[20].

In order to discuss the relationship between the

class of locally testable languages and that of re-

versible languages, we give the de�nition of re-

versible languages based on the language-theoretic

characterizations([2]). Let k be a non-negative inte-

ger. A regular language L is k-reversible if and only

if whenever u

1

vw and u

2

vw are in L and lg(v) = k,

it holds that u

1

vnL = u

2

vnL. (When k = 0, we of-

ten write \zero-reversible" rather than \0-reversible".)

A regular language is reversible if and only if it is k-

reversible for some k � 0.

Example 1 Consider a language L consisting of all

�nite-length strings over �(= fa; bg) that begin with

a, end with a and contain no occurrence of a as a

strictly interior subword, that is, L is the language

denoted by the regular expression a+ ab

a.

Let A = B = fag; and C = fbg. Then, it is easily

seen that for all w 2 �

with lg(w) � 1, w 2 L if and

only if L

1

(w) 2 A;R

1

(w) 2 B, and I

1

(w) � C. Hence,

L is 1-testable. It is seen that L is not a zero-reversible

language.

We have the following theorem.

Theorem 1 For each k � 1, the class of k-testable

languages is properly included in the class of (k + 1)-

reversible languages, but incomparable to the class of

zero-reversible languages.

In fact, we can show the following result that provides

a formal method for expressing k-testable languages

in more explicit fashion.

Lemma 2 Let L be a language such that L � �

k

.

Then, L is k-testable if and only if there exists a new

triple [A

0

; B

0

;C

0

] such that A

0

; B

0

; C

0

� �

k

and L =

A

0

\�

B

0

� �

+

C

0

+

.

Thus, in what follows, we assume the following:

[CONVENTION]

We adopt the following de�nition for a k-testable lan-

guage: L is k-testable if and only if (1) there exists

a triple S = [A;B;C] such that A;B;C � �

k

and

L = L(S), and (2) for no k

0

less than k, L is k

0

-

testable, where L(S) = A�

\�

B � �

+

C�

+

.

3.2 K-testable Language Associated with

a Finite Set

In this section, we will show that given a �nite set

of strings T , one can e�ectively construct a k-testable

language L

T

which is the smallest k-testable language

containing T in the sense that any other k-testable

language containing T is a superset of L

T

.

Let T be a �nite subset of �

k

. Then, construct

�nite sets A

T

, B

T

, and C

T

as follows:

A

T

= fL

k

(w)jw 2 Tg

B

T

= fR

k

(w)jw 2 Tg

C

T

= �

k

� I

k

(T ); where I

k

(T ) =

[

w2T

I

k

(w):

Consider a triple S

T

= [A

T

; B

T

; C

T

] and a language

L(S

T

) = A

T

\�

B

T

��

+

C

T

+

:

Then, obviously A

T

; B

T

; C

T

� �

k

. A set L(S

T

) is

called a k-testable language associated with T. A lan-

guage L(S

T

) is the smallest k-testable language con-

taining T , which is formally stated as follows:

Lemma 3 Let L be an arbitrary k-testable language

such that L � �

k

. Then, T � L implies that

L(S

T

) � L.

Note that L(S

T

) can be completely characterized

by T . In fact, we can generalize the above result so

that any k-testable language may have its �nite subset

with such a property.

Let L be a k-testable language such that L � �

k

.

A �nite subset R of �

k

is called a characteristic

sample for L if and only if L is the smallest k-testable

language containing R. The next lemma proves the

usefulness of the notion of a characteristic sample for

a language.

Lemma 4 Suppose that R � T � L, where R and T

are �nite, L is a k-testable language such that L �

k

. If R is a characteristic sample for L, then L =

L(S

R

) = L(S

T

) hold.

Thus, the existence of a characteristic sample for a

language L assures the existence of a correct learning

strategy for L.

In fact, we can prove the following.

Lemma 5 For any k-testable language L, there e�ec-

tively exists a characteristic sample R

L;k

for L.

3.3 Constructing a DFA associated with

a Triple

Let L be a k-testable language such that L =

L(S)(= A�

\ �

B � �

+

C�

+

), for some triple S =

[A;B;C]. The purpose of this section is to show how

one can construct a characteristic sample for L. First,

consider a DFA M

S

= (Q;�; �; p

0

; F ) constructed

from S = [A;B;C] in the following way :

Q =

Q

1

[ f[bx]jx 2 A \B \ Cg ( if A \ B \ C 6= ;)

Q

1

( otherwise )

where

Q

1

= f[�]; [a

1

]; [a

1

a

2

]; :::; [a

1

� � �a

k�1

]ja

1

� � �a

k

2 Ag

[f[x]jx 2 (�

k

� C) [ A [Bg;

p

0

= [�]

F and � are de�ned as follows:

F =

f[�]j� 2 Bg [ f[bx]jx 2 A \B \ Cg( if A \B \ C 6= ;)

f[�]j� 2 Bg (otherwise)

(1) for all � = a

1

� � � a

k

2 A;

�(p

0

; a

1

) = [a

1

]

�([w

i

]; a

i+1

) = [w

i

a

i+1

]

(where w

i

= a

1

� � � a

i

(1 � 8i � k � 2) )

�([w

k�1

]; a

k

) =

[b�] ( if � 2 B \ C)

[�] (otherwise)

(where w

k�1

= a

1

� � � a

k�1

)

(2) for all [ax]; [xb] 2 Q such that lg(x) = k � 1;

ax 2 A; xb 2 (�

k

� C) [ B;

�([cax]; b) = [xb] (if ax 2 B \ C)

�([ax]; b) = [xb] (otherwise)

(3) for all [ax]; [xb] 2 Q such that lg(x) = k � 1;

ax 2 (�

k

� C); xb 2 (�

k

� C) [ B

�([ax]; b) = [xb]:

We call M

S

the canonical DFA associated with S. (See

Example 2 and Figure 1. It should be noted that

the terminology \canonical" is used here in a manner

di�erent from the usual sense.)

We observe that, for a given triple S for a k-testable

language L, the canonical DFA M

S

= (Q;�; �; p

0

; F )

a

ab

b

a a a

ab

b

[a] [aa] [aa]

[ab] [ba]

[bb]

Figure 1

&K[ ]

Canonical DFA M associated with S in Example 2

S

associated with S is not always minimum. However,

by applying an appropriate state minimization algo-

rithm to M

S

, we could obtain the minimum DFA

M

L

= (Q

L

;�; �

L

; p

0

; F

L

) exactly accepting L, if nec-

essary.

Example 2 Consider a 2-testable language L de�ned

by a triple S = [A;B;C] as follows : L = A�

\

B��

+

C�

+

, where � = fa; bg; A = B = C = faag.

Note that L is the language denoted by the regular

expression: aa+ aa(b

+

a)

a.

We construct the canonical DFA M

S

= (Q;

�; p

0

; �; F ) associated with S = [A;B;C], where Q =

f[�]; [a]; [aa]; [ab]; [ba]; [bb]g [ f[caa]g, p

0

= [�], and

F = f[caa]; [aa]g. Further, � is de�ned by

�([�]; a) = [a]; �([a]; a) =

d

[aa], �(

d

[aa]; a) = [aa]

�(

d

[aa]; b) = [ab]; �([ab]; a) = [ba]; �([ab]; b) = [bb]

�([ba]; a) = [aa]; �([ba]; b) = [ab]; �([bb]; a) = [ba]

�([bb]; b) = [bb].

The state-transition graph of M

S

is given in Figure 1.

The canonical DFAM

S

associated with S is equivalent

to S in the following sense.

Lemma 6 For any w 2 �

k

, w is in L(S) if and

only if it is accepted by M

S

.

3.4 Learning Algorithm : LA

We now present a learning algorithm LA. The

learning protocol for LA consists of only the posi-

tive presentation of an unknown k-testable language

U over the �xed alphabet �. It is shown that LA given

below eventually learns in the limit a DFA M

E

such

that U = L(M

E

), where L(M

E

) is the language ac-

cepted by M

E

. More speci�cally, from Lemma 4 and

Lemma 5, we can show the following :

Lemma 7 Let M

E

0

;M

E

1

; :::;M

E

i

; ::: be the sequence

of DFAs produced by LA, where E

i

is the set of pos-

itive data at the i-th stage. Then, (i) for all i � 0,

L(M

E

i

) � L(M

E

i+1

) � U , and (ii) there exists r � 0

such that for all i � 0, M

E

r

= M

E

r+i

and L(M

E

r

) =

U .

Thus, we have the main theorem.

Theorem 8 Given an unknown k-testable language

U, the algorithm LA learns in the limit a DFA M

E

such that U = L(M

E

) from positive data.

[ALGORITHM LA]

Input: a positive integer k and a positive

presentation of a target k-testable

language U

Output: a sequence of DFAs for k-testable

languages

Procedure

initialize E

0

= ; ;

let S

E

0

= [;; ;;�

k

] be the initial triple ;

construct DFA M

E

0

accepting E

0

(= ;) ;

repeat (forever)

let M

E

i

= (Q

E

i

;�; �

E

i

; p

0

; F

E

i

) be

the current DFA ;

read the next positive example w

i+1

;

if w

i+1

2 L(M

E

i

), then output M

E

i+1

(= M

E

i

);

else

scan w

i+1

to compute

L

k

(w

i+1

); R

k

(w

i+1

); I

k

(w

i+1

) ;

construct the canonical DFA

M

E

i+1

associated with

S

E

i+1

= [A

E

i+1

; B

E

i+1

; C

E

i+1

] ;

(where A

E

i+1

= A

E

i

[ fL

k

(w

i+1

)g,

B

E

i+1

= B

E

i

[ fR

k

(w

i+1

)g,

C

E

i+1

= C

E

i

� I

k

(w

i+1

)

and E

i+1

= E

i

[ fw

i+1

g)

output M

E

i+1

;

[Time Analysis of LA]

We analyse the time complexity of the algorithm

LA needed to update a conjecture.

For every i � 0, each time a new positive example

w

i+1

is given, we observe :

(1) In the �rst phase, testing whether or not w

i+1

2

L(M

E

i

) is done in time O(lg(w

i+1

)).

(2) In the second phase, computing L

k

(w

i+1

);

R

k

(w

i+1

), and I

k

(w

i+1

) is done in time O(klg(w

i+1

)).

At the same time, �

E

i

is updated in time

O(kj�jlg(w

i+1

)). Hence, updating a conjecture M

E

i

requires at most O(kj�jlg(w

i+1

)).

Thus, we have the following theorem.

Theorem 9 The algorithm LA requires at most

O(j�jkm) time for updating a conjecture, where m is

the maximum length among all positive data provided

in the learning process.

4 Machine Identi�cation { An Experi-

mental Result

In the �rst half of the present paper, we have de-

veloped an e�cient algorithm LA for learning locally

testable languages in the limit from positive data, and

discussed some of the formal aspects of both the lan-

guage class and the learning algorithm LA.

The second half will discuss a possible application

of such theoretical results to a practical problem in

the domain of DNA sequence analysis.

4.1 A formal model for DNA Splicing Se-

quences

In [9] Head proposed a formal system called a splic-

ing system and studied the relationship between for-

mal language theory to infomational molecule biology.

A splicing system is a 4-tuple S = (�; I; B; C),

where � is an �nite alphabet, I is a �nite set of initial

strings, B and C are �nite sets of triples (a; x; b), called

patterns, with a; x; b 2 �

. For each such a triple, the

string axb is called a site and the string x is called a

crossing. Patterns in B and C are called the left(resp.,

right) patterns. For uaxbv and wcxdz in �

with pat-

terns (a; x; b) and (c; x; d) in the same hand(either B or

C), two new strings uaxdz and wcxbv are constructed

by splicing at the crossing x. (In a biological inter-

pretation, one may take I as the initial set of DNA

molecule sequences, B and C as the sets of splicing

rules, speci�ed by restricted enzymes and a ligase, that

produce 5

0

overhangs or blunt ends, and produce 3

0

overhangs, respectively.)

The language L = L(S) generated by S consists of

the strings in I and all strings that can be obtained

by adjoining to L uaxdz and wcxbv, whenever uaxbv

and wcxdz are in L and both patterns (a; x; b) and

(c; x; d) are in the same hand. (In other words, L(S)

is the minimal subset of �

which contains I and is

closed under the operation of splicing.) A language L

over � is a splicing language if there exists a splicing

system S such that L = L(S).

Here we are concerned with a speci�c type of splic-

ing systems. A splicing system is said to be persistent

if for each pairs of strings uaxbv and wcxdz in �

with (a; x; b) and (c; x; d) from the same hand, if y is a

subsegment of uax(resp., xdz) that is the crossing of

a site in uaxbv(resp., wcxdz), then this same subseg-

ment y of uaxdz contains an occurrence of the crossing

of a site in uaxdz. (Intuitively, in a persistent splic-

ing system, it is possible to apply consecutive splicing

operations in�nitely many times.)

It is known that if we specify B and C by choosing

restriction enzymes from actual biological world, then

the resulting systems are often persistent([9]). (For ex-

ample, in order to obtain persistent splicing systems,

one may choose restriction enzymes from Appendix

of the literature [19]. Further, any splicing system

containing only one restriction enzyme is always per-

sistent, even if the single enzyme is HgiAI.)

An important result for our purpose is the follow-

ing.

Proposition 10 ([9]) For a language L, the follow-

ings are equivalent:

1. L is a locally testable language,

2. L is generated by a persistent splicing system.

Example 3 ([9]) (1) Consider a splicing system

S = (fa; bg; fababag; f(a; b; a)g; ;). It is easy to see

that S is persistent and L(S) = faba(ba)

n

jn � 0g. A

biochemical interpretation of this system is that tak-

ing a = [A=T ][T=A] and b = [C=G][G=C], the result-

ing language L(S) consists of all molecules that can

be potentially arised from the action of ClaI (with an

appropriate ligase) on (copies of) the molecule :

[

A

T

][

T

A

][

C

G

][

G

C

][

A

T

][

T

A

][

C

G

][

G

C

][

A

T

][

T

A

](= ababa):

A language L(S) is in fact a 2-testable langauge that

is accepted by a DFA pictured in Figure 2.

(2) Let S = (fa; c; g; tg; I ;B;C), where a =

[A=T ]; t = [T=A]; c = [C=G]; g = [G=C], I is un-

speci�ed here, B consists of the patterns speci�ed by

EcoRI, TaqI and AluI and C contains the single pat-

tern by HhaI. The patterns provided by these four

enzymes are as follows:

a

a

b

[a]&K[ ] [ab] [ba]b

Figure 2

(g; aatt; c) for EcoRI, (t; cg; a) for TaqI

(ag; �; ct) for AluI, (g; cg; c) for HhaI.

A careful examination proves that this system is per-

sistent.

We call fa; c; g; tg the standard 4 letter alphabet and

denote it by S

4

. Note that in order to have a 2-testable

language L(S), we make use of a symbolic replacement

in (1) of Example 3 above.

Such a translation is frequently used in analysing

DNA sequences. In fact, we will describe several ex-

perimental results in the next section where a letter-

to-letter translation (i.e., a coding) is used to reduce

the size of the problem in question smaller, leading to

a feasible solution.

4.2 Protein �-chain Identi�cation

From Proposition 10, it may be justi�ed that we

assume the following :

[AWorking Assumption] Some types of amino acid

sequences are generated by persistent splicing systems,

and hence, they are characterized by locally testable

languages.

Under this assumption, in this section, we describe

some experimental results that concern identifying a

protein �-chain region. We have implemented the

learning algorithm for locally testable languages de-

veloped in the previous section on a workstation, and

applied it to the problem of identifying the �-chain

region in amino acid sequences for hemoglobin.

(1) Goal : The purpose of this experiment is to �nd a

DFA that identi�es the �-chain region in amino acid

sequences for hemoglobin.(One hemoglobin molecule

contains two �-chains and two �-chains, and the �-

chain is a polypeptide normally comprising 141 amino

acids.)

(2) Method : The positive data of �-chain in amino

acid sequences for hemoglobin have been drawn from

PRF protein sequence database([13]). Let us denote

the set of those raw positive data by POS

raw

that is a

�nite set of strings over the alphabet of 20 symbols,

A

20

, each of which represents each amino acid residue.

We used two kinds of letter-to-letter translations

: one is from A

20

to the 6 letter alphabet D

6

(=

fa; b; c; d; e; fg) speci�ed by Dayho�'s coding method,

the other from A

20

to a binary alphabet B(= f0; 1g)

used in [12], shown in Table 1. (A sample of a raw

positive data and its translation results via Dayho�'s

coding and binary coding are given in Figure 3.)

We now describe the details of our experiments.

Let's denote by POS the set of positive data obtained

from POS

raw

in terms of a translation via either Day-

ho�'s coding or binary coding in Table 1. In our ex-

periments, the cardinality of POS was 123.

We have also prepared a set of negative data NEG,

consisting of 2567 sequences. NEG was constructed

by randomly choosing from all protein data but

hemoglobin data in PRF database. The length of data

ranges from 138 to 142 for NEG and 141 or 142 for POS.

By Exp(k; x) we denote an experiment performed

by the method described below, where k is the pa-

rameter for k-testability and x is the cardinality of

Sample, a set of training data randomly selected from

POS.

4.2.1 Experiments Based on Dayho�'s coding

We now have the new set of positive data POS over D

6

Here, we �xed a parameter k as k = 2.

[An experiment Exp(2; x)] For each x(=10, 15,

20, 25, 30), we constructed Sample consisting of x se-

quences that were randomly selected from POS. Then,

using Sample, the learning algorithm LA described

in Section 3 produced a DFA M

2;x

that speci�es a

2-testable language containing all training data in

Sample. We then evaluated the performance of the

obtained DFA, using the set of test data Test and

NEG, where POS=Sample [ Test(disjoint union). The

success rates of correct identi�cation for positive and

negative data are de�ned by:

R

pos

=

jTest \ L(M

2;x

)j

jTestj

; R

neg

=

jNEG \L(M

2;x

)j

jNEGj

;

respectively, where L(M

2;x

) denotes the complement

of L(M

2;x

) with respect to D

6

(that is, the set of

strings over D

6

rejected by M

2;x

).

For each x(= 10; 15; 20; 25; 30), we have iteratevely

made an experiment Exp(2; x) 100 times. Figure 4

illustrates a series of experiments Exp(2; x) where the

�gures indicate the average scores on each experiment.

Table 2 summarizes the experimental results in which

Dayhoff’s coding

(coded data)

binary coding

. . .

VLSPADKTNVKAAWGKVGAHAGEY . . .

eebbbcdbdedbbfbdebbdbbcf

. . . 000101100010000100010010

A raw data from human hemoglobin is coded into a new data over D or 6 B

A part of -chain in human hemoglobin (raw data)&A

Figure 3

(I) shows that the DFA M

2;30

achieved a success rate

of 93 % correct identi�cation for (positive) test data,

and 99 % for negative data.

4.2.2 Experiments Based on Binary Coding

Using a binary coding in [12], we have a set of positive

data POS over B which is obtained from POS

raw

.

By �xing x = 20, we �rst made experiments

Exp(k; 20) for each k = 2; 3; 4; 5; and 6. In each case,

Exp(k; 20) has been iteratively performed 100 times.

The �nal results are presented in (II) of Table 2, where

the �gures indicate the average scores. For example,

we observe that an experiment Exp(3; 20) achieves the

overall best score : a success rate of 96% correct iden-

ti�cation for (positive) test data and 92% correct iden-

ti�cation for negative data.

Based on this observation, we then made experi-

ments Exp(3; x) for each x = 6; 7; 8; 9; and 10. Again,

for each x, Exp(3; x) has been performed 100 times.

Figure 5 illustrates the results of a series of Exp(3; x),

where the �gures indicate the average.

4.3 Discussion

First, we will mention some theoretical considera-

tion, immediately derived from our previous results,

and its molecule biological implication.

Suppose that a representation relation L = h(L

0

)

holds, where L is a k-testable language over D

6

and h

is a letter-to-letter coding from A

20

to D

6

. Then, we

note that L

0

is contained in a k-testable language over

A

20

.

Further, if a set of amino acid sequences is a k-

testable language over A

20

, then it is regarded as 3k-

testable language over the standard 4 letter alphabet

S

4

(= fa; c; g; tg).

From these consideration, the following theorem

holds :

Theorem 11 If L is a k-testable language over D

6

,

then a language L

0

such that L = h(L

0

) is contained

in a 3k-testable language over S

4

.

Thus, for example, using a result obtained in

Exp(2;30) with k = 2, one may make a conjecture that

a set of DNA sequences is contained in a 6-testable

language over S

4

, which is strongly supported by the

theoretical discussion made in Section 4.1.

Table 1. Translation Tables

(I) Dayho�'s Coding([4])

Amino Acids Properties New Symbol

C sulfur polymerization a

S, T, P, A, G small b

N, D, E, Q acid & amide c

H, R, K basic d

M, I, L, V hydrophobic e

F, Y, W aromaticity f

(II) Binary Coding([12])

Amino Acids Hydropathy Index New Symbol

A, C , F, G, I, L, M,

N, S, T, V, W, Y High 0

D, E, H, K, P, Q, R Low 1

Table 2. Experimental Results | Best Scores

(I) (Via Dayho� coding)

(k = 2) Exp(2; 30)

R

pos

93%

R

neg

99%

(II) (Via Binary Coding based on Hydropathy Index)

Exp(k,20) k = 2 k = 3 k = 4 k = 5 k = 6

R

pos

97.4% 96,2% 92.6% 87.9% 81.5%

R

neg

83.3% 92.3% 93.6% 94.9% 99.2%

[�-Chain Automata]

Through the experiments, several DFAs have been

identi�ed from sets of training data that have been se-

lected in a random manner from a set of positive data.

It is expected that these DFAs have some common fea-

ture, because they all have been obtained as the best

candidate DFA for specifying the same virtual target

of a locally testable language(i.e., a persistent splicing

langauge). However, since it is hard to compare DFAs

with distinct alphabets, let us �rst focus on the se-

ries of DFAs over B, M

k;20

(k = 2; 3; 4; 5; 6), obtained

in Exp(k; 20). Provided that the overall performance

score of a DFA is evaluated by a simple summation :

R

pos

+R

neg

, we observe thatM

3;20

might be called the

best DFA over B. We also note from Figure 5 that all

DFAs M

3;x

(for x = 6; 7; 8; 9; 10) equally achieve high

scores in overall performance evaluation. This sug-

gests one aspect of the stability of the obtained DFA

family. Further, it is somehow surprising that only 6

(and at most 10) training data randomly selected are

su�cient for constructing such a good DFA character-

izing �-chain regin in binary representation.

On the other hand, suppose that we compare two

DFAs M

3;6

over B and M

2;30

over D

6

. We then �nd

it di�cult to discuss one's superiority over the other.

Because, we note that these experiments have been

performed based on the distinct training data sets

randomly chosen under the di�erent coding methods.

Further, two codings we used are basically indepen-

dent of each other. Thus, it seems that at present

we have no justi�able way to select one DFA as the

best automaton for explaining �-chain in amino acid

sequences. Nevertheless, it may be claimed that each

DFA M

2;30

(M

3;x

for x = 6; 7; 8; 9; 10 or some other

x � 20 ) captures (at least) a part of the essential char-

acteristic of �-chain region for hemoglobin. In fact, in

the course of performing some supplemental experi-

ments whose purpose was to �nd a minimum DFA

with the same high level of identi�cation scores (over

90%) as above, we have obtained a DFA M

3;12

over

B that exactly accepts the language 000(0 + 1)

101

denoted by the regular expression([21]). Thus, an in-

teresting fact is that with high probability of over 90%,

the structure of �-chain region of amino acid sequence

can be characterized by only its pre�x and su�x of

length 3, in binary representation.

In summing-up, using the present method for learn-

ing locally testable languages, we could obtain a spe-

ci�c type of DFA M

that one might call an �-chain

DFA for hemoglobin. A machine M

will then be ap-

plicable to identifying the �-chain region for unknown

amino acid sequences with a very high success rate.

Finally, from the other supplemental experiments,

we conjecture that a DFA with a similarly high score

could be obtained from a set of positive raw data over

A

20

, i.e., POS

raw

if the training set is su�ciently large.

Here, we could eventually collect only 123 positive raw

data in total from the PRF databse, and as a result,

Success Rate

1.0

0.5

0

*** *

.1

.2.3

.4

.6

.7

.8

.9

Best Score

*posR

Rneg

An Experiment D 6over

*

No. of Training Data

Figure 4.

x1510 20 25 30

.803.854

.887 .915

.997 .995 .994 .992

=0.931

=0.991

.991

.931

Exp(2,x)

Overall

we could not help using two codings to obtain our

main results.

5 Conclusions

We have shown that the class of k-testable lan-

guages is learnable in the limit from positive data, and

presented an algorithm which learns any k-testable

language in the limit from positive data using DFAs.

There exist a few subclasses of regular languages

known to be learnable in the limit from positive data

in polynomial time in some sense([10, 14, 18]). It seems

that further research in this direction should be made

for enlarging the boundary of polynomial-time learn-

ability from positive data.

Since the notion of a splicing system was intro-

duced by Head in [9] as a mathematical model of re-

striction enzyme digestion and subsequent religation

in the recombination of DNA molecules, there have

been reported several works concerning the formal

characterizations of splicing systems and their learn-

ing problem([3, 5, 15, 17]). In particular, one of the

most signi�cant results for our purpose was the equiv-

alence relationship between a certain type of splicing

languages, called persistent splicing languages, and the

class of locally testable languages.

Motivated by this result due to Head([9]) which

strongly suggests that a certain type of amino acid

sequences can be expressed by a locally testable lan-

Success Rate

1.0

0.5

0 6 8 10

*** * *

.1

.2.3

.4

.6

.7

.8

.9

Best Score

*posR

Rneg

An Experiment Bover

No. of Training Data

Figure 5.

7 9

.949 .948 .951

.961 .961 .943

x

Exp(3,x)

Overall

=0.961

=0.949

.953

.951.950

.950

guage, we have described some experimental results

via a machine identi�cation system based on the learn-

ing algorithm developed above. In a usual problem

setting, the identi�cation of protein sequence families

is performed by the homology search technique to con-

struct a \template" which gives a favorable score(e.g.,

[7, 16]). Our method is unique in that the learning al-

gorithm in this paper may provide an automatic way

of deriving such a template. We applied the system

to the problem of identifying the protein �-chain in

amino acid sequences for hemoglobin. The main ex-

perimental data showed that the system achieved an

overall success rate of 95 % correct identi�cation for

positive data, and 96 % for negative data. Under the

same working assumption made here, the prediction

problem of other domains such as protein �-helix re-

gion in amino acid squences could be also attacked in

terms of the same identi�cation strategy, which we are

working on.

Finally, it is known that locally testable languages

have some connections to a certain subclass of �rst-

order logic formulas. (In other words, they can provide

a certain class of logical formulas for describing the ge-

netic information.) In this sense, the main experimen-

tal result stated in this article has a deep implication

to the bio-informatics aspect of DNA sequences, and

also suggests the great potential of a successful formal

method based on the automata-theoretic approach to

DNA sequence analysis in general.

Acknowledgements

The authors are grateful to anonymous referees for

their useful suggestions which greatly improved the

draft of this paper.

This work is supported in part by Grants-in-Aid for

Scienti�c Research No.04229105 from the Ministry of

Education, Science and Culture, Japan.

References

[1] D. Angluin. Inductive inference of formal lan-

guages from positive data. Information and Con-

trol, 45:117{135, 1980.

[2] D. Angluin. Inference of reversible languages.

Journal of the ACM, 29:741{765, 1982.

[3] K. Culik II and T. Harju. Dominoes and the regu-

larity of DNA splicing languages. In K. Mehlhorn,

editor, Proceedings of ICALP'89, LNCS 182,

Springer-Verlag, pages 222{233, 1989.

[4] H. Dayho� and H. Calderone. Composition of

proteins. Altas of Protein Sequence and Struc-

ture, 5(3):363{373, 1978.

[5] R.W. Gatterdam. Splicing systems and regular-

ity. International Journal of Computer Mathe-

matics, 31:63{67, 1989.

[6] E. M. Gold. Language identi�cation in the limit.

Information and Control, 10:447{474, 1967.

[7] M. Gribskov, A.D. McLachlan, and D. Eisenberg.

Pro�le analysis: Detection of distantly related

proteins. Proc. of Natl. Acad. Sci. USA, 84:4355{

4358, 1987.

[8] M.A. Harrison. Introduction to Formal Language

Theory. Addison-Wesley, Reading, MA, 1978.

[9] T. Head. Formal language theory and DNA :

An analysis of the generative capacity of speci�c

recombinant behaviors. Bulletin of Mathematical

Biology, 49:737{759, 1987.

[10] O.H. Ibarra and T. Jiang. Learning regular lan-

guages from counterexamples. In Proceedings of

1st Workshop on Computational Learning The-

ory, pages 337{351, 1988.

[11] R. McNaughton and S. Papert. Counter-Free Au-

tomata. MIT Press, Cambridge, MA, 1971.

[12] S. Miyano, A. Shinohara, S. Arikawa, S. Shimo-

zono, T. Shinohara, and S. Kuhara. Knowledge

acquisition from amino acid sequences by decision

trees and indexing. In Proceedings of Genome In-

formatics Workshop III, pages 69{72, 1992.

[13] PRF(Protein Research Foundation) Database.

[14] T. Shinohara. Polynomial time inference of ex-

tended regular pattern languages. In Proceedings

of RIMS Symposia on Software Science and Engi-

neering, LNCS 147, Springer-Verlag, pages 115{

127, 1983.

[15] R. Siromoney, K.G. Subramanian, and V.R.

Dare. Circular DNA and splicing systems. In

Proceedings of Intern. Conference on Parallel Im-

age Analysis, Lecture Notes in Computer Science

654, Springer-Verlag, pages 260{273, 1992.

[16] G.D. Stormo and G.W. Hartzell III. Identifying

protein-binding sites from unaligned DNA frag-

ments. Proc. of Natl. Acad. Sci. USA, 86:1183{

1187, 1989.

[17] Y. Takada and R. Siromoney. On identify-

ing DNA splicing systems from examples. In

P.K.Jantke, editor, Proceedings of AII'92, LNAI

642, Springer-Verlag, pages 305{319, 1992.

[18] N. Tanida and T. Yokomori. Polynomial-time

identi�cation of strictly regular languages in the

limit. IEICE Transactions on Information and

Systems, E75-D:125{132, 1992.

[19] J.D. Watson, J. Tooze, and D.T. Kurtz. Recombi-

nant DNA: A Short Course. Freeman, New York,

1983.

[20] T. Yokomori. A note on the polynomial-time

identi�cation of strictly locally testable languages

in the limit. Report CSIM 90-03, Dept. of

Comupt. Sci. and Inf. Math., Univ. of Elect.-

Communi., 1990.

[21] T. Yokomori and N. Ishida. Learning local

languages and its applications to protein �-

chain prediction. Report CSIM 93-03, Dept. of

Comupt. Sci. and Inf. Math., Univ. of Elect.-

Communi., 1993.