Upload
waseda
View
0
Download
0
Embed Size (px)
Citation preview
Learning Local Languages and Its Application to
Protein �-Chain Identi�cation]
Takashi Yokomori Nobuyuki Ishida Satoshi Kobayashi
Department of Computer Science and Information Mathematics
University of Electro-Communications
Chofu, Tokyo 182, Japan
October 7, 1996
Abstract
This paper concerns an e�cient algorithm for
learning in the limit a special type of regular lan-
guages called locally testable languages from positive
data, and its application to identifying the protein
�-chain region in amino acid sequences. First, we
present a linear time algorithm that, given a locally
testable language, learns (identi�es) its deterministic
�nite state automaton in the limit from only positive
data. This provides us with a practical and e�cient
learning method for a speci�c domain of symbolic anal-
ysis.
We then describe several experimental results using
the learning algorithm developed above. Following a
theoretical observation which strongly suggests that a
certain type of amino acid sequences can be expressed
by a locally testable language, we apply the learning
algorithm to identifying the protein �-chain region in
amino acid sequences for hemoglobin. Experimental
scores show an overall success rate of 95 % correct
identi�cation for positive data, and 96 % for negative
data.
1 Introduction
In the study of inductive inference of formal lan-
guages, Gold [6] showed that the class of languages
containing all �nite sets and one in�nite set is not
learnable in the limit from positive data only. This
fact was shocking in a sense because a simple implica-
tion is that even the class of regular languages is not
learnable in the limit from positive data.
Angluin [1] has given several conditions for the class
of languages to be learnable in the limit from posi-
tive data, and presented some examples of learnable
classes. She has also proposed subclasses of regular
languages called k-reversible languages for each k � 0
and shown these classes are learnable in the limit from
positive data with the conjectures updated in polyno-
mial time([2]). Here, the learnability from only pos-
itive data as well as polynomial-time e�ciency is of
great importance from the practical viewpoint.
On the other hand, the notion of a splicing sys-
tem was introduced by Head in [9] as a mathematical
model of restriction enzyme digestion and subsequent
religation in the recombination of DNA molecules. He
showed an interesting relationship between splicing
languages generated by splicing systems and regular
languages. In particular, one of the most signi�cant
results for our purpose is the equivalence between a
certain type of splicing languages, called persistent
splicing languages, and a subclass of regular languages
called locally testable languages. This result is cru-
cially instructive in that, as we will show, it can bridge
the gap between mathematical analysis in molecule bi-
ology and formal language theory in computer science.
In this paper, we will �rst show that, using deter-
ministic �nite state automata(abb., DFAs), for each
k � 1, the class of k-testable languages is learnable
in the limit from positive data with the conjectures
updated in linear time. The class of locally testable
languages consists of all classes of k-testable languages
for every k � 1.
Then, motivated by a theoretical result due to
Head([9]) mentioned above which strongly suggests
that a certain type of amino acid sequences can be
expressed by a locally testable language, we describe
a simple experimental result via a machine identi�-
cation system based on the learning algorithm devel-
oped above. We apply the system to identifying the
protein �-chain region in amino acid sequences for
hemoglobin. Some experimental data show that the
system achieved an overall success rate of 95 % correct
identi�cation for positive data, and 96 % for negative
data. Thus, the main experimental result stated in
this article suggests the great potential of a successful
formal method based on the automata-theoretic ap-
proach to DNA sequence analysis in general.
This article is organized as follows. After preparing
basic de�nitions and notations in Section 2, we intro-
duce the notion of a k-testable language and study its
formal properties in Section 3. A problem of learning
k-testable languages in the limit from positive data
is formalized and a linear time algorithm for solving
the problem is presented. Section 4 in turn deals with
an application issue of the algorithm developed in the
previous section. First, we state a theoretical result on
relationship between a formal model for describing the
behaivors of recombination of DNA molecules, called
splicing system, and a k-testable language. Then, we
apply the learning algorithm to an identi�cation prob-
lem of protein �-chain region in amino acid sequences,
to obtain high scores of overall success rates in several
experiments.
2 Preliminaries
2.1 Basic De�nitions and Notations for
Formal Languages
We assume the reader to be familiar with the rudi-
ments of automata and formal language theory. (For
notions and notations not stated here, see, e.g., [8].)
Let � be a �xed �nite alphabet and �
�
be the set
of all �nite-length strings over �. Further, let �
+
=
�
�
� f�g, where � is the null string. By lg(u) we
denote the length of string u. A language over � is a
subset of �
�
. The cardinality of a set S is denoted by
jSj.
Let L
k
(w) and R
k
(w) be the pre�x and the su�x
of w of length k, respectively. Further, let I
k
(w) be
the set of interior solid substrings of w of length k.
These are de�ned only when w has length k or more.
If w has length k, then L
k
(w) = R
k
(w) = w, while if
w has length k or k + 1, then I
k
(w) is empty. Let x
be a string over � and L be a language over �. The
left-quotient of L and x, denoted by xnL, is de�ned by
xnL = fy 2 �
�
jxy 2 Lg.
A determinisitic �nite state automaton (DFA) is a
5-tuple M = (Q;�; �; p
0
; F ), where Q is a �nite set of
states, � is a �nite alphabet of input symbols, p
0
(2 Q)
is the initial state, F (� Q) is the set of �nal(accepting)
states, and � is a state transition mapping such that
for all p 2 Q and a 2 �, �(p; a) is in Q or unde�ned.
A mapping � is extended to Q��
�
as follows: for all
p 2 Q, �(p; �) = p, and for all p 2 Q, x 2 �
�
and
a 2 �, �(p; xa) = �(�(p; x); a). A string w over � is
accepted byM if and only if �(p
0
; w) 2 F . A language
L is accepted byM , denoted by L = L(M), if and only
if L is exactly the set of all strings accepted by M .
2.2 Learning in the limit from Positive
Data
Let L be a class of languages to be learned over a
�xed alphabet �. Further, let M be a class of DFAs
such that for all L 2 L, there exists M 2 M such that
L = L(M).
For a given M 2 M, a positive presentation of
L(M) is any in�nite sequence of examples such that
every w 2 L(M) occurs at least once in the sequence
and no other examples not in L(M ) appear in the se-
quence.
LetM be a DFA inM accepting a given L(i.e., L =
L(M)). An algorithm A is said to learn(or identify)
a language L in the limit from positive data using M
if and only if for any positive presentation of L, the
in�nite sequence of DFAs M
i
in M produced by A
satis�es the property that there exists a DFAM
0
inM
such that for all su�ciently large i, the i-th conjecture
(DFA) M
i
is identical to M
0
and L(M
0
) = L(M).
A class of languages L is learnable in the limit from
positive data (using M) if and only if there exists an
algorithm A that, given an L in L, learns L in the
limit from positive data using M.
Let A be an algorithm for learning a language class
L in the limit from positive data. A class L is learnable
in the limit from positive data with the conjectures
updated in polynomial time if and only if there exists
an algorithmA for learning L in the limit from positive
data with the property that there exists a polynomial
p such that for any n, for any L in L for which a correct
DFA is of size n, and for any positive presentation of L,
the time used byA between receiving the i-th example
w
i
and outputting the i-th conjectured DFA M
i
is at
most p(n;m
1
+ � � �+m
i
), where m
j
= lg(w
j
)(1 � j �
i).
3 Learning Locally Testable Languages
from Positive Data
3.1 Locally Testable Languages
Given a positive integer k, let �
k
= fw 2
�
�
jlg(w) = kg (i.e., the set of all strings over � of
length k). A language L over � is k-testable if and
only if there exist �nite sets A;B, and C such that
A;B;C � �
k
, and for all w with lg(w) � k, w 2 L if
and only if L
k
(w) 2 A, R
k
(w) 2 B, and I
k
(w) � C.
(In this case, (A;B;C) is called a triple for L.) A
language L is locally testable if and only if there exists
an integer k � 1 such that L is k-testable ([11]).
Note 1 (1) In the literature [11], a k-testable lan-
guage in this paper is called a strictly k-testable lan-
guage.
(2) If L is k-testable, then L is k
0
-testable for all
k
0
> k. Further, note that the de�nition of \k-
testable" says nothing about the strings of length k�1
or less.
(3) In the sequel, we will omit the proofs for all
formal results simply because of the space limit, and
the reader interested in those is suggested to refer to
[20].
In order to discuss the relationship between the
class of locally testable languages and that of re-
versible languages, we give the de�nition of re-
versible languages based on the language-theoretic
characterizations([2]). Let k be a non-negative inte-
ger. A regular language L is k-reversible if and only
if whenever u
1
vw and u
2
vw are in L and lg(v) = k,
it holds that u
1
vnL = u
2
vnL. (When k = 0, we of-
ten write \zero-reversible" rather than \0-reversible".)
A regular language is reversible if and only if it is k-
reversible for some k � 0.
Example 1 Consider a language L consisting of all
�nite-length strings over �(= fa; bg) that begin with
a, end with a and contain no occurrence of a as a
strictly interior subword, that is, L is the language
denoted by the regular expression a+ ab
�
a.
Let A = B = fag; and C = fbg. Then, it is easily
seen that for all w 2 �
�
with lg(w) � 1, w 2 L if and
only if L
1
(w) 2 A;R
1
(w) 2 B, and I
1
(w) � C. Hence,
L is 1-testable. It is seen that L is not a zero-reversible
language.
We have the following theorem.
Theorem 1 For each k � 1, the class of k-testable
languages is properly included in the class of (k + 1)-
reversible languages, but incomparable to the class of
zero-reversible languages.
In fact, we can show the following result that provides
a formal method for expressing k-testable languages
in more explicit fashion.
Lemma 2 Let L be a language such that L � �
k
�
�
.
Then, L is k-testable if and only if there exists a new
triple [A
0
; B
0
;C
0
] such that A
0
; B
0
; C
0
� �
k
and L =
A
0
�
�
\�
�
B
0
� �
+
C
0
�
+
.
Thus, in what follows, we assume the following:
[CONVENTION]
We adopt the following de�nition for a k-testable lan-
guage: L is k-testable if and only if (1) there exists
a triple S = [A;B;C] such that A;B;C � �
k
and
L = L(S), and (2) for no k
0
less than k, L is k
0
-
testable, where L(S) = A�
�
\�
�
B � �
+
C�
+
.
3.2 K-testable Language Associated with
a Finite Set
In this section, we will show that given a �nite set
of strings T , one can e�ectively construct a k-testable
language L
T
which is the smallest k-testable language
containing T in the sense that any other k-testable
language containing T is a superset of L
T
.
Let T be a �nite subset of �
k
�
�
. Then, construct
�nite sets A
T
, B
T
, and C
T
as follows:
A
T
= fL
k
(w)jw 2 Tg
B
T
= fR
k
(w)jw 2 Tg
C
T
= �
k
� I
k
(T ); where I
k
(T ) =
[
w2T
I
k
(w):
Consider a triple S
T
= [A
T
; B
T
; C
T
] and a language
L(S
T
) = A
T
�
�
\�
�
B
T
��
+
C
T
�
+
:
Then, obviously A
T
; B
T
; C
T
� �
k
. A set L(S
T
) is
called a k-testable language associated with T. A lan-
guage L(S
T
) is the smallest k-testable language con-
taining T , which is formally stated as follows:
Lemma 3 Let L be an arbitrary k-testable language
such that L � �
k
�
�
. Then, T � L implies that
L(S
T
) � L.
Note that L(S
T
) can be completely characterized
by T . In fact, we can generalize the above result so
that any k-testable language may have its �nite subset
with such a property.
Let L be a k-testable language such that L � �
k
�
�
.
A �nite subset R of �
k
�
�
is called a characteristic
sample for L if and only if L is the smallest k-testable
language containing R. The next lemma proves the
usefulness of the notion of a characteristic sample for
a language.
Lemma 4 Suppose that R � T � L, where R and T
are �nite, L is a k-testable language such that L �
�
k
�
�
. If R is a characteristic sample for L, then L =
L(S
R
) = L(S
T
) hold.
Thus, the existence of a characteristic sample for a
language L assures the existence of a correct learning
strategy for L.
In fact, we can prove the following.
Lemma 5 For any k-testable language L, there e�ec-
tively exists a characteristic sample R
L;k
for L.
3.3 Constructing a DFA associated with
a Triple
Let L be a k-testable language such that L =
L(S)(= A�
�
\ �
�
B � �
+
C�
+
), for some triple S =
[A;B;C]. The purpose of this section is to show how
one can construct a characteristic sample for L. First,
consider a DFA M
S
= (Q;�; �; p
0
; F ) constructed
from S = [A;B;C] in the following way :
Q =
�
Q
1
[ f[bx]jx 2 A \B \ Cg ( if A \ B \ C 6= ;)
Q
1
( otherwise )
where
Q
1
= f[�]; [a
1
]; [a
1
a
2
]; :::; [a
1
� � �a
k�1
]ja
1
� � �a
k
2 Ag
[f[x]jx 2 (�
k
� C) [ A [Bg;
p
0
= [�]
F and � are de�ned as follows:
F =
�
f[�]j� 2 Bg [ f[bx]jx 2 A \B \ Cg( if A \B \ C 6= ;)
f[�]j� 2 Bg (otherwise)
(1) for all � = a
1
� � � a
k
2 A;
�(p
0
; a
1
) = [a
1
]
�([w
i
]; a
i+1
) = [w
i
a
i+1
]
(where w
i
= a
1
� � � a
i
(1 � 8i � k � 2) )
�([w
k�1
]; a
k
) =
�
[b�] ( if � 2 B \ C)
[�] (otherwise)
(where w
k�1
= a
1
� � � a
k�1
)
(2) for all [ax]; [xb] 2 Q such that lg(x) = k � 1;
ax 2 A; xb 2 (�
k
� C) [ B;
�([cax]; b) = [xb] (if ax 2 B \ C)
�([ax]; b) = [xb] (otherwise)
(3) for all [ax]; [xb] 2 Q such that lg(x) = k � 1;
ax 2 (�
k
� C); xb 2 (�
k
� C) [ B
�([ax]; b) = [xb]:
We call M
S
the canonical DFA associated with S. (See
Example 2 and Figure 1. It should be noted that
the terminology \canonical" is used here in a manner
di�erent from the usual sense.)
We observe that, for a given triple S for a k-testable
language L, the canonical DFA M
S
= (Q;�; �; p
0
; F )
a
ab
b
a a a
ab
b
[a] [aa] [aa]
[ab] [ba]
[bb]
Figure 1
&K[ ]
Canonical DFA M associated with S in Example 2
S
associated with S is not always minimum. However,
by applying an appropriate state minimization algo-
rithm to M
S
, we could obtain the minimum DFA
M
L
= (Q
L
;�; �
L
; p
0
; F
L
) exactly accepting L, if nec-
essary.
Example 2 Consider a 2-testable language L de�ned
by a triple S = [A;B;C] as follows : L = A�
�
\
�
�
B��
+
C�
+
, where � = fa; bg; A = B = C = faag.
Note that L is the language denoted by the regular
expression: aa+ aa(b
+
a)
�
a.
We construct the canonical DFA M
S
= (Q;
�; p
0
; �; F ) associated with S = [A;B;C], where Q =
f[�]; [a]; [aa]; [ab]; [ba]; [bb]g [ f[caa]g, p
0
= [�], and
F = f[caa]; [aa]g. Further, � is de�ned by
�([�]; a) = [a]; �([a]; a) =
d
[aa], �(
d
[aa]; a) = [aa]
�(
d
[aa]; b) = [ab]; �([ab]; a) = [ba]; �([ab]; b) = [bb]
�([ba]; a) = [aa]; �([ba]; b) = [ab]; �([bb]; a) = [ba]
�([bb]; b) = [bb].
The state-transition graph of M
S
is given in Figure 1.
The canonical DFAM
S
associated with S is equivalent
to S in the following sense.
Lemma 6 For any w 2 �
k
�
�
, w is in L(S) if and
only if it is accepted by M
S
.
3.4 Learning Algorithm : LA
We now present a learning algorithm LA. The
learning protocol for LA consists of only the posi-
tive presentation of an unknown k-testable language
U over the �xed alphabet �. It is shown that LA given
below eventually learns in the limit a DFA M
E
such
that U = L(M
E
), where L(M
E
) is the language ac-
cepted by M
E
. More speci�cally, from Lemma 4 and
Lemma 5, we can show the following :
Lemma 7 Let M
E
0
;M
E
1
; :::;M
E
i
; ::: be the sequence
of DFAs produced by LA, where E
i
is the set of pos-
itive data at the i-th stage. Then, (i) for all i � 0,
L(M
E
i
) � L(M
E
i+1
) � U , and (ii) there exists r � 0
such that for all i � 0, M
E
r
= M
E
r+i
and L(M
E
r
) =
U .
Thus, we have the main theorem.
Theorem 8 Given an unknown k-testable language
U, the algorithm LA learns in the limit a DFA M
E
such that U = L(M
E
) from positive data.
[ALGORITHM LA]
Input: a positive integer k and a positive
presentation of a target k-testable
language U
Output: a sequence of DFAs for k-testable
languages
Procedure
initialize E
0
= ; ;
let S
E
0
= [;; ;;�
k
] be the initial triple ;
construct DFA M
E
0
accepting E
0
(= ;) ;
repeat (forever)
let M
E
i
= (Q
E
i
;�; �
E
i
; p
0
; F
E
i
) be
the current DFA ;
read the next positive example w
i+1
;
if w
i+1
2 L(M
E
i
), then output M
E
i+1
(= M
E
i
);
else
scan w
i+1
to compute
L
k
(w
i+1
); R
k
(w
i+1
); I
k
(w
i+1
) ;
construct the canonical DFA
M
E
i+1
associated with
S
E
i+1
= [A
E
i+1
; B
E
i+1
; C
E
i+1
] ;
(where A
E
i+1
= A
E
i
[ fL
k
(w
i+1
)g,
B
E
i+1
= B
E
i
[ fR
k
(w
i+1
)g,
C
E
i+1
= C
E
i
� I
k
(w
i+1
)
and E
i+1
= E
i
[ fw
i+1
g)
output M
E
i+1
;
[Time Analysis of LA]
We analyse the time complexity of the algorithm
LA needed to update a conjecture.
For every i � 0, each time a new positive example
w
i+1
is given, we observe :
(1) In the �rst phase, testing whether or not w
i+1
2
L(M
E
i
) is done in time O(lg(w
i+1
)).
(2) In the second phase, computing L
k
(w
i+1
);
R
k
(w
i+1
), and I
k
(w
i+1
) is done in time O(klg(w
i+1
)).
At the same time, �
E
i
is updated in time
O(kj�jlg(w
i+1
)). Hence, updating a conjecture M
E
i
requires at most O(kj�jlg(w
i+1
)).
Thus, we have the following theorem.
Theorem 9 The algorithm LA requires at most
O(j�jkm) time for updating a conjecture, where m is
the maximum length among all positive data provided
in the learning process.
4 Machine Identi�cation { An Experi-
mental Result
In the �rst half of the present paper, we have de-
veloped an e�cient algorithm LA for learning locally
testable languages in the limit from positive data, and
discussed some of the formal aspects of both the lan-
guage class and the learning algorithm LA.
The second half will discuss a possible application
of such theoretical results to a practical problem in
the domain of DNA sequence analysis.
4.1 A formal model for DNA Splicing Se-
quences
In [9] Head proposed a formal system called a splic-
ing system and studied the relationship between for-
mal language theory to infomational molecule biology.
A splicing system is a 4-tuple S = (�; I; B; C),
where � is an �nite alphabet, I is a �nite set of initial
strings, B and C are �nite sets of triples (a; x; b), called
patterns, with a; x; b 2 �
�
. For each such a triple, the
string axb is called a site and the string x is called a
crossing. Patterns in B and C are called the left(resp.,
right) patterns. For uaxbv and wcxdz in �
�
with pat-
terns (a; x; b) and (c; x; d) in the same hand(either B or
C), two new strings uaxdz and wcxbv are constructed
by splicing at the crossing x. (In a biological inter-
pretation, one may take I as the initial set of DNA
molecule sequences, B and C as the sets of splicing
rules, speci�ed by restricted enzymes and a ligase, that
produce 5
0
overhangs or blunt ends, and produce 3
0
overhangs, respectively.)
The language L = L(S) generated by S consists of
the strings in I and all strings that can be obtained
by adjoining to L uaxdz and wcxbv, whenever uaxbv
and wcxdz are in L and both patterns (a; x; b) and
(c; x; d) are in the same hand. (In other words, L(S)
is the minimal subset of �
�
which contains I and is
closed under the operation of splicing.) A language L
over � is a splicing language if there exists a splicing
system S such that L = L(S).
Here we are concerned with a speci�c type of splic-
ing systems. A splicing system is said to be persistent
if for each pairs of strings uaxbv and wcxdz in �
�
with (a; x; b) and (c; x; d) from the same hand, if y is a
subsegment of uax(resp., xdz) that is the crossing of
a site in uaxbv(resp., wcxdz), then this same subseg-
ment y of uaxdz contains an occurrence of the crossing
of a site in uaxdz. (Intuitively, in a persistent splic-
ing system, it is possible to apply consecutive splicing
operations in�nitely many times.)
It is known that if we specify B and C by choosing
restriction enzymes from actual biological world, then
the resulting systems are often persistent([9]). (For ex-
ample, in order to obtain persistent splicing systems,
one may choose restriction enzymes from Appendix
of the literature [19]. Further, any splicing system
containing only one restriction enzyme is always per-
sistent, even if the single enzyme is HgiAI.)
An important result for our purpose is the follow-
ing.
Proposition 10 ([9]) For a language L, the follow-
ings are equivalent:
1. L is a locally testable language,
2. L is generated by a persistent splicing system.
Example 3 ([9]) (1) Consider a splicing system
S = (fa; bg; fababag; f(a; b; a)g; ;). It is easy to see
that S is persistent and L(S) = faba(ba)
n
jn � 0g. A
biochemical interpretation of this system is that tak-
ing a = [A=T ][T=A] and b = [C=G][G=C], the result-
ing language L(S) consists of all molecules that can
be potentially arised from the action of ClaI (with an
appropriate ligase) on (copies of) the molecule :
[
A
T
][
T
A
][
C
G
][
G
C
][
A
T
][
T
A
][
C
G
][
G
C
][
A
T
][
T
A
](= ababa):
A language L(S) is in fact a 2-testable langauge that
is accepted by a DFA pictured in Figure 2.
(2) Let S = (fa; c; g; tg; I ;B;C), where a =
[A=T ]; t = [T=A]; c = [C=G]; g = [G=C], I is un-
speci�ed here, B consists of the patterns speci�ed by
EcoRI, TaqI and AluI and C contains the single pat-
tern by HhaI. The patterns provided by these four
enzymes are as follows:
a
a
b
[a]&K[ ] [ab] [ba]b
Figure 2
(g; aatt; c) for EcoRI, (t; cg; a) for TaqI
(ag; �; ct) for AluI, (g; cg; c) for HhaI.
A careful examination proves that this system is per-
sistent.
We call fa; c; g; tg the standard 4 letter alphabet and
denote it by S
4
. Note that in order to have a 2-testable
language L(S), we make use of a symbolic replacement
in (1) of Example 3 above.
Such a translation is frequently used in analysing
DNA sequences. In fact, we will describe several ex-
perimental results in the next section where a letter-
to-letter translation (i.e., a coding) is used to reduce
the size of the problem in question smaller, leading to
a feasible solution.
4.2 Protein �-chain Identi�cation
From Proposition 10, it may be justi�ed that we
assume the following :
[AWorking Assumption] Some types of amino acid
sequences are generated by persistent splicing systems,
and hence, they are characterized by locally testable
languages.
Under this assumption, in this section, we describe
some experimental results that concern identifying a
protein �-chain region. We have implemented the
learning algorithm for locally testable languages de-
veloped in the previous section on a workstation, and
applied it to the problem of identifying the �-chain
region in amino acid sequences for hemoglobin.
(1) Goal : The purpose of this experiment is to �nd a
DFA that identi�es the �-chain region in amino acid
sequences for hemoglobin.(One hemoglobin molecule
contains two �-chains and two �-chains, and the �-
chain is a polypeptide normally comprising 141 amino
acids.)
(2) Method : The positive data of �-chain in amino
acid sequences for hemoglobin have been drawn from
PRF protein sequence database([13]). Let us denote
the set of those raw positive data by POS
raw
that is a
�nite set of strings over the alphabet of 20 symbols,
A
20
, each of which represents each amino acid residue.
We used two kinds of letter-to-letter translations
: one is from A
20
to the 6 letter alphabet D
6
(=
fa; b; c; d; e; fg) speci�ed by Dayho�'s coding method,
the other from A
20
to a binary alphabet B(= f0; 1g)
used in [12], shown in Table 1. (A sample of a raw
positive data and its translation results via Dayho�'s
coding and binary coding are given in Figure 3.)
We now describe the details of our experiments.
Let's denote by POS the set of positive data obtained
from POS
raw
in terms of a translation via either Day-
ho�'s coding or binary coding in Table 1. In our ex-
periments, the cardinality of POS was 123.
We have also prepared a set of negative data NEG,
consisting of 2567 sequences. NEG was constructed
by randomly choosing from all protein data but
hemoglobin data in PRF database. The length of data
ranges from 138 to 142 for NEG and 141 or 142 for POS.
By Exp(k; x) we denote an experiment performed
by the method described below, where k is the pa-
rameter for k-testability and x is the cardinality of
Sample, a set of training data randomly selected from
POS.
4.2.1 Experiments Based on Dayho�'s coding
We now have the new set of positive data POS over D
6
Here, we �xed a parameter k as k = 2.
[An experiment Exp(2; x)] For each x(=10, 15,
20, 25, 30), we constructed Sample consisting of x se-
quences that were randomly selected from POS. Then,
using Sample, the learning algorithm LA described
in Section 3 produced a DFA M
2;x
that speci�es a
2-testable language containing all training data in
Sample. We then evaluated the performance of the
obtained DFA, using the set of test data Test and
NEG, where POS=Sample [ Test(disjoint union). The
success rates of correct identi�cation for positive and
negative data are de�ned by:
R
pos
=
jTest \ L(M
2;x
)j
jTestj
; R
neg
=
jNEG \L(M
2;x
)j
jNEGj
;
respectively, where L(M
2;x
) denotes the complement
of L(M
2;x
) with respect to D
�
6
(that is, the set of
strings over D
6
rejected by M
2;x
).
For each x(= 10; 15; 20; 25; 30), we have iteratevely
made an experiment Exp(2; x) 100 times. Figure 4
illustrates a series of experiments Exp(2; x) where the
�gures indicate the average scores on each experiment.
Table 2 summarizes the experimental results in which
Dayhoff’s coding
(coded data)
binary coding
. . .
VLSPADKTNVKAAWGKVGAHAGEY . . .
eebbbcdbdedbbfbdebbdbbcf
. . . 000101100010000100010010
A raw data from human hemoglobin is coded into a new data over D or 6 B
A part of -chain in human hemoglobin (raw data)&A
Figure 3
(I) shows that the DFA M
2;30
achieved a success rate
of 93 % correct identi�cation for (positive) test data,
and 99 % for negative data.
4.2.2 Experiments Based on Binary Coding
Using a binary coding in [12], we have a set of positive
data POS over B which is obtained from POS
raw
.
By �xing x = 20, we �rst made experiments
Exp(k; 20) for each k = 2; 3; 4; 5; and 6. In each case,
Exp(k; 20) has been iteratively performed 100 times.
The �nal results are presented in (II) of Table 2, where
the �gures indicate the average scores. For example,
we observe that an experiment Exp(3; 20) achieves the
overall best score : a success rate of 96% correct iden-
ti�cation for (positive) test data and 92% correct iden-
ti�cation for negative data.
Based on this observation, we then made experi-
ments Exp(3; x) for each x = 6; 7; 8; 9; and 10. Again,
for each x, Exp(3; x) has been performed 100 times.
Figure 5 illustrates the results of a series of Exp(3; x),
where the �gures indicate the average.
4.3 Discussion
First, we will mention some theoretical considera-
tion, immediately derived from our previous results,
and its molecule biological implication.
Suppose that a representation relation L = h(L
0
)
holds, where L is a k-testable language over D
6
and h
is a letter-to-letter coding from A
20
to D
6
. Then, we
note that L
0
is contained in a k-testable language over
A
20
.
Further, if a set of amino acid sequences is a k-
testable language over A
20
, then it is regarded as 3k-
testable language over the standard 4 letter alphabet
S
4
(= fa; c; g; tg).
From these consideration, the following theorem
holds :
Theorem 11 If L is a k-testable language over D
6
,
then a language L
0
such that L = h(L
0
) is contained
in a 3k-testable language over S
4
.
Thus, for example, using a result obtained in
Exp(2;30) with k = 2, one may make a conjecture that
a set of DNA sequences is contained in a 6-testable
language over S
4
, which is strongly supported by the
theoretical discussion made in Section 4.1.
Table 1. Translation Tables
(I) Dayho�'s Coding([4])
Amino Acids Properties New Symbol
C sulfur polymerization a
S, T, P, A, G small b
N, D, E, Q acid & amide c
H, R, K basic d
M, I, L, V hydrophobic e
F, Y, W aromaticity f
(II) Binary Coding([12])
Amino Acids Hydropathy Index New Symbol
A, C , F, G, I, L, M,
N, S, T, V, W, Y High 0
D, E, H, K, P, Q, R Low 1
Table 2. Experimental Results | Best Scores
(I) (Via Dayho� coding)
(k = 2) Exp(2; 30)
R
pos
93%
R
neg
99%
(II) (Via Binary Coding based on Hydropathy Index)
Exp(k,20) k = 2 k = 3 k = 4 k = 5 k = 6
R
pos
97.4% 96,2% 92.6% 87.9% 81.5%
R
neg
83.3% 92.3% 93.6% 94.9% 99.2%
[�-Chain Automata]
Through the experiments, several DFAs have been
identi�ed from sets of training data that have been se-
lected in a random manner from a set of positive data.
It is expected that these DFAs have some common fea-
ture, because they all have been obtained as the best
candidate DFA for specifying the same virtual target
of a locally testable language(i.e., a persistent splicing
langauge). However, since it is hard to compare DFAs
with distinct alphabets, let us �rst focus on the se-
ries of DFAs over B, M
k;20
(k = 2; 3; 4; 5; 6), obtained
in Exp(k; 20). Provided that the overall performance
score of a DFA is evaluated by a simple summation :
R
pos
+R
neg
, we observe thatM
3;20
might be called the
best DFA over B. We also note from Figure 5 that all
DFAs M
3;x
(for x = 6; 7; 8; 9; 10) equally achieve high
scores in overall performance evaluation. This sug-
gests one aspect of the stability of the obtained DFA
family. Further, it is somehow surprising that only 6
(and at most 10) training data randomly selected are
su�cient for constructing such a good DFA character-
izing �-chain regin in binary representation.
On the other hand, suppose that we compare two
DFAs M
3;6
over B and M
2;30
over D
6
. We then �nd
it di�cult to discuss one's superiority over the other.
Because, we note that these experiments have been
performed based on the distinct training data sets
randomly chosen under the di�erent coding methods.
Further, two codings we used are basically indepen-
dent of each other. Thus, it seems that at present
we have no justi�able way to select one DFA as the
best automaton for explaining �-chain in amino acid
sequences. Nevertheless, it may be claimed that each
DFA M
2;30
(M
3;x
for x = 6; 7; 8; 9; 10 or some other
x � 20 ) captures (at least) a part of the essential char-
acteristic of �-chain region for hemoglobin. In fact, in
the course of performing some supplemental experi-
ments whose purpose was to �nd a minimum DFA
with the same high level of identi�cation scores (over
90%) as above, we have obtained a DFA M
3;12
over
B that exactly accepts the language 000(0 + 1)
�
101
denoted by the regular expression([21]). Thus, an in-
teresting fact is that with high probability of over 90%,
the structure of �-chain region of amino acid sequence
can be characterized by only its pre�x and su�x of
length 3, in binary representation.
In summing-up, using the present method for learn-
ing locally testable languages, we could obtain a spe-
ci�c type of DFA M
�
that one might call an �-chain
DFA for hemoglobin. A machine M
�
will then be ap-
plicable to identifying the �-chain region for unknown
amino acid sequences with a very high success rate.
Finally, from the other supplemental experiments,
we conjecture that a DFA with a similarly high score
could be obtained from a set of positive raw data over
A
20
, i.e., POS
raw
if the training set is su�ciently large.
Here, we could eventually collect only 123 positive raw
data in total from the PRF databse, and as a result,
Success Rate
1.0
0.5
0
*** *
.1
.2.3
.4
.6
.7
.8
.9
Best Score
*posR
Rneg
An Experiment D 6over
*
No. of Training Data
Figure 4.
x1510 20 25 30
.803.854
.887 .915
.997 .995 .994 .992
=0.931
=0.991
.991
.931
Exp(2,x)
Overall
we could not help using two codings to obtain our
main results.
5 Conclusions
We have shown that the class of k-testable lan-
guages is learnable in the limit from positive data, and
presented an algorithm which learns any k-testable
language in the limit from positive data using DFAs.
There exist a few subclasses of regular languages
known to be learnable in the limit from positive data
in polynomial time in some sense([10, 14, 18]). It seems
that further research in this direction should be made
for enlarging the boundary of polynomial-time learn-
ability from positive data.
Since the notion of a splicing system was intro-
duced by Head in [9] as a mathematical model of re-
striction enzyme digestion and subsequent religation
in the recombination of DNA molecules, there have
been reported several works concerning the formal
characterizations of splicing systems and their learn-
ing problem([3, 5, 15, 17]). In particular, one of the
most signi�cant results for our purpose was the equiv-
alence relationship between a certain type of splicing
languages, called persistent splicing languages, and the
class of locally testable languages.
Motivated by this result due to Head([9]) which
strongly suggests that a certain type of amino acid
sequences can be expressed by a locally testable lan-
Success Rate
1.0
0.5
0 6 8 10
*** * *
.1
.2.3
.4
.6
.7
.8
.9
Best Score
*posR
Rneg
An Experiment Bover
No. of Training Data
Figure 5.
7 9
.949 .948 .951
.961 .961 .943
x
Exp(3,x)
Overall
=0.961
=0.949
.953
.951.950
.950
guage, we have described some experimental results
via a machine identi�cation system based on the learn-
ing algorithm developed above. In a usual problem
setting, the identi�cation of protein sequence families
is performed by the homology search technique to con-
struct a \template" which gives a favorable score(e.g.,
[7, 16]). Our method is unique in that the learning al-
gorithm in this paper may provide an automatic way
of deriving such a template. We applied the system
to the problem of identifying the protein �-chain in
amino acid sequences for hemoglobin. The main ex-
perimental data showed that the system achieved an
overall success rate of 95 % correct identi�cation for
positive data, and 96 % for negative data. Under the
same working assumption made here, the prediction
problem of other domains such as protein �-helix re-
gion in amino acid squences could be also attacked in
terms of the same identi�cation strategy, which we are
working on.
Finally, it is known that locally testable languages
have some connections to a certain subclass of �rst-
order logic formulas. (In other words, they can provide
a certain class of logical formulas for describing the ge-
netic information.) In this sense, the main experimen-
tal result stated in this article has a deep implication
to the bio-informatics aspect of DNA sequences, and
also suggests the great potential of a successful formal
method based on the automata-theoretic approach to
DNA sequence analysis in general.
Acknowledgements
The authors are grateful to anonymous referees for
their useful suggestions which greatly improved the
draft of this paper.
This work is supported in part by Grants-in-Aid for
Scienti�c Research No.04229105 from the Ministry of
Education, Science and Culture, Japan.
References
[1] D. Angluin. Inductive inference of formal lan-
guages from positive data. Information and Con-
trol, 45:117{135, 1980.
[2] D. Angluin. Inference of reversible languages.
Journal of the ACM, 29:741{765, 1982.
[3] K. Culik II and T. Harju. Dominoes and the regu-
larity of DNA splicing languages. In K. Mehlhorn,
editor, Proceedings of ICALP'89, LNCS 182,
Springer-Verlag, pages 222{233, 1989.
[4] H. Dayho� and H. Calderone. Composition of
proteins. Altas of Protein Sequence and Struc-
ture, 5(3):363{373, 1978.
[5] R.W. Gatterdam. Splicing systems and regular-
ity. International Journal of Computer Mathe-
matics, 31:63{67, 1989.
[6] E. M. Gold. Language identi�cation in the limit.
Information and Control, 10:447{474, 1967.
[7] M. Gribskov, A.D. McLachlan, and D. Eisenberg.
Pro�le analysis: Detection of distantly related
proteins. Proc. of Natl. Acad. Sci. USA, 84:4355{
4358, 1987.
[8] M.A. Harrison. Introduction to Formal Language
Theory. Addison-Wesley, Reading, MA, 1978.
[9] T. Head. Formal language theory and DNA :
An analysis of the generative capacity of speci�c
recombinant behaviors. Bulletin of Mathematical
Biology, 49:737{759, 1987.
[10] O.H. Ibarra and T. Jiang. Learning regular lan-
guages from counterexamples. In Proceedings of
1st Workshop on Computational Learning The-
ory, pages 337{351, 1988.
[11] R. McNaughton and S. Papert. Counter-Free Au-
tomata. MIT Press, Cambridge, MA, 1971.
[12] S. Miyano, A. Shinohara, S. Arikawa, S. Shimo-
zono, T. Shinohara, and S. Kuhara. Knowledge
acquisition from amino acid sequences by decision
trees and indexing. In Proceedings of Genome In-
formatics Workshop III, pages 69{72, 1992.
[13] PRF(Protein Research Foundation) Database.
[14] T. Shinohara. Polynomial time inference of ex-
tended regular pattern languages. In Proceedings
of RIMS Symposia on Software Science and Engi-
neering, LNCS 147, Springer-Verlag, pages 115{
127, 1983.
[15] R. Siromoney, K.G. Subramanian, and V.R.
Dare. Circular DNA and splicing systems. In
Proceedings of Intern. Conference on Parallel Im-
age Analysis, Lecture Notes in Computer Science
654, Springer-Verlag, pages 260{273, 1992.
[16] G.D. Stormo and G.W. Hartzell III. Identifying
protein-binding sites from unaligned DNA frag-
ments. Proc. of Natl. Acad. Sci. USA, 86:1183{
1187, 1989.
[17] Y. Takada and R. Siromoney. On identify-
ing DNA splicing systems from examples. In
P.K.Jantke, editor, Proceedings of AII'92, LNAI
642, Springer-Verlag, pages 305{319, 1992.
[18] N. Tanida and T. Yokomori. Polynomial-time
identi�cation of strictly regular languages in the
limit. IEICE Transactions on Information and
Systems, E75-D:125{132, 1992.
[19] J.D. Watson, J. Tooze, and D.T. Kurtz. Recombi-
nant DNA: A Short Course. Freeman, New York,
1983.
[20] T. Yokomori. A note on the polynomial-time
identi�cation of strictly locally testable languages
in the limit. Report CSIM 90-03, Dept. of
Comupt. Sci. and Inf. Math., Univ. of Elect.-
Communi., 1990.
[21] T. Yokomori and N. Ishida. Learning local
languages and its applications to protein �-
chain prediction. Report CSIM 93-03, Dept. of
Comupt. Sci. and Inf. Math., Univ. of Elect.-
Communi., 1993.