Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Grammatical Inference
Francois Coste
SML, Master SIF
2020-2021
F. Coste (Inria) Grammatical Inference SML 2020-2021 1 / 123
Grammatical Inference
Learn the grammar of a language from correct (and incorrect) sentences
N. Chomsky, Syntactic Structure, Mouton, 1957, PhD thesis MIT 1955
E. M. Gold, Language Identification in the Limit, Information and Control, 1967
. . .
(targeted) Applications
Syntactic pattern recognition [Fu, 1982]
Natural language, Molecular biology, Structured texts, Web, actionplanning, intrusion detection . . .
Field
Theoretical (learnability)Practical (algorithms)
F. Coste (Inria) Grammatical Inference SML 2020-2021 2 / 123
Formal languages theory
Sequence of symbols s1s2 . . . sp: word
Set of words {m1,m2, . . .}: language
Set of production rules generating a language: grammar
Learning a grammar by induction: Grammatical Inference
(covers more broadly inductive learning of languages, even if the representation is not
grammatical)
F. Coste (Inria) Grammatical Inference SML 2020-2021 3 / 123
Grammar
Grammar : G = 〈Σ, N, S,R〉Σ finite set of terminals (a,b,c,. . . )
N finite set of non-terminals (S,T,U,. . . )
S(∈ N) axiom (start symbol)
R set of rewriting rulesEach rule is written as:
α→ β, α ∈ (N ∪ Σ)∗N(N ∪ Σ)∗, β ∈ (N ∪ Σ)∗
When some rules have the same left hand side, we write:
α→ β1|β2| · · ·
F. Coste (Inria) Grammatical Inference SML 2020-2021 4 / 123
Grammars and languages
Elementary derivation: ⇒G :
µαδ ⇒G µβδ iff ∃ α→ β ∈ R, µ, δ ∈ (N ∪ Σ)∗
Derivation ⇒∗G : finite sequence of elementary derivations
Language generated by a grammar G, L(G) :
L(G) = {m ∈ Σ∗|S ⇒∗G m}
Free Monoid Σ∗ : set of all the words on Σ
Empty word: ε or λ
Empty language: ∅ (6= {ε})
F. Coste (Inria) Grammatical Inference SML 2020-2021 5 / 123
ExampleDyck1’s grammar (balanced parenthesis)
G = 〈Σ, N, S,R〉Σ = {a, b}N = {S}R = {S → aSbS, S → ε}
Derivation
S ⇒ aSbS⇒ aaSbSbS⇒ aabSbS⇒ aabbS⇒ aabb
F. Coste (Inria) Grammatical Inference SML 2020-2021 6 / 123
Exercises
Find the grammars generating the following languages:
{aaba, aaa}All the words on {a, b} (Σ∗)
Words on {a, b} beginning by a
Codons on {a, c, g, t} (letter’s count is a multiple of 3)
Palindromes on {a, b}R = {S → aSa|bSb|a|b|ε}Biological palindromes (on {a, c, g, t}, a− t, c− g)exercise. . .
{anbncn|n ≥ 1}R = {S → abc|aSAc, bA→ bb, cA→ Ac}S ⇒ aSAc⇒ aabcAc⇒ aabAcc⇒ aabbcc
Copy : {ww/w ∈ {a, b}∗}exercise. . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 7 / 123
Chomsky Hierarchy
Hierarchy of recursively enumerable languages:
0 Unrestricted
1 Context sensitive (grammaires contextuelles)
α→ β, |α| ≤ |β|
2 context-free (grammaires algebriques)
A→ β, A ∈ N
3 regular (grammaires regulieres, automates)
A→ aB or A→ a, A,B ∈ N, a ∈ Σ ∪ {ε}
F. Coste (Inria) Grammatical Inference SML 2020-2021 8 / 123
The Chomsky Hierarchy
F. Coste (Inria) Grammatical Inference SML 2020-2021 9 / 123
Regular languages are worth inferring
For practical applications, powerful recursive models may not be required
Regular languages can account for short term dependencies (likeN-Gramms), but also some long-term dependencies.
Any language can be approximated by a regular language (each finitelanguage is regular!).
Properties of regular languages are well studied; this makes the developmentof inference methods easier
Simple and efficient parsing of string (O(|m|) for DFA).
F. Coste (Inria) Grammatical Inference SML 2020-2021 10 / 123
Outline
1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 11 / 123
Automata
A = 〈Σ, Q,Q0, QF , δ〉
Multiples of 3 (binary):
Σ finite alphabet {0, 1}Q finite set of states {q0, q1, q2}
Q0(⊆ Q) initial states {q0}QF (⊆ Q) final states {q0}
δ transition function: Q× Σ→ P(Q)(δ∗ : P(Q)× Σ∗ → P(Q) denotes the extension to words of δ)
Language accepted by A
L(A) = {m ∈ Σ∗|δ∗(Q0,m) ∩QF 6= ∅}
F. Coste (Inria) Grammatical Inference SML 2020-2021 12 / 123
Automata and languages
Language accepted/recognized by automata: regular language. + ∗ ()
Exces
Find automatas on Σ = {a, b} recognizing:
- {abba, aab}. (show that each finite language is regular)
- all the words on Σ : (a+ b)∗ = {a, b}∗ = Σ∗
- all the words containing the motif aa
- all the words with 3 letter (extension to codons ?)
- all the words with an even number of a.
Deterministic finite state automata (DFA) : |δ(q, a)| ≤ 1Any non deterministic automata (NFA) can be determinized
⇒ LAFN = LAFD
Canonical automaton of L, A(L) : smallest DFA accepting L
F. Coste (Inria) Grammatical Inference SML 2020-2021 13 / 123
Can we learn regular languagesfrom positive examples only?
Theoretical framework: identification in the limit [Gold67]
Presentation : infinite sequence of examples
P : x1 x2 x3 . . . xk . . . xi . . .↓ ↓ ↓ ↓ ↓H1 H2 H3 Hk Hi ≡ Hk ≡ H0
Identification in the limit of H0 :
∀P,∃k, ∀i > k,Hi ≡ H0
F. Coste (Inria) Grammatical Inference SML 2020-2021 14 / 123
Let’s try!
a, aa, aaa . . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 15 / 123
Limit point
If a limit point exists:
L1 ⊂ L2 ⊂ L3 ⊂ · · · ⊂ L∞ =⋃i
Li
Then
The class of languages is not identifiable in the limit from positiveexamples
F. Coste (Inria) Grammatical Inference SML 2020-2021 16 / 123
Results [Gold67]
No superfinite class of language (⊃ regular) can be identified in thelimit from text (i.e. positive examples only)
The class of primitive recursive function (“fonction recursiveprimitive”) can be identified in the limit from informant (examplesand counter-examples)(False for the class of total recursive functions)
→ rationale for using counter-examples
Time needed for learning ???
F. Coste (Inria) Grammatical Inference SML 2020-2021 17 / 123
Polynomial Time and Data Identification in the Limit[Gold 78] [Pitt 89] [Higuera 95]
Identification in the limit from Polynomial Time and Data (IPTD)
A representation class R is identifiable in the limit from polynomial timeand data iff there exists two polynomials p and q, a learning algorithm As.t.:
Given any sample S = 〈S+, S−〉 of size m,A returns a representation R in R compatible with S in p(m) time
For each representation R of size n,there exists a characteristic sample of size less than q(n)
Characteristic sample CS = 〈CS+, CS−〉: for any S = 〈S+, S−〉, s.t.CS+ ⊆ S+, CS− ⊆ S−, A returns a representation R′ equivalent with R
F. Coste (Inria) Grammatical Inference SML 2020-2021 18 / 123
Are automata IPTD?
Outline
1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 19 / 123
1 Learning automataDefinitionsLearning automata from positive and negative examples
Problem definitionRPNIStructural completeness hypothesisUtility of counter-examplesEDSM heuristic
Learning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 20 / 123
Remark:Given a sample S = 〈S+, S−〉, an infinite number of automata are
compatible with S
Searching for the smallest compatible DFA
Smallest compatible DFA problem
Given S+ ⊂ Σ∗ (examples) and S− ⊂ Σ∗ (counter-examples),Find smallest DFA A st S+ ⊂ L(A) and S− ∩ L(A) = ∅
Application of Occam’s razor
Canonical automata of language . . .
NP-Complete problem [Gold78] [Angluin78]
Proof: reduction to SAT
To find a DFA (only) polynomially bigger than the smallest DFA compatiblewith 〈S+, S−〉 is NP-Complete [Pitt, Warmuth 93]
PAC-Learning DFA is as hard as breaking the RSA cryptosystem [Pitt,
Warmuth 88] [Kearn, Valiant 89]
F. Coste (Inria) Grammatical Inference SML 2020-2021 22 / 123
PAC (Probably Approximatively Correct) - Learning[Valiant 84]
Approximatively Correct Error upper bound ε
Rreal(h) = P (h(o) 6= f(o)) < ε
For any concept f in FFor any error ε and any confidence 1− δThere exists Nε,δ such that for the set of h learnt from Nε,δ examples:
P (Rreal(h) < ε) > 1− δ
F. Coste (Inria) Grammatical Inference SML 2020-2021 23 / 123
1 Learning automataDefinitionsLearning automata from positive and negative examples
Problem definitionRPNIStructural completeness hypothesisUtility of counter-examplesEDSM heuristic
Learning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 24 / 123
RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]
S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}
Maximal Canonical Automaton MCA(S+) Determinisation. . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 25 / 123
RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]
S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}
Prefix Tree Automaton PTA(S+) Rote learning! Generalisation throughstate merging under control of S− Merge 0 and 1
F. Coste (Inria) Grammatical Inference SML 2020-2021 26 / 123
RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]
S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}
Result of merging 0 and 1Non deterministic automaton!
Merging for determinisation . . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 27 / 123
Merging for determinisationHow to consider only DFAs
Merging for determinisation
∀q ∈ Q,∀a ∈ Σ,∀s1, s2 ∈ δ(q, a),Merge(s1, s2)
(6= determinisation algorithm of a NFA : language can grow here!)
→
PTA(S+) = merging for determinisation of MCA(S+)
→
Deterministic merge
Merging states + merging for determinisationF. Coste (Inria) Grammatical Inference SML 2020-2021 28 / 123
RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]
S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}
Result of merging 0 and 1Non deterministic automaton!
Merging for determinisation . . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 29 / 123
RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]
S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}
After merging for determinisation A counter-example is accepted!
F. Coste (Inria) Grammatical Inference SML 2020-2021 30 / 123
RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]
S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}
Backtrack Merge 0 and 2
F. Coste (Inria) Grammatical Inference SML 2020-2021 31 / 123
RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]
S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}
Merged 0 and 2 Merging determinisation
F. Coste (Inria) Grammatical Inference SML 2020-2021 32 / 123
RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]
S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}
After merging for determinisation Merge 0 and 3
F. Coste (Inria) Grammatical Inference SML 2020-2021 33 / 123
RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]
S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}
Merged 0 et 3 Merging for determinisation
F. Coste (Inria) Grammatical Inference SML 2020-2021 34 / 123
RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]
S+ = {aaa, bba, baaa} S− = {aaaa, baab, bbabab}
After merging for determinisation No more possible merging... Solution !
F. Coste (Inria) Grammatical Inference SML 2020-2021 35 / 123
RPNI : Regular Positive Negative Inference[Oncina, Garcia 1992], [Lang 1992]
RPNI
A← PTA(S+)for all (p, q) in standard order 1 doA′ ← Deterministic merge(A, p, q)if A′ accepts no counter-example from S− thenA← A′
end ifend for
Complexity : O((|S+|+ |S−|).|S+|2)
1Standard order u ≺ v : (|u| < |v|) ∨ (|u| = |v| ∧ ∃k, ∀i < k, ui = vi ∧ uk < vk)F. Coste (Inria) Grammatical Inference SML 2020-2021 36 / 123
Success / amount of sequences in training sample
fig. from [Lang, 1992]
F. Coste (Inria) Grammatical Inference SML 2020-2021 37 / 123
Identification ?
Requirements for finding the solution with RPNI?
1. The target automata has to be in the search space
and
2. The good merges have to be chosen
F. Coste (Inria) Grammatical Inference SML 2020-2021 38 / 123
1 Learning automataDefinitionsLearning automata from positive and negative examples
Problem definitionRPNIStructural completeness hypothesisUtility of counter-examplesEDSM heuristic
Learning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 39 / 123
Structural completeness hypothesis
S+ is structurally complete wrt A if an acceptation of S+ by A exists st:
Every transition of A is used
Every final state of A is used for acceptation
S+ = {aaa, bba, baaa} A =
F. Coste (Inria) Grammatical Inference SML 2020-2021 40 / 123
Maximal Canonical Automaton
Rote learning of S+ = {aaa, bba, baaa}
Union :
MCA(S+)
Only one initial state (classical but not required):
MCA(S+)
F. Coste (Inria) Grammatical Inference SML 2020-2021 41 / 123
Merging states
Language generalisation operator
Preserve structural completeness
F. Coste (Inria) Grammatical Inference SML 2020-2021 42 / 123
Merging states
Language generalisation operator
Preserve structural completeness
F. Coste (Inria) Grammatical Inference SML 2020-2021 43 / 123
Merging states
Language generalisation operator
Preserve structural completeness
F. Coste (Inria) Grammatical Inference SML 2020-2021 44 / 123
Merging states
Language generalisation operator
Preserve structural completeness
F. Coste (Inria) Grammatical Inference SML 2020-2021 45 / 123
Merging states
Language generalisation operator
Preserve structural completeness
F. Coste (Inria) Grammatical Inference SML 2020-2021 46 / 123
Merging states
Language generalisation operator
Preserve structural completeness
F. Coste (Inria) Grammatical Inference SML 2020-2021 47 / 123
Merging states
Language generalisation operator
Preservation of structural completeness
ua(S+)
Theorem
All automata A st S+ is structurally complete wrt A can be built bymerging states of MCA(S+)
F. Coste (Inria) Grammatical Inference SML 2020-2021 48 / 123
Search space
F. Coste (Inria) Grammatical Inference SML 2020-2021 49 / 123
DFA search space
operator: deterministic merge
Theorem
All automata A st S+ is structurally complete wrt A can be build bydeterministic merges of states in MCA(S+) (or PTA(S+))
F. Coste (Inria) Grammatical Inference SML 2020-2021 50 / 123
1 Learning automataDefinitionsLearning automata from positive and negative examples
Problem definitionRPNIStructural completeness hypothesisUtility of counter-examplesEDSM heuristic
Learning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 51 / 123
Limiting generalisation with a set of counter-examples S−
Border Set : set of most general elements(Greater generalisation under control of S−)
Occam’s razor → looking for smallest automaton
S− guides also the search. . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 52 / 123
Limiting generalisation with a set of counter-examples S−
Border Set : set of most general elements(Greater generalisation under control of S−)
Occam’s razor → looking for smallest automaton
S− guides also the search. . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 53 / 123
Characteristic sample for RPNI
How to ensure that RPNI returns A(L) ?
Ideas :
Sample has to be structurally complete wrt A(L)
Sample is informative enough to prevent merging distinct states
F. Coste (Inria) Grammatical Inference SML 2020-2021 54 / 123
Characteristic sample for RPNIShort prefixes and Kernel
Let Pr(L) denote the set of prefixes of a language L: Pr(L) = {u ∈ Σ∗ : uv ∈ L}
Short prefixesSmallest sequences enabling to reach each state of the target
Sp(L) = {u ∈ Pr(L) : @v ∈ Pr(L), v < u and δA(L)(q0, v) = δA(L)(q0, u)}
KernelSequences of Sp concatenated with one letter allowing to reach a new state(exercise all the possible transitions)
N(L) = {ua ∈ Pr(L) : u ∈ Sp(L), a ∈ Σ} ∪ {ε}
What would be N(L) for the following DFA target ?
F. Coste (Inria) Grammatical Inference SML 2020-2021 55 / 123
Characteristic sample for RPNI
S = 〈S+, S−〉 is a characteristic sample of A(L) for RPNI if:
∀x ∈ N(L) :∃u ∈ Σ∗, xu ∈ S+(u = ε if x ∈ L)
∀x, y ∈ N(L), δA(L)(q0, x) 6= δA(L)(q0, y) :
∃u ∈ Σ∗, ((xu ∈ S+ and yu ∈ S−) or (xu ∈ S− and yu ∈ S+))
What would be a characteristic sample for ?
Is the characteristic sample unique for an automate?
It can shown that:
- Adding new examples to the characteristic sample does not change theautomata returned by RPNI
- For each A(L), there exists a characteristic sample of size O(|A(L)|2)
F. Coste (Inria) Grammatical Inference SML 2020-2021 56 / 123
What about merging states in random order?Trakhtenbrot et Barzdin 1973
Algorithm : deterministic merge of pair of states not resulting inincompatible automata in random order
Algorithm complexity?At most |PTA|.|A|2 [Lang92] (where A is the target automaton)
Characteristic sample?{w ∈ Σ∗/|w| ≤ d+ 1 + ρ}d : depth of automataρ : distinguishably degree(length of suffix required to distinguish pairs ofstates, i.e. allowing to reach a final state and a non final state)
Worst case d = ρ = |A| − 1In average, ρ = log|Σ| log2 |A| et d = C log|Σ|(where C : constant wrt Σ)For |Σ| = 2, average size is: ∼ 16|A|2 − 1|A| = 32 → 16383 seq., 65 → 67599, 506 → 4096575 ...
F. Coste (Inria) Grammatical Inference SML 2020-2021 57 / 123
RPNI
The solution returned by RPNI is:
a DFA belonging to the Border Set
the canonical automata of the language that it accepts
if the sample is characteristic, it is the smallest compatible DFA(Contradiction with NP-Completeness of the problem ?
No, sample has to be characteristic!)
Complexity : O((|S+|+ |S−|).|S+|2)Characteristic sample: O(n2)⇒
DFA are identifiable in the limit from polynomial time and data (IPTD)
F. Coste (Inria) Grammatical Inference SML 2020-2021 58 / 123
Positive results
Deterministic automata are IPTD⇒ Even linear grammars[Takada 88,94], [Sempere, Garcıa 94], [Makinen 96]
⇒ Sub-sequential transducers[Oncina, Garcıa, Vidal 93]
⇒ Context-free grammars from structure[Sakakibara 90]
⇒ Tree automata[Knuutila 93]
F. Coste (Inria) Grammatical Inference SML 2020-2021 59 / 123
Simple PAC
[Denis, D’Halluin, Gilleron 96]
PAC Learning but for “simple distribution” only
Simple example have a higher probability in the training sampleSo unseen simple example are counter-examples
DFA are Simple PAC learnable [Parekh, Honavar 97]
DFA are Simple PAC learnable from positive examples [Denis 98]
F. Coste (Inria) Grammatical Inference SML 2020-2021 60 / 123
Negative results
The classes below are not IPTD for |Σ| ≥ 2 :
Context-free grammars
Linear grammars
Non-deterministic automata
F. Coste (Inria) Grammatical Inference SML 2020-2021 61 / 123
1 Learning automataDefinitionsLearning automata from positive and negative examples
Problem definitionRPNIStructural completeness hypothesisUtility of counter-examplesEDSM heuristic
Learning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 67 / 123
Unbiased/symmetrical learning
Defining a regular language⇔ Definition of complementary language
[Alquezar, Sanfeliu 95]:
Consider symmetrically S+ and S−→ learn L+ and L−
Classification of words: +, - or ?
Related to learning Mealy, Moore finite states machines [Biermann,
Feldmann 72], and automata [Lang 92, Oncina, Garcia 92]
F. Coste (Inria) Grammatical Inference SML 2020-2021 68 / 123
Maximal Canonical Automaton
S+ = {aaa, bba, baaa} ; S− = {aaaa, baab, bbabab}MCA(S+,S−) :
Rote learning
F. Coste (Inria) Grammatical Inference SML 2020-2021 69 / 123
EDSM Heuristic
Evidence Driven State MergingR. Price, K. Lang, Abbadingo One, 1998
Data driven heuristic
Dynamic choice of best pair of states to merge at each step accordingto evidence of a good merge
Evidence measure: maximise count of final states merge fordeterminization
( Rem. : → similarity between subtrees)
F. Coste (Inria) Grammatical Inference SML 2020-2021 70 / 123
Example
S+ = {a, aaa, ba, baaa} S− = {aab, baab, baba}
PTA(S+, S−)
f(1,2) = 0, f(0,2) = 3, . . . , f(3,8) = 2, . . . , f(8,9) = −∞, . . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 71 / 123
Example
S+ = {a, aaa, ba, baaa} S− = {aab, baab, baba}
PTA(S+, S−)
f(1,2) = 0, f(0,2) = 3, . . . , f(3,8) = 2, . . . , f(8,9) = −∞, . . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 72 / 123
Example
S+ = {a, aaa, ba, baaa} S− = {aab, baab, baba}
PTA(S+, S−)
f(0 2, 1 4) = −∞, f(0 2, 3 8) = 1 , . . . ,
F. Coste (Inria) Grammatical Inference SML 2020-2021 73 / 123
Example
S+ = {a, aaa, ba, baaa} S− = {aab, baab, baba}
PTA(S+, S−)
f(1 4 6 11, 9) = 1 , . . . ,
F. Coste (Inria) Grammatical Inference SML 2020-2021 74 / 123
Example
S+ = {a, aaa, ba, baaa} S− = {aab, baab, baba}
PTA(S+, S−)
F. Coste (Inria) Grammatical Inference SML 2020-2021 75 / 123
EDSM : a good heuristic . . .for automata randomly and uniformly generated
fig. from Merge order count K. Lang, 1997
Abbadingo : pb 506 states, 60 000 seq. (R. Price)
would require ∼ 100 000 seq. with RPNI.
F. Coste (Inria) Grammatical Inference SML 2020-2021 76 / 123
. . . but expensive
Evaluation of O(n2) merges at each step of the algorithm!
Remark: scores for merging states far from the root are smaller
→ window w (Lang, Price ?)→ Blue-Fringe. . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 77 / 123
Blue Fringe, H. Juille, Abbadingo One, 1998
fig. from Faster algorithms for finding minimal consistent DFA, K. Lang, 1999
Any state B not mergeable with any state in R is promoted to R
Merge pairs of states in B ×REasy to implement: states of B are roots of subtrees
Blue-Fringe + EDSM (+ SAGE, H. Juille)Abbadingo : pb 65 states, 1 521 seq.
would require ∼ 4 000 seq. with RPNIF. Coste (Inria) Grammatical Inference SML 2020-2021 78 / 123
Learning from positive and negative examples
[Gold 67]:
No superfinite class of language can be identified in the limit frompositive examples onlyThe class of primitive recursive function can be identified in the limitfrom positive and negative examples
Efficient learning
DFA are IPTD from positive and negative examples (RPNI)Extension to some closely related classesNFA are not! CFG neither . . .An heuristic (EDSM) that seems to perform better . . . (?)
What if negative examples are not available?
F. Coste (Inria) Grammatical Inference SML 2020-2021 79 / 123
Outline
1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example
F. Coste (Inria) Grammatical Inference SML 2020-2021 80 / 123
Learning from positive example (only)
Statistical criteria for not merging pair of states: ALERGIA
“Characterizable” methods: k-RI, k-testable languages
Heuristics methods: ECGI
F. Coste (Inria) Grammatical Inference SML 2020-2021 81 / 123
1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example
Alergiak-reversibles languagesECGI
F. Coste (Inria) Grammatical Inference SML 2020-2021 82 / 123
ALERGIA
[Carrasco, Oncina 99]
Input: S+, precision parameter αOutput : (probabilistic) DFA AA← PPTA(S+)for all (p, q) in standard order do
if compatible(p, q, α) thenA← deterministic merge(A, p, q)
end ifend for
F. Coste (Inria) Grammatical Inference SML 2020-2021 83 / 123
ALERGIA
Compatibility between two 2 states q1 and q2 :
Transition probabilities are similar enough:∀a ∈ Σ ∪ {#},
∣∣∣∣C(q1, a)
C(q1)−C(q2, a)
C(q2)
∣∣∣∣ <√
1
2ln
2
α
(1√C(q1)
+1√C(q2)
)
Compatibility of successors :
∀a ∈ Σ, δ(q1, a) et δ(q2, a) sont α-compatibles
F. Coste (Inria) Grammatical Inference SML 2020-2021 84 / 123
ALERGIA
Local measure of suffix language similarity
Other measures . . .→ Learning probabilistic automata→ Identification of probability distributions on words
See:
PAC-learnability of Probabilistic Deterministic Finite State Automata,A. Clark and F. Thollard, Journal of Machine Learning Research, 2004.Towards feasible PAC-learning probabilistic deterministic finiteautomata, J. C. Rabal and R. Gavalda, ICGI 2008Learning Rational stochastic languages, F. Denis, Y. Esposito, A.Habrard, COLT 2006.Spectral learning of weighted automata - A forward-backwardperspective, B. Balle, X. Carreras, F. M. Luque, A. Quattoni, MachineLearning, 2014. . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 85 / 123
1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example
Alergiak-reversibles languagesECGI
F. Coste (Inria) Grammatical Inference SML 2020-2021 86 / 123
Characterizable learning
Negative result of [Gold67] applies to superfinite languages.To avoid over-generalization, an approach performing minimalgeneralisation at each step ensure identification for particular classes oflanguages.
F. Coste (Inria) Grammatical Inference SML 2020-2021 87 / 123
0-reversible languages
0−reversible automata: deterministic automata whose mirror isdeterminisistic
0−reversible language = recognized by a 0−reversible automata
Learnable from positive sample [Angluin 82]
F. Coste (Inria) Grammatical Inference SML 2020-2021 88 / 123
k-reversibles language
k-reversible automata: deterministic automata A whose reverse Ar isdeterministic with look-ahead k:∀q, q′ ∈ Q, q 6= q′,((q, q′ ∈ Q0) ∨ (q, q′ ∈ δ(q′′, a))⇒ @u ∈ Σk : (δ(q, u) 6= ∅) ∧ (δ(q′, u) 6= ∅)).k-reversible language iff a k-reversible automata recognize it(⇔ u1vw, u2vw ∈ L et |v| = k ⇒ SL(u1v) = SL(u2v)).
A : Ar :A est 1-reversible
Are the following languages 0-reversible, 1-reversible?
Σ∗, 1∗01, 0∗1+, 11∗
Find one non 1-reversible language. . . Does it exists non reversibles languages (i.e.
non k-reversible for all k) ? 0∗(1 + ε)0∗
F. Coste (Inria) Grammatical Inference SML 2020-2021 89 / 123
k-RI [Angluin82]
Input : k, S+
Output : Ak, canonical automaton accepting smallest k-reversible languageincluding S+
A← PTA(S+)while ∃(q1, q2)← (non-k-reversible (A)) doA← deterministicmerge(A, q1, q2)
end while
Temporal complexity: O(Σk|S+|k+3) Source: [TD2013]
Memory complexity: O(|S+|)Non incremental algorithm
[TD2013] How Symbolic Learning Can Help Statistical Learning (and vice versa),I. Tellier and Y. Dupont, RANLP 2013
F. Coste (Inria) Grammatical Inference SML 2020-2021 90 / 123
k-RIExample, k = 0
S = {ε, aa, bb, aaaa, abab, abba, baba}Prefix tree acceptor (PTA)
F. Coste (Inria) Grammatical Inference SML 2020-2021 91 / 123
k-RIExample, k = 0
S = {ε, aa, bb, aaaa, abab, abba, baba}Merging all final states
F. Coste (Inria) Grammatical Inference SML 2020-2021 91 / 123
k-RIExample, k = 0
S = {ε, aa, bb, aaaa, abab, abba, baba}Merging for determinisation of states B
F. Coste (Inria) Grammatical Inference SML 2020-2021 91 / 123
k-RIExample, k = 0
S = {ε, aa, bb, aaaa, abab, abba, baba}B predecessors of A by a, D predecessors of A by b have to be merged
F. Coste (Inria) Grammatical Inference SML 2020-2021 91 / 123
k-RIExample, k = 0
S = {ε, aa, bb, aaaa, abab, abba, baba}C predecessors by b of B to merge
F. Coste (Inria) Grammatical Inference SML 2020-2021 91 / 123
k-RIExample, k = 0
S = {ε, aa, bb, aaaa, abab, abba, baba}Solution
F. Coste (Inria) Grammatical Inference SML 2020-2021 91 / 123
k-RI
Remarks : returns smallest language, not smallest automaton!
[Angluin82]
the class Ck−rev is identifiable from positive examples (proof: existence ofa characteristic sample)
see also: distinguishing functions [Fernau2000]
Choice of k ?Pertinence of the subclass for the application ?Exercise: Automata returned for S = {a, aa, aaa} (k = 0)
F. Coste (Inria) Grammatical Inference SML 2020-2021 92 / 123
1 Learning automataDefinitionsLearning automata from positive and negative examplesLearning automata from positive example
Alergiak-reversibles languagesECGI
F. Coste (Inria) Grammatical Inference SML 2020-2021 94 / 123
ECGI Heuristic
Error Correcting GI [Rulot, Vidal 88]
Learns regular grammars which are non deterministic and without cycles,st ∀A,B,C ∈ N, ∀b, a ∈ Σ :
if (B → aA) ∈ P and (C → bA) ∈ P then b = a
Positive examples
Incremental algorithm:
First grammar G0 = first example s0
Minimal modification of Gi−1 to accept new example si
F. Coste (Inria) Grammatical Inference SML 2020-2021 95 / 123
ECGI
Error rules:Insertion of a : A→ aA,∀(A→ bB) ∈ P,∀a ∈ Σ
Subst. of b by a : A→ aB, ∀(A→ bB) ∈ P,∀a ∈ ΣA→ a,∀(A→ b) ∈ P,∀a ∈ Σ
Deletion of b : A→ B, ∀(A→ bB) ∈ P,∀a ∈ ΣA→ ε, ∀(A→ b) ∈ P,∀a ∈ Σ
By extending Gi−1 with these errors rules, one can compute (dyn. prog.) the
optimal error correcting parsing of si (using a minimal number of error rules)
Gi is Gi−1 extended with the minimal set of rules required to parse si
F. Coste (Inria) Grammatical Inference SML 2020-2021 96 / 123
ECGI
Example for S+ = {aabb, abbb, abbab, bbb} :
F. Coste (Inria) Grammatical Inference SML 2020-2021 97 / 123
ECGI
Input : I+Output : a grammar “ECGI” G compatible with I+x← I1
+ ; n← |x|N ← {A0, . . . , An−1} ; Σ← {a1, . . . , an}P ← {(Ai → aiAi), i = 1, . . . , n− 1} ∪ {An−1 → an}S ← A0 ; G1 ← (N,Σ, P, S)for i = 2 a |I+| doG← Gi−1 ; x← Ii+ ;P ′ ←optimal derivation(x,Gi−1)for j = 1 a P ′ doG←extend gram (G, pj)
end forGi ← G
end forReturn G
F. Coste (Inria) Grammatical Inference SML 2020-2021 98 / 123
ECGI
No recursivity
Heuristic capturing variations of a family of sequences
Order of examples can change result
Easy extension to stochastic grammars
Is there a link between the structural completeness hypothesis and ECGI?
F. Coste (Inria) Grammatical Inference SML 2020-2021 99 / 123
Learning languages: conclusion and perspective
Learning to classify sequences:
Classical machine learning approachTransformation into attribute-value representations and use classical ML.
Words embedding in multiple dimension. . .
Learning automataWell studied. Established learnability results.
Recent advances on learning regular distributions. . .
Learning grammarsHot topic nowadays. Substituability as a central concept for practical algorithms
and learnability results even beyond CFG (midly context sensitive languages).
Learning graphs is an emerging domain. . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 120 / 123
Some references
Inference grammaticale reguliere : fondements theoriques et principauxalgorithmes,Dupont, Miclet, RR-INRIA 3449, 1998
Recent advances of grammatical inference,Sakakibara, TCS vol 185, pp 15-45, 1997
A bibliographical study of Grammatical Inference,De la Higuera, 2002,http://pagesperso.lina.univ-nantes.fr/~cdlh/papers/bibliography_survey.pdf
Grammatical Inference,Colin de la Higuera
Inference grammaticaleChap7, support de cours de Laurent Micletftp: // ftp. irisa. fr/ local/ cordial/ polyAC0304. ps
Learnable classes of categorial grammars,Kanazawa, Cambridge University Press, 1998
Topics in Grammatical Inference,Editors: Heinz, Jeffrey, Sempere, Jose M (Eds.), Springer, 2016
Grammatical Inference Homepage : http://www.grammarlearning.org/
F. Coste (Inria) Grammatical Inference SML 2020-2021 121 / 123
Biological palindrome : S → aSt|cSg|tSa|gSc|εDerivation tree of atgttcgaacat ?Consequence of adding a new rewriting rule:S → SS|aSt|cSg|tSa|gSc|ε ?Derivation tree of caaatcgatcatcgaagagctcttgttg ?de gaatattcgaatattc ?
CopyS → AaS|CcS|GgS|TtS|XX → εAa→ aA ; Ac→ cA ; Ag → gA ; At→ tACa→ aC ; Cc→ cC ; Cg → gC ; Ct→ tCGa→ aG ; Gc→ cG ; Gg → gG ; Gt→ tGTa→ aT ; Tc→ cT ; Tg → gT ; Tt→ tTAX → Xa ; CX → Xc ; GX → Xg ; TX → Xt
Derivation tree of ctaacctaac ?
F. Coste (Inria) Grammatical Inference SML 2020-2021 122 / 123
What we have seen in SML so far
Introduction to machine learning
Generalisation, necessity of a bias. . .How to define properly a machine learning problem: choice of objectdescription, choice of hypothesis space, choice of ’best’ hypothesis, i.e.setting biasesExploration of search spaceEvaluation of the risk
Learning on sequences
Vectorization of texts and Naive BayesAutomata and learnability
Next: State-of-the art algorithms for attribute-valuerepresentations of instances. . .
F. Coste (Inria) Grammatical Inference SML 2020-2021 123 / 123