87
Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Embed Size (px)

DESCRIPTION

Zadar, August Outline 1. PFA 2. Distances between distributions 3. FFA 4. Basic elements for learning PFA 5. ALERGIA 6. MDI and DSAI 7. Open questions

Citation preview

Page 1: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 1

Learning probabilistic finite automata

Colin de la HigueraUniversity of Nantes

Zadar August 2010 2

Acknowledgements Laurent Miclet Jose Oncina Tim Oates Rafael Carrasco

Paco Casacuberta Reacutemi Eyraud Philippe Ezequel Henning Fernau Thierry Murgue Franck Thollard Enrique Vidal Freacutedeacuteric Tantini

List is necessarily incomplete Excuses to those that have been forgotten

httppagespersolinauniv-nantesfr~cdlhslides

Chapters 5 and 16

Zadar August 2010 3

Outline

1 PFA2 Distances between distributions3 FFA4 Basic elements for learning PFA5 ALERGIA6 MDI and DSAI7 Open questions

Zadar August 2010 4

1 PFAProbabilistic finite (state)

automata

Zadar August 2010 5

Practical motivations

(Computational biology speech recognition web services automatic translation image processing hellip)

A lot of positive data Not necessarily any negative data No ideal target Noise

Zadar August 2010 6

The grammar induction problem revisitedThe data consists of positive strings

laquogeneratedraquo following an unknown distribution

The goal is now to find (learn) this distribution

Or the grammarautomaton that is used to generate the strings

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

41

31

21

21

21

32

a

b

a

b

a

b

43

21

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

41

31

21

21

21

32

a

b

a

b

a

b

43

21

PrA(abab)= 241

43

32

31

21

21

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

41

31

21

21

21

32

b

b

a

a

a

b

43

21

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

41

31

21

21

21

32

b

a

b

43

21

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 2: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 2

Acknowledgements Laurent Miclet Jose Oncina Tim Oates Rafael Carrasco

Paco Casacuberta Reacutemi Eyraud Philippe Ezequel Henning Fernau Thierry Murgue Franck Thollard Enrique Vidal Freacutedeacuteric Tantini

List is necessarily incomplete Excuses to those that have been forgotten

httppagespersolinauniv-nantesfr~cdlhslides

Chapters 5 and 16

Zadar August 2010 3

Outline

1 PFA2 Distances between distributions3 FFA4 Basic elements for learning PFA5 ALERGIA6 MDI and DSAI7 Open questions

Zadar August 2010 4

1 PFAProbabilistic finite (state)

automata

Zadar August 2010 5

Practical motivations

(Computational biology speech recognition web services automatic translation image processing hellip)

A lot of positive data Not necessarily any negative data No ideal target Noise

Zadar August 2010 6

The grammar induction problem revisitedThe data consists of positive strings

laquogeneratedraquo following an unknown distribution

The goal is now to find (learn) this distribution

Or the grammarautomaton that is used to generate the strings

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

41

31

21

21

21

32

a

b

a

b

a

b

43

21

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

41

31

21

21

21

32

a

b

a

b

a

b

43

21

PrA(abab)= 241

43

32

31

21

21

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

41

31

21

21

21

32

b

b

a

a

a

b

43

21

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

41

31

21

21

21

32

b

a

b

43

21

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 3: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 3

Outline

1 PFA2 Distances between distributions3 FFA4 Basic elements for learning PFA5 ALERGIA6 MDI and DSAI7 Open questions

Zadar August 2010 4

1 PFAProbabilistic finite (state)

automata

Zadar August 2010 5

Practical motivations

(Computational biology speech recognition web services automatic translation image processing hellip)

A lot of positive data Not necessarily any negative data No ideal target Noise

Zadar August 2010 6

The grammar induction problem revisitedThe data consists of positive strings

laquogeneratedraquo following an unknown distribution

The goal is now to find (learn) this distribution

Or the grammarautomaton that is used to generate the strings

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

41

31

21

21

21

32

a

b

a

b

a

b

43

21

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

41

31

21

21

21

32

a

b

a

b

a

b

43

21

PrA(abab)= 241

43

32

31

21

21

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

41

31

21

21

21

32

b

b

a

a

a

b

43

21

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

41

31

21

21

21

32

b

a

b

43

21

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 4: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 4

1 PFAProbabilistic finite (state)

automata

Zadar August 2010 5

Practical motivations

(Computational biology speech recognition web services automatic translation image processing hellip)

A lot of positive data Not necessarily any negative data No ideal target Noise

Zadar August 2010 6

The grammar induction problem revisitedThe data consists of positive strings

laquogeneratedraquo following an unknown distribution

The goal is now to find (learn) this distribution

Or the grammarautomaton that is used to generate the strings

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

41

31

21

21

21

32

a

b

a

b

a

b

43

21

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

41

31

21

21

21

32

a

b

a

b

a

b

43

21

PrA(abab)= 241

43

32

31

21

21

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

41

31

21

21

21

32

b

b

a

a

a

b

43

21

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

41

31

21

21

21

32

b

a

b

43

21

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 5: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 5

Practical motivations

(Computational biology speech recognition web services automatic translation image processing hellip)

A lot of positive data Not necessarily any negative data No ideal target Noise

Zadar August 2010 6

The grammar induction problem revisitedThe data consists of positive strings

laquogeneratedraquo following an unknown distribution

The goal is now to find (learn) this distribution

Or the grammarautomaton that is used to generate the strings

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

41

31

21

21

21

32

a

b

a

b

a

b

43

21

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

41

31

21

21

21

32

a

b

a

b

a

b

43

21

PrA(abab)= 241

43

32

31

21

21

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

41

31

21

21

21

32

b

b

a

a

a

b

43

21

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

41

31

21

21

21

32

b

a

b

43

21

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 6: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 6

The grammar induction problem revisitedThe data consists of positive strings

laquogeneratedraquo following an unknown distribution

The goal is now to find (learn) this distribution

Or the grammarautomaton that is used to generate the strings

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

41

31

21

21

21

32

a

b

a

b

a

b

43

21

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

41

31

21

21

21

32

a

b

a

b

a

b

43

21

PrA(abab)= 241

43

32

31

21

21

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

41

31

21

21

21

32

b

b

a

a

a

b

43

21

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

41

31

21

21

21

32

b

a

b

43

21

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 7: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 7

Success of the probabilistic models

n-gramsHidden Markov ModelsProbabilistic grammars

Zadar August 2010 8

41

31

21

21

21

32

a

b

a

b

a

b

43

21

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

41

31

21

21

21

32

a

b

a

b

a

b

43

21

PrA(abab)= 241

43

32

31

21

21

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

41

31

21

21

21

32

b

b

a

a

a

b

43

21

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

41

31

21

21

21

32

b

a

b

43

21

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 8: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 8

41

31

21

21

21

32

a

b

a

b

a

b

43

21

DPFA Deterministic Probabilistic Finite Automaton

Zadar August 2010 9

41

31

21

21

21

32

a

b

a

b

a

b

43

21

PrA(abab)= 241

43

32

31

21

21

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

41

31

21

21

21

32

b

b

a

a

a

b

43

21

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

41

31

21

21

21

32

b

a

b

43

21

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 9: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 9

41

31

21

21

21

32

a

b

a

b

a

b

43

21

PrA(abab)= 241

43

32

31

21

21

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

41

31

21

21

21

32

b

b

a

a

a

b

43

21

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

41

31

21

21

21

32

b

a

b

43

21

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 10: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 10

01

03

a

b

a

b

a

b

065

03509

07

03

07

Zadar August 2010 11

41

31

21

21

21

32

b

b

a

a

a

b

43

21

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

41

31

21

21

21

32

b

a

b

43

21

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 11: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 11

41

31

21

21

21

32

b

b

a

a

a

b

43

21

PFA Probabilistic Finite (state) Automaton

Zadar August 2010 12

41

31

21

21

21

32

b

a

b

43

21

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 12: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 12

41

31

21

21

21

32

b

a

b

43

21

-PFA Probabilistic Finite (state) Automaton with -transitions

Zadar August 2010 13

How useful are these automata They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 13: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 13

How useful are these automata They can define a distribution over

They do not tell us if a string belongs to a language

They are good candidates for grammar induction

There is (was) not that much written theory

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 14: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 14

Basic references The HMM literature Azaria Paz 1973 Introduction to

probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines

Vidal Thollard cdlh Casacuberta amp Carrasco

Grammatical Inference papers

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 15: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 15

Automata definitions

Let D be a distribution over

0PrD(w)1

w PrD(w)=1

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 16: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 16

A Probabilistic Finite (state) Automaton is a ltQ IP FP Pgt Q set of states IP Q[01] FP Q[01] P QQ [01]

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 17: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 17

What does a PFA do It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=ipaths(w)Pr(i) i=qi0ai1qi1ai2hellipainqin

Pr(i)=IP(qi0)FP(qin) aij P (qij-1aijqij)

Note that if -transitions are allowed the sum may be infinite

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 18: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 18

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 19: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 19

non deterministic PFA many initial statesonly one initial state

an -PFA a PFA with -transitions and perhaps many initial states

DPFA a deterministic PFA

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 20: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 20

Consistency

A PFA is consistent if PrA()=1x 0PrA(x)1

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 21: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and

qQFP(q) + qrsquoQa P (qaqrsquo)= 1

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 22: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 22

Equivalence between modelsEquivalence between PFA and

HMMhellipBut the HMM usually define

distributions over each n

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 23: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 23

41

21

win draw lose win draw lose win draw lose

21

21

34

12

14

14

34

14

14

41

41

41

41

41

A football HMM

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 24: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 24

Equivalence between PFA with -transitions and PFA without -transitions

cdlh 2003 Hanneforth amp cdlh 2009 Many initial states can be transformed into

one initial state with -transitions -transitions can be removed in

polynomial time Strategy

number the states eliminate first -loops then the

transitions with highest ranking arrival state

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 25: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 25

PFA are strictly more powerful than DPFA

Folk theorem

(and) You canrsquot even tell in advance if you are in a good case or not

(see Denis amp Esposito 2004)

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 26: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 26

31

21

21

21

21

32

a

a

a

a

This distribution cannot be modelled by a DPFA

Example

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 27: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 27

What does a DPFA over =a look like

And with this architecture you cannot generate the previous one

a hellip a

a

a

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 28: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 28

Parsing issues

Computation of the probability of a string or of a set of strings

Deterministic case Simple apply definitions Technically rather sum up logs this is

easier safer and cheaper

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 29: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 29

Pr(aba) = 07090350 = 0Pr(abb) = 070906503 = 012285

01

03

a

b

a

b

a

b

065

035

09

07

03

07

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 30: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 30

Pr(aba) = 0704011 +070404502 = 0028+00252=00532

Non-deterministic case

02

01

1

a

b

a

a a

b

045

03504

07

03

01

b 04

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 31: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 31

In the literature

The computation of the probability of a string is by dynamic programming O(n2m)

2 algorithms Backward and Forward If we want the most probable derivation to

define the probability of a string then we can use the Viterbi algorithm

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 32: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 32

Forward algorithm

A[ij]=Pr(qi|a1aj)(The probability of being in state qi

after having read a1aj)A[i0]=IP(qi)A[ij+1]= k|Q|A[kj] P(qkaj+1qi)Pr(a1an)= k|Q|A[kn] FP(qk)

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 33: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 33

2 DistancesWhat for

Estimate the quality of a language modelHave an indicator of the convergence of learning algorithmsConstruct kernels

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 34: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 34

21 EntropyHow many bits do we need to correct

our modelTwo distributions over D et Drsquo Kullback Leibler divergence (or

relative entropy) between D and Drsquow

PrD(w) log PrD(w)-log PrDrsquo (w)

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 35: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 35

22 PerplexityThe idea is to allow the computation

of the divergence but relatively to a test set (S)

An approximation (sic) is perplexity inverse of the geometric mean of the probabilities of the elements of the test set

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 36: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 36

wS PrD(w)-1S

=1

wS PrD(w)

Problem if some probability is null

S

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 37: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 37

Why multiply (1) We are trying to compute the

probability of independently drawing the different strings in set S

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 38: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 38

Why multiply (2) Suppose we have two predictors for a

coin toss Predictor 1 heads 60 tails 40 Predictor 2 heads 100

The tests are H 6 T 4 Arithmetic mean

P1 36+16=052 P2 06

Predictor 2 is the better predictor -)

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 39: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 39

23 Distance d2

d2(D Drsquo)= w (PrD(w)-PrDrsquo(w))2

Can be computed in polynomial time if D and Drsquo are given by PFA (Carrasco amp cdlh 2002)

This also means that equivalence of PFA is in P

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 40: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 40

3 FFAFrequency Finite (state)

Automata

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 41: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 41

A learning sample is a multiset Strings appear with a frequency (or

multiplicity) S= (3) aaa (4) aaba (2) ababa (1)

bb (3) bbaaa (1)

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 42: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition and for entering the initial state such that

the sum of what enters is equal to what exits and

the sum of what halts is equal to what starts

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 43: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 43

Example

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 44: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 44

From a DFFA to a DPFA

313 16 27b 513 b 36

a 17 a 26

a 513 b 47

66

Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 45: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 45

From a DFA and a sample to a DFFA

3 1 2

b 5 b 3

a 1 a 2

a 5 b 4

6

S = aaaa ab babb bbbb bbbbaa

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 46: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 46

Note Another sample may lead to the same

DFFA Doing the same with a NFA is a much

harder problem Typically what algorithm Baum-Welch

(EM) has been invented forhellip

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 47: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA

consistent with the data Can be transformed into a PFA if

needed

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 48: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 48

From the sample to the FTA

3

24

3

a7

a6a4 a2

b4b4

b2

1

a1a1

a1

b1 1a1 b1

a1

S= (3) aaa (4) aaba (2) ababa (1) bb (3) bbaaa (1)

FTA(S)

14

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 49: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 49

Red Blue and White states

a

a

abb

b

a

ba

-Red states are confirmed states-Blue states are the (non Red) successors of the Red states-White states are the others

Same as with DFA and what RPNI does

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 50: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 50

Merge and fold

60

9

10

11

6

4

Suppose we decide to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 51: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 51

Merge and fold

60

9

10

11

6

4

First disconnectand reconnect to

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b

a

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 52: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 52

Merge and fold

60

9

10

11

6 4

Then fold

a26

a4

a6a4

a10

b24

b24

b9

b6

100

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 53: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 53

Merge and fold

60

10

9

1011

4

after folding

a26

a4a10a10

b24

b9

b30

100

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 54: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 54

State merging algorithmA=FTA(S) Blue =(qIa) a Red =qIWhile Blue do

choose q from Blue such that Freq(q)t0

if pRed d(ApAq) is small then A = merge_and_fold(Apq)

else Red = Red qBlue = (qa) qRed ndash Red

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 55: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 55

The real question How do we decide if d(ApAq) is small Use a distancehellip Be able to compute this distance If possible update the computation

easily Have properties related to this distance

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 56: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 56

Deciding if two distributions are similar If the two distributions are known

equality can be tested Distance (L2 norm) between

distributions can be exactly computed But what if the two distributions are

unknown

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 57: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 57

Taking decisions

60

9

10

11

6

4

Suppose we want to merge with state

a26

a4

a6

a4

a10

b24

b24b9

b6

100

b a

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 58: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 58

Taking decisions

60

9

10

11

6

4

Yes if the two distributions induced are similara26

a4

a6

a4

a10

b24

b24b9

b6

911

4a4a4

b24b9

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 59: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 59

5 Alergia

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 60: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 60

Alergiarsquos test

D1 D2 if x PrD1(x) PrD2(x)Easier to testPrD1 ()=PrD2()a PrD1 (a)=PrD2(a)

And do this recursivelyOf course do it on frequencies

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 61: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 61

Hoeffding bounds

1

1

nf

2

2

1

1

nf

nf

2ln

2111

21

nn

indicates if the relative frequencies and are sufficiently close 2

2

nf

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 62: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 62

A run of Alergia Our learning multisample

S=(490) a(128) b(170) aa(31) ab(42) ba(38) bb(14) aaa(8) aab(10) aba(10) abb(4) baa(9) bab(4) bba(3) bbb(6) aaaa(2) aaab(2) aaba(3) aabb(2) abaa(2) abab(2) abba(2) abbb(1) baaa(2) baab(2) baba(1) babb(1) bbaa(1) bbab(1) bbba(1) aaaaa(1) aaaab(1) aaaba(1) aabaa(1) aabab(1) aabba(1) abbaa(1) abbab(1)

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 63: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 63

Parameter is arbitrarily set to 005 We choose 30 as a value for threshold t0

Note that for the blue state who have a frequency less than the threshold a special merging operation takes place

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 64: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 64

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

1000

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 65: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 65

Can we merge and a Compare and a a and aa b

and ab 4901000 with 128257 2571000 with 64257 2531000 with 65257

All tests return true

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 66: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 66

490

10

8

31

128

38

170

a 257

a 57

a 15

b 170

b 26

b 18

9a 13

a 5

4

42b 65 a 14

b 9

14

a 64

10

4

3

6

b 6

b 7

3

2

2

2

2

1

1

1

2

2

2

1

2

1

1

1

1

1

1

1

1

1

1

a 4

a 5

a 2

a 4

a 2

a 1

a 1

a 1b 1

b 1

b 2

b 1

b 2

b 3

b 3

a

b

b

b

a

a

a

a

Mergehellip

1000

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 67: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 67

660

52

225

a341

a 77

b 340

b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

And fold

1000

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 68: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 68

660

52

225

a341

a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Next merge with b

1000

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 69: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 69

Can we merge and b Compare and b a and ba b

and bb 6601341 and 225340 are different

(giving = 0162) On the other hand

11102ln2111

21

nn

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 70: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 70

660

52

225

a341

a 77

b 340b 38

12a 16

a1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Promotion

1000

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 71: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 71

660

52

225

a 341 a 77

b 340b 38

12a 16

a 1020

7

6

7

b 9

b 8

2

1

1

1

2

1

1

a 2

a 1

a 1

a 1b 1

b 1

b 2

Merge

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 72: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 72

660291

a341 a 95

b 340b 49 a 11

297

8b 9

2

1

2

a 2

a 1

b 2

And fold

1000

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 73: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 73

660225

a341 a 95

b 340

b 49a 11

297

8b 9

2

1

2

a 2

a 1

b 2

Merge

1000

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 74: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 74

698 302

a 354 a 96

b 351

b 49

And fold

1000

698 302

a 354 a 096b 351

b 049

As a PFA

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 75: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 75

Conclusion and logic Alergia builds a DFFA in polynomial

time Alergia can identify DPFA in the limit

with probability 1 No good definition of Alergiarsquos

properties

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 76: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 76

6 DSAI and MDIWhy not change the

criterion

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 77: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 77

Criterion for DSAI Using a distinguishable string Use norm L

Two distributions are different if there is a string with a very different probability

Such a string is called -distinguishable Question becomes

Is there a string x such that |PrAq(x)-PrAqrsquo(x)|gt

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 78: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 78

(much more to DSAI) D Ron Y Singer and N Tishby On the

learnability and usage of acyclic probabilistic finite automata In Proceedings of Colt 1995 pages 31ndash40 1995

PAC learnability results in the case where targets are acyclic graphs

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 79: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 79

Criterion for MDI MDL inspired heuristic Criterion is does the reduction of the size of the

automaton compensate for the increase in preplexity

F Thollard P Dupont and C de la Higuera Probabilistic Dfa inference using Kullback-Leibler divergence and minimality In Proceedings of the 17th International Conference on Machine Learning pages 975ndash982 Morgan Kaufmann San Francisco CA 2000

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 80: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 80

7 Conclusion and open questions

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 81: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 81

A good candidate to learn NFA is DEES Never has been a challenge so state of

the art is still unclear Lots of room for improvement towards

probabilistic transducers and probabilistic context-free grammars

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 82: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 82

AppendixStern Brocot trees

Identification of probabilitiesIf we were able to discover the

structure how do we identify the probabilities

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 83: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 83

By estimation the edge is used 1501 times out of 3000 passages through the state

3000

a15013000

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 84: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 84

Stern-Brocot trees (Stern 1858 Brocot 1860)

Can be constructed from two simple adjacent fractions by the laquomeanraquo operation

a c a+cb d b+d=

m

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 85: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 85

0 11 0

1 1

1 22 1

1 2 3 33 3 2 1

1 2 3 3 4 5 5 44 5 5 4 3 3 2 1

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 86: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 86

Idea Instead of returning c(x)n search the

Stern-Brocot tree to find a good simple approximation of this value

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87
Page 87: Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar August 2010 87

Iterated Logarithm

c(x) - a lt log log n n b n

gt1

With probability 1 for a co-finite number of values of n we have

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
  • Slide 62
  • Slide 63
  • Slide 64
  • Slide 65
  • Slide 66
  • Slide 67
  • Slide 68
  • Slide 69
  • Slide 70
  • Slide 71
  • Slide 72
  • Slide 73
  • Slide 74
  • Slide 75
  • Slide 76
  • Slide 77
  • Slide 78
  • Slide 79
  • Slide 80
  • Slide 81
  • Slide 82
  • Slide 83
  • Slide 84
  • Slide 85
  • Slide 86
  • Slide 87