Stochastic Definite Clause Grammars

Stochastic Definite Clause Grammars

InterLogOnt, Nov 24Saarbrücken

Christian Theil [email protected]

What and why?

● DCG Syntax– Convenient

– Expresssive

– Flexible

● Probabilistic model– Polynomial parsing

– Parameter learning

– Robust

StochasticDefinite Clause

Grammars

DCG Grammar rules

sentence --> subject(N), verb(N), object. subject(sing) --> [he]. subject(plur) --> [they]. object --> [cake]. object --> [food]. verb(sing) --> [eats]. verb(plur) --> [eat].

sentence(L1,L4) :- subject(N,L1, L2), verb(N,L2,L3), object(L3,L4).

subject(sing,[he|R],R)....

Difference list representationSimple DCG grammar

● Definite Clause Grammars– Grammar formalism on top of Prolog.

– Production rules with unification variables

– Context-sensitive.. (stronger actually)

– Exploits unification semantics of Prolog

Stochastic Definite Clause Grammars

(S)DCG PRISM programCompilation

● Implemented as a DCG compiler– With some extensions to DCG syntax

● Transforms a DCG (grammar) into a stochastic logic program implemented in PRISM.

● Probabilistic inferences and parameter learning are then performed using PRISM

Compilation process

● PRISM - http://sato-www.cs.titech.ac.jp/prism/● Extends Prolog with random variables (msws in PRISM lingo)● Performs probabilistic inferences over such programs ,

● Probability calculation - probability of a derivation● Viterbi - find most probable derivation● EM learning – learn parameters from a set of example goals

PRISM program example: Bernoulli trials

target(ber,2).values(coin,[heads,tails]).:- set_sw(coin, 0.6+0.4).

ber(N,[R,Y]) :- N>0, msw(coin,R), % Probabilistic choice N1 is N – 1, ber(N1,Y). % Recursionber(0,[]).

The probabilistic model

target(s,2). values(s,[s1,s2]). s(A,B) :- msw(s,Outcome), s(Outcome, A, B). s(s1, A, B) :- np(_, A, B). s(s2, A, B) :- np(N, A, D), vp(N, D, B).

s(N) ==> np(N).s(N) ==> np(N),vp(N).

Selection rule

Implementation rules

One random variable encodes probability of expansion for rules with same functor/arity

The choice is made a selection rule

The selected rule is invoked through unification transformation

Unification failureSince SDCG embodies unification constraints,some derivations may fail

All derivations

Failed derivations

We only observe the successful derivations in sample data.

If the training algorithm only considers successful derivations, it will converge to a wrong probability distribution (missing probability mass).

In PRISM this is handled using the fgEM algorithm, which is based on Cussens Failure-Adjusted Maximization (FAM) algorithm.

A “failure program” which traces all derivations is derived using First Order Compilaton and the probabilities of failed derivations are estimated as part of the fgEM algorithm.

Unification failure issuesInfinite/long derivation paths● Impossible/difficult to derive failure program.● Workaround: SDCG has an option which limits the depth of

derivation.● Still: size of the failure program is very much an issue.

FOC requirement - “universally quantified clauses”:● Not the case with Difference Lists: 'C'([X|Y], X,Y).● Workaround 1:

– Trick the first order compiler by manually adding implications after program is partly compiled.

– Works empirically, but may be dubious● Workaround 2:

– Append based grammar– Works, but have inherent inefficiencies

Syntax extensions

● SDCG extends the usual DCG syntax– Compatible with DCG (superset)

● Extensions:– Regular expression operators

● Convenient rule recursion

– “Macros” ● Allows writing rules as templates which are filled out

according to certain rules

– Conditioning● Convenient expression of higher order HMM's● Lexicalization

Regular expression operators

name ==> ?(title), +(firstname), *(lastname).

The constituent in the original rule is replaced with a substitute which refers to intermediary rules, which implements the regular expression.

regex_sub ==> []

regex_sub ==> original_constituent

regex_sub ==> regex_sub,regex_sub

? +*

Limitation: Cannot be used in rules with unification variables.

Regular expressions operators can be associated with rule constituents:

? may be repeated zero or one times* may be repeated zero or more times+ may be one or more time

Meaning:

Template macros

word(@number(Word, N), @gender(Word,G)) ==> @wordlist(Word, WordList).

expand_mode(number(-, +)).expand_mode(gender(-, +)). expand_mode(wordlist(-, +)).

insertremove

find all answersword(sg,masc) ==> [ he ].word(sg,fem) ==> [ she ].

exp(Word, N, G, WordList) :- number(Word,N), gender(Word, G), wordlist(Word,WordList).

Meta rule is created and called,

Example:word(he,sg,masc). word(she,sg,fem).

number(Word,Number) :- word(Word,Number,_).gender(Word,Gender) :- word(Word,_,Gender).wordlist(X,[X]).

Special goals prefixed with @ are treated as macros. Grammar rules with macros are dynamically expanded.

expand_mode determines which variables to keep

Resulting grammar:

ConditioningA conditioned rule takes the form,

name(F1,F2,...,Fn) | V1,V2,...,Vn ==> C1,C2,...,Cn.

It is possible to specify which variables must unify using a condition_mode:

condition_mode(n(+,+,-)).

n(A,B,C) | x,y ==> c1, c2.

Conditioned rules are grouped by non-terminal name and arity and always has the same number of conditions.

Probabilistic semantics: A distinct probability distribution for each distinct set of conditions.

The | operator can be seen as a guard that assures the rule is only expanded if the conditions V1..Vn unify with F1..FN

Conditioning semantics

n ==> n1.n ==> n2.n1 ==> ......

n | a ==> n1(X).n | a ==> n2(X).n | b ==> n1(X).n | b ==> n2(X)....

n1

n2

n

n1

n2

n

n1

n2

n

n1

n2

n

n1_1

n2_1

n1_2

n2_2

Model with conditioning:Model without conditioning:

Stochastic selection

Selection using unification

n

n|a

n|b

Example, simple toy grammar

| ?- prob(start([time,flies],[],Tree), P).P = 0.083333333333333 ?yes| ?- viterbig(start([time,flies],[],Tree), P).Tree = [start,[[s(pl),[[np(pl),[[n(sg),[[]]],[n(pl),[[]]]]]]]]]P = 0.0625 ?yes| ?- n_viterbig(10,start([time,flies],[],Tree), P).Tree = [start,[[s(pl),[[np(pl),[[n(sg),[[]]],[n(pl),[[]]]]]]]]]P = 0.0625 ?;Tree = [start,[[s(sg),[[np(sg),[[n(sg),[[]]]]],[vp(sg),[[v(sg),[[]]]]]]]]]P = 0.020833333333333 ?;no

start ==> s(N).s(N) ==> np(N).s(N) ==> np(N),vp(N).np(N) ==> n(sg),n(N).np(N) ==> n(N).vp(N) ==> v(N),np(N).vp(N) ==> v(N)

n(sg) ==> [time].n(pl) ==> [flies].v(sg) ==> [flies].v(sg) ==> [crawls].v(pl) ==> [fly].

Probability of a sentence

The most probable parse

Most probable parses(indeed all two)

More interesting exampleSimple part of speech tagger – fully connected first order HMM.

tag(none).tag(det).tag(noun).tag(verb).tag(modalverb).

word(the).word(can). word(will).word(rust).

Some tags Some words

consume_word([Word]) :- word(Word).

conditioning_mode(tag_word(+,-,-)).

start(TagList) ==> tag_word(none,_,TagList).

tag_word(Previous, @tag(Current), [Current|TagsRest]) | @tag(SomeTag) ==> @consume_word(W), ?(tag_word(Current,_,TagsRest)).

Questions?

Documents

Stochastic Definite Clause Grammars