Upload
christian-have
View
396
Download
3
Embed Size (px)
Citation preview
Stochastic Definite Clause Grammars
InterLogOnt, Nov 24Saarbrücken
Christian Theil [email protected]
What and why?
● DCG Syntax– Convenient
– Expresssive
– Flexible
● Probabilistic model– Polynomial parsing
– Parameter learning
– Robust
StochasticDefinite Clause
Grammars
DCG Grammar rules
sentence --> subject(N), verb(N), object. subject(sing) --> [he]. subject(plur) --> [they]. object --> [cake]. object --> [food]. verb(sing) --> [eats]. verb(plur) --> [eat].
sentence(L1,L4) :- subject(N,L1, L2), verb(N,L2,L3), object(L3,L4).
subject(sing,[he|R],R)....
Difference list representationSimple DCG grammar
● Definite Clause Grammars– Grammar formalism on top of Prolog.
– Production rules with unification variables
– Context-sensitive.. (stronger actually)
– Exploits unification semantics of Prolog
Stochastic Definite Clause Grammars
(S)DCG PRISM programCompilation
● Implemented as a DCG compiler– With some extensions to DCG syntax
● Transforms a DCG (grammar) into a stochastic logic program implemented in PRISM.
● Probabilistic inferences and parameter learning are then performed using PRISM
Compilation process
● PRISM - http://sato-www.cs.titech.ac.jp/prism/● Extends Prolog with random variables (msws in PRISM lingo)● Performs probabilistic inferences over such programs ,
● Probability calculation - probability of a derivation● Viterbi - find most probable derivation● EM learning – learn parameters from a set of example goals
PRISM program example: Bernoulli trials
target(ber,2).values(coin,[heads,tails]).:- set_sw(coin, 0.6+0.4).
ber(N,[R,Y]) :- N>0, msw(coin,R), % Probabilistic choice N1 is N – 1, ber(N1,Y). % Recursionber(0,[]).
The probabilistic model
target(s,2). values(s,[s1,s2]). s(A,B) :- msw(s,Outcome), s(Outcome, A, B). s(s1, A, B) :- np(_, A, B). s(s2, A, B) :- np(N, A, D), vp(N, D, B).
s(N) ==> np(N).s(N) ==> np(N),vp(N).
Selection rule
Implementation rules
One random variable encodes probability of expansion for rules with same functor/arity
The choice is made a selection rule
The selected rule is invoked through unification transformation
Unification failureSince SDCG embodies unification constraints,some derivations may fail
All derivations
Failed derivations
We only observe the successful derivations in sample data.
If the training algorithm only considers successful derivations, it will converge to a wrong probability distribution (missing probability mass).
In PRISM this is handled using the fgEM algorithm, which is based on Cussens Failure-Adjusted Maximization (FAM) algorithm.
A “failure program” which traces all derivations is derived using First Order Compilaton and the probabilities of failed derivations are estimated as part of the fgEM algorithm.
Unification failure issuesInfinite/long derivation paths● Impossible/difficult to derive failure program.● Workaround: SDCG has an option which limits the depth of
derivation.● Still: size of the failure program is very much an issue.
FOC requirement - “universally quantified clauses”:● Not the case with Difference Lists: 'C'([X|Y], X,Y).● Workaround 1:
– Trick the first order compiler by manually adding implications after program is partly compiled.
– Works empirically, but may be dubious● Workaround 2:
– Append based grammar– Works, but have inherent inefficiencies
Syntax extensions
● SDCG extends the usual DCG syntax– Compatible with DCG (superset)
● Extensions:– Regular expression operators
● Convenient rule recursion
– “Macros” ● Allows writing rules as templates which are filled out
according to certain rules
– Conditioning● Convenient expression of higher order HMM's● Lexicalization
Regular expression operators
name ==> ?(title), +(firstname), *(lastname).
The constituent in the original rule is replaced with a substitute which refers to intermediary rules, which implements the regular expression.
regex_sub ==> []
regex_sub ==> original_constituent
regex_sub ==> regex_sub,regex_sub
? +*
Limitation: Cannot be used in rules with unification variables.
Regular expressions operators can be associated with rule constituents:
? may be repeated zero or one times* may be repeated zero or more times+ may be one or more time
Meaning:
Template macros
word(@number(Word, N), @gender(Word,G)) ==> @wordlist(Word, WordList).
expand_mode(number(-, +)).expand_mode(gender(-, +)). expand_mode(wordlist(-, +)).
insertremove
find all answersword(sg,masc) ==> [ he ].word(sg,fem) ==> [ she ].
exp(Word, N, G, WordList) :- number(Word,N), gender(Word, G), wordlist(Word,WordList).
Meta rule is created and called,
Example:word(he,sg,masc). word(she,sg,fem).
number(Word,Number) :- word(Word,Number,_).gender(Word,Gender) :- word(Word,_,Gender).wordlist(X,[X]).
Special goals prefixed with @ are treated as macros. Grammar rules with macros are dynamically expanded.
expand_mode determines which variables to keep
Resulting grammar:
ConditioningA conditioned rule takes the form,
name(F1,F2,...,Fn) | V1,V2,...,Vn ==> C1,C2,...,Cn.
It is possible to specify which variables must unify using a condition_mode:
condition_mode(n(+,+,-)).
n(A,B,C) | x,y ==> c1, c2.
Conditioned rules are grouped by non-terminal name and arity and always has the same number of conditions.
Probabilistic semantics: A distinct probability distribution for each distinct set of conditions.
The | operator can be seen as a guard that assures the rule is only expanded if the conditions V1..Vn unify with F1..FN
Conditioning semantics
n ==> n1.n ==> n2.n1 ==> ......
n | a ==> n1(X).n | a ==> n2(X).n | b ==> n1(X).n | b ==> n2(X)....
n1
n2
n
n1
n2
n
n1
n2
n
n1
n2
n
n1_1
n2_1
n1_2
n2_2
Model with conditioning:Model without conditioning:
Stochastic selection
Selection using unification
n
n|a
n|b
Example, simple toy grammar
| ?- prob(start([time,flies],[],Tree), P).P = 0.083333333333333 ?yes| ?- viterbig(start([time,flies],[],Tree), P).Tree = [start,[[s(pl),[[np(pl),[[n(sg),[[]]],[n(pl),[[]]]]]]]]]P = 0.0625 ?yes| ?- n_viterbig(10,start([time,flies],[],Tree), P).Tree = [start,[[s(pl),[[np(pl),[[n(sg),[[]]],[n(pl),[[]]]]]]]]]P = 0.0625 ?;Tree = [start,[[s(sg),[[np(sg),[[n(sg),[[]]]]],[vp(sg),[[v(sg),[[]]]]]]]]]P = 0.020833333333333 ?;no
start ==> s(N).s(N) ==> np(N).s(N) ==> np(N),vp(N).np(N) ==> n(sg),n(N).np(N) ==> n(N).vp(N) ==> v(N),np(N).vp(N) ==> v(N)
n(sg) ==> [time].n(pl) ==> [flies].v(sg) ==> [flies].v(sg) ==> [crawls].v(pl) ==> [fly].
Probability of a sentence
The most probable parse
Most probable parses(indeed all two)
More interesting exampleSimple part of speech tagger – fully connected first order HMM.
tag(none).tag(det).tag(noun).tag(verb).tag(modalverb).
word(the).word(can). word(will).word(rust).
Some tags Some words
consume_word([Word]) :- word(Word).
conditioning_mode(tag_word(+,-,-)).
start(TagList) ==> tag_word(none,_,TagList).
tag_word(Previous, @tag(Current), [Current|TagsRest]) | @tag(SomeTag) ==> @consume_word(W), ?(tag_word(Current,_,TagsRest)).
Questions?