Upload
shauna-wood
View
215
Download
0
Embed Size (px)
Citation preview
Statistical Decision-TreeStatistical Decision-TreeModels for ParsingModels for Parsing
NLP lab, POSTECHNLP lab, POSTECH
김 지 협김 지 협
CS730B
2
ContentsContents
AbstractAbstract IntroductionIntroduction Decision-Tree ModelingDecision-Tree Modeling SPATTER ParsingSPATTER Parsing Statistical Parsing ModelsStatistical Parsing Models Decision-Tree Growing & SmoothingDecision-Tree Growing & Smoothing Decision-Tree TrainingDecision-Tree Training Experiment ResultsExperiment Results ConclusionConclusion
CS730B
3
AbstractAbstract
Syntactic NL parser: not adequate for highly-ambiguous Syntactic NL parser: not adequate for highly-ambiguous
large-vocabulary text large-vocabulary text (ex. Wall Street Journal)(ex. Wall Street Journal)
Premises for develop a new parserPremises for develop a new parser grammars too complex to develop manually for most domains
parsing models must rely heavily on contextual information
existing n-gram model: inadequate for parsing
SPATTER: a statistical parser based on decision-tree modelSPATTER: a statistical parser based on decision-tree model better than a grammar-based parser
CS730B
4
IntroductionIntroduction
Parsing as making a sequence of disambiguation decisionsParsing as making a sequence of disambiguation decisions
The probability of a complete parse tree(T) of a sentence(S)The probability of a complete parse tree(T) of a sentence(S)
Automatically discovering the rules for disambiguationAutomatically discovering the rules for disambiguation
Producing a parser without a complicated grammarProducing a parser without a complicated grammar
Long-distance lexical information is crucial to disambiguate Long-distance lexical information is crucial to disambiguate
interpretations accuratelyinterpretations accurately
Tdiii
i
)Sd...dd|d(P)S|T(P121
CS730B
5
Decision-Tree ModelingDecision-Tree Modeling
Comparison Comparison
Grammarian: two crucial tasks for parsing
identifying the features relevant to each decision
deciding which choice to select based on the values of the
features
Decision-Tree: above 2 tasks + 3rd task
assigning a probability distribution to the possible choices, and
providing a ranking system
CS730B
6
ContinuedContinued
What is a Statistical Decision Tree?What is a Statistical Decision Tree?
A decision-making device assigning a probability to each of the poss
ible choices based on the context of the decision
P ( f | h ) , where f : an element of the future vocabulary
h : a history (the context of the decision)
The probability determined by asking a sequence of questions
i th question determined by the answers to the i - 1 previous question
Example: Part-of-speech tagging problem ( Figure 1 )
CS730B
7
ContinuedContinued
Decision Trees vs. Decision Trees vs. nn-grams-grams Equivalent to an interpolated n - gram model in expressive power Model Parameterization
n -gram model: n -gram model can be represented by decision-tree model ( n-1 questions ) Example: part-of-speech tagging
|H||F|:parametersofnumber),h...hh|f(P in 121
?backwordstwowordtheoftagtheisWhat.
?wordprevioustheoftagtheisWhat.
?taggedbeingwordtheisWhat.
elmodgramas)ttw|t(P iiii
3
2
1
421
CS730B
8
ContinuedContinued
Model Estimation n-gram model
)h|f(P)hhh()h|f(P)hhh(
)h|f(P)hhh()hh|f(P)hhh(
)hh|f(P)hhh()hh|f(P)hhh(
)hhh|f(P)hhh(
)hhh|f(P
smoothing:nd
)h...hh(Count
)fh...hh(Count)h...hh|f(P
numberthecounting:st
~
n
nn
3321723216
13215323214
313213213212
3213211
321
121
121121
2
1
CS730B
9
ContinuedContinued
decision-tree model
decision-tree model can be represented by interpolated n- gram
otherwise
leafaish...hhif)h...hh(
leaftorootfrompaththeonasked
questionstheofonetoanswerthe:h
nkk,nmwhere),h...hh|f(P
m
m
i
m
kkkkkki
k
iikkk
0
121
21
21 1
CS730B
10
ContinuedContinued
Why use decision-tree?Why use decision-tree?
As n grows, the parameter space for an n-gram model grows
exponentially
On the other hand, the decision-tree learning algorithm increases the
size of a model only as the training data allows
So, it can consider much contextual information
CS730B
11
SPATTER ParsingSPATTER Parsing
SPATTER RepresentationSPATTER Representation Parse: as a geometric pattern
4 features in node: words, tags, labels, and extensions (Figure 3)
The Parsing AlgorithmThe Parsing Algorithm Starting with the sentence’s words as leaves (Figure 3)
Gradually tagging, labeling, and extending nodes
Constraints Bottom-up, left-to-right No new node is constructed until its children completed Using DWC(derivational window constraints), # of active nodes restricted
A single-rooted, labeled tree is constructed
CS 730B
12
Statistical Parsing ModelsStatistical Parsing Models
The Tagging ModelThe Tagging Model
The Extension ModelThe Extension Model
The Label ModelThe Label Model
The Derivation ModelThe Derivation Model
The Parsing ModelThe Parsing Model
)NNNNttttwwwww|t(P)context|t(P kkkkiiiiiiiiiii
212121212121
)NNNNNNNNNNN|N(P)context|N(P cccckkkkkl
kt
kw
ke
ke
21212121
)NNNNNNNNQ|N(P)context|N(P cccckkkkkkl
kl
21212121
)NNNNNQ|active(P)context|active(P kkkkkk 2121
|d|j,TNjxj
d
))d(context|N(P))d(context|Nactive(P)S|d,T(P
)S|d,T(P)S|T(P
CS730B
13
Decision-Tree Growing & SmoothingDecision-Tree Growing & Smoothing
3 main models (tagging, extension, and label) 3 main models (tagging, extension, and label)
Dividing the training corpus into 2 sets: (90% for growing, Dividing the training corpus into 2 sets: (90% for growing,
10% for smoothing)10% for smoothing)
Growing & Smoothing AlgorithmGrowing & Smoothing Algorithm
Figure 3.5
CS730B
14
Decision-Tree TrainingDecision-Tree Training
Parsing model can not be estimated by direct frequency countParsing model can not be estimated by direct frequency count
s because the model contains a hidden component: the derivas because the model contains a hidden component: the deriva
tion modeltion model
In the corpus, no information about orders of derivationsIn the corpus, no information about orders of derivations
So, the training process must process discover which derivatiSo, the training process must process discover which derivati
ons assign higher probability to the parsesons assign higher probability to the parses
Forward-Backward Reestimation usedForward-Backward Reestimation used
CS730B
15
ContinuedContinued
Training AlgorithmTraining Algorithm
'sh
hnew
goalgoalgoal
hhhh
shhh
shhh
h,h
h
))'s,s(f(count
))s,s(f(count)h|f(p
)s()s(where,)s(
)s|)s,s(f(P)s()s())s,s(f(count
)s|)s,s(f(P)s()s(
)s|)s,s(f(P)s()s(
sstatetosstatefromgettomadeassignmentvaluefeature:)ss(f
latticestatetheinsprecedeswhichstate:s,state:s
h
CS730B
16
Experiment ResultsExperiment Results
IBM computer ManualIBM computer Manual
annotated by the University of Lancaster
195 part-of-speech tags and 19 non-terminal labels
trained on 30,800 sentences, and tested on 1,473 new sentences
0-crossing-brackets score
IBM’s rule-based, unification-style PCFG parse: 69%
SPATTER: 76%
CS730B
17
ContinuedContinued
Wall Street JournalWall Street Journal To test ability to accurately parse a highly-ambiguous, large-vocabul
ary domain Annotated in the Penn Treebank, version 2 46 part-of-speech tags, and 27 non-terminal labels Trained on 40,000 sentences, and tested on 1,920 new sentences Using PARSEVAL
BracketsgsinCros
parsetreebankintsconstituenof.no
parseSPATTERintsconstituencorrectof.no
callRe
parseSPATTERintsconstituenof.no
parseSPATTERintsconstituencorrectof.no
ecisionPr
CS730B
18
ConclusionConclusion
Large amounts of contextual information can be incorporated Large amounts of contextual information can be incorporated
into a statistical model for by applying decision-tree learning into a statistical model for by applying decision-tree learning
algorithmalgorithm
Automatically discovering rules are possible Automatically discovering rules are possible