18
Part of Speech Tagging in Context month day, year Alex Cheng [email protected] Ling 575 Winter 08 Michele Banko, Robert Moore

Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Embed Size (px)

DESCRIPTION

Previous methods Trigram model P(t_i | t_i-1, t_i-2) Kupiec(1992) divide lexicon into word classes –Words contained within the same equivalence classes posses the same set of POS Brill(1995) UTBL –Uses information from the distribution of unambiguously tagged data to make label decision –Considers both left and right context

Citation preview

Page 1: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Part of Speech Tagging in Context

month day, year

Alex [email protected] 575 Winter 08

Michele Banko, Robert Moore

Page 2: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Overview

• Comparison of previous methods• Using context from both sides• Lexicon Construction• Sequential EM for tag sequence and

lexical probabilities• Discussion Questions

Page 3: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Previous methods• Trigram model P(t_i | t_i-1, t_i-2)• Kupiec(1992) divide lexicon into word

classes– Words contained within the same equivalence

classes posses the same set of POS

• Brill(1995) UTBL – Uses information from the distribution of

unambiguously tagged data to make label decision– Considers both left and right context

Page 4: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

• Toutanova (2003) Conditional MM– Supervised learning method– Increase accuracy from 96.10% to

96.55%• Lafferty (2001)

– Compared HMM with MEMM, and CRF

Page 5: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Contextualized HMM• Estimate the probability of a word w_i based

on t_i-1, t_i and t_i+1

• Leads to higher dimensionality in the parameters• Standard absolute discounting scheme smoothing

Page 6: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Lexicon construction• Lexicons provided for both testing and

training• Initialize with uniform dist for all

possible tags for each word• Experiments with using word classes

in the Kupiec model

Page 7: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Problems

• Limiting the possible tags per lexicon– Tags that appeared less than X% of the time for

each word are omitted.

Page 8: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore
Page 9: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

HMM Model Training• Extracting non-ambiguous tag sequence

– Use these n-grams and their counts to bias the initial estimate of state transitions in the HMM

• Sequential training– Train the transition model probability first, keeping

the lexical probabilities constant.– Then train the lexical probabilities, keeping the

transition probability constant.

Page 10: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Discussion• Sequential training of HMM by training

the parameters separately. Is there any theoretical significance? Computational cost?

• What are the effects if we model the tag context differently using p(t_i | t_i-1, t_i+1)?

Page 11: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Improved Estimation for Unsupervised POS Tagging

month day, year

Alex [email protected] 575 Winter 08

Qin Iris Wang, Dale Schuurmans

Page 12: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Overview

• Focus on parameter estimation– Considering only simple models with limited

context (using a standard HMM - bigram)• Constraint on marginal tag probabilities• Smooth lexical parameters using word

similarities• Discussion Questions

Page 13: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Parameter Estimation• Banko and Moore (2004) reduces error rate

from 22.8% to 4.1% by reducing the set of possible tags for each word.– Requires tagged data to find the artificially reduced

lexicon.• EM is guaranteed to converge to a local

maximum.• HMM tends to have multiple local maxima.

– This leads to the resulting quality of the parameters may have more to do with the initial parameter estimation than the EM procedure itself.

Page 14: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Estimations problems• Using the standard model

– Tag -> tag unifrom over all tags– Tag -> word uniform over all possible tag for word

(as specified in complete lexicon)• Estimated parameters of the transition

probabilities are quite poor.– ‘a’ is always tagged LS.

• Estimated parameters of the lexical probabilities are also quite poor– Treat each parameter b_t_w1, b_t_w2 as

independent.– EM tends to over-fit the lexical model and ignore

similarity between words.

Page 15: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Marginally Constrained HMMsTag -> Tag probabilities

• Maintain a specific marginal distribution over the tag probabilities.– Assuming we are given a target

distribution over tags (raw tag frequency)• Can be obtained from tagged data• Can be approximated (see Toutanova, 2003)

Page 16: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Similarity based SmoothingTag -> Word probabilities

• Using a feature vector f for each word w which consists of the context (left and right word) of w.

• Took 100,000 most frequent words as features

Page 17: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Result

Page 18: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Discussion• Compared to Banko and Moore, are

methods used here “more or less” unsupervised?– Banko and Moore uses lexicon ablation– Here, we use raw frequency of tags