Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore

Part of Speech Tagging in Context

month day, year

Alex [email protected] 575 Winter 08

Michele Banko, Robert Moore

Overview

• Comparison of previous methods• Using context from both sides• Lexicon Construction• Sequential EM for tag sequence and

lexical probabilities• Discussion Questions

Previous methods• Trigram model P(t_i | t_i-1, t_i-2)• Kupiec(1992) divide lexicon into word

classes– Words contained within the same equivalence

classes posses the same set of POS

• Brill(1995) UTBL – Uses information from the distribution of

unambiguously tagged data to make label decision– Considers both left and right context

• Toutanova (2003) Conditional MM– Supervised learning method– Increase accuracy from 96.10% to

96.55%• Lafferty (2001)

– Compared HMM with MEMM, and CRF

Contextualized HMM• Estimate the probability of a word w_i based

on t_i-1, t_i and t_i+1

• Leads to higher dimensionality in the parameters• Standard absolute discounting scheme smoothing

Lexicon construction• Lexicons provided for both testing and

training• Initialize with uniform dist for all

possible tags for each word• Experiments with using word classes

in the Kupiec model

Problems

• Limiting the possible tags per lexicon– Tags that appeared less than X% of the time for

each word are omitted.

HMM Model Training• Extracting non-ambiguous tag sequence

– Use these n-grams and their counts to bias the initial estimate of state transitions in the HMM

• Sequential training– Train the transition model probability first, keeping

the lexical probabilities constant.– Then train the lexical probabilities, keeping the

transition probability constant.

Discussion• Sequential training of HMM by training

the parameters separately. Is there any theoretical significance? Computational cost?

• What are the effects if we model the tag context differently using p(t_i | t_i-1, t_i+1)?

Improved Estimation for Unsupervised POS Tagging

month day, year

Alex [email protected] 575 Winter 08

Qin Iris Wang, Dale Schuurmans

Overview

• Focus on parameter estimation– Considering only simple models with limited

context (using a standard HMM - bigram)• Constraint on marginal tag probabilities• Smooth lexical parameters using word

similarities• Discussion Questions

Parameter Estimation• Banko and Moore (2004) reduces error rate

from 22.8% to 4.1% by reducing the set of possible tags for each word.– Requires tagged data to find the artificially reduced

lexicon.• EM is guaranteed to converge to a local

maximum.• HMM tends to have multiple local maxima.

– This leads to the resulting quality of the parameters may have more to do with the initial parameter estimation than the EM procedure itself.

Estimations problems• Using the standard model

– Tag -> tag unifrom over all tags– Tag -> word uniform over all possible tag for word

(as specified in complete lexicon)• Estimated parameters of the transition

probabilities are quite poor.– ‘a’ is always tagged LS.

• Estimated parameters of the lexical probabilities are also quite poor– Treat each parameter b_t_w1, b_t_w2 as

independent.– EM tends to over-fit the lexical model and ignore

similarity between words.

Marginally Constrained HMMsTag -> Tag probabilities

• Maintain a specific marginal distribution over the tag probabilities.– Assuming we are given a target

distribution over tags (raw tag frequency)• Can be obtained from tagged data• Can be approximated (see Toutanova, 2003)

Similarity based SmoothingTag -> Word probabilities

• Using a feature vector f for each word w which consists of the context (left and right word) of w.

• Took 100,000 most frequent words as features

Result

Discussion• Compared to Banko and Moore, are

methods used here “more or less” unsupervised?– Banko and Moore uses lexicon ablation– Here, we use raw frequency of tags

Documents

Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore