CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags •...

Behavioral Data Mining

Lecture 16

Natural Language Processing I

Outline

• Parts-of-Speech and POS taggers

• HMMs

• Named Entity Recognition

• Conditional Random Fields (CRFs)

Part-Of-Speech Tags

• Part-of-speech tags provide limited but useful information

about word type (e.g. local sentiment attachment).

• They are widely used as an alternative to full parsing, which

can be very slow.

• While our intuition about tags tends to be semantic (nouns

are “people, places, things”), classification and inference is

based on morphology and context.

Penn Treebank Tagset

Part-Of-Speech Tags

POS tagging normally follows tokenization, which includes

punctuation:

Sheena

Sheila

POS tags

• Morphology:

– “liked,” “follows,” “poke”

• Context: “can” can be:

– “trash can,” “can do,” “can it”

• Therefore taggers should look at the neighborhood of a

Constraint-Based Tagging

Based on a table of words with morphological and context

features:

Outline

• HMMs

Markov Models (notes due to David Hall)

• A Markov model is a chain-structured Bayes Net

– Each node is identically distributed (stationarity)

– Value of X at a given time is called the state

– As a Bayes Net:

• Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial probs)

– Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial probs)

X2 X1 X3 X4

Hidden Markov Models

• Markov chains not so useful for most applications – Eventually you don’t know anything anymore

– Need observations to update your beliefs

• Hidden Markov models (HMMs) – Underlying Markov chain over states S

– You observe outputs (effects) at each time step

– As a Bayes’ net:

X1 X3 X4

E2 E3 E4 E5

Example

• An HMM is defined by: – Initial distribution:

– Transitions:

– Emissions:

Conditional Independence • HMMs have two important independence properties:

– Markov hidden process, future depends on past via the present

– Current observation independent of all else given current state

• Does this mean that observations are independent given no evidence?

– [No, correlated by the hidden state]

X1 X3 X4

E2 E3 E4 E5

Real HMM Examples

POS tagging

DNA alignment Observations are nucleotides (ACGT)

Transitions are substitutions, insertions, deletions

Named Entity Recognition/Field Segmentation Observations are words (tens of thousands)

States are “Person”, “Place”, “Other”, etc. (usually 3-20)

Ad Targeting: Observations are click streams

States are interests, or likelihood to buy

Named Entity Recognition

• People, places, (specific) things – “Chez Panisse”: Restaurant

– “Berkeley, CA”: Place

– “Churchill-Brenneis Orchards”: Company(?)

– Not “Page” or “Medjool”: part of a non-rigid designator

• Why a sequence model? – “Page” can be “Larry Page” or “Page mandarins” or just a

– “Berkeley” can be a person.

• Chez Panisse, Berkeley, CA - A bowl of Churchill-Brenneis Orchards Page mandarins and Medjool dates

• States: – Restaurant, Place, Company, GPE, Movie, “Outside”

– Usually use BeginRestaurant, InsideRestaurant. • Why?

• Emissions: words – Estimating these probabilities is harder.

Filtering / Monitoring

• Filtering, or monitoring, is the task of tracking the distribution

B(X) (the belief state) over time

• We start with B(X) in an initial setting, usually uniform

• As time passes, or we get observations, we update B(X)

• The Kalman filter (a continuous HMM) was invented in the

60’s and first implemented as a method of trajectory

estimation for the Apollo program

Passage of Time

Assume we have current belief P(X | evidence to date)

Then, after one time step passes:

Or, compactly:

Basic idea: beliefs get “pushed” through the transitions

Example HMM

The Forward Algorithm

We are given evidence at each time and want to know

We can derive the following updates We can normalize

as we go if we want

to have P(x|e) at

each time step, or

just once at the

end…

The Forward Algorithm

• Expressible as repeated matrix multiplication

• Careful: can get numerical underflow very quickly! – Use logs or renormalize after each product

Most Likely Sequence

The Viterbi Algorithm

Just like the forward algorithm with max instead of sum

Compare to the Forward Algorithm

How do we actually extract a max sequence from V?

– Store back pointers of the best

Estimating HMMs: Forward-Backward

• Given observed state sequences for an HMM, the forward-

backward algorithm (aka Baum-Welch) produces an estimate

for the model parameters.

• Baum-Welch is a special case of EM (Expectation-

Maximization), and requires iteration to find the best

parameters.

• Convergence is generally quite fast (10s of iterations for

speech HMMs).

Higher-Order Markov Models

• Sometimes, first-order history isn’t enough – Part-of-Speech Tagging

• Need to model

• Claim: Can use a first-order model – How?

• Solution: Product States

• Estimating probabilities a little more subtle

Outline

• HMMs

Conditional Random Fields

• HMMs are generative models of sequences.

• Hidden states X

• Observations E

• Two kinds of parameters: – Transmissions: P(X | X)

– Emission: P(E | X)

F1 F2 F3 F4

X1 X2 X3 X4

Generative Models

• Generative Models have both a prior term p(X) and

a likelihood term P(E|X).

• By Bayes’ Rule:

– P(X|E) = P(X)P(E|X)/ P(E)

• We always observe E...

• Why do we waste time modeling E?

Generative Models

• Forward algorithm

Discriminative Models

• Key Idea: Model P(X|E) directly!

– Consequence: P(E) and P(E|X) no longer (exactly) make

sense.

• Use a linear score function, just like SVMs or

regression:

• (Generic version:)

– If you have a dynamic program in a generative model with

P(...|...), just replace them with a score function.

CRF intuition

Naïve Bayes HMMs

Logistic Regression CRFs

Conditional Random Fields

• is called the “partition function” or

normalizing constant. –It’s like P(E)

CRFs vs HMMs

• Easy to map HMMs to CRFs

• How? – One feature for transition

– One feature for emission

• People, places, (specific) things – “Chez Panisse”: Restaurant

– “Berkeley, CA”: Place

– “Churchill-Brenneis Orchards”: Company(?)

– Not “Page” or “Medjool”: part of a non-rigid designator

• With an HMM, have to estimate probability P(E|X) – But what’s P(Panisse|Restaurant)

• If you’ve never seen Panisse before?

• Hard to estimate!

CRF features

• NER: – Features to detect names:

• Word on the left is “Mr.” or “Ms.”

• Word two back is “Mr.” or “Ms.”

• Word is capitalized.

• Caution: If you have features over an “arbitrary” window size, complexity guarantees of forward algorithm go out the window!

CRFs vs HMMs

• Not possible to do unsupervised learning – You need a generative model, or a MRF

• Slower to train

• Much easier to “overfit” – Use regularization!

Training CRFs

• Needs gradient updates (SGD/LBFGS)

• Optimize log-likelihood

features

Training CRFs

• Needs gradient updates (SGD/LBFGS)

• Optimize log-likelihood

Training CRFs

• Interpretation: derivative looks to minimize difference between true and guessed features

Ground truth

features

Model guess

features

Training CRFs

• CRF log likelihood is convex

• Any minimum is a global minimum

CRF Viterbi

• HMM Viterbi

• CRF Viterbi

Software

• Stanford POS tagger:

http://nlp.stanford.edu/software/tagger.shtml

• Stanford NER recognizer:

http://nlp.stanford.edu/software/CRF-NER.shtml

Summary

• HMMs

CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags •...

Documents

Haku Inc. - PRWebww1.prweb.com/prfiles/2015/01/28/12478819/Haku-Press-Kit.pdfJan 28, 2015 · types of Tags: Topic Tags, User Tags, Category Tags, Sentiment Tags and Source Tags

Government of Andhra Pradesh Case study ... x Factors of Learning — Personal and Environmental ... (10) Direct and Indirect speech (11) Questions and question tags (12)

Speech Application Language Tags

Tissue Sampling Tags Allflex Tissue Sampling Tags are high quality cattle tags ... · 2014-08-21 · Tissue Sampling Tags Allflex Tissue Sampling Tags are high quality cattle tags

TSU Tags and Tagger - documents.crinet.comdocuments.crinet.com/.../DHI/...Tags-8..5x11Combos.pdf · Ear Tags. Allflex Tamperproof™ Ear Tags feature a country-coded 15 digit unique

Tags, tags, tags... un peu de technique

RFID Tags and Readers - abracon.com · 3 Types of RFID Tags RFID tags that include power source are known as active tags versus those without a power source which are passive tags

Tags - AM 58kHz€¦ · TAGS- 8.2:CoWhSt45 White Cones 8.2 normal lock, 45mm TAGS- 8.2:CoWhSt63 White cone, 8.2 normal lock, 63mm TAGS- 8.2:MiGrSt Mini Grey RF Std Lk tags Used TAGS-

Beyond Tags

Tags & stencil

A New Annotation Scheme for the Sejong Part-of-speech ...Table 1: POS tags in the Sejong corpus and their 1-to-1 mapping to Universal POS tags 2.1 Universal POS tags and their mapping

Flexible Layouts for Fiducial Tags · based on the shape of the tags: square tags with a black and white border on the outside; and tags with other shapes. For square tags, one of

Get Durable Repair Tags From My Repair Tags

Tags - AM 58kHz · Tags - RF 8.2mHz TAGS- 8.2:ChSmWhCo Checkpoint sm white cone tag 8.2 Std TAGS- 8.2:CoBlSt63 Black cone, 8.2 normal lock,63mm TAGS- 8.2:CoWhSt45 White Cones 8.2

Struts Tags Classification of Struts Tags are: Form tags. Control tags. Data tags. Ajax tags

Part of Speech Tagging & Hidden Markov Models (Part 1)cis391/Lectures/part-of... · 2015-11-05 · CIS 391 - Intro to AI 2 NLP Task I –Determining Part of Speech Tags Given a text,

( ( ... · KATRINA KAIF (/TAGS/KATRINA-KAIF/) PARTITION 1947 (/TAGS/PARTITION-1947/) BHARAT BOX OFFICE COLLECTION (/TAGS/BHARAT-BOX-OFFICE-COLLECTION/) भारत (/TAGS/भारत/)

NFC SpeedTap Tags - thinfilmnfc.com · NFC SpeedTap™ Tags Making instant, authentic product connections possible NFC SpeedTap tags are revolutionary wireless tags that combine the

Transaction Objects, Control Objects, Control tags and Tags Dynamics

Measurement Context Extraction from Text: Discovering ... · initially tried extracting measurements using a rule-based approach built around part of speech (PoS) tags, named-entity