Text Models. Why? To “understand” text To assist in text search & ranking For autocompletion Part of Speech Tagging

Text Models

Why?

• To “understand” text

• To assist in text search & ranking

• For autocompletion

• Part of Speech Tagging

Simple application: spelling suggestions

• Say that we have a dictionary of words– Real dictionary or the result of crawling– Sentences instead of words

• Now we are given a word w not in the dictionary

• How can we correct it to something in the dictionary

String editing

• Given two strings (sequences) the “distance” between the two strings is defined by the minimum number of “character edit operations” needed to turn one sequence into the other.

• Edit operations: delete, insert, modify (a character) – Cost assigned to each operation (e.g. uniform =1 )

Edit distance

• Already a simple model for languages

• Modeling the creation of strings (and errors in them) through simple edit operations

Distance between strings

• Edit distance between strings = minimum number of edit operations that can be used to get from one string to the other– Symmetric because of the particular choice of edit

operations and uniform cost

• distance(“Willliam Cohon”,“William Cohen”)

• 2

Finding the edit distance

• An “alignment” problem

• Deciding how to align the two strings

• Can we try all alignments?

• How many (reasonable options) are there?

Dynamic Programming

• An umbrella name for a collection of algorithms

• Main idea: reuse computation for sub-problems, combined in different ways

Example: Fibonnaci

if n = 0 or n = 1 return nelse return fib(n-1) + fib(n-2)

Exponential time!

Fib with Dynamic Programmingtable = {}def fib(n):

global tableif table.has_key(n):

return table[n]if n == 0 or n == 1:

table[n] = nreturn n

else:value = fib(n-1) + fib(n-2)

table[n] = valuereturn value

Using a partial solution

• Partial solution: – Alignment of s up to location i, with t up to

location j

• How to reuse?

• Try all options for the “last” operation

• Base case : D(i,0)=I, D(0,i)=i for i inserts \ deletions

• Easy to generalize to arbitrary cost functions!

Models

• Bag-of-words

• N-grams

• Hidden Markov Models

• Probabilistic Context Free Grammar

Bag-of-words

• Every document is represented as a bag of the words it contains

• Bag means that we keep the multiplicity (=number of occurrences) of each word

• Very simple, but we lose all track of structure

n-grams

• Limited structure

• Sliding window of n words

n-gram model

How would we infer the probabilities?

• Issues:– Overfitting– Probability 0

How would we infer the probabilities?

• Maximum Likelihood:

"add-one" (Laplace) smoothing

• V = Vocabulary size

Good-Turing Estimate

Good-Turing

More than a fixed n..Linear Interpolation

Precision vs. Recall

Richer Models

• HMM

• PCFG

Motivation: Part-of-Speech Tagging

– Useful for ranking– For machine translation– Word-Sense Disambiguation– …

Part-of-Speech Tagging

• Tag this word. This word is a tag.

• He dogs like a flea

• The can is in the fridge

• The sailor dogs me every day

A Learning Problem

• Training set: tagged corpus– Most famous is the Brown Corpus with about 1M

words

– The goal is to learn a model from the training set, and then perform tagging of untagged text

– Performance tested on a test-set

Simple Algorithm

• Assign to each word its most popular tag in the training set

• Problem: Ignores context

• Dogs, tag will always be tagged as a noun…

• Can will be tagged as a verb

• Still, achieves around 80% correctness for real-life test-sets– Goes up to as high as 90% when combined with some simple rules

(HMM) Hidden Markov Model

• Model: sentences are generated by a probabilistic process

• In particular, a Markov Chain whose states correspond to Parts-of-Speech

• Transitions are probabilistic

• In each state a word is outputted– The output word is again chosen probabilistically based on the

state

HMM

• HMM is:– A set of N states– A set of M symbols (words) – A matrix NXN of transition probabilities Ptrans– A vector of size N of initial state probabilities

Pstart– A matrix NXM of emissions probabilities Pout

• “Hidden” because we see only the outputs, not the sequence of states traversed

Example

3 Fundamental Problems

1) Compute the probability of a given observationSequence (=sentence) 2) Given an observation sequence, find the most likely hidden state sequence This is tagging3) Given a training set find the model that would make the observations most likely

Tagging

• Find the most likely sequence of states that led to an observed output sequence

• Problem: exponentially many possible sequences!

Viterbi Algorithm

• Dynamic Programming• Vt,k is the probability of the most probable

state sequence – Generating the first t + 1 observations (X0,..Xt)– And terminating at state k

Viterbi Algorithm

• Dynamic Programming• Vt,k is the probability of the most probable

state sequence – Generating the first t + 1 observations (X0,..Xt)– And terminating at state k

• V0,k = Pstart(k)*Pout(k,X0)• Vt,k= Pout(k,Xt)*max{Vt-1k’ *Ptrans(k’,k)}

Finding the path

• Note that we are interested in the most likely path, not only in its probability

• So we need to keep track at each point of the argmax– Combine them to form a sequence

• What about top-k?

Complexity

• O(T*|S|^2)

• Where T is the sequence (=sentence) length, |S| is the number of states (= number of possible tags)

Computing the probability of a sequence

• Forward probabilities: αt(k) is the probability of seeing the sequence X1…Xt and terminating at state k• Backward probabilities:

βt(k) is the probability of seeing the sequenceXt+1…Xn given that the Markov process is atstate k at time t.

Computing the probabilities

Forward algorithmα0(k)= Pstart(k)*Pout(k,X0)αt(k)= Pout(k,Xt)*Σk’{αt-1k’ *Ptrans(k’,k)}P(O1,…On)= Σk αn(k)

Backward algorithmβt(k) = P(Ot+1…On| state at time t is k)βt(k) = Σk’{Ptrans(k,k’)* Pout(k’,Xt+1)* βt+1(k’)}βn(k) = 1 for all kP(O)= Σk β 0(k)* Pstart(k)

Learning the HMM probabilities

• Expectation-Maximization Algorithm1. Start with initial probabilities2. Compute Eij the expected number of transitions

from i to j while generating a sequence, for each i,j (see next)3. Set the probability of transition from i to j to be Eij/ (Σk Eik)4. Similarly for omission probability5. Repeat 2-4 using the new model, until convergence

Estimating the expectancies

• By sampling– Re-run a random a execution of the model 100

times– Count transitions

• By analysis– Use Bayes rule on the formula for sequence

probability– Called the Forward-backward algorithm

Accuracy

• Tested experimentally

• Exceeds 96% for the Brown corpus– Trained on half and tested on the other half

• Compare with the 80-90% by the trivial algorithm

• The hard cases are few but are very hard..

NLTK

• http://www.nltk.org/

• Natrual Language ToolKit

• Open source python modules for NLP tasks– Including stemming, POS tagging and much more

http://www.nltk.org/

Context Free Grammars

• Context Free Grammars are a more natural model for Natural Language

• Syntax rules are very easy to formulate using CFGs

• Provably more expressive than Finite State Machines– E.g. Can check for balanced parentheses


• Non-terminals

• Terminals

• Production rules– V → w where V is a non-terminal and w is a

sequence of terminals and non-terminals


• Can be used as acceptors

• Can be used as a generative model

• Similarly to the case of Finite State Machines

• How long can a string generated by a CFG be?

Stochastic Context Free Grammar

• Non-terminals

• Terminals

• Production rules associated with probability– V → w where V is a non-terminal and w is a

sequence of terminals and non-terminals

Chomsky Normal Form

• Every rule is of the form• V → V1V2 where V,V1,V2 are non-terminals • V → t where V is a non-terminal and t is a terminal

Every (S)CFG can be written in this form• Makes designing many algorithms easier

Questions

• What is the probability of a string?– Defined as the sum of probabilities of all possible

derivations of the string• Given a string, what is its most likely derivation?– Called also the Viterbi derivation or parse– Easy adaptation of the Viterbi Algorithm for HMMs

• Given a training corpus, and a CFG (no probabilities) learn the probabilities on derivation rule

• Inside probability: probability of generating wp…wq from non-terminal Nj.

• Outside probability: total prob of beginning with the start symbol N1 and generating and everything outside wp…wq

jpqN

(,,)(,) (1)(1)1 mqjpqpj wNwPqp

(|)(,) jpqpqj NwPqp

Inside-outside probabilities

CYK algorithm

(,1)(,)()(,),

1

qddpNNNPqp srsr

sr

q

pd

jj

Nj

Nr Ns

wp wd Wd+1 wq

()(,) kj

j wNPkk

CYK algorithm

(,1)()(,)(,), 1

eqNNNPepqp ggjf

gf

m

qefj

Nj Ng

wp wq Wq+1 we

Nf

N1

w1 wm

CYK algorithm

(1,)()(,)(,),

1

1

peNNNPqeqp gjgf

gf

p

efj

Ng Nj

we Wp-1 Wp wq

Nf

N1

w1 wm

Outside probability

(1,)()(,)

(,1)()(,)(,)

,

1

1

, 1

peNNNPqe

eqNNNPepqp

gjgf

gf

p

ef

ggjf

gf

m

qefj

otherwise

jifmj 0

11(,1)

Probability of a sentence

(,1)() 11 mwP m

kanyforwNPkkwP kj

jjm ()(,)() 1

(,)(,)(,) 1 qpqpNwP jjjpqm

The probability that a binary rule is used

()

(,1)(,)()(,)

(|,)1

1

1m

q

pdsr

srjj

msrjj

pq wP

qddpNNNPqp

wNNNNP

()

(,1)(,)()(,)

(|,)

(|,)

1

1

1 1

1 11

1

m

q

pdsr

srjj

m

p

m

q

m

p

m

qm

srjjpq

msrjj

wP

qddpNNNPqp

wNNNNP

wNNNNP

(1)

The probability that Nj is used

(,)(,)(,) 1 qpqpNwP jjjpqm

()

(,)(,)

()

(,)(|)

11

11

m

jj

m

mjpq

mjpq wP

qpqp

wP

wNPwNP

()

(,)(,)

(|)

(|)

(|)

11 1

1

11 1

1

m

jjm

p

m

q

msrj

r s

mjpq

m

p

m

q

mj

wP

qpqp

wNNNP

wNP

wNP

(2)

m

p

m

pqjj

m

p

m

pq

q

pdsr

srjj

mj

mjsrj

msrj

qpqp

qddpNNNPqp

wNP

wNNNNPwNNNP

1

1

1

1

11

(,)(,)

(,1)(,)()(,)

(2)

(1)

(|)

(|,)(|)

The probability that a unaryrule is used

()

(,)(,)(,)(|,)

1

11

m

m

h

khjj

mjkj

wP

wwhhhhwusedisNwNP

m

p

m

pqjj

m

h

khjj

mj

mjkj

mjkj

qpqp

wwhhhh

wNP

wNwNPwNwNP

1

1

1

11

(,)(,)

(,)(,)(,)

(2)

(3)

(|)

(|,)(,|)

(3)

Multiple training sentences

ii

m

jjm

p

m

qm

j

Wsentenceforjh

wP

qpqpwNP

()

()

(,)(,)(|)

11 11

(,,)

()

(,1)(,)()(,)

(|,)1

11

1 11

srjf

wP

qddpNNNPqp

wNNNNP

i

m

q

pdsr

srjj

m

p

m

pqm

srjj

(1)

(2)

()

(,,)()

jh

srjfNNNP

ii

iisrj

Documents

Text Models. Why? To “understand” text To assist in text search & ranking For autocompletion Part of Speech Tagging