45
Structure and Predictions Machine Learning: Jordan Boyd-Graber University of Maryland INTRODUCTION Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 1 / 11

Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Structure and Predictions

Machine Learning: Jordan Boyd-GraberUniversity of MarylandINTRODUCTION

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 1 / 11

Page 2: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Today

� Perceptron� Structured Perceptron

1. Good ML analysis, standard NLP problem2. Uses structure and representation

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 2 / 11

Page 3: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Today

� Perceptron� Structured Perceptron

1. Good ML analysis, standard NLP problem2. Uses structure and representation

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 2 / 11

Page 4: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Most supervised algorithms are . . .

Logistic Regression

p(y |x) =σ(∑

i βixi)

SVM

sign(~w · x +b)

� What statistical property do these (and many others share)?

� Hint: p(yi ,yj |xi ,xj) = p(yi |xi)p(yj |xj)

� Independent!

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 3 / 11

Page 5: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Most supervised algorithms are . . .

Logistic Regression

p(y |x) =σ(∑

i βixi)

SVM

sign(~w · x +b)

� What statistical property do these (and many others share)?

� Hint: p(yi ,yj |xi ,xj) = p(yi |xi)p(yj |xj)

� Independent!

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 3 / 11

Page 6: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Most supervised algorithms are . . .

Logistic Regression

p(y |x) =σ(∑

i βixi)

SVM

sign(~w · x +b)

� What statistical property do these (and many others share)?

� Hint: p(yi ,yj |xi ,xj) = p(yi |xi)p(yj |xj)

� Independent!

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 3 / 11

Page 7: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Most supervised algorithms are . . .

Logistic Regression

p(y |x) =σ(∑

i βixi)

SVM

sign(~w · x +b)

� What statistical property do these (and many others share)?

� Hint: p(yi ,yj |xi ,xj) = p(yi |xi)p(yj |xj)

� Independent!

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 3 / 11

Page 8: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Most supervised algorithms are . . .

Logistic Regression

p(y |x) =σ(∑

i βixi)

SVM

sign(~w · x +b)

� What statistical property do these (and many others share)?

� Hint: p(yi ,yj |xi ,xj) = p(yi |xi)p(yj |xj)

� Independent!

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 3 / 11

Page 9: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Is this how the world works?

Also particularly relevant for 2016: correlated voting patterns

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 4 / 11

Page 10: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Is this how the world works?

Also particularly relevant for 2016: correlated voting patterns

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 4 / 11

Page 11: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS Tagging: Task Definition

� Annotate each word in a sentence with a part-of-speech marker.

� Lowest level of syntactic analysis.John saw the saw and decided to take it to the tableNNP VBD DT NN CC VBD TO VB PRP IN DT NN

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 5 / 11

Page 12: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Typical Features (φ)

Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.

π Start state scores (vector of length K ): πi

θ Transition matrix (matrix of size K by K ): θi ,j

β An emission matrix (matrix of size K by V ): βj ,w

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11

Page 13: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Typical Features (φ)

Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.

π Start state scores (vector of length K ): πi

θ Transition matrix (matrix of size K by K ): θi ,j

β An emission matrix (matrix of size K by V ): βj ,w

Score

f (x ,z)≡∑

i

wiφi(x ,z) (1)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11

Page 14: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Typical Features (φ)

Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.

π Start state scores (vector of length K ): πi

θ Transition matrix (matrix of size K by K ): θi ,j

β An emission matrix (matrix of size K by V ): βj ,w

Score

f (x ,z)≡∑

i

wiφi(x ,z) (1)

Total score of hypothesis z given input x

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11

Page 15: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Typical Features (φ)

Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.

π Start state scores (vector of length K ): πi

θ Transition matrix (matrix of size K by K ): θi ,j

β An emission matrix (matrix of size K by V ): βj ,w

Score

f (x ,z)≡∑

i

wiφi(x ,z) (1)

Feature weight

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11

Page 16: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Typical Features (φ)

Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.

π Start state scores (vector of length K ): πi

θ Transition matrix (matrix of size K by K ): θi ,j

β An emission matrix (matrix of size K by V ): βj ,w

Score

f (x ,z)≡∑

i

wiφi(x ,z) (1)

Feature present (binary)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11

Page 17: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Typical Features (φ)

Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.

π Start state scores (vector of length K ): πi

θ Transition matrix (matrix of size K by K ): θi ,j

β An emission matrix (matrix of size K by V ): βj ,w

Score

f (x ,z)≡∑

i

wiφi(x ,z) (1)

Two problems: How do we move from data to algorithm? (Estimation) Howdo we move from a model and unlabled data to labeled data? (Inference)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11

Page 18: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Typical Features (φ)

Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.

π Start state scores (vector of length K ): πi

θ Transition matrix (matrix of size K by K ): θi ,j

β An emission matrix (matrix of size K by V ): βj ,w

Score

f (x ,z)≡∑

i

wiφi(x ,z) (1)

Two problems: How do we move from data to algorithm? (Estimation:HMM) How do we move from a model and unlabled data to labeled data?(Inference)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11

Page 19: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Typical Features (φ)

Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.

π Start state scores (vector of length K ): πi

θ Transition matrix (matrix of size K by K ): θi ,j

β An emission matrix (matrix of size K by V ): βj ,w

Score

f (x ,z)≡∑

i

wiφi(x ,z) (1)

Two problems: How do we move from data to algorithm? (Estimation:HMM) How do we move from a model and unlabled data to labeled data?(Inference)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11

Page 20: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Typical Features (φ)

Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.

π Start state scores (vector of length K ): πi = logp(z1 = i)

θ Transition matrix (matrix of size K by K ): θi ,j

β An emission matrix (matrix of size K by V ): βj ,w

Score

f (x ,z)≡∑

i

wiφi(x ,z) (1)

Two problems: How do we move from data to algorithm? (Estimation:HMM) How do we move from a model and unlabled data to labeled data?(Inference)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11

Page 21: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Typical Features (φ)

Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.

π Start state scores (vector of length K ): πi = logp(z1 = i)

θ Transition matrix (matrix of size K by K ): θi ,j = logp(zn = j |zn−1 = i)

β An emission matrix (matrix of size K by V ): βj ,w

Score

f (x ,z)≡∑

i

wiφi(x ,z) (1)

Two problems: How do we move from data to algorithm? (Estimation:HMM) How do we move from a model and unlabled data to labeled data?(Inference)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11

Page 22: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Typical Features (φ)

Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.

π Start state scores (vector of length K ): πi = logp(z1 = i)

θ Transition matrix (matrix of size K by K ): θi ,j = logp(zn = j |zn−1 = i)

β An emission matrix (matrix of size K by V ): βj ,w = logp(xn =w |zn = j)

Score

f (x ,z)≡∑

i

wiφi(x ,z) (1)

Two problems: How do we move from data to algorithm? (Estimation:HMM) How do we move from a model and unlabled data to labeled data?(Inference)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11

Page 23: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Viterbi Algorithm

� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest score.

� It’s impossible to compute K L possibilities.

� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .

� Memoization: fill a table of solutions of sub-problems

� Solve larger problems by composing sub-solutions

� Base case:f1(k) =πk +βk ,xi

(2)

� Recursion:fn(k) =max

j

fn−1(j)+θj ,k

+βk ,xn(3)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 7 / 11

Page 24: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Viterbi Algorithm

� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest score.

� It’s impossible to compute K L possibilities.

� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .

� Memoization: fill a table of solutions of sub-problems

� Solve larger problems by composing sub-solutions

� Base case:f1(k) =πk +βk ,xi

(2)

� Recursion:fn(k) =max

j

fn−1(j)+θj ,k

+βk ,xn(3)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 7 / 11

Page 25: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Viterbi Algorithm

� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest score.

� It’s impossible to compute K L possibilities.

� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .

� Memoization: fill a table of solutions of sub-problems

� Solve larger problems by composing sub-solutions

� Base case:f1(k) =πk +βk ,xi

(2)

� Recursion:fn(k) =max

j

fn−1(j)+θj ,k

+βk ,xn(3)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 7 / 11

Page 26: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Viterbi Algorithm

� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest score.

� It’s impossible to compute K L possibilities.

� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .

� Memoization: fill a table of solutions of sub-problems

� Solve larger problems by composing sub-solutions

� Base case:f1(k) =πk +βk ,xi

(2)

� Recursion:fn(k) =max

j

fn−1(j)+θj ,k

+βk ,xn(3)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 7 / 11

Page 27: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

Viterbi Algorithm

� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest score.

� It’s impossible to compute K L possibilities.

� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .

� Memoization: fill a table of solutions of sub-problems

� Solve larger problems by composing sub-solutions

� Base case:f1(k) =πk +βk ,xi

(2)

� Recursion:fn(k) =max

j

fn−1(j)+θj ,k

+βk ,xn(3)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 7 / 11

Page 28: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

� The complexity of this is now K 2L.

� Garden path sentences like “the old man the boats” require all cells

� But just computing the max isn’t enough. We also have to rememberwhere we came from. (Breadcrumbs from best previous state.)

Ψn = argmaxj fn−1(j)+θj ,k (4)

� Let’s do that for the sentence “come and get it”

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 8 / 11

Page 29: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

� The complexity of this is now K 2L.

� Garden path sentences like “the old man the boats” require all cells

� But just computing the max isn’t enough. We also have to rememberwhere we came from. (Breadcrumbs from best previous state.)

Ψn = argmaxj fn−1(j)+θj ,k (4)

� Let’s do that for the sentence “come and get it”

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 8 / 11

Page 30: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS πk βk ,x1f1(k)

MOD log0.234 log0.024 -5.18DET log0.234 log0.032 -4.89

CONJ log0.234 log0.024 -5.18N log0.021 log0.016 -7.99

PREP log0.021 log0.024 -7.59PRO log0.021 log0.016 -7.99

V log0.234 log0.121 -3.56come and get it (with HMM probabilities)

Why logarithms?

1. More interpretable than a float with lots of zeros.

2. Underflow is less of an issue

3. Generalizes to linear models (next!)

4. Addition is cheaper than multiplication

log(ab) = log(a)+ log(b) (5)

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 9 / 11

Page 31: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(j)

f1(j)+θj ,CONJ

f2(CONJ)MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47 ??? -6.02

N -7.99

≤−7.99

PREP -7.59

≤−7.59

PRO -7.99

≤−7.99

V -3.56

-5.21

come and get it

f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65

log f1(k) =−5.21+βCONJ, and =

−5.21−0.64

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11

Page 32: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(j)

f1(j)+θj ,CONJ

f2(CONJ)MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99

≤−7.99

PREP -7.59

≤−7.59

PRO -7.99

≤−7.99

V -3.56

-5.21

come and get it

f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65

log f1(k) =−5.21+βCONJ, and =

−5.21−0.64

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11

Page 33: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99

≤−7.99

PREP -7.59

≤−7.59

PRO -7.99

≤−7.99

V -3.56

-5.21

come and get it

f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65

log f1(k) =−5.21+βCONJ, and =

−5.21−0.64

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11

Page 34: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99

≤−7.99

PREP -7.59

≤−7.59

PRO -7.99

≤−7.99

V -3.56

-5.21

come and get it

f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65

log f1(k) =−5.21+βCONJ, and =

−5.21−0.64

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11

Page 35: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99

≤−7.99

PREP -7.59

≤−7.59

PRO -7.99

≤−7.99

V -3.56 -5.21come and get it

f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65

log f1(k) =−5.21+βCONJ, and =

−5.21−0.64

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11

Page 36: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99

V -3.56 -5.21come and get it

f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65

log f1(k) =−5.21+βCONJ, and =

−5.21−0.64

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11

Page 37: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72

CONJ -5.18 -8.47 ???

-6.02

N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99

V -3.56 -5.21come and get it

f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65

log f1(k) =−5.21+βCONJ, and =

−5.21−0.64

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11

Page 38: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72

CONJ -5.18 -8.47 ???

-6.02

N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99

V -3.56 -5.21come and get it

f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65

log f1(k) =−5.21+βCONJ, and =

−5.21−0.64

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11

Page 39: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72

CONJ -5.18 -8.47

??? -6.02

N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99

V -3.56 -5.21come and get it

f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65

log f1(k) =−5.21+βCONJ, and =

−5.21−0.64

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11

Page 40: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72

CONJ -5.18 -8.47

??? -6.02

N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99

V -3.56 -5.21come and get it

f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65

log f1(k) =−5.21+βCONJ, and = −5.21−0.64

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11

Page 41: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72

CONJ -5.18 -8.47

???

-6.02N -7.99 ≤−7.99

PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99

V -3.56 -5.21come and get it

f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65

log f1(k) =−5.21+βCONJ, and =

−5.21−0.64

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11

Page 42: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(k) f2(k) b2 f3(k) b3 f4(k) b4

MOD -5.18

-0.00 X -0.00 X -0.00 X

DET -4.89

-0.00 X -0.00 X -0.00 X

CONJ -5.18 -6.02 V

-0.00 X -0.00 X

N -7.99

-0.00 X -0.00 X -0.00 X

PREP -7.59

-0.00 X -0.00 X -0.00 X

PRO -7.99

-0.00 X -0.00 X -14.6 V

V -3.56

-0.00 X -9.03 CONJ -0.00 X

WORD come and get it

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 11 / 11

Page 43: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(k) f2(k) b2 f3(k) b3 f4(k) b4

MOD -5.18 -0.00 X

-0.00 X -0.00 X

DET -4.89 -0.00 X

-0.00 X -0.00 X

CONJ -5.18 -6.02 V

-0.00 X -0.00 X

N -7.99 -0.00 X

-0.00 X -0.00 X

PREP -7.59 -0.00 X

-0.00 X -0.00 X

PRO -7.99 -0.00 X

-0.00 X -14.6 V

V -3.56 -0.00 X

-9.03 CONJ -0.00 X

WORD come and get it

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 11 / 11

Page 44: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(k) f2(k) b2 f3(k) b3 f4(k) b4

MOD -5.18 -0.00 X -0.00 X

-0.00 X

DET -4.89 -0.00 X -0.00 X

-0.00 X

CONJ -5.18 -6.02 V -0.00 X

-0.00 X

N -7.99 -0.00 X -0.00 X

-0.00 X

PREP -7.59 -0.00 X -0.00 X

-0.00 X

PRO -7.99 -0.00 X -0.00 X

-14.6 V

V -3.56 -0.00 X -9.03 CONJ

-0.00 X

WORD come and get it

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 11 / 11

Page 45: Structure and Predictionsjbg/teaching/CMSC_726/11a.pdf · — It’s impossible to compute KL possibilities. — So, we use dynamic programming to compute most likely tags for each

POS f1(k) f2(k) b2 f3(k) b3 f4(k) b4

MOD -5.18 -0.00 X -0.00 X -0.00 XDET -4.89 -0.00 X -0.00 X -0.00 X

CONJ -5.18 -6.02 V -0.00 X -0.00 XN -7.99 -0.00 X -0.00 X -0.00 X

PREP -7.59 -0.00 X -0.00 X -0.00 XPRO -7.99 -0.00 X -0.00 X -14.6 V

V -3.56 -0.00 X -9.03 CONJ -0.00 XWORD come and get it

Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 11 / 11