Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Structure and Predictions
Machine Learning: Jordan Boyd-GraberUniversity of MarylandINTRODUCTION
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 1 / 11
Today
� Perceptron� Structured Perceptron
1. Good ML analysis, standard NLP problem2. Uses structure and representation
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 2 / 11
Today
� Perceptron� Structured Perceptron
1. Good ML analysis, standard NLP problem2. Uses structure and representation
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 2 / 11
Most supervised algorithms are . . .
Logistic Regression
p(y |x) =σ(∑
i βixi)
SVM
sign(~w · x +b)
� What statistical property do these (and many others share)?
� Hint: p(yi ,yj |xi ,xj) = p(yi |xi)p(yj |xj)
� Independent!
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 3 / 11
Most supervised algorithms are . . .
Logistic Regression
p(y |x) =σ(∑
i βixi)
SVM
sign(~w · x +b)
� What statistical property do these (and many others share)?
� Hint: p(yi ,yj |xi ,xj) = p(yi |xi)p(yj |xj)
� Independent!
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 3 / 11
Most supervised algorithms are . . .
Logistic Regression
p(y |x) =σ(∑
i βixi)
SVM
sign(~w · x +b)
� What statistical property do these (and many others share)?
� Hint: p(yi ,yj |xi ,xj) = p(yi |xi)p(yj |xj)
� Independent!
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 3 / 11
Most supervised algorithms are . . .
Logistic Regression
p(y |x) =σ(∑
i βixi)
SVM
sign(~w · x +b)
� What statistical property do these (and many others share)?
� Hint: p(yi ,yj |xi ,xj) = p(yi |xi)p(yj |xj)
� Independent!
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 3 / 11
Most supervised algorithms are . . .
Logistic Regression
p(y |x) =σ(∑
i βixi)
SVM
sign(~w · x +b)
� What statistical property do these (and many others share)?
� Hint: p(yi ,yj |xi ,xj) = p(yi |xi)p(yj |xj)
� Independent!
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 3 / 11
Is this how the world works?
Also particularly relevant for 2016: correlated voting patterns
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 4 / 11
Is this how the world works?
Also particularly relevant for 2016: correlated voting patterns
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 4 / 11
POS Tagging: Task Definition
� Annotate each word in a sentence with a part-of-speech marker.
� Lowest level of syntactic analysis.John saw the saw and decided to take it to the tableNNP VBD DT NN CC VBD TO VB PRP IN DT NN
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 5 / 11
Typical Features (φ)
Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.
π Start state scores (vector of length K ): πi
θ Transition matrix (matrix of size K by K ): θi ,j
β An emission matrix (matrix of size K by V ): βj ,w
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11
Typical Features (φ)
Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.
π Start state scores (vector of length K ): πi
θ Transition matrix (matrix of size K by K ): θi ,j
β An emission matrix (matrix of size K by V ): βj ,w
Score
f (x ,z)≡∑
i
wiφi(x ,z) (1)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11
Typical Features (φ)
Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.
π Start state scores (vector of length K ): πi
θ Transition matrix (matrix of size K by K ): θi ,j
β An emission matrix (matrix of size K by V ): βj ,w
Score
f (x ,z)≡∑
i
wiφi(x ,z) (1)
Total score of hypothesis z given input x
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11
Typical Features (φ)
Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.
π Start state scores (vector of length K ): πi
θ Transition matrix (matrix of size K by K ): θi ,j
β An emission matrix (matrix of size K by V ): βj ,w
Score
f (x ,z)≡∑
i
wiφi(x ,z) (1)
Feature weight
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11
Typical Features (φ)
Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.
π Start state scores (vector of length K ): πi
θ Transition matrix (matrix of size K by K ): θi ,j
β An emission matrix (matrix of size K by V ): βj ,w
Score
f (x ,z)≡∑
i
wiφi(x ,z) (1)
Feature present (binary)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11
Typical Features (φ)
Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.
π Start state scores (vector of length K ): πi
θ Transition matrix (matrix of size K by K ): θi ,j
β An emission matrix (matrix of size K by V ): βj ,w
Score
f (x ,z)≡∑
i
wiφi(x ,z) (1)
Two problems: How do we move from data to algorithm? (Estimation) Howdo we move from a model and unlabled data to labeled data? (Inference)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11
Typical Features (φ)
Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.
π Start state scores (vector of length K ): πi
θ Transition matrix (matrix of size K by K ): θi ,j
β An emission matrix (matrix of size K by V ): βj ,w
Score
f (x ,z)≡∑
i
wiφi(x ,z) (1)
Two problems: How do we move from data to algorithm? (Estimation:HMM) How do we move from a model and unlabled data to labeled data?(Inference)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11
Typical Features (φ)
Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.
π Start state scores (vector of length K ): πi
θ Transition matrix (matrix of size K by K ): θi ,j
β An emission matrix (matrix of size K by V ): βj ,w
Score
f (x ,z)≡∑
i
wiφi(x ,z) (1)
Two problems: How do we move from data to algorithm? (Estimation:HMM) How do we move from a model and unlabled data to labeled data?(Inference)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11
Typical Features (φ)
Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.
π Start state scores (vector of length K ): πi = logp(z1 = i)
θ Transition matrix (matrix of size K by K ): θi ,j
β An emission matrix (matrix of size K by V ): βj ,w
Score
f (x ,z)≡∑
i
wiφi(x ,z) (1)
Two problems: How do we move from data to algorithm? (Estimation:HMM) How do we move from a model and unlabled data to labeled data?(Inference)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11
Typical Features (φ)
Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.
π Start state scores (vector of length K ): πi = logp(z1 = i)
θ Transition matrix (matrix of size K by K ): θi ,j = logp(zn = j |zn−1 = i)
β An emission matrix (matrix of size K by V ): βj ,w
Score
f (x ,z)≡∑
i
wiφi(x ,z) (1)
Two problems: How do we move from data to algorithm? (Estimation:HMM) How do we move from a model and unlabled data to labeled data?(Inference)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11
Typical Features (φ)
Assume K parts of speech, a lexicon size of V , a series of observations{x1, . . . ,xN}, and a series of unobserved states {z1, . . . ,zN}.
π Start state scores (vector of length K ): πi = logp(z1 = i)
θ Transition matrix (matrix of size K by K ): θi ,j = logp(zn = j |zn−1 = i)
β An emission matrix (matrix of size K by V ): βj ,w = logp(xn =w |zn = j)
Score
f (x ,z)≡∑
i
wiφi(x ,z) (1)
Two problems: How do we move from data to algorithm? (Estimation:HMM) How do we move from a model and unlabled data to labeled data?(Inference)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 6 / 11
Viterbi Algorithm
� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest score.
� It’s impossible to compute K L possibilities.
� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .
� Memoization: fill a table of solutions of sub-problems
� Solve larger problems by composing sub-solutions
� Base case:f1(k) =πk +βk ,xi
(2)
� Recursion:fn(k) =max
j
�
fn−1(j)+θj ,k
�
+βk ,xn(3)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 7 / 11
Viterbi Algorithm
� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest score.
� It’s impossible to compute K L possibilities.
� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .
� Memoization: fill a table of solutions of sub-problems
� Solve larger problems by composing sub-solutions
� Base case:f1(k) =πk +βk ,xi
(2)
� Recursion:fn(k) =max
j
�
fn−1(j)+θj ,k
�
+βk ,xn(3)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 7 / 11
Viterbi Algorithm
� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest score.
� It’s impossible to compute K L possibilities.
� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .
� Memoization: fill a table of solutions of sub-problems
� Solve larger problems by composing sub-solutions
� Base case:f1(k) =πk +βk ,xi
(2)
� Recursion:fn(k) =max
j
�
fn−1(j)+θj ,k
�
+βk ,xn(3)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 7 / 11
Viterbi Algorithm
� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest score.
� It’s impossible to compute K L possibilities.
� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .
� Memoization: fill a table of solutions of sub-problems
� Solve larger problems by composing sub-solutions
� Base case:f1(k) =πk +βk ,xi
(2)
� Recursion:fn(k) =max
j
�
fn−1(j)+θj ,k
�
+βk ,xn(3)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 7 / 11
Viterbi Algorithm
� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest score.
� It’s impossible to compute K L possibilities.
� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .
� Memoization: fill a table of solutions of sub-problems
� Solve larger problems by composing sub-solutions
� Base case:f1(k) =πk +βk ,xi
(2)
� Recursion:fn(k) =max
j
�
fn−1(j)+θj ,k
�
+βk ,xn(3)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 7 / 11
� The complexity of this is now K 2L.
� Garden path sentences like “the old man the boats” require all cells
� But just computing the max isn’t enough. We also have to rememberwhere we came from. (Breadcrumbs from best previous state.)
Ψn = argmaxj fn−1(j)+θj ,k (4)
� Let’s do that for the sentence “come and get it”
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 8 / 11
� The complexity of this is now K 2L.
� Garden path sentences like “the old man the boats” require all cells
� But just computing the max isn’t enough. We also have to rememberwhere we came from. (Breadcrumbs from best previous state.)
Ψn = argmaxj fn−1(j)+θj ,k (4)
� Let’s do that for the sentence “come and get it”
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 8 / 11
POS πk βk ,x1f1(k)
MOD log0.234 log0.024 -5.18DET log0.234 log0.032 -4.89
CONJ log0.234 log0.024 -5.18N log0.021 log0.016 -7.99
PREP log0.021 log0.024 -7.59PRO log0.021 log0.016 -7.99
V log0.234 log0.121 -3.56come and get it (with HMM probabilities)
Why logarithms?
1. More interpretable than a float with lots of zeros.
2. Underflow is less of an issue
3. Generalizes to linear models (next!)
4. Addition is cheaper than multiplication
log(ab) = log(a)+ log(b) (5)
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 9 / 11
POS f1(j)
f1(j)+θj ,CONJ
f2(CONJ)MOD -5.18
-8.48
DET -4.89
-7.72
CONJ -5.18
-8.47 ??? -6.02
N -7.99
≤−7.99
PREP -7.59
≤−7.59
PRO -7.99
≤−7.99
V -3.56
-5.21
come and get it
f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65
log f1(k) =−5.21+βCONJ, and =
−5.21−0.64
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11
POS f1(j)
f1(j)+θj ,CONJ
f2(CONJ)MOD -5.18
-8.48
DET -4.89
-7.72
CONJ -5.18
-8.47
???
-6.02
N -7.99
≤−7.99
PREP -7.59
≤−7.59
PRO -7.99
≤−7.99
V -3.56
-5.21
come and get it
f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65
log f1(k) =−5.21+βCONJ, and =
−5.21−0.64
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11
POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18
-8.48
DET -4.89
-7.72
CONJ -5.18
-8.47
???
-6.02
N -7.99
≤−7.99
PREP -7.59
≤−7.59
PRO -7.99
≤−7.99
V -3.56
-5.21
come and get it
f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65
log f1(k) =−5.21+βCONJ, and =
−5.21−0.64
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11
POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18
-8.48
DET -4.89
-7.72
CONJ -5.18
-8.47
???
-6.02
N -7.99
≤−7.99
PREP -7.59
≤−7.59
PRO -7.99
≤−7.99
V -3.56
-5.21
come and get it
f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65
log f1(k) =−5.21+βCONJ, and =
−5.21−0.64
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11
POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18
-8.48
DET -4.89
-7.72
CONJ -5.18
-8.47
???
-6.02
N -7.99
≤−7.99
PREP -7.59
≤−7.59
PRO -7.99
≤−7.99
V -3.56 -5.21come and get it
f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65
log f1(k) =−5.21+βCONJ, and =
−5.21−0.64
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11
POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18
-8.48
DET -4.89
-7.72
CONJ -5.18
-8.47
???
-6.02
N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99
V -3.56 -5.21come and get it
f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65
log f1(k) =−5.21+βCONJ, and =
−5.21−0.64
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11
POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72
CONJ -5.18 -8.47 ???
-6.02
N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99
V -3.56 -5.21come and get it
f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65
log f1(k) =−5.21+βCONJ, and =
−5.21−0.64
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11
POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72
CONJ -5.18 -8.47 ???
-6.02
N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99
V -3.56 -5.21come and get it
f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65
log f1(k) =−5.21+βCONJ, and =
−5.21−0.64
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11
POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72
CONJ -5.18 -8.47
??? -6.02
N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99
V -3.56 -5.21come and get it
f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65
log f1(k) =−5.21+βCONJ, and =
−5.21−0.64
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11
POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72
CONJ -5.18 -8.47
??? -6.02
N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99
V -3.56 -5.21come and get it
f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65
log f1(k) =−5.21+βCONJ, and = −5.21−0.64
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11
POS f1(j) f1(j)+θj ,CONJ f2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72
CONJ -5.18 -8.47
???
-6.02N -7.99 ≤−7.99
PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99
V -3.56 -5.21come and get it
f0(V)+θV, CONJ = f0(k)+θV, CONJ =−3.56+−1.65
log f1(k) =−5.21+βCONJ, and =
−5.21−0.64
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 10 / 11
POS f1(k) f2(k) b2 f3(k) b3 f4(k) b4
MOD -5.18
-0.00 X -0.00 X -0.00 X
DET -4.89
-0.00 X -0.00 X -0.00 X
CONJ -5.18 -6.02 V
-0.00 X -0.00 X
N -7.99
-0.00 X -0.00 X -0.00 X
PREP -7.59
-0.00 X -0.00 X -0.00 X
PRO -7.99
-0.00 X -0.00 X -14.6 V
V -3.56
-0.00 X -9.03 CONJ -0.00 X
WORD come and get it
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 11 / 11
POS f1(k) f2(k) b2 f3(k) b3 f4(k) b4
MOD -5.18 -0.00 X
-0.00 X -0.00 X
DET -4.89 -0.00 X
-0.00 X -0.00 X
CONJ -5.18 -6.02 V
-0.00 X -0.00 X
N -7.99 -0.00 X
-0.00 X -0.00 X
PREP -7.59 -0.00 X
-0.00 X -0.00 X
PRO -7.99 -0.00 X
-0.00 X -14.6 V
V -3.56 -0.00 X
-9.03 CONJ -0.00 X
WORD come and get it
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 11 / 11
POS f1(k) f2(k) b2 f3(k) b3 f4(k) b4
MOD -5.18 -0.00 X -0.00 X
-0.00 X
DET -4.89 -0.00 X -0.00 X
-0.00 X
CONJ -5.18 -6.02 V -0.00 X
-0.00 X
N -7.99 -0.00 X -0.00 X
-0.00 X
PREP -7.59 -0.00 X -0.00 X
-0.00 X
PRO -7.99 -0.00 X -0.00 X
-14.6 V
V -3.56 -0.00 X -9.03 CONJ
-0.00 X
WORD come and get it
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 11 / 11
POS f1(k) f2(k) b2 f3(k) b3 f4(k) b4
MOD -5.18 -0.00 X -0.00 X -0.00 XDET -4.89 -0.00 X -0.00 X -0.00 X
CONJ -5.18 -6.02 V -0.00 X -0.00 XN -7.99 -0.00 X -0.00 X -0.00 X
PREP -7.59 -0.00 X -0.00 X -0.00 XPRO -7.99 -0.00 X -0.00 X -14.6 V
V -3.56 -0.00 X -9.03 CONJ -0.00 XWORD come and get it
Machine Learning: Jordan Boyd-Graber | UMD Structure and Predictions | 11 / 11