323
1

1. Structured Belief Propagation for NLP Matthew R. Gormley & Jason Eisner ACL ‘14 Tutorial June 22, 2014 2 For the latest version of these slides, please

Embed Size (px)

Citation preview

  • Slide 1

1 Slide 2 Structured Belief Propagation for NLP Matthew R. Gormley & Jason Eisner ACL 14 Tutorial June 22, 2014 2 For the latest version of these slides, please visit: http://www.cs.jhu.edu/~mrg/bp-tutorial/ Slide 3 3 Language has a lot going on at once Structured representations of utterances Structured knowledge of the language Many interacting parts for BP to reason about! Slide 4 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: Whats going on here? Can we trust BPs estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 4 Slide 5 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: Whats going on here? Can we trust BPs estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 5 Slide 6 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: Whats going on here? Can we trust BPs estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 6 Slide 7 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: Whats going on here? Can we trust BPs estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 7 Slide 8 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: Whats going on here? Can we trust BPs estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 8 Slide 9 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: Whats going on here? Can we trust BPs estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 9 Slide 10 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: Whats going on here? Can we trust BPs estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 10 Slide 11 Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1.Models: Factor graphs can express interactions among linguistic structures. 2.Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3.Intuitions: Whats going on here? Can we trust BPs estimates? 4.Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5.Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6.Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7.Software: Build the model you want! 11 Slide 12 Section 1: Introduction Modeling with Factor Graphs 12 Slide 13 Sampling from a Joint Distribution 13 time like flies anarrow X1X1 22 X2X2 44 X3X3 66 X4X4 88 X5X5 11 33 55 77 99 00 X0X0 nv p d n Sample 6: vn v d n Sample 5: vn p d n Sample 4: nv p d n Sample 3: nn v d n Sample 2: nv p d n Sample 1: A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x. A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x. Slide 14 Sampling from a Joint Distribution 14 X1X1 11 22 X2X2 33 44 X3X3 55 66 X4X4 77 88 X5X5 99 X6X6 10 X7X7 12 11 Sample 1: 11 22 33 44 55 66 77 88 99 10 12 11 Sample 2: 11 22 33 44 55 66 77 88 99 10 12 11 Sample 3: 11 22 33 44 55 66 77 88 99 10 12 11 Sample 4: 11 22 33 44 55 66 77 88 99 10 12 11 A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x. A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x. Slide 15 nn v d n Sample 2: time like flies anarrow Sampling from a Joint Distribution 15 W1W1 W2W2 W3W3 W4W4 W5W5 X1X1 22 X2X2 44 X3X3 66 X4X4 88 X5X5 11 33 55 77 99 00 X0X0 nv p d n Sample 1: time like flies anarrow pn n v v Sample 4: with you time willsee nv p n n Sample 3: flies with fly theirwings A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x. A joint distribution defines a probability p(x) for each assignment of values x to variables X. This gives the proportion of samples that will equal x. Slide 16 W1W1 W2W2 W3W3 W4W4 W5W5 X1X1 22 X2X2 44 X3X3 66 X4X4 88 X5X5 11 33 55 77 99 00 X0X0 Factors have local opinions ( 0) 16 Each black box looks at some of the tags X i and words W i Note: We chose to reuse the same factors at different positions in the sentence. Slide 17 Factors have local opinions ( 0) 17 timeflies like an arrow n 22 v 44 p 66 d 88 n 11 33 55 77 99 00 Each black box looks at some of the tags X i and words W i p( n, v, p, d, n, time, flies, like, an, arrow ) = ? Slide 18 Global probability = product of local opinions 18 timeflies like an arrow n 22 v 44 p 66 d 88 n 11 33 55 77 99 00 Each black box looks at some of the tags X i and words W i p( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * ) Uh-oh! The probabilities of the various assignments sum up to Z > 1. So divide them all by Z. Uh-oh! The probabilities of the various assignments sum up to Z > 1. So divide them all by Z. Slide 19 Markov Random Field (MRF) 19 timeflies like an arrow n 22 v 44 p 66 d 88 n 11 33 55 77 99 00 p( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * ) Joint distribution over tags X i and words W i The individual factors arent necessarily probabilities. Joint distribution over tags X i and words W i The individual factors arent necessarily probabilities. Slide 20 timeflies like an arrow nv p d n Hidden Markov Model 20 But sometimes we choose to make them probabilities. Constrain each row of a factor to sum to one. Now Z = 1. p( n, v, p, d, n, time, flies, like, an, arrow ) = (.3 *.8 *.2 *.5 * ) Slide 21 Markov Random Field (MRF) 21 timeflies like an arrow n 22 v 44 p 66 d 88 n 11 33 55 77 99 00 p( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * ) Joint distribution over tags X i and words W i Slide 22 Conditional Random Field (CRF) 22 timeflies like an arrow n 22 v 44 p 66 d 88 n 11 33 55 77 99 00 Conditional distribution over tags X i given words w i. The factors and Z are now specific to the sentence w. Conditional distribution over tags X i given words w i. The factors and Z are now specific to the sentence w. p( n, v, p, d, n, time, flies, like, an, arrow ) = (4 * 8 * 5 * 3 * ) Slide 23 How General Are Factor Graphs? Factor graphs can be used to describe Markov Random Fields (undirected graphical models) i.e., log-linear models over a tuple of variables Conditional Random Fields Bayesian Networks (directed graphical models) Inference treats all of these interchangeably. Convert your model to a factor graph first. Pearl (1988) gave key strategies for exact inference: Belief propagation, for inference on acyclic graphs Junction tree algorithm, for making any graph acyclic (by merging variables and factors: blows up the runtime) Slide 24 Object-Oriented Analogy What is a sample? A datum: an immutable object that describes a linguistic structure. What is the sample space? The class of all possible sample objects. What is a random variable? An accessor method of the class, e.g., one that returns a certain field. Will give different values when called on different random samples. 24 class Tagging: int n;// length of sentence Word[] w;// array of n words (values w i ) Tag[] t;// array of n tags (values t i ) Word W(int i) { return w[i]; }// random var W i Tag T(int i) { return t[i]; }// random var T i String S(int i) {// random var S i return suffix(w[i], 3); } Random variable W 5 takes value w 5 == arrow in this sample Slide 25 Object-Oriented Analogy What is a sample? A datum: an immutable object that describes a linguistic structure. What is the sample space? The class of all possible sample objects. What is a random variable? An accessor method of the class, e.g., one that returns a certain field. A model is represented by a different object. What is a factor of the model? A method of the model that computes a number 0 from a sample, based on the samples values of a few random variables, and parameters stored in the model. What probability does the model assign to a sample? A product of its factors (rescaled). E.g., uprob(tagging) / Z(). How do you find the scaling factor? Add up the probabilities of all possible samples. If the result Z != 1, divide the probabilities by that Z. 25 class TaggingModel: float transition(Tagging tagging, int i) { // tag-tag bigram return tparam[tagging.t(i-1)][tagging.t(i)]; } float emission(Tagging tagging, int i) { // tag-word bigram return eparam[tagging.t(i)][tagging.w(i)]; } float uprob(Tagging tagging) { // unnormalized prob float p=1; for (i=1; i