49
Artificial Intelligence CS 165A Jan 23 2020 Instructor: Prof. Yu-Xiang Wang ® Bayesian Network (Ch 14)

Artificial Intelligence - UCSB Computer Science

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Artificial Intelligence - UCSB Computer Science

Artificial IntelligenceCS 165A

Jan 23 2020

Instructor: Prof. Yu-Xiang Wang

® Bayesian Network (Ch 14)

Page 2: Artificial Intelligence - UCSB Computer Science

Announcement

• Extension of HW1 submission deadline to Friday 23:59.– No extensions will be given to future homeworks.

• HW2 will be released tonight.

• Homework is the most important component of this course. – Research shows that active learning is more effective.– It is expected to be long and challenging.– You have two full weeks of time. Start early!

2

Page 3: Artificial Intelligence - UCSB Computer Science

Student feedback

• Feedback from the last lecture– Lecture is going at a “great pace”

– “Ease your assumption on prior knowledge”

– “Review conditional independence … how it reduces # of parameters in discussion sections”

– “Color-blindness”

• On Piazza:– “Can we go over HW1 Q6 in the lecture please?”

3

Page 4: Artificial Intelligence - UCSB Computer Science

Color blindness accessible palettes

4https://davidmathlogic.com/colorblind/

Page 5: Artificial Intelligence - UCSB Computer Science

Homework 1 Q6

• “SGD does not converge, results look random”– There are so many steps– Creating the vocabulary– Feature extraction– Gradient implementation– Optimization– Hyperparameter choices

• “What did I do wrongly?”

• These are exactly what you will run into in practice!

5

Page 6: Artificial Intelligence - UCSB Computer Science

Gradient Descent and Stochastic Gradient Descent• Optimization problem to solve in Q6

• From Slide 36 of Lecture 4

6

However, simple as it is, cross-validation is easy to misuse, especially when there are data-preprocessing steps that are needed before training a classifier. For instance, in this example oftext classification, the feature extraction step requires constructing a vocabulary as well as somenormalization steps (if TF-IDF is used). Discuss what can go wrong if these feature extractors arecreated based on all data, rather than just the training data.

Question 6. (40 points) (Implement logistic regression classifier) In this question, we will takethe TF-IDF features vector from the previous question and train a multiclass logistic regressionclassifier.

Logistic regression classifier [4] is a popular machine learning model for predicting the probabilityof individual labels. It has a parameter matrix ⇥ 2 Rd⇥k, where k is the number of classes (in thiscase k = 15) and d is the dimension of the feature vector. These are the “free parameters” in thelogistic regression. Each specific choice of ⇥ corresponds to a classifier.

Learning amounts to finding a specific parameter⇥ that has low classification error. Inferenceis about using this classifier to predict the label.

Given a feature vector x = (x1, x2, ..., xd)T , the prediction that generates by a logistic regressionclassifier:

y = soft argmax(⇥Tx),

where soft argmax is a vector-valued function that to a probability distribution via soft argmax(z)i =ezi/

Pki=1 e

zi .

The predicted label of a logistic regression classifier is then taken as argmax{y}.

While ultimately we will evaluate the classifier using the classification error on the hold-outdata. These quantities involving indicators are not di↵erentiable thus not easy to optimize.

The training of a logistic regression classifier thus involves minimizing a surrogate loss functionwhich is continuously di↵erentiable. In particular, the multi-class logistic loss function for a datapoint (x, y) that we use is the following

L(⇥, (x, y)) := �

kX

j=1

y[j] log(y[j]).

This is also called the Cross-Entropy loss.

Here, we used the one-hot encoding of the label y. If the true label is 5 (the fifth author wrotethis piece), y 2 Rk is a vector with only the 5th element being 1 and all other elements are zero.

Finally, the continuous optimization problem that we solve to “train” a logistic regressionclassifier using (x1, y1), ..., (xn, yn) is

min⇥

nX

i=1

L(⇥, (xi, yi)) + �k⇥k2F . (1)

The Frobenius norm k⇥kF =qP

i,j ⇥[i, j]2.

7

Page 7: Artificial Intelligence - UCSB Computer Science

(Slide 41 for Lecture 4) How to choose the step sizes / learning rates?• In theory:

– Gradient decent: 1/L where L is the Lipschitz constant of the function we minimize.

– SGD:

• e.g.

• In practice:– Use cross-validation on a subsample of the data.– Fixed learning rate for SGD is usually fine.– If it diverges, decrease the learning rate.– If for extremely small learning rate, it still diverges, check if your

gradient implementation is correct.

7

X

t

⌘t = 1,X

t

⌘2t < 1<latexit sha1_base64="t6VIGthL0E0AqSEgCJWfgrGpPRQ=">AAACH3icbZDLSsNAFIYnXmu9VV26GSyCCylJEXWhUHTjsoK9QBPDZDpph04mYeZECKFv4sZXceNCEXHXt3F6WWjrgYGP/z+HM+cPEsE12PbIWlpeWV1bL2wUN7e2d3ZLe/tNHaeKsgaNRazaAdFMcMkawEGwdqIYiQLBWsHgduy3npjSPJYPkCXMi0hP8pBTAkbyS+euTiM/hyF2GRAfML7GLpchZKcY4znzsYqvpqZfKtsVe1J4EZwZlNGs6n7p2+3GNI2YBCqI1h3HTsDLiQJOBRsW3VSzhNAB6bGOQUkipr18ct8QHxuli8NYmScBT9TfEzmJtM6iwHRGBPp63huL/3mdFMJLL+cySYFJOl0UpgJDjMdh4S5XjILIDBCquPkrpn2iCAUTadGE4MyfvAjNasUxfH9Wrt3M4iigQ3SETpCDLlAN3aE6aiCKntErekcf1ov1Zn1aX9PWJWs2c4D+lDX6AWfNofI=</latexit><latexit sha1_base64="t6VIGthL0E0AqSEgCJWfgrGpPRQ=">AAACH3icbZDLSsNAFIYnXmu9VV26GSyCCylJEXWhUHTjsoK9QBPDZDpph04mYeZECKFv4sZXceNCEXHXt3F6WWjrgYGP/z+HM+cPEsE12PbIWlpeWV1bL2wUN7e2d3ZLe/tNHaeKsgaNRazaAdFMcMkawEGwdqIYiQLBWsHgduy3npjSPJYPkCXMi0hP8pBTAkbyS+euTiM/hyF2GRAfML7GLpchZKcY4znzsYqvpqZfKtsVe1J4EZwZlNGs6n7p2+3GNI2YBCqI1h3HTsDLiQJOBRsW3VSzhNAB6bGOQUkipr18ct8QHxuli8NYmScBT9TfEzmJtM6iwHRGBPp63huL/3mdFMJLL+cySYFJOl0UpgJDjMdh4S5XjILIDBCquPkrpn2iCAUTadGE4MyfvAjNasUxfH9Wrt3M4iigQ3SETpCDLlAN3aE6aiCKntErekcf1ov1Zn1aX9PWJWs2c4D+lDX6AWfNofI=</latexit><latexit sha1_base64="t6VIGthL0E0AqSEgCJWfgrGpPRQ=">AAACH3icbZDLSsNAFIYnXmu9VV26GSyCCylJEXWhUHTjsoK9QBPDZDpph04mYeZECKFv4sZXceNCEXHXt3F6WWjrgYGP/z+HM+cPEsE12PbIWlpeWV1bL2wUN7e2d3ZLe/tNHaeKsgaNRazaAdFMcMkawEGwdqIYiQLBWsHgduy3npjSPJYPkCXMi0hP8pBTAkbyS+euTiM/hyF2GRAfML7GLpchZKcY4znzsYqvpqZfKtsVe1J4EZwZlNGs6n7p2+3GNI2YBCqI1h3HTsDLiQJOBRsW3VSzhNAB6bGOQUkipr18ct8QHxuli8NYmScBT9TfEzmJtM6iwHRGBPp63huL/3mdFMJLL+cySYFJOl0UpgJDjMdh4S5XjILIDBCquPkrpn2iCAUTadGE4MyfvAjNasUxfH9Wrt3M4iigQ3SETpCDLlAN3aE6aiCKntErekcf1ov1Zn1aX9PWJWs2c4D+lDX6AWfNofI=</latexit><latexit sha1_base64="t6VIGthL0E0AqSEgCJWfgrGpPRQ=">AAACH3icbZDLSsNAFIYnXmu9VV26GSyCCylJEXWhUHTjsoK9QBPDZDpph04mYeZECKFv4sZXceNCEXHXt3F6WWjrgYGP/z+HM+cPEsE12PbIWlpeWV1bL2wUN7e2d3ZLe/tNHaeKsgaNRazaAdFMcMkawEGwdqIYiQLBWsHgduy3npjSPJYPkCXMi0hP8pBTAkbyS+euTiM/hyF2GRAfML7GLpchZKcY4znzsYqvpqZfKtsVe1J4EZwZlNGs6n7p2+3GNI2YBCqI1h3HTsDLiQJOBRsW3VSzhNAB6bGOQUkipr18ct8QHxuli8NYmScBT9TfEzmJtM6iwHRGBPp63huL/3mdFMJLL+cySYFJOl0UpgJDjMdh4S5XjILIDBCquPkrpn2iCAUTadGE4MyfvAjNasUxfH9Wrt3M4iigQ3SETpCDLlAN3aE6aiCKntErekcf1ov1Zn1aX9PWJWs2c4D+lDX6AWfNofI=</latexit>

⌘t 2 [1/t, 1/pt)

<latexit sha1_base64="Plt0/O8c4GN77BNALwwilmGiM+Y=">AAACBXicbZDLSsNAFIYn9VbrLepSF4NFUJA2EUGXRTcuK9gLNKFMppN26GQSZ06EErpx46u4caGIW9/BnW/jtM1Cqz8MfPznHM6cP0gE1+A4X1ZhYXFpeaW4Wlpb39jcsrd3mjpOFWUNGotYtQOimeCSNYCDYO1EMRIFgrWC4dWk3rpnSvNY3sIoYX5E+pKHnBIwVtfe9xiQLmCPS9xxq3CC3aqn7xRkMD7u2mWn4kyF/4KbQxnlqnftT68X0zRiEqggWndcJwE/Iwo4FWxc8lLNEkKHpM86BiWJmPaz6RVjfGicHg5jZZ4EPHV/TmQk0noUBaYzIjDQ87WJ+V+tk0J44WdcJikwSWeLwlRgiPEkEtzjilEQIwOEKm7+iumAKELBBFcyIbjzJ/+F5mnFNXxzVq5d5nEU0R46QEfIReeohq5RHTUQRQ/oCb2gV+vRerberPdZa8HKZ3bRL1kf32A9lzU=</latexit><latexit sha1_base64="Plt0/O8c4GN77BNALwwilmGiM+Y=">AAACBXicbZDLSsNAFIYn9VbrLepSF4NFUJA2EUGXRTcuK9gLNKFMppN26GQSZ06EErpx46u4caGIW9/BnW/jtM1Cqz8MfPznHM6cP0gE1+A4X1ZhYXFpeaW4Wlpb39jcsrd3mjpOFWUNGotYtQOimeCSNYCDYO1EMRIFgrWC4dWk3rpnSvNY3sIoYX5E+pKHnBIwVtfe9xiQLmCPS9xxq3CC3aqn7xRkMD7u2mWn4kyF/4KbQxnlqnftT68X0zRiEqggWndcJwE/Iwo4FWxc8lLNEkKHpM86BiWJmPaz6RVjfGicHg5jZZ4EPHV/TmQk0noUBaYzIjDQ87WJ+V+tk0J44WdcJikwSWeLwlRgiPEkEtzjilEQIwOEKm7+iumAKELBBFcyIbjzJ/+F5mnFNXxzVq5d5nEU0R46QEfIReeohq5RHTUQRQ/oCb2gV+vRerberPdZa8HKZ3bRL1kf32A9lzU=</latexit><latexit sha1_base64="Plt0/O8c4GN77BNALwwilmGiM+Y=">AAACBXicbZDLSsNAFIYn9VbrLepSF4NFUJA2EUGXRTcuK9gLNKFMppN26GQSZ06EErpx46u4caGIW9/BnW/jtM1Cqz8MfPznHM6cP0gE1+A4X1ZhYXFpeaW4Wlpb39jcsrd3mjpOFWUNGotYtQOimeCSNYCDYO1EMRIFgrWC4dWk3rpnSvNY3sIoYX5E+pKHnBIwVtfe9xiQLmCPS9xxq3CC3aqn7xRkMD7u2mWn4kyF/4KbQxnlqnftT68X0zRiEqggWndcJwE/Iwo4FWxc8lLNEkKHpM86BiWJmPaz6RVjfGicHg5jZZ4EPHV/TmQk0noUBaYzIjDQ87WJ+V+tk0J44WdcJikwSWeLwlRgiPEkEtzjilEQIwOEKm7+iumAKELBBFcyIbjzJ/+F5mnFNXxzVq5d5nEU0R46QEfIReeohq5RHTUQRQ/oCb2gV+vRerberPdZa8HKZ3bRL1kf32A9lzU=</latexit><latexit sha1_base64="Plt0/O8c4GN77BNALwwilmGiM+Y=">AAACBXicbZDLSsNAFIYn9VbrLepSF4NFUJA2EUGXRTcuK9gLNKFMppN26GQSZ06EErpx46u4caGIW9/BnW/jtM1Cqz8MfPznHM6cP0gE1+A4X1ZhYXFpeaW4Wlpb39jcsrd3mjpOFWUNGotYtQOimeCSNYCDYO1EMRIFgrWC4dWk3rpnSvNY3sIoYX5E+pKHnBIwVtfe9xiQLmCPS9xxq3CC3aqn7xRkMD7u2mWn4kyF/4KbQxnlqnftT68X0zRiEqggWndcJwE/Iwo4FWxc8lLNEkKHpM86BiWJmPaz6RVjfGicHg5jZZ4EPHV/TmQk0noUBaYzIjDQ87WJ+V+tk0J44WdcJikwSWeLwlRgiPEkEtzjilEQIwOEKm7+iumAKELBBFcyIbjzJ/+F5mnFNXxzVq5d5nEU0R46QEfIReeohq5RHTUQRQ/oCb2gV+vRerberPdZa8HKZ3bRL1kf32A9lzU=</latexit>

Page 8: Artificial Intelligence - UCSB Computer Science

Learning rate η choices in GD and SGD

8

OKish Learning rate Small Learning rate Learning rate too large

http://d2l.ai/chapter_optimization/gd.html#gradient-descent-in-one-dimension

1D-example from D2L book, Chapter 11.

Page 9: Artificial Intelligence - UCSB Computer Science

Recap of the last lecture

• Probability notations– Distinguish between events and random variables, apply rules of

probabilities

• Representing a joint-distribution– number of parameters exponential in the number of variables– Calculating marginals and conditionals from the joint-distribution.

• Conditional independences and factorization of joint-distributions– Saves parameters, often exponential improvements

9

Page 10: Artificial Intelligence - UCSB Computer Science

10

Tradeoffs in our model choices

Expressiveness

Space / computation efficiency

Fully IndependentP(X1, X2, …, XN)

= P(X1) P(X2) …P(XN)

O(N)<latexit sha1_base64="BH2x9RNnrmjKUPQvCy8QUM7xb10=">AAAB63icbZDLSgMxFIbP1Futt6pLN8Ei1E2ZkYIui25caQV7gXYomTTThiaZIckIZegruHGhiFtfyJ1vY6adhbb+EPj4zznknD+IOdPGdb+dwtr6xuZWcbu0s7u3f1A+PGrrKFGEtkjEI9UNsKacSdoyzHDajRXFIuC0E0xusnrniSrNIvlopjH1BR5JFjKCTWbdV+/OB+WKW3PnQqvg5VCBXM1B+as/jEgiqDSEY617nhsbP8 XKMMLprNRPNI0xmeAR7VmUWFDtp/NdZ+jMOkMURso+adDc/T2RYqH1VAS2U2Az1su1zPyv1ktMeOWnTMaJoZIsPgoTjkyEssPRkClKDJ9awEQxuysiY6wwMTaekg3BWz55FdoXNc/yQ73SuM7jKMIJnEIVPLiEBtxCE1pAYAzP8ApvjnBenHfnY9FacPKZY/gj5/MHCW6NkA==</latexit><latexit sha1_base64="BH2x9RNnrmjKUPQvCy8QUM7xb10=">AAAB63icbZDLSgMxFIbP1Futt6pLN8Ei1E2ZkYIui25caQV7gXYomTTThiaZIckIZegruHGhiFtfyJ1vY6adhbb+EPj4zznknD+IOdPGdb+dwtr6xuZWcbu0s7u3f1A+PGrrKFGEtkjEI9UNsKacSdoyzHDajRXFIuC0E0xusnrniSrNIvlopjH1BR5JFjKCTWbdV+/OB+WKW3PnQqvg5VCBXM1B+as/jEgiqDSEY617nhsbP8 XKMMLprNRPNI0xmeAR7VmUWFDtp/NdZ+jMOkMURso+adDc/T2RYqH1VAS2U2Az1su1zPyv1ktMeOWnTMaJoZIsPgoTjkyEssPRkClKDJ9awEQxuysiY6wwMTaekg3BWz55FdoXNc/yQ73SuM7jKMIJnEIVPLiEBtxCE1pAYAzP8ApvjnBenHfnY9FacPKZY/gj5/MHCW6NkA==</latexit><latexit sha1_base64="BH2x9RNnrmjKUPQvCy8QUM7xb10=">AAAB63icbZDLSgMxFIbP1Futt6pLN8Ei1E2ZkYIui25caQV7gXYomTTThiaZIckIZegruHGhiFtfyJ1vY6adhbb+EPj4zznknD+IOdPGdb+dwtr6xuZWcbu0s7u3f1A+PGrrKFGEtkjEI9UNsKacSdoyzHDajRXFIuC0E0xusnrniSrNIvlopjH1BR5JFjKCTWbdV+/OB+WKW3PnQqvg5VCBXM1B+as/jEgiqDSEY617nhsbP8 XKMMLprNRPNI0xmeAR7VmUWFDtp/NdZ+jMOkMURso+adDc/T2RYqH1VAS2U2Az1su1zPyv1ktMeOWnTMaJoZIsPgoTjkyEssPRkClKDJ9awEQxuysiY6wwMTaekg3BWz55FdoXNc/yQ73SuM7jKMIJnEIVPLiEBtxCE1pAYAzP8ApvjnBenHfnY9FacPKZY/gj5/MHCW6NkA==</latexit><latexit sha1_base64="BH2x9RNnrmjKUPQvCy8QUM7xb10=">AAAB63icbZDLSgMxFIbP1Futt6pLN8Ei1E2ZkYIui25caQV7gXYomTTThiaZIckIZegruHGhiFtfyJ1vY6adhbb+EPj4zznknD+IOdPGdb+dwtr6xuZWcbu0s7u3f1A+PGrrKFGEtkjEI9UNsKacSdoyzHDajRXFIuC0E0xusnrniSrNIvlopjH1BR5JFjKCTWbdV+/OB+WKW3PnQqvg5VCBXM1B+as/jEgiqDSEY617nhsbP8 XKMMLprNRPNI0xmeAR7VmUWFDtp/NdZ+jMOkMURso+adDc/T2RYqH1VAS2U2Az1su1zPyv1ktMeOWnTMaJoZIsPgoTjkyEssPRkClKDJ9awEQxuysiY6wwMTaekg3BWz55FdoXNc/yQ73SuM7jKMIJnEIVPLiEBtxCE1pAYAzP8ApvjnBenHfnY9FacPKZY/gj5/MHCW6NkA==</latexit>

O(eN )<latexit sha1_base64="/aIPhdmFR4TX04BMt/kMvmjmkj4=">AAAB7XicbZDLSgMxFIbP1Futt6pLN4NFqJsyI4Iui25caQV7gXYsmfRMG5tJhiQjlKHv4MaFIm59H3e+jellodUfAh//OYec84cJZ9p43peTW1peWV3Lrxc2Nre2d4q7ew0tU0WxTiWXqhUSjZwJrBtmOLYShSQOOTbD4eWk3nxEpZkUd2aUYBCTvmARo8RYq3FTxvvr426x5FW8qdy/4M+hBHPVusXPTk/SNEZhKCdat30vMUFGlGGU47jQSTUmhA5JH9sWBYlRB9l027F7ZJ2eG0llnzDu1P05kZFY61Ec2s6YmIFerE3M/2rt1ETnQcZEkhoUdPZRlHLXSHdyuttjCqnhIwuEKmZ3demAKEKNDahgQ/AXT/4LjZOKb/n2tFS9mMeRhwM4hDL4cAZVuIIa1IHCAzzBC7w60nl23pz3WWvOmc/swy85H999uI5n</latexit><latexit sha1_base64="/aIPhdmFR4TX04BMt/kMvmjmkj4=">AAAB7XicbZDLSgMxFIbP1Futt6pLN4NFqJsyI4Iui25caQV7gXYsmfRMG5tJhiQjlKHv4MaFIm59H3e+jellodUfAh//OYec84cJZ9p43peTW1peWV3Lrxc2Nre2d4q7ew0tU0WxTiWXqhUSjZwJrBtmOLYShSQOOTbD4eWk3nxEpZkUd2aUYBCTvmARo8RYq3FTxvvr426x5FW8qdy/4M+hBHPVusXPTk/SNEZhKCdat30vMUFGlGGU47jQSTUmhA5JH9sWBYlRB9l027F7ZJ2eG0llnzDu1P05kZFY61Ec2s6YmIFerE3M/2rt1ETnQcZEkhoUdPZRlHLXSHdyuttjCqnhIwuEKmZ3demAKEKNDahgQ/AXT/4LjZOKb/n2tFS9mMeRhwM4hDL4cAZVuIIa1IHCAzzBC7w60nl23pz3WWvOmc/swy85H999uI5n</latexit><latexit sha1_base64="/aIPhdmFR4TX04BMt/kMvmjmkj4=">AAAB7XicbZDLSgMxFIbP1Futt6pLN4NFqJsyI4Iui25caQV7gXYsmfRMG5tJhiQjlKHv4MaFIm59H3e+jellodUfAh//OYec84cJZ9p43peTW1peWV3Lrxc2Nre2d4q7ew0tU0WxTiWXqhUSjZwJrBtmOLYShSQOOTbD4eWk3nxEpZkUd2aUYBCTvmARo8RYq3FTxvvr426x5FW8qdy/4M+hBHPVusXPTk/SNEZhKCdat30vMUFGlGGU47jQSTUmhA5JH9sWBYlRB9l027F7ZJ2eG0llnzDu1P05kZFY61Ec2s6YmIFerE3M/2rt1ETnQcZEkhoUdPZRlHLXSHdyuttjCqnhIwuEKmZ3demAKEKNDahgQ/AXT/4LjZOKb/n2tFS9mMeRhwM4hDL4cAZVuIIa1IHCAzzBC7w60nl23pz3WWvOmc/swy85H999uI5n</latexit><latexit sha1_base64="/aIPhdmFR4TX04BMt/kMvmjmkj4=">AAAB7XicbZDLSgMxFIbP1Futt6pLN4NFqJsyI4Iui25caQV7gXYsmfRMG5tJhiQjlKHv4MaFIm59H3e+jellodUfAh//OYec84cJZ9p43peTW1peWV3Lrxc2Nre2d4q7ew0tU0WxTiWXqhUSjZwJrBtmOLYShSQOOTbD4eWk3nxEpZkUd2aUYBCTvmARo8RYq3FTxvvr426x5FW8qdy/4M+hBHPVusXPTk/SNEZhKCdat30vMUFGlGGU47jQSTUmhA5JH9sWBYlRB9l027F7ZJ2eG0llnzDu1P05kZFY61Ec2s6YmIFerE3M/2rt1ETnQcZEkhoUdPZRlHLXSHdyuttjCqnhIwuEKmZ3demAKEKNDahgQ/AXT/4LjZOKb/n2tFS9mMeRhwM4hDL4cAZVuIIa1IHCAzzBC7w60nl23pz3WWvOmc/swy85H999uI5n</latexit>

Fully generalP(X1, X2, …, XN)

Idea:1.Independent groups of variables?

2.Conditional independences?

Page 11: Artificial Intelligence - UCSB Computer Science

11

Benefit of conditional independence

• If some variables are conditionally independent, the joint probability can be specified with many fewer than 2N-1numbers (or 3N-1, or 10N-1, or…)

• For example: (for binary variables W, X, Y, Z)– P(W,X,Y,Z) = P(W) P(X|W) P(Y|W,X) P(Z|W,X,Y)

• 1 + 2 + 4 + 8 = 15 numbers to specify – But if Y and W are independent given X, and Z is independent of

W and X given Y, then• P(W,X,Y,Z) = P(W) P(X|W) P(Y|X) P(Z|Y)

– 1 + 2 + 2 + 2 = 7 numbers

• This is often the case in real problems.

Page 12: Artificial Intelligence - UCSB Computer Science

Today

• Bayesian Networks

• Examples

• d-separation, Bayes Ball algorithm

• Examples

12

Page 13: Artificial Intelligence - UCSB Computer Science

Graphical models come out of the marriage of graph theory and probability theory

13

Directed Graph => Bayesian Networks / Belief NetworksUndirected Graph => Markov Random Fields

Page 14: Artificial Intelligence - UCSB Computer Science

Used as a modeling tool. Many applications!

14

Reasoning under uncertainty!

© Eric Xing @ CMU, 2005-2014

Speech recognition

Information retrieval

Computer vision

Robotic control

Planning

Games

Evolution

Pedigree

7

(Slides from Prof. Eric Xing)

Page 15: Artificial Intelligence - UCSB Computer Science

Two ways to think about Graphical Models

• A particular factorization of a joint distribution

– P(X,Y,Z) = P(X) P(Y|X) P(Z|Y)

• A collection of conditional independences– { X ⟂ Z | Y, … }

15

Represented using a graph!

Page 16: Artificial Intelligence - UCSB Computer Science

16

Belief Networks

a.k.a. Probabilistic networks, Belief nets, Bayes nets, etc.• Belief network

– A data structure (depicted as a graph) that represents the dependence among variables and allows us to concisely specify the joint probability distribution

– The graph itself is known as an “influence diagram”

• A belief network is a directed acyclic graph where:– The nodes represent the set of random variables (one node per

random variable)– Arcs between nodes represent influence, or causality

• A link from node X to node Y means that X “directly influences” Y

– Each node has a conditional probability table (CPT) that definesP(node | parents)

Page 17: Artificial Intelligence - UCSB Computer Science

17

Example

• Random variables X and Y– X – It is raining– Y – The grass is wet

• X has an effect on YOr, Y is a symptom of X

• Draw two nodes and link them

• Define the CPT for each node− P(X) and P(Y | X)

• Typical use: we observe Y and we want to query P(X | Y)− Y is an evidence variable

− X is a query variable

X

Y

P(X)

P(Y|X)

Page 18: Artificial Intelligence - UCSB Computer Science

18

We can write everything we want as a function of the CPTs. Try it!• What is P(X | Y)?

– Given that we know the CPTs of each node in the graph X

Y

P(X)

P(Y|X))(

)()|()|(YP

XPXYPYXP =

å=

XYXPXPXYP),()()|(

å=

XXPXYPXPXYP)()|()()|(

Page 19: Artificial Intelligence - UCSB Computer Science

19

Belief nets represent the joint probability

• The joint probability function can be calculated directly from the network– It’s the product of the CPTs of all the nodes

– P(var1, …, varN) = Πi P(vari|Parents(vari))

X

Y

P(X)

P(Y|X)

P(X,Y) = P(X) P(Y|X) P(X,Y,Z) = P(X) P(Y) P(Z|X,Y)

X

Z

Y P(Y)

P(Z|X,Y)

P(X)

Page 20: Artificial Intelligence - UCSB Computer Science

Three steps in modelling with Belief Networks1. Choose variables in the environments, represent them as

nodes.

2. Connect the variables by inspecting the “direct influence”: cause-effect

3. Fill in the probabilities in the CPTs.

20

Page 21: Artificial Intelligence - UCSB Computer Science

21

Example: Modelling with Belief Net

I’m at work and my neighbor John called to say my home alarm is ringing, but my neighbor Mary didn’t call. The alarm is sometimes triggered by minor earthquakes. Was there a burglar at my house?

• Random (boolean) variables:– JohnCalls, MaryCalls, Earthquake, Burglar, Alarm

• The belief net shows the causal links• This defines the joint probability

– P(JohnCalls, MaryCalls, Earthquake, Burglar, Alarm)

• What do we want to know?P(B | J, ¬M)

Page 22: Artificial Intelligence - UCSB Computer Science

22

How should we connect the nodes?(3 min discussion)

Links and CPTs?

Page 23: Artificial Intelligence - UCSB Computer Science

23

What are the CPTs? What are their dimensions?

0

1

0 12x

4x

01

x 1

0

1

0 1x 1

2x

0

1

0 1

3x

x 1

5x 0

1

0 13x

0

1

0 1

01

6x

2x

5x

1X

2X

3X

X 4

X 5

X6

0

1

0 12x

4x

01

x 1

0

1

0 1x 1

2x

0

1

0 1

3x

x 1

5x 0

1

0 13x

0

1

0 1

01

6x

2x

5x

1X

2X

3X

X 4

X 5

X6

0

1

0 12x

4x

01

x 1

0

1

0 1x 1

2x

0

1

0 1

3x

x 1

5x 0

1

0 13x

0

1

0 1

01

6x

2x

5x

1X

2X

3X

X 4

X 5

X6

0

1

0 12x

4x

01

x 1

0

1

0 1x 1

2x

0

1

0 1

3x

x 1

5x 0

1

0 13x

0

1

0 1

01

6x

2x

5x

1X

2X

3X

X 4

X 5

X6

A

B

E

E

J

B

A

A

M

Question: How to fill values into these CPTs?Ans: Specify by hands. Learn from data (e.g., MLE).

Page 24: Artificial Intelligence - UCSB Computer Science

24

Example

Joint probability? P(J, ¬M, A, B, ¬E)?

Page 25: Artificial Intelligence - UCSB Computer Science

25

Calculate P(J, ¬M, A, B, ¬E)

Read the joint pf from the graph:P(J, M, A, B, E) = P(B) P(E) P(A|B,E) P(J|A) P(M|A)

Plug in the desired values:P(J, ¬M, A, B, ¬E) = P(B) P(¬E) P(A|B,¬E) P(J|A) P(¬M|A)

= 0.001 * 0.998 * 0.94 * 0.9 * 0.3= 0.0002532924

How about P(B | J, ¬M) ?

Remember, this means P(B=true | J=true, M=false)

Page 26: Artificial Intelligence - UCSB Computer Science

26

Calculate P(B | J, ¬M)

),(),,(),|(

MJPMJBPMJBP

¬¬

=¬By marginalization:

ååååå

ååååå

¬

¬=

¬

¬=

kiikjikj

ji

jiijij

i

kkji

ji

jji

i

AMPAJPEBAPEPBP

AMPAJPEBAPEPBP

EBAMJP

EBAMJP

)|()|(),|()()(

)|()|(),|()()(

),,,,(

),,,,(

Page 27: Artificial Intelligence - UCSB Computer Science

Quick checkpoint

• Belief Net as a modelling tool

• By inspecting the cause-effect relationships, we can draw directed edges based on our domain knowledge

• The product of the CPTs give the joint distribution– We can calculate P(A | B) for any A and B– The factorization makes it computationally more tractable

27

What else can we get?

Page 28: Artificial Intelligence - UCSB Computer Science

28

Example: Conditional Independence

• Conditional independence is seen here– P(JohnCalls | MaryCalls, Alarm, Earthquake, Burglary) =

P(JohnCalls | Alarm)– So JohnCalls is independent of MaryCalls, Earthquake, and

Burglary, given Alarm• Does this mean that an earthquake or a burglary do not

influence whether or not John calls?– No, but the influence is already accounted for in the Alarm

variable– JohnCalls is conditionally independent of Earthquake, but not

absolutely independent of it

*This conclusion is independent to values of CPTs!

Page 29: Artificial Intelligence - UCSB Computer Science

Question

If X and Y are independent, are they therefore independent given any variable(s)?

I.e., if P(X, Y) = P(X) P(Y) [ i.e., if P(X|Y) = P(X) ], can we conclude that P(X | Y, Z) = P(X | Z)?

Page 30: Artificial Intelligence - UCSB Computer Science

Question

If X and Y are independent, are they therefore independent given any variable(s)?

I.e., if P(X, Y) = P(X) P(Y) [ i.e., if P(X|Y) = P(X) ], can we conclude that P(X | Y, Z) = P(X | Z)?

The answer is no, and here’s a counter example:

X

Z

YWeight of Person A

Weight of Person B

Their combined weight

P(X | Y) = P(X)P(X | Y, Z) ≠ P(X | Z)

30

Note: Even though Z is a deterministic function of X and Y, it is still a random variable with a probability distribution

*Again: This conclusion is independent to values of CPTs!

Page 31: Artificial Intelligence - UCSB Computer Science

Big question: Is there a general way that we can answer questions about conditional independences by just inspecting the graphs?

• Turns out the answer is “Yes!”

31

Direct Acyclic Graph

The set of Conditional Independences that holds for all CPTs

Factorization of Joint Distribution

By definition

By traversing

the DAG

By d-s

epar

atio

n

Bayes

Ball

alg.

Page 32: Artificial Intelligence - UCSB Computer Science

Intuition: the graph and the edges controls the information flow, if there is no path that the information can flow from one-node to another, we say these two nodes are independent..

1X

2X

3X

X 4

X 5

X6

32From Chapter 2 of Jordan book “Introduction to Probabilistic Graphical Models”

“Shading” denotes “observing” or

”conditioning on” that variables.

Page 33: Artificial Intelligence - UCSB Computer Science

d-separation in three canonical graphs

33

X Y Z X Y Z

(a) (b)

(a)

X

Y

Z X

Y

Z

(b)

(a) (b)

X

Y

Z

X Z

aliens

late

watch

X ⟂ Z | Y“X and Z are d-separated by the observation of Y.”

X ⟂ Z | Y“X and Z are d-separated by the observation of Y.”

X ⟂ Z | Y“X and Z are d-separated by NOT observing Y nor any descendants of Y.”

Page 34: Artificial Intelligence - UCSB Computer Science

34

Examples

G WRRain Wet

GrassWorms

P(W | R, G) = P(W | G)

F CTTired Flu Cough

P(T | C, F) = P(T | F)

M IWWork Money Inherit

P(W | I, M) ¹ P(W | M)

P(W | I) = P(W)

Page 35: Artificial Intelligence - UCSB Computer Science

The Bayes Ball algorithm

• Let X, Y, Z be “groups” of nodes / set / subgraphs.

• Shade nodes in Y• Place a “ball” at each node in X• Bounce balls around the graph according to rules

• If no ball reaches any node in Z, then declare

X ⟂ Z | Y

35

Page 36: Artificial Intelligence - UCSB Computer Science

The Ten Rules of Bayes Ball Algorithm

36

Page 37: Artificial Intelligence - UCSB Computer Science

37

Examples

X

Z

Y

X – wet grass

Y – rainbow Z – rain

X – rain

Y – sprinklerZ – wet grassW – worms

P(X, Y) ¹ P(X) P(Y)

P(X | Y, Z) = P(X | Z)

P(X, Y) = P(X) P(Y)P(X | Y, Z) ¹ P(X | Z)P(X | Y, W) ¹ P(X | W)

X

Z

Y

W

Are X and Y ind.? Are X and Y cond. ind. given…?

Page 38: Artificial Intelligence - UCSB Computer Science

38

Examples

X

W

Y

X – rainY – sprinklerZ – rainbowW – wet grass

Z

X

W

Y

X – rainY – sprinklerZ – rainbowW – wet grass

Z

P(X,Y) = P(X) P(Y) YesP(X | Y, Z) = P(X | Z) Yes

P(X,Y) ¹ P(X) P(Y) NoP(X | Y, Z) ¹ P(X | Z) No

Are X and Y independent?

Are X and Y conditionally independent given Z?

Page 39: Artificial Intelligence - UCSB Computer Science

39

Conditional Independence

• Where are conditional independences here?

Radio and Ignition, given Battery?

Yes Radio and Starts, given Ignition?

YesGas and Radio, given Battery?

YesGas and Radio, given Starts?

NoGas and Radio, given nil?

YesGas and Battery, given Moves?

No (why?)

Page 40: Artificial Intelligence - UCSB Computer Science

Summary of the today

• Encode knowledge / structures using a DAG

• How to check conditional independence algebraically by the factorizations?

• How to read off conditional independences from a DAG– d-separation, Bayes Ball algorithm

(More examples in supplementary slides, and in the discussion: Hidden Markov Models, AIMA 15.3)

40

Page 41: Artificial Intelligence - UCSB Computer Science

Additional resources about PGM

• Recommended: Ch.2 Jordan book. AIMA Ch. 13-14.

• More readings: – Koller’s PGM book: https://www.amazon.com/Probabilistic-

Graphical-Models-Daphne-Koller/dp/B007YXTT12– Probabilistic programming: http://probabilistic-

programming.org/wiki/Home

• Software for PGMs and modeling and inference:– Stan: https://mc-stan.org/– JAGS: http://mcmc-jags.sourceforge.net/

41

Page 42: Artificial Intelligence - UCSB Computer Science

Upcoming lectures

• Jan 28: Finish Graphical models. Start “Search”• Jan 30: Search algorithms• Feb 4: Minimax search and game playing• Feb 6: Finish “search” + Midterm review. HW2 Due.

• Recommended readings on search: – AIMA Ch 3, Ch 5.1-5.3.

42

Page 43: Artificial Intelligence - UCSB Computer Science

-------------- Supplementary Slides --------------

• More examples

• Practice questions

43

Page 44: Artificial Intelligence - UCSB Computer Science

44

Representing any inference as CPTs

X

Y

P(X)

P(Y|X)

ASK P(X|Y)

Raining

Wet grass

X

Y

P(X)

P(Y|X)

Z P(Z|Y)

ASK P(X|Z)

Rained

Wet grass

Wormsighting

Page 45: Artificial Intelligence - UCSB Computer Science

45

Quiz

1. What is the joint probability distribution of the random variables described by this belief net?– I.e., what is P(U, V, W, X, Y, Z)?

2. Variables W and X area) Independentb) Independent given Uc) Independent given Y(choose one)

3. If you know the CPTs, is it possible to compute P(Z | U)? P(U | Z)?

U

X

V

W

ZY

Page 46: Artificial Intelligence - UCSB Computer Science

46

Review

U

X

V

W

ZY

P(V)

P(X|U,V)

P(U)

P(W|U)

P(Z|X)P(Y|W,X)

Given this Bayesian network:1. What are the CPTs?2. What is the joint probability

distribution of all the variables?

3. How would we calculate P(X | W, Y, Z)?

P(U,V,W,X,Y,Z) = product of the CPTs

= P(U) P(V) P(W|U) P(X|U,V) P(Y|W,X) P(Z|X)

Page 47: Artificial Intelligence - UCSB Computer Science

47

Example: Flu and measles

Flu

MeaslesFever

SpotsP(Flu)

P(Measles)

P(Spots | Measles)

P(Fever | Flu, Measles)

To create the belief net:• Choose variables (evidence and query)• Choose an ordering and create links (direct influences)• Fill in probabilities (CPTs)

Page 48: Artificial Intelligence - UCSB Computer Science

48

Example: Flu and measles

Flu

MeaslesFever

Spots

P(Flu) = 0.01P(Measles) = 0.001

P(Flu)

P(Measles)

P(Spots | Measles)

P(Fever | Flu, Measles)

P(Spots | Measles) = [0, 0.9]P(Fever | Flu, Measles) = [0.01, 0.8, 0.9, 1.0]

Compute P(Flu | Fever) and P(Flu | Fever, Spots).Are they equivalent?

CPTs:

Page 49: Artificial Intelligence - UCSB Computer Science

Practical uses of Belief Nets

• Uses of belief nets:– Calculating probabilities of causes, given symptoms– Calculating probabilities of symptoms, given multiple causes– Calculating probabilities to be combined with an agent’s utility

function– Explaining the results of probabilistic inference to the user– Deciding which additional evidence variables should be observed

in order to gain useful information• Is it worth the extra cost to observe additional evidence?

– P(X | Y) vs. P(X | Y, Z)– Performing sensitivity analysis to understand which aspects of the

model affect the queries the most• How accurate must various evidence variables be? How will

input errors affect the reasoning?49