Intro 10pages

8/10/2019 Intro 10pages

1/10

6.867 Lecture Notes: Section 1: Introduction

Reading

Lecture 1: Murphy: Chapters 1 (Introduction), 2 (Good probability review; pay atten-tion to multinomial Gaussian, Beta, and Dirichlet distributions)

Contents

1 Intro 2

2 Problem class 32.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.1 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.3 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Other settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Assumptions 5

4 Evaluation criteria 5

5 Model type 6

5.1 No model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65.2 Prediction rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65.3 Probabilistic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5.3.1 Fitting a probabilistic model . . . . . . . . . . . . . . . . . . . . . . . . 75.3.2 Decision theoretic prediction . . . . . . . . . . . . . . . . . . . . . . . 85.3.3 Benefits of using a probabilistic model . . . . . . . . . . . . . . . . . . 8

5.4 Distribution over models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

6 Model class and parameter fitting 10

6.1 Fitting maximum-likelihood probabilistic models . . . . . . . . . . . . . . . 106.1.1 Gaussian with fixed variance . . . . . . . . . . . . . . . . . . . . . . . 106.1.2 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116.1.3 Uniform over a continuous interval . . . . . . . . . . . . . . . . . . . 116.1.4 Uniform over a finite set of points . . . . . . . . . . . . . . . . . . . . 116.1.5 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126.1.6 Maximum likelihood and binary data . . . . . . . . . . . . . . . . . . 12

6.2 Model selection over classes of probabilistic models . . . . . . . . . . . . . . 136.2.1 Bias-Variance decomposition . . . . . . . . . . . . . . . . . . . . . . . 13

1


2/10

MIT 6.867 Fall 2014 2

6.2.2 Regularization and model selection . . . . . . . . . . . . . . . . . . . 156.3 Bayesian inference from prior to posterior over models . . . . . . . . . . . . 18

6.3.1 Bernoulli/Bernoulli . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.3.2 Beta/Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.3.3 Gaussian with fixed variance . . . . . . . . . . . . . . . . . . . . . . . 216.3.4 Finding conjugate families . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.4 Bayesian model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7 Algorithm 23

8 Recap 23


3/10


4/10

MIT 6.867 Fall 2014 4

2 Problem class

There are many differentproblem classesin machine learning. They vary according to whatkind of data is provided and what kind of conclusions are to be drawn from it. We willgo through several of the standard problem classes here, and establish some notation andterminology.

2.1 Supervised learning

2.1.1 Classification

Training dataD is in the form of a set of pairs {(x(1),y(1)), . . . , (x(n),y(n))}wherex(i) rep-resents an object to be classified, most typically a D-dimensional vector of real and/or dis-crete values, andy(i) is an element of a discrete set of values. The y values are sometimes Our textbook usesxi

and yi instead ofx(i)

and y(i). We find thatnotation somewhat difficult to manage whenx(i) is itself a vector anwe need to talk about

its elements. The no-tation we are using isstandard in some otheparts of the machine-learning literature.

Our textbook uses xiand yi instead ofx

(i)

and y(i). We find thatnotation somewhat difficult to manage whenx(i) is itself a vector anwe need to talk about

its elements. The no-tation we are using isstandard in some otheparts of the machine-learning literature.

calledtarget values.A classification problem isbinaryortwo-classify(i) is drawn from a set of two possible

values; otherwise, it is calledmulti-class.The goal in a classification problem is ultimately, given a new input value x(n+1), to

predict the value ofy(n+1).

Classification problems are a kind ofsupervised learning, because the desired output (orclass)y(i) is specified for each of the training examplesx(i).

We typically assume that the elements ofDare independent and identically distributedaccording to some distribution Pr(X, Y).

2.1.2 Regression

Regression is like classification, except that y(i) Rk.

2.2 Unsupervised learning

2.2.1 Density estimation

Given samplesy(1), . . . ,y(n) RD drawn IID from some distribution Pr(Y), the goal is topredict the probability Pr(y(n+1))of an element drawn from the same distribution. Densityestimation often plays a role as a subroutine in the overall learning method for super-vised learning, as well.

This is a type ofunsupervised learning, because it doesnt involve learning a functionfrom inputs to outputs based on a set of input-output pairs.

2.2.2 Clustering

Given samples x(1), . . . , x(n) RD, the goal is to find a partition of the samples that groupstogether samples that are similar. There are many different objectives, depending on thedefinition of the similarity between samples and exactly what criterion is to be used (e.g.,

minimize the average distance between elements inside a cluster and maximize the aver-age distance between elements across clusters). Other methods perform a soft clustering,in which samples may be assigned 0.9 membership in one cluster and 0.1 in another. Clus-tering is sometimes used as a step in density estimation, and sometimes to find usefulstructure in data. This is also unsupervised learning.


5/10

MIT 6.867 Fall 2014 5

2.2.3 Dimensionality reduction

Given samples x(1), . . . , x(n) RD, the problem is to re-represent them as points in a d-dimensional space, whered < D. The goal is typically to retain information in the data setthat will, e.g., allow elements of one class to be discriminated from another.

Standard dimensionality reduction is particularly useful for visualizing or understand-

ing high-dimensional data. If the goal is ultimately to perform regression or classificationon the data after the dimensionality is reduced, it is usually best to articulate an objectivefor the overall prediction problem rather than to first do dimensionality reduction withoutknowing which dimensions will be important for the prediction task.

2.3 Reinforcement learning

In reinforcement learning, the goal is to learn a mapping from input values x to outputvalues y, but without a direct supervision signal to specify which output values y are

best for a particular input. There is no training set specifieda priori. Instead, the learningproblem is framed as an agent interacting with an environment, in the following setting:

The agent observes the current state,x(0).

It selects an action,y(0).

It receives a reward, r(0), which depends onx(0) and possiblyy(0).

The environment transitions probabilistically to a new state, x(1), with a distributionthat depends only onx(0) andy(0).

The agent observes the current state,x(1).

. . .

The goal is to find a policy , mapping xto y, (that is, states to actions) such that somelong-term sum or average of rewards ris maximized.

This setting is very different from either supervised learning or unsupervised learning,because the agents action choices affect both its reward and its ability to observe the envi-ronment. It requires careful consideration of the long-term effects of actions, as well as allof the other issues that pertain to supervised learning.

2.4 Other settings

There are many other problem settings. Here are a few.Insemi-supervised learning, we have a supervised-learning training set, but there may

be an additional set ofx(i) values with no known y(i). These values can still be used toimprove learning performance if they are drawn from Pr(X)that is the marginal of Pr(X, Y)that governs the rest of the data set.

Inactivelearning, it is assumed to be expensive to acquire a label y(i) (imagine asking ahuman to read an x-ray image), so the learning algorithm can sequentially ask for particularinputs x(i) to be labeled, and must carefully select queries in order to learn as effectively aspossible while minimizing the cost of labeling.

In transfer learning, there are multiple tasks, with data drawn from different, but related,distributions. The goal is for experience with previous tasks to apply to learning a currenttask in a way that requires decreased experience with the new task.


6/10


7/10


8/10


9/10

MIT 6.867 Fall 2014 9

We will explore the trade-offs between using generative and discriminative models in moredetail in later sections. In general, we find that generative models can be learned somewhatmore efficiently and are highly effective if their modeling assumptions match the processthat generated the data. Discriminative models generally give the best performance onthe prediction task in the limit of a large amount of training data, because they focus theirmodeling effort based on the problem of making the prediction and not on modeling the

distribution over possible query pointsx(n+1).We cant take this approach any further until we consider the particular class of models

we are going to fit. Well do some examples in section 6.

5.3.2 Decision theoretic prediction

Now, imagine that you have found some parameters , so that you have a probabilitydistribution Pr(Y | X; )or Pr(X, Y;). How can you combine this with a loss function tomake predictions?

The optimal prediction, in terms of minimizing expected loss, g, for our simple pre-diction problem, is

g =argming Z

y

Pr(y;)L(g,y)dy .

For a regression or classification problem with a conditional probabilistic model, it wouldbe a function of a new inputx:

g(x) =argming

Zy

Pr(y| x;)L(g,y)dy .

In general, we have to know both the form of the probabilistic model Pr(y | x;)andthe loss functionL(g, a). But we can get some insight into the optimal decision problem inthe simple but very prevalent special case of squared error loss. In this case, for the simpleprediction problem, we have:

g =argminp

Zy

Pr(y;)(gy)2 dy .

We can optimize this by taking the derivative of the risk with respect to g, setting it to zero,and solving forg. In this case, we have

d

dg

Zy

Pr(y; )(g y)2 dy = 0Zy

Pr(y;)2(gy)dy = 0Zy

Pr(y; )g dy =

Zy

Pr(y; )y dy

g = E[y; ]

Of course, for different choices of loss function, the risk-minimizing prediction will have adifferent form.

5.3.3 Benefits of using a probabilistic model

This approach allows us to separate the probability model from the loss function; so wecould learn one model and use it in different decision-making contexts. Models that makeprobabilistic predictions can be easily combined using probabilistic inference: it is muchharder to know how to combine multiple direct prediction rules.


10/10

MIT 6.867 Fall 2014 10

5.4 Distribution over models

In the distribution over models approach, which well often call the Bayesian approach, we treatthe model parameters as random variables, and use probabilistic inference to update adistribution over based on the observed data D.We There are a lot of diffe

ent methods and quantities within machine

learning that are calledBayesian foo. Some-times people just use ito mean that there is aapplication of Bayesrule going on some-where. We will try touse it to refer to keep-ing distributions overmodels or hypotheses.

There are a lot of diffeent methods and quantities within machine

learning that are calledBayesian foo. Some-times people just use ito mean that there is aapplication of Bayesrule going on some-where. We will try touse it to refer to keep-ing distributions overmodels or hypotheses.

1. Begin with a prior distribution over possible probabilistic models of the data, Pr();

2. Perform Bayesian inference to combine the data with the prior distribution to find aposterior distribution over models of the data, Pr(| D);

3. (Typically), integrate out the posterior distribution over models in the context ofdecision-theoretic reasoning, using aloss functionto select the best prediction

In the Bayesian approach, we do not have a single model that we will use to make pre-dictions: instead, we will integrate over possible models to select the minimum expectedloss prediction. For the simple prediction problem this is:

g =argming

Z

Zy

Pr(y| ) Pr(| D)L(g,y)dyd .

This integral can sometimes be difficult to evaluate. We can evaluate it approximately,using sampling techniques.

The Bayesian approach lets us maintain an explicit representation of our uncertaintyabout the model and take it into account when making predictions. Imagine a situationin which the most likely model would make a predictiong1, but all of the other models,which are not that much less likely would predict g2: in such situations g2is the choice thatis more likely to be correct. Or, it might be that there is a prediction that is not the best,

but has relatively small loss in all the models: that might be the choice that minimizes risk,even though it is not the best choice in any of the models.

The Bayesian approach can also provide elegant solutions to the problem of modelselection.

Documents

Intro 10pages