View
214
Download
0
Category
Preview:
Citation preview
Lecture 2: Statistical learning primer for biologists
Alan QiPurdue Statistics and CS
Jan. 15, 2009
Outline
• Basics for probability• Regression• Graphical models: Bayesian networks and
Markov random fields• Unsupervised learning: K-means and
Expectation maximization
Probability Theory
•Sum Rule
Product Rule
The Rules of Probability
• Sum Rule
• Product Rule
Bayes’ Theorem
posterior likelihood × prior
Probability Density & Cumulative Distribution Functions
Expectations
Conditional Expectation(discrete)
Approximate Expectation(discrete and continuous)
Variances and Covariances
The Gaussian Distribution
Gaussian Mean and Variance
The Multivariate Gaussian
Gaussian Parameter Estimation
Likelihood function
Maximum (Log) Likelihood
Properties of and
Unbiased
Biased
Curve Fitting Re-visited
Maximum Likelihood
Determine by minimizing sum-of-squares error, .
Predictive Distribution
MAP: A Step towards Bayes
Determine by minimizing regularized sum-of-squares error, .
Bayesian Curve Fitting
Bayesian Networks
• Directed Acyclic Graph (DAG)
Bayesian Networks
General Factorization
Generative Models
• Causal process for generating images
Discrete Variables (1)
• General joint distribution: K 2 -1 parameters
• Independent joint distribution: 2(K-1) parameters
Discrete Variables (2)
General joint distribution over M variables: KM -1 parameters
M node Markov chain: K-1+(M-1)K(K-1) parameters
Discrete Variables: Bayesian Parameters (1)
Discrete Variables: Bayesian Parameters (2)
Shared prior
Parameterized Conditional Distributions
If are discrete, K-state variables, in general has O(K M) parameters.
The parameterized form
requires only M + 1 parameters
Conditional Independence
• a is independent of b given c
• Equivalently
• Notation
Conditional Independence: Example 1
Conditional Independence: Example 1
Conditional Independence: Example 2
Conditional Independence: Example 2
Conditional Independence: Example 3
• Note: this is the opposite of Example 1, with c unobserved.
Conditional Independence: Example 3
Note: this is the opposite of Example 1, with c observed.
“Am I out of fuel?”
B = Battery (0=flat, 1=fully charged)F = Fuel Tank (0=empty, 1=full)G = Fuel Gauge Reading
(0=empty, 1=full)
And hence
“Am I out of fuel?”
Probability of an empty tank increased by observing G = 0.
“Am I out of fuel?”
Probability of an empty tank reduced by observing B = 0. This referred to as “explaining away”.
The Markov Blanket
Factors independent of xi cancel between numerator and denominator.
Cliques and Maximal Cliques
Clique
Maximal Clique
Joint Distribution
• where is the potential over clique C and
• is the normalization coefficient; note: M K-state variables KM terms in Z.
• Energies and the Boltzmann distribution
Illustration: Image De-Noising (1)
Original Image Noisy Image
Illustration: Image De-Noising (2)
Illustration: Image De-Noising (3)
Noisy Image Restored Image (ICM)
Converting Directed to Undirected Graphs (1)
Converting Directed to Undirected Graphs (2)
• Additional links: “marrying parents”, i.e., moralization
Directed vs. Undirected Graphs (2)
Inference on a Chain
Computational time increases exponentially with N.
Inference on a Chain
Supervised Learning
• Supervised learning: learning with examples or labels, e.g., classification and regression
• Linear regression (the example we just given), Generalized linear models (e.g, probit classification), Support vector machines, Gaussian processes classifications, etc.
• Take CS590M-Machine Learning in 2009 fall.
Unsupervised Learning
• Supervised learning: learning with examples or labels, e.g., classification and regression
• Unsupervised learning: learning without examples or labels, e.g., clustering, mixture models, PCA, non-negative matrix factorization
K-means Clustering: Goal
Cost Function
Two Stage Updates
Optimizing Cluster Assignment
Optimizing Cluster Centers
Convergence of Iterative Updates
Example of K-Means Clustering
Mixture of Gaussians• Mixture of Gaussians:
• Introduce latent variables:
• Marginal distribution:
Conditional Probability
• Responsibility that component k takes for explaining the observation.
Maximum Likelihood
• Maximize the log likelihood function
Maximum Likelihood Conditions (1)
• Setting the derivatives of to zero:
Maximum Likelihood Conditions (2)
• Setting the derivative of to zero:
Maximum Likelihood Conditions (3)
• Lagrange function:
• Setting its derivative to zero and use the normalization constraint, we obtain:
Expectation Maximization for Mixture Gaussians
• Although the previous conditions do not provide closed-form conditions, we can use them to construct iterative updates:
• E step: Compute responsibilities .• M step: Compute new mean , variance ,
and mixing coefficients .• Loop over E and M steps until the log
likelihood stops to increase.
Example
• EM on the Old Faithful data set.
General EM Algorithm
EM as Lower Bounding Methods
• Goal: maximize
• Define:
• We have
Lower Bound
• is a functional of the distribution .
• Since and ,• is a lower bound of the log likelihood
function .
Illustration of Lower Bound
Lower Bound Perspective of EM
• Expectation Step:• Maximizing the functional lower bound
over the distribution .
• Maximization Step:• Maximizing the lower bound over the
parameters .
Illustration of EM Updates
Recommended