Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Announcements
1©2017 Kevin Jamieson
• Homework 3 due tonight!
• HW 4 will be posted tonight. Start early.
©Kevin Jamieson 2017 2
Clustering
Machine Learning – CSE546 Kevin Jamieson University of Washington
November 21, 2016©Kevin Jamieson 2017
©Kevin Jamieson 2017
Clustering images
3[Goldberger et al.]
Set of Images
©Kevin Jamieson 2017
©Kevin Jamieson 2017
Clustering web search results
4©Kevin Jamieson 2017
Hierarchical Clustering
5
Pick one: - Bottom up: start with every point as a cluster and
merge - Top down: start with a single cluster containing
all points and split
Different rules for splitting/merging, no “right answer”
Gives apparently interpretable tree representation. However, warning: even random data with no structure will produce a tree that “appears” to be structured.
©Kevin Jamieson 2017
Some Data
6©Kevin Jamieson 2017
©Kevin Jamieson 2017
K-means
1. Ask user how many clusters they’d like. (e.g. k=5)
7©Kevin Jamieson 2017
©Kevin Jamieson 2017
K-means
1. Ask user how many clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
8©Kevin Jamieson 2017
©Kevin Jamieson 2017
K-means
1. Ask user how many clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)
9©Kevin Jamieson 2017
©Kevin Jamieson 2017
K-means
1. Ask user how many clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns
10©Kevin Jamieson 2017
©Kevin Jamieson 2017
K-means
1. Ask user how many clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns…
5. …and jumps there
6. …Repeat until terminated!
11©Kevin Jamieson 2017
©Kevin Jamieson 2017
K-means
■ Randomly initialize k centers µ(0) = µ1
(0),…, µk(0)
■ Classify: Assign each point j∈{1,…N} to nearest center:
■ Recenter: µi becomes centroid of its point:
Equivalent to µi ← average of its points!12©Kevin Jamieson 2017
©Kevin Jamieson 2017
What is K-means optimizing?
■ Potential function F(µ,C) of centers µ and point allocations C:
■ Optimal K-means: minµminC F(µ,C)
13
N
©Kevin Jamieson 2017
©Kevin Jamieson 2017
Does K-means converge??? Part 1
■ Optimize potential function:
■ Fix µ, optimize C
14©Kevin Jamieson 2017
©Kevin Jamieson 2017
Does K-means converge??? Part 2
■ Optimize potential function:
■ Fix C, optimize µ
15©Kevin Jamieson 2017
Vector Quantization, Fisher Vectors
16
1. Represent image as grid of patches 2. Run k-means on the patches to build code book 3. Represent each patch as a code word.
Vector Quantization (for compression)
Vector Quantization, Fisher Vectors
17
1. Represent image as grid of patches 2. Run k-means on the patches to build code book 3. Represent each patch as a code word.
Vector Quantization (for compression)
Vector Quantization, Fisher Vectors
18
1. Represent image as grid of patches 2. Run k-means on the patches to build code book 3. Represent each patch as a code word.
Similar reduced representation can be used as a feature vector
Vector Quantization (for compression)
Coates, Ng, Learning Feature Representations with K-means, 2012
Typical output of k-means on patches
Spectral Clustering
19
Adjacency matrix: W
Wi,j = weight of edge (i, j)
Given feature vectors, could construct: - k-nearest neighbor graph with weights in {0,1} - weighted graph with arbitrary similarities
Di,i =nX
j=1
Wi,j L = D�W
Wi,j
= e��||xi�xj ||2
Let f 2 Rnbe a
function over the nodes
Spectral Clustering
20
Adjacency matrix: W
Wi,j = weight of edge (i, j)
Given feature vectors, could construct: - (k=10)-nearest neighbor graph with
weights in {0,1}
Di,i =nX
j=1
Wi,j L = D�W
Popular to use the Laplacian L or
its normalized form
eL = I �D�1Was a regularizer for learning over graphs
©Kevin Jamieson 2017 21
Mixtures of Gaussians
Machine Learning – CSE546 Kevin Jamieson University of Washington
November 21, 2016©Kevin Jamieson 2017
©Kevin Jamieson 2017 22
(One) bad case for k-means
■ Clusters may overlap ■ Some clusters may be
“wider” than others
©Kevin Jamieson 2017
©Kevin Jamieson 2017 23
(One) bad case for k-means
■ Clusters may overlap ■ Some clusters may be
“wider” than others
©Kevin Jamieson 2017
©Kevin Jamieson 2017 24
Mixture models
©Kevin Jamieson 2017
Y
`(✓;Z) =nX
i=1
log[(1� ⇡)�✓1(yi) + ⇡�✓2(yi)]
Z = {yi}ni=1 is observed data
If �✓(x) is Gaussian density with parameters ✓ = (µ,�2) then
©Kevin Jamieson 2017 25
Mixture models
©Kevin Jamieson 2017
Y
`(✓; yi,�i = 0) =
`(✓; yi,�i = 1) =
If �✓(x) is Gaussian density with parameters ✓ = (µ,�2) then
Z = {yi}ni=1 is observed data
� = {�i}ni=1 is unobserved data
©Kevin Jamieson 2017 26
Mixture models
©Kevin Jamieson 2017
Y
`(✓;Z,�) =
nX
i=1
(1��i) log[(1� ⇡)�✓1(yi)] +�i log(⇡�✓2(yi)]
Z = {yi}ni=1 is observed data
� = {�i}ni=1 is unobserved data
If �✓(x) is Gaussian density with parameters ✓ = (µ,�2) then
If we knew �, how would we choose ✓?
©Kevin Jamieson 2017 27
Mixture models
©Kevin Jamieson 2017
Y
`(✓;Z,�) =
nX
i=1
(1��i) log[(1� ⇡)�✓1(yi)] +�i log(⇡�✓2(yi)]
Z = {yi}ni=1 is observed data
� = {�i}ni=1 is unobserved data
If �✓(x) is Gaussian density with parameters ✓ = (µ,�2) then
If we knew ✓, how would we choose �?
©Kevin Jamieson 2017 28
Mixture models
©Kevin Jamieson 2017
Y
`(✓;Z,�) =
nX
i=1
(1��i) log[(1� ⇡)�✓1(yi)] +�i log(⇡�✓2(yi)]
Z = {yi}ni=1 is observed data
� = {�i}ni=1 is unobserved data
If �✓(x) is Gaussian density with parameters ✓ = (µ,�2) then
�i(✓) = E[�i|✓,Z] =
©Kevin Jamieson 2017 29
Mixture models
©Kevin Jamieson 2017
©Kevin Jamieson 2017 30
Gaussian Mixture Example: Start
©Kevin Jamieson 2017
©Kevin Jamieson 2017 31
After first iteration
©Kevin Jamieson 2017
©Kevin Jamieson 2017 32
After 2nd iteration
©Kevin Jamieson 2017
©Kevin Jamieson 2017 33
After 3rd iteration
©Kevin Jamieson 2017
©Kevin Jamieson 2017 34
After 4th iteration
©Kevin Jamieson 2017
©Kevin Jamieson 2017 35
After 5th iteration
©Kevin Jamieson 2017
©Kevin Jamieson 2017 36
After 6th iteration
©Kevin Jamieson 2017
©Kevin Jamieson 2017 37
After 20th iteration
©Kevin Jamieson 2017
©Kevin Jamieson 2017 38
Some Bio Assay data
©Kevin Jamieson 2017
©Kevin Jamieson 2017 39
GMM clustering of the assay data
©Kevin Jamieson 2017
©Kevin Jamieson 2017 40
Resulting Density Estimator
©Kevin Jamieson 2017
©Kevin Jamieson 2017 41
Expectation Maximization Algorithm
©Kevin Jamieson 2017
The iterative gaussian mixture model (GMM) fitting algorithm is special case of EM:
Z is observed data
� is unobserved data
T = (Z,�)
©Kevin Jamieson 2017 42
Missing data example
©Kevin Jamieson 2017
xi ⇠ N (µ,⌃)but suppose some entries of xi are missing
`(✓|T, ✓) = �1
2
log(2⇡|⌃|) + (xi � µ)
T⌃
�1(x� µ)
Z is observed data
� is unobserved data
T = (Z,�)
E[`(✓0;T)|Z, b✓(j)]E Step:Natural choice for
b✓(0)?
©Kevin Jamieson 2017 43
Missing data example
©Kevin Jamieson 2017
xi ⇠ N (µ,⌃)but suppose some entries of xi are missing
`(✓|T, ✓) = �1
2
log(2⇡|⌃|) + (xi � µ)
T⌃
�1(x� µ)
Z is observed data
� is unobserved data
T = (Z,�)
E[`(✓0;T)|Z, b✓(j)]E Step:
b✓(j+1)= argmax
✓0E[`(✓0;T)|Z, b✓(j)]M Step:
Natural choice for
b✓(0)?
E[Y |X = x] = µY + ⌃Y X⌃�1XX(x� µX)
©Kevin Jamieson 2017 44
Missing data example
©Kevin Jamieson 2017
xi ⇠ N (µ,⌃)but suppose some entries of xi are missing
`(✓|T, ✓) = �1
2
log(2⇡|⌃|) + (xi � µ)
T⌃
�1(x� µ)
Z is observed data
� is unobserved data
T = (Z,�)
E[`(✓0;T)|Z, b✓(j)]E Step:
b✓(j+1)= argmax
✓0E[`(✓0;T)|Z, b✓(j)]M Step:
Natural choice for
b✓(0)?
Connection to matrix factorization?
E[Y |X = x] = µY + ⌃Y X⌃�1XX(x� µX)
©Kevin Jamieson 2017 45
Density Estimation
Machine Learning – CSE546 Kevin Jamieson University of Washington
November 21, 2016©Kevin Jamieson 2017
©Kevin Jamieson 2017
Kernel Density Estimation
46©Kevin Jamieson 2017
A very “lazy” GMM
©Kevin Jamieson 2017
Kernel Density Estimation
47©Kevin Jamieson 2017
©Kevin Jamieson 2017
Kernel Density Estimation
48©Kevin Jamieson 2017
Predict argmaxm brim
What is the Bayes optimal classification rule?
©Kevin Jamieson 2017
Generative vs Discriminative
49©Kevin Jamieson 2017