42
Lecture #16: K-means, Principal Component Analysis, and Autoencoders Mat Kallada Introduction to Data Mining

Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Lecture #16: K-means, Principal Component Analysis, and Autoencoders

Mat Kallada

Introduction to Data Mining

Page 2: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Clustering? Why bother?

Page 3: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Clustering Motivation● Finding different customer segments● Finding different groups of things

Page 4: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

DBSCAN ReviewSomeone tell me how it works!

Secondly, tell me the pitfalls.

Page 5: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

DBSCANPros:

● Can find non-linear/weird looking clusters● Discover new clusters

Cons:

● Suffers from the curse of dimensionality● Sometimes doesn’t work if two clusters are meshed together

Page 6: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

The K-means Method for ClusteringRandomly Put K vectors on our Feature Space

Assign each feature vector to vector centroid and compute Density

Use Gradient Descent to Move Vector Centroids

Density =

Page 7: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

The K-means Method for Clustering

Page 8: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Why Gradient Descent Sucks?Same reason with neural networks.

Page 9: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Convergence of Cluster Centroids● K-means converges, but it finds a local minimum of

the cost function● Cluster assignments may not be optimal

Page 10: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations
Page 11: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

K-means in R

DataJoy Link: https://www.getdatajoy.com/examples/56e18779697743550fc1c3bc

Page 12: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

K-Means vs DBCAN: Some Key PointsK-Means is much faster than DBSCAN

DBSCAN can learn non-linear clusters

Page 13: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Dimensionality ReductionWe’ve looked at the unsupervised learning task known as clustering - but, another problem we can tackle with unsupervised learning is dimensionality reduction.

TDLR; Reduce the columns SOMEHOW

Page 14: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Dimensionality Reduction“Dimensionality reduction” sounds like a weird term - but you actually already know what it is, and use it almost everyday when you take a picture with your phone.

Let’s consider the scenario where we take a picture of an object.

Page 15: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Dimensionality Reduction: High-dimensional PhotographingUsually when we take a photograph, we’d want to angle the camera so that most of the interesting parts are captured such that we can cherish them as memories.

Page 16: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

When you think about it, taking a picture effectively takes a 3D environment and transforms it into a 2D one.

Using the two-dimensional photograph alone, we’ll already get a good sense of the 3D environment.

Page 17: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Here’s the question: why couldn’t we do this with data?

In a seventh dimensional feature space, it would be great if we could take a snapshot so that we can understand it from a two or three dimensional perspective.

Page 18: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Taking a 2D picture in 3D

Page 19: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Why do we need high-dimensional photographs?That is exactly what an unsupervised machine learning algorithm called principal component analysis (often referred to as PCA) does.

It takes a snapshot of our data in some higher dimensional space and transforms it onto a lower one.

Why would this be of use to us?

Page 20: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Dimensionality Reduction = Column CompressionVisualization: We can use dimensionality reduction algorithms such as PCA for large feature spaces (above 3), so that we can visualize and understand the structure of our data better.

Compression: Some learning algorithms take very long to train properly with high dimensional data (e.g. Neural networks, since there are lots of weights to learn), compressing it beforehand with dimensionality-reduction could make things roll a lot faster.

Page 21: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Principal Component Analysis● Without looking at the labels (since it is an

unsupervised process; the photographer doesn’t need to know what the labels are, he just wants the best positioned picture)

● PCA will take a given dataset then linearly transform it into some lower dimensional space.

Page 22: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Principal Component Analysis● Just like a photographer who would try to find the

best angle for the shot, PCA tries to to capture keep the same quality of information while projecting to the new space; in other words, PCA tries to capture the principal components of the dataset.

Page 23: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Principal Component Analysis Extracts a Linear Plane

Page 24: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

How to pick this Linear Plane?

Similar to a photographer, we want to take the picture that captures most of the action.

We want to minimizes the variance of our extracted plane.

Page 25: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

PCA in ActionOur raw data lives in a three dimensional vector space.

Page 26: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

PCA in a NutshellWe can apply our PCA transformation to some data using the transform function. In most cases we want to apply our transformation to the original data.

After we have our transformed data, we could easily graph the newly created feature space (since it is two-dimensional) which could give us some general intuition on how our data was organized it’s original three-dimensional space.

Page 27: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Our data after PCARe-represent our data into a better set of axes

Principal Component #2

Prin

cipa

l Com

pone

nt #

1

Page 28: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

How well does our newly transformed data actually capture the original data?

Page 29: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Limitations of PCAOnly extracts linear derived features

These are latent variables of our original feature space

Page 30: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Autoencoders: Simple Neural Models for Compression

Core Idea: Train a neural network on the SAME INPUT

Page 31: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Motivation of AutoencodersAutomatic feature extraction.

Page 32: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Automated Feature Extraction from Raw Features

Page 33: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

We’re capturing non-linear representation of the input data

We’re taking advantage of the non-linear dimensionality reduction steps of neural networks

Page 34: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Non-linear Dimensionality ReductionOne popular way to do feature learning is through a type of neural network called an autoencoder.

These special types of neural networks try to reconstruct the raw input data with hopes that a hidden layer will capture interesting relationships about the raw data itself.

In other words, we will train a normal feed-forward neural network where the ‘labels’ in our case are the same as the input data.

Page 35: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Autoencoder can construct optimal feature representation

Page 36: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Issue with this method?It’s pretty straightforward.

Just train a neural network on where the target label is the input itself

The Hidden Layer should capture non-linear compression of the input.

Page 37: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

What if I learn a useless transformation?Literally a transformation that multiplies each feature by 1

It has zero reconstruction error.

We add random noise to the input data.

Page 38: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Extracted Feature Detectors for ImagesOnce we’ve trained the network to reconstruct the raw data, the hidden layer itself will have learned transformations which capture interesting features of our images.

Page 39: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

Deeper Feature RepresentationsBut we don’t need to stop here; we can train yet another autoencoder based from this transformation to learn an even higher level feature representation.

Page 40: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations
Page 41: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

As shown above, we can sequentially stacked autoencoders to create higher and higher levels of feature abstractions.

That is, we will train an autoencoder and once it’s complete, use that transformation for input for a subsequent autoencoder.

Effectively learning feature transformations which can be thrown into any off-the-shelf learning algorithm we’ve learned in previous lectures.

Page 42: Component Analysis, and Autoencoders Lecture #16: K-means ...kallada/stat2450/lectures/Lecture16.pdf · K-means converges, but it finds a local minimum of ... Deeper Feature Representations

That’s all for today.