24
Conceptual and Empirical Exploration of t-SNE Lucy Chen, Allyson Ling, Shuting Zang Advisor: Xiaodong Li Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li) Conceptual and Empirical Exploration of t-SNE 1 / 24

Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Conceptual and Empirical Exploration of t-SNE

Lucy Chen, Allyson Ling, Shuting Zang

Advisor: Xiaodong Li

Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 1 / 24

Page 2: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Seeds Dataset

From UC Irvine Machine Learning Repository

Geometric features of different wheat seeds

210 observations, 7 attributes, 3 classes

Link for dataset: https://archive.ics.uci.edu/ml/datasets/seeds

Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 2 / 24

Page 3: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Seeds Dataset

Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 3 / 24

Page 4: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Seeds Dataset

Figure: t-SNE Figure: PCA

Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 4 / 24

Page 5: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

What is t-SNE?

t-distributed stochastic neighbor embedding

t-SNE is a non-linear dimension reduction technique

Steps:X → D → P → Q → Y

X = Data matrix (≤ 40 attributes)D = Pairwise distance matrixP = Probability matrix in high dimensionQ = Probability matrix in low dimensionY = Data matrix in low dimension

Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 5 / 24

Page 6: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

X → Pairwise Distance Matrix → P

Perp(Pi ) = eH(Pi ) H(Pi ) = −∑j

pj |i log(pj |i )

pj |i =−exp(||xi − xj ||2/2σ2i )∑k 6=i −exp(||xi − xk ||2/2σ2i )

Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 6 / 24

Page 7: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Perplexity

Perp(Pi ) = eH(Pi )

Measure of effective number of neighbors

Function of Shannon entropy

Measure of uncertainty of an outcome from a set of possible eventsNumber of bits needed to store information from dataset

Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 7 / 24

Page 8: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Pairwise Distance Matrix → P

Perp(Pi ) = eH(Pi ), H(Pi ) = −∑

j pj |i log(pj |i ) , D = { p1∑i pi, . . . , pk∑

i pi}

pj = e−βjdij

H(D) = −∑j

pj∑i pi

logpj∑i pi

= −∑j

pj∑i pi

(logpj − log∑i

pi )

= −∑j

pj∑i pi

logpj + log(∑i

pi )pj∑i pi

= log∑i

pi −1∑i pi

∑j

pj logpj

= log∑i

pi −1∑i pi

∑j

βjdje−βjdj

Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 8 / 24

Page 9: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

P → Q → Y

First set pij =pj|i+pi|j

2n and diagonal entries to 0

Then use gradient descent algorithm

C = KL(P||Q) =∑i

∑j

pij logpijqij

δC

δyi= 4

∑j

(pij − qij)(yi − yj)(1 + ||yi − yj ||2)−1

qij =(1 + ||yi − yj ||2)−1∑k 6=i (1 + ||yk − yi ||2)−1

Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 9 / 24

Page 10: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Real Data Implementation

Real Data Implementation 10 / 24

Page 11: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Wisconsin Breast Cancer Dataset

From UC Irvine Machine Learning Repository

Breast Cancer Diagnosis: 357 benign, 212 malignant

Ten real-valued features are computed for each cell nucleus, andattributes include the mean, standard error, and ”worst” or largest(mean of the three largest values) of these features

Real Data Implementation 11 / 24

Page 12: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Wisconsin Breast Cancer Dataset

Figure: Original Figure: Scaled

Real Data Implementation 12 / 24

Page 13: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Wisconsin Breast Cancer Dataset

Figure: 2D PCA

Real Data Implementation 13 / 24

Page 14: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Wisconsin Breast Cancer Dataset

Finding the proper perplexity

For this dataset, higher perplexity looks better for clustering

Real Data Implementation 14 / 24

Page 15: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Wisconsin Breast Cancer Dataset

Find the anomalies, and trace them back to the original dataset

Blue for benign, orange for malignant

Real Data Implementation 15 / 24

Page 16: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Wisconsin Breast Cancer Dataset

Real Data Implementation 16 / 24

Page 17: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Wisconsin Breast Cancer Dataset

By the visualized clusters, we would like the doctor to reconsider thediagnosis, since the patient may be misdiagnosed.

Mean/SD/Worse of radiusMean/SD/Worse of perimeter → Features about size of cancer cellMean/SD/Worse of area

Real Data Implementation 17 / 24

Page 18: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Vertebral Column Dataset

From UC Irvine Machine Learning Repository

Features distinguishing spinal disorders

310 observations, 6 attributes, 3 classes

Link for dataset:https://archive.ics.uci.edu/ml/datasets/Vertebral+Column

Real Data Implementation 18 / 24

Page 19: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Vertebral Column Dataset

Figure: Disk Hernia and Normal Figure: Spondylolisthesis

Real Data Implementation 19 / 24

Page 20: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Vertebral Column Dataset

Figure: t-SNE Figure: PCA

Real Data Implementation 20 / 24

Page 21: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Vertebral Column Dataset

Figure: Density plots of each attribute

Real Data Implementation 21 / 24

Page 22: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Vertebral Column Dataset

Real Data Implementation 22 / 24

Page 23: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

Vertebral Column Dataset

Figure: Density plots of each attribute with anomalies

Real Data Implementation 23 / 24

Page 24: Conceptual and Empirical Exploration of t-SNE · X = Data matrix ( 40 attributes) D = Pairwise distance matrix P = Probability matrix in high dimension Q = Probability matrix in low

References

Maaten, Laurens van der. Hinton, Geoffrey. Visualizing Data using t-SNE.https://lvdmaaten.github.io/publications/papers/JMLR 2008.pdf

Shannon, Claude E. A Mathematical Theory of Communication.http://math.harvard.edu/ ctm/home/text/others/shannon/entropy/entropy.pdf

Datasets:

Seeds: https://archive.ics.uci.edu/ml/datasets/seeds

Breast Cancer: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

Vertebral Column:https://archive.ics.uci.edu/ml/datasets/Vertebral+Column

Real Data Implementation 24 / 24