Sammon's map Word2vec Regularization t-SNE Representation … · 2020. 2. 10. · Context word Context word Target word Context word involving respiratory system and other chest symptoms

Representation learningDeep learning overview, representation learning methods in detail (sammons map, t-sne), the backprop algorithm in detail, and regularization and its impact on optimization.

Dimensionality reductionWord2vecPCASammon's mapRegularizationt-SNEFactorized EmbeddingsLatent variable modelsALI/BiGAN

Slides by Joseph Paul Cohen 2020

Dim reduction overviewWe would like a mapping from R100 (or anything) to R2 to visualize it: aka: encoding, code, embedding, representation

Invertible vs non-invertible: If we allow for the loss of information we cannot have a lossless reconstruction. Some methods just preserve distances, no mapping learned.

PCA for dim reduction (quick recap)

In 2d data this vector captures the most variance

A is the eigenvectors of X such that x=AATx, A's columns are orthogonal, and the columns of A form a basis which encodes the most variance.

word2vecPresented at NeurIPS 2013

Strategy to learn representations for word tokens given their context.

Relevant to problems where context defines concepts (with redundancy)

4

Tomas Mikolov

What to do with word embeddings?

● We can compose them to create paragraph embeddings (bag of embeddings).● Use in place of words for an RNN● Augment learned representations on small datasets

● `

[Cultural Shift or Linguistic Drift, Hamilton, 2016]

Study how the meaning between two texts varies (or hospitals, or doctors)?

[Pennington, 2014][Mikolov, 2013]

Study the compositionality of the learned latent space

Token representationsOne-hot encoding: binary vector per token

Example: cat = [0 0 1 0 0 0 0 0 0 0 0 0 0 0 … 0]dog = [0 0 0 0 1 0 0 0 0 0 0 0 0 0 … 0]house = [1 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0]

Note!

If x is one hotThe dot product of Wx= a Single column of W M x N N x 1

=

M x 1

word2vecTarget wordContext wordContext word Context word

involving respiratory system and other chest symptomsContext word

involving

respiratory

doctor

chest

Mikolov, Efficient Estimation of Word Representations in Vector Space, 2013

1. Each word is a training example2. Each word is used in many contexts3. The context defines each word

system

1

1

0

1

0

0

0

0

0

1-1

5.1

involving

respiratory

doctor

chest

system

Context window = 2

Learning in progress

king + (woman - man) = ?

The point that is closest is queen!

Try it yourself!https://colab.research.google.com/drive/1VU4mm_DThBaQc9t0Cf6ajjHQDEw-Q1H2

https://colab.research.google.com/drive/1VU4mm_DThBaQc9t0Cf6ajjHQDEw-Q1H2

Sammon's mapDescribed by John W. Sammon in 1969

Method of non-linear dim reduction based on gradient descent.

Basic method of preserving distances in a low dim space.

11

John W. Sammon

Colab Notebookhttps://colab.research.google.com/drive/1FDJ2FlVfN5PYYrNKEW2w48_BuSknhKif

http://www.youtube.com/watch?v=HSsrJC_KCKk

https://colab.research.google.com/drive/1FDJ2FlVfN5PYYrNKEW2w48_BuSknhKif

= distance computed between each learned representation

= distance function (or matrix) that you want to represent

First: a basic non-linear dimensionality reduction

Learn a representation that maintains pairwise distances.



Constant!

Scale discrepancy by true distance.Small distance = more important



source_d = torch.pdist(source)target_d = torch.pdist(target)

stress = (((target_d - source_d)**2)/(source_d+1)).sum()

Output is a non-square (non redundant) distance matrix between vectors

source = torch.Tensor(data.values)target = torch.randn(source.shape[0],2, requires_grad=True)

optimizer = torch.optim.SGD([target], lr=0.5)optimizer.zero_grad() # get ready for new gradients

source_d = torch.pdist(source) # compute distancestarget_d = torch.pdist(target) # compute distances

stress = (((target_d - source_d)**2)/(source_d+1)).sum()

stress.backward() # compute gradients for target

optimizer.step() # adjust the target tensor

This paper calls the learning rate the "magic factor" !

Exercise (regularization)

How to control the representation learned?

Adjust the objective function so that the minimum has the property you want.

dloss += 0.01*torch.pdist(target[label==6]).mean()

DiscussionSimple cases? hard cases?

What does a learning rate over 1 mean?

What are the drawbacks of this sammon's map?

What does regularization change about training?

t-SNETL;DR: Sammon's map but distances delay exponentially

= conditional probability that xi is next to xj given a Gaussian centered at xi

data space embedding space

Ratio between distances weighted by source data distance.Drives p and q to be equal but only for nearby points.

Setting the σ (perplexity)t-SNE performs a binary search for the value of σi that produces a Pi with a fixed perplexity.

Data space (but in 2d)

Small σ

Image from: https://www.dataminingapps.com/2019/11/a-refresher-on-t-sne/

Perp=30-> H = ~4.9

Large σ

DiscussionHow does t-SNE differ from sammon's map?

Which distances are meaningful?

Factorized EmbeddingsTL;DR: two spaces of non-linear embeddings.

Conditioned on each other to predict data.

GTEx Dataset: 8,910 samples, 56,000 genes

[Trofimov et al. ICML WCB 2017]

Tissue

Color represents expression predicted when conditioned on

a gene

Evaluating sample space

Value predicted for entire space

Latent variable modelsWe learn a mapping from a latent

variable z to a complicated x = Something simple like a Gaussian

where

The conditional prob is modeled by a neural network and the latent space

is a distribution we understand.

Image from Ward, A. D, 3D Surface Parameterization Using Manifold Learning for Medial Shape Representation

ALI/BiGANICLR 2017. Two papers, same idea.

Matches joint distribution p(x,z).

Trains an encoder and decoder.

27

Vincent Dumoulin Jeff Donahue

ALI/BiGANMatches the joint distribution p(z,x) with q(z,x) using an adversarial loss.

p(z,x)

q(z,x)

Classifier learns to tell the difference between inputs. Update E and G so D cannot distinguish.

Typically a Gaussian generates

points here.

The generator learns to match the target

distribution to fool the discriminator

Generator

Fake Image

Discriminator

Real or Fake?

Fake Real

Noise

Quick intro to adversarial distribution matching

29

Homework1) Find a single/multi cell RNA-seq dataset compute a PCA, Sammon's Map,

and t-SNE. Color points by some relevant value.

2) What are the challenges for representation learning?

3)

Documents

Sammon's map Word2vec Regularization t-SNE Representation … · 2020. 2. 10. · Context word Context word Target word Context word involving respiratory system and other chest symptoms