Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Representation learningDeep learning overview, representation learning methods in detail (sammons map, t-sne), the backprop algorithm in detail, and regularization and its impact on optimization.
Dimensionality reductionWord2vecPCASammon's mapRegularizationt-SNEFactorized EmbeddingsLatent variable modelsALI/BiGAN
Slides by Joseph Paul Cohen 2020
Dim reduction overviewWe would like a mapping from R100 (or anything) to R2 to visualize it: aka: encoding, code, embedding, representation
Invertible vs non-invertible: If we allow for the loss of information we cannot have a lossless reconstruction. Some methods just preserve distances, no mapping learned.
PCA for dim reduction (quick recap)
In 2d data this vector captures the most variance
A is the eigenvectors of X such that x=AATx, A's columns are orthogonal, and the columns of A form a basis which encodes the most variance.
word2vecPresented at NeurIPS 2013
Strategy to learn representations for word tokens given their context.
Relevant to problems where context defines concepts (with redundancy)
4
Tomas Mikolov
What to do with word embeddings?
● We can compose them to create paragraph embeddings (bag of embeddings).● Use in place of words for an RNN● Augment learned representations on small datasets
● `
[Cultural Shift or Linguistic Drift, Hamilton, 2016]
Study how the meaning between two texts varies (or hospitals, or doctors)?
[Pennington, 2014][Mikolov, 2013]
Study the compositionality of the learned latent space
Token representationsOne-hot encoding: binary vector per token
Example: cat = [0 0 1 0 0 0 0 0 0 0 0 0 0 0 … 0]dog = [0 0 0 0 1 0 0 0 0 0 0 0 0 0 … 0]house = [1 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0]
Note!
If x is one hotThe dot product of Wx= a Single column of W M x N N x 1
=
M x 1
word2vecTarget wordContext wordContext word Context word
involving respiratory system and other chest symptomsContext word
involving
respiratory
doctor
chest
Mikolov, Efficient Estimation of Word Representations in Vector Space, 2013
1. Each word is a training example2. Each word is used in many contexts3. The context defines each word
system
1
1
0
1
0
0
0
0
0
1-1
5.1
involving
respiratory
doctor
chest
system
Context window = 2
Learning in progress
king + (woman - man) = ?
The point that is closest is queen!
Try it yourself!https://colab.research.google.com/drive/1VU4mm_DThBaQc9t0Cf6ajjHQDEw-Q1H2
Sammon's mapDescribed by John W. Sammon in 1969
Method of non-linear dim reduction based on gradient descent.
Basic method of preserving distances in a low dim space.
11
John W. Sammon
Colab Notebookhttps://colab.research.google.com/drive/1FDJ2FlVfN5PYYrNKEW2w48_BuSknhKif
= distance computed between each learned representation
= distance function (or matrix) that you want to represent
First: a basic non-linear dimensionality reduction
Learn a representation that maintains pairwise distances.
= distance computed between each learned representation
= distance function (or matrix) that you want to represent
Constant!
Scale discrepancy by true distance.Small distance = more important
= distance computed between each learned representation
= distance function (or matrix) that you want to represent
source_d = torch.pdist(source)target_d = torch.pdist(target)
stress = (((target_d - source_d)**2)/(source_d+1)).sum()
Output is a non-square (non redundant) distance matrix between vectors
source = torch.Tensor(data.values)target = torch.randn(source.shape[0],2, requires_grad=True)
optimizer = torch.optim.SGD([target], lr=0.5)optimizer.zero_grad() # get ready for new gradients
source_d = torch.pdist(source) # compute distancestarget_d = torch.pdist(target) # compute distances
stress = (((target_d - source_d)**2)/(source_d+1)).sum()
stress.backward() # compute gradients for target
optimizer.step() # adjust the target tensor
This paper calls the learning rate the "magic factor" !
Exercise (regularization)
How to control the representation learned?
Adjust the objective function so that the minimum has the property you want.
dloss += 0.01*torch.pdist(target[label==6]).mean()
DiscussionSimple cases? hard cases?
What does a learning rate over 1 mean?
What are the drawbacks of this sammon's map?
What does regularization change about training?
t-SNETL;DR: Sammon's map but distances delay exponentially
= conditional probability that xi is next to xj given a Gaussian centered at xi
data space embedding space
Ratio between distances weighted by source data distance.Drives p and q to be equal but only for nearby points.
Setting the σ (perplexity)t-SNE performs a binary search for the value of σi that produces a Pi with a fixed perplexity.
Data space (but in 2d)
Small σ
Image from: https://www.dataminingapps.com/2019/11/a-refresher-on-t-sne/
Perp=30-> H = ~4.9
Large σ
DiscussionHow does t-SNE differ from sammon's map?
Which distances are meaningful?
Factorized EmbeddingsTL;DR: two spaces of non-linear embeddings.
Conditioned on each other to predict data.
GTEx Dataset: 8,910 samples, 56,000 genes
[Trofimov et al. ICML WCB 2017]
Tissue
Color represents expression predicted when conditioned on
a gene
Evaluating sample space
Value predicted for entire space
Latent variable modelsWe learn a mapping from a latent
variable z to a complicated x = Something simple like a Gaussian
where
The conditional prob is modeled by a neural network and the latent space
is a distribution we understand.
Image from Ward, A. D, 3D Surface Parameterization Using Manifold Learning for Medial Shape Representation
ALI/BiGANICLR 2017. Two papers, same idea.
Matches joint distribution p(x,z).
Trains an encoder and decoder.
27
Vincent Dumoulin Jeff Donahue
ALI/BiGANMatches the joint distribution p(z,x) with q(z,x) using an adversarial loss.
p(z,x)
q(z,x)
Classifier learns to tell the difference between inputs. Update E and G so D cannot distinguish.
Typically a Gaussian generates
points here.
The generator learns to match the target
distribution to fool the discriminator
Generator
Fake Image
Discriminator
Real or Fake?
Fake Real
Noise
Quick intro to adversarial distribution matching
29
Homework1) Find a single/multi cell RNA-seq dataset compute a PCA, Sammon's Map,
and t-SNE. Color points by some relevant value.
2) What are the challenges for representation learning?
3)