90
Unsupervised Deep Learning Tutorial – Part 1 Alex Graves NeurIPS, 3 December 2018 Marc’Aurelio Ranzato

Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Unsupervised Deep LearningTutorial – Part 1

Alex Graves

NeurIPS, 3 December 2018

Marc’Aurelio Ranzato

Page 2: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Part 1 – Alex Graves

● Introduction to unsupervised learning

● Autoregressive models● Representation learning● Unsupervised reinforcement learning● 10-15 minute break

Page 3: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Part 2 – Marc’Aurelio Ranzato

● Practical Recipes of Unsupervised Learning● Learning representations● Learning to generate samples● Learning to map between two domains● Open Research Problems● 10-15 minutes questions (both presenters)

Page 4: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Introduction to Unsupervised Learning

Page 5: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Reinforcement Learning /Active Learning

Intrinsic Motivation / Exploration

Supervised Learning Unsupervised Learning

Types of Learning

With Teacher Without Teacher

Active

Passive

Page 6: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Reinforcement Learning /Active Learning

Intrinsic Motivation / Exploration

Supervised Learning Unsupervised Learning

Types of Learning

With Teacher Without Teacher

Active

Passive

Page 7: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Why Learn Without a Teacher?

If our goal is to create intelligent systems that can succeed at a wide variety of tasks (RL or supervised), why not just teach them those tasks directly?

1. Targets / rewards can be difficult to obtain or define.2. Want rapid generalisation to new tasks and situations3. Unsupervised learning is interesting

Page 8: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Why Learn Without a Teacher?

If our goal is to create intelligent systems that can succeed at a wide variety of tasks (RL or supervised), why not just teach them those tasks directly?

1. Targets / rewards can be difficult to obtain or define2. Want rapid generalisation to new tasks and situations3. Unsupervised learning is interesting

Page 9: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Why Learn Without a Teacher?

If our goal is to create intelligent systems that can succeed at a wide variety of tasks (RL or supervised), why not just teach them those tasks directly?

1. Targets / rewards can be difficult to obtain or define2. Unsupervised learning feels more human3. Want rapid generalisation to new tasks and situations

Page 10: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Why Learn Without a Teacher?

If our goal is to create intelligent systems that can succeed at a wide variety of tasks (RL or supervised), why not just teach them those tasks directly?

1. Targets / rewards can be difficult to obtain or define2. Unsupervised learning feels more human3. Want rapid generalisation to new tasks and situations

Page 11: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Transfer Learning● Teaching on one task and transferring to another (multi-task

learning, one-shot learning…) kind of works● E.g. Retraining speech recognition systems from a language with

lots of data can improve performance on a related language with little data

● But never seems to transfer as far or as fast as we want it to● Maybe there just isn’t enough information in the

targets/rewards to learn transferable skills?

Stop learning tasks, start learning skills – Satinder Singh

Page 12: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

The Cherry on the Cake● The targets for supervised learning contain far less information

than the input data

● RL reward signals contain even less● Unsupervised learning gives us an essentially unlimited supply of

information about the world: surely we should exploit that?

If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake.

– Yann LeCun

Page 13: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Example● ImageNet training set contains ~1.28M images, each assigned one of

1000 labels● If labels are equally probable, complete set of randomly shuffled labels

contains ~log2(1000)*1.28M ≈ 12.8 Mbits● Complete set of images uncompressed at 128 x128 contains ~500

Gbits: > 4 orders of magnitude more● A large conv net (~30M weights) can memorise randomised ImageNet

labellings. Could it memorise randomised pixels?

UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION, Zhang et. al. 2016

Page 14: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Supervised Learning● Given a dataset D of inputs x labelled with targets y, learn to predict

y from x, typically with maximum likelihood:

● (Still) the dominant paradigm in deep learning: image classification, speech recognition, translation…

Page 15: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Unsupervised Learning● Given a dataset D of inputs x, learn to predict… what?

● Basic challenge of unsupervised learning is that the task is undefined

● Want a single task that will allow the network generalise to many other tasks (which ones?)

Page 16: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

● Simplest approach: do maximum likelihood on the data instead of the targets

Density Modelling

● Goal is to learn the ‘true’ distribution from which the data was drawn● Means attempting to learn everything about the data

Page 17: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Where to Look Not everyone agrees that trying to understand everything is a good idea. Shouldn’t we instead focus on things that we believe will one day be useful for us?

… we lived our lives under the constantly changing sky without sparing it a glance or a thought. And why indeed should we? If the various formations had had some meaning, if, for example, there had been concealed signs and messages for us which it was important to decode correctly, unceasing attention to what was happening would have been inescapable…

– Karl Ove Knausgaard, A Death in the Family

Page 18: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Problems with Density Modelling

● First problem: density modelling is hard! From having too few bits to learn from, we now have too many (e.g. video, audio), and we have to deal with complex interactions between variables (curse of dimensionality)

● Second Problem: not all bits are created equal. Log-likelihoods depend much more on low-level details (pixel correlations, word N-Grams) than on high-level structure (image contents, semantics)

● Third problem: even if we learn the underlying structure, it isn’t always clear how to access and exploit that knowledge for future tasks (representation learning)

Page 19: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Generative Models

● Modelling densities also gives us a generative model of the data (as long as we can draw samples)

● Allows us to ‘see’ what the model has and hasn’t learned● Can also use generative models to imagine possible scenarios, e.g.

for model-based RL

What I cannot create, I do not understand– Richard Feynman

Page 20: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Autoregressive Models

Page 21: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

The Chain Rule for Probabilities

Slide Credit: Piotr Mirowski

Page 22: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Autoregressive Networks

● Basic trick: split high dimensional data up into a sequence of small pieces, predict each piece from those before (curse of dimensionality)

● Conditioning on past is done via network state (LSTM/GRU, masked convolutions, transformers…), output layer parameterises predictions

Page 23: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Slide Credit: Piotr Mirowski

Page 24: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Slide Credit: Piotr Mirowski

Page 25: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Slide Credit: Piotr Mirowski

Page 26: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Slide Credit: Piotr Mirowski

Page 27: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Slide Credit: Piotr Mirowski

Page 28: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Advantages of Autoregressive Models

● Simple to define: just have to pick an ordering● Easy to generate samples: just sample from each predictive

distribution, then feed in the sample at the next step as if it’s real data (dreaming for neural networks?)

● Best log-likelihoods for many types of data: images, audio, video, text…

Page 29: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Disadvantages of Autoregressive Models

● Very expensive for high-dimensional data (e.g millions of predictions per second for video); can mitigate with parallelisation during training, but generating still slow

● Order dependent: get very different results depending on the order in which predictions are made, and can’t easily impute out of order

● Teacher forcing: only learning to predict one step ahead, not many (potentially brittle generation and myopic representations)

Page 30: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Language Modelling

Some of the obese people lived five to eight years longer than others.

Abu Dhabi is going ahead to build solar city and no pollution city.

Or someone who exposes exactly the truth while lying.

VIERA , FLA . -- Sometimes, Rick Eckstein dreams about baseball swings.

For decades, the quintessentially New York city has elevated its streets to the status of an icon.

The lawsuit was captioned as United States ex rel.

R. Jozefowicz et. al. Exploring the Limits of Language Modeling (2016)

Page 31: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

WaveNets

van den Oord, A., et al. “WaveNet: A Generative Model for Raw Audio.” arxiv (2016).

Page 32: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

PixelRNN - Model

● Fully visible

● Model pixels with Softmax

● ‘Language model’ for images

van den Oord, A., et al. “Pixel Recurrent Neural Networks.” ICML (2016).

Page 33: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Pixel RNN - Samples

van den Oord, A., et al. “Pixel Recurrent Neural Networks.” ICML (2016).

Page 34: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Conditional Pixel CNN

van den Oord, A., et al. “Conditional Pixel CNN.” NIPS (2016).

Page 35: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

1 2 3 4

5 6 7 8

9

1 2 3 4

5 6 7 8

9

1 2 3 4

5 6 7 8

9

1 2 3 4

5 6 7 8

9

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Slice 1 Slice 2 Slice 4Slice 3

Source

Target

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Autoregressive over slices, then pixels within a slice

J. Menick et. al. Generating High Fidelity Images with subsample pixel networks and multidimensional upscaling (2018)

Page 36: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

256 x 256 CelebA-HQ

J. Menick et. al. Generating High Fidelity Images with subsample pixel networks and multidimensional upscaling (2018)

Page 37: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

128 x128 ImageNet

J. Menick et. al. Generating High Fidelity Images with subsample pixel networks and multidimensional upscaling (2018)

Page 38: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Video Pixel Network (VPN)

Kalchbrenner, N., et al. “Video Pixel Networks.” ICML (2017).

Page 39: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

HandwritingSynthesis

A. Graves, Generating Sequences with Recurrent Neural Networks (2013)

Page 40: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Autoregressive Mixture ModelsCo-ordinate Density

Component Weights

Page 41: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Distribution over Sequences

Carter et. al., Experiments in Handwriting with a Neural Network (2016)

Page 42: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Representation Learning

Page 43: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

The Language of Neural Networks

● Deep networks work by learning complex, often hierarchical internal representations of input data

● These form a kind of language the network uses to describe the data

● Language can emerge from tasks like object recognition: has pointy ears, whiskers, tail => cat (c.f. Wittgenstein)

Page 44: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

C. Olah et. al. Feature Visualization, distill (2018)

Page 45: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Unsupervised Representations

● Task-driven representations are limited by the requirements of the task: e.g. don’t need to internalise the laws of physics to recognise objects

● Unsupervised representations should be more general: as long as the laws of physics help to model observations in the world, they are worth representing

Page 46: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Reading the Latent Language● We want neural networks to describe the data to us (image

captioning without the captions?)

● Then we can re-use the descriptions to plan, reason, and generalise at a more abstract level

● Good density models must learn a rich internal language, but we can’t read it (distil for WaveNet?): we need to break open the black box

● One way to make representations more accessible is to force them through a bottleneck

Page 47: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Autoencoder

Input Reconstruction

Latent representation

Slide: Irina Higgins, Loïc MattheyReconstruction cost

Encoder Decoder

Page 48: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Latent representation

Autoencoder

Input

Encoder

Reconstruction

Slide: Irina Higgins, Loïc Matthey

Decoder

Reconstruction cost

Page 49: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Variational AutoEncoder

Encoder

Input Reconstruction

Decoder

Latent distribution

Reconstruction cost

Coding Cost

Kingma et al, 2014Rezende et al, 2014

Slide: Irina Higgins, Loïc Matthey

Page 50: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Minimum Description Length for VAE ● Alice wants to transmit x as compactly as possible to Bob, who knows

only the prior p(z) and the decoder weights

● The coding cost is the number of bits required for Alice to transmit a sample from qθ(z|x) to Bob (e.g. bits-back coding)

● The reconstruction cost measures the number of additional error bits Alice will need to send to Bob to reconstruct the data given the latent sample (e.g. arithmetic coding)

● The sum of the two costs is the total length of the message Alice needs to send to Bob to allow him to recover x (c.f. variational inference)

Chen at. al., Variational Lossy Autoencoder (2017)

Page 51: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Code Collapse● Ideally a VAE would put high-level information in the codes, leave

low-level information to the decoder

● But when the decoder is sufficiently powerful (e.g. autoregressive) the coding distribution tends to ‘collapse’ to the prior p(z)

● This means no information is passed through the bottleneck and no latent representation is learned

● MDL suggests a reason: a powerful decoder can implicitly learn p(z), meaning that if each x is independently transmitted, the number of bits saved by the decoder by conditioning on z ≈ the cost of transmitting z

Page 52: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Thought Experiments

● Experiment 1: An MNIST Decoder learns a uniform mixture over 10 disjoint models. Prior is uniform over 10 classes. Conditioning on the image class saves ~ log2(10) bits, encoding the class costs ~ log2(10) bits

● Experiment 2: Pick 100 character strings at random from an encyclopedia. The context from the paragraph, article etc. is missing. Is it worth appending that information to each of the strings?

Page 53: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Learn the Dataset, Not the Datapoints● Suggests a fundamental flaw with using log-likelihoods to find representations: never

worth encoding high-level information

● Example: conditioning on ImageNet labels makes a huge difference to samples, tiny difference to log-probs (≈ log2(1000) bits)

● But one label applies to many data, so worth encoding high-level information if we only encode it once for the whole dataset (≈ 1000 x log2(1000) bits)

● Want to amortise the coding cost over the whole dataset

● Use high level information to organise low level data, not annotate it

…one must take seriously the idea of working with datasets, rather than datapoints, as the key objects to model.– Edwards & Storkey, Towards a Neural Statistician, (2017)

Page 54: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Associative Compression Networks● ACNs modify the VAE loss by replacing the unconditional prior p(z) with a

conditional prior p(z|z’), where z’ is the latent representation of an associated data point (one of the K nearest Euclidean neighbours to z)

● p(z|z’) – parameterised by an MLP – models only part of the latent space, rather than the whole thing, which greatly reduces the coding cost

● Implicit amortisation: the more clustered the codes, the cheaper they are● Result: rich, informative codes are learned, even with powerful decoders.

Graves et. al., Associative Compression Networks for Representation Learning (2018)

Page 55: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

MDL for ACN

● Alice now wants to transmit the entire dataset to Bob, in any order (justified for IID data?)

● Bob has the weights of the associative prior, decoder and encoder● Alice chooses an ordering for the data that minimises total coding cost

(travelling salesman) and sends the data to Bob one at a time.

● After receiving each latent code + error bits, he decodes the datapoint, then re-encodes it and uses the result to determine the associative prior for the next code

Page 56: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Red bits are different from standard VAE,The rest is the same

C

Page 57: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to
Page 58: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Unordered: KL from unconditional priorOrdered: KL from conditional ACN prior

Page 59: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Binary MNIST reconstructions: leftmost column are test set images

Page 60: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

CelebA Reconstructions: leftmost column from test set

Page 61: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

‘Daydream’ sampling: encode data, sample latent from conditional prior, generate new data conditioned on latent, repeat

Page 62: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Mutual Information● Want codes that ‘describe’ the data as well as possible● Mathematically, we want to maximise the mutual information

between the code z and the data x

● For an autoencoder, the difference between decoding x with z and (optimally) decoding without z is a lower bound on MI(x, z), so minimising the reconstruction cost maximises MI

● But decoding is very expensive if we just want codes● Are there other ways to maximise MI?

Page 63: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial Intelligence

Contrastive Predictive Coding

van den Oord et al., Representation Learning with Contrastive Predictive Coding (2018)

Page 64: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial IntelligenceRepresentation Learning with Contrastive Predictive Codingvan den Oord et al., Representation Learning with Contrastive Predictive Coding (2018)

Page 65: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial IntelligenceRepresentation Learning with Contrastive Predictive Codingvan den Oord et al., Representation Learning with Contrastive Predictive Coding (2018)

Page 66: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial IntelligenceRepresentation Learning with Contrastive Predictive CodingGutmann et al., Noise-Contrastive Estimation (2009)

Page 67: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial IntelligenceRepresentation Learning with Contrastive Predictive Coding

Page 68: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial IntelligenceRepresentation Learning with Contrastive Predictive Coding

Page 69: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial IntelligenceRepresentation Learning with Contrastive Predictive Coding

Page 70: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial IntelligenceRepresentation Learning with Contrastive Predictive Coding

Page 71: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial IntelligenceRepresentation Learning with Contrastive Predictive Coding

Page 72: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial IntelligenceRepresentation Learning with Contrastive Predictive Coding

Page 73: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial IntelligenceRepresentation Learning with Contrastive Predictive Coding

Page 74: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial Intelligence

Speech - LibriSpeech

Representation Learning with Contrastive Predictive Coding

t-SNE on codes coloured by speaker identity

van den Oord et al., Representation Learning with Contrastive Predictive Coding (2018)

Page 75: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial Intelligence

Images - ImageNet

Representation Learning with Contrastive Predictive Coding

Page 76: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial Intelligence

NLP - BookCorpus

Representation Learning with Contrastive Predictive Coding

Page 77: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Unsupervised Reinforcement Learning

Page 78: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Auxiliary Tasks

● How can unsupervised learning help reinforcement learning?● Simplest way is as an auxiliary task: maximise reward and

minimise unsupervised loss with the same network

● Hope is that the representations learned for the unsupervised task will help with the RL task

● Also applies to supervised learning (e.g. semi-supervised learning, unsupervised pre-training)

Page 79: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

M. Jaderberg et. al., Reinforcement Learning with Unsupervised Auxiliary Tasks. (2016)

Pixel Control – auxiliary policies are trained to maximise change in pixel intensity of different regions of the input

Reward Prediction – given three recent frames, the network must predict the reward that will be obtained in the next unobserved timestep.

UNREAL Agent

Page 80: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Unsupervised RL Baselines

M. Jaderberg et. al., Reinforcement Learning with Unsupervised Auxiliary Tasks. (2016)

Page 81: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Sparse Rewards? More Cherries!

Many reward signalsSingle scalar reward signal

Page 82: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial Intelligence

Auxiliary Losses

Auxiliary loss is on policyPredict 30 steps in the future

Reinforcement Learning on DM-Lab

Representation Learning with Contrastive Predictive Coding

Page 83: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

General Artificial Intelligence

-- Batched A2C-- Aux loss

Reinforcement Learning on DM-Lab

Representation Learning with Contrastive Predictive Coding

Page 84: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Intrinsic Motivation

● Unsupervised learning can guide the policy of an RL agent as well as shaping the representations

● Agent becomes intrinsically motivated to discover or control aspects of the environment, with or without an extrinsic reward

● Many variants, no consensus…

Page 85: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Curious AgentsCan reward the agent’s curiosity by guiding it towards ‘novel’ observations from which it can rapidly learn. Many curiosity signals can be used:

● Prediction Error: choose actions to maximise prediction error in observations. Problem is noise addiction: inherently unpredictable environments become unreasonably interesting. One solution is to make predictions in latent space instead: network doesn’t import noise into latent representations, only useful structure

Pathak et. al. Curiosity-driven Exploration by Self-supervised Prediction (2017)

Page 86: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Curious Agents (cotd.)● Bayesian Surprise: maximise KL between posterior (after seeing observation) and

prior (before seeing it) Baldi et. al., Bayesian Surprise Attracts Human Attention. (2005)

● Prediction Gain: maximise change in prediction error before and after seeing an observation. Approximates Bayesian surprise.

Bellemare et. al. (Unifying Count-Based Exploration and Intrinsic Motivation. 2016)

● Complexity Gain: maximise increase in complexity of (regularised) predictive model. Assumes a parsimonious model will only increase complexity if it discovers a meaningful regularity. Needs a way of measuring complexity (e.g. VI).

Graves et. al. Automated Curriculum learning For Neural Networks. (2017)

Page 87: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Prediction Gain Syllabus

Automated Curriculum learning For Neural Networks. Graves et. al. (2017)

Page 88: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

● Complexity Gain: Seek out data that maximise the decrease in bits of everything the agent has ever observed (!). In other words find (or create) the thing that makes the most sense of the agent’s life so far: science, art, music, jokes…

Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes, Schmidhuber, 2008

Curiouser and Curiouser…

Page 89: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Empowered AgentsInstead of curiosity, agent can be motivated by empowerment: attempt to maximise the Mutual Information between the agent’s actions and the consequences of its actions (e.g. the state the actions will lead to). Agent wants to have as much control as possible over its future.

Klyubin et. al. Empowerment: A Universal Agent-Centric Measure of Control (2005)

One way to maximise mutual information is to classify the high level ‘option’ that determined the actions from the final state (while keeping the option entropy high): contrastive estimation again? Gregor et. al. Variational Intrinsic Control (2016)

Page 90: Unsupervised Deep Learning - NeurIPS · Part 1 – Alex Graves Introduction to unsupervised learning Autoregressive models Representation learning ... Want rapid generalisation to

Conclusions● Unsupervised learning gives us much more signal to learn from● But it isn’t clear what the learning objective should be● Density modelling is one option● Autoregressive neural networks are a powerful family of density model● Methods such as autoencoding and predictive coding can yield useful latent

representations

● RL can benefit from unsupervised learning as an auxiliary loss, and from intrinsic motivation signals such as curiosity