40
Deep Learning (Part I) 1 Matt Gormley Lecture 16 October 24, 2016 School of Computer Science Readings: Nielsen (online book) Neural Networks and Deep Learning 10601B Introduction to Machine Learning

Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Deep  Learning  (Part  I)  

1  

Matt  Gormley  Lecture  16  

October  24,  2016    

School of Computer Science

Readings:  Nielsen  (online  book)  Neural  Networks  and  Deep  Learning  

10-­‐601B  Introduction  to  Machine  Learning  

Page 2: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Reminders  

•  Midsemester  grades  released  today  

 

2  

Page 3: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Outline  •  Deep  Neural  Networks  (DNNs)  

–  Three  ideas  for  training  a  DNN  –  Experiments:  MNIST  digit  classification  –  Autoencoders  –  Pretraining  

•  Convolutional  Neural  Networks  (CNNs)  –  Convolutional  layers  –  Pooling  layers  –  Image  recognition  

•  Recurrent  Neural  Networks  (RNNs)  –  Bidirectional  RNNs  –  Deep  Bidirectional  RNNs  –  Deep  Bidirectional  LSTMs  –  Connection  to  forward-­‐backward  algorithm  

    3  

Part  I  

Part  II  

Page 4: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

PRE-­‐TRAINING  FOR  DEEP  NETS  

4  

Page 5: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

A  Recipe  for    Machine  Learning  

1.  Given  training  data:   3.  Define  goal:  

5  

Background  

2.  Choose  each  of  these:  –  Decision  function  

–  Loss  function  

4.  Train  with  SGD:  (take  small  steps  opposite  the  gradient)  

Goals  for  Today’s  Lecture  

1.  Explore  a  new  class  of  decision  functions    (Deep  Neural  Networks)  

2.  Consider  variants  of  this  recipe  for  training  

Page 6: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Idea  #1:  No  pre-­‐training  

6  

Training  

�  Idea  #1:  (Just  like  a  shallow  network)  �  Compute  the  supervised  gradient  by  backpropagation.  �  Take  small  steps  in  the  direction  of  the  gradient  (SGD)  

Page 7: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Comparison  on  MNIST  

1.0  

1.5  

2.0  

2.5  

Shallow  Net   Idea  #1  (Deep  Net,  no-­‐pretraining)    

Idea  #2  (Deep  Net,  

supervised  pre-­‐training)  

Idea  #3  (Deep  Net,  

unsupervised  pre-­‐training)  

%  Er

ror  

7  

Training  

•  Results  from  Bengio  et  al.  (2006)  on    MNIST  digit  classification  task  

•  Percent  error  (lower  is  better)    

Page 8: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Comparison  on  MNIST  

1.0  

1.5  

2.0  

2.5  

Shallow  Net   Idea  #1  (Deep  Net,  no-­‐pretraining)    

Idea  #2  (Deep  Net,  

supervised  pre-­‐training)  

Idea  #3  (Deep  Net,  

unsupervised  pre-­‐training)  

%  Er

ror  

8  

Training  

•  Results  from  Bengio  et  al.  (2006)  on    MNIST  digit  classification  task  

•  Percent  error  (lower  is  better)    

Page 9: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Idea  #1:  No  pre-­‐training  

•  What  goes  wrong?  A.  Gets  stuck  in  local  optima  

•  Nonconvex  objective    •  Usually  start  at  a  random  (bad)  point  in  parameter  space  

B.  Gradient  is  progressively  getting  more  dilute  •  “Vanishing  gradients”  

9  

Training  

�  Idea  #1:  (Just  like  a  shallow  network)  �  Compute  the  supervised  gradient  by  backpropagation.  �  Take  small  steps  in  the  direction  of  the  gradient  (SGD)  

Page 10: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

0.00.5

1.0

-20-15

-10-5

-20

-15

-10

-5

0

Problem  A:  Nonconvexity  

•  Where  does  the  nonconvexity  come  from?  •  Even  a  simple  quadratic  z  =  xy  objective  is  nonconvex:  

10  

Training  

z  

x  y  

Page 11: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Problem  A:  Nonconvexity  

•  Where  does  the  nonconvexity  come  from?  •  Even  a  simple  quadratic  z  =  xy  objective  is  nonconvex:  

11  

Training  

0.00.5

1.0

-20 -15 -10 -5

-20

-15

-10

-5

0

z  

x  y  

Page 12: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

12  

Stochastic  Gradient  Descent…  

…climbs  to  the  top  of  the  nearest  hill…  

 

 

Problem  A:  Nonconvexity  Training  

Page 13: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

13  

Stochastic  Gradient  Descent…  

…climbs  to  the  top  of  the  nearest  hill…  

 

 

Problem  A:  Nonconvexity  Training  

Page 14: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

14  

Stochastic  Gradient  Descent…  

…climbs  to  the  top  of  the  nearest  hill…  

 

 

Problem  A:  Nonconvexity  Training  

Page 15: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

15  

Stochastic  Gradient  Descent…  

…climbs  to  the  top  of  the  nearest  hill…  

 

 

Problem  A:  Nonconvexity  Training  

Page 16: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

16  

Stochastic  Gradient  Descent…  

…climbs  to  the  top  of  the  nearest  hill…  

 

…which  might  not  lead  to  the  top  of  the  mountain  

 

 

Problem  A:  Nonconvexity  Training  

Page 17: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Problem  B:  Vanishing  Gradients  

The  gradient  for  an  edge  at  the  base  of  the  network  depends  on  the  gradients  of  many  edges  above  it    The  chain  rule  multiplies  many  of  these  partial  derivatives  together  

17  

Training  

…Input  

Hidden  Layer  

…Hidden  Layer  

Output  

Hidden  Layer  

Page 18: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Problem  B:  Vanishing  Gradients  

The  gradient  for  an  edge  at  the  base  of  the  network  depends  on  the  gradients  of  many  edges  above  it    The  chain  rule  multiplies  many  of  these  partial  derivatives  together  

18  

Training  

…Input  

Hidden  Layer  

…Hidden  Layer  

Output  

Hidden  Layer  

Page 19: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Problem  B:  Vanishing  Gradients  

The  gradient  for  an  edge  at  the  base  of  the  network  depends  on  the  gradients  of  many  edges  above  it    The  chain  rule  multiplies  many  of  these  partial  derivatives  together  

19  

Training  

…Input  

Hidden  Layer  

…Hidden  Layer  

Output  

Hidden  Layer  

0.1  

0.3  

0.2  

0.7  

Page 20: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Idea  #1:  No  pre-­‐training  

•  What  goes  wrong?  A.  Gets  stuck  in  local  optima  

•  Nonconvex  objective    •  Usually  start  at  a  random  (bad)  point  in  parameter  space  

B.  Gradient  is  progressively  getting  more  dilute  •  “Vanishing  gradients”  

20  

Training  

�  Idea  #1:  (Just  like  a  shallow  network)  �  Compute  the  supervised  gradient  by  backpropagation.  �  Take  small  steps  in  the  direction  of  the  gradient  (SGD)  

Page 21: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Idea  #2:  Supervised    Pre-­‐training  

1.  Supervised  Pre-­‐training  –  Use  labeled  data  –  Work  bottom-­‐up  •  Train  hidden  layer  1.  Then  fix  its  parameters.  •  Train  hidden  layer  2.  Then  fix  its  parameters.  •  …  •  Train  hidden  layer  n.  Then  fix  its  parameters.  

2.  Supervised  Fine-­‐tuning  –  Use  labeled  data  to  train  following  “Idea  #1”  –  Refine  the  features  by  backpropagation  so  that  they  become  

tuned  to  the  end-­‐task  21  

Training  

�  Idea  #2:  (Two  Steps)  �  Train  each  level  of  the  model  in  a  greedy  way  �  Then  use  our  original  idea  

Page 22: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Idea  #2:  Supervised    Pre-­‐training  

22  

Training  

�  Idea  #2:  (Two  Steps)  �  Train  each  level  of  the  model  in  a  greedy  way  �  Then  use  our  original  idea  

Output  

Input  

Hidden  Layer  1  

Page 23: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

�  Idea  #2:  (Two  Steps)  �  Train  each  level  of  the  model  in  a  greedy  way  �  Then  use  our  original  idea  

Idea  #2:  Supervised    Pre-­‐training  

23  

Training  

…Input  

Hidden  Layer  1  

Output  

Hidden  Layer  2  

Page 24: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Idea  #2:  Supervised    Pre-­‐training  

24  

Training  

�  Idea  #2:  (Two  Steps)  �  Train  each  level  of  the  model  in  a  greedy  way  �  Then  use  our  original  idea  

…Input  

Hidden  Layer  1  

…Hidden  Layer  2  

Output  

Hidden  Layer  3  

Page 25: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Idea  #2:  Supervised    Pre-­‐training  

25  

Training  

�  Idea  #2:  (Two  Steps)  �  Train  each  level  of  the  model  in  a  greedy  way  �  Then  use  our  original  idea  

…Input  

Hidden  Layer  1  

…Hidden  Layer  2  

Output  

Hidden  Layer  3  

Page 26: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Comparison  on  MNIST  

1.0  

1.5  

2.0  

2.5  

Shallow  Net   Idea  #1  (Deep  Net,  no-­‐pretraining)    

Idea  #2  (Deep  Net,  

supervised  pre-­‐training)  

Idea  #3  (Deep  Net,  

unsupervised  pre-­‐training)  

%  Er

ror  

26  

Training  

•  Results  from  Bengio  et  al.  (2006)  on    MNIST  digit  classification  task  

•  Percent  error  (lower  is  better)    

Page 27: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Comparison  on  MNIST  

1.0  

1.5  

2.0  

2.5  

Shallow  Net   Idea  #1  (Deep  Net,  no-­‐pretraining)    

Idea  #2  (Deep  Net,  

supervised  pre-­‐training)  

Idea  #3  (Deep  Net,  

unsupervised  pre-­‐training)  

%  Er

ror  

27  

Training  

•  Results  from  Bengio  et  al.  (2006)  on    MNIST  digit  classification  task  

•  Percent  error  (lower  is  better)    

Page 28: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Idea  #3:  Unsupervised  Pre-­‐training  

1.  Unsupervised  Pre-­‐training  –  Use  unlabeled  data  –  Work  bottom-­‐up  •  Train  hidden  layer  1.  Then  fix  its  parameters.  •  Train  hidden  layer  2.  Then  fix  its  parameters.  •  …  •  Train  hidden  layer  n.  Then  fix  its  parameters.  

2.  Supervised  Fine-­‐tuning  –  Use  labeled  data  to  train  following  “Idea  #1”  –  Refine  the  features  by  backpropagation  so  that  they  become  

tuned  to  the  end-­‐task  28  

Training  

�  Idea  #3:  (Two  Steps)  �  Use  our  original  idea,  but  pick  a  better  starting  point  �  Train  each  level  of  the  model  in  a  greedy  way  

Page 29: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

The  solution:  Unsupervised  pre-­‐training  

29  

…Input  

Hidden  Layer  

Output  

Unsupervised  pre-­‐training  of  the  first  layer:    •  What  should  it  predict?  •  What  else  do  we  observe?    

•  The  input!  

This  topology  defines  an  Auto-­‐encoder.  

Page 30: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

The  solution:  Unsupervised  pre-­‐training  

Unsupervised  pre-­‐training  of  the  first  layer:    •  What  should  it  predict?  •  What  else  do  we  observe?    

•  The  input!  

This  topology  defines  an  Auto-­‐encoder.  

30  

…Input  

Hidden  Layer  

…“Input”   ’   ’   ’   ’  

Page 31: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Auto-­‐Encoders  

Key  idea:  Encourage  z  to  give  small  reconstruction  error:  –  x’  is  the  reconstruction  of  x  –  Loss  =  ||  x  –  DECODER(ENCODER(x))  ||2  –  Train  with  the  same  backpropagation  algorithm  for  2-­‐layer  

Neural  Networks  with  xm  as  both  input  and  output.  

31  

…Input  

Hidden  Layer  

…“Input”   ’   ’   ’   ’  

Slide  adapted  from  Raman  Arora  

DECODER:    x’  =  h(W’z)  

ENCODER:    z  =  h(Wx)  

Page 32: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

The  solution:  Unsupervised  pre-­‐training  

Unsupervised  pre-­‐training  •  Work  bottom-­‐up  –  Train  hidden  layer  1.  

Then  fix  its  parameters.  –  Train  hidden  layer  2.  

Then  fix  its  parameters.  –  …  –  Train  hidden  layer  n.  

Then  fix  its  parameters.  

32  

…Input  

Hidden  Layer  

…“Input”   ’   ’   ’   ’  

Page 33: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

The  solution:  Unsupervised  pre-­‐training  

Unsupervised  pre-­‐training  •  Work  bottom-­‐up  –  Train  hidden  layer  1.  

Then  fix  its  parameters.  –  Train  hidden  layer  2.  

Then  fix  its  parameters.  –  …  –  Train  hidden  layer  n.  

Then  fix  its  parameters.  

33  

…Input  

Hidden  Layer  

…Hidden  Layer  

…’   ’   ’  

Page 34: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

The  solution:  Unsupervised  pre-­‐training  

Unsupervised  pre-­‐training  •  Work  bottom-­‐up  –  Train  hidden  layer  1.  

Then  fix  its  parameters.  –  Train  hidden  layer  2.  

Then  fix  its  parameters.  –  …  –  Train  hidden  layer  n.  

Then  fix  its  parameters.  

34  

…Input  

Hidden  Layer  

…Hidden  Layer  

…Hidden  Layer  

…’   ’   ’  

Page 35: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

The  solution:  Unsupervised  pre-­‐training  

Unsupervised  pre-­‐training  •  Work  bottom-­‐up  –  Train  hidden  layer  1.  

Then  fix  its  parameters.  –  Train  hidden  layer  2.  

Then  fix  its  parameters.  –  …  –  Train  hidden  layer  n.  

Then  fix  its  parameters.  Supervised  fine-­‐tuning  Backprop  and  update  all  parameters  

35  

…Input  

Hidden  Layer  

…Hidden  Layer  

…Hidden  Layer  

Output  

Page 36: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Deep  Network  Training      

36  

�  Idea  #3:  1.  Unsupervised  layer-­‐wise  pre-­‐training  2.  Supervised  fine-­‐tuning  

�  Idea  #2:  1.  Supervised  layer-­‐wise  pre-­‐training  2.  Supervised  fine-­‐tuning  

�  Idea  #1:  1.  Supervised  fine-­‐tuning  only  

Page 37: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Comparison  on  MNIST  

1.0  

1.5  

2.0  

2.5  

Shallow  Net   Idea  #1  (Deep  Net,  no-­‐pretraining)    

Idea  #2  (Deep  Net,  

supervised  pre-­‐training)  

Idea  #3  (Deep  Net,  

unsupervised  pre-­‐training)  

%  Er

ror  

37  

Training  

•  Results  from  Bengio  et  al.  (2006)  on    MNIST  digit  classification  task  

•  Percent  error  (lower  is  better)    

Page 38: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Comparison  on  MNIST  

1.0  

1.5  

2.0  

2.5  

Shallow  Net   Idea  #1  (Deep  Net,  no-­‐pretraining)    

Idea  #2  (Deep  Net,  

supervised  pre-­‐training)  

Idea  #3  (Deep  Net,  

unsupervised  pre-­‐training)  

%  Er

ror  

38  

Training  

•  Results  from  Bengio  et  al.  (2006)  on    MNIST  digit  classification  task  

•  Percent  error  (lower  is  better)    

Page 39: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Is  layer-­‐wise  pre-­‐training  always  necessary?  

39  

Training  

In  2010,  a  record  on  a  hand-­‐writing  recognition  task  was  set  by  standard  supervised  backpropagation  (our  Idea  #1).  

 How?  A  very  fast  implementation  on  GPUs.  

 See  Ciresen  et  al.  (2010)  

Page 40: Deep$Learning$ (Part$I)ARecipefor MachineLearning 1.Given"training"data: 3.Define"goal: 5 Background 2.Choose"each"of"these: – Decisionfunction – Lossfunction 4.Train"with"SGD:

Deep  Learning  

•  Goal:  learn  features  at  different  levels  of  abstraction  

•  Training  can  be  tricky  due  to…  – Nonconvexity  – Vanishing  gradients  

•  Unsupervised  layer-­‐wise  pre-­‐training  can  help  with  both!  

40