36
Neural Networks 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 12 Feb. 24, 2020 Machine Learning Department School of Computer Science Carnegie Mellon University

Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Neural Networks

1

10-601 Introduction to Machine Learning

Matt GormleyLecture 12

Feb. 24, 2020

Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University

Page 2: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Reminders

• Homework 4: Logistic Regression– Out: Wed, Feb. 19– Due: Fri, Feb. 28 at 11:59pm

• Today’s In-Class Poll– http://p12.mlcourse.org

• Swapped lecture/recitation:– Lecture 14: Fri, Feb. 28– Recitation HW5: Mon, Mar. 02

2

Page 3: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Q&A

3

Page 4: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

NEURAL NETWORKS

4

Page 5: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

A Recipe for Machine Learning

1. Given training data:

5

Background

2. Choose each of these:– Decision function

– Loss function

Face Face Not a face

Examples: Linear regression, Logistic regression, Neural Network

Examples: Mean-squared error, Cross Entropy

Page 6: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

A Recipe for Machine Learning

1. Given training data: 3. Define goal:

6

Background

2. Choose each of these:

– Decision function

– Loss function

4. Train with SGD:

(take small steps opposite the gradient)

Page 7: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

A Recipe for Machine Learning

1. Given training data: 3. Define goal:

7

Background

2. Choose each of these:

– Decision function

– Loss function

4. Train with SGD:

(take small steps opposite the gradient)

Gradients

Backpropagation can compute this gradient!

And it’s a special case of a more general algorithm called reverse-mode automatic differentiation that can compute the gradient of any differentiable function efficiently!

Page 8: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

A Recipe for Machine Learning

1. Given training data: 3. Define goal:

8

Background

2. Choose each of these:

– Decision function

– Loss function

4. Train with SGD:

(take small steps opposite the gradient)

Goals for Today’s Lecture

1. Explore a new class of decision functions (Neural Networks)

2. Consider variants of this recipe for training

Page 9: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Linear Regression

9

Decision Functions

Output

Input

θ1 θ2 θ3 θM

y = h�(x) = �(�T x)

where �(a) = a

Page 10: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Logistic Regression

10

Decision Functions

Output

Input

θ1 θ2 θ3 θM

y = h�(x) = �(�T x)

where �(a) =1

1 + (�a)

Page 11: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Perceptron

13

Decision Functions

Output

Input

θ1 θ2 θ3 θM

y = h�(x) = �(�T x)

where �(a) =1

1 + (�a)

Page 12: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Neural Network

14

Decision Functions

Output

Input

Hidden Layer

Page 13: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

WeightsHidden Layer

“Probability of beingAlive”

0.6S

S

.4

.2S

Neural Network Model

© Eric Xing @ CMU, 2006-2011 15

Page 14: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.1

.7

WeightsHidden Layer

“Probability of beingAlive”

0.6S

“Combined logistic models”

© Eric Xing @ CMU, 2006-2011 16

Page 15: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.5

.8.2

.3

.2

WeightsHidden Layer

“Probability of beingAlive”

0.6S

© Eric Xing @ CMU, 2006-2011 17

Page 16: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

© Eric Xing @ CMU, 2006-2011 18

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

1Gender

Stage 4

.6.5

.8.2

.1

.3.7

.2

WeightsHidden Layer

“Probability of beingAlive”

0.6S

Page 17: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

WeightsIndependent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

WeightsHidden Layer

“Probability of beingAlive”

0.6S

S

.4

.2S

Not really, no target for hidden units...

© Eric Xing @ CMU, 2006-2011 19

Page 18: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

From Biological to Artificial

Biological “Model”• Neuron: an excitable cell

• Synapse: connection between neurons

• A neuron sends an electrochemical pulse along its synapses when a sufficient voltage change occurs

• Biological Neural Network: collection of neurons along some pathway through the brain

Artificial Model• Neuron: node in a directed acyclic

graph (DAG)

• Weight: multiplier on each edge

• Activation Function: nonlinear thresholding function, which allows a neuron to “fire” when the input value is sufficiently high

• Artificial Neural Network: collection of neurons into a DAG, which define some differentiable function

21

Biological “Computation”• Neuron switching time : ~ 0.001 sec

• Number of neurons: ~ 1010

• Connections per neuron: ~ 104-5

• Scene recognition time: ~ 0.1 sec

Artificial Computation• Many neuron-like threshold switching

units

• Many weighted interconnections among units

• Highly parallel, distributed processes

Slide adapted from Eric Xing

The motivation for Artificial Neural Networks comes from biology…

Page 19: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Neural Networks

Chalkboard– Example: Neural Network w/1 Hidden Layer– Example: Neural Network w/2 Hidden Layers– Example: Feed Forward Neural Network

22

Page 20: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

y = h�(x) = �(�T x)

where �(a) =1

1 + (�a)

Logistic Regression

23

Decision Functions

Output

Input

θ1 θ2 θ3 θM

Face Face Not a face

Page 21: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

y = h�(x) = �(�T x)

where �(a) =1

1 + (�a)

Logistic Regression

24

Decision Functions

Output

Input

θ1 θ2 θ3 θM

1 1 0

x1

x2

y

In-Class Example

Page 22: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Neural Networks

Chalkboard– From linear regression to logistic regression– From logistic regression to a neural network

25

Page 23: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

y = h�(x) = �(�T x)

where �(a) =1

1 + (�a)

Logistic Regression

26

Decision Functions

Output

Input

θ1 θ2 θ3 θM

Face Face Not a face

Page 24: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

y = h�(x) = �(�T x)

where �(a) =1

1 + (�a)

Logistic Regression

27

Decision Functions

Output

Input

θ1 θ2 θ3 θM

1 1 0

x1

x2

y

In-Class Example

Page 25: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Neural Network

28

Decision Functions

Output

Input

Hidden Layer

(F) LossJ = 1

2 (y � y(d))2

(E) Output (sigmoid)y = 1

1+ (�b)

(D) Output (linear)b =

�Dj=0 �jzj

(C) Hidden (sigmoid)zj = 1

1+ (�aj), �j

(B) Hidden (linear)aj =

�Mi=0 �jixi, �j

(A) InputGiven xi, �i

Neural Network for Classification

Page 26: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Neural Network ParametersQuestion:Suppose you are training a one-hidden layer neural network with sigmoid activations for binary classification.

True or False: There is a unique set of parameters that maximize the likelihood of the dataset above.

29

Answer:

Page 27: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

ARCHITECTURES

30

Page 28: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Neural Network Architectures

Even for a basic Neural Network, there are many design decisions to make:

1. # of hidden layers (depth)2. # of units per hidden layer (width)3. Type of activation function (nonlinearity)4. Form of objective function

31

Page 29: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Building a Neural Net

34

Output

Input

Hidden LayerD = M

Q: How many hidden units, D, should we use?

Page 30: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Building a Neural Net

35

Output

Input

Hidden LayerD = M

Q: How many hidden units, D, should we use?

Page 31: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Building a Neural Net

36

Output

Input

Hidden LayerD < M

What method(s) is this setting similar to?

Q: How many hidden units, D, should we use?

Page 32: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Building a Neural Net

37

Output

Input

Hidden LayerD > M

What method(s) is this setting similar to?

Q: How many hidden units, D, should we use?

Page 33: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Deeper Networks

38

Output

Input

Hidden Layer 1

Q: How many layers should we use?

Page 34: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Deeper Networks

39

…Input

Hidden Layer 1

Output

Hidden Layer 2

Q: How many layers should we use?

Page 35: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Q: How many layers should we use?

Deeper Networks

40

…Input

Hidden Layer 1

…Hidden Layer 2

Output

Hidden Layer 3

Page 36: Neural Networksmgormley/courses/10601/slides/lecture12... · 2020-04-13 · Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make:

Deeper Networks

41

Output

Input

Hidden Layer 1

Q: How many layers should we use?• Theoretical answer:

– A neural network with 1 hidden layer is a universal function approximator

– Cybenko (1989): For any continuous function g(x), there

exists a 1-hidden-layer neural net hθ(x)

s.t. | hθ(x) – g(x) | < ϵ for all x, assuming sigmoid activation

functions

• Empirical answer:– Before 2006: “Deep networks (e.g. 3 or more hidden layers)

are too hard to train”

– After 2006: “Deep networks are easier to train than shallow

networks (e.g. 2 or fewer layers) for many problems”

Big caveat: You need to know and use the right tricks.