Deep Neural Networks - GitHub Pages

Deep Neural Networks

Lili [email protected]

CMPUT 651 (Fall 2019)

mailto:[email protected]

http://lili-mou.github.io


• Logistic regression

• Loss

• Optimization: gradient descent

• Softmax

y( j) =1

1 + e−(θ0+θ⊤x( j))

J =n

∑j=1

[−t( j) log y(i) − (1 − t( j))log(1 − y( j))]

θi = θi − α∂J∂θi

Last Lecture: Classification

∂J∂θi

= (y − t)xi

p(Y = i) Δ= yi ∝i exp{w⊤i x}

yi =exp{w⊤

i x}

∑′�i exp{w⊤

i′ � x}

• Part-of-speech tagging

• Chunking

• Dependency parsing

• Constituency parsing


Last Lecture: NLP TasksDT NN VB

— — — — / — — / — — —

xxx xxx xxx xxx xxx


• Classification is linear

- Deep learning

• Not modeling the relationship of labels within one data sample

- Structured prediction

Drawbacks


• Input feature �

• Output candidate labels: �

• Q: In which area of � does the model predict class � ?

• Model outputs � <=> �

• Considering Class � and Class � � :

- Class � is preferred than � :

x ∈ ℝn

0, 2, . . . , m − 1ℝn i

i i = argmaxi′ � p(Y = i′�|x)i j (i ≠ j)

i j

Decision Boundary

{x ∈ ℝn : p(Y = i |x) > p(Y = j |x)}

exp{w⊤i x + bi}

∑k exp{w⊤k x + bk}

>exp{w⊤

j x + bj}∑k exp{w⊤

k x + bk}

(wi − wj)⊤x + (bi − bj) > 0⟺


Decision Boundary


XOR Problem

+

+—

—

0

0

1

1• Two classes +/1, -/0

• Logistic regression �

• Hypothesis class �

y = 1 ⟺1

1 + e−(w0+w1x1+w2x2)≥ 0.5

ℋ = {all half-planes}

x2

x1

⟺ w0 + w1x1 + w2x2 ≥ 0


Non-Linear Classification• Non-linearly mapping to some other space by human engineering

- Input feature: � - Construct additional features:

� , etc.

- Apply linear models in the extended features space �

- Nonlinear in the original feature space �

x ∈ ℝn

x2i , xixj, sin(xi)

x̃x


Non-Linear Classification• Non-linearly mapping to some other space by human engineering• Non-linearly mapping to some other space by kernels

- Define the “inner-product” of all pairs of data samples in a way you believe

- Mercer’s Theorem:

• � PSD � � for some �

- E.g., � � Original feature space

- E.g., �

� Mapped to “infinite dimensional space”• Separability is good. But rely too heavily on the kernel.

K( ⋅ , ⋅ ) ⟺ K(xi, xj) = ⟨ϕ(xi), ϕ(xj)⟩ ϕ( ⋅ )

K(xi, xj) = x⊤i xj ⟹

K(xi, xj) = exp{−∥xi − xj∥2

2σ2 }⟹


Non-Linear Classification• Non-linearly mapping to some other space by human engineering• Non-linearly mapping to some other space by kernels • Non-linearly mapping to some other space by learnable

composite functions - Neural networks, or deep learning


XOR Problem

• What if we allow stacking logistic regression classifiers?• Can we “program” in the “language” of LR stacks?

+

+—

—

0

0

1

1

x2

x1


XOR Problem

+

+—

—

0

0

1

1

Point: (0,0) (0,1) (1,0) (1,1)Classifier 1: 1 0 0 0Classifier 2: 0 0 0 1

x2

x1


XOR Problem

+

+—

—

0

0

1

Point: (0,0) (0,1) (1,0) (1,1)Classifier 1: 1 0 0 0Classifier 2: 0 0 0 1

x2

x10

0

1

1

c2

c1

(0,1)(1,0)

1

(1,0)

(1,)


LR Stack We Programmed

!x1 !x2

c1 c1

y



!x1 !x2

Ungraded homework: Assign the actual weights

c1 c1

y


!x1 !x2

Ungraded homework: Assign the actual weights

Student: I refuse to do the homework because

- Programming is tedious- Only feasible for simple

problems


c1 c1

y


Can We Learn the Weights?• Yes, we can. Still by gradient descent.


• Yes, � is a differentiable function of weightsy

Can We Compute the Gradient?

y =1

1 + e−(w0+w1c1+w2c2)

=1

1 + e−(w0+w1{ 1

1 + e−(w1,0+w1,1x1+w1,2x2) }+w2{ 11 + e−(w2,0+w2,1x1+w2,2x2) }w1,0 w1,1 w1,2 w2,0 w2,1 w2,2


• Yes, � is a differentiable function of weightsy

Can We Compute the Gradient?

y =1

1 + e−(w0+w1c1+w2c2)

=1

1 + e−(w0+w1{ 1

1 + e−(w1,0+w1,1x1+w1,2x2) }+w2{ 11 + e−(w2,0+w2,1x1+w2,2x2) }w1,0 w1,1 w1,2 w2,0 w2,1 w2,2

Problem: We need a systematic way of defining a deep architecture


• Perceptron [Rosenblatt, 1958]

- �

- � binary thresholding function

• Perceptron-like neuron/unit/node

- �

- � activation function (usually point-wise)

- sigmoid

- tanh

- ReLU

y = f(w⊤x + b)f :

y = f(w⊤x + b)f :

Artificial Neural Network


• Neurons in our brain


Source: https://en.wikipedia.org/wiki/Neuron


• Layer-wise fully connected neural networks- A layer of neutrons

- Multilayer perceptrons



• Layer-wise fully connected neural networks- “Arbitrary decision regions can be arbitrarily

well approximated by continuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity.”

Cybenko, G., 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, Signals and Systems, 2(4), pp.303-314.

A Note on Model Capacity


• Given input � , compute output (say, the � th layer)• Recursion

- Initialization: Input is known

- Recursive step: For each layer � , we can compute �

�

- Termination: �

x(0) L

x(l−1) x(l)

x(l) = f(W(l)x(l−1) + b)l = L

Forward Propagation


• BP, Backprop, Back-propagation• Compute derivatives (from output to input)• Recursion (on what?)

Backward Propagation

∂J∂x(l)


• BP, Backprop, Back-propagation• Compute derivatives (from output to input)

• Recursion on �

- Initialization: � is given by the loss

∂J∂x(l)

∂J∂x(L)



• BP, Backprop, Back-propagation• Compute derivatives (from output to input)• Recursion

- Initialization: � is given by the loss

- Recursive step:

Suppose � is known, we compute �

∂J∂x(L)

∂J∂x(l)

∂J∂x(l−1)



• Recursive step:


- Forward:

- Backward:

∂J∂x(l)

∂J∂x(l−1)

Chain Rule

xi (1,⋯, n)

yj (1,⋯, m)

wijx Δ= x(l)

y Δ= x(l+1)

zj = wj1x1 + wj2x2 + ⋯ + wjnxn + bj

yj = f(zj)

∂J∂xi

=∂J∂zj

∂zj

∂xi


Chain Rule

xi (1,⋯, n)

yj (1,⋯, m)

wijx Δ= x(l)

y Δ= x(l+1)

J

• Recursive step:


- Forward:

- Backward:

∂J∂x(l)

∂J∂x(l−1)


yj = f(zj)

∂J∂xi

=∂J∂zj

∂zj

∂xi


• Suppose � is known, we compute �

- Forward:

- Backward:

∂J∂x(l+1)

∂J∂x(l)

Chain Rule

xi (1,⋯, n)

yj (1,⋯, m)

wijx Δ= x(l)

y Δ= x(l+1)


yj = f(zj)

∂J∂xi

= ∑j

∂J∂zj

∂zj

∂xi

= ∑j

∂J∂yj

∂yj

∂zj

∂zj

∂xi

if � is pointwisey = f(x)


Chain Rule

xi (1,⋯, n)

yj (1,⋯, m)

wijx Δ= x(l)

y Δ= x(l+1)∂J∂xi

= ∑j

∂J∂zj

∂zj

∂xi

= ∑j

∂J∂yj

∂yj

∂zj

∂zj

∂xi

if � is pointwisey = f(x)

Softmax derivative

! : one-hot representation of groundtruth ti

∂J∂zi

= yi − ti


• Suppose � is known, we compute �

- Forward:

- Backward:

∂J∂x(l+1)

∂J∂x(l)

Chain Rule

xi (1,⋯, n)

yj (1,⋯, m)

wijx Δ= x(l)

y Δ= x(l+1)


yj = f(zj)

∂J∂yj

good (previous slide)

∂J∂wji

=∂J∂zj

xi

∂J∂bj

=∂J∂zj


• Initiliazation: � is given by the loss

• Recursive step:

• Termination:

� done for all layers

∂J∂x(L)

∂J∂wji

,∂J∂bj

Back Propagation

xi (1,⋯, n)

yj (1,⋯, m)

wijx Δ= x(l)

y Δ= x(l+1)

∂Jyj

,∂J

∂wji,

∂J∂bj

all good


• Forward propagation:

Vectorized Implementation

xi (1,⋯, n)

yj (1,⋯, m)

wijx Δ= x(l)

y Δ= x(l+1)


yj = f(zj)

x ∈ ℝNdata×Nin

x : ⟨data × in⟩z, y : ⟨data × out⟩W : in × outb : out

z = xW + by = f(z)

Cheatsheet


• Backward propagation:Vectorized Implementation

xi (1,⋯, n)

yj (1,⋯, m)

wijx Δ= x(l)

y Δ= x(l+1)

∂J∂xi

= ∑j

∂J∂zj

∂zj

∂xi

= ∑j

∂J∂yj

∂yj

∂zj

∂zj

∂xi

∂J∂z

=∂J∂y

. * (∂y∂z )data×out

∂J∂x

=∂J∂z

W⊤

∂J∂W

= x⊤ ⋅∂J∂z

∂J∂b

= (∂J∂z )⊤ ⋅ 1data×1

x : ⟨data × in⟩z, y : ⟨data × out⟩W : in × outb : out

Pointwise,not Jacobian

z = xW + by = f(z)

(assuming sum of per-sample loss)


• Non-layerwise connection- Topological sort

• Multiple losses- BP is a linear system

• Tied weights- Total derivative

A Few More Thoughts


• Input: <Layers, Edges, Losses>• Algorithm:

- Topological sort of all layers- Apply losses at respective layers

- For layer � from last to first:

For each lower layer � of �

� .gradient += BP from �

Lℓ L

ℓ L

Autodiff in General


• How can I know if my bp is correct?

• Definition of partial derivative

• Numerical gradient checking

Numerical Gradient Checking

∂∂xi

f(x1, ⋯, xi, ⋯xn)def= lim

δ→0

f(x1, ⋯, xi + δ, ⋯, xn) − f(x1, ⋯, xi, ⋯xn)δ

∂∂xi

f(x1, ⋯, xi, ⋯xn) ≈f(x1, ⋯, xi + δ, ⋯, xn) − f(x1, ⋯, xi − δ, ⋯, xn)

2δ


• Weight initialization• Batch normalization [Ioffe & Szegedy, 2015]• Layer normalization [Ba et al., 2016]• Dropout [Srivastava et al., 2014]• Optimization algorithm (e.g., Adam)

[Kingma & Ba, 2014]

• Bias-variance tradeoff for DL [Zhang et al., 2016]

Practical Guide of Training DNN


Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. 2015.

Ba JL, Kiros JR, Hinton GE. Layer normalization. arXiv preprint arXiv:1607.06450. 2016.

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research. 2014 Jan 1;15(1):1929-58.

Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.

Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530. 2016.

References


• Due: Monday, Sep 30 (Acceptable until Oct 7) • Adapt the logistic regression in Assignment#1 to a

two-layer neural network

Coding Assignment #2

Thank you!Q&A


Documents

Deep Neural Networks - GitHub Pages