Upload
others
View
34
Download
0
Embed Size (px)
Citation preview
CMPUT 651 (Fall 2019)
• Logistic regression
• Loss
• Optimization: gradient descent
• Softmax
y( j) =1
1 + e−(θ0+θ⊤x( j))
J =n
∑j=1
[−t( j) log y(i) − (1 − t( j))log(1 − y( j))]
θi = θi − α∂J∂θi
Last Lecture: Classification
∂J∂θi
= (y − t)xi
p(Y = i) Δ= yi ∝i exp{w⊤i x}
yi =exp{w⊤
i x}
∑′�i exp{w⊤
i′ � x}
• Part-of-speech tagging
• Chunking
• Dependency parsing
• Constituency parsing
CMPUT 651 (Fall 2019)
Last Lecture: NLP TasksDT NN VB
— — — — / — — / — — —
xxx xxx xxx xxx xxx
CMPUT 651 (Fall 2019)
• Classification is linear
- Deep learning
• Not modeling the relationship of labels within one data sample
- Structured prediction
Drawbacks
CMPUT 651 (Fall 2019)
• Input feature �
• Output candidate labels: �
• Q: In which area of � does the model predict class � ?
• Model outputs � <=> �
• Considering Class � and Class � � :
- Class � is preferred than � :
x ∈ ℝn
0, 2, . . . , m − 1ℝn i
i i = argmaxi′ � p(Y = i′�|x)i j (i ≠ j)
i j
Decision Boundary
{x ∈ ℝn : p(Y = i |x) > p(Y = j |x)}
exp{w⊤i x + bi}
∑k exp{w⊤k x + bk}
>exp{w⊤
j x + bj}∑k exp{w⊤
k x + bk}
(wi − wj)⊤x + (bi − bj) > 0⟺
CMPUT 651 (Fall 2019)
Decision Boundary
CMPUT 651 (Fall 2019)
XOR Problem
+
+—
—
0
0
1
1• Two classes +/1, -/0
• Logistic regression �
• Hypothesis class �
y = 1 ⟺1
1 + e−(w0+w1x1+w2x2)≥ 0.5
ℋ = {all half-planes}
x2
x1
⟺ w0 + w1x1 + w2x2 ≥ 0
CMPUT 651 (Fall 2019)
Non-Linear Classification• Non-linearly mapping to some other space by human engineering
- Input feature: � - Construct additional features:
� , etc.
- Apply linear models in the extended features space �
- Nonlinear in the original feature space �
x ∈ ℝn
x2i , xixj, sin(xi)
x̃x
CMPUT 651 (Fall 2019)
Non-Linear Classification• Non-linearly mapping to some other space by human engineering• Non-linearly mapping to some other space by kernels
- Define the “inner-product” of all pairs of data samples in a way you believe
- Mercer’s Theorem:
• � PSD � � for some �
- E.g., � � Original feature space
- E.g., �
� Mapped to “infinite dimensional space”• Separability is good. But rely too heavily on the kernel.
K( ⋅ , ⋅ ) ⟺ K(xi, xj) = ⟨ϕ(xi), ϕ(xj)⟩ ϕ( ⋅ )
K(xi, xj) = x⊤i xj ⟹
K(xi, xj) = exp{−∥xi − xj∥2
2σ2 }⟹
CMPUT 651 (Fall 2019)
Non-Linear Classification• Non-linearly mapping to some other space by human engineering• Non-linearly mapping to some other space by kernels • Non-linearly mapping to some other space by learnable
composite functions - Neural networks, or deep learning
CMPUT 651 (Fall 2019)
XOR Problem
• What if we allow stacking logistic regression classifiers?• Can we “program” in the “language” of LR stacks?
+
+—
—
0
0
1
1
x2
x1
CMPUT 651 (Fall 2019)
XOR Problem
+
+—
—
0
0
1
1
Point: (0,0) (0,1) (1,0) (1,1)Classifier 1: 1 0 0 0Classifier 2: 0 0 0 1
x2
x1
CMPUT 651 (Fall 2019)
XOR Problem
+
+—
—
0
0
1
Point: (0,0) (0,1) (1,0) (1,1)Classifier 1: 1 0 0 0Classifier 2: 0 0 0 1
x2
x10
0
1
1
c2
c1
(0,1)(1,0)
1
(1,0)
(1,)
CMPUT 651 (Fall 2019)
LR Stack We Programmed
!x1 !x2
c1 c1
y
CMPUT 651 (Fall 2019)
LR Stack We Programmed
!x1 !x2
Ungraded homework: Assign the actual weights
c1 c1
y
CMPUT 651 (Fall 2019)
!x1 !x2
Ungraded homework: Assign the actual weights
Student: I refuse to do the homework because
- Programming is tedious- Only feasible for simple
problems
LR Stack We Programmed
c1 c1
y
CMPUT 651 (Fall 2019)
Can We Learn the Weights?• Yes, we can. Still by gradient descent.
CMPUT 651 (Fall 2019)
• Yes, � is a differentiable function of weightsy
Can We Compute the Gradient?
y =1
1 + e−(w0+w1c1+w2c2)
=1
1 + e−(w0+w1{ 1
1 + e−(w1,0+w1,1x1+w1,2x2) }+w2{ 11 + e−(w2,0+w2,1x1+w2,2x2) }w1,0 w1,1 w1,2 w2,0 w2,1 w2,2
CMPUT 651 (Fall 2019)
• Yes, � is a differentiable function of weightsy
Can We Compute the Gradient?
y =1
1 + e−(w0+w1c1+w2c2)
=1
1 + e−(w0+w1{ 1
1 + e−(w1,0+w1,1x1+w1,2x2) }+w2{ 11 + e−(w2,0+w2,1x1+w2,2x2) }w1,0 w1,1 w1,2 w2,0 w2,1 w2,2
Problem: We need a systematic way of defining a deep architecture
CMPUT 651 (Fall 2019)
• Perceptron [Rosenblatt, 1958]
- �
- � binary thresholding function
• Perceptron-like neuron/unit/node
- �
- � activation function (usually point-wise)
- sigmoid
- tanh
- ReLU
y = f(w⊤x + b)f :
y = f(w⊤x + b)f :
Artificial Neural Network
CMPUT 651 (Fall 2019)
• Neurons in our brain
Artificial Neural Network
Source: https://en.wikipedia.org/wiki/Neuron
CMPUT 651 (Fall 2019)
• Layer-wise fully connected neural networks- A layer of neutrons
- Multilayer perceptrons
Artificial Neural Network
CMPUT 651 (Fall 2019)
• Layer-wise fully connected neural networks- “Arbitrary decision regions can be arbitrarily
well approximated by continuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity.”
Cybenko, G., 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, Signals and Systems, 2(4), pp.303-314.
A Note on Model Capacity
CMPUT 651 (Fall 2019)
• Given input � , compute output (say, the � th layer)• Recursion
- Initialization: Input is known
- Recursive step: For each layer � , we can compute �
�
- Termination: �
x(0) L
x(l−1) x(l)
x(l) = f(W(l)x(l−1) + b)l = L
Forward Propagation
CMPUT 651 (Fall 2019)
• BP, Backprop, Back-propagation• Compute derivatives (from output to input)• Recursion (on what?)
Backward Propagation
∂J∂x(l)
CMPUT 651 (Fall 2019)
• BP, Backprop, Back-propagation• Compute derivatives (from output to input)
• Recursion on �
- Initialization: � is given by the loss
∂J∂x(l)
∂J∂x(L)
Backward Propagation
CMPUT 651 (Fall 2019)
• BP, Backprop, Back-propagation• Compute derivatives (from output to input)• Recursion
- Initialization: � is given by the loss
- Recursive step:
Suppose � is known, we compute �
∂J∂x(L)
∂J∂x(l)
∂J∂x(l−1)
Backward Propagation
CMPUT 651 (Fall 2019)
• Recursive step:
Suppose � is known, we compute �
- Forward:
- Backward:
∂J∂x(l)
∂J∂x(l−1)
Chain Rule
xi (1,⋯, n)
yj (1,⋯, m)
wijx Δ= x(l)
y Δ= x(l+1)
zj = wj1x1 + wj2x2 + ⋯ + wjnxn + bj
yj = f(zj)
∂J∂xi
=∂J∂zj
∂zj
∂xi
CMPUT 651 (Fall 2019)
Chain Rule
xi (1,⋯, n)
yj (1,⋯, m)
wijx Δ= x(l)
y Δ= x(l+1)
J
• Recursive step:
Suppose � is known, we compute �
- Forward:
- Backward:
∂J∂x(l)
∂J∂x(l−1)
zj = wj1x1 + wj2x2 + ⋯ + wjnxn + bj
yj = f(zj)
∂J∂xi
=∂J∂zj
∂zj
∂xi
CMPUT 651 (Fall 2019)
• Suppose � is known, we compute �
- Forward:
- Backward:
∂J∂x(l+1)
∂J∂x(l)
Chain Rule
xi (1,⋯, n)
yj (1,⋯, m)
wijx Δ= x(l)
y Δ= x(l+1)
zj = wj1x1 + wj2x2 + ⋯ + wjnxn + bj
yj = f(zj)
∂J∂xi
= ∑j
∂J∂zj
∂zj
∂xi
= ∑j
∂J∂yj
∂yj
∂zj
∂zj
∂xi
if � is pointwisey = f(x)
CMPUT 651 (Fall 2019)
Chain Rule
xi (1,⋯, n)
yj (1,⋯, m)
wijx Δ= x(l)
y Δ= x(l+1)∂J∂xi
= ∑j
∂J∂zj
∂zj
∂xi
= ∑j
∂J∂yj
∂yj
∂zj
∂zj
∂xi
if � is pointwisey = f(x)
Softmax derivative
! : one-hot representation of groundtruth ti
∂J∂zi
= yi − ti
CMPUT 651 (Fall 2019)
• Suppose � is known, we compute �
- Forward:
- Backward:
∂J∂x(l+1)
∂J∂x(l)
Chain Rule
xi (1,⋯, n)
yj (1,⋯, m)
wijx Δ= x(l)
y Δ= x(l+1)
zj = wj1x1 + wj2x2 + ⋯ + wjnxn + bj
yj = f(zj)
∂J∂yj
good (previous slide)
∂J∂wji
=∂J∂zj
xi
∂J∂bj
=∂J∂zj
CMPUT 651 (Fall 2019)
• Initiliazation: � is given by the loss
• Recursive step:
• Termination:
� done for all layers
∂J∂x(L)
∂J∂wji
,∂J∂bj
Back Propagation
xi (1,⋯, n)
yj (1,⋯, m)
wijx Δ= x(l)
y Δ= x(l+1)
∂Jyj
,∂J
∂wji,
∂J∂bj
all good
CMPUT 651 (Fall 2019)
• Forward propagation:
Vectorized Implementation
xi (1,⋯, n)
yj (1,⋯, m)
wijx Δ= x(l)
y Δ= x(l+1)
zj = wj1x1 + wj2x2 + ⋯ + wjnxn + bj
yj = f(zj)
x ∈ ℝNdata×Nin
x : ⟨data × in⟩z, y : ⟨data × out⟩W : in × outb : out
z = xW + by = f(z)
Cheatsheet
CMPUT 651 (Fall 2019)
• Backward propagation:Vectorized Implementation
xi (1,⋯, n)
yj (1,⋯, m)
wijx Δ= x(l)
y Δ= x(l+1)
∂J∂xi
= ∑j
∂J∂zj
∂zj
∂xi
= ∑j
∂J∂yj
∂yj
∂zj
∂zj
∂xi
∂J∂z
=∂J∂y
. * (∂y∂z )data×out
∂J∂x
=∂J∂z
W⊤
∂J∂W
= x⊤ ⋅∂J∂z
∂J∂b
= (∂J∂z )⊤ ⋅ 1data×1
x : ⟨data × in⟩z, y : ⟨data × out⟩W : in × outb : out
Pointwise,not Jacobian
z = xW + by = f(z)
(assuming sum of per-sample loss)
CMPUT 651 (Fall 2019)
• Non-layerwise connection- Topological sort
• Multiple losses- BP is a linear system
• Tied weights- Total derivative
A Few More Thoughts
CMPUT 651 (Fall 2019)
• Input: <Layers, Edges, Losses>• Algorithm:
- Topological sort of all layers- Apply losses at respective layers
- For layer � from last to first:
For each lower layer � of �
� .gradient += BP from �
Lℓ L
ℓ L
Autodiff in General
CMPUT 651 (Fall 2019)
• How can I know if my bp is correct?
• Definition of partial derivative
• Numerical gradient checking
Numerical Gradient Checking
∂∂xi
f(x1, ⋯, xi, ⋯xn)def= lim
δ→0
f(x1, ⋯, xi + δ, ⋯, xn) − f(x1, ⋯, xi, ⋯xn)δ
∂∂xi
f(x1, ⋯, xi, ⋯xn) ≈f(x1, ⋯, xi + δ, ⋯, xn) − f(x1, ⋯, xi − δ, ⋯, xn)
2δ
CMPUT 651 (Fall 2019)
• Weight initialization• Batch normalization [Ioffe & Szegedy, 2015]• Layer normalization [Ba et al., 2016]• Dropout [Srivastava et al., 2014]• Optimization algorithm (e.g., Adam)
[Kingma & Ba, 2014]
• Bias-variance tradeoff for DL [Zhang et al., 2016]
Practical Guide of Training DNN
CMPUT 651 (Fall 2019)
Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. 2015.
Ba JL, Kiros JR, Hinton GE. Layer normalization. arXiv preprint arXiv:1607.06450. 2016.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research. 2014 Jan 1;15(1):1929-58.
Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.
Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530. 2016.
References
CMPUT 651 (Fall 2019)
• Due: Monday, Sep 30 (Acceptable until Oct 7) • Adapt the logistic regression in Assignment#1 to a
two-layer neural network
Coding Assignment #2
Thank you!Q&A
CMPUT 651 (Fall 2019)