Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Lecture 4: Backpropagation and Automatic Differentiation
CSE599W: Spring 2018
Announcement
• Assignment 1 is out today, due in 2 weeks (Apr 19th, 5pm)
Model Training Overviewlayer1extractor
layer2extractor predictor
Objective
Training
Symbolic Differentiation
• Input formulae is a symbolic expression tree (computation graph).• Implement differentiation rules, e.g., sum rule, product rule, chain rule
✘For complicated functions, the resultant expression can be exponentially large.✘Wasteful to keep around intermediate symbolic expressions if we only
need a numeric value of the gradient in the end✘Prone to error
!(# + %)!' = !#
!' +!%!'
!(#%)!' = !#
!' % + #!%!'
! ℎ '!' = !# % '
!' * !%(')'
Numerical Differentiation
• We can approximate the gradient using!" #!$%
≈ lim*→," # + ℎ/0 − "(#)
ℎ
" 4, $ = 4 ⋅ $−0.8 0.3 < 0.5
−0.2
Numerical Differentiation
• We can approximate the gradient using!" #!$%
≈ lim*→," # + ℎ/0 − "(#)
ℎ
" 4, $ = 4 ⋅ $−0.8 + ; 0.3 = 0.5
−0.2
" 4, $ = 4 ⋅ $−0.8 0.3 = 0.5
−0.2
Numerical Differentiation
• We can approximate the gradient using!" #!$%
≈ lim*→," # + ℎ/0 − "(#)
ℎ• Reduce the truncation error by using center difference
!" #!$%
≈ lim*→," # + ℎ/0 − "(# − ℎ/0)
2ℎ✘Bad: rounding error, and slow to computeüA powerful tool to check the correctness of implementation, usually
use h = 1e-6.
Backpropagation
!
"
# = %(!, ")Operator %
Backpropagation
Operator !
"
#
$ = !(", #))*)$
)*)" =
)*)$)$)"
)*)# =
)*)$)$)#
Compute gradient becomes local computation
Backpropagation simple example
! = 11 + %&(()*(+,+*(-,-)
Backpropagation simple example
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1
, = 11 + *.(012034320545)
1/%
Backpropagation simple example
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1
, = 11 + *.(012034320545)
1/%
1.0
3.0
-2.0
2.02.0
3.0
-4.0 -1.01.0 -1.0 0.37 1.37 0.73
Backpropagation simple example
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1
, = 11 + *.(012034320545)
1/%
1.0
3.0
-2.0
2.02.0
3.0
-4.0 -1.01.0 -1.0 0.37 1.37 0.73
1.0
Backpropagation simple example
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1
, = 11 + *.(012034320545)
1/%
1.0
3.0
-2.0
2.02.0
3.0
-4.0 -1.01.0 -1.0 0.37 1.37 0.73
1.0-0.53
, % = 1/% à8984 = −1/%&
:;:% =
:;:,:,:% = −1/%&
Backpropagation simple example
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1
, = 11 + *.(012034320545)
1/%
1.0
3.0
-2.0
2.02.0
3.0
-4.0 -1.01.0 -1.0 0.37 1.37 0.73
1.0-0.53-0.53
, % = % + 1 à8984 = 1
Backpropagation simple example
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1
, = 11 + *.(012034320545)
1/%
1.0
3.0
-2.0
2.02.0
3.0
-4.0 -1.01.0 -1.0 0.37 1.37 0.73
1.0-0.53-0.53-0.20
, % = *4 à8984 = *4
:;:% =
:;:,:,:% =
:;:, ⋅ *
4
Backpropagation simple example
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1
, = 11 + *.(012034320545)
1/%
1.0
3.0
-2.0
2.02.0
3.0
-4.0 -1.01.0 -1.0 0.37 1.37 0.73
1.0-0.53-0.53-0.20
, %," = %" à9:94 = ", 9:
90 = %
0.200.20
0.20
0.200.20
0.600.20
Backpropagation simple example
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1
, = 11 + *.(012034320545)
1/%
1.0
3.0
-2.0
2.02.0
3.0
-4.0 -1.01.0 -1.0 0.37 1.37 0.73
1.0-0.53-0.53-0.200.200.20
0.20
0.200.20
-0.400.40
0.600.20
Any problem?Can we do better?
Problems of backpropagation
• You always need to keep intermediate data in the memory during the forward pass in case it will be used in the backpropagation.
• Lack of flexibility, e.g., compute the gradient of gradient.
Automatic Differentiation (autodiff)
• Create computation graph for gradient computation
Automatic Differentiation (autodiff)
• Create computation graph for gradient computation
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1
, = 11 + *.(012034320545)
1/%
Automatic Differentiation (autodiff)
• Create computation graph for gradient computation
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1 1/%
− 1%&
- = 11 + */(123145431656)
- % = 1/% à8985 = −1/%&
Automatic Differentiation (autodiff)
• Create computation graph for gradient computation
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1 1/%
− 1%&
- = 11 + */(123145431656)
∗ 1
- % = % + 1 à8985 = 1
Automatic Differentiation (autodiff)
• Create computation graph for gradient computation
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1 1/%
− 1%&
- = 11 + */(123145431656)
∗ 1∗
- % = *5 à8985 = *5
Automatic Differentiation (autodiff)
• Create computation graph for gradient computation
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1 1/%
− 1%&
- = 11 + */(123145431656)
∗ 1∗∗ −1∗89814
- %, " = %" à8;81 = %
Automatic Differentiation (autodiff)
• Create computation graph for gradient computation
∗"#
+%#
∗"&%&"'
+ ∗ −1 *%+ +1 1/%
− 1%&
- = 11 + */(123145431656)
∗ 1∗∗ −1∗89814
∗89816
AutoDiff Algorithm
W
xmatmult softmax log
y_
mul meany cross_en
tropy
log-gradsoftmax-grad mul 1 / batch_sizematmult-transpose
W_grad
AutoDiff Algorithmdef gradient(out):
node_to_grad[out] = 1
nodes = get_node_list(out)
for node in reverse_topo_order(nodes):
grad ß sum partial adjoints from output edges
input_grads ß node.op.gradient(input, grad) for input in node.inputs
add input_grads to node_to_grad
return node_to_grad
!"
1!$
!%
!&
exp
+
×
AutoDiff Algorithmdef gradient(out):
node_to_grad[out] = 1
nodes = get_node_list(out)
for node in reverse_topo_order(nodes):
grad ß sum partial adjoints from output edges
input_grads ß node.op.gradient(input, grad) for input in node.inputs
add input_grads to node_to_grad
return node_to_grad
!"
1!$
!%
!&
exp
+
×
!&node_to_grad:!&: !&
AutoDiff Algorithmdef gradient(out):
node_to_grad[out] = 1
nodes = get_node_list(out)
for node in reverse_topo_order(nodes):
grad ß sum partial adjoints from output edges
input_grads ß node.op.gradient(input, grad) for input in node.inputs
add input_grads to node_to_grad
return node_to_grad
!"
1!$
!%
!&
exp
+
×
!&node_to_grad:!&: !&
!& !% !$ !"
AutoDiff Algorithmdef gradient(out):
node_to_grad[out] = 1
nodes = get_node_list(out)
for node in reverse_topo_order(nodes):
grad ß sum partial adjoints from output edges
input_grads ß node.op.gradient(input, grad) for input in node.inputs
add input_grads to node_to_grad
return node_to_grad
!"
1!$
!%
!&
exp
+
×
!&node_to_grad:!&: !&
!& !% !$ !"
!$" !%××
AutoDiff Algorithmdef gradient(out):
node_to_grad[out] = 1
nodes = get_node_list(out)
for node in reverse_topo_order(nodes):
grad ß sum partial adjoints from output edges
input_grads ß node.op.gradient(input, grad) for input in node.inputs
add input_grads to node_to_grad
return node_to_grad
!"
1!$
!%
!&
exp
+
×
!&node_to_grad:!&: !&!%: !%!$: !$"
!& !% !$ !"
!$" !%××
AutoDiff Algorithmdef gradient(out):
node_to_grad[out] = 1
nodes = get_node_list(out)
for node in reverse_topo_order(nodes):
grad ß sum partial adjoints from output edges
input_grads ß node.op.gradient(input, grad) for input in node.inputs
add input_grads to node_to_grad
return node_to_grad
!"
1!$
!%
!&
exp
+
×
!&node_to_grad:!&: !&!%: !%!$: !$"
!& !% !$ !"
!$" !%××
AutoDiff Algorithmdef gradient(out):
node_to_grad[out] = 1
nodes = get_node_list(out)
for node in reverse_topo_order(nodes):
grad ß sum partial adjoints from output edges
input_grads ß node.op.gradient(input, grad) for input in node.inputs
add input_grads to node_to_grad
return node_to_grad
!"
1!$
!%
!&
exp
+
×
!&node_to_grad:!&: !&!%: !%!$: !$"
!& !% !$ !"
!$" !%××
!$$id
AutoDiff Algorithmdef gradient(out):
node_to_grad[out] = 1
nodes = get_node_list(out)
for node in reverse_topo_order(nodes):
grad ß sum partial adjoints from output edges
input_grads ß node.op.gradient(input, grad) for input in node.inputs
add input_grads to node_to_grad
return node_to_grad
!"
1!$
!%
!&
exp
+
×
!&node_to_grad:!&: !&!%: !%!$: !$", !$$
!& !% !$ !"
!$" !%××
!$$id
AutoDiff Algorithmdef gradient(out):
node_to_grad[out] = 1
nodes = get_node_list(out)
for node in reverse_topo_order(nodes):
grad ß sum partial adjoints from output edges
input_grads ß node.op.gradient(input, grad) for input in node.inputs
add input_grads to node_to_grad
return node_to_grad
!"
1!$
!%
!&
exp
+
×
!&node_to_grad:!&: !&!%: !%!$: !$", !$$
!& !% !$ !"
!$" !%××
!$$id
!$+
AutoDiff Algorithmdef gradient(out):
node_to_grad[out] = 1
nodes = get_node_list(out)
for node in reverse_topo_order(nodes):
grad ß sum partial adjoints from output edges
input_grads ß node.op.gradient(input, grad) for input in node.inputs
add input_grads to node_to_grad
return node_to_grad
!"
1!$
!%
!&
exp
+
×
!&node_to_grad:!&: !&!%: !%!$: !$", !$$
!& !% !$ !"
!$" !%××
!$$id
!$+
!"×
AutoDiff Algorithmdef gradient(out):
node_to_grad[out] = 1
nodes = get_node_list(out)
for node in reverse_topo_order(nodes):
grad ß sum partial adjoints from output edges
input_grads ß node.op.gradient(input, grad) for input in node.inputs
add input_grads to node_to_grad
return node_to_grad
!"
1!$
!%
!&
exp
+
×
!&
!$" !%
!$
!$$
!"
××
+
×
id
node_to_grad:!&: !&!%: !%!$: !$", !$$!": !"
!& !% !$ !"
AutoDiff Algorithmdef gradient(out):
node_to_grad[out] = 1
nodes = get_node_list(out)
for node in reverse_topo_order(nodes):
grad ß sum partial adjoints from output edges
input_grads ß node.op.gradient(input, grad) for input in node.inputs
add input_grads to node_to_grad
return node_to_grad
!"
1!$
!%
!&
exp
+
×
!&
!$" !%
!$
!$$
!"
××
+
×
id
node_to_grad:!&: !&!%: !%!$: !$", !$$!": !"
!& !% !$ !"
Backpropagation vs AutoDiff
!"
!#exp 1
!(
!)
+
×
!"
!#exp 1
!(
!)
+
×!)
!(×!#′×
!#′′id
!#+
!"×
Backpropagation AutoDiff
Recap
• Numerical differentiation• Tool to check the correctness of implementation
• Backpropagation• Easy to understand and implement• Bad for memory use and schedule optimization
• Automatic differentiation• Generate gradient computation to entire computation graph• Better for system optimization
References
• Automatic differentiation in machine learning: a surveyhttps://arxiv.org/abs/1502.05767• CS231n backpropagation: http://cs231n.github.io/optimization-2/