1
Variants of RMSProp and Adagrad with Logarithmic Regret Bounds Mahesh Chandra Mukkamala 1, 2 , Matthias Hein 1 1 Saarland University, 2 Max Planck Institute for Informatics Contributions Motivation: Use RMSProp (Hinton et al., 2012) in Online Convex Optimization framework. Use optimal algorithms for strongly convex problems to train Deep Neural Networks. Main Contributions: Analyzed RMSProp (Hinton et al., 2012). Equivalence of RMSProp and Adagrad. Proposed SC-Adagrad and SC-RMSProp with log T -type optimal regret bounds (Hazan et al. [2007]) for strongly convex problems . Better test accuracy on various Deep Nets. Online convex optimization Notation: In R d , (a b) i = a i b i for i =1,...,d, 0 R d . Let A 0 , z R d , convex set C , then P A C (z ) = arg min xC x - z 2 A = x - z,A(x - z ) Online Learning setup: Let C be a convex set. for t =1, 2,...,T do We predict θ t C . Adversary gives f t : C R (continuous convex) We suffer loss f t (θ t ), update θ t , using g t ∂f t (θ t ) Goal: To perform well w.r.t θ * = arg min θ C T t=1 f t (θ ) and bound regret R(T )= T t=1 (f t (θ t ) - f t (θ * )). μ-strongly convex function f : C R , if μ R d + s.t x, y C , f (y ) f (x)+ f (x),y - x + y - x 2 diag(μ) Online Gradient Descent: θ t+1 = P C (θ t - α t g t ) Convex f t (Zinkevich, 2003): α t = O ( 1 t ) T -type optimal data-independent regret bounds. Strongly Convex f t (Hazan et al., 2007): α t = O (1/t) log T -type optimal data-independent regret bounds. Adagrad (Duchi et al, 2011): v 0 = 0, α, δ > 0 v t = v t-1 +(g t g t ), A t = diag( v t )+ δ I θ t+1 = P A t C θ t - αA -1 t g t Main Idea: Adaptivity, effective step-size of O 1 t ! Effective Step-size Adagrad (Duchi et al, 2011): α(A -1 T ) ii = α r T t=1 g 2 t,i + δ = α T 1 r 1 T T t=1 g 2 t,i + δ T SC-Adagrad (Ours): α(A -1 T ) ii = α T t=1 g 2 t,i + δ T = α T 1 1 T T t=1 g 2 t,i + δ T T SC-Adagrad With θ 1 C, δ 0 > 0,v 0 = 0,α> 0 for t =1 to T do g t ∂f t (θ t ), v t = v t-1 +(g t g t ) Choose 0 t δ t-1 element wise A t = diag(v t + δ t ), θ t+1 = P A t C θ t - αA -1 t g t end for Decay scheme varies with dimension as δ t R d . Logarithmic Regret Bounds Let sup t1 g t G , sup t1 θ t - θ * D , f t : C R is μ-strongly convex, α max i=1,...,d G 2 2μ i , then regret bound of SC-Adagrad for T 1 is R(T ) D 2 tr(diag(δ 1 )) 2α + α 2 d X i=1 log v T,i + δ T,i δ 1,i + 1 2 d X i=1 inf t[ T ] (θ t,i - θ * i ) 2 α - α v t,i + δ t,i (δ T,i - δ 1,i ) Data-dependent log T -type optimal regret bounds. RMSProp RMSProp (Hinton et al., 2012): Most popular adaptive gradient method used in deep learning. Idea: Moving average of second order gradients. Can we use RMSProp for Online learning? RMSProp (Ours): v 0 = 0, α, δ > 0, 0 1 With β t =1 - γ t , t = δ/ tt = α/ t v t = β t v t-1 + (1 - β t )(g t g t ) A t = diag( v t )+ t I t+1 = P A t C θ t - α t A -1 t g t For Convex Problems: T -type regret bounds. For original RMSProp set β t =0.9t = α> 0. SC-RMSProp SC-Adagrad + RMSProp = SC-RMSProp We need to modify RMSProp (Ours) by: Using t = δ t /t with δ 0 > 0, where 0 t δ t-1 element-wise. A t = diag(v t + t ) and α t = α/t Idea: Effective step-size is O (1/t) so log T regret bound. New Decay scheme: For SC-RMSProp choose δ t = ξ 2 e -ξ 1 tv t and for SC-Adagrad δ t = ξ 2 e -ξ 1 v t . Pros: Enhanced adaptivity, Stabilizes quadratic growth of v t in g t , Exponential decay in v t . Rule of Thumb: ξ 1 =0.12 =1 for convex problems and ξ 1 =0.12 =0.1 for deep learning. Interesting Phenomenon Choose β =1 - 1 t , we obtain update step of RMSProp (Ours) Adagrad SC-RMSProp SC-Adagrad Follows from a simple telescoping sum of v t . Experimental Setup : Only one varying parameter: the stepsize α chosen from {1, 0.1, 0.01, 0.001, 0.0001}. No method has an advantage just because it has more hyperparameters. Optimal rate is chosen so that algorithm achieves best performance (in consideration) at the end. Results of Residual Network, CNN and Softmax Regression Algorithms: SGD (Bottou, 2010) (step-size is O ( 1 t ) for strongly convex problems ), Adam (step-size is O ( 1 t ) for strongly convex problems ), Adagrad, RMSProp with β t = 0.9 t 1. With γ =0.9 we use RMSProp (Ours) and SC-RMSProp (Ours), finally SC-Adagrad (Ours).[ CODE: github.com/mmahesh ] 1 100 200 300 400 Epoch 0.1 0.2 0.3 0.4 0.5 0.6 Training Objective SGD Adam Adagrad RMSProp RMSProp (Ours) SC-RMSProp SC-Adagrad (a) Training Objective 1 100 200 300 400 Epoch 0.70 0.72 0.74 0.76 0.78 0.80 0.82 0.84 0.86 Test Accuracy SGD Adam Adagrad RMSProp RMSProp (Ours) SC-RMSProp SC-Adagrad (b)Test Accuracy Figure 1: Plots of 18-layer Residual Network (ResNet-18) on CIFAR10 dataset 1 50 100 150 200 Epoch 0.60 0.62 0.64 0.66 0.68 0.70 0.72 Test Accuracy SGD Adam Adagrad RMSProp RMSProp (Ours) SC-RMSProp SC-Adagrad (a) CIFAR10 1 50 100 150 200 Epoch 0.980 0.982 0.984 0.986 0.988 0.990 0.992 0.994 Test Accuracy SGD Adam Adagrad RMSProp RMSProp (Ours) SC-RMSProp SC-Adagrad (b) MNIST Figure 2: Test Accuracy vs Number of Epochs for 4- layer Convolutional Neural Network 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Dataset proportion 10 4 10 5 Regret (log scale) SGD Adam Adagrad RMSProp RMSProp (Ours) SC-RMSProp SC-Adagrad (a) CIFAR10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Dataset proportion 10 4 10 5 Regret (log scale) SGD Adam Adagrad RMSProp RMSProp (Ours) SC-RMSProp SC-Adagrad (b) CIFAR100 Figure 3: Regret (log scale) vs Dataset Proportion for Online L2-Regularized Softmax Regression References: Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121-2159, 2011 (COLT, page 257, 2010) Hinton, G., Srivastava, N., and Swersky, K. Lecture 6d - a separate, adaptive learning rate for each connection. Slides of Lecture Neural Networks for Machine Learning, 2012. Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169-192, 2007

Variants of RMSProp and Adagrad with Logarithmic Regret Bounds · VariantsofRMSPropandAdagradwithLogarithmicRegretBounds MaheshChandraMukkamala1,2,MatthiasHein1 1SaarlandUniversity,2

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Variants of RMSProp and Adagrad with Logarithmic Regret Bounds · VariantsofRMSPropandAdagradwithLogarithmicRegretBounds MaheshChandraMukkamala1,2,MatthiasHein1 1SaarlandUniversity,2

Variants of RMSProp and Adagrad with Logarithmic Regret BoundsMahesh Chandra Mukkamala1, 2, Matthias Hein1

1Saarland University, 2 Max Planck Institute for Informatics

ContributionsMotivation:

•Use RMSProp (Hinton et al., 2012) in OnlineConvex Optimization framework.

•Use optimal algorithms for strongly convexproblems to train Deep Neural Networks.

Main Contributions:•Analyzed RMSProp (Hinton et al., 2012).•Equivalence of RMSProp and Adagrad.•Proposed SC-Adagrad and SC-RMSPropwith log T -type optimal regret bounds (Hazanet al. [2007]) for strongly convex problems .

•Better test accuracy on various Deep Nets.

Online convex optimization

Notation: In Rd, (a � b)i = aibi for i = 1, . . . , d,0 ∈ Rd. Let A � 0 , z ∈ Rd, convex set C , thenPAC (z) = arg min

x∈C‖x− z‖2

A = 〈x− z, A(x− z)〉

Online Learning setup: Let C be a convex set.for t = 1, 2, . . . , T do

•We predict θt ∈ C.•Adversary gives ft : C → R (continuous convex)•We suffer loss ft(θt), update θt, using gt ∈ ∂ft(θt)Goal: To perform well w.r.t θ∗ = arg min

θ∈C

∑Tt=1 ft(θ)

and bound regret R(T ) = ∑Tt=1(ft(θt)− ft(θ∗)).

µ−strongly convex function f : C → R , if∃µ ∈ Rd

+ s.t ∀x, y ∈ C,f (y) ≥ f (x) + 〈∇f (x), y − x〉 + ‖y − x‖2

diag(µ)

Online Gradient Descent: θt+1 = PC(θt − αtgt)Convex ft (Zinkevich, 2003): αt = O( 1√

t)√

T -type optimal data-independent regret bounds.Strongly Convex ft (Hazan et al., 2007): αt = O(1/t)log T -type optimal data-independent regret bounds.Adagrad (Duchi et al, 2011): v0 = 0, α, δ > 0vt = vt−1 + (gt � gt), At = diag(√vt) + δI

θt+1 = PAtC

(θt − αA−1

t gt)

Main Idea: Adaptivity, effective step-size of O(

1√t

)

Effective Step-size

Adagrad (Duchi et al, 2011):

α(A−1T )ii = α√∑T

t=1 g2t,i + δ

=α√T

1√1T

∑Tt=1 g

2t,i + δ√

T

SC-Adagrad (Ours):

α(A−1T )ii = α∑T

t=1 g2t,i + δT

T

11T

∑Tt=1 g

2t,i + δT

T

SC-AdagradWith θ1 ∈ C, δ0 > 0, v0 = 0, α > 0for t = 1 to T dogt ∈ ∂ft(θt), vt = vt−1 + (gt � gt)Choose 0 < δt ≤ δt−1 element wiseAt = diag(vt + δt), θt+1 = PAt

C

(θt−αA−1

t gt)

end forDecay scheme varies with dimension as δt ∈ Rd.

Logarithmic Regret Bounds

Let supt≥1 ‖gt‖∞ ≤ G∞, supt≥1 ‖θt − θ∗‖∞ ≤ D∞,ft : C → R is µ-strongly convex, α ≥ maxi=1,...,d

G2∞

2µi ,then regret bound of SC-Adagrad for T ≥ 1 is

R(T ) ≤ D2∞ tr(diag(δ1))

2α+ α

2

d∑i=1

log(vT,i + δT,i

δ1,i

)

+ 12

d∑i=1

inft∈[T ]

((θt,i − θ∗i )2

α− α

vt,i + δt,i

)(δT,i − δ1,i)

Data-dependent log T -type optimal regret bounds.

RMSProp

RMSProp (Hinton et al., 2012): Most popularadaptive gradient method used in deep learning.Idea: Moving average of second order gradients.Can we use RMSProp for Online learning?RMSProp (Ours): v0 = 0, α, δ > 0, 0 < γ ≤ 1With βt = 1− γ

t, εt = δ/

√t, αt = α/

√t

vt = βtvt−1 + (1− βt)(gt � gt)At = diag(√vt) + εtI, θt+1 = PAt

C

(θt − αtA−1

t gt)

For Convex Problems:√T -type regret bounds.

For original RMSProp set βt = 0.9, αt = α > 0.

SC-RMSPropSC-Adagrad + RMSProp = SC-RMSPropWe need to modify RMSProp (Ours) by:

•Using εt = δt/t with δ0 > 0, where0 < δt ≤ δt−1 element-wise.

•At = diag(vt + εt) and αt = α/t

Idea: Effective step-size is O(1/t) so log T regret bound.

New Decay scheme: For SC-RMSProp chooseδt = ξ2e

−ξ1tvt and for SC-Adagrad δt = ξ2e−ξ1vt.

Pros: Enhanced adaptivity, Stabilizes quadraticgrowth of vt in gt, Exponential decay in vt.Rule of Thumb: ξ1 = 0.1, ξ2 = 1 for convexproblems and ξ1 = 0.1, ξ2 = 0.1 for deep learning.

Interesting Phenomenon

Choose β = 1− 1t, we obtain update step of

RMSProp (Ours) ≡ AdagradSC-RMSProp ≡ SC-Adagrad

Follows from a simple telescoping sum of vt.

Experimental Setup :•Only one varying parameter: the stepsize α chosenfrom {1, 0.1, 0.01, 0.001, 0.0001}.

•No method has an advantage just because it hasmore hyperparameters.

•Optimal rate is chosen so that algorithm achievesbest performance (in consideration) at the end.

Results of Residual Network, CNN and Softmax Regression

Algorithms: SGD (Bottou, 2010) (step-size is O(1t) for strongly convex problems ), Adam (step-size is O( 1√

t)

for strongly convex problems ), Adagrad, RMSProp with βt = 0.9 ∀t ≥ 1. With γ = 0.9 we useRMSProp (Ours) and SC-RMSProp (Ours), finally SC-Adagrad (Ours). [CODE: github.com/mmahesh ]

1 100 200 300 400Epoch

0.1

0.2

0.3

0.4

0.5

0.6

Tra

inin

g Ob

ject

ive

SGDAdamAdagradRMSPropRMSProp (Ours)SC-RMSPropSC-Adagrad

(a)Training Objective

1 100 200 300 400Epoch

0.70

0.72

0.74

0.76

0.78

0.80

0.82

0.84

0.86

Tes

t Acc

urac

y

SGDAdamAdagradRMSPropRMSProp (Ours)SC-RMSPropSC-Adagrad

(b)Test AccuracyFigure 1: Plots of 18-layer Residual Network (ResNet-18) on CIFAR10 dataset

1 50 100 150 200Epoch

0.60

0.62

0.64

0.66

0.68

0.70

0.72

Tes

t Acc

urac

y

SGDAdamAdagradRMSPropRMSProp (Ours)SC-RMSPropSC-Adagrad

(a)CIFAR10

1 50 100 150 200Epoch

0.980

0.982

0.984

0.986

0.988

0.990

0.992

0.994

Tes

t Acc

urac

y SGDAdamAdagradRMSPropRMSProp (Ours)SC-RMSPropSC-Adagrad

(b)MNISTFigure 2: Test Accuracy vs Number of Epochs for 4-layer Convolutional Neural Network

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Dataset proportion

104

105

Reg

ret (

log

scal

e)

SGDAdamAdagradRMSPropRMSProp (Ours)SC-RMSPropSC-Adagrad

(a)CIFAR10

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Dataset proportion

104

105

Reg

ret (

log

scal

e)

SGDAdamAdagradRMSPropRMSProp (Ours)SC-RMSPropSC-Adagrad

(b)CIFAR100Figure 3: Regret (log scale) vs Dataset Proportion forOnline L2-Regularized Softmax Regression

References:• Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121-2159, 2011 (COLT, page 257, 2010)• Hinton, G., Srivastava, N., and Swersky, K. Lecture 6d - a separate, adaptive learning rate for each connection. Slides of Lecture Neural Networks for Machine Learning, 2012.• Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169-192, 2007