Variants of RMSProp and Adagrad with Logarithmic Regret Bounds · VariantsofRMSPropandAdagradwithLogarithmicRegretBounds MaheshChandraMukkamala1,2,MatthiasHein1 1SaarlandUniversity,2

Variants of RMSProp and Adagrad with Logarithmic Regret BoundsMahesh Chandra Mukkamala1, 2, Matthias Hein1

1Saarland University, 2 Max Planck Institute for Informatics

ContributionsMotivation:

•Use RMSProp (Hinton et al., 2012) in OnlineConvex Optimization framework.

•Use optimal algorithms for strongly convexproblems to train Deep Neural Networks.

Main Contributions:•Analyzed RMSProp (Hinton et al., 2012).•Equivalence of RMSProp and Adagrad.•Proposed SC-Adagrad and SC-RMSPropwith log T -type optimal regret bounds (Hazanet al. [2007]) for strongly convex problems .

•Better test accuracy on various Deep Nets.

Online convex optimization

Notation: In Rd, (a � b)i = aibi for i = 1, . . . , d,0 ∈ Rd. Let A � 0 , z ∈ Rd, convex set C , thenPAC (z) = arg min

x∈C‖x− z‖2

A = 〈x− z, A(x− z)〉

Online Learning setup: Let C be a convex set.for t = 1, 2, . . . , T do

•We predict θt ∈ C.•Adversary gives ft : C → R (continuous convex)•We suffer loss ft(θt), update θt, using gt ∈ ∂ft(θt)Goal: To perform well w.r.t θ∗ = arg min

θ∈C

∑Tt=1 ft(θ)

and bound regret R(T ) = ∑Tt=1(ft(θt)− ft(θ∗)).

µ−strongly convex function f : C → R , if∃µ ∈ Rd

+ s.t ∀x, y ∈ C,f (y) ≥ f (x) + 〈∇f (x), y − x〉 + ‖y − x‖2

diag(µ)

Online Gradient Descent: θt+1 = PC(θt − αtgt)Convex ft (Zinkevich, 2003): αt = O( 1√

t)√

T -type optimal data-independent regret bounds.Strongly Convex ft (Hazan et al., 2007): αt = O(1/t)log T -type optimal data-independent regret bounds.Adagrad (Duchi et al, 2011): v0 = 0, α, δ > 0vt = vt−1 + (gt � gt), At = diag(√vt) + δI

θt+1 = PAtC

(θt − αA−1

t gt)

Main Idea: Adaptivity, effective step-size of O(

1√t

)

Effective Step-size

Adagrad (Duchi et al, 2011):

α(A−1T )ii = α√∑T

t=1 g2t,i + δ

=α√T

1√1T

∑Tt=1 g

2t,i + δ√

T

SC-Adagrad (Ours):

α(A−1T )ii = α∑T

t=1 g2t,i + δT

=α

T

11T

∑Tt=1 g

2t,i + δT

T

SC-AdagradWith θ1 ∈ C, δ0 > 0, v0 = 0, α > 0for t = 1 to T dogt ∈ ∂ft(θt), vt = vt−1 + (gt � gt)Choose 0 < δt ≤ δt−1 element wiseAt = diag(vt + δt), θt+1 = PAt

C

(θt−αA−1

t gt)

end forDecay scheme varies with dimension as δt ∈ Rd.

Logarithmic Regret Bounds

Let supt≥1 ‖gt‖∞ ≤ G∞, supt≥1 ‖θt − θ∗‖∞ ≤ D∞,ft : C → R is µ-strongly convex, α ≥ maxi=1,...,d

G2∞

2µi ,then regret bound of SC-Adagrad for T ≥ 1 is

R(T ) ≤ D2∞ tr(diag(δ1))

2α+ α

2

d∑i=1

log(vT,i + δT,i

δ1,i

)

+ 12

d∑i=1

inft∈[T ]

((θt,i − θ∗i )2

α− α

vt,i + δt,i

)(δT,i − δ1,i)

Data-dependent log T -type optimal regret bounds.

RMSProp

RMSProp (Hinton et al., 2012): Most popularadaptive gradient method used in deep learning.Idea: Moving average of second order gradients.Can we use RMSProp for Online learning?RMSProp (Ours): v0 = 0, α, δ > 0, 0 < γ ≤ 1With βt = 1− γ

t, εt = δ/

√t, αt = α/

√t

vt = βtvt−1 + (1− βt)(gt � gt)At = diag(√vt) + εtI, θt+1 = PAt

C

(θt − αtA−1

t gt)

For Convex Problems:√T -type regret bounds.

For original RMSProp set βt = 0.9, αt = α > 0.

SC-RMSPropSC-Adagrad + RMSProp = SC-RMSPropWe need to modify RMSProp (Ours) by:

•Using εt = δt/t with δ0 > 0, where0 < δt ≤ δt−1 element-wise.

•At = diag(vt + εt) and αt = α/t

Idea: Effective step-size is O(1/t) so log T regret bound.

New Decay scheme: For SC-RMSProp chooseδt = ξ2e

−ξ1tvt and for SC-Adagrad δt = ξ2e−ξ1vt.

Pros: Enhanced adaptivity, Stabilizes quadraticgrowth of vt in gt, Exponential decay in vt.Rule of Thumb: ξ1 = 0.1, ξ2 = 1 for convexproblems and ξ1 = 0.1, ξ2 = 0.1 for deep learning.

Interesting Phenomenon

Choose β = 1− 1t, we obtain update step of

RMSProp (Ours) ≡ AdagradSC-RMSProp ≡ SC-Adagrad

Follows from a simple telescoping sum of vt.

Experimental Setup :•Only one varying parameter: the stepsize α chosenfrom {1, 0.1, 0.01, 0.001, 0.0001}.

•No method has an advantage just because it hasmore hyperparameters.

•Optimal rate is chosen so that algorithm achievesbest performance (in consideration) at the end.

Results of Residual Network, CNN and Softmax Regression

Algorithms: SGD (Bottou, 2010) (step-size is O(1t) for strongly convex problems ), Adam (step-size is O( 1√

t)

for strongly convex problems ), Adagrad, RMSProp with βt = 0.9 ∀t ≥ 1. With γ = 0.9 we useRMSProp (Ours) and SC-RMSProp (Ours), finally SC-Adagrad (Ours). [CODE: github.com/mmahesh ]

1 100 200 300 400Epoch

0.1

0.2

0.3

0.4

0.5

0.6

Tra

inin

g Ob

ject

ive

SGDAdamAdagradRMSPropRMSProp (Ours)SC-RMSPropSC-Adagrad

(a)Training Objective

1 100 200 300 400Epoch

0.70

0.72

0.74

0.76

0.78

0.80

0.82

0.84

0.86

Tes

t Acc

urac

y


(b)Test AccuracyFigure 1: Plots of 18-layer Residual Network (ResNet-18) on CIFAR10 dataset

1 50 100 150 200Epoch

0.60

0.62

0.64

0.66

0.68

0.70

0.72

Tes

t Acc

urac

y


(a)CIFAR10

1 50 100 150 200Epoch

0.980

0.982

0.984

0.986

0.988

0.990

0.992

0.994

Tes

t Acc

urac

y SGDAdamAdagradRMSPropRMSProp (Ours)SC-RMSPropSC-Adagrad

(b)MNISTFigure 2: Test Accuracy vs Number of Epochs for 4-layer Convolutional Neural Network

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Dataset proportion

104

105

Reg

ret (

log

scal

e)


(a)CIFAR10

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Dataset proportion

104

105

Reg

ret (

log

scal

e)


(b)CIFAR100Figure 3: Regret (log scale) vs Dataset Proportion forOnline L2-Regularized Softmax Regression

References:• Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121-2159, 2011 (COLT, page 257, 2010)• Hinton, G., Srivastava, N., and Swersky, K. Lecture 6d - a separate, adaptive learning rate for each connection. Slides of Lecture Neural Networks for Machine Learning, 2012.• Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169-192, 2007

Documents

Variants of RMSProp and Adagrad with Logarithmic Regret Bounds · VariantsofRMSPropandAdagradwithLogarithmicRegretBounds MaheshChandraMukkamala1,2,MatthiasHein1 1SaarlandUniversity,2