ARUBA: Eﬃcient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Misha Khodak June 15, 2019

ARUBA: Efficient and Adaptive Meta-Learning with Provable Guarantees

Based on joint work with: •Nina Balcan, Ameet Talwalkar • Jeff Li, Sebastian Caldas, Ameet Talwalkar

Success of Gradient-Based Meta-Learning (GBML)

Few-Shot Learning

Federated Learning with Personalization

Meta Reinforcement Learning

GBML is simple & flexible

Input: T few-shot training tasks

Algorithm: General; only assumes gradient updates

Output: Initialization for few-shot test taskϕGBML

{𝒟}T1

GBML is simple & flexible…What is it doing?

Why/when do GBML methods work?




{𝒟}T1

GBML is simple & flexible…What is it doing?

Why/when do GBML methods work? ‣ Can we develop improved training algorithms? ‣ Can we personalize models while preserving privacy?




{𝒟}T1

Many GBML methods are Online Gradient Descent, twice

{𝒟}T1Input: T few-shot training tasks



for task t = 1,…, T :̂θt = Within-task-OGD(𝒟t, ϕt)

update ϕt+1 using ̂θt




Output: Initialization for few-shot test taskϕReptile

{𝒟}T1

e.g. Reptile [Nichol-Achiam-Schulman]




ϕReptile = Across-task-OGD({𝒟}T1)



Output: Initialization for few-shot test taskϕReptile

{𝒟}T1

e.g. Reptile [Nichol-Achiam-Schulman]







Output: Initialization for few-shot test taskϕMAML

{𝒟}T1

MAML [Finn-Abbeel-Levine] ‣ Replace by OGD by GD ‣ Update involves holdout data







Output: Initialization for few-shot test taskϕFedAvg

{𝒟}T1

MAML [Finn-Abbeel-Levine] Replace by OGD by GD Update involves holdout data

FedAvg [McMahan et al.] ‣ Process k tasks in parallel ‣ Update aggregates over k tasks

GBML through the Lens of Online Learning

suffer loss ℓi(θi)

for i = 1,…, m

Online Learning

pick action θi ∈ Θ


Measure per-task performance via Regret Online Learning

R =m

∑i=1

ℓi(θi) − ℓi(θ*)

best fixed action in hindsight


for i = 1,…, mpick action θi ∈ Θ


Non-IID Data / Tasks: Models realistic settings (e.g. mobile, RL data; lifelong learning)


R =m

∑i=1






Non-IID Data / Tasks: Models realistic settings (e.g. mobile, RL data; lifelong learning) IID Implications: Online-to-batch conversion results


R =m

∑i=1






Non-IID Data / Tasks: Models realistic settings (e.g. mobile, RL data; lifelong learning) IID Implications: Online-to-batch conversion results Generality: Can adapt / generalize numerous online learning results to GBML setup


R =m

∑i=1





ARUBA: Novel theoretical Framework for GBML

Nina Balcan Ameet Talwalkar

Provides regret bounds for a sequence of online learning problems



Theoretical focus on online convex optimization



Theoretical focus on online convex optimization

Practical implications in nonconvex settings


ARUBA Framework ‣ Few Shot Learning and GBML ‣ An Illustrative Result

Applications

Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)

L : 𝒴 × 𝒴 ↦ ℝ

Single-Task Few-Shot Learning

θi+1 ← θi − η∇ℓi(θi)

randomly initialize θ1 ∈ Θ, η > 0

for i = 1,…, m

Online Gradient Descent (OGD)


L : 𝒴 × 𝒴 ↦ ℝ




for i = 1,…, m



L : 𝒴 × 𝒴 ↦ ℝ

Cannot hope to do well when m is small


Single-Task Regret

suffer ℓi(θi)

Measure performance via Regret



for i = 1,…, mR =

m

∑i=1

ℓi(θi) − ℓi(θ*)best fixed action

in hindsight



Single-Task Regret


R =m

∑i=1


R = 𝒪(D m)

D = diam(Θ)

OGD upper-bound:

Size of Action Space:




[Abernethy-Bartlett-Rakhlin-Tewari]

Single-Task Regret


R =m

∑i=1


R = 𝒪(D m)

D = diam(Θ)

R = Ω (D m)Matching lower-bound: (for any algorithm)

OGD upper-bound:






Single-Task Regret


R =m

∑i=1


R = 𝒪(D m)

D = diam(Θ)

R = Ω (D m)Matching lower-bound: (for any algorithm)

OGD upper-bound:


Key Question: can GBML do better?






for i = 1,…, m



Learn an initialization sequentially from previous t tasks meta initialize ϕt ∈ Θ, η > 0

GBML Meta-Testing, i.e., using ϕGBML

Key Question: can GBML do betteron-average across tasks?

Average Regret and Task Similarity

Training Data Hypothesis Class Loss Function {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓt,i(θ) = L( fθ(xt,i), yt,i)(xt,1, yt,1), …, (xT,m, yT,m)


Training Data Hypothesis Class Loss Function {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓt,i(θ) = L( fθ(xt,i), yt,i)

R̄ =1T

T

∑t=1

Rt =1T

T

∑t=1

m

∑i=1

ℓt,i(θt,i) − ℓt,i(θ*t )Average Regret:

(xt,1, yt,1), …, (xT,m, yT,m)


Training Data Hypothesis Class Loss Function {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓt,i(θ) = L( fθ(xt,i), yt,i)

R̄ =1T

T

∑t=1

Rt =1T

T

∑t=1

m

∑i=1

ℓt,i(θt,i) − ℓt,i(θ*t )

V2 = minϕ∈Θ

1T

T

∑t=1

∥θ*t − ϕ∥22

Average Regret:

Task Similarity:V is small when optimal

parameters are close together

(xt,1, yt,1), …, (xT,m, yT,m)

Task Similarity V

When optimal task parameters are close together, GBML leads to much better average performance

Our Guarantee: R̄ = 𝒪 (V +log TVT ) m

Rt = Ω (D m)Single-Task Lower Bound:

Multi-Task Lower Bound: R̄ = Ω (V m)

ARUBA: An Illustrative Result




Recall: Reptile Algorithm


update ϕt+1 using θ*t


Recall: Reptile Algorithm

assume oracle access to optimum in hindsight

can be relaxed under nondegeneracy assumption

Main ObservationSingle-task regret guarantees are often nice, data-dependent functions of the algorithm parameters.

Main Observation

Rt =m

∑i=1

ℓt,i(θ) − ℓt,i(θ*t ) ≤ R̂t(ϕ) =∥θ*t − ϕ∥2

2

η+ ηm

e.g. OGD from ϕregret-upper-bound (RUB)

Single-task regret guarantees are often nice, data-dependent functions of the algorithm parameters.

Main Observation

Rt =m

∑i=1

ℓt,i(θ) − ℓt,i(θ*t ) ≤ R̂t(ϕ) =∥θ*t − ϕ∥2

2

η+ ηm

e.g. OGD from ϕregret-upper-bound (RUB)

Reduces GBML to OCO over a sequence of regret-upper-bounds


1. Within-task OGD: RUB controls performance as data-dependent ( ) function of

2. Across-task OGD: Choose initializations such that RUB is small on Average

3. Task-Relatedness: Analyze its impact on resulting bound

ϕ1. Within-task OGD: RUB controls performance as data-dependent ( ) function of θ*

3 Key Steps of ARUBA Framework

1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1

R̂t(ϕt) − minϕ∈Θ

T

∑t=1

R̂t(ϕ) + minϕ∈Θ

T

∑t=1

R̂t(ϕ))Step 1: Substitute RUB






1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1


T

∑t=1

R̂t(ϕ)) + minϕ∈Θ

1T

T

∑t=1

R̂t(ϕ)

Addition/SubtractionStep 1: Substitute RUB






R̂(ϕ) =∥θ* − ϕ∥2

2

2η+ ηm

OGD Regret-Upper-Bound (RUB)

1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1


T

∑t=1


1T

T

∑t=1

R̂t(ϕ)

Step 1: Substitute RUB Addition/Subtraction



ϕ

ϕt

θ*


1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1


T

∑t=1


1T

T

∑t=1

R̂t(ϕ)

Step 2: Across-task OGD

=1T (

T

∑t=1

∥θ*t − ϕt∥22

2η− min

ϕ∈Θ

∥θ*t − ϕ∥22

2η ) + minϕ∈Θ

1T

T

∑t=1

∥θ*t − ϕ∥22

2η+ ηm




ϕ

ϕt

θ*


1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1


T

∑t=1


1T

T

∑t=1

R̂t(ϕ)

= 𝒪 ( log TVT ) m +

32

V m

Step 2: Across-task OGD

=1T (

T

∑t=1

∥θ*t − ϕt∥22

2η− min

ϕ∈Θ

∥θ*t − ϕ∥22

2η ) + minϕ∈Θ

1T

T

∑t=1

∥θ*t − ϕ∥22

2η+ ηm


Step 3: Impact of Task Relatedness




ϕ

ϕt

θ*1. Within-task OGD: RUB controls performance as data-dependent ( ) function of




1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1


T

∑t=1


1T

T

∑t=1

R̂t(ϕ)

= 𝒪 ( log TVT ) m

Step 2: Across-task OGD V

=1T (

T

∑t=1

∥θ*t − ϕt∥22

2η− min

ϕ∈Θ

∥θ*t − ϕ∥22

2η ) + minϕ∈Θ

1T

T

∑t=1

∥θ*t − ϕ∥22

2η+ ηm


Step 3: Impact of Task Relatedness




ϕ

ϕt

θ*1. Within-task OGD: RUB controls performance as data-dependent ( ) function of




1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1


T

∑t=1


1T

T

∑t=1

R̂t(ϕ)

= 𝒪 ( log TVT ) m +

32

V m

Step 2: Across-task OGD V

η = V/ m

=1T (

T

∑t=1

∥θ*t − ϕt∥22

2η− min

ϕ∈Θ

∥θ*t − ϕ∥22

2η ) + minϕ∈Θ

1T

T

∑t=1

∥θ*t − ϕ∥22

2η+ ηm

substitute


Task Similarity V

Recap: What have we achieved?

When optimal task parameters are close together, GBML leads to much better average performance

Our Guarantee: R̄ = 𝒪 (V +log TVT ) m

Rt = Ω (D m)Single-Task Lower Bound:

Multi-Task Lower Bound: R̄ = Ω (V m)

What can we get by applying ARUBA?

Adaptivity

‣ Learn any base-learner parameter from data ‣ e.g., improved training algorithms by learning and simultaneously

Generality

‣ Low-dynamic-regret algorithms for changing task-environments

‣ Stronger online-to-batch conversions for faster statistical rates ‣ Specialized within-task algorithms, e.g., satisfying privacy guarantees

ϕ η

What can we get by applying ARUBA?

Adaptivity

‣ Learn any base-learner parameter from data ‣ e.g., improved training algorithms by learning and simultaneously

Generality

‣ Low-dynamic-regret algorithms for changing task-environments

‣ Stronger online-to-batch conversions for faster statistical rates ‣ Specialized within-task algorithms, e.g., satisfying privacy guarantees

ϕ η

Excess transfer risk bounds on regularized SGD via online-to-batch conversion [Denevi-Ciliberto-Grazzi-Pontil]

Online learnability of MAML via Follow-the-Leader [Finn-Rajeswaran-Kakade-Levine]

Average regret bounds for Online Mirror Descent meta-algo under max-deviation assumption on task-parameters [K-Balcan-Talwalkar]

ARUBA in Context of Three ICML’19 Papers

ARUBA Framework Applications ‣ Adaptivity for Improved Training ‣ Federated Learning & Privacy-Preserving GBML

θt,i+1 = θt,i − ηt,i ⊙ ∇t,i

gradient on sample i from task t

per-coordinate learning rate

Is learning an initialization good enough?

Adaptively preconditioned gradient descent is popular in GBML

‣ Reptile uses Adam with accumulated gradient information

‣ Meta-SGD learns a per-coordinate learning rate for MAML [Li et al.]

Can we do this rigorously?





Adaptively preconditioned gradient descent is popular in GBML

‣ Reptile uses Adam with accumulated gradient information

‣ Meta-SGD learns a per-coordinate learning rate for MAML [Li et al.]

Can we do this rigorously?





Applying ARUBA: the regret-upper-bound

R̂t(ϕ, η) = ∥θ*t − ϕ∥212η

+m

∑i=1

∥∇t,i∥2η


Mahalanobis norm

η-preconditioned OGD from ϕ

Setting the learning rate

Bt,j =12 ∑

s<t

(ϕs,j − θ*s,j)2 Gt,j = ∑

s<t

m

∑i=1

∇2t,i,j

εt = o(t), ζt = o(t) mηt,j =Bt,j + εt

Gt,j + ζt

sum of sq. distances from initialization

sum of sq. gradients

smoothing terms

learning rate coordinate j

at task t

Adaptive ARUBA

Setting the learning rate

Bt,j =12 ∑

s<t

(ϕs,j − θ*s,j)2 Gt,j = ∑

s<t

m

∑i=1

∇2t,i,j

εt = o(t), ζt = o(t) mηt,j =Bt,j + εt

Gt,j + ζt

sum of sq. distances from initialization

sum of sq. gradients

smoothing terms

learning rate coordinate j

at task t

�̃� (1/T2/5)Guarantee: - convergence of average regret to that of always using the optimal init and diagonal preconditioner

Adaptive ARUBA

Applying result to practical GBML with OGD base learners:

ηt,j =Bt,j + εt

Gt,j + ζt

sum of squared distances traveled on each task

Compare to AdaGrad:

ηt,j =1

Gt,j + δ, δ = O(1)

smoothing term

Connection to AdaGrad

sum of all squared gradients across tasks

Reptile: preconditioning with ARUBA instead of Adam

Results on Omniglot character classification task, averaged over three runs

Improved Training via Adaptive ARUBA

Reptile: preconditioning with ARUBA instead of Adam

Results on Omniglot character classification task, averaged over three runs

Improved Training via Adaptive ARUBA

ARUBA Framework Applications ‣ Adaptivity for Improved Training ‣ Federated Learning & Privacy-Preserving GBML

Personalized Federated Learning

‣ Massively distributed ‣ Small sample sizes ‣ Privacy concerns ‣ Non-IID data and tasks ‣ Underlying task similarity

FedAvg ≈ Reptile [with a batch-averaged meta-update]

FedAvg ≈ Reptile [with a batch-averaged meta-update]

‣ Most popular algorithm in federated learning

‣ Usually run without personalization - just use the meta-initialization within-task

Personalization in Federated Learning via Adaptive ARUBA

Results on Shakespeare next-character prediction task

‣ Meta-training: run FedAvg with ARUBA optimizer within-task

Personalization in Federated Learning via Adaptive ARUBA

Results on Shakespeare next-character prediction task

‣ Meta-training: run FedAvg with ARUBA optimizer within-task

‣ Meta-testing - use (preconditioned) OGD to learn a personalized model for each user

Private and Low-Risk Federated Learning via ARUBA

Ameet TalwalkarJeff Li Sebastian Caldas


‣ Approach: ‣ ARUBA handles approximate meta-updates, e.g. due to noise

from differential privacy ‣ use private OCO algorithms within-task to get per-sample privacy

‣ analysis of their regret-upper-bounds yields a GBML method that’s private even to the central aggregator and has risk bounds improving with task-similarity.


‣ Approach: ‣ ARUBA handles approximate meta-updates, e.g. due to noise

from differential privacy ‣ use private OCO algorithms within-task to get per-sample privacy

‣ analysis of their regret-upper-bounds yields a GBML method that’s private even to the central aggregator and has risk bounds improving with task-similarity.

Summary

ARUBA: a framework to obtain GBML algorithms with provable and mathematically interpretable guarantees via reduction to OCO.

New methods for meta-training, federated learning, private GBML.

Future directions:

• Beyond adversarial analysis within-task — can the base-learners be statistical or RL algos?

• Better multi-task optimizers, e.g. for regimes beyond few-shot learning

Thank You!

More Info: http://www.cs.cmu.edu/~mkhodak/

ARUBA: https://arxiv.org/abs/1906.02717

http://www.cs.cmu.edu/~mkhodak/

https://arxiv.org/abs/1906.02717

Documents

ARUBA: Eﬃcient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models