69
Misha Khodak June 15, 2019 ARUBA: Ecient and Adaptive Meta- Learning with Provable Guarantees Based on joint work with: Nina Balcan, Ameet Talwalkar JeLi, Sebastian Caldas, Ameet Talwalkar

ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Misha Khodak June 15, 2019

ARUBA: Efficient and Adaptive Meta-Learning with Provable Guarantees

Based on joint work with: •Nina Balcan, Ameet Talwalkar • Jeff Li, Sebastian Caldas, Ameet Talwalkar

Page 2: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Success of Gradient-Based Meta-Learning (GBML)

Few-Shot Learning

Federated Learning with Personalization

Meta Reinforcement Learning

Page 3: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

GBML is simple & flexible

Input: T few-shot training tasks

Algorithm: General; only assumes gradient updates

Output: Initialization for few-shot test taskϕGBML

{𝒟}T1

Page 4: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

GBML is simple & flexible…What is it doing?

Why/when do GBML methods work?

Input: T few-shot training tasks

Algorithm: General; only assumes gradient updates

Output: Initialization for few-shot test taskϕGBML

{𝒟}T1

Page 5: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

GBML is simple & flexible…What is it doing?

Why/when do GBML methods work? ‣ Can we develop improved training algorithms? ‣ Can we personalize models while preserving privacy?

Input: T few-shot training tasks

Algorithm: General; only assumes gradient updates

Output: Initialization for few-shot test taskϕGBML

{𝒟}T1

Page 6: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Many GBML methods are Online Gradient Descent, twice

{𝒟}T1Input: T few-shot training tasks

Algorithm: General; only assumes gradient updates

Output: Initialization for few-shot test taskϕGBML

Page 7: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

for task t = 1,…, T :̂θt = Within-task-OGD(𝒟t, ϕt)

update ϕt+1 using ̂θt

Many GBML methods are Online Gradient Descent, twice

Input: T few-shot training tasks

Algorithm: General; only assumes gradient updates

Output: Initialization for few-shot test taskϕReptile

{𝒟}T1

e.g. Reptile [Nichol-Achiam-Schulman]

Page 8: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

for task t = 1,…, T :̂θt = Within-task-OGD(𝒟t, ϕt)

update ϕt+1 using ̂θt

Many GBML methods are Online Gradient Descent, twice

ϕReptile = Across-task-OGD({𝒟}T1)

Input: T few-shot training tasks

Algorithm: General; only assumes gradient updates

Output: Initialization for few-shot test taskϕReptile

{𝒟}T1

e.g. Reptile [Nichol-Achiam-Schulman]

Page 9: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

for task t = 1,…, T :̂θt = Within-task-OGD(𝒟t, ϕt)

update ϕt+1 using ̂θt

Many GBML methods are Online Gradient Descent, twice

ϕReptile = Across-task-OGD({𝒟}T1)

Input: T few-shot training tasks

Algorithm: General; only assumes gradient updates

Output: Initialization for few-shot test taskϕMAML

{𝒟}T1

MAML [Finn-Abbeel-Levine] ‣ Replace by OGD by GD ‣ Update involves holdout data

Page 10: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

for task t = 1,…, T :̂θt = Within-task-OGD(𝒟t, ϕt)

update ϕt+1 using ̂θt

Many GBML methods are Online Gradient Descent, twice

ϕReptile = Across-task-OGD({𝒟}T1)

Input: T few-shot training tasks

Algorithm: General; only assumes gradient updates

Output: Initialization for few-shot test taskϕFedAvg

{𝒟}T1

MAML [Finn-Abbeel-Levine] Replace by OGD by GD Update involves holdout data

FedAvg [McMahan et al.] ‣ Process k tasks in parallel ‣ Update aggregates over k tasks

Page 11: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

GBML through the Lens of Online Learning

suffer loss ℓi(θi)

for i = 1,…, m

Online Learning

pick action θi ∈ Θ

Page 12: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

GBML through the Lens of Online Learning

Measure per-task performance via Regret Online Learning

R =m

∑i=1

ℓi(θi) − ℓi(θ*)

best fixed action in hindsight

suffer loss ℓi(θi)

for i = 1,…, mpick action θi ∈ Θ

Page 13: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

GBML through the Lens of Online Learning

Non-IID Data / Tasks: Models realistic settings (e.g. mobile, RL data; lifelong learning)

Measure per-task performance via Regret Online Learning

R =m

∑i=1

ℓi(θi) − ℓi(θ*)

best fixed action in hindsight

suffer loss ℓi(θi)

for i = 1,…, mpick action θi ∈ Θ

Page 14: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

GBML through the Lens of Online Learning

Non-IID Data / Tasks: Models realistic settings (e.g. mobile, RL data; lifelong learning) IID Implications: Online-to-batch conversion results

Measure per-task performance via Regret Online Learning

R =m

∑i=1

ℓi(θi) − ℓi(θ*)

best fixed action in hindsight

suffer loss ℓi(θi)

for i = 1,…, mpick action θi ∈ Θ

Page 15: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

GBML through the Lens of Online Learning

Non-IID Data / Tasks: Models realistic settings (e.g. mobile, RL data; lifelong learning) IID Implications: Online-to-batch conversion results Generality: Can adapt / generalize numerous online learning results to GBML setup

Measure per-task performance via Regret Online Learning

R =m

∑i=1

ℓi(θi) − ℓi(θ*)

best fixed action in hindsight

suffer loss ℓi(θi)

for i = 1,…, mpick action θi ∈ Θ

Page 16: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

ARUBA: Novel theoretical Framework for GBML

Nina Balcan Ameet Talwalkar

Page 17: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Provides regret bounds for a sequence of online learning problems

ARUBA: Novel theoretical Framework for GBML

Page 18: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Provides regret bounds for a sequence of online learning problems

Theoretical focus on online convex optimization

ARUBA: Novel theoretical Framework for GBML

Page 19: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Provides regret bounds for a sequence of online learning problems

Theoretical focus on online convex optimization

Practical implications in nonconvex settings

ARUBA: Novel theoretical Framework for GBML

Page 20: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

ARUBA Framework ‣ Few Shot Learning and GBML ‣ An Illustrative Result

Applications

Page 21: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)

L : 𝒴 × 𝒴 ↦ ℝ

Single-Task Few-Shot Learning

Page 22: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

θi+1 ← θi − η∇ℓi(θi)

randomly initialize θ1 ∈ Θ, η > 0

for i = 1,…, m

Online Gradient Descent (OGD)

Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)

L : 𝒴 × 𝒴 ↦ ℝ

Single-Task Few-Shot Learning

Page 23: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

θi+1 ← θi − η∇ℓi(θi)

randomly initialize θ1 ∈ Θ, η > 0

for i = 1,…, m

Online Gradient Descent (OGD)

Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)

L : 𝒴 × 𝒴 ↦ ℝ

Cannot hope to do well when m is small

Single-Task Few-Shot Learning

Page 24: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Single-Task Regret

suffer ℓi(θi)

Measure performance via Regret

θi+1 ← θi − η∇ℓi(θi)

randomly initialize θ1 ∈ Θ, η > 0

for i = 1,…, mR =

m

∑i=1

ℓi(θi) − ℓi(θ*)best fixed action

in hindsight

Online Gradient Descent (OGD)

Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)

Page 25: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Single-Task Regret

Measure performance via Regret

R =m

∑i=1

ℓi(θi) − ℓi(θ*)

R = 𝒪(D m)

D = diam(Θ)

OGD upper-bound:

Size of Action Space:

Online Gradient Descent (OGD)

Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)

best fixed action in hindsight

[Abernethy-Bartlett-Rakhlin-Tewari]

Page 26: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Single-Task Regret

Measure performance via Regret

R =m

∑i=1

ℓi(θi) − ℓi(θ*)

R = 𝒪(D m)

D = diam(Θ)

R = Ω (D m)Matching lower-bound: (for any algorithm)

OGD upper-bound:

Size of Action Space:

Online Gradient Descent (OGD)

Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)

best fixed action in hindsight

[Abernethy-Bartlett-Rakhlin-Tewari]

Page 27: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Single-Task Regret

Measure performance via Regret

R =m

∑i=1

ℓi(θi) − ℓi(θ*)

R = 𝒪(D m)

D = diam(Θ)

R = Ω (D m)Matching lower-bound: (for any algorithm)

OGD upper-bound:

Size of Action Space:

Key Question: can GBML do better?

Online Gradient Descent (OGD)

Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)

best fixed action in hindsight

[Abernethy-Bartlett-Rakhlin-Tewari]

Page 28: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

θi+1 ← θi − η∇ℓi(θi)

for i = 1,…, m

Online Gradient Descent (OGD)

Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)

Learn an initialization sequentially from previous t tasks meta initialize ϕt ∈ Θ, η > 0

GBML Meta-Testing, i.e., using ϕGBML

Key Question: can GBML do betteron-average across tasks?

Page 29: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Average Regret and Task Similarity

Training Data Hypothesis Class Loss Function {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓt,i(θ) = L( fθ(xt,i), yt,i)(xt,1, yt,1), …, (xT,m, yT,m)

Page 30: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Average Regret and Task Similarity

Training Data Hypothesis Class Loss Function {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓt,i(θ) = L( fθ(xt,i), yt,i)

R̄ =1T

T

∑t=1

Rt =1T

T

∑t=1

m

∑i=1

ℓt,i(θt,i) − ℓt,i(θ*t )Average Regret:

(xt,1, yt,1), …, (xT,m, yT,m)

Page 31: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Average Regret and Task Similarity

Training Data Hypothesis Class Loss Function {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓt,i(θ) = L( fθ(xt,i), yt,i)

R̄ =1T

T

∑t=1

Rt =1T

T

∑t=1

m

∑i=1

ℓt,i(θt,i) − ℓt,i(θ*t )

V2 = minϕ∈Θ

1T

T

∑t=1

∥θ*t − ϕ∥22

Average Regret:

Task Similarity:V is small when optimal

parameters are close together

(xt,1, yt,1), …, (xT,m, yT,m)

Page 32: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Task Similarity V

When optimal task parameters are close together, GBML leads to much better average performance

Our Guarantee: R̄ = 𝒪 (V +log TVT ) m

Rt = Ω (D m)Single-Task Lower Bound:

Multi-Task Lower Bound: R̄ = Ω (V m)

ARUBA: An Illustrative Result

Page 33: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

for task t = 1,…, T :̂θt = Within-task-OGD(𝒟t, ϕt)

update ϕt+1 using ̂θt

ϕReptile = Across-task-OGD({𝒟}T1)

Recall: Reptile Algorithm

Page 34: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

for task t = 1,…, T :̂θt = Within-task-OGD(𝒟t, ϕt)

update ϕt+1 using θ*t

ϕReptile = Across-task-OGD({𝒟}T1)

Recall: Reptile Algorithm

assume oracle access to optimum in hindsight

can be relaxed under nondegeneracy assumption

Page 35: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Main ObservationSingle-task regret guarantees are often nice, data-dependent functions of the algorithm parameters.

Page 36: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Main Observation

Rt =m

∑i=1

ℓt,i(θ) − ℓt,i(θ*t ) ≤ R̂t(ϕ) =∥θ*t − ϕ∥2

2

η+ ηm

e.g. OGD from ϕregret-upper-bound (RUB)

Single-task regret guarantees are often nice, data-dependent functions of the algorithm parameters.

Page 37: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Main Observation

Rt =m

∑i=1

ℓt,i(θ) − ℓt,i(θ*t ) ≤ R̂t(ϕ) =∥θ*t − ϕ∥2

2

η+ ηm

e.g. OGD from ϕregret-upper-bound (RUB)

Reduces GBML to OCO over a sequence of regret-upper-bounds

Single-task regret guarantees are often nice, data-dependent functions of the algorithm parameters.

Page 38: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

1. Within-task OGD: RUB controls performance as data-dependent ( ) function of

2. Across-task OGD: Choose initializations such that RUB is small on Average

3. Task-Relatedness: Analyze its impact on resulting bound

ϕ1. Within-task OGD: RUB controls performance as data-dependent ( ) function of θ*

3 Key Steps of ARUBA Framework

1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1

R̂t(ϕt) − minϕ∈Θ

T

∑t=1

R̂t(ϕ) + minϕ∈Θ

T

∑t=1

R̂t(ϕ))Step 1: Substitute RUB

Page 39: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

1. Within-task OGD: RUB controls performance as data-dependent ( ) function of

2. Across-task OGD: Choose initializations such that RUB is small on Average

3. Task-Relatedness: Analyze its impact on resulting bound

ϕ1. Within-task OGD: RUB controls performance as data-dependent ( ) function of θ*

3 Key Steps of ARUBA Framework

1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1

R̂t(ϕt) − minϕ∈Θ

T

∑t=1

R̂t(ϕ)) + minϕ∈Θ

1T

T

∑t=1

R̂t(ϕ)

Addition/SubtractionStep 1: Substitute RUB

Page 40: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

1. Within-task OGD: RUB controls performance as data-dependent ( ) function of

2. Across-task OGD: Choose initializations such that RUB is small on Average

3. Task-Relatedness: Analyze its impact on resulting bound

ϕ1. Within-task OGD: RUB controls performance as data-dependent ( ) function of θ*

3 Key Steps of ARUBA Framework

R̂(ϕ) =∥θ* − ϕ∥2

2

2η+ ηm

OGD Regret-Upper-Bound (RUB)

1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1

R̂t(ϕt) − minϕ∈Θ

T

∑t=1

R̂t(ϕ)) + minϕ∈Θ

1T

T

∑t=1

R̂t(ϕ)

Step 1: Substitute RUB Addition/Subtraction

Page 41: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

1. Within-task OGD: RUB controls performance as data-dependent ( ) function of

2. Across-task OGD: Choose initializations such that RUB is small on Average

ϕ

ϕt

θ*

3 Key Steps of ARUBA Framework

1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1

R̂t(ϕt) − minϕ∈Θ

T

∑t=1

R̂t(ϕ)) + minϕ∈Θ

1T

T

∑t=1

R̂t(ϕ)

Step 2: Across-task OGD

=1T (

T

∑t=1

∥θ*t − ϕt∥22

2η− min

ϕ∈Θ

∥θ*t − ϕ∥22

2η ) + minϕ∈Θ

1T

T

∑t=1

∥θ*t − ϕ∥22

2η+ ηm

Addition/SubtractionStep 1: Substitute RUB

Page 42: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

1. Within-task OGD: RUB controls performance as data-dependent ( ) function of

2. Across-task OGD: Choose initializations such that RUB is small on Average

ϕ

ϕt

θ*

3 Key Steps of ARUBA Framework

1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1

R̂t(ϕt) − minϕ∈Θ

T

∑t=1

R̂t(ϕ)) + minϕ∈Θ

1T

T

∑t=1

R̂t(ϕ)

= 𝒪 ( log TVT ) m +

32

V m

Step 2: Across-task OGD

=1T (

T

∑t=1

∥θ*t − ϕt∥22

2η− min

ϕ∈Θ

∥θ*t − ϕ∥22

2η ) + minϕ∈Θ

1T

T

∑t=1

∥θ*t − ϕ∥22

2η+ ηm

Addition/SubtractionStep 1: Substitute RUB

Page 43: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Step 3: Impact of Task Relatedness

1. Within-task OGD: RUB controls performance as data-dependent ( ) function of

2. Across-task OGD: Choose initializations such that RUB is small on Average

3. Task-Relatedness: Analyze its impact on resulting bound

ϕ

ϕt

θ*1. Within-task OGD: RUB controls performance as data-dependent ( ) function of

2. Across-task OGD: Choose initializations such that RUB is small on Average

3. Task-Relatedness: Analyze its impact on resulting bound

3 Key Steps of ARUBA Framework

1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1

R̂t(ϕt) − minϕ∈Θ

T

∑t=1

R̂t(ϕ)) + minϕ∈Θ

1T

T

∑t=1

R̂t(ϕ)

= 𝒪 ( log TVT ) m

Step 2: Across-task OGD V

=1T (

T

∑t=1

∥θ*t − ϕt∥22

2η− min

ϕ∈Θ

∥θ*t − ϕ∥22

2η ) + minϕ∈Θ

1T

T

∑t=1

∥θ*t − ϕ∥22

2η+ ηm

Addition/SubtractionStep 1: Substitute RUB

Page 44: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Step 3: Impact of Task Relatedness

1. Within-task OGD: RUB controls performance as data-dependent ( ) function of

2. Across-task OGD: Choose initializations such that RUB is small on Average

3. Task-Relatedness: Analyze its impact on resulting bound

ϕ

ϕt

θ*1. Within-task OGD: RUB controls performance as data-dependent ( ) function of

2. Across-task OGD: Choose initializations such that RUB is small on Average

3. Task-Relatedness: Analyze its impact on resulting bound

3 Key Steps of ARUBA Framework

1T

T

∑t=1

Rt ≤1T

T

∑t=1

R̂t(ϕt) =1T (

T

∑t=1

R̂t(ϕt) − minϕ∈Θ

T

∑t=1

R̂t(ϕ)) + minϕ∈Θ

1T

T

∑t=1

R̂t(ϕ)

= 𝒪 ( log TVT ) m +

32

V m

Step 2: Across-task OGD V

η = V/ m

=1T (

T

∑t=1

∥θ*t − ϕt∥22

2η− min

ϕ∈Θ

∥θ*t − ϕ∥22

2η ) + minϕ∈Θ

1T

T

∑t=1

∥θ*t − ϕ∥22

2η+ ηm

substitute

Addition/SubtractionStep 1: Substitute RUB

Page 45: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Task Similarity V

Recap: What have we achieved?

When optimal task parameters are close together, GBML leads to much better average performance

Our Guarantee: R̄ = 𝒪 (V +log TVT ) m

Rt = Ω (D m)Single-Task Lower Bound:

Multi-Task Lower Bound: R̄ = Ω (V m)

Page 46: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

What can we get by applying ARUBA?

Adaptivity

‣ Learn any base-learner parameter from data ‣ e.g., improved training algorithms by learning and simultaneously

Generality

‣ Low-dynamic-regret algorithms for changing task-environments

‣ Stronger online-to-batch conversions for faster statistical rates ‣ Specialized within-task algorithms, e.g., satisfying privacy guarantees

ϕ η

Page 47: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

What can we get by applying ARUBA?

Adaptivity

‣ Learn any base-learner parameter from data ‣ e.g., improved training algorithms by learning and simultaneously

Generality

‣ Low-dynamic-regret algorithms for changing task-environments

‣ Stronger online-to-batch conversions for faster statistical rates ‣ Specialized within-task algorithms, e.g., satisfying privacy guarantees

ϕ η

Page 48: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Excess transfer risk bounds on regularized SGD via online-to-batch conversion [Denevi-Ciliberto-Grazzi-Pontil]

Online learnability of MAML via Follow-the-Leader [Finn-Rajeswaran-Kakade-Levine]

Average regret bounds for Online Mirror Descent meta-algo under max-deviation assumption on task-parameters [K-Balcan-Talwalkar]

ARUBA in Context of Three ICML’19 Papers

Page 49: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

ARUBA Framework Applications ‣ Adaptivity for Improved Training ‣ Federated Learning & Privacy-Preserving GBML

Page 50: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

θt,i+1 = θt,i − ηt,i ⊙ ∇t,i

gradient on sample i from task t

per-coordinate learning rate

Is learning an initialization good enough?

Page 51: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Adaptively preconditioned gradient descent is popular in GBML

‣ Reptile uses Adam with accumulated gradient information

‣ Meta-SGD learns a per-coordinate learning rate for MAML [Li et al.]

Can we do this rigorously?

θt,i+1 = θt,i − ηt,i ⊙ ∇t,i

gradient on sample i from task t

per-coordinate learning rate

Is learning an initialization good enough?

Page 52: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Adaptively preconditioned gradient descent is popular in GBML

‣ Reptile uses Adam with accumulated gradient information

‣ Meta-SGD learns a per-coordinate learning rate for MAML [Li et al.]

Can we do this rigorously?

θt,i+1 = θt,i − ηt,i ⊙ ∇t,i

gradient on sample i from task t

per-coordinate learning rate

Is learning an initialization good enough?

Page 53: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Applying ARUBA: the regret-upper-bound

R̂t(ϕ, η) = ∥θ*t − ϕ∥212η

+m

∑i=1

∥∇t,i∥2η

Single-task regret guarantees are often nice, data-dependent functions of the algorithm parameters.

Mahalanobis norm

η-preconditioned OGD from ϕ

Page 54: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Setting the learning rate

Bt,j =12 ∑

s<t

(ϕs,j − θ*s,j)2 Gt,j = ∑

s<t

m

∑i=1

∇2t,i,j

εt = o(t), ζt = o(t) mηt,j =Bt,j + εt

Gt,j + ζt

sum of sq. distances from initialization

sum of sq. gradients

smoothing terms

learning rate coordinate j

at task t

Adaptive ARUBA

Page 55: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Setting the learning rate

Bt,j =12 ∑

s<t

(ϕs,j − θ*s,j)2 Gt,j = ∑

s<t

m

∑i=1

∇2t,i,j

εt = o(t), ζt = o(t) mηt,j =Bt,j + εt

Gt,j + ζt

sum of sq. distances from initialization

sum of sq. gradients

smoothing terms

learning rate coordinate j

at task t

�̃� (1/T2/5)Guarantee: - convergence of average regret to that of always using the optimal init and diagonal preconditioner

Adaptive ARUBA

Page 56: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Applying result to practical GBML with OGD base learners:

ηt,j =Bt,j + εt

Gt,j + ζt

sum of squared distances traveled on each task

Compare to AdaGrad:

ηt,j =1

Gt,j + δ, δ = O(1)

smoothing term

Connection to AdaGrad

sum of all squared gradients across tasks

Page 57: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Reptile: preconditioning with ARUBA instead of Adam

Results on Omniglot character classification task, averaged over three runs

Improved Training via Adaptive ARUBA

Page 58: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Reptile: preconditioning with ARUBA instead of Adam

Results on Omniglot character classification task, averaged over three runs

Improved Training via Adaptive ARUBA

Page 59: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

ARUBA Framework Applications ‣ Adaptivity for Improved Training ‣ Federated Learning & Privacy-Preserving GBML

Page 60: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Personalized Federated Learning

‣ Massively distributed ‣ Small sample sizes ‣ Privacy concerns ‣ Non-IID data and tasks ‣ Underlying task similarity

Page 61: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

FedAvg ≈ Reptile [with a batch-averaged meta-update]

Page 62: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

FedAvg ≈ Reptile [with a batch-averaged meta-update]

‣ Most popular algorithm in federated learning

‣ Usually run without personalization - just use the meta-initialization within-task

Page 63: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Personalization in Federated Learning via Adaptive ARUBA

Results on Shakespeare next-character prediction task

‣ Meta-training: run FedAvg with ARUBA optimizer within-task

Page 64: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Personalization in Federated Learning via Adaptive ARUBA

Results on Shakespeare next-character prediction task

‣ Meta-training: run FedAvg with ARUBA optimizer within-task

‣ Meta-testing - use (preconditioned) OGD to learn a personalized model for each user

Page 65: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Private and Low-Risk Federated Learning via ARUBA

Ameet TalwalkarJeff Li Sebastian Caldas

Page 66: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Private and Low-Risk Federated Learning via ARUBA

‣ Approach: ‣ ARUBA handles approximate meta-updates, e.g. due to noise

from differential privacy ‣ use private OCO algorithms within-task to get per-sample privacy

‣ analysis of their regret-upper-bounds yields a GBML method that’s private even to the central aggregator and has risk bounds improving with task-similarity.

Page 67: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Private and Low-Risk Federated Learning via ARUBA

‣ Approach: ‣ ARUBA handles approximate meta-updates, e.g. due to noise

from differential privacy ‣ use private OCO algorithms within-task to get per-sample privacy

‣ analysis of their regret-upper-bounds yields a GBML method that’s private even to the central aggregator and has risk bounds improving with task-similarity.

Page 68: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Summary

ARUBA: a framework to obtain GBML algorithms with provable and mathematically interpretable guarantees via reduction to OCO.

New methods for meta-training, federated learning, private GBML.

Future directions:

• Beyond adversarial analysis within-task — can the base-learners be statistical or RL algos?

• Better multi-task optimizers, e.g. for regimes beyond few-shot learning

Page 69: ARUBA: Efficient and Adaptive Meta- Learning with Provable …mkhodak/docs/AMTL2019Slides.pdf · 2019. 6. 22. · GBML through the Lens of Online Learning Non-IID Data / Tasks: Models

Thank You!

More Info: http://www.cs.cmu.edu/~mkhodak/

ARUBA: https://arxiv.org/abs/1906.02717