Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Misha Khodak June 15, 2019
ARUBA: Efficient and Adaptive Meta-Learning with Provable Guarantees
Based on joint work with: •Nina Balcan, Ameet Talwalkar • Jeff Li, Sebastian Caldas, Ameet Talwalkar
Success of Gradient-Based Meta-Learning (GBML)
Few-Shot Learning
Federated Learning with Personalization
Meta Reinforcement Learning
GBML is simple & flexible
Input: T few-shot training tasks
Algorithm: General; only assumes gradient updates
Output: Initialization for few-shot test taskϕGBML
{𝒟}T1
GBML is simple & flexible…What is it doing?
Why/when do GBML methods work?
Input: T few-shot training tasks
Algorithm: General; only assumes gradient updates
Output: Initialization for few-shot test taskϕGBML
{𝒟}T1
GBML is simple & flexible…What is it doing?
Why/when do GBML methods work? ‣ Can we develop improved training algorithms? ‣ Can we personalize models while preserving privacy?
Input: T few-shot training tasks
Algorithm: General; only assumes gradient updates
Output: Initialization for few-shot test taskϕGBML
{𝒟}T1
Many GBML methods are Online Gradient Descent, twice
{𝒟}T1Input: T few-shot training tasks
Algorithm: General; only assumes gradient updates
Output: Initialization for few-shot test taskϕGBML
for task t = 1,…, T :̂θt = Within-task-OGD(𝒟t, ϕt)
update ϕt+1 using ̂θt
Many GBML methods are Online Gradient Descent, twice
Input: T few-shot training tasks
Algorithm: General; only assumes gradient updates
Output: Initialization for few-shot test taskϕReptile
{𝒟}T1
e.g. Reptile [Nichol-Achiam-Schulman]
for task t = 1,…, T :̂θt = Within-task-OGD(𝒟t, ϕt)
update ϕt+1 using ̂θt
Many GBML methods are Online Gradient Descent, twice
ϕReptile = Across-task-OGD({𝒟}T1)
Input: T few-shot training tasks
Algorithm: General; only assumes gradient updates
Output: Initialization for few-shot test taskϕReptile
{𝒟}T1
e.g. Reptile [Nichol-Achiam-Schulman]
for task t = 1,…, T :̂θt = Within-task-OGD(𝒟t, ϕt)
update ϕt+1 using ̂θt
Many GBML methods are Online Gradient Descent, twice
ϕReptile = Across-task-OGD({𝒟}T1)
Input: T few-shot training tasks
Algorithm: General; only assumes gradient updates
Output: Initialization for few-shot test taskϕMAML
{𝒟}T1
MAML [Finn-Abbeel-Levine] ‣ Replace by OGD by GD ‣ Update involves holdout data
for task t = 1,…, T :̂θt = Within-task-OGD(𝒟t, ϕt)
update ϕt+1 using ̂θt
Many GBML methods are Online Gradient Descent, twice
ϕReptile = Across-task-OGD({𝒟}T1)
Input: T few-shot training tasks
Algorithm: General; only assumes gradient updates
Output: Initialization for few-shot test taskϕFedAvg
{𝒟}T1
MAML [Finn-Abbeel-Levine] Replace by OGD by GD Update involves holdout data
FedAvg [McMahan et al.] ‣ Process k tasks in parallel ‣ Update aggregates over k tasks
GBML through the Lens of Online Learning
suffer loss ℓi(θi)
for i = 1,…, m
Online Learning
pick action θi ∈ Θ
GBML through the Lens of Online Learning
Measure per-task performance via Regret Online Learning
R =m
∑i=1
ℓi(θi) − ℓi(θ*)
best fixed action in hindsight
suffer loss ℓi(θi)
for i = 1,…, mpick action θi ∈ Θ
GBML through the Lens of Online Learning
Non-IID Data / Tasks: Models realistic settings (e.g. mobile, RL data; lifelong learning)
Measure per-task performance via Regret Online Learning
R =m
∑i=1
ℓi(θi) − ℓi(θ*)
best fixed action in hindsight
suffer loss ℓi(θi)
for i = 1,…, mpick action θi ∈ Θ
GBML through the Lens of Online Learning
Non-IID Data / Tasks: Models realistic settings (e.g. mobile, RL data; lifelong learning) IID Implications: Online-to-batch conversion results
Measure per-task performance via Regret Online Learning
R =m
∑i=1
ℓi(θi) − ℓi(θ*)
best fixed action in hindsight
suffer loss ℓi(θi)
for i = 1,…, mpick action θi ∈ Θ
GBML through the Lens of Online Learning
Non-IID Data / Tasks: Models realistic settings (e.g. mobile, RL data; lifelong learning) IID Implications: Online-to-batch conversion results Generality: Can adapt / generalize numerous online learning results to GBML setup
Measure per-task performance via Regret Online Learning
R =m
∑i=1
ℓi(θi) − ℓi(θ*)
best fixed action in hindsight
suffer loss ℓi(θi)
for i = 1,…, mpick action θi ∈ Θ
ARUBA: Novel theoretical Framework for GBML
Nina Balcan Ameet Talwalkar
Provides regret bounds for a sequence of online learning problems
ARUBA: Novel theoretical Framework for GBML
Provides regret bounds for a sequence of online learning problems
Theoretical focus on online convex optimization
ARUBA: Novel theoretical Framework for GBML
Provides regret bounds for a sequence of online learning problems
Theoretical focus on online convex optimization
Practical implications in nonconvex settings
ARUBA: Novel theoretical Framework for GBML
ARUBA Framework ‣ Few Shot Learning and GBML ‣ An Illustrative Result
Applications
Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)
L : 𝒴 × 𝒴 ↦ ℝ
Single-Task Few-Shot Learning
θi+1 ← θi − η∇ℓi(θi)
randomly initialize θ1 ∈ Θ, η > 0
for i = 1,…, m
Online Gradient Descent (OGD)
Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)
L : 𝒴 × 𝒴 ↦ ℝ
Single-Task Few-Shot Learning
θi+1 ← θi − η∇ℓi(θi)
randomly initialize θ1 ∈ Θ, η > 0
for i = 1,…, m
Online Gradient Descent (OGD)
Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)
L : 𝒴 × 𝒴 ↦ ℝ
Cannot hope to do well when m is small
Single-Task Few-Shot Learning
Single-Task Regret
suffer ℓi(θi)
Measure performance via Regret
θi+1 ← θi − η∇ℓi(θi)
randomly initialize θ1 ∈ Θ, η > 0
for i = 1,…, mR =
m
∑i=1
ℓi(θi) − ℓi(θ*)best fixed action
in hindsight
Online Gradient Descent (OGD)
Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)
Single-Task Regret
Measure performance via Regret
R =m
∑i=1
ℓi(θi) − ℓi(θ*)
R = 𝒪(D m)
D = diam(Θ)
OGD upper-bound:
Size of Action Space:
Online Gradient Descent (OGD)
Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)
best fixed action in hindsight
[Abernethy-Bartlett-Rakhlin-Tewari]
Single-Task Regret
Measure performance via Regret
R =m
∑i=1
ℓi(θi) − ℓi(θ*)
R = 𝒪(D m)
D = diam(Θ)
R = Ω (D m)Matching lower-bound: (for any algorithm)
OGD upper-bound:
Size of Action Space:
Online Gradient Descent (OGD)
Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)
best fixed action in hindsight
[Abernethy-Bartlett-Rakhlin-Tewari]
Single-Task Regret
Measure performance via Regret
R =m
∑i=1
ℓi(θi) − ℓi(θ*)
R = 𝒪(D m)
D = diam(Θ)
R = Ω (D m)Matching lower-bound: (for any algorithm)
OGD upper-bound:
Size of Action Space:
Key Question: can GBML do better?
Online Gradient Descent (OGD)
Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)
best fixed action in hindsight
[Abernethy-Bartlett-Rakhlin-Tewari]
θi+1 ← θi − η∇ℓi(θi)
for i = 1,…, m
Online Gradient Descent (OGD)
Training Data Hypothesis Class Loss Function (x1, y1), …, (xm, ym) {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓi(θ) = L( fθ(xi), yi)
Learn an initialization sequentially from previous t tasks meta initialize ϕt ∈ Θ, η > 0
GBML Meta-Testing, i.e., using ϕGBML
Key Question: can GBML do betteron-average across tasks?
Average Regret and Task Similarity
Training Data Hypothesis Class Loss Function {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓt,i(θ) = L( fθ(xt,i), yt,i)(xt,1, yt,1), …, (xT,m, yT,m)
Average Regret and Task Similarity
Training Data Hypothesis Class Loss Function {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓt,i(θ) = L( fθ(xt,i), yt,i)
R̄ =1T
T
∑t=1
Rt =1T
T
∑t=1
m
∑i=1
ℓt,i(θt,i) − ℓt,i(θ*t )Average Regret:
(xt,1, yt,1), …, (xT,m, yT,m)
Average Regret and Task Similarity
Training Data Hypothesis Class Loss Function {fθ : 𝒳 ↦ 𝒴 : θ ∈ Θ ⊂ ℝd} ℓt,i(θ) = L( fθ(xt,i), yt,i)
R̄ =1T
T
∑t=1
Rt =1T
T
∑t=1
m
∑i=1
ℓt,i(θt,i) − ℓt,i(θ*t )
V2 = minϕ∈Θ
1T
T
∑t=1
∥θ*t − ϕ∥22
Average Regret:
Task Similarity:V is small when optimal
parameters are close together
(xt,1, yt,1), …, (xT,m, yT,m)
Task Similarity V
When optimal task parameters are close together, GBML leads to much better average performance
Our Guarantee: R̄ = 𝒪 (V +log TVT ) m
Rt = Ω (D m)Single-Task Lower Bound:
Multi-Task Lower Bound: R̄ = Ω (V m)
ARUBA: An Illustrative Result
for task t = 1,…, T :̂θt = Within-task-OGD(𝒟t, ϕt)
update ϕt+1 using ̂θt
ϕReptile = Across-task-OGD({𝒟}T1)
Recall: Reptile Algorithm
for task t = 1,…, T :̂θt = Within-task-OGD(𝒟t, ϕt)
update ϕt+1 using θ*t
ϕReptile = Across-task-OGD({𝒟}T1)
Recall: Reptile Algorithm
assume oracle access to optimum in hindsight
can be relaxed under nondegeneracy assumption
Main ObservationSingle-task regret guarantees are often nice, data-dependent functions of the algorithm parameters.
Main Observation
Rt =m
∑i=1
ℓt,i(θ) − ℓt,i(θ*t ) ≤ R̂t(ϕ) =∥θ*t − ϕ∥2
2
η+ ηm
e.g. OGD from ϕregret-upper-bound (RUB)
Single-task regret guarantees are often nice, data-dependent functions of the algorithm parameters.
Main Observation
Rt =m
∑i=1
ℓt,i(θ) − ℓt,i(θ*t ) ≤ R̂t(ϕ) =∥θ*t − ϕ∥2
2
η+ ηm
e.g. OGD from ϕregret-upper-bound (RUB)
Reduces GBML to OCO over a sequence of regret-upper-bounds
Single-task regret guarantees are often nice, data-dependent functions of the algorithm parameters.
1. Within-task OGD: RUB controls performance as data-dependent ( ) function of
2. Across-task OGD: Choose initializations such that RUB is small on Average
3. Task-Relatedness: Analyze its impact on resulting bound
ϕ1. Within-task OGD: RUB controls performance as data-dependent ( ) function of θ*
3 Key Steps of ARUBA Framework
1T
T
∑t=1
Rt ≤1T
T
∑t=1
R̂t(ϕt) =1T (
T
∑t=1
R̂t(ϕt) − minϕ∈Θ
T
∑t=1
R̂t(ϕ) + minϕ∈Θ
T
∑t=1
R̂t(ϕ))Step 1: Substitute RUB
1. Within-task OGD: RUB controls performance as data-dependent ( ) function of
2. Across-task OGD: Choose initializations such that RUB is small on Average
3. Task-Relatedness: Analyze its impact on resulting bound
ϕ1. Within-task OGD: RUB controls performance as data-dependent ( ) function of θ*
3 Key Steps of ARUBA Framework
1T
T
∑t=1
Rt ≤1T
T
∑t=1
R̂t(ϕt) =1T (
T
∑t=1
R̂t(ϕt) − minϕ∈Θ
T
∑t=1
R̂t(ϕ)) + minϕ∈Θ
1T
T
∑t=1
R̂t(ϕ)
Addition/SubtractionStep 1: Substitute RUB
1. Within-task OGD: RUB controls performance as data-dependent ( ) function of
2. Across-task OGD: Choose initializations such that RUB is small on Average
3. Task-Relatedness: Analyze its impact on resulting bound
ϕ1. Within-task OGD: RUB controls performance as data-dependent ( ) function of θ*
3 Key Steps of ARUBA Framework
R̂(ϕ) =∥θ* − ϕ∥2
2
2η+ ηm
OGD Regret-Upper-Bound (RUB)
1T
T
∑t=1
Rt ≤1T
T
∑t=1
R̂t(ϕt) =1T (
T
∑t=1
R̂t(ϕt) − minϕ∈Θ
T
∑t=1
R̂t(ϕ)) + minϕ∈Θ
1T
T
∑t=1
R̂t(ϕ)
Step 1: Substitute RUB Addition/Subtraction
1. Within-task OGD: RUB controls performance as data-dependent ( ) function of
2. Across-task OGD: Choose initializations such that RUB is small on Average
ϕ
ϕt
θ*
3 Key Steps of ARUBA Framework
1T
T
∑t=1
Rt ≤1T
T
∑t=1
R̂t(ϕt) =1T (
T
∑t=1
R̂t(ϕt) − minϕ∈Θ
T
∑t=1
R̂t(ϕ)) + minϕ∈Θ
1T
T
∑t=1
R̂t(ϕ)
Step 2: Across-task OGD
=1T (
T
∑t=1
∥θ*t − ϕt∥22
2η− min
ϕ∈Θ
∥θ*t − ϕ∥22
2η ) + minϕ∈Θ
1T
T
∑t=1
∥θ*t − ϕ∥22
2η+ ηm
Addition/SubtractionStep 1: Substitute RUB
1. Within-task OGD: RUB controls performance as data-dependent ( ) function of
2. Across-task OGD: Choose initializations such that RUB is small on Average
ϕ
ϕt
θ*
3 Key Steps of ARUBA Framework
1T
T
∑t=1
Rt ≤1T
T
∑t=1
R̂t(ϕt) =1T (
T
∑t=1
R̂t(ϕt) − minϕ∈Θ
T
∑t=1
R̂t(ϕ)) + minϕ∈Θ
1T
T
∑t=1
R̂t(ϕ)
= 𝒪 ( log TVT ) m +
32
V m
Step 2: Across-task OGD
=1T (
T
∑t=1
∥θ*t − ϕt∥22
2η− min
ϕ∈Θ
∥θ*t − ϕ∥22
2η ) + minϕ∈Θ
1T
T
∑t=1
∥θ*t − ϕ∥22
2η+ ηm
Addition/SubtractionStep 1: Substitute RUB
Step 3: Impact of Task Relatedness
1. Within-task OGD: RUB controls performance as data-dependent ( ) function of
2. Across-task OGD: Choose initializations such that RUB is small on Average
3. Task-Relatedness: Analyze its impact on resulting bound
ϕ
ϕt
θ*1. Within-task OGD: RUB controls performance as data-dependent ( ) function of
2. Across-task OGD: Choose initializations such that RUB is small on Average
3. Task-Relatedness: Analyze its impact on resulting bound
3 Key Steps of ARUBA Framework
1T
T
∑t=1
Rt ≤1T
T
∑t=1
R̂t(ϕt) =1T (
T
∑t=1
R̂t(ϕt) − minϕ∈Θ
T
∑t=1
R̂t(ϕ)) + minϕ∈Θ
1T
T
∑t=1
R̂t(ϕ)
= 𝒪 ( log TVT ) m
Step 2: Across-task OGD V
=1T (
T
∑t=1
∥θ*t − ϕt∥22
2η− min
ϕ∈Θ
∥θ*t − ϕ∥22
2η ) + minϕ∈Θ
1T
T
∑t=1
∥θ*t − ϕ∥22
2η+ ηm
Addition/SubtractionStep 1: Substitute RUB
Step 3: Impact of Task Relatedness
1. Within-task OGD: RUB controls performance as data-dependent ( ) function of
2. Across-task OGD: Choose initializations such that RUB is small on Average
3. Task-Relatedness: Analyze its impact on resulting bound
ϕ
ϕt
θ*1. Within-task OGD: RUB controls performance as data-dependent ( ) function of
2. Across-task OGD: Choose initializations such that RUB is small on Average
3. Task-Relatedness: Analyze its impact on resulting bound
3 Key Steps of ARUBA Framework
1T
T
∑t=1
Rt ≤1T
T
∑t=1
R̂t(ϕt) =1T (
T
∑t=1
R̂t(ϕt) − minϕ∈Θ
T
∑t=1
R̂t(ϕ)) + minϕ∈Θ
1T
T
∑t=1
R̂t(ϕ)
= 𝒪 ( log TVT ) m +
32
V m
Step 2: Across-task OGD V
η = V/ m
=1T (
T
∑t=1
∥θ*t − ϕt∥22
2η− min
ϕ∈Θ
∥θ*t − ϕ∥22
2η ) + minϕ∈Θ
1T
T
∑t=1
∥θ*t − ϕ∥22
2η+ ηm
substitute
Addition/SubtractionStep 1: Substitute RUB
Task Similarity V
Recap: What have we achieved?
When optimal task parameters are close together, GBML leads to much better average performance
Our Guarantee: R̄ = 𝒪 (V +log TVT ) m
Rt = Ω (D m)Single-Task Lower Bound:
Multi-Task Lower Bound: R̄ = Ω (V m)
What can we get by applying ARUBA?
Adaptivity
‣ Learn any base-learner parameter from data ‣ e.g., improved training algorithms by learning and simultaneously
Generality
‣ Low-dynamic-regret algorithms for changing task-environments
‣ Stronger online-to-batch conversions for faster statistical rates ‣ Specialized within-task algorithms, e.g., satisfying privacy guarantees
ϕ η
What can we get by applying ARUBA?
Adaptivity
‣ Learn any base-learner parameter from data ‣ e.g., improved training algorithms by learning and simultaneously
Generality
‣ Low-dynamic-regret algorithms for changing task-environments
‣ Stronger online-to-batch conversions for faster statistical rates ‣ Specialized within-task algorithms, e.g., satisfying privacy guarantees
ϕ η
Excess transfer risk bounds on regularized SGD via online-to-batch conversion [Denevi-Ciliberto-Grazzi-Pontil]
Online learnability of MAML via Follow-the-Leader [Finn-Rajeswaran-Kakade-Levine]
Average regret bounds for Online Mirror Descent meta-algo under max-deviation assumption on task-parameters [K-Balcan-Talwalkar]
ARUBA in Context of Three ICML’19 Papers
ARUBA Framework Applications ‣ Adaptivity for Improved Training ‣ Federated Learning & Privacy-Preserving GBML
θt,i+1 = θt,i − ηt,i ⊙ ∇t,i
gradient on sample i from task t
per-coordinate learning rate
Is learning an initialization good enough?
Adaptively preconditioned gradient descent is popular in GBML
‣ Reptile uses Adam with accumulated gradient information
‣ Meta-SGD learns a per-coordinate learning rate for MAML [Li et al.]
Can we do this rigorously?
θt,i+1 = θt,i − ηt,i ⊙ ∇t,i
gradient on sample i from task t
per-coordinate learning rate
Is learning an initialization good enough?
Adaptively preconditioned gradient descent is popular in GBML
‣ Reptile uses Adam with accumulated gradient information
‣ Meta-SGD learns a per-coordinate learning rate for MAML [Li et al.]
Can we do this rigorously?
θt,i+1 = θt,i − ηt,i ⊙ ∇t,i
gradient on sample i from task t
per-coordinate learning rate
Is learning an initialization good enough?
Applying ARUBA: the regret-upper-bound
R̂t(ϕ, η) = ∥θ*t − ϕ∥212η
+m
∑i=1
∥∇t,i∥2η
Single-task regret guarantees are often nice, data-dependent functions of the algorithm parameters.
Mahalanobis norm
η-preconditioned OGD from ϕ
Setting the learning rate
Bt,j =12 ∑
s<t
(ϕs,j − θ*s,j)2 Gt,j = ∑
s<t
m
∑i=1
∇2t,i,j
εt = o(t), ζt = o(t) mηt,j =Bt,j + εt
Gt,j + ζt
sum of sq. distances from initialization
sum of sq. gradients
smoothing terms
learning rate coordinate j
at task t
Adaptive ARUBA
Setting the learning rate
Bt,j =12 ∑
s<t
(ϕs,j − θ*s,j)2 Gt,j = ∑
s<t
m
∑i=1
∇2t,i,j
εt = o(t), ζt = o(t) mηt,j =Bt,j + εt
Gt,j + ζt
sum of sq. distances from initialization
sum of sq. gradients
smoothing terms
learning rate coordinate j
at task t
�̃� (1/T2/5)Guarantee: - convergence of average regret to that of always using the optimal init and diagonal preconditioner
Adaptive ARUBA
Applying result to practical GBML with OGD base learners:
ηt,j =Bt,j + εt
Gt,j + ζt
sum of squared distances traveled on each task
Compare to AdaGrad:
ηt,j =1
Gt,j + δ, δ = O(1)
smoothing term
Connection to AdaGrad
sum of all squared gradients across tasks
Reptile: preconditioning with ARUBA instead of Adam
Results on Omniglot character classification task, averaged over three runs
Improved Training via Adaptive ARUBA
Reptile: preconditioning with ARUBA instead of Adam
Results on Omniglot character classification task, averaged over three runs
Improved Training via Adaptive ARUBA
ARUBA Framework Applications ‣ Adaptivity for Improved Training ‣ Federated Learning & Privacy-Preserving GBML
Personalized Federated Learning
‣ Massively distributed ‣ Small sample sizes ‣ Privacy concerns ‣ Non-IID data and tasks ‣ Underlying task similarity
FedAvg ≈ Reptile [with a batch-averaged meta-update]
FedAvg ≈ Reptile [with a batch-averaged meta-update]
‣ Most popular algorithm in federated learning
‣ Usually run without personalization - just use the meta-initialization within-task
Personalization in Federated Learning via Adaptive ARUBA
Results on Shakespeare next-character prediction task
‣ Meta-training: run FedAvg with ARUBA optimizer within-task
Personalization in Federated Learning via Adaptive ARUBA
Results on Shakespeare next-character prediction task
‣ Meta-training: run FedAvg with ARUBA optimizer within-task
‣ Meta-testing - use (preconditioned) OGD to learn a personalized model for each user
Private and Low-Risk Federated Learning via ARUBA
Ameet TalwalkarJeff Li Sebastian Caldas
Private and Low-Risk Federated Learning via ARUBA
‣ Approach: ‣ ARUBA handles approximate meta-updates, e.g. due to noise
from differential privacy ‣ use private OCO algorithms within-task to get per-sample privacy
‣ analysis of their regret-upper-bounds yields a GBML method that’s private even to the central aggregator and has risk bounds improving with task-similarity.
Private and Low-Risk Federated Learning via ARUBA
‣ Approach: ‣ ARUBA handles approximate meta-updates, e.g. due to noise
from differential privacy ‣ use private OCO algorithms within-task to get per-sample privacy
‣ analysis of their regret-upper-bounds yields a GBML method that’s private even to the central aggregator and has risk bounds improving with task-similarity.
Summary
ARUBA: a framework to obtain GBML algorithms with provable and mathematically interpretable guarantees via reduction to OCO.
New methods for meta-training, federated learning, private GBML.
Future directions:
• Beyond adversarial analysis within-task — can the base-learners be statistical or RL algos?
• Better multi-task optimizers, e.g. for regimes beyond few-shot learning
Thank You!
More Info: http://www.cs.cmu.edu/~mkhodak/
ARUBA: https://arxiv.org/abs/1906.02717