Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
The Sample Complexity in Data(Sample)-Driven
Optimization
Department of Management Science and Engineering
and ICME
Stanford University
Y I N Y U Y E
J O I N T W O R K W I T H M A N Y O T H E R S
Outline
1. Sample Complexity in Reinforcement Learning and Stochastic Game
• Reinforcement Learning and MDP
• The Sample-Complexity Problem in RL
• The Near Sample-Optimal Algorithm
• Variance Reduction
• Monotonicity Analysis
2. Average Approximation with Sparsity-Inducing Penalty for High-
Dimensional Stochastic Programming/Learning
• Regularized sample average approximation (RSAA)
• Theoretical generalizations
• Theoretical applications:
• High-dimensional statistical learning
• Deep neural network learning
I. What Is Reinforcement Learning?
• Use samples/experiences to control an unknown system
Unknown system
Action
State observationsRewards
Agent
Samples/Experiences: previous observations
Maze with random punishment/bonus
Play game, learn to drive
Reinforcement learning achieves phenomenal empirical successes
Median over 57 Atari games
Hessel et al’17
Human
What if sample/trial is costly and limited?
Precision Medicine Biological Experiment
Crowd Sourcing
Sample Size/Complexity is essential!
“What If” Model: Markov Decision Process
• States: ; Actions:
• Reward:
• State transition:
• Policy:
• Optimal policy & value:
• -optimal policy : :
When model is known, solve via Linear Programming
Effective Horizon:
random
• Three-States MDP
− Optimal policy: 𝜋∗ 1 = 𝐿
−0.1-optimal policy: 𝜋 1 = 𝑅
• Pacman
− How to get an 0.1-optimal policy?
− Learn the policy from sample data
− Short episodes:
Examples
1
2 3+1 +0.999
L R
The Sample-Complexity
Problem
• samples per (s,a) is necessary!
• No previous algorithms achieves this sample complexity
• Sufficient side: what is the best way of learning policy?
For an , how many samples are
sufficient/necessary to learn a 0.1-optimal policy
w. p. ≥ 0.9?
Data/sample: transition examples from every (s,a)
[Azar et al (2013)]
Empirical Risk Minimization and Value-Iteration
❑Collect m samples for each (𝑠, 𝑎)
❑Construct an empirical model
❑Solve the empirical MDP via Linear Programming or Simply Value-Iteration:
vs ← maxa [ r(s,a)+ΥP(s’|s,a)Tv ] for all s
Sample-First and Solve-Second
• Simple and intuitive
• Space: model-based, all data
• Samples: [Azar et al’ 2013]
Iterative Q-learning: Solving while Sampling
• #Samples for 0.1-opt. policy w.h.p.:
Empirical estimator of the expected future value
Converging to Q*
: quality of (s,a), initialized to 0
After iterations:
Approximate DP
Why Iterative Q-Learning Fails
• Runs for 𝑅 = 1 − 𝛾 −1 iterations
• Each iteration uses new samples of same number
Empirical estimator of the expected
future value
“a little closer” to Q*
Q1
Q2
Q4
Q3
Q5
Too many samples for too little improvement
How many samples per are sufficient/necessary to obtain a 0.1-optimal policy w. p. > 0.9?
Summary of Previous Methods
Previous Algorithms #Samples/(s,a) Ref
Phased Q-learning 1014 Kearns & Singh’ 1999
ERM 1010 Azar et al’ 2013
Randomized LP 108 Wang’ 2017
Randomized Val. Iter. 108 Sidford et al’ ISML2018
Sufficient?
Necessary?Previous Best Algorithm:
Information theoretical lower bound:
Theorem (informal):
We give an algorithm (vQVI) that
• Recover w.h.p. an 0.1-optimal policy with samples per (s,a)
• Time: #samples
• Space: model-free
The first sample-optimal* algorithm for learning policy of DMDP.
For an unknown discounted MDP, with model size .
The Near Sample-Optimal Algorithm
[Sidford, Wang, Wu, Yang, and Ye, NeurIPS (2018)]
Analysis Overview
Q-Learning
Variance Reduction
Monotonicity
Bernstein
Q-Learning
Var. Reduced Q-Learning
Monotonicity
Bernstein + Law of total var.
Utilizing previous estimate
Guarantee good policy
Precise control of error
Variance Reduction
• Original Q-Learning
• Use milestone-point to reduce samples used
Compute once; reused for many iterations
Compute many times; each time uses a small number
of samples
“a little closer” to Q*
Algorithm: vQVI (sketch)
𝐻 = 1 − 𝛾 −1 iters
Check-point
Large number of samples
Small number of samples
log 𝜖−1
iterations
“Big improvement”
Compare to Q-Learning
❑ Q-learning variants ❑ Variance-reduced Q-Learning
#Samples: (minibatch size) × R #samples: ≈ single minibatch
Q1
Q2
Q4
Q3
Q5
Q2
Q3
Q4
Q5
Q1
Precise Control of Error Accumulation
• Each modified Q-learning step has estimation error
• Error
• Bernstein inequality + Law of total variance
∑𝜖𝑖+𝑗 ∼ (Total variance of MDP)/𝑚 ≤ 1 − 𝛾 −3/𝑚
the intrinsic complexity of MDP
• Fundamentally different computation classes even for the discount case.
• But they achieve the same complexity lower bound for the discount case!
Extension to Two-Person Stochastic Game
[Sidford, Wang, Yang, Ye (2019)]: samples per (s,a) to learn 0.1-opt strategy w.h.p.
𝑎1 𝑎2 𝑎3
𝑎11 𝑎12 𝑎21 𝑎22 𝑎23 𝑎31 𝑎31
3
3 2 2
3 12 2 4 6 14 2
MIN
MAX
RL:
Game:
II. Stochastic Programming
• Consider a Stochastic Programming (SP) problem
min𝒙∈ℜ𝒑:0≤𝒙≤𝑅
𝐅 𝒙 := 𝔼ℙ𝑓(𝒙,𝑊)
• 𝑓 deterministic. For every feasible 𝒙, the function 𝑓(𝒙, ⋅ ) is measurable and
𝔼ℙ𝑓(𝒙,𝑊) is finite. 𝒲 is the support of 𝑊.
• Commonly solved Sample Average Approximation (SAA), a.k.a., Monte
Carlo (Sampling) method.
20
Sampling Average Approximation
• To solve
𝒙𝑡𝑟𝑢𝑒 ∈ argmin{𝐅 𝒙 : 0 ≤ 𝒙 ≤ 𝑅}
• Solve instead an approximation problem
min𝒙∈ℜ𝒑: 0≤𝒙≤𝑅
𝐹𝑛 𝒙 ≔ 𝑛−1
𝑖=1
𝑛
𝑓(𝒙,𝑊𝑖)
• 𝑊1,𝑊2, … ,𝑊𝑖 , …𝑊𝑛 is a sequence of samples of 𝑊
• Simple to implement
• Often tractably computable
21
SAA is a popular way of solving SP
Equivalence between SAA and M-Estimation
Stochastic
ProgrammingStatistical learning
SAA = Model Fitting Formulation for M-Estimation
RSAA = Regularized M-Estimation
Suboptimality gap 𝜖 = Excess Risk: 𝐅 ⋅ − 𝐅 𝒙𝑡𝑟𝑢𝑒
Minimizer 𝒙𝑡𝑟𝑢𝑒 = True parameters 𝒙𝑡𝑟𝑢𝑒
Function = Statistical Loss
Efficacy of SAA
• Assumptions:
• (a) The sequence {𝑓 𝒙,𝑊1 , 𝑓 𝒙,𝑊2 , … , 𝑓 𝒙,𝑊𝑛 } i.i.d.
• (b) Subgaussian (can be generalized into sub-exponential)
• (c) Lipschitz-like condition
• Number 𝑛 of samples required to achieve 𝜖 accuracy with probability 1 − 𝛼in solving an 𝑝-dimensional problem.
ℙ 𝐅 𝒙𝑆𝐴𝐴 − 𝐅 𝒙𝑡𝑟𝑢𝑒 ≤ 𝜖 ≥ 1 − 𝛼
• If 𝑛 is large enough to satisfy
𝑛 ≳𝑝
𝜖2ln1
𝜖+
1
𝜖2ln1
𝛼
23 Shapiro et al., 2009, Shapiro, 2003, Shapiro and Xu, 2008
Sample size 𝑛 grows polynomially when number of dimensions 𝑝 increases
High-Dimensional SP
• What if 𝑝 is large, but acquiring new samples are prohibitively costly?
• What if most dimensions are (approximately) ineffective, e.g., 𝒙𝑡𝑟𝑢𝑒 = [80;
131; 100; 0; 0; ….; 0]
• Consider a problem with 𝑝 = 100,000
• SAA requires n ≥ 100,000 ∼ 10,000,000
• If only it is known which dimensions (e.g., three of them) are nonzero, then
original problem can be reduced to 𝑝 = 3, then 𝑛 ≥ 10 ∼ 1,000
24
Conventional SAA is not effective in solving high-dimensional SP
Convex and Nonconvex Regularization Schemes
Tibshirani, 1996. J R Stat Soc B
Frank and Friedman, 1993.Technometrics
Fan and Li, 2001. JASA
Zhang, 2010. Ann. Stat
Raskutti et al., 2011, IEEE Trans. Inf. Theory
Bian, Chen, Y, 2011, Math. Progrm
• 𝑙𝑞-norm (0 < 𝑞 ≤ 1 or 𝑞 = 2) (--), biased or
non-sparse
• 𝑃𝜆 𝑡 ≔ 𝜆 ⋅ 𝑡𝑞
• 𝑙0-norm (--), not continuous
• 𝑃𝜆 𝑡 ≔ 𝜆 ⋅ 𝕝(𝑡 ≠ 0)
• Folded concave penalty in the form of minimax
concave penalty (--)
• 𝑃𝜆 𝑡 ≔ 0𝑡 𝑎𝜆−𝑠 +
𝑎𝑑𝑠
FCP entails unbiasedness, sparsity, and continuity
t
𝑃𝜆 𝑡
-- ℓ𝒒-norm with 𝑞 = 0.5
-- ℓ𝟎-norm with 𝜆 = 1-- FCP with 𝜆 = 2, 𝑎 = 0.5
RSAA formulation: min𝒙: 0≤𝒙≤𝑅
𝐹𝑛,𝜆 𝒙 ≔ 𝐹𝑛 𝒙 + ∑𝑗=1𝑝
𝑃𝜆 𝑥𝑗
Existing Results on Convex Regularization
• Lasso (Tibshirani, 1996), compressive sensing (Donoho, 2006), and Dantzig
selector (Candes & Tao, 2007):
• Pro:
• Computationally tractable
• Generalization error and oracle inequalities (e.g., ℓ1, ℓ2-loss): Bunea et al.
(2007), Zou (2006), Bickel et al. (2009), Van de Geer (2008), Negahban et
al. (2010), etc.
• Cons:
• Requirement of RIP, restricted eigenvalue (RE), or RSC
• Absence of (strong) oracle property: Fan and Li (2001)
• Biased: Fan and Li (2001)
• Remark
• Excess risk bound available for linear regression beyond RIP, RE, or RSC
(e.g., Zhang et al., 2017)
Existing Results on (Folded) Concave Penalty
• SCAD: Fan and Li (2001) and MCP: Zhang (2010)
• Pro:
• (Nearly) Unbiased: Fan and Li (2001), Zhang (2010)
• Generalization error (ℓ1, ℓ2-loss) and oracle inequalities: Loh and Wainwright
(2015), Wang et al. (2014), Zhang & Zhang (2012), etc.
• Oracle and Strong oracle property: Fan & Li (2001), Fan et al. (2014).
• Cons:
• Strong NP-Hard, requirement of RSC, except for the special case of linear
regression
• Remark:
• Chen, Fu and Y (2010): threshod lower-bound property at every local solution;
• Zhang and Zhang (2012): global optimality results in good statistical quality;
• Fan and Lv (2011): sufficient condition results in desirable statistical property;
• Though strongly NP-hard to find global solution (Chen et al., 2017), theories on
local schemes available: Loh & Wainwright (2015), Wang, Liu, & Zhang (2014),
Wang, Kim & Li (2013), Fan et al. (2014), Liu et al. (2017)
The S3ONC Solutions Admits FPTAS
• First-order necessary condition (FONC): The solution 𝒙∗ ∈ 0, 𝑅 𝑝
𝛻 𝐹𝑛 𝒙∗ + 𝑃𝜆′ 𝑥𝑗
∗ : 1 ≤ 𝑗 ≤ 𝑝 , 𝒙 − 𝒙∗ ≥ 0, ∀𝒙 ∈ 0, 𝑅 𝑝
• Significant subspace second-order necessary condition (S3ONC):
• FONC holds, and, for all 𝑗 = 1,… , 𝑝: 𝑥𝑗∗ ∈ (0, 𝑎𝜆), it holds that
𝑈𝐿+𝑃𝜆′′ 𝑥𝑗
∗ ≥ 0 𝑜𝑟 𝛻2 𝐹𝑗𝑗 𝒙∗ + 𝑃𝜆
′′ 𝑥𝑗∗ ≥ 0
• S3ONC is weaker than the canonical second-order KKT
• Computable within pseudo-polynomial time
Assumption (d):
𝑓( ⋅ , 𝑧) is continuously differentiable and for a.e. 𝑧 ∈ 𝒲,
𝜕 𝐹𝑛 𝒙, 𝑧
𝜕𝑥𝑗 𝒙=𝒙+𝛿𝒆𝒋
−𝜕 𝐹𝑛 𝒙, 𝑧
𝜕𝑥𝑗 𝒙=𝒙
≤ 𝑈𝐿 ⋅ |𝛿|
Y (1998), Liu et al. (2017), Haeser et al. (2018), Liu, Y (2019) https://arxiv.org/abs/1903.00616.
Result 1: Efficacy of RSAA at a Global Minimum
29
Assumptions:
• (a) – (d)
Consider a global minimal solution 𝒙∗ to RSAA. If the sample size 𝑛 satisfies
𝑛 ≳𝑠2.5
𝜖3ln
𝑝
𝜖
1.5+
1
𝜖2ln
1
𝛼,
then ℙ 𝐅 𝒙∗ − 𝐅 𝒙𝑡𝑟𝑢𝑒 ≤ 𝜖 ≥ 1 − 𝛼, for any 𝛼 > 0 and 𝜖 ∈ 0, 0.5 .
THEOREM 1
• Better dependence in 𝑝 than SAA: 𝑝
𝜖2ln
1
𝜖+
1
𝜖2ln
1
𝛼
• However, generating a global solution is generally computationally
prohibitive
30
Assumptions:
• (a) – (d)
• 𝑓( ⋅ , 𝑧) is convex for almost every 𝑧 ∈𝒲Consider an S3ONC solution 𝒙∗ to RSAA. If 𝐹𝑛,𝜆 𝒙∗ ≤ 𝐹𝑛,𝜆(𝟎) a.s., and if the sample
size 𝑛 satisfies
𝑛 ≳𝑠2.5
𝜖4ln
𝑝
𝜖
2+
1
𝜖2ln
1
𝛼,
then ℙ 𝐅 𝒙∗ − 𝐅 𝒙𝑡𝑟𝑢𝑒 ≤ 𝜖 ≥ 1 − 𝛼, for any 𝛼 > 0 and 𝜖 ∈ 0, 0.5 .
THEOREM 2
• Better dependence in 𝑝 than SAA: 𝑝
𝜖2ln
1
𝜖+
1
𝜖2ln
1
𝛼
• Much easier solvable than a global minimizer: Pseudo-polynomial-time
algorithms exist
Result 2: Efficacy of RSAA at a Local Solution
31
Assumptions:
• (a) – (d)
• 𝑓( ⋅ , 𝑧) is convex for almost every 𝑧 ∈𝒲• 𝐅 is strongly convex and differentiable (∗)Consider an S3ONC solution 𝒙∗. If 𝐹𝑛,𝜆 𝒙∗ ≤ 𝐹𝑛,𝜆(𝟎) a.s., and if the sample size 𝑛
satisfies
𝑛 ≳𝑠1.5
𝜖3ln
𝑝
𝜖
1.5+
1
𝜖2ln
1
𝛼,
then ℙ 𝐅 𝒙∗ − 𝐅 𝒙𝑡𝑟𝑢𝑒 ≤ 𝜖 ≥ 1 − 𝛼, for any 𝛼 > 0 and 𝜖 ∈ 0, 0.5 .
THEOREM 3
• For some problems, while f( ⋅ ,𝑊) is only convex, 𝐅 can be strongly convex,
e.g., high-dimensional linear regression with random design matrix
• Under additional assumption (∗), Theorem 3 shows an improved sample
complexity than Theorem 2
Result 3: Efficacy of RSAA Under Additional Assumptions
32
Assumptions:
• (a) – (d)
• 𝑓( ⋅ , 𝑧) is convex for almost every 𝑧 ∈𝒲• 𝐅 is strongly convex and differentiable (∗)
• min 𝑥𝑗𝑡𝑟𝑢𝑒 : 𝑥𝑗
𝑡𝑟𝑢𝑒 ≠ 0, 𝑗 = 1,… , 𝑝 ≥ 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 (∗∗)
Consider an S3ONC solution 𝒙∗. If 𝐹𝑛,𝜆 𝒙∗ ≤ 𝐹𝑛,𝜆(𝟎) a.s., and if the sample size 𝑛
satisfies
𝑛 ≳𝑠
𝜖2ln
𝑝
𝜖+
1
𝜖2ln
1
𝛼,
then ℙ 𝐅 𝒙∗ − 𝐅 𝒙𝑡𝑟𝑢𝑒 ≤ 𝜖 ≥ 1 − 𝛼, for any 𝛼 > 0 and 𝜖 ∈ 0, 0.5 .
THEOREM 4
Under additional assumptions (∗) and (∗∗), no compromise on the rate in 𝜖
compared to the SAA: 𝑝
𝜖2ln
1
𝜖+
1
𝜖2ln
1
𝛼
Result 4: Efficacy of RSAA Under Additional Assumptions
Numerical Experiments
Liu, Wang, Yao, Li, Ye, 2018. Math Program
Fix sample size 𝑛 = 100;
Increase dimensionality
Fix dimension 𝑝 = 100;
Increase sample size
Generalizations and Applications
• Theoretical Generalizations
• From subgaussian distribution to subexponentiality
• From smooth to nonsmooth cost function
• Theoretical Applications
• High-Dimensional Statistical Learning (HDSL)
• Deep Neural Networks
Theoretical Generalizations
Assumption (e):
𝑓(𝒙,𝑊) is subexponential for all 𝒙 ∈ 0, 𝑅 𝑝
Assumption (f):
𝑓 𝒙,𝑊 = 𝑓1 𝒙,𝑊 +max𝑢∈ℋ
{𝒖⊤𝑨 𝑊 𝒙 − 𝜙(𝒖,𝑊)}
where 𝑨 𝑊 ∈ ℜ𝑚×𝑝 is a linear operator, 𝜙 is a measurable,
deterministic function, 𝑓1 ⋅ ,𝑊 is continuously differentiable with
Lipschitz continuous gradient, ℋ ⊂ ℜ𝑚 is a convex and compact set.
• Generalization from subgaussian distribution to subexponentiality
• Generalization from smooth cost function 𝑓 to non-smooth cost function
36
Assumptions: (a), (c), (d), (e) subexponentiality; 𝑓1( ⋅ , 𝑧) is convex for almost every 𝑧 ∈𝒲Consider an S3ONC solution 𝒙∗. If 𝐹𝑛,𝜆 𝒙∗ ≤ 𝐹𝑛,𝜆(𝒙
ℓ𝟏) a.s., and if the sample size 𝑛
satisfies
𝑛 ≳𝑠3
𝜖3ln
𝑝
𝜖
1.5+ ln
𝑝
𝛼+ ln
1
𝛼
3,
then ℙ 𝐅 𝒙∗ − 𝐅 𝒙𝑡𝑟𝑢𝑒 ≤ 𝜖 ≥ 1 − 𝛼, for any 𝛼 > 0 and 𝜖 ∈ 0, 0.5 .
THEOREM 5
• Better dependence in 𝜖 than Theorem 2: 𝑠2.5
𝜖4ln
𝑝
𝜖
2+
1
𝜖2ln
1
𝛼
• Desired solution can be generated by solving for an S3ONC
solution initialized with 𝒙ℓ1.
Result 5: Generalization to subexponentiality
𝒙ℓ1 ≔ arg min𝒙:𝟎≤𝒙≤𝑹
𝐹𝑛 𝒙 + 𝜆∑𝑗=1𝑝
𝑥𝑗 .
37
Assumptions: (a), (c), (e) subexponentiality; (f); 𝑓( ⋅ , 𝑧) is convex for almost every 𝑧 ∈𝒲.
Consider an S3ONC solution 𝒙∗. If 𝐹𝑛,𝜆 𝒙∗ ≤ 𝐹𝑛,𝜆(𝒙ℓ𝟏) a.s., and if the sample size 𝑛
satisfies
𝑛 ≳𝑠4
𝜖4ln
𝑝
𝜖
2+ ln
𝑝
𝛼+ ln
1
𝛼
2,
then ℙ 𝐅 𝒙∗ − 𝐅 𝒙𝑚𝑖𝑛 ≤ 𝜖 ≥ 1 − 𝛼, for any 𝛼 > 0 and 𝜖 ∈ 0, 0.5 .
THEOREM 6
• Assumption (d) is replaced by (f)
• Desired solution can be generated by solving for an S3ONC
solution initialized with 𝒙ℓ1
Result 6: Generalization to non-smoothness
𝒙ℓ1 ≔ arg min𝒙:𝟎≤𝒙≤𝑹
𝐹𝑛 𝒙 + 𝜆∑𝑗=1𝑝
𝑥𝑗 .
Theoretical Application 1: High-Dimensional M-Estimation
• Given knowledge of a statistical loss function 𝑓, and hypothetically
𝒙𝑡𝑟𝑢𝑒 ∈ argmin 𝐅 𝒙 := 𝔼 𝑓 𝒙,𝑊
𝒙𝑡𝑟𝑢𝑒: True parameters governing data generation model
• Estimate 𝒙𝑡𝑟𝑢𝑒 ∈ ℜ𝑝 given only 𝑓 samples 𝑊1,𝑊2, … ,𝑊𝑛.
• Assume that 𝑓 𝒙,𝑊𝑖 , 𝑖 = 1,… , 𝑛, is i.i.d. subexponential.
• Traditional approach: minimizing empirical risk to obtain unbiased estimator
ෝ𝒙 = arg min 𝐹𝑛 𝒙 := 𝑛−1
𝑖=1
𝑛
𝑓 𝒙,𝑊𝑖
• Under High-dimensional setting, 𝑝 ≫ 𝑛
• Consider Folded concave penalized (FCP) learning
min 𝐹𝑛,𝜆 𝒙 := 𝑛−1
𝑖=1
𝑛
𝑓 𝑥,𝑊𝑖 +
𝑗=1
𝑝
𝑃𝜆 𝑥𝑗
Result 7: Learning Under Approximate Sparsity
For any 𝛤 ≥ 0 , consider an S3ONC solution 𝒙∗ with 𝐹𝑛,𝜆 𝒙∗ ≤ min𝑥
𝐹𝑛,𝜆 𝒙 + 𝛤, a.s. Assume
(a) subexponentiality; (b) Lipschitz-like continuity of 𝐹 and 𝛻𝐹; (c) approximate sparsity
𝐸𝑥𝑐𝑒𝑠𝑠 𝑟𝑖𝑠𝑘 𝑜𝑓 𝒙∗ ≤ ෨𝑂 1 ⋅𝑠 ln 𝑝
𝑛1/3+
𝛤 + Ƹ휀
𝑛1/3+ 𝛤 + Ƹ휀
with probability 1 − 𝑂 1 ⋅ exp(−𝑂 1 𝑛1/3 ln 𝑝)
THEOREM 7
• First explication: more general assumptions than sparsity;
• 𝛤 can be interpreted as the sub-optimality gap of 𝒙∗
• No convexity or RSC assumed.
• Approximate sparsity: Existence of 𝒙 such that 𝔽 𝒙 − 𝔽 𝒙𝑡𝑟𝑢𝑒 ≤ ෝ휀 , and 𝑠:= 𝒙 0 ≪ 𝑝
Removing 𝜞 via Wise Initialization for Convex Loss
Step 1: Solve the LASSO program
𝒙ℓ1 ≔ argmin𝒙
𝐹𝑛(𝒙) + 𝜆𝑗=1
𝑝
|𝑥𝑖|
Step 2: Solve for an S3ONC stationary point initialized with 𝜷ℓ1
• A modified first-order method, ADMM
Algorithm-independent, fully polynomial-time approximation scheme
Complexity ≤ Poly(p, desired_accuracy)
Consider an S3ONC solution 𝒙∗ with 𝐹𝑛,𝜆 𝒙∗ ≤ 𝐹𝑛,𝜆 𝒙ℓ𝟏 , a.s. Assume
(a) subexponentiality; (b) Lipschitz-like continuity of 𝐹 and 𝛻𝐹; (c) sparsity
(i.e., ෝ휀 = 0 in the assumption of approximate sparsity)
𝐸𝑥𝑐𝑒𝑠𝑠 𝑟𝑖𝑠𝑘 𝑜𝑓 𝒙∗ ≤ ෨𝑂 1 ⋅𝑠‖𝒙𝑡𝑟𝑢𝑒‖∞ ln 𝑝
𝑛1/3+𝑠 ln 𝑝
𝑛2/3
with probability 1 − 𝑂 1 ⋅ exp(−𝑂 1 𝑛1/3 ln 𝑝)
THEOREM 8
A pseudo-polynomial-time computable theory without RSC
Result 8: FCP+LASSO
Comparing This Research with Existing Results
• New insights on local solutions for folded concave penalty
• Pros:
• No requirement of RSC, RIP, RE, or alike
• Applicable to non-smooth learning and deep learning
• Applicable to high-dimensional stochastic programming
• Generalization error bounds on excess risk
• Cons:
• Slower rate than alternative results that require RSC
Under RSC (e.g., Loh
& Wainwright 2015)
Without RSC, linear regression
only (e.g, Zhang et al., 2017)
This research on M-
estimation without RSC
𝑂1
𝑛𝑂
1
𝑛𝑂
1
𝑛 Τ1 3
Traditional training model:
min𝒙
1
2𝑛
𝑖=1
𝑛
𝑏𝑖 − 𝐹𝑁𝑁 𝒂𝑖 , 𝒙2
• Expressive power: Yarotsky (2017) and Mhaskar (1996), DeVore et al. (1989), Mhaskar
and Poggio (2016)
• Two-layer, polynomial increase in dimensions, or sometimes exponential
growth in layer number: Arora et al. (2018), Barron and Klusowski (2018), Bartlett et al.
(2017), Golowich et al. (2017), Neyshabur et al. (2015, 2017, 2018)
Scarcity of theoretical guarantee on generalization performance
under over-parameterizationImage from: medium.com/xanaduai/
Theoretical Application 2: Deep Learning
Date generation: for some unknown 𝑔:ℜ𝑑 → ℜ𝑏𝑖 = 𝑔 𝒂𝑖 + 𝑤𝑖 𝑖 = 1, … , 𝑛
Approximate Sparsity and Regularization in Neural Networks
Neural networks may not obey the RSC but they innately entail
approximate sparsity
• Proposed training model: Find an S3ONC solution to the formulation below
min𝒙
1
2𝑛
𝑖=1
𝑛
𝑏𝑖 − 𝐹𝑁𝑁 𝒂𝑖 , 𝒙2 +
𝑗=1
𝑝
𝑃𝜆(|𝑥𝑗|)
• Expressive power of NN implies approximate sparsity
• E.g., A ReLu network, with 𝑐𝑛1/3 ⋅ ln 𝑛𝑟
3𝑑 + 1 -many approximates any 𝑟-th
order weakly-differentiable function with accuracy O(1)𝑛−𝑟
3𝑑 (Yarotsky, 2017)
• s = 𝑐𝑛1/3 ⋅ ln 𝑛𝑟
3𝑑 + 1 and Ƹ휀 = O(1)𝑛−𝑟
3𝑑 in approximate sparsity
• Empirical evidences on sparse neural networks
• E.g., Dropout: Srivastava et al. (2014); DropConnect: Wan et al. (2013); Deep
compression (remove link with small weights and retrain): Han et al. (2016)
Poly-logarithmic in dimensionality
෨𝑂1
𝑛13
+1
𝑛16+𝑟6𝑑
+1
𝑛𝑟3𝑑
+Γ
𝑛13
⋅ ln 𝑝 + 𝛤
For 𝑔:ℜ𝑑 → ℜ that has 𝑟-th order weak derivative;
Assume that the PDF of inputs to neurons have
continuous density in a near neighborhood of 0.
Generalization Performance of Deep Learning in Special Case 1
෨𝑂1
𝑛 Τ1 3+
Γ
𝑛 Τ1 3ln 𝑝 + 𝛤
For 𝑔:ℜ𝑑 → ℜ being a polynomial function
Generalization Performance of Deep Learning in Special Case 2
Poly-logarithmic in dimensionality
Numerical Experiment 1: MNIST
Handwriting recognition problem:
• 28 x 28-pixel images of handwritten digits
• 60,000 training data
• 10,000 test data
Numerical Experiment 1: MNIST
• 800 × 800 + softmax
• Fully connected
• No preprocessing/distortion
• 500 × 150 × 10 ×… × 10 × 150 + softmax
• Fully connected
• No preprocessing/distortion
Numerical Experiment 1: MNIST
Numerical Experiment 2: Matrix Completion
Ratings of customers to products/restaurants/movies
often constitutes an incomplete matrix
𝑝 = 100 × 100 = 10,000, 𝑛 = 2,000
Numerical Experiment 2: Matrix Completion
Method Sub. Nuclear
norm
(Lasso)
S3ONC
Error 85.7699 0.1072 0.0005
𝑝 = 100 × 100 = 10,000, 𝑛 = 2,000
S3ONC
Summary
• The Sample Complexity analysis is an essential problem in optimization driven
by data or high-dimensional/deep learning
• Theories can be developed to match a sample size upper bound to the
information theoretical lower bound, such as in RL by smart sampling
• Based on the structure of problems/solutions, adding certain regularizations
can greatly reduce the sample complexity, such as RSAA achieves poly-
logarithmic sample complexity on any S3ONC solution based on standard
statistical and sparse assumptions
• Although the RSAA problem become nonconvex when the regularization is
concave, an S3ONC solution can be efficiently computed in both theory and
practice
• Future Research: further theoretical developments and more comprehensive
numerical results