The Sample Complexity in Data(Sample)-Driven Optimizationyyye/MostOM2019Final.pdf · The Sample Complexity in Data(Sample)-Driven Optimization Department of Management Science and

The Sample Complexity in Data(Sample)-Driven

Optimization

Department of Management Science and Engineering

and ICME

Stanford University

Y I N Y U Y E

J O I N T W O R K W I T H M A N Y O T H E R S

Outline

1. Sample Complexity in Reinforcement Learning and Stochastic Game

• Reinforcement Learning and MDP

• The Sample-Complexity Problem in RL

• The Near Sample-Optimal Algorithm

• Variance Reduction

• Monotonicity Analysis

2. Average Approximation with Sparsity-Inducing Penalty for High-

Dimensional Stochastic Programming/Learning

• Regularized sample average approximation (RSAA)

• Theoretical generalizations

• Theoretical applications:

• High-dimensional statistical learning

• Deep neural network learning

I. What Is Reinforcement Learning?

• Use samples/experiences to control an unknown system

Unknown system

Action

State observationsRewards

Agent

Samples/Experiences: previous observations

Maze with random punishment/bonus

Play game, learn to drive

Reinforcement learning achieves phenomenal empirical successes

Median over 57 Atari games

Hessel et al’17

Human

What if sample/trial is costly and limited?

Precision Medicine Biological Experiment

Crowd Sourcing

Sample Size/Complexity is essential!

“What If” Model: Markov Decision Process

• States: ; Actions:

• Reward:

• State transition:

• Policy:

• Optimal policy & value:

• -optimal policy : :

When model is known, solve via Linear Programming

Effective Horizon:

random

• Three-States MDP

− Optimal policy: 𝜋∗ 1 = 𝐿

−0.1-optimal policy: 𝜋 1 = 𝑅

• Pacman

− How to get an 0.1-optimal policy?

− Learn the policy from sample data

− Short episodes:

Examples

1

2 3+1 +0.999

L R

The Sample-Complexity

Problem

• samples per (s,a) is necessary!

• No previous algorithms achieves this sample complexity

• Sufficient side: what is the best way of learning policy?

For an , how many samples are

sufficient/necessary to learn a 0.1-optimal policy

w. p. ≥ 0.9?

Data/sample: transition examples from every (s,a)

[Azar et al (2013)]

Empirical Risk Minimization and Value-Iteration

❑Collect m samples for each (𝑠, 𝑎)

❑Construct an empirical model

❑Solve the empirical MDP via Linear Programming or Simply Value-Iteration:

vs ← maxa [ r(s,a)+ΥP(s’|s,a)Tv ] for all s

Sample-First and Solve-Second

• Simple and intuitive

• Space: model-based, all data

• Samples: [Azar et al’ 2013]

Iterative Q-learning: Solving while Sampling

• #Samples for 0.1-opt. policy w.h.p.:

Empirical estimator of the expected future value

Converging to Q*

: quality of (s,a), initialized to 0

After iterations:

Approximate DP

Why Iterative Q-Learning Fails

• Runs for 𝑅 = 1 − 𝛾 −1 iterations

• Each iteration uses new samples of same number

Empirical estimator of the expected

future value

“a little closer” to Q*

Q1

Q2

Q4

Q3

Q5

Too many samples for too little improvement

How many samples per are sufficient/necessary to obtain a 0.1-optimal policy w. p. > 0.9?

Summary of Previous Methods

Previous Algorithms #Samples/(s,a) Ref

Phased Q-learning 1014 Kearns & Singh’ 1999

ERM 1010 Azar et al’ 2013

Randomized LP 108 Wang’ 2017

Randomized Val. Iter. 108 Sidford et al’ ISML2018

Sufficient?

Necessary?Previous Best Algorithm:

Information theoretical lower bound:

Theorem (informal):

We give an algorithm (vQVI) that

• Recover w.h.p. an 0.1-optimal policy with samples per (s,a)

• Time: #samples

• Space: model-free

The first sample-optimal* algorithm for learning policy of DMDP.

For an unknown discounted MDP, with model size .

The Near Sample-Optimal Algorithm

[Sidford, Wang, Wu, Yang, and Ye, NeurIPS (2018)]

Analysis Overview

Q-Learning

Variance Reduction

Monotonicity

Bernstein

Q-Learning

Var. Reduced Q-Learning

Monotonicity

Bernstein + Law of total var.

Utilizing previous estimate

Guarantee good policy

Precise control of error

Variance Reduction

• Original Q-Learning

• Use milestone-point to reduce samples used

Compute once; reused for many iterations

Compute many times; each time uses a small number

of samples

“a little closer” to Q*

Algorithm: vQVI (sketch)

𝐻 = 1 − 𝛾 −1 iters

Check-point

Large number of samples

Small number of samples

log 𝜖−1

iterations

“Big improvement”

Compare to Q-Learning

❑ Q-learning variants ❑ Variance-reduced Q-Learning

#Samples: (minibatch size) × R #samples: ≈ single minibatch

Q1

Q2

Q4

Q3

Q5

Q2

Q3

Q4

Q5

Q1

Precise Control of Error Accumulation

• Each modified Q-learning step has estimation error

• Error

• Bernstein inequality + Law of total variance

∑𝜖𝑖+𝑗 ∼ (Total variance of MDP)/𝑚 ≤ 1 − 𝛾 −3/𝑚

the intrinsic complexity of MDP

• Fundamentally different computation classes even for the discount case.

• But they achieve the same complexity lower bound for the discount case!

Extension to Two-Person Stochastic Game

[Sidford, Wang, Yang, Ye (2019)]: samples per (s,a) to learn 0.1-opt strategy w.h.p.

𝑎1 𝑎2 𝑎3

𝑎11 𝑎12 𝑎21 𝑎22 𝑎23 𝑎31 𝑎31

3

3 2 2

3 12 2 4 6 14 2

MIN

MAX

RL:

Game:

II. Stochastic Programming

• Consider a Stochastic Programming (SP) problem

min𝒙∈ℜ𝒑:0≤𝒙≤𝑅

𝐅 𝒙 := 𝔼ℙ𝑓(𝒙,𝑊)

• 𝑓 deterministic. For every feasible 𝒙, the function 𝑓(𝒙, ⋅ ) is measurable and

𝔼ℙ𝑓(𝒙,𝑊) is finite. 𝒲 is the support of 𝑊.

• Commonly solved Sample Average Approximation (SAA), a.k.a., Monte

Carlo (Sampling) method.

20

Sampling Average Approximation

• To solve

𝒙𝑡𝑟𝑢𝑒 ∈ argmin{𝐅 𝒙 : 0 ≤ 𝒙 ≤ 𝑅}

• Solve instead an approximation problem

min𝒙∈ℜ𝒑: 0≤𝒙≤𝑅

𝐹𝑛 𝒙 ≔ 𝑛−1

𝑖=1

𝑛

𝑓(𝒙,𝑊𝑖)

• 𝑊1,𝑊2, … ,𝑊𝑖 , …𝑊𝑛 is a sequence of samples of 𝑊

• Simple to implement

• Often tractably computable

21

SAA is a popular way of solving SP

Equivalence between SAA and M-Estimation

Stochastic

ProgrammingStatistical learning

SAA = Model Fitting Formulation for M-Estimation

RSAA = Regularized M-Estimation

Suboptimality gap 𝜖 = Excess Risk: 𝐅 ⋅ − 𝐅 𝒙𝑡𝑟𝑢𝑒

Minimizer 𝒙𝑡𝑟𝑢𝑒 = True parameters 𝒙𝑡𝑟𝑢𝑒

Function = Statistical Loss

Efficacy of SAA

• Assumptions:

• (a) The sequence {𝑓 𝒙,𝑊1 , 𝑓 𝒙,𝑊2 , … , 𝑓 𝒙,𝑊𝑛 } i.i.d.

• (b) Subgaussian (can be generalized into sub-exponential)

• (c) Lipschitz-like condition

• Number 𝑛 of samples required to achieve 𝜖 accuracy with probability 1 − 𝛼in solving an 𝑝-dimensional problem.

ℙ 𝐅 𝒙𝑆𝐴𝐴 − 𝐅 𝒙𝑡𝑟𝑢𝑒 ≤ 𝜖 ≥ 1 − 𝛼

• If 𝑛 is large enough to satisfy

𝑛 ≳𝑝

𝜖2ln1

𝜖+

1

𝜖2ln1

𝛼

23 Shapiro et al., 2009, Shapiro, 2003, Shapiro and Xu, 2008

Sample size 𝑛 grows polynomially when number of dimensions 𝑝 increases

High-Dimensional SP

• What if 𝑝 is large, but acquiring new samples are prohibitively costly?

• What if most dimensions are (approximately) ineffective, e.g., 𝒙𝑡𝑟𝑢𝑒 = [80;

131; 100; 0; 0; ….; 0]

• Consider a problem with 𝑝 = 100,000

• SAA requires n ≥ 100,000 ∼ 10,000,000

• If only it is known which dimensions (e.g., three of them) are nonzero, then

original problem can be reduced to 𝑝 = 3, then 𝑛 ≥ 10 ∼ 1,000

24

Conventional SAA is not effective in solving high-dimensional SP

Convex and Nonconvex Regularization Schemes

Tibshirani, 1996. J R Stat Soc B

Frank and Friedman, 1993.Technometrics

Fan and Li, 2001. JASA

Zhang, 2010. Ann. Stat

Raskutti et al., 2011, IEEE Trans. Inf. Theory

Bian, Chen, Y， 2011， Math. Progrm

• 𝑙𝑞-norm (0 < 𝑞 ≤ 1 or 𝑞 = 2) (--), biased or

non-sparse

• 𝑃𝜆 𝑡 ≔ 𝜆 ⋅ 𝑡𝑞

• 𝑙0-norm (--), not continuous

• 𝑃𝜆 𝑡 ≔ 𝜆 ⋅ 𝕝(𝑡 ≠ 0)

• Folded concave penalty in the form of minimax

concave penalty (--)

• 𝑃𝜆 𝑡 ≔ 0𝑡 𝑎𝜆−𝑠 +

𝑎𝑑𝑠

FCP entails unbiasedness, sparsity, and continuity

t

𝑃𝜆 𝑡

-- ℓ𝒒-norm with 𝑞 = 0.5

-- ℓ𝟎-norm with 𝜆 = 1-- FCP with 𝜆 = 2, 𝑎 = 0.5

RSAA formulation: min𝒙: 0≤𝒙≤𝑅

𝐹𝑛,𝜆 𝒙 ≔ 𝐹𝑛 𝒙 + ∑𝑗=1𝑝

𝑃𝜆 𝑥𝑗

Existing Results on Convex Regularization

• Lasso (Tibshirani, 1996), compressive sensing (Donoho, 2006), and Dantzig

selector (Candes & Tao, 2007):

• Pro:

• Computationally tractable

• Generalization error and oracle inequalities (e.g., ℓ1, ℓ2-loss): Bunea et al.

(2007), Zou (2006), Bickel et al. (2009), Van de Geer (2008), Negahban et

al. (2010), etc.

• Cons:

• Requirement of RIP, restricted eigenvalue (RE), or RSC

• Absence of (strong) oracle property: Fan and Li (2001)

• Biased: Fan and Li (2001)

• Remark

• Excess risk bound available for linear regression beyond RIP, RE, or RSC

(e.g., Zhang et al., 2017)

Existing Results on (Folded) Concave Penalty

• SCAD: Fan and Li (2001) and MCP: Zhang (2010)

• Pro:

• (Nearly) Unbiased: Fan and Li (2001), Zhang (2010)

• Generalization error (ℓ1, ℓ2-loss) and oracle inequalities: Loh and Wainwright

(2015), Wang et al. (2014), Zhang & Zhang (2012), etc.

• Oracle and Strong oracle property: Fan & Li (2001), Fan et al. (2014).

• Cons:

• Strong NP-Hard, requirement of RSC, except for the special case of linear

regression

• Remark:

• Chen, Fu and Y (2010): threshod lower-bound property at every local solution;

• Zhang and Zhang (2012): global optimality results in good statistical quality;

• Fan and Lv (2011): sufficient condition results in desirable statistical property;

• Though strongly NP-hard to find global solution (Chen et al., 2017), theories on

local schemes available: Loh & Wainwright (2015), Wang, Liu, & Zhang (2014),

Wang, Kim & Li (2013), Fan et al. (2014), Liu et al. (2017)

The S3ONC Solutions Admits FPTAS

• First-order necessary condition (FONC): The solution 𝒙∗ ∈ 0, 𝑅 𝑝

𝛻 𝐹𝑛 𝒙∗ + 𝑃𝜆′ 𝑥𝑗

∗ : 1 ≤ 𝑗 ≤ 𝑝 , 𝒙 − 𝒙∗ ≥ 0, ∀𝒙 ∈ 0, 𝑅 𝑝

• Significant subspace second-order necessary condition (S3ONC):

• FONC holds, and, for all 𝑗 = 1,… , 𝑝: 𝑥𝑗∗ ∈ (0, 𝑎𝜆), it holds that

𝑈𝐿+𝑃𝜆′′ 𝑥𝑗

∗ ≥ 0 𝑜𝑟 𝛻2 𝐹𝑗𝑗 𝒙∗ + 𝑃𝜆

′′ 𝑥𝑗∗ ≥ 0

• S3ONC is weaker than the canonical second-order KKT

• Computable within pseudo-polynomial time

Assumption (d):

𝑓( ⋅ , 𝑧) is continuously differentiable and for a.e. 𝑧 ∈ 𝒲,

𝜕 𝐹𝑛 𝒙, 𝑧

𝜕𝑥𝑗 𝒙=𝒙+𝛿𝒆𝒋

−𝜕 𝐹𝑛 𝒙, 𝑧

𝜕𝑥𝑗 𝒙=𝒙

≤ 𝑈𝐿 ⋅ |𝛿|

Y (1998), Liu et al. (2017), Haeser et al. (2018), Liu, Y (2019) https://arxiv.org/abs/1903.00616.

Result 1: Efficacy of RSAA at a Global Minimum

29

Assumptions:

• (a) – (d)

Consider a global minimal solution 𝒙∗ to RSAA. If the sample size 𝑛 satisfies

𝑛 ≳𝑠2.5

𝜖3ln

𝑝

𝜖

1.5+

1

𝜖2ln

1

𝛼,

then ℙ 𝐅 𝒙∗ − 𝐅 𝒙𝑡𝑟𝑢𝑒 ≤ 𝜖 ≥ 1 − 𝛼, for any 𝛼 > 0 and 𝜖 ∈ 0, 0.5 .

THEOREM 1

• Better dependence in 𝑝 than SAA: 𝑝

𝜖2ln

1

𝜖+

1

𝜖2ln

1

𝛼

• However, generating a global solution is generally computationally

prohibitive

30

Assumptions:

• (a) – (d)

• 𝑓( ⋅ , 𝑧) is convex for almost every 𝑧 ∈𝒲Consider an S3ONC solution 𝒙∗ to RSAA. If 𝐹𝑛,𝜆 𝒙∗ ≤ 𝐹𝑛,𝜆(𝟎) a.s., and if the sample

size 𝑛 satisfies

𝑛 ≳𝑠2.5

𝜖4ln

𝑝

𝜖

2+

1

𝜖2ln

1

𝛼,


THEOREM 2

• Better dependence in 𝑝 than SAA: 𝑝

𝜖2ln

1

𝜖+

1

𝜖2ln

1

𝛼

• Much easier solvable than a global minimizer: Pseudo-polynomial-time

algorithms exist

Result 2: Efficacy of RSAA at a Local Solution

31

Assumptions:

• (a) – (d)

• 𝑓( ⋅ , 𝑧) is convex for almost every 𝑧 ∈𝒲• 𝐅 is strongly convex and differentiable (∗)Consider an S3ONC solution 𝒙∗. If 𝐹𝑛,𝜆 𝒙∗ ≤ 𝐹𝑛,𝜆(𝟎) a.s., and if the sample size 𝑛

satisfies

𝑛 ≳𝑠1.5

𝜖3ln

𝑝

𝜖

1.5+

1

𝜖2ln

1

𝛼,


THEOREM 3

• For some problems, while f( ⋅ ,𝑊) is only convex, 𝐅 can be strongly convex,

e.g., high-dimensional linear regression with random design matrix

• Under additional assumption (∗), Theorem 3 shows an improved sample

complexity than Theorem 2

Result 3: Efficacy of RSAA Under Additional Assumptions

32

Assumptions:

• (a) – (d)

• 𝑓( ⋅ , 𝑧) is convex for almost every 𝑧 ∈𝒲• 𝐅 is strongly convex and differentiable (∗)

• min 𝑥𝑗𝑡𝑟𝑢𝑒 : 𝑥𝑗

𝑡𝑟𝑢𝑒 ≠ 0, 𝑗 = 1,… , 𝑝 ≥ 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 (∗∗)

Consider an S3ONC solution 𝒙∗. If 𝐹𝑛,𝜆 𝒙∗ ≤ 𝐹𝑛,𝜆(𝟎) a.s., and if the sample size 𝑛

satisfies

𝑛 ≳𝑠

𝜖2ln

𝑝

𝜖+

1

𝜖2ln

1

𝛼,


THEOREM 4

Under additional assumptions (∗) and (∗∗), no compromise on the rate in 𝜖

compared to the SAA: 𝑝

𝜖2ln

1

𝜖+

1

𝜖2ln

1

𝛼

Result 4: Efficacy of RSAA Under Additional Assumptions

Numerical Experiments

Liu, Wang, Yao, Li, Ye, 2018. Math Program

Fix sample size 𝑛 = 100;

Increase dimensionality

Fix dimension 𝑝 = 100;

Increase sample size

Generalizations and Applications

• Theoretical Generalizations

• From subgaussian distribution to subexponentiality

• From smooth to nonsmooth cost function

• Theoretical Applications

• High-Dimensional Statistical Learning (HDSL)

• Deep Neural Networks

Theoretical Generalizations

Assumption (e):

𝑓(𝒙,𝑊) is subexponential for all 𝒙 ∈ 0, 𝑅 𝑝

Assumption (f):

𝑓 𝒙,𝑊 = 𝑓1 𝒙,𝑊 +max𝑢∈ℋ

{𝒖⊤𝑨 𝑊 𝒙 − 𝜙(𝒖,𝑊)}

where 𝑨 𝑊 ∈ ℜ𝑚×𝑝 is a linear operator, 𝜙 is a measurable,

deterministic function, 𝑓1 ⋅ ,𝑊 is continuously differentiable with

Lipschitz continuous gradient, ℋ ⊂ ℜ𝑚 is a convex and compact set.

• Generalization from subgaussian distribution to subexponentiality

• Generalization from smooth cost function 𝑓 to non-smooth cost function

36

Assumptions: (a), (c), (d), (e) subexponentiality; 𝑓1( ⋅ , 𝑧) is convex for almost every 𝑧 ∈𝒲Consider an S3ONC solution 𝒙∗. If 𝐹𝑛,𝜆 𝒙∗ ≤ 𝐹𝑛,𝜆(𝒙

ℓ𝟏) a.s., and if the sample size 𝑛

satisfies

𝑛 ≳𝑠3

𝜖3ln

𝑝

𝜖

1.5+ ln

𝑝

𝛼+ ln

1

𝛼

3,


THEOREM 5

• Better dependence in 𝜖 than Theorem 2: 𝑠2.5

𝜖4ln

𝑝

𝜖

2+

1

𝜖2ln

1

𝛼

• Desired solution can be generated by solving for an S3ONC

solution initialized with 𝒙ℓ1.

Result 5: Generalization to subexponentiality

𝒙ℓ1 ≔ arg min𝒙:𝟎≤𝒙≤𝑹

𝐹𝑛 𝒙 + 𝜆∑𝑗=1𝑝

𝑥𝑗 .

37

Assumptions: (a), (c), (e) subexponentiality; (f); 𝑓( ⋅ , 𝑧) is convex for almost every 𝑧 ∈𝒲.

Consider an S3ONC solution 𝒙∗. If 𝐹𝑛,𝜆 𝒙∗ ≤ 𝐹𝑛,𝜆(𝒙ℓ𝟏) a.s., and if the sample size 𝑛

satisfies

𝑛 ≳𝑠4

𝜖4ln

𝑝

𝜖

2+ ln

𝑝

𝛼+ ln

1

𝛼

2,

then ℙ 𝐅 𝒙∗ − 𝐅 𝒙𝑚𝑖𝑛 ≤ 𝜖 ≥ 1 − 𝛼, for any 𝛼 > 0 and 𝜖 ∈ 0, 0.5 .

THEOREM 6

• Assumption (d) is replaced by (f)

• Desired solution can be generated by solving for an S3ONC

solution initialized with 𝒙ℓ1

Result 6: Generalization to non-smoothness

𝒙ℓ1 ≔ arg min𝒙:𝟎≤𝒙≤𝑹

𝐹𝑛 𝒙 + 𝜆∑𝑗=1𝑝

𝑥𝑗 .

Theoretical Application 1: High-Dimensional M-Estimation

• Given knowledge of a statistical loss function 𝑓, and hypothetically

𝒙𝑡𝑟𝑢𝑒 ∈ argmin 𝐅 𝒙 := 𝔼 𝑓 𝒙,𝑊

𝒙𝑡𝑟𝑢𝑒: True parameters governing data generation model

• Estimate 𝒙𝑡𝑟𝑢𝑒 ∈ ℜ𝑝 given only 𝑓 samples 𝑊1,𝑊2, … ,𝑊𝑛.

• Assume that 𝑓 𝒙,𝑊𝑖 , 𝑖 = 1,… , 𝑛, is i.i.d. subexponential.

• Traditional approach: minimizing empirical risk to obtain unbiased estimator

ෝ𝒙 = arg min 𝐹𝑛 𝒙 := 𝑛−1

𝑖=1

𝑛

𝑓 𝒙,𝑊𝑖

• Under High-dimensional setting, 𝑝 ≫ 𝑛

• Consider Folded concave penalized (FCP) learning

min 𝐹𝑛,𝜆 𝒙 := 𝑛−1

𝑖=1

𝑛

𝑓 𝑥,𝑊𝑖 +

𝑗=1

𝑝

𝑃𝜆 𝑥𝑗

Result 7: Learning Under Approximate Sparsity

For any 𝛤 ≥ 0 , consider an S3ONC solution 𝒙∗ with 𝐹𝑛,𝜆 𝒙∗ ≤ min𝑥

𝐹𝑛,𝜆 𝒙 + 𝛤, a.s. Assume

(a) subexponentiality; (b) Lipschitz-like continuity of 𝐹 and 𝛻𝐹; (c) approximate sparsity

𝐸𝑥𝑐𝑒𝑠𝑠 𝑟𝑖𝑠𝑘 𝑜𝑓 𝒙∗ ≤ ෨𝑂 1 ⋅𝑠 ln 𝑝

𝑛1/3+

𝛤 + Ƹ휀

𝑛1/3+ 𝛤 + Ƹ휀

with probability 1 − 𝑂 1 ⋅ exp(−𝑂 1 𝑛1/3 ln 𝑝)

THEOREM 7

• First explication: more general assumptions than sparsity;

• 𝛤 can be interpreted as the sub-optimality gap of 𝒙∗

• No convexity or RSC assumed.

• Approximate sparsity: Existence of 𝒙 such that 𝔽 𝒙 − 𝔽 𝒙𝑡𝑟𝑢𝑒 ≤ ෝ휀 , and 𝑠:= 𝒙 0 ≪ 𝑝

Removing 𝜞 via Wise Initialization for Convex Loss

Step 1: Solve the LASSO program

𝒙ℓ1 ≔ argmin𝒙

𝐹𝑛(𝒙) + 𝜆𝑗=1

𝑝

|𝑥𝑖|

Step 2: Solve for an S3ONC stationary point initialized with 𝜷ℓ1

• A modified first-order method, ADMM

Algorithm-independent, fully polynomial-time approximation scheme

Complexity ≤ Poly(p, desired_accuracy)

Consider an S3ONC solution 𝒙∗ with 𝐹𝑛,𝜆 𝒙∗ ≤ 𝐹𝑛,𝜆 𝒙ℓ𝟏 , a.s. Assume

(a) subexponentiality; (b) Lipschitz-like continuity of 𝐹 and 𝛻𝐹; (c) sparsity

(i.e., ෝ휀 = 0 in the assumption of approximate sparsity)

𝐸𝑥𝑐𝑒𝑠𝑠 𝑟𝑖𝑠𝑘 𝑜𝑓 𝒙∗ ≤ ෨𝑂 1 ⋅𝑠‖𝒙𝑡𝑟𝑢𝑒‖∞ ln 𝑝

𝑛1/3+𝑠 ln 𝑝

𝑛2/3

with probability 1 − 𝑂 1 ⋅ exp(−𝑂 1 𝑛1/3 ln 𝑝)

THEOREM 8

A pseudo-polynomial-time computable theory without RSC

Result 8: FCP+LASSO

Comparing This Research with Existing Results

• New insights on local solutions for folded concave penalty

• Pros:

• No requirement of RSC, RIP, RE, or alike

• Applicable to non-smooth learning and deep learning

• Applicable to high-dimensional stochastic programming

• Generalization error bounds on excess risk

• Cons:

• Slower rate than alternative results that require RSC

Under RSC (e.g., Loh

& Wainwright 2015)

Without RSC, linear regression

only (e.g, Zhang et al., 2017)

This research on M-

estimation without RSC

𝑂1

𝑛𝑂

1

𝑛𝑂

1

𝑛 Τ1 3

Traditional training model:

min𝒙

1

2𝑛

𝑖=1

𝑛

𝑏𝑖 − 𝐹𝑁𝑁 𝒂𝑖 , 𝒙2

• Expressive power: Yarotsky (2017) and Mhaskar (1996), DeVore et al. (1989), Mhaskar

and Poggio (2016)

• Two-layer, polynomial increase in dimensions, or sometimes exponential

growth in layer number: Arora et al. (2018), Barron and Klusowski (2018), Bartlett et al.

(2017), Golowich et al. (2017), Neyshabur et al. (2015, 2017, 2018)

Scarcity of theoretical guarantee on generalization performance

under over-parameterizationImage from: medium.com/xanaduai/

Theoretical Application 2: Deep Learning

Date generation: for some unknown 𝑔:ℜ𝑑 → ℜ𝑏𝑖 = 𝑔 𝒂𝑖 + 𝑤𝑖 𝑖 = 1, … , 𝑛

Approximate Sparsity and Regularization in Neural Networks

Neural networks may not obey the RSC but they innately entail

approximate sparsity

• Proposed training model: Find an S3ONC solution to the formulation below

min𝒙

1

2𝑛

𝑖=1

𝑛

𝑏𝑖 − 𝐹𝑁𝑁 𝒂𝑖 , 𝒙2 +

𝑗=1

𝑝

𝑃𝜆(|𝑥𝑗|)

• Expressive power of NN implies approximate sparsity

• E.g., A ReLu network, with 𝑐𝑛1/3 ⋅ ln 𝑛𝑟

3𝑑 + 1 -many approximates any 𝑟-th

order weakly-differentiable function with accuracy O(1)𝑛−𝑟

3𝑑 (Yarotsky, 2017)

• s = 𝑐𝑛1/3 ⋅ ln 𝑛𝑟

3𝑑 + 1 and Ƹ휀 = O(1)𝑛−𝑟

3𝑑 in approximate sparsity

• Empirical evidences on sparse neural networks

• E.g., Dropout: Srivastava et al. (2014); DropConnect: Wan et al. (2013); Deep

compression (remove link with small weights and retrain): Han et al. (2016)

Poly-logarithmic in dimensionality

෨𝑂1

𝑛13

+1

𝑛16+𝑟6𝑑

+1

𝑛𝑟3𝑑

+Γ

𝑛13

⋅ ln 𝑝 + 𝛤

For 𝑔:ℜ𝑑 → ℜ that has 𝑟-th order weak derivative;

Assume that the PDF of inputs to neurons have

continuous density in a near neighborhood of 0.

Generalization Performance of Deep Learning in Special Case 1

෨𝑂1

𝑛 Τ1 3+

Γ

𝑛 Τ1 3ln 𝑝 + 𝛤

For 𝑔:ℜ𝑑 → ℜ being a polynomial function

Generalization Performance of Deep Learning in Special Case 2

Poly-logarithmic in dimensionality

Numerical Experiment 1: MNIST

Handwriting recognition problem:

• 28 x 28-pixel images of handwritten digits

• 60,000 training data

• 10,000 test data


• 800 × 800 + softmax

• Fully connected

• No preprocessing/distortion

• 500 × 150 × 10 ×… × 10 × 150 + softmax

• Fully connected

• No preprocessing/distortion


Numerical Experiment 2: Matrix Completion

Ratings of customers to products/restaurants/movies

often constitutes an incomplete matrix

𝑝 = 100 × 100 = 10,000, 𝑛 = 2,000

Numerical Experiment 2: Matrix Completion

Method Sub. Nuclear

norm

(Lasso)

S3ONC

Error 85.7699 0.1072 0.0005

𝑝 = 100 × 100 = 10,000, 𝑛 = 2,000

S3ONC

Summary

• The Sample Complexity analysis is an essential problem in optimization driven

by data or high-dimensional/deep learning

• Theories can be developed to match a sample size upper bound to the

information theoretical lower bound, such as in RL by smart sampling

• Based on the structure of problems/solutions, adding certain regularizations

can greatly reduce the sample complexity, such as RSAA achieves poly-

logarithmic sample complexity on any S3ONC solution based on standard

statistical and sparse assumptions

• Although the RSAA problem become nonconvex when the regularization is

concave, an S3ONC solution can be efficiently computed in both theory and

practice

• Future Research: further theoretical developments and more comprehensive

numerical results

Documents

The Sample Complexity in Data(Sample)-Driven Optimizationyyye/MostOM2019Final.pdf · The Sample Complexity in Data(Sample)-Driven Optimization Department of Management Science and