Normal-bundle Bootstrap · 2021. 8. 1. · Wild bootstrap: @WuCF1986 (more generally, ) Replacing the model from regression to dimensionality reduction, we can obtain other bootstrap

Normal-bundle Bootstrap

Ruda Zhang*

[email protected]@ncsu.edu

Roger [email protected]

Jul 29, 2020

1 / 18

Talk outlineOverview: probabilistic distributions with geometric structure

Related literature

Bootstrap variants for regression

Normal-bundle bootstrap: algorithm, statistical theory

Experiments:

1. Inference: confidence set of density ridge2. Data augmentation: regression by deep neural net

More on ridge estimation

Discussion

Ongoing research

arxiv.org/abs/2007.13869 2 / 18

When data sets are modelled as multivariateprobability distributions, such distributionsoften have salient geometric structure.

Examples:

Regression;Topological data analysis, incl. manifoldlearning.Deep learning;

Manifold distribution hypothesis:

Natural high-dimensional dataconcentrate close to a nonlinear low-dimensional manifold.

Goal: generate new data which preserve thegeometric structure of a probability distributionmodelling a given data set.

This Paper: Normal-bundle bootstrap, a variantof the bootstrap resampling method.

Applications:

1. Inference of statistical estimators.2. Data augmentation: increase training data

diversity to reduce overfitting, withoutcollecting new data.

Inspirations:

Algorithms for nonlinear dimensionalityreduction: subspace-constrained mean shift(SCMS) for density ridge estimation;Constructions in differential geometry: fiberbundle;

Overview

arxiv.org/abs/2007.13869 3 / 18

Related literatureProbabilistic learning on manifolds:

Soize and Ghanem, Data-driven probability concentration and sampling on manifold, Journal ofComputational Physics, 2016Soize and Ghanem, Probabilistic learning on manifolds, arXiv, 2020Zhang, Wingo, Duran, Rose, Bauer, and Ghanem, Environmental economics and uncertainty: Reviewand a machine learning outlook, Oxford Research Encyclopedia of Environmental Science, 2020

Ridge estimation:

Hastie and Stuetzle, Principal curves, Journal of the American Statistical Association, 1989Ozertem and Erdogmus, Locally defined principal curves and surfaces, Journal of Machine LearningResearch, 2011Genovese, Perone-Pacifico, Verdinelli, and Wasserman, Nonparametric ridge estimation, Annals ofStatistics, 2014Chen, Genovese, and Wasserman, Asymptotic theory for density ridges, Annals of Statistics, 2015Chen, Genovese, Ho, and Wasserman, Optimal ridge detection using coverage risk, in Advances inNeural Information Processing Systems, 2015

Statistics on manifolds:

Chikuse, Statistics on Special Manifolds, Springer, 2003Bhattacharya and Bhattacharya, Nonparametric Inference on Manifolds: With Applications to ShapeSpaces, Cambridge University Press, 2012Patrangenaru and Ellingson, Nonparametric Statistics on Manifolds and Their Applications to ObjectData Analysis, CRC Press, 2015Chen, Solution manifold and its statistical applications. arXiv, 2020

arxiv.org/abs/2007.13869 4 / 18

Data generating mechanism:

Fitted (regression) model:

Residuals:

Residual bootstrap (global):

Wild bootstrap: @WuCF1986

(more generally, )

Replacing the model from regression todimensionality reduction, we can obtain otherbootstrap variants.

Linear Regression Principal Component Analysis

Smooth Regression Principal Curve / Ridge Estimation

Regression Dimensionality Reduction

Line

arN

onpa

ram

etric

Dimensionality reduction vs. regression: linearand non-parametric methods. Adapted from[@Hastie1989, Fig 1].

Bootstrap variants for regression

arxiv.org/abs/2007.13869

y = f(x) + ε(x), μ

ε

= 0

y =

^

f (x)

u

i

= y

i

− y

i

, i ∈ {1,… ,N}

~

y

i

=

^

y

i

+ u

j

, j ∼ U{1,… ,N}

~

y

i

=

^

y

i

+wu

i

, w ∼ U{−1, 1}

μ

w

= 0,σ

w

= 1

Chien-Fu Jeff Wu, Jackknife, Bootstrap and Other Resampling Methods in Regression Analysis, Annals of Statistics, 1986Hastie and Stuetzle, Principal curves, Journal of the American Statistical Association, 1989 5 / 18

For a 2d Gaussian PDF (blue contours), its 1ddensity ridge (bold line) is its 1st principalcomponent line, where the normal spaces (thinlines) are parallel to the 2nd principalcomponent. Probability density on the normalspaces (right margin) declines faster than that onthe ridge (top margin).

In general, density ridges are nonlinear, and itsnormal bundle (the collection of normal spaces)decomposes the original distribution into one onthe ridge and one on each normal space.

Density ridge and its normal bundleab

arxiv.org/abs/2007.13869 6 / 18

NBB algorithm: (marked steps are optional)

1. kernel bandwidth selection2. ridge estimation3. align bottom- eigenvectors4. coordinates of normal vectors5. -nearest neighbors on ridge6. construct new data

:

1. 2. , for 3. 4. , for 5. 6. , for

The algorithm moves data points (blue) to theridge (red) and, for each ridge point, picksneighboring ridge points (shade) and adds theprojection vectors (dashed line) to construct newdata points (hollow). Using a smooth frame cankeep the constructed points in the normal space.

cb

B, basin of attraction; R, density ridge; π,projection; U, a neighborhood in density ridge.

Normal-bundle bootstrap


c

∗

∗

k

NBB(X, d,α, k)

h ← α argmax

h

∑

N

i=1

log p

h,−i

(x

i

)

(r

i

,V

c,i

) ← SCMS(x

i

; log p

h

, d) i ∈ N

E ← SmoothFrame(r,V

c

,n− d)

[n]

i

← E

T

i

(x

i

− r

i

) i ∈ N

K ← KNN(r, k)

~

x

ij

← r

i

+ E

i

[n]

K(i,j)

i ∈ N , j ∈ k

7 / 18

Parabola with heteroscedasticerror.

Data geneating mechanism:

(left) data generation: noiselessmodel (black), data (blue).

(right) ridge estimation:estimated ridge (red).

(bottom) "residual" plot:projection vectors (black),neighborhood of a random ridgepoint (red). Neighborhood size k= N / 16. (Relative size: 6.25%)

Example


x ∼ U(0, 1)

y = x

2

(

~

x,

~

y) ∼ (x,y) + σ(x)N(0, I)

σ(x) = (1 + x)/20

8 / 18

Assumption: In a neighborhood of ridge ,

is three times differentiable; is sharply curved in normal spaces:

and , where ;trajectories are not too wiggly andtangential gradients are not toolarge:

.

Theorem (consistency): Let the assumptions holdfor the measure in the basin of attraction ,and the conditional measure varies slowlyover the ridge , then for each estimated ridgepoint , as sample size , the distributions of the constructed data points

, , converge to the distribution restrictedto the fiber of the estimated ridge point:

Finite-sample behavior is also very good:

As soon as the estimated ridge becomes closeenough to the true ridge such that the conditionalmeasures over the estimated ridge varyslowly, the conditional measures on neighboringfibers become similar to each other: .This would suffice to make the constructed datadistribute similarly to the original measurerestricted to a fiber:

Even if the estimated ridge has a finite bias to thetrue ridge, it would not affect the conclusion.

Statistical theory


B R

p(x)

p(x)

λ

c

< −β λ

c

< λ

c+1

− β β > 0

ϕ

x

(t)

U(x)g(x)

∥U(x)g(x)∥max

i,j,k

∣

∣

(x)

∣

∣

<

∂H

ij

∂x

k

β

2

2n

3/2

μ B

μ

F

r

R

r = π

N

(x) = ϕ

∞

N

(x) N →∞

~

x

j

j ≤ k

~

x

j

|r → x|

F

r

d

μ

F

r

μ

F

r

j

≈ μ

F

r

~

x

j

|r ∼ x|

F

^

r

9 / 18

NBB confidence set: (confidence level )

For , is the mode estimated from theconstructed points , and is the -upperquantile of .

Bootstrap confidence set: @ChenYC2015

Here, is the -upper quantile of .

-1 0 1

-1

0

1

a

-1

-1 0 1

0

1

b

Inference: confidence set of density ridge

Experiment: , , .Data (blue), estimated ridge (red), true ridge (black), 90% confidence sets of true ridge (gray) byNBB (a) vs. bootstrap (b). N = 128.


1 − α

^

C

NBB

N

=

^

R

N

⊕D

α

= {r + n : r ∈

^

R

N

, n ∈ D

α

(r)}

D

α

(

^

r) =

^

m⊕ ε

α

= {

^

n ∈ N

r

^

R

N

: d(

^

n,

^

m) < ε

α

}

r

i

m

i

~

x

ij

^

ε

α,i

α

{d(

^

m

∗

i,b

,

^

m

i

)}

B

b=1

^

C

B

N

=

^

R

h

⊕ ε

α

= {x ∈ R

n

: d(x,

^

R

h

) < ε

α

}

^

ε

α

α

{d(r

∗

i,b

,

^

R

h

)}

b=1…B

i=1…N

x = re

iθ

θ ∼ U [0, 2π) r ∼ N(1, 0.2

2

)

r ≈ 1

Chen, Genovese, and Wasserman, Asymptotic theory for density ridges, Annals of Statistics, 2015 10 / 18

256 512 10240

20

40

60

80

90

100Coverage rate of 90% confidence set (%)

Sample size256 512 1024

0

20

40

60

80

100

120

140Average compute time (s)

Sample size

BootstrapNormal-bundle bootstrap

Metrics of NBB (orange) and bootstrap (blue) overan ensemble of samples: (c) coverage rate, mean(solid line) and 90% prediction interval (shade);(d) average computation time.

is valid throughout the range of sample

sizes computed, while the validity of slowly

improves. Moreover, is computationally

costlier than , due to repeated ridgeestimation. Although repeated mode estimation isalso costly, it is faster than ridge estimation of thesame problem size.

Note that other population parameters like meanand quantiles can be estimated much faster thanthe mode, so the related inference using NBB willbe much faster than in this example, such asconfidence sets of principal manifolds. Hastie1989

Inference: metrics on NBB vs. boostrap


^

C

NBB

^

C

B

^

C

B

^

C

NBB

Hastie and Stuetzle, Principal curves, Journal of the American Statistical Association, 1989 11 / 18

The neural network is asequential model with 4 denselyconnected hidden layers, whichhave 256, 128, 64, and 32 unitsrespectively and use the ReLUactivation function; the outputlayer has 16 units. We train thenetwork to minimize meansquared error. Set inNBB.

Without augmentation thenetwork starts to overfit aroundepoch 100, while withaugmentation the network trainsfaster, continues to improveover time, and has a lower error.

0.0 0.5 1.0 1.5 2.0

-1.0

-0.5

0.0

0.5

1.0

theta

x1

originalaugmented

10 100 10000.1

0.2

0.3 Mean Absolute Error

Epoch

trainvalidateNBB trainNBB validate

Data augmentation. (left) original and augmented data,showing only. Noiseless true model in black line.(right) training and validation error with and without NBB.

Data augmentation: regression by deep neural netTrue model: , where and .

Observation: .

Training dataL . Let and , so the training data is a matrix.


(x

j

, y

j

)(θ) = (cos(π(θ+ γ

j

)), sin(π(θ+ γ

j

))) θ ∈ [0, 2) {γ

j

}

l

j=1

⊂ [0, 2)

(

~

θ, (

~

x

j

,

~

y

j

)

l

j=1

)

∣

∣

∣

θ ∼ (θ, (x

j

(θ), y

j

(θ))

l

j=1

)+N(0, 0.2

2

I)

(

~

θ

i

, (

~

x

ij

,

~

y

ij

)

l

j=1

)

N

i=1

l = 8 N = 32 32 × 17

k = 16

(θ,x

1

)

12 / 18

Definition: Ridge of dimension fora twice differentiable function ,denoted as , is the set of points wherethe smallest eigenvalues of the Hessianare negative, and the span of their eigenspacesare orthogonal to the gradient:

Here, Hessian has an eigen-decomposition , where and is in increasing order. Let

where and are columnmatrices of and eigenvectors respectively.Denote projection matrices ,

, and gradient .

Assumption: Let ,assume that:

1. ;2. for some ,

is an embedded -dimensional submanifoldof .

Estimation by subspace-constrained mean shift(SCMS): @Ozertem2011

subspace-constrained gradient flow:

SCMS update:

Asymptotic theory:

Geometric and topological consistency andconvergence rates of estimated ridge to trueridge and hidden manifold. @Genovese2014Bootstrap gives consistent confidence sets ofsmoothed ridge. @ChenYC2015

Density ridge


d ∈ {0,… ,n}

f : R

n

↦ R

Ridge(f, d)

c = n− d

Ridge(f, d) := {x ∈ R

n

: λ

c

< 0,Lg = 0}

H = ∇∇f

H = VΛV

T

Λ = diag(λ)

λ = (λ

i

)

n

i=1

V = (V

c

;V

d

) V

c

V

d

c d

U = V

d

V

T

d

L = V

c

V

T

c

= I − U g = ∇f

D = {x ∈ R

n

: p(x) > 0}

p|

D

∈ C

2

(D,R

>0

)

d ∈ {1,… ,n− 1} Ridge(p, d) ⊂ D

d

R

n

v = Lg

x

t+1

= x

t

+ κ

t

Lg

h

(x

t

)

κ

t

= h

2

/

^

p

h

(x

t

)

Ozertem and Erdogmus, Locally defined principal curves and surfaces, Journal of Machine Learning Research, 2011.Genovese, Perone-Pacifico, Verdinelli, and Wasserman, Nonparametric ridge estimation, Annals of Statistics, 2014.Chen, Genovese, and Wasserman, Asymptotic theory for density ridges, Annals of Statistics, 2015. 13 / 18

Density ridge

Gradient flow

Subspaceconstrainedgradient flow

Gradient field:

Subspace-constrained gradient field:

a b

Subspace-constrained gradient flow as projectionto estimated density ridge. Data (blue points);true (gray curve) and estimated (red curve)density ridge; trajectories (orange curves),pointing towards estimated ridge. (a) True ridgeis the unit circle, a manifold without boundary;the estimated ridge is also without boundary. (b)True ridge is a parabola segment, a manifoldwith boundary; the estimated ridge isunbounded.

Subspace-constrained gradient flow


g = ∇ log p

h

v = Lg

L = V

c

V

T

c

14 / 18

With a stronger smoothness and manifold assumption, under thesubspace-constrained gradient flow, each data point converges toan estimated ridge point with probability one:

Proposition (flow): If has bounded super-level sets for all , then generates a semi-

flow . If has a compact support , let , , then generates a flow . Moreover, if

is locally Lipschitz or , , then is locally Lipschitz or , respectively.

Proposition (convergence): If is analytic and has boundedsuper-level sets, then every forward trajectory converges to a fixedpoint: , , .

Proposition (basin): If is analytic and has a compact support , and and are, respectively, embedded - and -submanifolds of , then

is a subset of full Lebesgue measure and therefore has probabilityone: , , where is the Lebesgue measure on

.

Qualitative properties of dynamical system


x

i

r

i

p(x)

B

c

= {x ∈ R

n

: p(x) ≥ c} c > 0 v(x)

ϕ : R

≥0

×D↦ D p(x)

¯ ¯

D v(x) = 0

∀x ∈ ∂D v(x) ϕ : R×

¯ ¯

D ↦

¯ ¯

D

v(x) C

k

k ≥ 1 ϕ

C

k

p(x)

∀x ∈ R

n

∃x

∗

∈ v

−1

(0) lim

t→+∞

ϕ

x

(t) = x

∗

p(x)

¯ ¯

D

A

u

= {x ∈ D : v = 0,λ

c

> 0} A

c

= {x ∈ D : v = 0,λ

c

= 0}

d (d− 1) R

n

B

λ(D ∖ B) = 0 μ(B) = 1 λ

R

n

15 / 18

Kernel bandwidth selection:

Maximum likelihood bandwidthtends to be too small, such thatthe estimated ridge often hasisolated points. We use anoversmoothing parameter α,usually between 2 and 4, andgood estimates can be oftenobtained across a wide range ofα values. @ChenYC2015b gave amethod to select h thatminimizes coverage riskestimates.

Acceleration of SCMS:

A naive implementation of SCMSwould be linearly convergent,and have a computationalcomplexity of periteration, where the partcomes from computing for eachupdate point using all datapoints, and the part comesfrom eigen-decomposition of theHessian.

To reduce computation periteration to :

local KDE: use the -nearestdata points for densityestimation;partial eigen-decomposition: computeonly the top- eigen-pairs;

To reduce number of iterations,use Newton's method for rootfinding to obtain quadraticconvergence:

where solves

or

The former only requires(partial) eigen-decomposition atthe first step, and the latter has alarger convergence region.@ZhangRD2020nr

Discussion


O(N

2

n

3

)

O(N

2

)

x

t

O(n

3

)

O(kNdn

2

)

k

d

x

t+1

= x

t

+ L

t

δ

t

δ

t

L

0

H

t

L

0

δ

t

= −L

0

g

t

L

t

H

t

L

t

δ

t

= −L

t

g

t

Chen, Genovese, Ho, and Wasserman, Optimal ridge detection using coverage risk, in Advances in Neural Information Processing Systems, 2015.R. Zhang, Newton retraction as approximate geodesics on submanifolds, arXiv, 2020. 16 / 18

Ongoing researchProbabilistic learning on manifolds (theory):

Normal-bundle bootstrap.Kernel density estimation and sampling on submanifolds.Manifold-base joint probabilistic models.

Applications:

Surrogate modeling, with application to oil spill models.

Numerical procedures:

Newton retraction as approximate geodesics on submanifolds.Acceleration of SCMS.

arxiv.org/abs/2007.13869 17 / 18

The authors thank many people for support andvaluable discussions:

David Banks (Duke)Mansoor Haider (NCSU)Emily Griffith (NCSU)Ryan Murray (NCSU)Greg Forest (UNC Chapel Hill)Michael E Taylor (UNC Chapel Hill)Jeremy Louis Marzuola (UNC Chapel Hill)Ernest Fokoue (RIT)

This work was supported, in part, by the NationalScience Foundation grant DMS-1638521.

Acknowledgment

arxiv.org/abs/2007.13869 18 / 18

Documents

Normal-bundle Bootstrap · 2021. 8. 1. · Wild bootstrap: @WuCF1986 (more generally, ) Replacing the model from regression to dimensionality reduction, we can obtain other bootstrap