29
Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint work with Lindsey Dietz, Megan Heyman, Subhabrata (Subho) Majumdar, and Ujjal Mukherjee April 25, 2018

Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Geometric Statistics for High-DimensionalData Analysis

Snigdhansu Chatterjee

School of Statistics, University of Minnesota

Joint work with Lindsey Dietz, Megan Heyman, Subhabrata (Subho) Majumdar,and Ujjal Mukherjee

April 25, 2018

Page 2: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Major contributors

Page 3: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Outline

Quantiles: univariate, multivariate

Geometric quantiles for classification

The Indian Summer Monsoons: GSQ for feature selection

fMRI data: GSQ for spatio-temporal modeling

Page 4: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate quantiles

I Suppose X ∈ R is a random variable.I For any α ∈ (0,1), the αth quantile Qα is the number below

which X is observed with probability α, i .e.Qα = inf{q : P [X ≤ q] ≥ α.

Theorem

If X is (absolutely) continuous with cumulative distributionfunction F (·), then F (X ) ∼ Uniform(0,1), and there is aone-to-one relationship between α and Qα.

Page 5: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate quantiles: an alternative view

I The median is the (unique) minimizer of Ψ(q) = E|X − q|.

I (An extension) The αth quantile Qα is the (unique)minimizer of

Ψ(q) = E{|X − q|+ (2α− 1)(X − q)}.

I (Alternative notation) Define u = 2α− 1 ∈ (−1,1). Theuth quantile Qu is the (unique) minimizer of

Ψ(q) = E{|X − q|+ u(X − q)}

I = E{||X − q||+ < u,X − q >}.I Define quantiles in any inner-product space as minimizers

of Ψu(q) = E{||X − q||+ < u,X − q >}. (Haldane (1948),Chaudhuri (1996).)

Page 6: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate quantiles: an alternative view

I The median is the (unique) minimizer of Ψ(q) = E|X − q|.I (An extension) The αth quantile Qα is the (unique)

minimizer of

Ψ(q) = E{|X − q|+ (2α− 1)(X − q)}.

I (Alternative notation) Define u = 2α− 1 ∈ (−1,1). Theuth quantile Qu is the (unique) minimizer of

Ψ(q) = E{|X − q|+ u(X − q)}

I = E{||X − q||+ < u,X − q >}.I Define quantiles in any inner-product space as minimizers

of Ψu(q) = E{||X − q||+ < u,X − q >}. (Haldane (1948),Chaudhuri (1996).)

Page 7: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate quantiles: an alternative view

I The median is the (unique) minimizer of Ψ(q) = E|X − q|.I (An extension) The αth quantile Qα is the (unique)

minimizer of

Ψ(q) = E{|X − q|+ (2α− 1)(X − q)}.

I (Alternative notation) Define u = 2α− 1 ∈ (−1,1). Theuth quantile Qu is the (unique) minimizer of

Ψ(q) = E{|X − q|+ u(X − q)}

I = E{||X − q||+ < u,X − q >}.I Define quantiles in any inner-product space as minimizers

of Ψu(q) = E{||X − q||+ < u,X − q >}. (Haldane (1948),Chaudhuri (1996).)

Page 8: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate quantiles: an alternative view

I The median is the (unique) minimizer of Ψ(q) = E|X − q|.I (An extension) The αth quantile Qα is the (unique)

minimizer of

Ψ(q) = E{|X − q|+ (2α− 1)(X − q)}.

I (Alternative notation) Define u = 2α− 1 ∈ (−1,1). Theuth quantile Qu is the (unique) minimizer of

Ψ(q) = E{|X − q|+ u(X − q)}

I = E{||X − q||+ < u,X − q >}.

I Define quantiles in any inner-product space as minimizersof Ψu(q) = E{||X − q||+ < u,X − q >}. (Haldane (1948),Chaudhuri (1996).)

Page 9: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate quantiles: an alternative view

I The median is the (unique) minimizer of Ψ(q) = E|X − q|.I (An extension) The αth quantile Qα is the (unique)

minimizer of

Ψ(q) = E{|X − q|+ (2α− 1)(X − q)}.

I (Alternative notation) Define u = 2α− 1 ∈ (−1,1). Theuth quantile Qu is the (unique) minimizer of

Ψ(q) = E{|X − q|+ u(X − q)}

I = E{||X − q||+ < u,X − q >}.I Define quantiles in any inner-product space as minimizers

of Ψu(q) = E{||X − q||+ < u,X − q >}. (Haldane (1948),Chaudhuri (1996).)

Page 10: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate to multivariate quantiles

Univariate quantiles:

For every u ∈ {x : ||x || < 1} ⊂ R, Q(u) minimizesΨu(q) = E [||X − q||+ < u,X − q >].

Write x = xuu/||u||+ xu⊥ .Generally, for some λ ≥ 0, the generalized spatial quantile(GSQ) are:

1. indexed by vectors in the unit ball u ∈ Bp = {x : ||x || < 1},and

2. the u-th quantile Q(u) is the minimizer of

Ψuλ(q) = E[|Xu − qu|

{1 + λ(Xu − qu)−2||Xu⊥ − qu⊥ ||2

}1/2

+||u||(Xu − qu)].

Page 11: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Bivariate quantiles

−4 −2 0 2 4

−4−2

02

4

Support of Distn

Q(u)

−1.0 −0.5 0.0 0.5 1.0

Domain

u

Page 12: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Bahadur representation of generalized spatial quantiles

TheoremThe following asymptotic Bahadur-type representation holdswith probability 1 for any u:

n1/2(Q(u)−Q(u)) = −n−1/2H−1Sn + O(n−(1+s)/4(log n)1/2(log log n)(1+s)/4)

as n→∞.

(Apologies for not including the details.)

Page 13: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Projection quantiles

Generalized spatial quantiles minimize:

Ψuλ(q) = E[|Xu − qu|

{1 + λ(Xu − qu)−2||Xu⊥ − qu⊥ ||2

}1/2+ ||u||(Xu − qu)

].

Set λ = 0 to get projection quantiles.

I Computationally extremely simple, no limitations fromsample size and dimension (high p, low n allowed).

I Projection quantiles based confidence sets have exactcoverage.

I Works on infinite-dimensional spaces.

Page 14: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Projection quantiles

Theorem

Projection quantiles have a one-to-one relationship with the unitball, like univariate quantiles.

Page 15: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Example: simulated data plots

Figure: Simulated data with a few GSQ (covered areas are deliberatelydifferent)

Page 16: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Outline

Quantiles: univariate, multivariate

Geometric quantiles for classification

The Indian Summer Monsoons: GSQ for feature selection

fMRI data: GSQ for spatio-temporal modeling

Page 17: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

GSQ-depths are great for classification

Figure: A simulated 2-class classification problem with GSQ-depth classifier

Page 18: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

GSQ-depth based classification: some results

Method CPU Time AccuracyGSQ 3.67 0.925Random Forest 16714.20 0.895SVM 966.86 0.842LDA 0.28 0.74Logit 0.35 0.69

Table: Arcene classification without feature selection (neural nets didnot converge)

Page 19: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Outline

Quantiles: univariate, multivariate

Geometric quantiles for classification

The Indian Summer Monsoons: GSQ for feature selection

fMRI data: GSQ for spatio-temporal modeling

Page 20: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

The data on monsoons

Figure: Air from the eastern Indian Ocean (yellow) and airdescending over Arabia (blue) converge in the Somali jet. Lowpressure at 30S. {Courtesy: UMn Climate Expeditions team.}

Page 21: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Variable dropped en(S−j )- Tmax 0.1490772- X120W 0.2190159- ELEVATION 0.2288938- X120E 0.2290021- ∆TT_Deg_Celsius 0.2371846- X80E 0.2449195- LATITUDE 0.2468698- TNH 0.2538924- Nino34 0.2541503- X10W 0.2558397- LONGITUDE 0.2563105- X100E 0.2565388- EAWR 0.2565687- X70E 0.2596766- v_wind_850 0.2604214- X140E 0.2609039- X40W 0.261159- SolarFlux 0.2624313- X160E 0.2626321- EPNP 0.2630901- TempAnomaly 0.2633658- u_wind_850 0.2649837- WP 0.2660394<none> 0.2663496- POL 0.2677756- Tmin 0.268231- X20E 0.2687891- EA 0.2690791- u_wind_200 0.2692731- u_wind_600 0.2695297- SCA 0.2700276- DMI 0.2700579- PNA 0.2715089- v_wind_200 0.2731708- v_wind_600 0.2748239- NAO 0.2764488

Table: Ordered values of en(S−j ) after dropping the j-th variable fromthe full model in the Indian summer precipitation data

Page 22: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

2004 2006 2008 2010 2012

−3

−2

−1

01

23

year

Bia

s

● ●●

Full modelReduced model

● ●

2004 2006 2008 2010 2012

02

46

8

year

MS

E

● ● ●● ●

●●

Full modelReduced model

(a) (b)

Figure: Comparing full model rolling predictions with reducedmodels: (a) Bias across years, (b) MSE across years.

Page 23: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

−2 0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

0.5

Year 2012

log(PRCP+1)

dens

ity

TruthFull model predReduced model pred

2012

●●

●●

Positive residnegative resid

(c) (d)

Figure: Comparing full model rolling predictions with reducedmodels: (c) density plots for 2012, (d) stationwise residuals for 2012

Page 24: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Outline

Quantiles: univariate, multivariate

Geometric quantiles for classification

The Indian Summer Monsoons: GSQ for feature selection

fMRI data: GSQ for spatio-temporal modeling

Page 25: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

A brief outline

I We consider 19 tests subjects, with 2 kinds of visualstasks.

I Each subject went through 9 runs, where they saw faces orscrambled images, and had to react.

I We fit a spatio-temporal model. Temporally, we fit a AR(5)with quadratic drift. Spatially, we consider different layersnearest neighbor voxels.

I We measure the degree of spatial dependency in differentregions of the brain.

I The figures below are for one subject in one run.

Page 26: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x = 48

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

y = 7

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

z = 12

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

z = 8

Figure: Plot of significant p-values at 95% confidence level at thespecified cross-sections.

Page 27: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Figure: A smoothed surface obtained from the p-values clearlyshows high spatial dependence in right optic nerve, auditory nerves,auditory cortex and left visual cortex areas

Page 28: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Acknowledgment:

I This research is partially supported by the NationalScience Foundation (NSF) under grants # DMS-1622483,# DMS-1737918, and by the National Aeronautics andSpace Administration (NASA).

I This research is partially supported by the Institute on theEnvironment (IonE).

Page 29: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Thank you