42
Clustering by Important Features PCA (IF-PCA) Rare/Weak Signals and Phase Diagrams Jiashun Jin, CMU David Donoho (Stanford) Zheng Tracy Ke (Univ. of Chicago) Wanjie Wang (Univ. of Pennsylvania) August 12, 2015 Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 1 / 42

Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Clustering by Important FeaturesPCA (IF-PCA)

Rare/Weak Signals and Phase Diagrams

Jiashun Jin, CMU

David Donoho (Stanford)Zheng Tracy Ke (Univ. of Chicago)

Wanjie Wang (Univ. of Pennsylvania)

August 12, 2015

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 1 / 42

Page 2: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Clustering subjects using microarray data

# Data Name Source K n (# of subjects) p (# of genes)1 Brain Pomeroy (02) 5 42 55972 Breast Cancer Wang et al. (05) 2 276 222153 Colon Cancer Alon et al. (99) 2 62 20004 Leukemia Golub et al. (99) 2 72 35715 Lung Cancer Gordon et al. (02) 2 181 125336 Lung Cancer(2) Bhattacharjee et al. (01) 2 203 126007 Lymphoma Alizadeh et al. (00) 3 62 40628 Prostate Cancer Singh et al. (02) 2 136 60339 SRBCT Kahn (01) 4 63 230810 Su-Cancer Su et al (01) 2 174 7909

Goal. Predict class labels

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 2 / 42

Page 3: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Left/right singular vectors

features selected features PC’ssa

mpl

es

I Left singular vector:

(n-dimensional) eigenvector of XX ′

I Right singular vector:

(p-dimensional) eigenvector of X ′X

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 3 / 42

Page 4: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Principal Component Analysis (PCA)

Karl Pearson (1857-1936)

Idea:

I Transformation

I Dimension Reductionwhile keeping main info.

I Data = signal + noise(signal matrix: low-rank)

Microarray data (after standardization):

many columns of the signal matrix are 0

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 4 / 42

Page 5: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Important Features PCA (IF-PCA)

Idea: PCA applied to a small fraction of carefullyselected features:

I Rank features by Kolmogorov-Smirnov statistic

I Select those with the largest KS-scores

I Apply PCA to the post-selection data matrix

features selected features PC’s

sam

ples

Azizyan et al (2013), Chan and Hall (2010), Fan and Lv (2008)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 5 / 42

Page 6: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

IF-PCA (microarray data)Wi (j) = [Xi (j)− X (j)]/σ(j) : feature-wise normalization

W = [w1, . . . ,wp] = [W ′1, . . . ,W

′n]′, Fn,j(t) =

1

n

n∑i=1

1{Wi (j) ≤ t}

1. Rank features with Kolmogorov-Smirnov (KS) scores

ψn,j =√n· sup−∞<t<∞

|Fn,j(t)−Φ(t)|, (Φ: CDF of N(0, 1))

2. Renormalize the KS scores by (Efron’s empirical null)

ψ∗n,j =ψn,j −mean of all p different KS-scores

SD of all p different KS-scores

3. Fix t > 0. U (t) ∈ Rn,K−1: first (K − 1) left singularvectors of post-selection data matrix [wj : ψ∗n,j ≥ t]

4. Apply classical k-means algorithm to U (t) ∈ Rn,K−1

IF-PCA-HCT : t = tHCp : Higher Criticism threshold (TBA)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 6 / 42

Page 7: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

The blessing of feature selectionx-axis: 1, 2, . . . , n; y -axis: entries of UHC (U (t) for t = tHCp )

Left: plot of UHC (Lung Cancer; K = 2 so UHC ∈ Rn is a vector)

Right: counterpart of UHC without feature selection

0 50 100 150

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0 50 100 150

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

ADCAMPM

ADCAMPM

0 50 100 150

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0 50 100 150

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

ADCAMPM

ADCAMPM

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 7 / 42

Page 8: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Efron’s null correction (Lung Cancer)Efron’s theoretic Null: density of ψn,j if Xi(j)

iid∼ N(uj , σ2j )

(not depend on (j , uj , σj); easy to simulate). Theoretic null

is a bad fit to ψn,j (top) but a nice fit to ψ∗n,j (bottom)

0 1 2 3 4 5 6 70

500

1000

1500

2000

2500

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.80

200

400

600

Theoretic Null

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 8 / 42

Page 9: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

How to set the threshold t?I CV: not directly implementable (class labels

unknown)I FDR: need tuning and target on Rare/Strong

signals [Benjaminin and Hochberg (1995), Efron (2010)]

t (threshold) #{selected features} feature-FDR errors.0280 12529 1.00 22.1595 2523 1.00 28

.2814 299 .538 4

.2862 280 .50 5

.3331 132 .25 6.3469 106 .20 43.3622 86 .15 38.4009 32 .10 38.4207 27 .06 37

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 9 / 42

Page 10: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Tukey’s Higher Criticism

John W. Tukey (1915-2002)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 10 / 42

Page 11: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Higher Criticism (HC)Review papers: Donoho and Jin (2015), Jin and Ke (2015)

I Proposed by Donoho and Jin (2004) for sparsesignal detection

I Found useful in GWAS, DNA Copy NumberVariants (CNV), Cosmology and Astronomy,Disease surveillance

I Extended to many different directions:Innovated HC (for colored noise), signalsdetection in a regression model, HC when noisedist. is unknown/nonGaussian, estimate theproportion of non-null effects

I Threshold choice in classification [Donoho and Jin

(2008, 2009), Fan et al (2013), Jin (2009)]

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 11 / 42

Page 12: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Threshold choice by HC for IF-PCA

Jin and Wang (2015), Jin, Ke and Wang (2015)

tHCp = argmaxt{HCp(t)},

HCp(t) =

√p[Gp(t)− F0(t)]√√

n[Gp(t)− F0(t)] + Gp(t)(new)

I F0: survival function of Efron’s theoretical null

I Gp: empirical survival function of renormalizedKS-scores ψ∗n,j

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 12 / 42

Page 13: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

HCT: implementation

I Compute P-values: πj = F0(ψ∗n,j), 1 ≤ j ≤ pI Sort P-values: π(1) < π(2) < . . . < π(p)

I Define the HC functional by

HCp,k =

√p(k/p − π(k))√

k/p + max{√n(k/p − π(k)), 0}

Let k = argmax{1≤k≤p/2,π(k)>log(p)/p}{HCp,k}.HC threshold tHC

p is the k-th largest KS-score

HCp(t)∣∣t=ψ∗

(k)

=

√p[k/p − π(k)]√

k/p +√n(k/p − π(k))

; ψ∗(k): sorted KS scores

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 13 / 42

Page 14: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Illustration

0 0.02 0.04 0.06 0.08 0.10

5

ordered KS−scores

0 0.02 0.04 0.06 0.08 0.10

0.05

0.1ordered p−values

0 0.02 0.04 0.06 0.08 0.10

5

10HC objective Function

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 14 / 42

Page 15: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Illustration, II (Lung Cancer)

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500−0.05

0

0.05

0.1

0.15

0.2

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 55000

0.1

0.2

0.3

0.4

0.5

0.6

HC threshold

x-axis: # of selected features; y -axis: error rates by IF-PCA

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 15 / 42

Page 16: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Comparison

r =error rate of IF-PCA-HCT

minimum error rate of all other methods

# Data set K kmeans kmeans++ Hier SpecGem* IF-PCA-HCT r1 Brain 5 .286 .427(.09) .524 .143 .262 1.832 Breast Cancer 2 .442 .430(.05) .500 .438 .406 .943 Colon Cancer 2 .443 .460(.07) .387 .484 .403 1.044 Leukemia 2 .278 .257(.09) .278 .292 .069 .275 Lung Cancer 2 .116 .196(.09) .177 .122 .033 .296 Lung Cancer(2) 2 .436 .439(.00) .301 .434 .217 .727 Lymphoma 3 .387 .317(.13) .468 .226 .065 .298 Prostate Cancer 2 .422 .432(.01) .480 .422 .382 .919 SRBCT 4 .556 .524(.06) .540 .508 .444 .8710 SuCancer 2 .477 .459(.05) .448 .489 .333 .74

*: SpecGem is classical PCA [Lee et al (2010)]; Arthur and Vassilvitskii (2007)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 16 / 42

Page 17: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Sparse PCA and variants of IF-PCAfeatures selected features PC’s

samp

les

# Data set K Clu-sPCA ∗ IF-kmeans IF-Hier IF-PCA-HCT1 Brain 5 .172 .302 .476 .2622 Breast Cancer 2 .438 .378 .351 .4063 Colon Cancer 2 .404 .396 .371 .4034 Leukemia 2 .292 .114 .250 .0695 Lung Cancer 2 .110 .180 .177 .0336 Lung Cancer(2) 2 .434 .226 .227 .2177 Lymphoma 3 .055 .138 .355 .0658 Prostate Cancer 2 .422 .382 .412 .3829 SRBCT 4 .428 .417 .603 .44410 SuCancer 2 .466 .430 .500 .333

∗: project to estimated feature space (sparse PCA) and then clustering;unclear how to set λ (ideal λ is used above); clustering 6= feature estimation

Zou et al (2006), Witten and Tibshirani (2010)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 17 / 42

Page 18: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Summary (so far)

I IF-PCA-HCT consists of three simple stepsI Marginal screening (KS)I Threshold choice (Empirical null + HCT)I Post-selection PCA

I tuning free, fast, and yet effective

I easily extendable and adaptable

Next: theoretical reasons for HCT in RW settings

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 18 / 42

Page 19: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

RW viewpoint

In many types of “Big Data”, signals of interest arenot only sparse (rare) but also individually weak,and we have no priori where these RW signals are

I “Large p small n” (e.g., genetics and genomics)

(Signal strength)α ∝ n ∝ $ or labor

Clustering: α = 6; classification: α = 2

I Technical limitation (e.g., astronomy)

I Early detection (e.g., disease surveillance)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 19 / 42

Page 20: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

RW model and Phase Diagram

Many methods/theory target on Rare/Strongsignals (if conditions XX hold and all signals are sufficiently strong ...)

Our proposal:

I RW model: parsimonious model capturing themain factors (sparsity and signal strength)

I Phase Diagram:I provides sharp results that characterize when the

desired goal is impossible or possible to achieveI an approach to distinguish non-optimal and

optimal procedures

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 20 / 42

Page 21: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Sparse signal detection (global testing)

Donoho and Jin (2004), Ingster (1997, 1999), Jin (2004)

X = µ + Z , X ∈ Rp, Z ∼ N(0, Ip)

H(p)0 : µ = 0, vs. H

(p)1 : µ(j)

iid∼ (1−εp)ν0+εp ντp

Calibration and subtlety of the problem:

εp = p−β, τp =√

2r log(p), 0 < β < 1, r > 0

I Rare: only a small fraction of non-zero means

I Weak: signals only moderately strong

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 21 / 42

Page 22: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Phase Diagram (signal detection/recovery)

standard phase function: r = ρ(β); ρ(β) =

0, 0 < β < 1/2

β − 12 ,

12 < β < 3

4

(1−√

1− β)2, 34 < β < 1

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Estimable

Undetectable

Detectable

β

r

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

3.5

4

Detectable Undetectable

Almost Full Recovery

Exact Recovery

r

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 22 / 42

Page 23: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

RW settings: why we care?

I Growing awareness of irreproducibilityI Many methods/theory focus on Rare/Strong

signals, do not cope well with RW settingsI FDR-controlling methods (showed before)I L0/L1-penalization methodsI Minimax framework

Ioannidis, (2005). “Why most published research findings are false”;

Donoho and Stark (1989); Chen et al (1995); Tibshirani (1996);

Abrmovich et al (2006); Jin et al (2015); Ke et al (2015)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 23 / 42

Page 24: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

A two-class model for clustering

Jin, Ke & Wang (2015)

Xi = `iµ+Zi , Ziiid∼ N(0, Ip), i = 1, . . . , n (p � n)

I `i = ±1: unknown class labels (main interest)

I µ ∈ Rp: feature vector

I RW: only a small fraction of entries of µ isnonzero, each contributes weakly to clustering

Goal. Theoretical insight on HCT (and more)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 24 / 42

Page 25: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

IF-PCA simplified to two-class model

features selected features PC’ssa

mpl

es

microarray two-class modelpre-normalization yes skipped

feature-wise screening Kolmogorov-Smirnov chi-squareψj = supt |Fn,j(t)− Φ(t)| ψj = (‖xj‖ − n)/

√2n

re-normalization Efron’s null correction skippedthreshold choice HCT same

post-selection PCA same same

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 25 / 42

Page 26: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Asymptotic Rare/Weak (ARW) model

X = `µ + Z ∈ Rn,p, Z : iid N(0, 1) entries

`i = ±1 with equal prob.

µ(j)iid∼ (1− εp)ν0 + εpντp

I “large p small n”:

n = pθ, 0 < θ < 1

I RW signal:

εp = p−β, τp = 4√

(4/n)r log(p)

Note: ψj = (‖xj‖2 − n)/√

2n ≈ N(0, 1) or N(√

2r log(p), 1)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 26 / 42

Page 27: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Three functions: HC , idealHC , SNRGiven X = `µ′ + Z , we retain feature j if ψj ≥ t

I F0(t): survival function of normalized χ22n(0)

I Gp(t): empirical survival function ψj

I U (t): first left singular vector of [xj : ψj ≥ t]

1. HC (t) :√p[Gp(t)− F0(t)]/[

√n[Gp(t)− F0(t)]+ Gp(t)]

12

2. idealHC (t) :

HC (t) with Gp(t) replaced by Gp(t)

3. SNR(t):

U (t) ∝ SNR(t) · ` + z + rem, z ∼ N(0, In)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 27 / 42

Page 28: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Three thresholds: tHCp ≈ t idealHCp ≈ t idealp

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.21.5

2

2.5

3

3.5

4

HC

IdealHCSNR

HCT

SNR(t) =E [‖µ(t)‖2]√

E [‖µ(t)‖2] + E [|S(t)|]/n∝

√p(Gp(t)− F0(t))√√

n[Gp(t)− F0(t)] + Gp(t)

I µ(t): µ restricted to S(t) (index set of all retained features)

I E‖µ(t)‖2 ≈ τ 2p p[Gp(t)− F0(t)]

I E‖µ(t)‖2 + E [|S(t)|/n] ∝ τ 2p [Gp(t)− F0(t)] + Gp(t)/n

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 28 / 42

Page 29: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

ImpossibilityLet ρ(β) be the standard phase function (before). Define

ρθ(β) = (1− θ)ρ(1

2+

β− 12

1−θ ); recalling n = pθ

Consider IF-PCA with t > 0. Let U (t) ∈ Rn be thefirst left singular vector of post-selection data matrix

[xj : ψj ≥ t]

Consider ARW with

n = pθ, εp = p−β, τp = 4√

(4/n)r log(p)

If r < ρθ(β), then for any threshold t,

Cos(U (t), `) ≤ c0 < 1 (IF-PCA partially fails)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 29 / 42

Page 30: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Possibility

If r > ρθ(β), thenI For some t, Cos(U (t), `)→ 1I There is a non-stochastic function SNR(t)

such that for some t, SNR(t)� 1 and

U (t) ∝ SNR(t)` + z + rem, z ∼ N(0, In);

I HCT yields the right threshold choice:

tHCp /t ideal

p → 1 in prob.; t idealp = argmaxt{SNR(t)}

I IF-PCA-HCT yields successful clustering:

Hammp(ˆHCT ; β, r , θ)→ 0 ††: Hammp(ˆ;β, α, θ) = (n/2)−1 minb=±sgn(ˆ)

{∑ni=1 P

(bi 6= sgn(`i )

)}Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 30 / 42

Page 31: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Phase Diagram (IF-PCA)

#(useful features) ≈ p1−β, τp = 4√

(4/n)r log(p); n = pθ (θ = .6)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

β

r

Failure

Success

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.850

0.1

0.2

0.3

0.4

0.5

0.6

β

r

Success

Failure

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 31 / 42

Page 32: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Summary (so far)

I Big Data are here, but signals are RWI RW model and Phase Diagrams: theoretical

framework specifically for RW settings, allowfor insights that will be otherwise overlooked

I HC provides the right threshold choice

Limitation: IF-PCA only works at a specific signalstrength and in a limited sparsity range

Want a more complete story:

statistical limits for clustering under ARW

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 32 / 42

Page 33: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

ARW for statistical limits (a slight change)

I X = `µ′ + Z

I `i = ±1 with equal prob.

I µ(j)iid∼ (1− εp)ν0 + εpντp

I n = pθ, εp = p−β

β0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

α

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

τ = n−1/4

s = p1/2 s = n1/2

weaker

sparser

Most interesting range for IF-PCA

For statistical limits For IF-PCA

τp τp = p−α, α > 0 τp = 4√

(4/n)r log(p)

s := #{signals} 1� s � p√n� s � √p

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 33 / 42

Page 34: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Aggregation methods

I Simple Aggregation: ˆsa∗ = sgn

(x)

I Sparse Aggregation: ˆsaN = sgn

(xS

), where

S = S(N) = argmax{S :|S |=N}{‖xS‖1

}

𝑥𝑥1 𝑥𝑥2 𝑥𝑥3 𝑥𝑥𝑝𝑝−1 𝑥𝑥𝑝𝑝 𝑥𝑥𝑝𝑝−2 ..............

��𝑥

Features

Sam

ple

s

𝑥𝑥1 𝑥𝑥2 𝑥𝑥3 𝑥𝑥𝑝𝑝−1 𝑥𝑥𝑝𝑝 𝑥𝑥𝑝𝑝−2 ..............

��𝑥𝑆𝑆

Features

Sam

ple

s

𝑥𝑥1 𝑥𝑥2 𝑥𝑥3 𝑥𝑥𝑝𝑝−1 𝑥𝑥𝑝𝑝 𝑥𝑥𝑝𝑝−2 ..............

��𝑥

Features

Sam

ple

s

𝑥𝑥1 𝑥𝑥2 𝑥𝑥3 𝑥𝑥𝑝𝑝−1 𝑥𝑥𝑝𝑝 𝑥𝑥𝑝𝑝−2 ..............

��𝑥𝑆𝑆

Features

Sam

ple

s

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 34 / 42

Page 35: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Comparison of methods

Method Simple Agg. PCA Sparse Agg. IF-PCAˆsa∗

ˆif∗

ˆsaN (N � p) ˆif

t (t > 0)

Sparsity dense dense sparse sparseStrength weak weak strong* strong

F. Selection No No Yes YesComplexity Poly. Poly. NP-hard Poly.

Tuning No No Yes Yes**

I Notation. ˆift : IF-PCA adapted to two-class model

I Notation. ˆif∗ : classical PCA (a special case)

*: signals are comparably stronger but still weak**: a tuning-free version exists

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 35 / 42

Page 36: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Statistical limits (clustering)

Hammp(ˆ; β, α, θ) = (n/2)−1 minb=±sgn(ˆ)

{ n∑i=1

P(bi 6= sgn(`i)

)}ηcluθ (β) =

(1− 2β)/2, 0 < β < 1−θ2

θ/2, 1−θ2< β < 1− θ

(1− β)/2, β > 1− θ

Consider ARW where

n = pθ, εp = p−β, τp = p−α

I When α > ηcluθ (β), Hammp(ˆ; β, α, θ) & 1

I When α < ηcluθ (β),

I Hammp(ˆsa∗ ; β, α, θ)→ 0 for β < 1−θ

2;

I Hammp(ˆsaN ; β, α, θ)→ 0 for β > 1−θ

2(N = p1−β)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 36 / 42

Page 37: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Phase Diagram (clustering; θ = 0.6)For statistical limits For IF-PCA

τp τp = p−α, α > 0 τp = 4√

(4/n)r log(p)Range of β 0 < β < 1 1/2 < β < 1− θ/2

β0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

α

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

τ = n−1/2

τ = p1/2/s

τ = s−1/2

Impossible

Simple Aggregation

Sparse Aggregation (NP-hard)

weaker

sparser β0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

α

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

weaker

sparser

Simple Aggreg.

Impossible

Classical PCA IF-PCA

NP-hard

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 37 / 42

Page 38: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Two closely related problems

X = `µ′ + Z , Z : iid N(0, 1) entries

I (sig). Estimate support of µ (Signal recovery)

Hammp(µ;β, r , θ) = (pεp)−1n∑

i=1

P(sgn(µi ) 6= sgn(µi ))

I (hyp). Test H(p)0 that X = Z against alternative H

(p)1

that X = `µ′ + Z (global hypothesis testing)

TestErrp(T , β, r , β) = PH

(p)0

(Reject H0) + PH

(p)1

(Accept H(p)0 )

Arias-Castro & Verzelen (2015), Johnstone & Lu (2001), Rigollet &

Berthet (2013)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 38 / 42

Page 39: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Statistical limits (three problems)

µ(j)iid∼ (1− εp)ν0 + εpντp ,

n = pθ, s ≡ #{useful features} ≈ p1−β, signal strength = p−α

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

β

α

τ=(sn)−1/4

τ=n−1/4p1/2/s

τ=p1/2/s

τ=n−1/2

τ=s−1/2τ=n−1/4

s=n

ClusteringSignal RecoveryHypothesis Testing

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 39 / 42

Page 40: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Computable upper bounds (two problems)Left: signal recovery. Right: (global) hypothesis testing

µ(j)iid∼ (1− εp)ν0 + εpντp ,

n = pθ, s ≡ #{useful features} ≈ p1−β, signal strength = p−α

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

β

α

τ=n−1/2

τ=(p/n)1/4s−1/2

τ=n−1/4s=p1/2

ClusteringSignal Recovery (stat)Hypothesis TestingSignal Recovery (compu)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

β

α

τ=n−1/4

τ=n−1/4(√

p/s)

s=p1/2

ClusteringSignal RecoveryHypothesis Testing (stat)Hypothesis Testing (compu)

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 40 / 42

Page 41: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Take-home message

I “Big Data” is here, but your signals are RW;we need to do many things very differently

I RW model and Phase Diagrams are theoreticalframework specifically designed for RWsettings, and expose many insights we do notsee with more traditional frameworks

I IF-PCA and Higher Criticism are simple andeasy-to-adapt methods which are provablyeffective for analyzing real data, especially forRare/Weak settings

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 41 / 42

Page 42: Clustering by Important Features PCA (IF-PCA)jiashun/Research/Talks/Medallion.pdfI CV: not directly implementable (class labels unknown) I FDR: need tuning and target on Rare/Strong

Acknowledgements

Many thanks for David Donoho, Stephen Fienberg,Peter Hall, Jon Wellner and many others forinspirations, collaborations, and encouragements!

Main references.Jin J, Wang W (2015) Important Features PCA for high dimensionalclustering. arXiv.1407.5241.Jin J, Ke Z, Wang W (2015) Phase transitions for high dimensionalclustering and related problems. arXiv.1502.06952.

Two review papers on HC.D. Donoho and J. Jin (2015). Higher Criticism for Large-Scale Inference,especially for rare and weak effects. Statistical Science, 30(1), 1-25.J. Jin and Z. Ke (2015). Rare and weak effects in large-scale inference:methods and phase diagrams. arXiv.1410.4578.

Jiashun Jin, CMU Clustering by Important Features PCA (IF-PCA) 42 / 42