A model for interpretable high dimensional interactions

  • View
    130

  • Download
    3

  • Category

    Science

Preview:

Citation preview

A Model for Interpretable High Dimensional

Interactions

Sahir Rai Bhatnagar

Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood

Poster Number 67

Motivation

one predictor variable at a time

Predictor Variable Phenotype

Test 1

Test 2

Test 3

Test 4

Test 5

1

one predictor variable at a time

Predictor Variable Phenotype

Test 1

Test 2

Test 3

Test 4

Test 5

1

a network based view

Predictor Variable Phenotype

Test 1

2

a network based view

Predictor Variable Phenotype

Test 1

2

a network based view

Predictor Variable Phenotype

Test 1

2

system level changes due to environment

Predictor Variable PhenotypeEnvironment

A

B

Test 1

3

system level changes due to environment

Predictor Variable PhenotypeEnvironment

A

B

Test 1

3

Motivating Dataset: Newborn epigenetic adaptations to gesta-

tional diabetes exposure (Luigi Bouchard, Sherbrooke)

Environment

Gestational

Diabetes

Large Data

Child’s epigenome

(p ≈ 450k)

Phenotype

Obesity measures

4

Differential Correlation between environments

(a) Gestational diabetes affected pregnancy (b) Controls

5

formal statement of initial problem

• n: number of subjects

• p: number of predictor variables

• Xn×p: high dimensional data set (p >> n)

• Yn×1: phenotype

• En×1: environmental factor that has widespread effect on X and can

modify the relation between X and Y

Objective

• Which elements of X that are associated with Y , depend on E?

6

formal statement of initial problem

• n: number of subjects

• p: number of predictor variables

• Xn×p: high dimensional data set (p >> n)

• Yn×1: phenotype

• En×1: environmental factor that has widespread effect on X and can

modify the relation between X and Y

Objective

• Which elements of X that are associated with Y , depend on E?

6

formal statement of initial problem

• n: number of subjects

• p: number of predictor variables

• Xn×p: high dimensional data set (p >> n)

• Yn×1: phenotype

• En×1: environmental factor that has widespread effect on X and can

modify the relation between X and Y

Objective

• Which elements of X that are associated with Y , depend on E?

6

formal statement of initial problem

• n: number of subjects

• p: number of predictor variables

• Xn×p: high dimensional data set (p >> n)

• Yn×1: phenotype

• En×1: environmental factor that has widespread effect on X and can

modify the relation between X and Y

Objective

• Which elements of X that are associated with Y , depend on E?

6

formal statement of initial problem

• n: number of subjects

• p: number of predictor variables

• Xn×p: high dimensional data set (p >> n)

• Yn×1: phenotype

• En×1: environmental factor that has widespread effect on X and can

modify the relation between X and Y

Objective

• Which elements of X that are associated with Y , depend on E?

6

formal statement of initial problem

• n: number of subjects

• p: number of predictor variables

• Xn×p: high dimensional data set (p >> n)

• Yn×1: phenotype

• En×1: environmental factor that has widespread effect on X and can

modify the relation between X and Y

Objective

• Which elements of X that are associated with Y , depend on E?

6

Methods

ECLUST - our proposed method: 3 phases

Original Data

E = 0

1) Gene Similarity

E = 1

2) Cluster

Representation

n × 1 n × 1

3) Penalized

Regression

Yn×1∼ + ×E

7

ECLUST - our proposed method: 3 phases

Original Data

E = 0

1) Gene Similarity

E = 1

2) Cluster

Representation

n × 1 n × 1

3) Penalized

Regression

Yn×1∼ + ×E

7

ECLUST - our proposed method: 3 phases

Original Data

E = 0

1) Gene Similarity

E = 1

2) Cluster

Representation

n × 1 n × 1

3) Penalized

Regression

Yn×1∼ + ×E

7

ECLUST - our proposed method: 3 phases

Original Data

E = 0

1) Gene Similarity

E = 1

2) Cluster

Representation

n × 1 n × 1

3) Penalized

Regression

Yn×1∼ + ×E

7

ECLUST - our proposed method: 3 phases

Original Data

E = 0

1) Gene Similarity

E = 1

2) Cluster

Representation

n × 1 n × 1

3) Penalized

Regression

Yn×1∼ + ×E

7

ECLUST - our proposed method: 3 phases

Original Data

E = 0

1) Gene Similarity

E = 1

2) Cluster

Representation

n × 1 n × 1

3) Penalized

Regression

Yn×1∼ + ×E

7

the objective of statistical

methods is the reduction of data.

A quantity of data . . . is to be

replaced by relatively few quantities

which shall adequately represent

. . . the relevant information

contained in the original data.

- Sir R. A. Fisher, 1922

7

Model

g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸ ︷︷ ︸main effects

+α1E (X1E ) + · · ·+ αpE (XpE )︸ ︷︷ ︸interactions

Reparametrization1: αjE = γjEβjβE .

Strong heredity principle2:

αjE 6= 0 ⇒ βj 6= 0 and βE 6= 0

1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics

8

Model

g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸ ︷︷ ︸main effects

+α1E (X1E ) + · · ·+ αpE (XpE )︸ ︷︷ ︸interactions

Reparametrization1: αjE = γjEβjβE .

Strong heredity principle2:

αjE 6= 0 ⇒ βj 6= 0 and βE 6= 0

1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics

8

Model

g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸ ︷︷ ︸main effects

+α1E (X1E ) + · · ·+ αpE (XpE )︸ ︷︷ ︸interactions

Reparametrization1: αjE = γjEβjβE .

Strong heredity principle2:

αjE 6= 0 ⇒ βj 6= 0 and βE 6= 0

1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics

8

Strong Heredity Model with Penalization

arg minβ0,β,γ

1

2‖Y − g(µ)‖2 +

λβ (w1β1 + · · ·+ wqβq + wEβE ) +

λγ (w1Eγ1E + · · ·+ wqEγqE )

wj =

∣∣∣∣∣ 1

βj

∣∣∣∣∣ , wjE =

∣∣∣∣∣ βj βEαjE

∣∣∣∣∣

9

Results

Simulation Study: Jaccard Index and test set MSE

10

Open source software

• Software implementation in R: http://sahirbhatnagar.com/eclust/

• Allows user specified interaction terms

• Automatically determines the optimal tuning parameters through

cross validation

• Can also be applied to genetic data

11

Conclusions

Conclusions and Contributions

• Large system-wide changes are observed in many environments

• This assumption can possibly be exploited to aid analysis of large

data

• We develop and implement a multivariate penalization procedure for

predicting a continuous or binary disease outcome while detecting

interactions between high dimensional data (p >> n) and an

environmental factor.

• Dimension reduction is achieved through leveraging the

environmental-class-conditional correlations

• Also, we develop and implement a strong heredity framework

within the penalized model

• R software: http://sahirbhatnagar.com/eclust/

12

Conclusions and Contributions

• Large system-wide changes are observed in many environments

• This assumption can possibly be exploited to aid analysis of large

data

• We develop and implement a multivariate penalization procedure for

predicting a continuous or binary disease outcome while detecting

interactions between high dimensional data (p >> n) and an

environmental factor.

• Dimension reduction is achieved through leveraging the

environmental-class-conditional correlations

• Also, we develop and implement a strong heredity framework

within the penalized model

• R software: http://sahirbhatnagar.com/eclust/

12

Conclusions and Contributions

• Large system-wide changes are observed in many environments

• This assumption can possibly be exploited to aid analysis of large

data

• We develop and implement a multivariate penalization procedure for

predicting a continuous or binary disease outcome while detecting

interactions between high dimensional data (p >> n) and an

environmental factor.

• Dimension reduction is achieved through leveraging the

environmental-class-conditional correlations

• Also, we develop and implement a strong heredity framework

within the penalized model

• R software: http://sahirbhatnagar.com/eclust/

12

Conclusions and Contributions

• Large system-wide changes are observed in many environments

• This assumption can possibly be exploited to aid analysis of large

data

• We develop and implement a multivariate penalization procedure for

predicting a continuous or binary disease outcome while detecting

interactions between high dimensional data (p >> n) and an

environmental factor.

• Dimension reduction is achieved through leveraging the

environmental-class-conditional correlations

• Also, we develop and implement a strong heredity framework

within the penalized model

• R software: http://sahirbhatnagar.com/eclust/

12

Conclusions and Contributions

• Large system-wide changes are observed in many environments

• This assumption can possibly be exploited to aid analysis of large

data

• We develop and implement a multivariate penalization procedure for

predicting a continuous or binary disease outcome while detecting

interactions between high dimensional data (p >> n) and an

environmental factor.

• Dimension reduction is achieved through leveraging the

environmental-class-conditional correlations

• Also, we develop and implement a strong heredity framework

within the penalized model

• R software: http://sahirbhatnagar.com/eclust/

12

Conclusions and Contributions

• Large system-wide changes are observed in many environments

• This assumption can possibly be exploited to aid analysis of large

data

• We develop and implement a multivariate penalization procedure for

predicting a continuous or binary disease outcome while detecting

interactions between high dimensional data (p >> n) and an

environmental factor.

• Dimension reduction is achieved through leveraging the

environmental-class-conditional correlations

• Also, we develop and implement a strong heredity framework

within the penalized model

• R software: http://sahirbhatnagar.com/eclust/

12

Limitations

• There must be a high-dimensional signature of the exposure

• Clustering is unsupervised

• Two tuning parameters

• Need more samples . . . Got data? (Poster 67)

13

Limitations

• There must be a high-dimensional signature of the exposure

• Clustering is unsupervised

• Two tuning parameters

• Need more samples . . . Got data? (Poster 67)

13

Limitations

• There must be a high-dimensional signature of the exposure

• Clustering is unsupervised

• Two tuning parameters

• Need more samples . . . Got data? (Poster 67)

13

Limitations

• There must be a high-dimensional signature of the exposure

• Clustering is unsupervised

• Two tuning parameters

• Need more samples . . . Got data? (Poster 67)

13

acknowledgements

• Dr. Celia Greenwood

• Dr. Blanchette and Dr. Yang

• Dr. Luigi Bouchard, Andre Anne

Houde

• Dr. Steele, Dr. Kramer,

Dr. Abrahamowicz

• Maxime Turgeon, Kevin

McGregor, Lauren Mokry,

Dr. Forest

• Greg Voisin, Dr. Forgetta,

Dr. Klein

• Mothers and children from the

study

14

Recommended