Analysis of DNA methylation and Gene expression to predict childhood obesity

Analysis of DNA Methylation and Gene

Expression data in Placenta tissue to

predict childhood obesityAn Integrative Approach

Bhatnagar SR1,2, Houde A4,5, Voisin G2,Bouchard L4,5, Greenwood CMT1,2,3

1Department of Epidemiology, Biostatistics and Occupational Health, McGill University2Lady Davis Institute, Jewish General Hospital, Montreal, QC

3Departments of Oncology and Human Genetics, McGill University4Department of Biochemistry, Universite de Sherbrooke, QC

5ECOGENE-21 and Lipid Clinic, Chicoutimi Hospital, QC

sahirbhatnagar.com/talksPoster Session B, # 56

Motivating Question # 1

sahirbhatnagar.com Data Integration CHSGM 2015 2 / 25

Motivation

I 1 in 4 adult Canadians and 1 in 10 children are clinically obese.

I Events during pregnancy are suspected to play a role in childhoodobesity → we don’t know about the mechanisms involved

I Children born to women who had a gestational diabetesmellitus-affected pregnancy are more likely to be overweight and obese

I Evidence suggests epigenetic factors are important piece of the puzzle


Motivating Question # 2


Motivating Question

sample size

genomic data

25 50

GeneExpression

DNAMethylation

DNAMethylation

GeneExpression

??

?


Motivating Question

sample size

genomic data

25 50

GeneExpression

DNAMethylation

DNAMethylation

GeneExpression

??

?


Motivating Question

sample size

genomic data

25 50

GeneExpression

DNAMethylation

DNAMethylation

GeneExpression

??

?


The Data


The Data

ExpressionHT-12 v4p = 46, 889

Methylation

Illumina 450k

p = 375, 561

GestationalDiabetesn = 45GD = 29

Placentan = 45

timeat birth age 5| |

X

7 FatMeasures

Childn = 23GD = 16

Y?


The Data


Methylation

Illumina 450k

p = 375, 561


Placentan = 45


X

7 FatMeasures

Childn = 23GD = 16

Y?


The Data


Methylation

Illumina 450k

p = 375, 561


Placentan = 45


X

7 FatMeasures

Childn = 23GD = 16

Y?


The Data


Methylation

Illumina 450k

p = 375, 561


Placentan = 45


X

7 FatMeasures

Childn = 23GD = 16

Y

?


The Data


Methylation

Illumina 450k

p = 375, 561


Placentan = 45


X

7 FatMeasures

Childn = 23GD = 16

Y?


Summarizing Expression,Methylation and Gestational

Diabetes Phenotype in PlacentaTissue


Sparse Canonical Correlation Analysis (sCCA)

I CCA requires calculation of(XTX

)−1and

(YTY

)−1

I When p + q >> n, these matrices are singular

I sCCA applies an L1 penalty to the canonical vectors to obtain sparsesolutions (Witten et al., 2009; Parkhomenko et al., 2009)

I Assumes XTX = Ip, YTY = Iq

maximizeu,v uTXTYv

subject to‖u‖2

2 ≤ 1, ‖v‖22 ≤ 1

andP1(u) ≤ λ1, P2(v) ≤ λ2


Supervised Sparse CCA

Main idea:

1. The features that are most associated with the outcome Q areidentified to form the reduced matrices X and Y

2. sCCA is performed on X and Y


Importance of Gestational Diabetes Phenotype

0.88

0.90

0.92

0.94

0.96

0.98

# no

n−0

expr

essi

on p

robe

s

# non−0 methylation probes

Cor

rela

tion

Gestational Diabetes Status Used in Sparse CCA

0.88

0.90

0.92

0.94

0.96

0.98

# no

n−0

expr

essi

on p

robe

s# non−0 methylation probes

Cor

rela

tion

Gestational Diabetes Status Not Used


GO Stat Analysis for Enrichment

I Enrichment Analysis based on non zero vector of 1st component fromthe Supervised sCCA analysis

I Genes associated with inflammatory processes

Table : Top list of enriched GO terms

GOBPID FDR OR E.Count Count Size Term

0002376 < 10−14 2.1 131.6 227 2178 immune system process0006955 < 10−13 2.3 78.7 153 1303 immune response0002252 < 10−9 2.7 34.1 80 565 immune effector process0045087 < 10−8 2.3 49.0 99 811 innate immune response0002682 < 10−8 2.1 66.56 122 1102 regulation of immune system process0002684 < 10−8 2.4 40.1 84 664 positive regulation of immune system process0006952 < 10−8 1.9 84.5 144 1399 defense response0050776 < 10−8 2.3 44.5 90 738 regulation of immune response0050778 < 10−7 2.6 28.5 65 473 positive regulation of immune response0006950 < 10−7 1.6 196.8 271 3258 response to stress


Summarizing Bodyfat Measures


Cluster 6 Bodyfat measures in 2 groups

34 14 8 16 7 6 38 30 20 25 13 3 12 11 17 21 39 31 19 37 28 32 18

Zscore BMI

percent fat

subscapularis

bicep

tricep

iliacus

−2 0 2Value

Color Key


Circle of Correlations

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Variables factor map (PCA)

Dim 1 (50.68%)

Dim

2 (

15.4

1%)

Zscore BMIpercent fat

tricep

bicepsubscapularis

iliacus


Combining Both Data


Regression via Elastic Net


Methylation

Illumina 450k

p = 375, 561


Placentan = 45


X

7 FatMeasures

Childn = 23GD = 16

Y?


1st PC as Summary Bodyfat Measure

3

8

32

14294

102

1443

187

12853

124

375563

36

37513

81

338052

197

75115

30

7505

188

67612

196

84503

37

9380

202

751251

2

3

4

data used to predict 1st PC of bodyfat measures

LOO

CV

mea

n sq

uare

d er

ror

data.type

Canonical Variables

Expr+Methy non 0 CCA factors

Expr non 0 CCA factors

Methy non 0 CCA factors

Expr+Methy Filter

Expr Filter low means

Methy Filter low var

Expr+Methy Filter low+t.test

Expr Filter low+t.test

Methy Filter low+t.test

Expr+Methy Filter t.test

Expr Filter t.test

Methy Filter t.test


Ward Clustering Groups

1

8

22

14294

1

1443

20

12853

331

375563

1

37513

54

338052

6

75115

1

7505

6

67612

7

84503

1

9380

30

751250.0

0.1

0.2

0.3

0.4

0.5

data used to predict Ward clustering groups

LOO

CV

mis

clas

sific

atio

n er

ror

data.type

Canonical Variables

Expr+Methy non 0 CCA factors

Expr non 0 CCA factors

Methy non 0 CCA factors

Expr+Methy Filter

Expr Filter low means

Methy Filter low var

Expr+Methy Filter low+t.test

Expr Filter low+t.test

Methy Filter low+t.test

Expr+Methy Filter t.test

Expr Filter t.test

Methy Filter t.test


Neuropeptide Y Receptor (NPY1R)

From OMIM:

I One of the most abundant neuropeptides in the mammaliannervous system

I Exhibits a diverse range of important physiologic activities,including effects on food intake

I Have been identified in a variety of tissues, includingplacenta (Herzog et al., 1992).


Motivating Question #2: My Answer

sample size

genomic data

25 50

GeneExpression

DNAMethylation

DNAMethylation

GeneExpression


Big Data

Data IntegrationMachine Learning


Big DataData Integration

Machine Learning


Big DataData IntegrationMachine Learning


Smalln Data


Acknowledgements

I Celia Greenwood andMathieu Blanchette

I Greg Voisin, Andree-AnneHoude, Luigi Bouchard

I All the mothers and childrenthat took part in this study

I You


References

Principal component analysis plots and beamer template. URLhttp://gastonsanchez.com/.

Elena Parkhomenko, David Tritchler, and Joseph Beyene. Sparse canonicalcorrelation analysis with application to genomic data integration.Statistical Applications in Genetics and Molecular Biology, 8(1):1–34,2009.

Daniela M Witten and Robert J Tibshirani. Extensions of sparse canonicalcorrelation analysis with applications to genomic data. Statisticalapplications in genetics and molecular biology, 8(1):1–27, 2009.

Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalizedmatrix decomposition, with applications to sparse principal componentsand canonical correlation analysis. Biostatistics, page kxp008, 2009.


http://gastonsanchez.com/

Science

Analysis of DNA methylation and Gene expression to predict childhood obesity