12
OMICS A Journal of Integrative Biology Volume 13, Number 3, 2009 © Mary Ann Liebert, Inc. DOI: 10.1089/omi.2009.0003 Correlation between Gene Expression and Clinical Data through Linear and Nonlinear Principal Components Analyses: Muscular Dystrophies as Case Studies Chiara Romualdi, 1 Alessandro Giuliani, 2 Caterina Millino, 1 Barbara Celegato, 1 Romualdo Benigni, 2 and Gerolamo Lanfranchi 1 Abstract The large dimension of microarray data and the complex dependence structure among genes make data anal- ysis extremely challenging. In the last decade several statistical techniques have been proposed to tackle ge- nome-wide expression data; however, clinical and molecular data associated to pathologies have often been considered as separate dimensions of the same phenomenon, especially when clinical variables lie on a multi- dimensional space. A better comprehension of the relationships between clinical and molecular data can be ob- tained if both data types are combined and integrated. In this work we adopt a multidimensional correlation strategy together with linear and nonlinear principal component, to integrate genetic and clinical information obtained from two sets of dystrophic patients. With this approach we decompose different aspects of clinical manifestations and correlate these features with the correspondent patterns of differential gene expression. 1 Introduction W ITH THE ADVENT OF THE GENOMIC ERA, various technolo- gies have become available for monitoring the ex- pression profiles of roughly the entire set of genes present in a given organism or cell. The most widely used are those based on oligonucleotide and cDNA microarrays platforms that provide a rapid and parallel quantification of the ex- pression pattern of thousands of genes in a single experi- ment (Schena et al., 1995). A very powerful application of transcriptional profiling is the study of patterns similarity across many experiments representing a wide variety of con- ditions. In particular, microarray technology has been suc- cessfully applied in tumor characterization, leading to the identification of unknown cancer subclasses or groups of markers with high prognostic properties (Khan et al., 2001; Ross et al., 2000; Shipp et al., 2002; Van’t Veer et al., 2002). The large dimension of data generated by microarray ex- periments and the complex dependence structure among genes, make the analysis of expression data a very chal- lenging issue. In the last decade several statistical techniques have been proposed to tackle the management of microar- ray data (Baty et al., 2005; Ben-Dor et al., 2000; Dudoit et al., 2000; Eisen et al., 1998; Guan et al., 2005; Li et al., 2001; Pan et al., 2002; Ramoni et al., 2002; Romualdi et al., 2003; Tamayo et al., 1999; Tang et al., 2006; Teschendorff et al., 2005; Wang and Gehan 2005). In the application of microarrays approach to human pathologies, clinical and expression data have been considered as separate dimensions of the same phenomenon. In particular, some statistical techniques based on penalized regression (Li and Gui, 2004; Gui and Li, 2005; van Houwelingen, 2006; Nguyen and Rocke, 2002) and principal components (Tan et al., 2005) have been proposed to predict single clinical variable (e.g., pathology classification, survival time, pharmacological treatment) through expression data. Little effort has been produced to develop adequate statisti- cal procedures for the correlation of expression data to a large series of clinical traits and variables. Especially when the molecular processes involved in a specific pathological trait are almost unknown, a better comprehension of the re- sults could derive from an analytical methodology ap- proaching and combining both data types. In fact, even when the principal molecular cause is simple (a single gene muta- tion, a virus, or a bacterium), diseases remain complex enti- ties in their clinical description and gene expression signa- tures. This complexity arises from the boundary conditions surrounding the basic etiological cause and makes the indi- vidual response to the same pathological status highly vari- able. As a matter of fact, variability in response is the basic material of any effort to describe pathological entities. 1 CRIBI Biotechnology Centre and Dipartimento di Biologia, Università degli Studi di Padova, Padova, Italy. 2 Department of Environment and Health, Istituto Superiore di Sanità, Rome, Italy.

Correlation between Gene Expression and Clinical Data through Linear and Nonlinear Principal Components Analyses: Muscular Dystrophies as Case Studies

  • Upload
    unipd

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

OMICS A Journal of Integrative Biology Volume 13, Number 3, 2009© Mary Ann Liebert, Inc.DOI: 10.1089/omi.2009.0003

Correlation between Gene Expression and Clinical Datathrough Linear and Nonlinear Principal Components

Analyses: Muscular Dystrophies as Case Studies

Chiara Romualdi,1 Alessandro Giuliani,2 Caterina Millino,1 Barbara Celegato,1

Romualdo Benigni,2 and Gerolamo Lanfranchi1

Abstract

The large dimension of microarray data and the complex dependence structure among genes make data anal-ysis extremely challenging. In the last decade several statistical techniques have been proposed to tackle ge-nome-wide expression data; however, clinical and molecular data associated to pathologies have often beenconsidered as separate dimensions of the same phenomenon, especially when clinical variables lie on a multi-dimensional space. A better comprehension of the relationships between clinical and molecular data can be ob-tained if both data types are combined and integrated. In this work we adopt a multidimensional correlationstrategy together with linear and nonlinear principal component, to integrate genetic and clinical informationobtained from two sets of dystrophic patients. With this approach we decompose different aspects of clinicalmanifestations and correlate these features with the correspondent patterns of differential gene expression.

1

Introduction

WITH THE ADVENT OF THE GENOMIC ERA, various technolo-gies have become available for monitoring the ex-

pression profiles of roughly the entire set of genes presentin a given organism or cell. The most widely used are thosebased on oligonucleotide and cDNA microarrays platformsthat provide a rapid and parallel quantification of the ex-pression pattern of thousands of genes in a single experi-ment (Schena et al., 1995). A very powerful application oftranscriptional profiling is the study of patterns similarityacross many experiments representing a wide variety of con-ditions. In particular, microarray technology has been suc-cessfully applied in tumor characterization, leading to theidentification of unknown cancer subclasses or groups ofmarkers with high prognostic properties (Khan et al., 2001;Ross et al., 2000; Shipp et al., 2002; Van’t Veer et al., 2002).

The large dimension of data generated by microarray ex-periments and the complex dependence structure amonggenes, make the analysis of expression data a very chal-lenging issue. In the last decade several statistical techniqueshave been proposed to tackle the management of microar-ray data (Baty et al., 2005; Ben-Dor et al., 2000; Dudoit et al.,2000; Eisen et al., 1998; Guan et al., 2005; Li et al., 2001; Panet al., 2002; Ramoni et al., 2002; Romualdi et al., 2003; Tamayo

et al., 1999; Tang et al., 2006; Teschendorff et al., 2005; Wangand Gehan 2005). In the application of microarrays approachto human pathologies, clinical and expression data have beenconsidered as separate dimensions of the same phenomenon.In particular, some statistical techniques based on penalizedregression (Li and Gui, 2004; Gui and Li, 2005; vanHouwelingen, 2006; Nguyen and Rocke, 2002) and principalcomponents (Tan et al., 2005) have been proposed to predictsingle clinical variable (e.g., pathology classification, survivaltime, pharmacological treatment) through expression data.Little effort has been produced to develop adequate statisti-cal procedures for the correlation of expression data to alarge series of clinical traits and variables. Especially whenthe molecular processes involved in a specific pathologicaltrait are almost unknown, a better comprehension of the re-sults could derive from an analytical methodology ap-proaching and combining both data types. In fact, even whenthe principal molecular cause is simple (a single gene muta-tion, a virus, or a bacterium), diseases remain complex enti-ties in their clinical description and gene expression signa-tures. This complexity arises from the boundary conditionssurrounding the basic etiological cause and makes the indi-vidual response to the same pathological status highly vari-able. As a matter of fact, variability in response is the basicmaterial of any effort to describe pathological entities.

1CRIBI Biotechnology Centre and Dipartimento di Biologia, Università degli Studi di Padova, Padova, Italy.2Department of Environment and Health, Istituto Superiore di Sanità, Rome, Italy.

The statistical approach for the description of a diseaseusually results in a data matrix where the single patients areconsidered statistical units and the different descriptors ofthese units (clinical traits, biological end-points, levels ofgene expression, etc.) are the statistical variables. In mi-croarray studies, only gene expression dimension is ana-lyzed maintaining its intrinsic complexity, while the com-plexity of clinical data is usually collapsed to a single maindimension (prognosis, subclasses of the same pathologicalentity, survival, response to a given therapy).

The strong unbalance between observations and variabledimension diminishes the specificity of the analysis per-formed by the classical statistical technique of hypothesistesting (detecting differentially expressed genes one by one)with the consequent generation of large number of false pos-itives. Therefore, reduction techniques, like principal com-ponent analysis, avoiding gene selection, are more suited inthis context. In this study we tried to maintain the naturalcomplexity of the diseases, (1) analyzing clinical and geneexpression variables avoiding any a priori data filtering, and(2) adopting the same unsupervised approach (principalcomponent analysis, PCA) to both clinical (nonlinear PCA)and molecular data (linear PCA) of the analyzed patholo-gies. The link between clinical and molecular levels is givenby the mutual correlation between the principal componentscores arising from the correlation structures of clinical andgene expression data sets, respectively. Our approach sharessimilarities with nonlinear canonical correlation analysis(NLCCA), which corresponds to categorical canonical cor-relation analysis with optimal scaling (Gifi, 1990). However,NLCCA applied to genomic data (characterized by a hugedimensionality) will lead to multicolliearity and/or overfit-ting problems. A dimension reduction technique like PCA isnecessary in this context. Furthermore, we are convinced thatperforming separately PCA and NLPCA to molecular andclinical data, respectively (independently from the correla-tion that linked these batches of new variables), should leadto easier interpretable results and to smaller false positivecorrelations. Our strategy, while avoiding the risk of chancecorrelation deriving from the extremely high dimensionalityof the gene expression data, allows data-driven and unbi-ased appreciation of the links between the spontaneous or-der parameters shaping the clinical and molecular sides ofthe studied diseases.

The results of microarray experiments of a given pathol-ogy usually consist of very long lists of genes differentiallyexpressed/coexpressed that could be involved in a mixtureof different biological processes. The comprehension and theseparation of these aspects a posteriori is extremely difficultespecially when the pathology is rare and mostly unknown.Our method allows the assignment of deregulated genes intodifferent clinical aspects of the pathology, improving resultsinterpretations, pathology comprehension and prognosticprediction. We applied our strategy to two different geneexpression datasets, the first obtained from two musculardystrophies: the limb-girdle muscular dystrophy type 2B(LGMD2B) and the congenital muscular dystrophy (MDC)(Campanaro et al., 2002; Millino et al., 2006), the secondobtained from facioscapulo-humeral muscular dystrophy(FSHD) (Celagato et al., 2006). In these muscle disorders thebiological processes involved still have to be clarified. Thesespecific datasets have been selected for their comprehensive

clinical characterization of patients. Our approach needs anadequately large number of clinical variables for each pa-tient, and in particular, larger the number of clinical vari-ables more efficient the separation of different pathologicalaspects. In this work we show how the proposed approachallows the establishing of a link between clinical and geneexpression information.

Materials and Methods

Expression data

Three gene expression datasets of distinct muscle disor-ders obtained by cDNA microarrays have been tested: (1)LGMD2B dataset (Campanaro et al., 2002) (GEO IDGSE3022); (2) MDC dataset (Millino et al., 2006) available athttp://muscle.cribi.unipd.it/microarrays/MDC/; (3) FSHDdataset (GEO ID GSE2820) (Celegato et al., 2006). LGMD2Band MDC datasets were obtained with the same platform(GPL2677), performed in the same laboratory with the sameprotocols, control reference and RNA derived from biopsiesof the same muscle type (quadriceps femoralis). For thesereasons LGMD2B and MDC have been integrated in one sin-gle dataset. All the experiments have been normalized withglobal and then LOWESS statistical procedures (Yang et al.,2002) using MIDAW web tool (Romualdi et al., 2005).

Clinical data

Clinical, histopathological, and immunological traits of all patients whose biopsies were used for microarray exper-iments are collected to form a multidimensional matrix(Table 1).

For details on the degree of the dystrophic/myopathicprocess and other clinical information on LGMD2B, MDC,and FSHD patients see Supplementary Appendix 1 (see on-line supplementary material at www.liebertonline.com).

Statistical analysis

Clinical data matrix is characterized by binary, poly-chotomous, or numerical variables. In this case, simple prin-cipal component analysis (PCA) could generate artifactualfactors as the categories are converted into quantitative scale.Then, for the analysis of categorical data through nonlinearvariety of classical multivariate analysis, we used the Gifisystem (Gifi, 1990). This system is characterized by the op-timal scaling of categorical variables implemented throughalternating least squares algorithms. In the Gifi system, non-linear PCA (NLPCA) is derived as homogeneity analysiswith particular restrictions (de Leeuw et al., 1980).

Then, clinical data matrix was analyzed by means of non-linear principal component analysis and the scores of thecomponents significantly different from noise floor wereused to define patients into the clinical space. The same pro-cedure was applied to the microarray space, with the dif-ference that, given the numerical nature of gene expressionmatrix, molecular data was analyzed by means of linearPCA. In this case, however, the principal components wereextracted from the transposed matrix having genes as sta-tistical units and patients as variables. Then, patients weredefined in terms of their component loadings (instead ofcomponent scores), and these component loadings were cor-related to the correspondent scores of the clinical space. This

ROMUALDI ET AL.2

TA

BL

E1.

SUM

MA

RY

OF

CL

INIC

AL, H

IST

OP

AT

HO

LO

GIC

AL, A

ND

IMM

UN

OL

OG

ICA

LT

RA

ITS

OF

TH

EL

GM

D2B

, MD

C,

AN

DFS

HD

PA

TIE

NT

SA

NA

LY

ZE

DIN

TH

ISST

UD

Y

Age

at

Foet

al m

yosi

nC

reat

ine

biop

syFa

mily

posi

tive

fib

.T

ype

1 fib

ers

Mac

roph

ages

Nec

rosi

sK

inas

e in

Rel

ated

Pat

h.P

t.Se

x(y

ears

)hi

stor

y(%

) (I

HC

)(%

) (I

HC

)(I

HC

)(H

&E

)U

/L (

age)

prot

ein

%P

ath.

sev.

sco

re

AF

5�

8.95

.0�

�50

(5)

norm

alM

DC

�M

ildB

F4

�4.

95.0

���

�50

(4)

norm

alM

DC

�M

ildC

M0.

1�

0.25

.0�

���

917

(0,1

)no

rmal

MD

C �

Ad

vanc

edD

M13

.5�

1.95

.0�

���

400

(5)

norm

alM

DC

�M

oder

ate

EF

2�

20.0

80.0

��

��

�15

60 (

3)no

rmal

MD

C �

Ad

vanc

edF

F10

�0.

95.0

��

760

(3)

norm

alM

DC

�ac

tive

GF

0.3

�20

.075

.0�

��

�15

00 (

0,3)

abse

ntM

DC

�A

dva

nced

HF

5�

5.75

.0�

�95

0 (0

,1)

trac

esM

DC

�ac

tive

IM

0.2

�30

.030

.0�

��

��

1817

(0,

2)ab

sent

MD

C �

Ad

vanc

edL

F3

�5.

45.0

���

�14

31 (

3)ab

sent

MD

C �

Mild

1F

19�

24.6

45.4

��

��

�36

42 (

34)

abse

ntL

GM

D2B

Act

ive

2M

30�

25.7

18.4

��

�15

20 (

30)

trac

esL

GM

D2B

Act

ive

4F

36�

0.4

47.1

��

3921

(26

)ab

sent

LG

MD

2BM

ild5

M34

�14

.842

.8�

��

2079

0 (1

9)ab

sent

LG

MD

2BA

ctiv

e7

F2

�7.

622

.0�

��

�18

80 (

36)

trac

esL

GM

D2B

Mod

erat

e

Age

at

Fibe

rP

at.

Seve

rity

Frag

men

tP

t.Se

xbi

opsy

Infla

mm

atio

nR

egen

erat

ion

% f

ib 1

% f

ib 2

diam

eter

scor

ele

ngth

(kb

)

1M

69N

oN

o51

4968

127

2M

51N

ora

re45

5576

227

3M

20N

oN

o56

4460

226

4M

27N

oN

o40

6073

223

5F

53N

ora

re53

4759

123

6M

30N

oN

o45

5558

121

7M

15m

ildN

o57

4367

219

8M

32N

oN

o42

5855

216

9M

13m

ildN

o67

3346

110

10F

8m

ildra

re72

2830

210

11M

21N

oN

o62

3854

110

(IH

C: b

y Im

mun

ohis

toch

emic

al R

eact

ion.

H&

E: b

y he

mat

oxyl

in-e

osin

sta

inin

g). F

or m

ore

det

ails

, see

the

Met

hod

s se

ctio

n.

FSHD clinical dataLGMD2B�MDC clinical data

inversion was necessary because of the degeneration of datastructure where the number of genes largely exceeds thenumber of patients: in this case, modeling the data using pa-tients as statistical units could generate a high risk of chancecorrelations (Topliss and Edwards, 1979). To avoid this riskand to establish a robust statistical basis for our analysis weinverted the role of statistical units and variables: in any casepatients are defined in the space of component loadings thatis a perfectly legitimate representation of patients in termsof similarities in the gene expression space. On the otherhand, genes are defined in the space of component scores,allowing for a biological association of components to groupsof genes having the highest absolute scores.

PCs and factors are found through mathematical model-ing of the original variables (clinical and/or molecular data).However, the way they are reconstructed is based on the in-trinsic structure of variability underneath the data. The in-trinsic data variability basically reflects similarity or dissim-ilarity of observations according to specific variables. Thisvariability is usually shared by variables that are proxies ofa specific biological process. PCA and NLPCA are con-structed in order to capture this variability through orthog-onal linear combination of the original variables. Thus, itshould not be a surprise that usually (but not always) thefirst PCs or factors (capturing the major source of variance)could be of easy interpretation while the remaining ones area mixture of variables difficult to be interpreted. However,there may be some cases where data variability is stronglywidespread across experiments, in these cases our unsuper-vised approach but also classical hypothesis testing proce-dure (whose power is strongly influenced by variability) willnot be of easy interpretation.

Gene Ontology annotation was performed using DAVIDWeb tool (Dennis et al., 2003) (available at http://david.niaid.nih.gov). Calculation of the statistical significance en-richment of all categories was performed according to EASEscore (Hosack et al., 2003). It is well known that multiple test-ing may produce large number of false positives, if the typeI error (the well-known alpha level, commonly used in sta-tistical test) is applied. False discovery rate (FDR), a multi-ple test correction widely used in microarray experiments,was therefore applied. FDR or Q-value is defined as the ex-pected number of false positives in a list of enriched cate-gories (Storey, 2002). The Q-values for each class is definedas: Q � (p*n)/i, where p is the p-value of the class obtainedby the hypergeometric distribution, n the total number of test performed, and i is the number of class at or betterthan p.

We used a stepwise procedure to identify the best set ofPCs linear predictor for Factors derived by clinical data.Many multiple regression models contain variables whose tstatistics have nonsignificant p values. These variables arenot displaying statistically significant predictive capabilityin the presence of other predictors. The question is whethersome variables can be removed from the model generatinga more parsimonious and powerful model. Variable selec-tion procedure includes forward selection, backwards elim-ination, and stepwise regression. They add or remove vari-ables one at a time until some stopping rule is satisfied.Forward selection starts with an empty model. The variablethat has the smallest p-value is placed in the model when itis the only predictor in the regression equation. Each subse-

quent step adds the variable that has the smallest p-value inthe presence of the predictors already in the equation. Vari-ables are added one at a time as long as their p-values aresmall enough, typically less than 0.05 or 0.10. Backward elim-ination starts with all of the predictors in the model. The vari-able that is least significant (that is the one with the largestp-value) is removed and the model is refitted. Each subse-quent step removes the least significant variable in the modeluntil all remaining variables have individual p-valuessmaller than some defined value, such as 0.05 or 0.10. Step-wise regression is similar to forward selection except thatvariables are removed from the model if they become non-significant as other predictors are added.

After stepwise regression, a power analysis was per-formed on each final linear model. The software developedby Dunlap et al. (2004) and available at http://www.tulane.edu/�dunlap/dunlap.html was used for power calculation(with alpha level equal to 0.05 and number of predictors asobtained by the stepwise analysis).

To compare the list of genes selected through the PCAscores and the results obtained by the original analyses ofCampanaro et al. (2002) and Millino et al. (2006), we appliedSAM test (Tusher et al., 2001) for the identification of mark-ers genes between MDC and LGMD2B and between MDC1versus MDC2.

All statistical analyses were performed with R software(http:/www.r-project.org). Nonlinear PCA was performedwith homals R package.

Results

LGMD2B and MDC datasets

Background. LGMD2B is caused by mutations in the hu-man dysferlin gene, and usually it affects prematurely theproximal muscles of the arms. The dysferlin gene product isa membrane-associated protein whose function is still underinvestigation (Han and Campbell, 2007).

MDC clinical phenotype is characterized by neonatal hy-potonia, and by histological changes in skeletal muscle tis-sues with many features of a true dystrophic process. Themost common form in the Caucasian population is MDC1A,caused by mutations in the laminin a-2 chain (LAMA2).

Clinical traits of the cohort of patients described in thesestudies are summarized in Table 1. For details on experi-mental plans and protocols see Supplementary Appendix 1.

Nonlinear PCA on clinical data. NLPCA analysis showsthat a four-component solution explains more then 80% ofdata variability (Supplementary Fig. 1A–B). Table 2 reportsthe component loadings (correlation coefficients betweencomponents and original variables). Bolded values corre-spond to the variables highly correlated with the components,and thus more relevant for the definition of the factor mean-ing. Factor 1 can be interpreted as a general measure of theindividual biological response to the disease: all the most im-portant biological markers of the pathology are significantlyloaded on the component (Table 2). Looking at the signs ofthe loadings within Factor 1, it is clear that the trait “low levelof Type 1 muscle fibers” (negative correlation with Factor 1)is connected to relatively high values of fetal myosin, mac-rophages, necrosis, and creatine kinase, pointing Factor 1 asa global measure of the “general inflammatory and mus-

ROMUALDI ET AL.4

cle wasting process” linked to the disease. We will furtheridentify Factor1 as “biological effect—inflammatory pro-cess.”

Since PCs are independent by construction, we are al-lowed considering Factor 2 as a distinct aspect of the pathol-ogy with respect to Factor 1, pointing to separate features ofthe clinical patients description. Table 2 clearly shows thatthe leading variable of Factor 2 can be considered as a proxyof disease progression. Low values of the trait “age at bi-opsy” are related to high values of the severity score and toa prevalence of MDC pathology. This is in agreement withthe established greater severity of MDC (Millino et al., 2006).

Sex and familiarity have lower impact on the general clin-ical description of these dystrophies as shown by the lowerpercentage of variability of Factors 3 and 4 that are associ-ated to sex and familiarity, respectively.

The two dystrophies are well separated in Factor 1–Fac-tor 2 plane as shown in Figure 1A (lower triangle). Factor 1(“inflammatory processes”) supports the more marked “bi-ological effects” of LGMD2B (right side of Factor 1) and MDCpatients I and G. On the other hand, Factor 2 (“general sever-ity score”) separates MDC patients E, G, C, and I (high levelof Factor 2) from the other MDC patients as well as LGMD2Bpatient 2 from other LGMD2B patients. Factor 3 and 4 in-stead, do not clearly separate groups of patients.

PCA on gene expression data. The first two componentsexplain approximately 60% of the total variability (Supple-mentary Fig. 1C–D) and, after the fourth component the re-sidual variability appears to be negligible.

According to the loading levels (see Supplementary Table1 and the upper triangle of Fig. 1), it is possible to separate

LGMD2B from MDC, and MDC in two separate groups(MDC1 and 2) in agreement with Millino and Coll. MDC1contains four LAMA2 positive patients (A, B, D, and F),whereas MDC2 includes both positive (C and E) and defi-cient (G, I, H, and L) LAMA2 patients. Surprisingly, the twoMDC groups only partially correspond to the clinical classi-fication (positive and negative for LAMA2) reflecting insteadmild and the severe dystrophic phenotypes. On the otherhand, the two LGMD2B groups (patients 4 and 7 vs. patients1, 2, and 5) are characterized by different severity of the mus-cle disorder (Campanaro et al., 2001).

Supplementary Table 1 and Figure 1 show that the firsttwo components allow an almost perfect discrimination ofMDC from LGMD2B and of MDC1 from MDC2. This im-plies that LGMD2B/MDC (by PC1) and MDC1/MDC2 (byPC2) separations are the two most important order param-eters shaping gene expression data.

Correlation between clinical data and gene expression. Us-ing the information carried by principal components, we areable to define linear models able to well predict clinical fac-tors. We used a stepwise regression analysis where Factor i,with i � 1, . . . ,3 represents dependent variable while thefirst 10 PCs are initially tested as potential independent vari-ables. The model becomes:

Factorji � �ji � �p

k � 1

�kji � l(PCk)j � �ji,

(with i � 1, . . . , 3 and j � 1, . . . , n) where n is the total num-ber of patients (n � 15), l(PCk)j the jth value of the loadingvector of kth component, p is the optimal number of com-ponents selected by stepwise regression and � an error term.

CORRELATION BETWEEN GENE EXPRESSION AND CLINICAL DATA 5

TABLE 2. FACTOR LOADINGS OBTAINED THROUGH NLPCA ON LGMD2B � MDC AND FSHD CLINICAL

DATA AFTER NUMERICAL CATEGORIZATION (MORE DETAILS SEE APPENDIX 1)

Factor loadings

Orig. variables 1 2 3 4

LGMD2B � MDCSex �0.06 �0.17 0.18 0.05Family history �0.02 �0.17 0.02 �0.22Age 0.13 �0.20 �0.14 �0.05Myosin 0.22 0.10 0.12 �0.05Type1 �0.20 �0.08 0.12 �0.03Macro 0.21 0.07 0.14 �0.11Necrosis 0.23 �0.05 0.12 0.08Ck 0.26 �0.03 0.00 0.08Proteine level �0.25 0.05 �0.02 �0.02Pathology type �0.23 0.13 0.12 0.00Pathology score 0.10 0.20 �0.06 �0.10

FSHDAge �0.29 �0.14 �0.12 0.04Sex 0.17 �0.29 �0.01 �0.06Pathology score �0.02 0.00 0.34 �0.07Inflammation 0.29 0.06 0.10 0.08Regeneration 0.07 �0.33 0.07 0.04Type 1 fibers 0.34 0.02 �0.04 0.07Type 2 fibers �0.34 �0.02 0.04 �0.07Average fiber diameter �0.27 �0.02 0.13 0.18

In bold are reported the highest loadings (absolute value) per factor.

Stepwise regression results (Table 3) and correlation coef-ficients among clinical factors and molecular PCs (Supple-mentary Table 2) underline that Factor 1 is well predicted byPC 1, Factor 2 by PC9, while Factor 3 by PC 3. Regression

model with Factor 1 is the best model in terms of determi-nant coefficient R (0.85) and statistical power (0.99). The othermodels though characterized by highly significant regressioncoefficients (t test, p-value � 0.05) show lower levels of thedeterminant coefficient R and of statistical power. Panels A,B, and C of Figure 2 compare fitted and observed values ofthe above models.

It is worth to note how PC4 clearly separates patients Cand D from all the others (upper triangle of Fig. 1). How-ever, PC4 does not show significant correlation with any ofthe easy interpretable clinical factors (neither through corre-lation analysis nor with stepwise regression). A possible in-terpretation is that PC4 is underling a structure of molecu-lar similarity between patients C and D that unfortunatelycannot be explained through the available clinical variables.Then, in this case we are unable to suggest a possible inter-pretation of this result.

We used the same approach to predict the pathology sever-ity score (PSS), results show that PSS can be well predictedusing only PC 2 (p-value � 0.01), with R � 0.64 and power �0.78. Figure 2D shows a good correlation between fitted andpredicted pathology severity scores apart from MDC patientL and LGMD2B patient 4.

Selection of relevant genes linked to different clinical traits.Gene scores obtained by PCA can be used to select relevantgenes as “representative” of each component. As a rule, wedecided to select genes with standardized absolute scoregreater than the quantile 5% (�q0.05� � 1.6) in each distribu-tion. In this way, 275 genes were selected for PC 1, 235 forPC 2, 226 for PC3, 243 for PC4, 260 for PC5, 272 for PC6 and275 for PC10. According to previous results, PC 1 and PC 2should be associated to general inflammatory processes thattypically accompany response to muscle wasting. PC 1 is en-riched by a large and significant cluster of genes linked toprotein synthesis (e.g., EEF2, EIF2B1, EEF1A1, H3F3B, AKT1)and ribosome structure (33 ribosomal proteins), to ATP pro-duction (ATP5D, ATP5O, ATP5L, ATP2A1, ANT3), and toimmune response (HLA-1, HLA-2, CFD, C1R, IFITM1,IFITM2). On the other hand, PC 2 is characterized by a rel-evant cluster of genes involved in the regulation of musclecontraction (TNNI1, TNNT3, TTN, MYL1, MYL2, MYH7,MYOM1, MYOM2), sarcomere development (e.g., MYBPC1,MYBPC2, MYOZ2) and, interestingly, by a cluster of geneswith a role in motor neuron development and differentia-tion (FEZ2, UBA52, APBB1, MTCH1, CNP). This last resultis in agreement with the correlation of PC2 with the pathol-ogy severity score defined by histological analysis of dys-trophic muscle sections (Campanaro et al., 2002).

PC 9 is highly enriched by transcript involved in the mTorsignalling pathway (PDK2, AKT1, EIF4EBP1, EIF3S10, TSC1,EIF4G1) that is known to lead to atrophic process in responseto oxidative stress, catabolism and ATP production. Theseresults are consistent with the interpretation of Factor 2 as“general severity of the pathology.”

PC 3 is enriched with genes involved in muscle devel-opment (CAV3, MYOD1, ANKRD2, NEB, TTN) and cal-cium binding (S100A1, S100A4, S100A6, S100A13, MRLC2,LAMB2, SLC26A6) and cell-cycle positive (CCND3, CCNI,IGFBP7, TERF2).

To compare results of our methodology with those ob-tained with the standard statistical techniques for the analy-sis of microarray data, we applied a two-classes SAM test to

ROMUALDI ET AL.6

FIG. 1. (A) Pairwise scatterplots among the first four fac-tors derived by LGMD2B � MDC clinical dataset (lower tri-angle—dashed) and the first four PCs derived by LGMD2B+ MDC molecular data (upper triangle—dotted). Empty cir-cles represent LGMD2B patients, while filled circles MDCpatients. (B) Pairwise scatterplots among the first four fac-tors derived by FSHD clinical dataset (lower triangle—dashed) and the first four PCs derived by FSHD moleculardata (upper triangle—dotted). Numbers represent patientsD4Z4 fragment length.

the original datasets (LGMD2B vs. MDC) obtaining a list of173 differentially expressed genes (103 upregulated and 70downregulated in LGMD2B with respect to MDC). One hun-dred ten of these (110/173 � 64%) overlap with those selectedby the highest scores in PC 1 and PC2. In particular, 46 genes(66%) with negative PC scores among the 70 downregulatedrepresent genes downregulated in LGMD2B and 65 (63%)with positive PC scores among the 103 upregulated representgenes upregulated in LGMD2B. Because PC2 appear to be re-sponsible for the separation between CMD1 and the groupCMD2 � LGMD2B (Fig. 2B), we further applied a two-classesSAM to compare MDC1 versus MDC2 � LGMD2B. The re-sult is a list of 156 differentially expressed genes (92 upregu-lated and 64 downregulated in MDC2 � LGMD2B with re-spect to MDC1). One hundred thirty-four of these (86%)overlap those found selecting PC 1, PC 2, PC 5, PC 6, and PC10 scores. In particular, 57 genes (89%) with negative PCscores among the 64 downregulated represent genes down-regulated in MDC2 � LGMD2B with respect to MDC1 and77 (85%) with positive PC scores among the 90 upregulatedare upregulated in MDC2 � LGMD2B with respect to MDC1.

FSHD dataset

Background. The major locus associated to FSHD is theD4Z4 low copy GC-rich repeat, consisting of a perfect array

of 3.3-kb KpnI units (U). This polymorphic array varies be-tween 11 and 150 U in the general population, whereas inaffected individuals it shows a reduction to 1–10 U (Celegatoet al., 2006). There is an inverse relationship among the re-sidual repeat size, the severity and the age at onset of FSHD.The 11 patients of the examined study had a number of D4Z4repeat ranging from 10 to 27 (Table 1). Given the associationbetween pathology severity and the size of the repeat, wedecided to exclude this variable from the PCA, using it in afurther validation step. For details on experimental plans andprotocols see Supplementary Appendix 1.

Nonlinear PCA on clinical data. Three components explainabout 90% of the total variability (Supplementary Fig. 1E–F).Factor1 is characterized by clinical variables associated to ageneral muscle “inflammatory response” (Table 2). As ex-pected, inflammation and percentage of type 1 fibers showan inverse correlation with the percentage of type 2 fibers,fibers diameter and patient age. Factor 2 is associated to “re-generation process.” Factor 3 seems only connected to thepathology severity score. Differently from the previous ex-ample, age (linked to Factor 1) and sex (linked to Factor 2)appear to be directly related to the process of muscle re-sponse to the pathology.

Interestingly, Factor 1 separates patients carrying smallD4Z4 fragment (10 and 19 repeats with the exclusion of 16)

CORRELATION BETWEEN GENE EXPRESSION AND CLINICAL DATA 7

TABLE 3. ESTIMATED REGRESSION COEFFICIENTS, DETERMINANT COEFFICIENT R2, AND REGRESSION POWER OF THE

LINEAR MODELS OBTAINED BY STEPWISE REGRESSION ANALYSES WHERE FACTOR 1, 2, 3, AND VARIABLE

“PATHOLOGY SEVERITY SCORE” FOR LGMD2B � MDC DATA AND FACTOR 1, 2, 3, AND VARIABLE

“FRAGMENT LENGTH” FOR FSHD DATA, REPRESENT DEPENDENT VARIABLES WHILE PCS REPRESENTS COVARIATES

PathologyFactor 1 Factor 2 Factor 3 severity score

Param Se Sig. Param Se Sig. Param Se Sig. Param Se Sig.

LGMD2B � MDCConstant 0.12 0.02 0.0003 0.008 0.02 0.66 0.001 0.02 0.97 2.48 0.25 0.0001PC1 0.54 0.09 0.0001PC2 2.88 0.96 0.010PC3 0.15 0.07 0.045PC4PC5PC6PC7PC8PC9 0.16 0.07 0.04PC10RPower

FSHDConstant 0.029 0.03 0.36 �0.16 0.02 0.55 �0.006 0.03 0.812 16.97 1.13 0.0001PC1 0.2 0.08 0.04PC2PC3 �0.24 0.09 0.037 19.44 4.35 0.002PC4PC5 0.24 0.09 0.02PC6PC7PC8PC9 0.19 0.08 0.04PC10RPower

0.85 0.54 0.51 0.640.99 0.60 0.52 0.78

0.63 0.80 0.70 0.830.60 0.81 0.73 0.94

from all other patients (Fig. 1B, lower triangle). These indi-viduals are affected by a more severe FSHD phenotype; then,our results support the interpretation of Factor 1 as “re-sponse to muscle wasting.” Factor 2 is linked to muscle re-generation and seems to separate one patient with mildFSHD (repeat length of 27 U) and two patients with 10 and23 U from the others. This cluster, in fact, is composed by

patients with age at biopsy greater than 50 years old thatprobably differ from the others by muscle regeneration rate.

PCA on gene expression data. The first four PCs explainapproximately 60% of the total variability and a marked el-bow point can be identified at the third component (Sup-plementary Fig. 1G–H). Patients’ distribution obtained byPCA (Fig. 1B and Supplementary Table 1) is in agreementwith that obtained by Celegato et al. (2006) through clusteranalysis. PC 1 separates patients with age at biopsy greaterthan 50 years old (fragment length of 27 and 23 U). Fur-thermore, PC 2 divides patients with D4Z4 fragment of 26,23, and 21 U while PC 3 divides patients with fragment of10 and 16 U from the others.

Correlation between clinical data and gene expression.Stepwise regression (Table 3) and correlation analysesamong clinical factors and PCs (Supplementary Table 3)show how Factor 1 is well predicted by PC 3, Factor 2 by PC1 plus PC 9, and Factor 3 only by PC 5. Furthermore, “frag-ment length” seems to be well predicted by PC 3. These lin-ear models are characterized by high statistical power exceptfor the first one (60% of power). In Figure 2E–I we show thecomparisons of predicted and observed values of the factorsas obtained by linear models.

Selection of relevant genes linked to different clinical traits.As described in the previous paragraph, quantile 5% (�q0.05� �1.3) of the standardized PC scores was used as rule for genesselection. In this way 339 genes were selected for PC 1, 545for PC 3, 724 for PC 4, 783 for PC5, 260 for PC5, and 847 forPC 9. Factor 1 (“inflammatory process”) results highly cor-related with PC 3 that is enriched of genes involved in im-mune response (CTSG, CD59, CFB, HLA-DPB1), response tostress (CSDA, HSPB1, MAPK12, IRAK1) and regulation ofapoptosis (BAG1, BNIP1, NOL3, NGFRAP1). On the otherhand, Factor 2 (“regeneration process”) is highly correlatedwith PC 1 that shows an enrichment of genes involved inmuscle cell development and differentiation (ACTA1,UBA52, FHL1, FHL3, MEF2C), and nervous system devel-opment (NAGLU, PTN, UBE3A). PC 5, highly correlatedwith Factor 3 (associated to pathology severity score), is en-riched with genes involved in striated muscle structure(MYH7, MYBPC1, MYBPC3, MYH1, MYH11, TPM3,TNNT3), regulation of muscle contraction (MYL2, TTN,TNNT1, MYBPC3, TPM3, S100B), transcriptional regulation(FHL2, NCOA3, NCOA4, NFATC4, NCOR1, SMARCA4),and neuron development (APOE, CNP, RTN4).

Discussion

Muscular dystrophies are heterogeneous disorders char-acterized by progressive degenerative changes in skeletalmuscle fibers (as variation in fiber size, muscle fiber necro-sis, proliferation of connective tissues, activation of apoptoticpathways, etc.). Despite the knowledge of the primary ge-netic defects, the molecular pathways leading to muscle celldegeneration are poorly understood. Clinical and moleculardata of a given pathology are different levels of the same bi-ological problem. We think that a better comprehension of apathological phenotype can be obtained only if both typesof data can be combined and integrated in order to dissectout from a complex picture its single components.

ROMUALDI ET AL.8

FIG. 2. LGMD2B � MDC dataset. Scatterplots of observedand fitted values for the first (A), the second (B), the third(C) factor of clinical data and the pathology severity score(D) obtained through the stepwise regression analysis. Blackcircles represent CDM patients; gray squares representLGMD2B patients. FSHD dataset. Scatterplots of observedand fitted values for the first (E), the second (F), the third(G) factor of clinical data and the D4Z4 fragment length (H)obtained through the stepwise regression analysis. Numberrepresent patients fragment lengths.

Our approach has been tested on two different gene ex-pression datasets. The first is derived from the integration ofgene expression data obtained from muscles affected by con-genital muscular dystrophy (MDC) and limb-girdle muscu-lar dystrophy type 2B (LGMD2B). The second dataset is de-rived from a gene expression study on facioscapulo-humeraldystrophy (FSHD).

In both clinical datasets we found an interesting separa-tion between biological effect such as inflammatory muscleresponse to the disease and the severity of the disease pro-gression. These factors correlate well only with some of theprincipal components identified in the correspondent geneexpression datasets. This correlation allows the identificationof genes involved separately to the above factors. A carefulinvestigation of the highly scored features reveals groups ofgenes that are involved in muscle inflammation and groupsof genes responsible of disease severity progression. Most ofthese genes are known to be markers of the pathology, val-idating our approach; however, we detect further genes in-volved in the muscle pathology that allows completing thetranscriptomic deregulation picture of the pathology stud-ied. Comparing our methodology with a general supervisedinferential approach (with statistical test such as SAM) wewere able to better-characterized different physiopathologi-cal aspects of the dystrophies, aspects impossible to be sep-arately identified by a classical supervised approach.

Conclusions

In this work we show that an unsupervised approachjoined with a correlation analysis can be very useful for theseparation of different clinical traits of muscular dystrophies.At odds with the great majority of the medical applicationsof microarray data, the link between clinical and molecularspaces is obtained without any prefiltering of the data: bothclinical and the genetic components are obtained by unsu-pervised approaches, and the correlation between the twospaces is applied only at the end of the procedures for di-mensionality reduction, avoiding any possible overfittingbias. With the proposed methodology we were able to selectgroups of genes that are representative of the principal com-ponents, and that could be at the same time responsible ofdifferent aspects of the pathological phenotypes. We thinkthat this approach can be of a great help, especially whenthe biological processes involved in the pathogenesis of dis-eases are not completely understood.

Acknowledgments

The work is partially supported by the grant “Identificationof the autoantibody signature in inflammatory myopathies”from Association Franáaise contre les Myopathies (AFM) call2008 to G.L., the grant “A computational approach to the studyof skeletal muscle genomic expression in health and disease,”supported by Fondazione CARIPARO to C.R. Furthermore,we gratefully acknowledge the financial support of the Ital-ian Ministry of University and Scientific Research (PRIN 2006),the Program Azione Biotech 2 (2005-2006 Veneto Region), theUniversity of Padova (CPDA075919 to C.R.)

Authors Disclosure Statement

The authors declare that there are no conflicts of interest.

References

Baty, F., Bihl, M.P., Perriere, G., Culhane, A.C., and Brutsche,M.H. (2005). Optimized between-group classification: a newjackknife-based gene selection procedure for genome-wide ex-pression data. BMC Bioinformatics 28, 239.

Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer,M., and Yakhini, Z. (2000). Tissue classification with gene ex-pression profiles. J Comput Biol 7, 559–583.

Campanaro, S., Romualdi, C., Fanin, M., Celegato, B., Pacchioni,B., Trevisan, S., et al. (2002). Gene expression profiling in dys-ferlinopathies using a dedicated muscle microarray. Hum MolGenet 11, 3283–3298

Celegato, B., Capitanio, D., Pescatori, M., Romualdi, C., Pac-chioni, B., Cagnin, S., et al. (2006). Parallel protein and tran-script profiles of FSHD patient muscles correlate to the D4Z4arrangement and reveal a common impairment of slow to fastfibre differentiation and a general deregulation of MyoD-de-pendent genes. Proteomics 6, 5303–5321.

De Leeuw, J., and Van Rijckevorsel, J. (1980). Homals and prin-cals. Some generalizations of principal components analysis.In Data Analysis and Informatics II, Diday et al., eds., (NorthHolland, Amsterdam). pp. 231–242.

Dennis, G., Sherman, B.T., Hosack, D.A., Yang, J., Gao, W., Lane,H.C., et al. (2003). DAVID: database for annotation, visual-ization and integrated discovery. Genome Biol 4, P3.

Dudoit, S., Fridlyand, J., and Speed, T. (2000). Comparison ofdiscrimination methods for the classification of tumours byusing gene expression data. J Am Stat Assoc 97, 77–87.

Dunlap, W.P., Xin, X., and Myers, L. (2004). Computing aspectsof power for multiple regression. Behav Res Methods, InstrumComput 36, 695–701.

Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein., D. (1998).Cluster analysis and display of genome-wide expression pat-tern. Proc Natl Acad Sci USA 95, 14863–14868.

Gifi, A. (1990). Non-linear Multivariate Analysis. (Wiley & Sons,Chichester).

Guan, Z., and Zhao, H. (2005). A semiparametric approach formarker gene selection based on gene expression data. Bioin-formatics. 21, 529–536.

Gui, J., and Li, H. (2005). Penalized Cox regression analysis inthe high-dimensional and low-sample size settings, with ap-plications to microarray gene expression data. Bioinformatics21, 3001–3008.

Han, R., and Campbell, KP (2007). Dysferlin and muscle mem-brane repair. Curr Opin Cell Biol 19, 409–416.

Hosack, D.A., Dennis, G. J.R., Sherman, B.T., Lane, H.C., andLempicki, R.A. (2003). Identifying biological themes withinlists of genes with EASE. Genome Biol 4, P4.

Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Wester-mann, F., et al. (2001). Classification and diagnostic predictionof cancers using gene expression profiling and artificial neuralnetworks. Nat Med 7, 673–679.

Li, H., and Gui, J. (2004). Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics20(Suppl. 1), I208–I215.

Li, L., Weinberg, C.R., Darden, T.A., and Pedersen, L.G. (2001).Gene selection for sample classification based on gene ex-pression data: study of sensitivity to choice of parameters ofthe GA/KNN method. Bioinformatics 17, 1131–1142.

Millino, C., Bellin, M., Fanin, M., Romualdi, C., Pegoraro, E., An-gelini, C., et al. (2006). Expression profiling characterizationof laminin alpha-2 positive MDC. Biochem Biophys Res Com-mun 350, 345–351.

Nguyen, D.V., and Rocke, D.M. (2002). Partial least squares pro-

CORRELATION BETWEEN GENE EXPRESSION AND CLINICAL DATA 9

portional hazard regression for application to DNA microar-ray survival data. Bioinformatics 18, 1625–1632.

Pan, W., Lin, J., and Le, C.T. (2002). Model-based cluster anal-ysis of microarray gene-expression data. Genome Biol 3,RESEARCH0009.

Ramoni, M.F., Sebastiani, P., and Kohane, I.S. (2002). Clusteranalysis of gene expression dynamics. Proc Natl Acad Sci USA99, 9121–9126.

Romualdi, C., Campanaro, S., Campagna, D., Celegato, B., Can-nata, N., Toppo, S., et al. (2003), Pattern recognition in geneexpression profiling using DNA array: a comparative studyof different statistical methods applied to cancer classification.Hum Mol Genet 12, 823–836.

Romualdi, C., Vitulo, N., Del Favero, M., Lanfranchi, G. (2005).MIDAW: a web tool for statistical analysis of microarray data.Nucleic Acids Res 33, W644–W649.

Ross, D.T., Scherf, U., Eisen, M.B., Perou, C.M., Rees, C., Spell-man, P., et al. (2000). Systematic variation in gene expressionpatterns in human cancer cell lines. Nat Genet 24, 227–235.

Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. (1995).Quantitative monitoring of gene expression patterns with acomplementary DNA microarray. Science 270, 467–470.

Shipp, M.A., Ross, K.N., Tamayo, P., Weng, A.P., Kutok, J.L.,Aguiar, R.C., et al. (2002). Diffuse large B-cell lymphoma out-come prediction by gene-expression profiling and supervisedmachine learning. Nat Med 8, 68–74.

Storey, J.D. (2002). A direct approach to false discovery rates JR Stat Soc B 64, 479–498.

Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S.,Dmitrovsky, E., et al. (1999). Interpreting patterns of gene ex-pression with self-organizing maps, methods and applicationto hematopoietic differentiation. Proc Natl Acad Sci USA 96,2907–2912.

Tan, Y., Shi, L., Tong, W., and Wang, C. (2005). Multi-class can-cer classification by total principal component regression(TPCR) using microarray gene expression data. Nucleic AcidsRes 33, 56–65.

Tang, E.K., Suganthan, P.N., and Yao, X. (2006). Gene selectionalgorithms for microarray data based on least squares supportvector machine. BMC Bioinformatics 27, 95.

Teschendorff, A.E., Wang, Y., Barbosa-Morais, N.L., Brenton,J.D., and Caldas, C. (2005). A variational Bayesian mixturemodelling framework for cluster analysis of gene-expressiondata. Bioinformatics 21, 3025–3033.

Topliss, J.G., and Edwards, R.P. (1979). Chance factors in stud-ies of quantitative structure-activity relationships. J MedChem 22, 1238–1244.

Tusher, V.G., Tibshirani, R., and Chu, G. (2001). Significanceanalysis of microarrays applied to the ionizing radiation re-sponse. Proc Natl Acad Sci USA 98, 5116–5121.

Van Houwelingen, H.C., Bruinsma, T., Hart, A.A., Van’t Veer,L.J., and Wessels, L.F. (2006). Cross-validated Cox regres-sion on microarray gene expression data. Stat Med 25, 3201–3216.

Van’t Veer, L.J., Dai, H., Van de Vijver,, M.J., He, Y.D., Hart,A.A.M., Mao, M., et al. (2002). Gene expression profiling pre-dicts clinical outcome of breast cancer. Nature 415, 530–536.

Wang, A., and Gehan, E.A. (2005). Gene selection for microar-ray data analysis using principal component analysis. StatMed 24, 2069–2087.

Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., etal. (2002). Normalization for cDNA microarray data: a robustcomposite method addressing single and multiple slide sys-tematic variation. Nucleic Acids Res 30, e15.

Address reprint requests to:Chiara Romualdi, Ph.D.

CRIBI Biotechnology Centre and Dipartimento di BiologiaUniversità degli Studi di Padova

Via Ugo Bassi 58/B, 35121Padova, Italy

E-mail: [email protected]

ROMUALDI ET AL.10

Appendix 1

Experimental Details as Described in the Original Papers

MDC dataset (Millino et al., 2006)

DNA analysis of MDC patients. Patients A, B, C, D, E, andF were investigated for the most common genetic defects in-volved in MDC. Both LAMA2 and Fukutin-related protein(FKRP) gene analysis performed by SSCP and DHPLC did notreveal any change in the patients DNA. Patient G showed asingle nucleotide deletion causing a precocious termination atnucleotide position 7484–7485 (2500X) of the LAMA-2 codingsequence; the other allele was not mutated. Patient H carriestwo nonsense mutations: C–T at nucleotide position 4694(R1549X) and C–T at nucleotide position 7196 (R2383X) of theLAMA-2 coding sequence. Patient I showed one nonsense mu-tation (C–T) at nucleotide position 7196 (R2382X) of theLAMA-2 coding sequence; the other allele was not mutated.Patient L revealed a two-base pair deletion at nucleotide po-sition 8075–8076 (8075–8076delGT), resulting in a prematurestop codon at position 2706 of the LAMA2 mRNA. This dele-tion is located in exon 56 of the LAMA2 gene.

Microarray preparation

Total RNA was purified according to the Trizol standardprotocol. RNA was retro-transcribed and labeled using theMICROMAX TSA labeling kit (Perkin-Elmer, Wellesley,MA). The labeled cDNA of patients was used in competitivehybridization with cDNA from normal muscle. Patient andcontrol biopsies were both taken from quadriceps femoralismuscle. Two replicates of each experiment were done.

LGMD2B Dataset (Campanaro et al., 2002)

Clinical data

The present study involved eight Italian patients (three fe-males and five males), including two pairs of siblings (1, 5and 7, 8), one of whom was born to consanguineous parents.Seven patients presented with distal Miyoshi myopathy andone had LGMD phenotype; they had disease onset between11 and 33 years of age (mean 19.5) and underwent musclebiopsy between 19 and 37 years of age (mean 29.5). The timelapse between the age of the onset and the muscle biopsy

(disease duration) ranged from 0–24 years (mean 10). Theclinical course or disease progression was intermediate infour cases and rapid in the remaining four, causing the lossof independent ambulation after the age of 35 in the secondgroup. All the biopsies considered in our experiments wereobtained from quadriceps femoralis except for patient 7, forwhom biopsy was obtained from deltoid (first column).

DNA analysis of LGMD2B patients

Western blot analysis with specific antibodies to deter-mine the dysferlin protein content among the patients in thisstudy. Dysferlin protein was completely absent in four cases(patients 1, 4, 5), barely detectable in one cases (patient 2,�5% of control) and markedly reduced in one cases (10% ofcontrol; patient 7). Dysferlin gene mutations have been iden-tified in four cases (patient 1, TG3817-8AA, Y1148X, exon 32,homozygous; patient 4, T4454C, C1361R, exon 38, heterozy-gous; patient 5, TG3817-8AA, Y1148X, exon 32, homozygous;patient 7, C5358G, T1662R, exon 45, G2234A, G618R exon20): one pair of siblings were compound heterozygotes fortwo missense mutations in exon 20 and 45; another pair ofsiblings born to consanguineous parents, were homozygousfor a nonsense mutation in exon 32. In another sporadic pa-tient only one mutant allele was identified with a missensemutation in exon 38. None of these four mutations has beenreported previously in other dysferlinopathy patients.

LGMD2B patients

Western blot analysis with specific antibodies to deter-mine the dysferlin protein content among the patients in thisstudy. Dysferlin protein was completely absent in four cases(patients 1, 4, 5), barely detectable in one cases (patient 2,�5% of control) and markedly reduced in one cases (10% ofcontrol; patient 7). Dysferlin gene mutations have been iden-tified in four cases (patient 1, TG3817-8AA, Y1148X, exon 32,homozygous; patient 4, T4454C, C1361R, exon 38, heterozy-gous; patient 5, TG3817-8AA, Y1148X, exon 32, homozygous;patient 7, C5358G, T1662R, exon 45, G2234A, G618R exon20): one pair of siblings were compound heterozygotes fortwo missense mutations in exon 20 and 45; another pair ofsiblings born to consanguineous parents, were homozygousfor a nonsense mutation in exon 32. In another sporadic pa-tient only one mutant allele was identified with a missensemutation in exon 38. None of these four mutations have beenreported previously in other dysferlinopathy patients.

RNA purification and labeling

Frozen patient biopsies were weighed and immediatelyhomogenized for 3–5 min using an ultraturrax-T8 blender(IKA-Werke, Staufen, Germany) in 5 vol of TRIZOL reagent(Invitrogen/Life Technologies, Carlsbad, CA). Total RNAwas purified following the TRIZOL standard protocol. Asmall aliquot of RNA was then used for quantification andquality control using the RNA 6000 LabChip kit and AgilentBioanalyzer 2100 (Agilent Technologies, Palo Alto, CA). Weroutinely obtain a mean quantity of 0.5 mg of RNA per mgof homogenized muscle tissue. RNA was retro-transcribedand labeled using a MICROMAX TSA labeling kit (Perkin-Elmer). Two milligrams of total RNA were used in each re-action, but only half of the labeled cDNA was hybridized tothe microarray.

Microarray hybridization

Microarray hybridization was carried out in a dual slidechamber (HybChamber, GeneMachines, San Carlos, CA) hu-midified using 100 ml of 3�SSC. Labeled cDNA was dis-solved in 20 mL of hybridization buffer, denatured at 90°Cfor 2 min in a thermal cycler and applied directly on theslides. Microarrays were covered with a 22–22 mm coverslipand hybridized overnight at 65°C by immersion in a high-precision water bath (W28, Grant, Cambridge, UK). Posthy-bridization washing was performed according to the MI-CROMAX TSA Detection kit (Perkin-Elmer). Two replicatesof each experiment were done using different microarrayslides in which the sample and reference RNA was labeledeither with Cy3 or Cy5 fluorochromes were crossed in bothcombinations.

Pathology severity score codification

The degree of the dystrophic/myopathic process, calledpathology severity score, was based on microscopy inspec-tion of stained muscle sections and defined as: (1) active dy-strophic process (marked increase of fibre size variability, ac-tive degeneration and regeneration, marked increase ofconnective tissue); (2) moderate dystrophic process (markedincrease of fibre size variability, increased central nuclei, fewdegenerating and regenerating fibers, slight increase of con-nective tissue); and (3) mild myopathic picture (moderate in-crease of fiber size variability, increased central nuclei).

LGMD2B and MDC

Muscle degeneration is evaluated by the presence of ne-crosis, macrophages (absent or normal �, slightly increased�, moderately increased ��, markedly increased ���) andcreatine kinase, while regeneration by the presence of fetalmyosin. On the basis of the above parameters, the degree ofdystrophic process and muscle histopathology severity wasclassified in four different categories: mild myopathic pic-ture, moderate dystrophic process, active dystrophic pro-cess, and advanced-stage dystrophic process. The presenceof affected relatives (family history, � yes, �no), type of dis-order (MDC and LGMD2B), the presence of the associatedprotein (dysferlin and laminin alpha 2 for LGMD2B andMDC, respectively), sex (M male, F female), and age at bi-opsy of each patient was also recorded.

FSHD Dataset (Celegato et al., 2006)

Patient selection

We analyzed 11 FSHD patients carrying different numbersof KpnI repeat units on the pathogenic 4q35 allele (nine malesand two females) whose age at biopsy ranged from 8 to 69years. The investigation was carried out in accordance withthe principles outlined in the 1989 declaration of Helsinki af-ter approval of the ethical committees of the collaboratinginstitutions. Muscle biopsies from subjects in whom a mus-cle disease was excluded by both clinical and histopatho-logical criteria were used as controls. Diagnostic criteria forFSHD followed the guidelines proposed by the European Ex-pert Group on FMD. Neurological examination was per-formed in all patients by E.R. at the Institute of Neurologyof the Catholic University of Rome or at the Center for Neu-romuscular Diseases (UILDM Sezione Laziale, Rome). For

CORRELATION BETWEEN GENE EXPRESSION AND CLINICAL DATA 11

clinical classification we adopted a scale of clinical severitytaking into account the extent of weakness in various mus-cular regions, and the spread of symptoms to pelvic and legmuscles; higher scores were assigned to patients with in-volvement of pelvic and proximal lower limb muscles.

Tissue collection and storage

After obtaining written consent, 30 mg muscle tissue wasobtained by surgical biopsies from the midportion of the leftdeltoideus muscle. This muscle was chosen because it is rel-atively unaffected by the dystrophic process in FSHD. There-fore, differences between healthy and affected individualswould probably reflect the primary defect rather than sec-ondary consequences of muscle degeneration. relatively un-affected by the dystrophic process in FSHD. Therefore, dif-ferences between healthy and affected individuals wouldprobably reflect the primary defect rather than secondaryconsequences of muscle degeneration.

Molecular characterization of patients

Lymphocytes were isolated from human peripheral bloodand embedded in agarose plugs before DNA extraction. Thesize of the 4q35-EcoRI-fragments was determined by pulsefield gel electrophoresis (PFGE) and Southern hybridization.To distinguish between the fragments of EcoRI (BlnI resis-tant) and 10q26 (BlnI sensitive) the DNA was digested withEcoRI/HindIII and EcoRI/BlnI. Southern blot was hy-bridized with a p13E-11 probe labeled with P32 dATP.

RNA purification and labeling

Frozen patient biopsies were homogenized for 3–5 min us-ing an ultraturrax-T8 blender (IKA-Werke, Staufen, Ger-many) in 5 vol of TRIZOL reagent (Invitrogen/Life Tech-nologies). Total RNA was purified using RNeasy Mini Kit(Qiagen, Hilden, Germany). A 100-ng aliquot of total RNAwas used for quantification and quality control using theRNA 6000 LabChip kit and Agilent Bioanalyzer 2100 (Agi-lent Technologies, Palo Alto, CA). Linear amplification ofmRNA starting from 100 ng of total RNA was carried outusing the Message-Amp-aRNA kit (Ambion, Austin, TX)with two consecutive amplification steps according to themanufacturer’s recommendations. Fluorescent cDNA targetswere prepared by direct labelling of 1.5-mg aRNA from eachsample, performing a retro-transcription reaction in 30 mLof final volume with 1 mL of Cy3 or Cy5 deoxyribonu-cleotides (Amersham Pharmacia Biotech, Barcelona, Spain).

Immunohistochemical study of muscle biopsies was per-formed in this study (Celegato et al., 2006). To characterizefiber types distribution, at least 1,000 fibers from eight mi-croscopic fields were analysed in each patient and controlmuscle section. The mean fiber diameter of each fiber typewas determined by computational analysis of at least 200fibers using the software MCID Basic (Version 7.0, ImagingResearch, St. Catharines, ON, Canada).

Clinical severity scale adopted in the FSHD study. 0.5, fa-cial weakness; 1, mild scapular involvement without limita-tion of arm abduction; no awareness of disease symptoms ispossible; 1.5, moderate involvement of scapular and arm-muscles or both (arm abduction 0.60° and strength 3 in

arm muscles); no involvement of pelvic and leg muscles; 2,Severe scapular involvement (arm abduction �60° on at leastone side); strength �3 in at least one muscular district of thearms; no involvement of pelvic and leg muscles; 2.5, tibioper-oneal weakness; no weakness of pelvic and proximal legmuscles; 3, mild weakness of pelvic and proximal leg mus-cles or both (strength 4 in all these muscles); able to standup from a chair without support; 3.5 moderate weakness ofpelvic and proximal leg muscles or both (strength 3 in allthese muscles); able to stand up from a chair with monolat-eral support; 4, severe weakness of pelvic and proximal legmuscles or both (strength �3 in at least one of these mus-cles); able to stand up from a chair with double support; ableto walk unaided; 4.5, unable to stand up from a chair; walk-ing limited to several steps with support; may use wheel-chair for most activities; 5, wheelchair bound. Musclestrength was evaluated by using the Manual Muscle TestingScale.

Statistical Analysis

Stepwise regression

Stepwise regression is a regression model in which thechoice of predictive variables is carried out by an automaticprocedure. At each step, the independent variable not in theequation that has the smallest probability F (the observedsignificance level associated to the statistical test performedon the parameter of the variable) is entered, if the probabil-ity is sufficiently small. Variable already in the regressionequation are removed if their probability of F becomes suf-ficiently large. The procedure terminates when no more vari-able are eligible for inclusion or removal. Stepwise regres-sion is useful to create parsimonious models, avoidingoverfitting and loss of power.

We used a stepwise procedure to identify the best set oflinear predictor PCs for factors derived by clinical data.Many multiple regression models contain variables whose tstatistics have nonsignificant p values. These variables arenot displaying statistically significant predictive capabilityin the presence of other predictors. The question is whethersome variables can be removed from the model generatinga more parsimonious and powerful model. Variable selec-tion procedure includes forward selection, backward elimi-nation, and stepwise regression. They add or remove vari-ables one at a time until some stopping rule is satisfied.Forward selection starts with an empty model. The variablethat has the smallest p-value is placed in the model when itis the only predictor in the regression equation. Each subse-quent step is adding the variable that has the smallest p-valuein the presence of the predictors already in the equation.Variables are added one at a time as long as their p-valuesare small enough, typically less than 0.05 or 0.10. Backwardelimination starts with all of the predictors in the model. Thevariable that is least significant (that is the one with thelargest p-value) is removed and the model is refitted. Eachsubsequent step removes the least significant variable in themodel until all remaining variables have individual p-valuessmaller than some defined value, such as 0.05 or 0.10. Step-wise regression is similar to forward selection except thatvariables are removed from the model if they become non-significant as other predictors are added.

ROMUALDI ET AL.12