14
Dimension reduction techniques for the integrative analysis of multi-omics data Chen Meng*, Oana A. Zeleznik*, Gerhard G. Thallinger, Bernhard Kuster, Amin M. Gholami and Aedı´n C. Culhane Corresponding author: Aedı ´n C. Culhane, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02115, USA. Tel.: þ1 (617) 632 2468; Fax: þ1 (617) 582 7760; E-mail: [email protected] *These authors contributed equally to this work. Abstract State-of-the-art next-generation sequencing, transcriptomics, proteomics and other high-throughput ‘omics’ technologies en- able the efficient generation of large experimental data sets. These data may yield unprecedented knowledge about molecular pathways in cells and their role in disease. Dimension reduction approaches have been widely used in exploratory analysis of single omics data sets. This review will focus on dimension reduction approaches for simultaneous exploratory analyses of multiple data sets. These methods extract the linear relationships that best explain the correlated structure across data sets, the variability both within and between variables (or observations) and may highlight data issues such as batch effects or out- liers. We explore dimension reduction techniques as one of the emerging approaches for data integration, and how these can be applied to increase our understanding of biological systems in normal physiological function and disease. Key words: multivariate analysis; multi-omics data integration; dimension reduction; integrative genomics; exploratory data analysis; multi-assay Introduction Technological advances and lower costs have resulted in stud- ies using multiple comprehensive molecular profiling or omics assays on each biological sample. Large national and interna- tional consortia including The Cancer Genome Atlas (TCGA) and The International Cancer Genome Consortium have profiled thousands of biological samples, assaying multiple different molecular profiles per sample, including mRNA, microRNA, methylation, DNA sequencing and proteomics. These data have the potential to reveal great insights into the mechanism of dis- ease and to discover novel biomarkers; however, statistical methods for integrative analysis of multi-omics (or multi-assay) data are only emerging. Chen Meng is a senior graduate student currently completing his thesis entitled ‘Application of multivariate methods to the integrative analysis of omics data’ in Bernhard Kuster’s group at Technische Universita ¨ t Mu ¨ nchen, Germany. Oana Zeleznik has recently defended her thesis ‘Integrative Analysis of Omics Data. Enhancement of Existing Methods and Development of a Novel Gene Set Enrichment Approach’ in Gerhard Thallinger’s group at the Graz University of Technology, Austria. Gerhard Thallinger is a principal investigator at Graz University of Technology, Austria. His research interests include analysis of next-generation sequencing, microbiome and lipidomics data. He supervised Oana Zeleznik’s graduate studies. Bernhard Kuster is Full Professor for Proteomics and Bioanalytics at the Technical University Munich and Co-Director of the Bavarian Biomolecular Mass Spectrometry Center. His research focuses on mass spectrometry-based proteomics and chemical biology. He is supervising the graduate studies of Chen Meng. Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and Immunology. He is working on big data integrative and comparative analysis of multi-omics data. He was previously a bioinformatics group leader at Technische Universita ¨t Mu ¨ nchen, where he supervised Chen Meng. Aedı ´n Culhane is a research scientist in Biostatistics and Computational Biology at Dana-Farber Cancer Institute and Harvard T.H. Chan School of Public Health. She develops and applies methods for integrative analysis of omics data in cancer. Dr Zeleznik was a visiting student at the Dana-Faber Cancer Institute in Dr Culhane’s Lab during her PhD. Dr Culhane co-supervises Dr Zeleznik and Mr Meng on several projects. Submitted: 8 June 2015; Received (in revised form): 26 October 2015 V C The Author 2016. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 628 Briefings in Bioinformatics, 17(4), 2016, 628–641 doi: 10.1093/bib/bbv108 Advance Access Publication Date: 11 March 2016 Paper at Baruj Benacerraf Library on September 29, 2016 http://bib.oxfordjournals.org/ Downloaded from

Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

Dimension reduction techniques for the integrative

analysis of multi-omics dataChen Meng Oana A Zeleznik Gerhard G ThallingerBernhard Kuster Amin M Gholami and Aedın C CulhaneCorresponding author Aedın C Culhane Dana-Farber Cancer Institute 450 Brookline Ave Boston MA 02115 USA Tel thorn1 (617) 632 2468 Fax thorn1 (617) 5827760 E-mail aedinjimmyharvardeduThese authors contributed equally to this work

Abstract

State-of-the-art next-generation sequencing transcriptomics proteomics and other high-throughput lsquoomicsrsquo technologies en-able the efficient generation of large experimental data sets These data may yield unprecedented knowledge about molecularpathways in cells and their role in disease Dimension reduction approaches have been widely used in exploratory analysis ofsingle omics data sets This review will focus on dimension reduction approaches for simultaneous exploratory analyses ofmultiple data sets These methods extract the linear relationships that best explain the correlated structure across data setsthe variability both within and between variables (or observations) and may highlight data issues such as batch effects or out-liers We explore dimension reduction techniques as one of the emerging approaches for data integration and how these canbe applied to increase our understanding of biological systems in normal physiological function and disease

Key words multivariate analysis multi-omics data integration dimension reduction integrative genomics exploratory dataanalysis multi-assay

Introduction

Technological advances and lower costs have resulted in stud-ies using multiple comprehensive molecular profiling or omicsassays on each biological sample Large national and interna-tional consortia including The Cancer Genome Atlas (TCGA) andThe International Cancer Genome Consortium have profiled

thousands of biological samples assaying multiple differentmolecular profiles per sample including mRNA microRNAmethylation DNA sequencing and proteomics These data havethe potential to reveal great insights into the mechanism of dis-ease and to discover novel biomarkers however statisticalmethods for integrative analysis of multi-omics (or multi-assay)data are only emerging

Chen Meng is a senior graduate student currently completing his thesis entitled lsquoApplication of multivariate methods to the integrative analysis of omicsdatarsquo in Bernhard Kusterrsquos group at Technische Universitat Munchen GermanyOana Zeleznik has recently defended her thesis lsquoIntegrative Analysis of Omics Data Enhancement of Existing Methods and Development of a Novel GeneSet Enrichment Approachrsquo in Gerhard Thallingerrsquos group at the Graz University of Technology AustriaGerhard Thallinger is a principal investigator at Graz University of Technology Austria His research interests include analysis of next-generationsequencing microbiome and lipidomics data He supervised Oana Zeleznikrsquos graduate studiesBernhard Kuster is Full Professor for Proteomics and Bioanalytics at the Technical University Munich and Co-Director of the Bavarian Biomolecular MassSpectrometry Center His research focuses on mass spectrometry-based proteomics and chemical biology He is supervising the graduate studies of Chen MengAmin M Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and Immunology He is working on bigdata integrative and comparative analysis of multi-omics data He was previously a bioinformatics group leader at Technische Universitat Munchenwhere he supervised Chen MengAedın Culhane is a research scientist in Biostatistics and Computational Biology at Dana-Farber Cancer Institute and Harvard TH Chan School of PublicHealth She develops and applies methods for integrative analysis of omics data in cancer Dr Zeleznik was a visiting student at the Dana-Faber CancerInstitute in Dr Culhanersquos Lab during her PhD Dr Culhane co-supervises Dr Zeleznik and Mr Meng on several projectsSubmitted 8 June 2015 Received (in revised form) 26 October 2015

VC The Author 2016 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40)which permits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

628

Briefings in Bioinformatics 17(4) 2016 628ndash641

doi 101093bibbbv108Advance Access Publication Date 11 March 2016Paper

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

Exploratory data analysis (EDA) is an important early step inomics data analysis [1] It summarizes the main characteristics ofdata and may identify potential issues such as batch effects [2]and outliers Techniques for EDA include cluster analysis and di-mension reduction Both have been widely applied to transcrip-tomics data analysis [1] but there are advantages to dimensionreduction approaches when integrating multi-assay data Whilecluster analysis generally investigates pairwise distances betweenobjects looking for fine relationships dimension reduction or la-tent variable methods consider the global variance of the data sethighlighting general gradients or patterns in the data [3]

Biological data frequently have complex phenotypes and de-pending on the subset of variables analyzed multiple valid clus-tering classifications may co-exist Dimension reductionapproaches decompose the data into a few new variables (calledcomponents) that explain most of the differences in observa-tions For example a recent dimension reduction analysis ofbladder cancers identified components associated with batcheffects GC content in the RNA sequencing data in addition toseven components that were specific to tumor cells and threecomponents associated with tumor stroma [4] By contrastmost clustering approaches are optimized for discovery of dis-crete clusters where each observation or variable is assigned toonly one cluster Limitations of clustering were observed whenthe method cluster-of-cluster assignments was applied toTCGA pan-cancer multi-omics data of 3527 specimens from 12cancer type sources [5] Tumors were assigned to one clusterand these clusters grouped largely by anatomical origin andfailed to identify clusters associated with known cancer path-ways [5] However a dimension reduction analysis across 10 dif-ferent cancers identified novel and known cancer-specificpathways in addition to pathways such as cell cycle mitochon-dria gender interferon response and immune response thatwere common among different cancers [4]

Overlapping clusters have been identified in many tumorsincluding glioblastoma and serous ovarian cancer [6 7]Gusenleitner and colleagues [8] found that k-means or hierarch-ical clustering failed to identify the correct cluster structure insimulated data with multiple overlapping clusters Clusteringmethods may also falsely discover clusters in unimodal dataFor example Senbabaoglu et al [7] applied consensus clusteringto randomly generated unimodal data and found it divided thedata into apparently stable clusters for a range of K where K is apredefined number of clusters However principal componentanalysis (PCA) did not identify these clusters

In this article we first introduce linear dimension reduction ofa single data set describing the fundamental concepts and ter-minology that are needed to understand its extensions to mul-tiple matrices Then we review multivariate dimension reductionapproaches which can be applied to the integrative exploratoryanalysis of multi-omics data To demonstrate the application ofthese methods we apply multiple co-inertia analysis (MCIA) toEDA of mRNA miRNA and proteomics data of a subset of 60 celllines studied at the National Cancer Institute (NCI-60)

Introduction to dimension reduction

Dimension reduction methods arose in the early 20th century[9 10] and have continued to evolve often independently inmultiple fields giving rise to a myriad of associated termin-ology Wikipedia lists over 10 different names for PCA the mostwidely used dimension reduction approach Therefore weprovide a glossary (Table 1) and tables of methods (Tables 2ndash4)to assist beginners to the field Each of these are dimension

reduction techniques whether they are applied to one (Table 2)or multiple (Tables 3 and 4) data sets We start by introducingthe central concepts of dimension reduction

We denote matrices with boldface uppercase letters Therows of a matrix contain the observations while the columnshold the variables In an omics study the variables (alsoreferred to as features) generally measure tissue or cell attri-butes including abundance of mRNAs proteins and metabol-ites All vectors are columns vectors and are denoted withboldface lowercase letters Scalars are indicated by italic letters

Given an omics data set X which is an np matrix of nobservations and p variables it can be represented by

X frac14 x1x2 xp

(1)

where x are vectors of length n and are measurements ofmRNA or other biological variables for n observations (samples)In a typical omics study p ranges from several hundred to mil-lions Therefore observations (samples) are represented in largedimensional spaces Rp The goal of dimension reduction is toidentify a (set of) new variable(s) using a linear combination ofthe original variables such that the number of new variables ismuch smaller than p An example of such a linear combinationis shown in Equation (2)

f frac14 q1x1 thorn q2x2 thorn thorn qpxp (2)

or expressed in a matrix form

f frac14 Xq (3)

In Equations (2) and (3) f is a new variable which is often calleda latent variable or a component Depending on the scientificfield f may also be called principal axis eigenvector or latentfactor qfrac14(q1 q2 qp)T is a p-length vector of coefficients ofscalar values in which at least one of the coefficients is differentfrom zero These coefficients are also called lsquoloadingsrsquoDimension reduction analysis introduces constraints to obtaina meaningful solution we find the set of qrsquos that maximize thevariance of components frsquos In doing so a smaller number ofvariables f capture most of the variance in the data Differentoptimization and constraint criteria distinguish between differ-ent dimension reduction methods Table 2 provides a nonex-haustive list of these methods which includes PCA lineardiscriminant analysis and factor analysis

Principal component analysis

PCA is one of the most widely used dimension reductionmethods [20] Given a column centered and scaled (unit variance)matrix X PCA finds a set of new variables fifrac14Xqi where i is the ithcomponent and qi is the variable loading for the ith principalcomponent (PC superscript denotes the component or the di-mension) The variance of fi is maximized that is

arg maxqi varethXqiTHORN (4)

with the constraints that jjqijj frac14 1 and each pair of components(f if j) are orthogonal to each other (or uncorrelated ie f iTf j frac14 0for j = i)PCA can be computed using different algorithms includingeigen analysis latent variable analysis factor analysis singularvalue decomposition (SVD) [21] or linear regression [3] Among

Dimension reduction techniques | 629

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

them SVD is the most widely used approach Given X an npmatrix with rank r (r ltfrac14min[n p]) SVD decomposes X intothree matrices

X frac14 USQTsubject to the constraint that UTU frac14 QTQ frac14 I (5)

where U is an nr matrix and Q is a pr matrix The columns ofU and Q are the orthogonal left and right singular vectors re-spectively S is an rr diagonal matrix of singular values whichare proportional to the standard deviations associated with rsingular vectors The singular vectors are ordered such that

their associated variances are monotonically decreasingIn a PCA of X the PCs comprise an nr matrix F which isdefined as

F frac14 US frac14 USQTQ frac14 XQ (6)

where the columns of matrix F are the PCs and the matrix Q iscalled the loadings matrix and contains the linear combinationcoefficients of the variables for each PC (q in Equation (3))Therefore we represent the variance of X in lower dimension rThe above formula also emphasizes that Q is a matrix that

Table 1 Glossary

Term Definition

Variance The variance of a random variable measures the spread (variability) of its realizations (values of the random vari-able) The variance is always a positive number If the variance is small the values of the random variable areclose to the mean of the random variable (the spread of the data is low) A high variance is equivalent to widelyspread values of the random variable See [11]

Standard deviation The standard deviation of a random variable measures the spread (variability) of its realizations (values of the ran-dom variable) It is defined as the square root of the variance The standard deviation will have the same units asthe random variable in contrast to the variance See [11]

Covariance The covariance is an unstandardized measure about the tendency of two random variables to vary together See[12]

Correlation The correlation of two random variables is defined by the covariance of the two random variables normalized bythe product between their standard deviations It measures the linear relationship between the two randomvariables The correlation coefficient ranges between 1 and thorn1 See [12]

Inertia Inertia is a measure for the variability of the data The inertia of a set of points relative to one point P is defined bythe weighted sum of the squared distances between each considered point and the point P Correspondingly theinertia of a centered matrix (mean is equal to zero) is simply the sum of the squared matrix elements The inertiaof the matrix X defined by the metrics L and D is the weighted sum of its squared values The inertia is equal thetotal variance of X when X is centered L is the Euclidean metric and D is a diagonal matrix with diagonal elementsequal to 1n See [13]

Co-inertia The co-inertia is a global measure for the co-variability of two data sets (for example two high-dimensional randomvariables) If the data sets are centered the co-inertia is the sum of squared covariances When coupling a pair ofdata sets the co-inertia between two matrices X and Y is calculated as trace (XLXTDYRYTD) See [13]

Orthogonal Two vectors are called orthogonal if they form an angle that measures 90 degrees Generally two vectors are orthog-onal if their inner product is equal to zero Two orthogonal vectors are always linearly independent See [12]

Independent In linear algebra two vectors are called linearly independent if their liner combination is equal to zero only when allconstants of the linear combination are equal to zero See [14] In statistics two random variables are called statis-tically independent if the distribution of one of them does not affect the distribution of the other If two independ-ent random variables are added then the mean of the sum is the sum of the two mean values This is also true forthe variance The covariance of two independent variables is equal to zero See [11]

Eigenvector eigenvalue An eigenvector of a matrix is a vector that does not change its direction after a linear transformation The vector v isan eigenvector of the matrix A if Av frac14 kv k is the eigenvalue associated with the eigenvector v and it reflects thestretch of the eigenvector following the linear transformation The most popular way to compute eigenvectorsand eigenvalues is the SVD See [14]

Linear combination Mathematical expression calculated through the multiplication of variables with constants and adding the individ-ual multiplication results A linear combination of the variables x and y is ax thorn by where a and b are the constantsSee [15]

Omics The study of biological molecules in a comprehensive fashion Examples of omics data types include genomicstranscriptomics proteomics metabolomics and epigenomics [16]

Dimension reduction Dimension reduction is the mapping of data to a lower dimensional space such that redundant variance in the datais reduced or discarded enabling a lower-dimensional representation without significant loss of informationSee [17]

Exploratory data analysis EDA is the application of statistical techniques that summarize the main characteristics of data often with visualmethods In contrast to statistical hypothesis testing (confirmatory data analysis) EDA can help to generatehypotheses See [18]

Sparse vector A sparse vector is a vector in which most elements are zero A sparse loadings matrix in PCA or related methodsreduce the number of features contributing to a PC The variables with nonzero entries (features) are the lsquoselectedfeaturesrsquo See [19]

630 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

projects the observations in X onto the PCs The sum of squaredvalues of the columns in U equals 1 (Equation (5)) therefore thevariance of the ith PC di2 can be calculated as

di2 frac14 si2

n 1(7)

where si is the ith diagonal element in S The variance reflectsthe amount of information (underling structure) captured byeach PC The squared correlations between PCs and the original

variables are informative and often illustrated using a correl-ation circle plot These can be calculated by

C frac14 QD (8)

where D frac14 (d1 d2 dr)T is a diagonal matrix of the standarddeviation of the PCs

In contrast to SVD which calculates all PCs simultaneouslyPCA can also be calculated using the Nonlinear Iterative PartialLeast Squares (NIPALS) algorithm which uses an iterative

Table 2 Dimension reduction methods for one data set

Method Description Name of R function R package

PCA Principal component analysis prcompstats princompstats dudipcaade4 pcaveganPCAFactoMineR principalpsych

CA COA Correspondence analysis caca CAFactoMineR dudicoaade4NSC Nonsymmetric correspondence analysis dudinscade4PCoA MDS Principal co-ordinate analysismultiple dimensional

scalingcmdscalestats dudipcoade4 pcoaape

NMF Nonnegative matrix factorization nmfnmfnmMDS Nonmetric multidimensional scaling metaMDSvegansPCA nsPCA pPCA Sparse PCA nonnegative sparse PCA penalized PCA

(PCA with feature selection)SPCPMA spcamixOmics nsprcompnsprcomp

PMDPMANIPALS PCA Nonlinear iterative partial least squares analysis (PCA

on data with missing values)nipalsade4 pcapcaMethodsa nipalsmixOmics

pPCA bPCA Probabilistic PCA Bayesian PCA pcapcaMethodsa

MCA Multiple correspondence analysis dudiacmade4 mcaMASSICA Independent component analysis fastICAFastICAsIPCA Sparse independent PCA (combines sPCA and ICA) sipcamixOmics ipcamixOmicsplots Graphical resources R packages including scatterplot3d ggordb ggbiplotc

plotlyd explor

aAvailable in BioconductorbOn github devtoolsinstall_github (lsquofawda123ggordrsquo)cOn github devtoolsinstall_github (lsquoggbiplotrsquo lsquovqvrsquo)dOn github devtoolsinstall_github (lsquoropensciplotlyrsquo)

Table 3 Dimension reduction methods for pairs of data sets

Method Description Featureselection

R Function package

CCAa Canonical correlation analysis Limited to ngtpa No cccca CCorAveganCCAa Canonical correspondence analysis is a constrained correspondence

analysis which is popular in ecologya

No ccaade4 ccavegan cancorstats

RDA Redundancy analysis is a constrained PCA Popular in ecology No rdaveganProcrutes Procrutes rotation rotates a matrix to maximum similarity with a tar-

get matrix minimizing sum of squared differencesNo procrustesvegan procusteade4

rCCA Regularized canonical correlation No rccccasCCA Sparse CCA Yes CCApmapCCA Penalized CCA Yes spCCAspCCA supervised versionWAPLS Weighted averaging PLS regression No WAPLSrioja waplspaltranPLS Partial least squares of K-tables (multi-block PLS) No mbplsade4 plsdacaretsPLS pPLS Sparse PLS Penalized PLS Yes splsspls splsmixOmics pplspplssPLS-DA Sparse PLS-discriminant analysis Yes splsdamixOmics splsdacaretcPCA Consensus PCA No cpcamogsaCIA Coinertia analysis No coinertiaade4 ciamade4

aA source for confusion CCA is widely used as an acronym for both Canonical lsquoCorrespondencersquo Analysis and Canonical lsquoCorrelationrsquo Analysis Throughout this article

we use CCA for canonical lsquocorrelationrsquo analysis Both methods search for the multivariate relationships between two data sets Canonical lsquocorrespondencersquo analysis is

an extension and constrained form of lsquocorrespondencersquo analysis [22] Both canonical lsquocorrelationrsquo analysis and RDA assume a linear model however RDA is a con-

strained PCA (and assumes one matrix is the dependent variable and one independent) whereas canonical correlation analysis considers both equally See [23] for

more explanation

Dimension reduction techniques | 631

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

regression procedure to calculate PCs Computation can be per-formed on data with missing values and it is faster than SVDwhen applied to large matrices Furthermore NIPALS may begeneralized to discover the correlated structure in more thanone data set (see sections on the analysis of multi-omics datasets) Please refer to the Supplementary Information for add-itional details on NIPALS

Visualizing and interpreting results ofdimension reduction analysis

We present an example to illustrate how to interpret results of aPCA PCA was applied to analyze mRNA gene expression data ofa subset of cell lines from the NCI-60 panel those of melanomaleukemia and central nervous system (CNS) tumors The resultsof PCA can be easily interpreted by visualizing the observationsand variables in the same space using a biplot Figure 1A is abiplot of the first (PC1) and second PC (PC2) where points andarrows from the plot origin represent observations and genesrespectively Cell lines (points) with correlated gene expressionprofiles have similar scores and are projected close to eachother on PC1 and PC2 We see that cell lines from the same ana-tomical location are clustered

Both the direction and length of the mRNA gene expressionvectors can be interpreted Gene expression vectors point in thedirection of the latent variable (PC) to which it is most similar(squared correlation) Gene vectors with the same direction (egFAM69B PNPLA2 PODXL2) have similar gene expression pro-files The length of the gene expression vector is proportional tothe squared multiple correlation between the fitted values forthe variable and the variable itself

A gene expression vector and a cell line projected in the samedirection from the origin are positively associated For example inFigure 1A FAM69B PNPLA2 PODXL2 are active (have higher geneexpression) in melanoma cell lines Similarly genes DPPA5 andRPL34P1 are among those genes that are highly expressed inmost leukemia cell lines By contrast genes WASL and HEXBpoint in the opposite direction to most leukemia cell lines indicat-ing low association In Figure 1B it is clear that these genes arenot expressed in leukemia cell lines and are colored blue in theheatmap (these are colored dark gray in grayscale)

The sum of the squared correlation coefficients between avariable and all the components (calculated in equation 8)equals 1 Therefore variables are often shown within a correl-ation circle (Figure 1C) Variables positioned on the unit circlerepresent variables that are perfectly represented by the two di-mensions displayed Those not on the unit circle may requireadditional components to be represented

In most analyses only the first few PCs are plotted andstudied as these explain the most variant trends in the dataGenerally the selection of components is subjective anddepends on the purpose of the EDA An informal elbow test mayhelp to determine the number of PCs to retain and examine [2425] From the scree plot of PC eigenvalues in Figure 1D we mightdecide to examine the first three PCs because the decrease inPC variance becomes relatively moderate after PC3 Anotherapproach that is widely used is to include (or retain) PCs thatcumulatively capture a certain proportion of variance for ex-ample 70 of variance is modeled with three PCs If a parsi-mony model is preferred the variance proportion cutoff can beas low as 50 [24] More formal tests including the Q2 statisticare also available (for details see [24]) In practice the first com-ponent might explain most of the variance and the remainingaxes may simply be attributed to noise from technical or biolo-gical sources in a study with low complexity (eg cell line repli-cates of controls and one treatment condition) However acomplex data set (for example a set of heterogeneous tumors)may require multiple PCs to capture the same amount ofvariance

Different dimension reduction approaches areoptimized for different data

There are many dimension reduction approaches related to PCA(Table 2) including principal co-ordinate analysis (PCoA) cor-respondence analysis (CA) and nonsymmetrical correspond-ence analysis (NSCA) These may be computed by SVD butdiffer in how the data are transformed before decomposition[21 26 27] and therefore each is optimized for specific dataproperties PCoA (also known as Classical MultidimensionalScaling) is versatile as it is a SVD of a distance matrix that canbe applied to decompose distance matrices of binary count or

Table 4 Dimension reduction methods for multiple (more than two) data sets

Method Description Featureselection

Matchedcases

R Function package

MCIA Multiple coinertia analysis No No mciaomicade4 mcoaade4gCCA Generalized CCA No No regCCAdmtrGCCA Regularized generalized CCA No No regCCAdmt rgccargcca

wrapperrgccamixOmicssGCCA Sparse generalized canonical

correlation analysisYes No sgccargcca wrappersgccamixOmics

STATIS Structuration des Tableaux aTrois Indices de la Statistique(STATIS) Family of methods whichinclude X-statis

No No statisade4

CANDECOMPPARAFAC Tucker3

Higher order generalizations of SVDand PCA Require matchedvariables and cases

No Yes CPThreeWay T3ThreeWay PCAnPTaKCANDPARAPTaK

PTA Partial triadic analysis No Yes ptaade4statico Statis and CIA (find structure

between two pairs of K-tables)No No staticoade4

632 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

continuous data It is frequently applied in the analysis ofmicrobiome data [28]

PCA is designed for the analysis of multi-normal distributeddata If data are strongly skewed or extreme outliers are pre-sent the first few axes may only separate those objects with ex-treme values instead of displaying the main axes of variation Ifdata are unimodal or display nonlinear trends one may see dis-tortions or artifacts in the resulting plots in which the secondaxis is an arched function of the first axis In PCA this is calledthe horseshoe effect and it is well described including illustra-tions in Legendre and Legendre [3] Both nonmetric Multi-Dimensional Scaling (MDS) and CA perform better than PCA inthese cases [26 29] Unlike PCA CA can be applied to sparsecount data with many zeros Although designed for contingencytables of nonnegative count data CA and NSCA decomposea chi-squared matrix [30 31] but have been successfullyapplied to continuous data including gene expression and pro-tein profiles [32 33] As described by Fellenberg et al [33] gene

and protein expression can be seen as an approximation of thenumber of corresponding molecules present in the cell during acertain measured condition Additionally Greenacre [27]emphasized that the descriptive nature of CA and NSCA allowstheir application on data tables in general not only on countdata These two arguments support the suitability of CA andNSCA as analysis methods for omics data While CA investi-gates symmetric associations between two variables NSCA cap-tures asymmetric relations between variables Spectral mapanalysis is related to CA and performs comparably with CAeach outperforming PCA in the identification of clusters of leu-kemia gene expression profiles [26] All dimension reductionmethods can be formulated in terms of the duality diagramDetails on this powerful framework are included in theSupplementary Information

Nonnegative matrix factorization (NMF) [34] forces a positiveor nonnegative constraint on the resulting data matrices andsimilar to Independent Component Analysis (ICA) [35] there is no

Figure 1 Results of a PCA analysis of mRNA gene expression data of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines from the NCI-60 cell

line panel All variables were centered and scaled Results show (A) a biplot where observations (cell lines) are points and gene expression profiles are arrows (B) a

heatmap showing the gene expression of the same 20 genes in the cell lines red to blue scale represent high to low gene expression (light to dark gray represent high

to low gene expression on the black and white figure) (C) correlation circle (D) variance barplot of the first ten PCs To improve the readability of the biplot some labels

of the variables (genes) in (A) have been moved slightly A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 633

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

requirement for orthogonality or independence in the compo-nents The nonnegative constraint guarantees that only the addi-tive combinations of latent variables are allowed This may bemore intuitive in biology where many biological measurements(eg protein concentrations count data) are represented by posi-tive values NMF is described in more detail in the SupplementalInformation ICA was recently applied to molecular subtype dis-covery in bladder cancer [4] Biton et al [4] applied ICA to geneexpression data of 198 bladder cancers and examined 20 compo-nents ICA successfully decomposed and extracted multiplelayers of signal from the data The first two components wereassociated with batch effects but other components revealed newbiology about tumor cells and the tumor microenvironmentThey also applied ICA to non-bladder cancers and by comparingthe correlation between components were able to identify a setof bladder cancer-specific components and their associatedgenes

As omics data sets tend to have high dimensionality (p n)it is often useful to reduce the number of variables Several re-cent extensions of PCA include variable selection often via aregularization step or L-1 penalization (eg Least AbsoluteShrinkage and Selection Operator LASSO) [36] The NIPALS algo-rithm uses an iterative regression approach to calculate thecomponents and loadings which is easily extended to have asparse operator that can be included during regression on thecomponent [37] A cross-validation approach can be used to de-termine the level of sparsity Sparse penalized and regularizedextensions of PCA and related methods have been describedrecently [36 38ndash41]

Integrative analysis of two data sets

One-table dimension reduction methods have been extended tothe EDA of two matrices and can simultaneously decomposeand integrate a pair of matrices that measure different variableson the same observations (Table 3) Methods include general-ized SVD [42] Co-Inertia Analysis (CIA) [43 44] sparse or penal-ized extensions of Partial Least Squares (PLS) CanonicalCorrespondence analysis (CCA) and Canonical CorrelationAnalysis (CCA) [36 45ndash47] Note both canonical correspondenceanalysis and canonical correlation analysis are referred to bythe acronym CCA Canonical correspondence analysis is a con-strained form of CA that is widely used in ecological statistics[46] however it is yet to be adopted by the genomics commu-nity in analysis of pairs of omics data By contrast severalgroups have applied extensions of canonical correlation ana-lysis to omics data integration Therefore in this review we useCCA to describe canonical correlation analysis

Canonical correlation analysis

Two omics data sets X (dimension npx) and Y (dimension npy)can be expressed by the following latent component decompos-ition problem

X frac14 FxQTx thorn Ex

Yfrac14 FyQTy thorn Ey

(9)

where Fx and Fy are nr matrices with r columns of compo-nents that explain the co-structure between X and Y The col-umns of Qx and Qy are variable loading vectors for X and Yrespectively Ex and Ey are error terms

Proposed by Hotelling in 1936 [47] CCA searches for associ-ations or correlations among the variables of X and Y [47] bymaximizing the correlation between Xqx

i and Yqyi

for the ith component arg maxqixqi

ycorethXqi

x YqiyTHORN (10)

In CCA the components Xqxi and Yqy

i are called canonical vari-ates and their correlations are the canonical correlations

Sparse canonical correlation analysis

The main limitation of applying CCA to omics data is that itrequires an inversion of the correlation or covariance matrix[38 49 50] which cannot be calculated when the number ofvariables exceeds the number of observations [46] In high-dimensional omics data where p n application of these meth-ods requires a regularization step This may be accomplished byadding a ridge penalty that is adding a multiple of the identitymatrix to the covariance matrix [51] A sparse solution of theloading vectors (Qx and Qy) filters the number of variables andsimplifies the interpretation of results For this purpose penal-ized CCA [52] sparse CCA [53] CCA-l1 [54] CCA elastic net (CCA-EN) [45] and CCA-group sparse [55] have been introduced andapplied to the integrative analysis of two omics data setsWitten et al [36] provided an elegant comparison of various CCAextensions accompanied by a unified approach to compute bothpenalized CCA and sparse PCA In addition Witten andTibshirani [54] extended sparse CCA into a supervised frame-work which allows the integration of two data sets with aquantitative phenotype for example selecting variables fromboth genomics and transcriptomics data and linking them todrug sensitivity data

Partial least squares analysis

PLS is an efficient dimension reduction method in the analysisof high-dimensional omics data PLS maximizes the covariancerather than the correlation between components making itmore resistant to outliers than CCA Additionally PLS does notsuffer from the p n problem as CCA does Nonetheless asparse solution is desired in some cases For example Le Caoet al [56] proposed a sparse PLS method for the feature selectionby introducing a LASSO penalty for the loading vectors In a re-cent comparison sPLS performed similarly to sparse CCA [45]There are many implementations of PLS which optimize differ-ent objective functions with different constraints several ofwhich are described by Boulesteix et al [57]

Co-Inertia analysis

CIA is a descriptive nonconstrained approach for coupling pairsof data matrices It was originally proposed to link two ecolo-gical tables [13 58] but has been successfully applied in integra-tive analysis of omics data [32 43] CIA is implemented andformulated under the duality diagram framework(Supplementary Information) CIA is performed in two steps (i)application of a dimension reduction technique such as PCACA or NSCA to the initial data sets depending on the type ofdata (binary categorical discrete counts or continuous data)and (ii) constraining the projections of the orthogonal axes suchthat they are maximally covariant [43 58] CIA does not requirean inversion step of the correlation or covariance matrix thusit can be applied to high-dimensional genomics data withoutregularization or penalization

634 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

Though closely related to CCA [49] CIA maximizes thesquared covariance between the linear combination of the pre-processed matrix that is

for the ith dimension argmaxqixqi

ycov2ethXqi

xYqiyTHORN (11)

Equation (11) can be decomposed as

cov2 XqxiYqy

i

frac14 cor2 XqxiYqy

i

var Xqxi

var Xqy

i

(12)

CIA decomposition of covariance maximizes the varianceand the correlation between matrices and thus is less sensitiveto outliers The relationship between CIA Procrustes analysis[13] and CCA have been well described [49] A comparison be-tween sCCA (with elastic net penalty) sPLS and CIA is providedby Le Cao et al [45] In summary CIA and sPLS both maximizethe covariance between eigenvectors and efficiently identifyjoint and individual variance in paired data In contrast CCA-EN maximizes the correlation between eigenvectors and willdiscover effects present in both data sets but may fail to dis-cover strong individual effects [45] Both sCCA and sPLS aresparse methods that select similar subsets of variables whereasCIA does not include a feature selection step thus in terms offeature selection results of CIA are more likely to contain re-dundant information in comparison with sparse methods [45]

Similar to classical dimension reduction approaches thenumber of dimensions to be examined needs to be consideredand can be visualized using a scree plot (similar to Figure 1D)Components may be evaluated by their associated variance [25]the elbow test or Q2 statistics as described previously Forexample the Q2 statistic was applied to select the number ofdimensions in the predictive mode of PLS [56] In additionwhen a sparse factor is introduced in the loading vectors cross-validation approaches may be used to determine the number ofvariables selected from each pair of components Selection ofthe number of components and optimization of these param-eters is still an open research question and is an active area ofresearch

Integrative analysis of multi-assay data

There is a growing need to integrate more than two data sets ingenomics Generalizations of dimension reduction methods tothree or more data sets are sometimes called K-table methods[59ndash61] and a number of them have been applied to multi-assaydata (Table 4) Simultaneous decomposition and integration ofmultiple matrices is more complex than an analysis of a singledata set or paired data because each data set may have differentnumbers of variables scales or internal structure and thus havedifferent variance This might produce global scores that aredominated by one or a few data sets Therefore data are prepro-cessed before decomposition Preprocessing is often performedon two levels On the variable levels variables are often cen-tered and normalized so that their sum of squared values orvariance equals 1 This procedure enables all the variables tohave equal contribution to the total inertia (sum of squares ofall elements) of a data set However the number of variablesmay vary between data sets or filteringpreprocessing stepsmay generate data sets that have a higher variance contributionto the final result Therefore a data set level normalization isalso required In the simplest K-table analysis all matrices haveequal weights More commonly data sets are weighted by theirexpected contribution or expected data quality for example by

the square root of their total inertia or by the square root of thenumbers of columns of each data set [62] Alternatively greaterweights can be given to smaller or less redundant matricesmatrices that have more stable predictive information or thosethat share more information with other matrices Such weight-ing approaches are implemented in multiple-factor analysis(MFA) principal covariates regression [63] and STATIS

The simplest multi-omics data integration is when the datasets have the same variables and observations that is matchedrows and matched columns In genomics these could be pro-duced when variables from different multi-assay data sets aremapped to a common set of genomic coordinates or gene iden-tifiers thus generating data sets with matched variables andmatched observations Alternatively a repeated analysis orlongitudinal analysis of the same samples and the same vari-ables could produce such data (one should note that thesedimension reduction approaches do not model the time correla-tion between different datasets) There is a history of such ana-lyses in ecology where counts of species and environmentvariables are measured over different seasons [49 64 65] and inpsychology where different standardized tests are measuredmultiple times on study populations [66 67] Analysis of suchvariables x samples x time data are called a three-mode decom-position triadic cube or three-way table analysis tensor de-composition three-way PCA three-mode PCA three-modeFactor Analysis Tucker-3 model Tucker3 TUCKALS3 multi-block analysis among others (Table 4) The relationship be-tween various tensor or higher decompositions for multi-blockanalysis are reviewed by Kolda and Bader [68]

More frequently we need to find associations in multi-assaydata that have matched observations but have different andunmatched numbers of variables For example TCGA generatedmiRNA and mRNA transcriptome (RNAseq microarray) DNAcopy number DNA mutation DNA methylation and proteomicsmolecular profiles on each tumor The NCI-60 and the CancerCell Line Encyclopedia projects have measured pharmacologicalcompound profiles in addition to exome sequencing and tran-scriptomic profiles Methods that can be applied to EDA ofK-table of multi-assay data with different variables includeMCIA MFA Generalized CCA (GCCA) and Consensus PCA(CPCA)

The K-table methods can be generally expressed by the fol-lowing model

X1 frac14 FQT1 thorn E1

Xk frac14 FQTk thorn Ek

XK frac14 FQTK thorn EK

(13)

where there are K matrices or omics data sets X1 XK For con-venience we assume that the rows of Xk share a common set ofobservations but the columns of Xk may each have differentvariables F is the lsquoglobal scorersquo matrix Its columns are the PCsand are interpreted similarly to PCs from a PCA of a single dataset The global score matrix which is identical in all decompos-itions integrates information from all data sets Therefore it isnot specific to any single data set rather it represents the com-mon pattern defined by all data sets The matrices Qk with kranging from 1 to K are the loadings or coefficient matricesA high positive value indicates a strong positive contribution ofthe corresponding variable to the lsquoglobal scorersquo While the above

Dimension reduction techniques | 635

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

methods are formulated for multiple data sets with differentvariables but the same observations most can be similarly for-mulated for multiple data sets with the same variables but dif-ferent observations [69]

Multiple co-inertia analysis

MCIA is an extension of CIA which aims to analyze multiplematrices through optimizing a covariance criterion [60 70]MCIA simultaneously projects K data sets into the same dimen-sional space Instead of maximizing the covariance betweenscores from two data sets as in CIA the optimization problemused by MCIA is as following

argmaxqi1 q

ikqi

K

XK

kfrac141

cov2ethXikqi

kXiqiTHORN (14)

for dimension i with the constraints that jjqkijj frac14 var (Xiqi) frac14 1

where X frac14 (X1j jXkj jXK) and qi holds the correspondingloading values (lsquoglobalrsquo loading vector) [60 69 70]

MCIA derives a set of lsquoblock scoresrsquo Xkiqk

i using linear com-binations of the original variables from each individual matrixThe global score Xiqi is then further defined as the linear com-bination of lsquoblock scoresrsquo In practice the global scores representa correlated structure defined by multiple data sets whereasthe concordance and discrepancy between these data sets maybe revealed by the block scores (for detail see lsquoExample casestudyrsquo section) MCIA may be calculated with the ad hoc exten-sion of the NIPALS PCA [71] This algorithm starts with an ini-tialization step in which the global scores and the blockloadings for the first dimension are computed The residualmatrices are calculated in an iterative step by removing thevariance induced by the variable loadings (the lsquodeflationrsquo step)For higher order solutions the same procedure is applied to theresidual matrices and re-iterated until the desired number of di-mensions is reached Therefore the computation time stronglydepends on the number of desired dimensions MCIA is imple-mented in the R package omicade4 and has been applied to theintegrative analysis of transcriptomic and proteomic data setsfrom the NCI-60 cell lines [60]

Generalized canonical correlation analysis

GCCA [71] is a generalization of CCA to K-table analysis [73ndash75]It has also been applied to the analysis of omics data [36 76]Typically MCIA and GCCA will produce similar results (for amore detailed comparison see [60]) GCCA uses a different defla-tion strategy than MCIA it calculates the residual matrices byremoving the variance with respect to the lsquoblock scoresrsquo (insteadof lsquovariable loadingsrsquo used by MCIA or lsquoglobal scoresrsquo used byCPCA see later) When applied to omics data where p n avariable selection step is often integrated within the GCCA ap-proach which cannot be done in case of MCIA In addition asblock scores are better representations of a single data set (incontrast to the global score) GCCA is more likely to find com-mon variables across data sets Witten and Tibshirani [54]applied sparse multiple CCA to analyze gene expression andCopy Number Variation (CNV) data from diffuse large B-celllymphoma patients and successfully identified lsquocis interactionsrsquothat are both up-regulated in CNV and mRNA data

Consensus PCA

CPCA is closely related to GCCA and MCIA but has had less expos-ure to the omics data community CPCA optimizes the same criter-ion as GCCA and MCIA and is subject to the same constraints asMCIA [71] The deflation step of CPCA relies on the lsquoglobal scorersquo Asa result it guarantees the orthogonality of the global score onlyand tends to find common patterns in the data sets This charac-teristic makes it is more suitable for the discovery of joint patternsin multiple data sets such as the joint clustering problem

Regularized generalized canonical correlation analysis

Recently Tenenhaus and Tenenhaus [69 74] proposed regular-ized generalized CCA (RGCCA) which provides a unified frame-work for different K-table multivariate methods The RGCCAmodel introduces some extra parameters particularly a shrink-age parameter and a linkage parameter The linkage parameteris defined so that the connection between matrices may be cus-tomized The shrinkage parameter ranges from 0 to 1 Settingthis parameter to 0 will force the correlation criterion (criterionused by GCCA) whereas a shrinkage parameter of 1 will applythe covariance criterion (used by MCIA and CPCA) A value be-tween 0 and 1 leads to a compromise between the two optionsIn practice the correlation criterion is better in explaining thecorrelated structure across data sets while discarding the vari-ance within each individual data set The introduction of theextra parameters make RGCCA highly versatile GCCA CIA andCPCA can be described as special cases of RGCCA (see [69] andSupplementary Information) In addition RGCCA also integratesa feature selection procedure named sparse GCCA (SGCCA)Tenenhaus et al [76] applied SGCCA to combine gene expres-sion comparative genomic hybridization and a qualitativephenotype measured on a set of 53 children with glioma Sparsemultiple CCA [54] and SGCCA [76] are available in the R pack-ages PMA and RGCCA respectively Similarly a higher orderimplementation of spare PLS is described in Zhao et al [77]

Joint NMF

NMF has also been extended to jointly factorize multiple matri-ces In joint NMF the values in the global score F and in the co-efficient matrices (Q1 QK) are nonnegative and there is noexplicit definition of the block loadings An optimization algo-rithm is applied to minimize an objective function typically the

sum of squared errors ieXK

kfrac141Ek

2 This approach can be con-

sidered to be a nonnegative implementation of PARAFAC al-though it has also been implemented using the Tucker model[78ndash80] Zhang et al [81] apply joint NMF to a three-way analysisof DNA methylation gene expression and miRNA expressiondata to identify modules in each of these regulatory layers thatare associated with each other

Advantages of dimension reduction whenintegrating multi-assay data

Dimension reduction or latent variable approaches provideEDA integrate multi-assay data highlight global correlationsacross data sets and discover outliers or batch effects in indi-vidual data sets Dimension reduction approaches also facilitatedown-stream analysis of both observations and variables(genes) Compared with cluster analysis of individual data setscluster analysis of the global score matrix (F matrix) is robust asit aggregates observations across data sets and is less likely to

636 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

reflect a technical or batch effect of a single data set Similarlydimension reduction of multi-assay data facilities downstreamgene set pathway and network analysis of variables MCIAtransforms variables from each data set onto the same scaleand their loadings (Q matrix) rank the variables by their contri-bution to the global data structure (Figure 2) Meng et al reportthat pathway or gene set enrichment analysis (GSEA) of thetransformed variables is more sensitive than GSEA of each indi-vidual data set This is both because of the re-weighting andtransformation of variables but also because GSEA on the com-bined data has greater coverage of variables (genes) thus com-pensating for missing or unreliable information in any singledata set For example in Figure 2B we integrate and transformmRNA proteomics and miRNA data on the same scale allowingus to extract and study the union of all variables

Example case study

To demonstrate the integration of multi-data sets using dimen-sions reduction we applied MCIA to analyze mRNA miRNA andproteomics expression profiles of melanoma leukemia and CNS

cells lines from the NCI-60 panel The graphical output fromthis analysis a plot of the sample space variable space anddata weighting space are provided in Figure 2A B and E Theeigenvalues can be interpreted similarly to PCA a higher eigen-value contributes more information to the global score As withPCA researchers may be subjective in their selection of thenumber of components [24] The scree plot in Figure 2D showsthe eigenvalues of each global score In this case the first twoeigenvalues were significantly larger so we visualized the celllines and variables on PC1 and PC2

In the observation space (Figure 2A) assay data (mRNAmiRNA and proteomics) are distinguished by shape The coord-inates of each cell line (Fk in Figure 2A) are connected by lines tothe global scores (F) Short lines between points and cell lineglobal scores reflect high concordance in cell line data Most celllines have concordant information between data sets (mRNAmiRNA protein) as indicated by relatively short lines In add-ition the RV coefficient [82 83] which is a generalized Pearsoncorrelation coefficient for matrices may be used to estimate thecorrelation between two transformed data sets The RV coeffi-cient has values between 0 and 1 where a higher value

Figure 2 MCIA of mRNA miRNA and proteomics profiles of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines (A) shows a plot of the first two

components in sample space (sample lsquotypersquo is coded by the point shape circles for mRNAs triangles for proteins and squares for miRNAs) Each sample (cell line) is

represented by a ldquostarrdquo where the three omics data for each cell line are connected by lines to a center point which is the global score (F) for that cell line the shorter

the line the higher the level of concordance between the data types and the global structure (B) shows the variable space of MCIA A variable that is highly expressed

in a cell line will be projected with a high weight (far from the origin) in the direction of that cell line Some miRNAs with a large distance from the origin are labeled as

these miRNAs are the strongly associated with cancer tissue of origin (C) shows the correlation coefficients of the proteome profiling of SR with other cell lines The

proteome profiling of SR cell line is more correlated with melanoma cell line There may be a technical issue with the LESR proteomics data (D) A scree plot of the

eigenvalues and (E) a plot of data weighting space A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 637

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

indicates higher co-structure In this example we observed rela-tively high RV coefficients between the three data sets rangingfrom 078 to 084 It was recently reported that the RV coefficientis biased toward large data sets and a modified RV coefficienthas been proposed [84]

In this analysis (Figure 2A) cell lines originating from thesame anatomical source are projected close to each other andconverge in clusters The first PC separates the leukemia celllines (positive end of PC1) from the other two cell lines (negativeend of PC1) and PC2 separates the melanoma and CNS celllines The melanoma cell line LOX-IMVI which lacks the mela-nogenesis is projected close to the origin away from the mel-anoma cluster We were surprised to see that the proteomicsprofile of leukemia cell line SR was projected closer to melan-oma rather than leukemia cell lines We examined withintumor type correlations to the SR cell line (Figure 2C) Weobserved that the SR proteomics data had higher correlationwith melanoma compared with to leukemia cell lines Giventhat the mRNA and miRNA profiles of LE_SR are closer to theleukemia cell lines it suggests that there may have been a tech-nical error in generating the proteomics data on the SR cell line(Figure 2A and C)

MCIA projects all variables into the same space The variablespace (Q1 QK) is visualized in Figure 2B Variables and sam-ples projected in the same direction are associated This allowsone to select the variables most strongly associated with spe-cific observations from each data set for subsequent analysis Inour previous study [85] we have shown that the genes and pro-teins highly weighted on the melanoma side (positive end of se-cond dimension) are enriched with melanogenesis functionsand genesproteins highly weighted on the protein side arehighly enriched in T-cell or immune-related functions

We examined the miRNA data to extract the miRNAs withthe most extreme weights on the first two dimensions miR-142and miR-223 which are active and expressed in leukemia [8283 86ndash88] had high weights on the positive end of both thefirst and second axis (close to the leukemia cell lines samplespace Figure 2A) miR-142 plays an essential role in T-lympho-cyte development miR-223 is regulated by the Notch andNF-kB signaling pathways in T-cell acute lymphoblastic leuke-mia [89]

The miRNA with strongest association to CNS cell lines wasmiR-409 This miRNA is reported to promote the epithelial-to-mesenchymal transition in prostate cancer [90] In the NCI-60cell line data CNS cell lines are characterized more by a pro-nounced mesenchymal phenotype which is consistent withhigh expression of this miRNA On the positive end of the se-cond axis and negative end of the first axis (which correspondsto melanoma cell lines in the sample space Figure 2A) wefound miR-509 miR-513 and miR-506 strongly associated withmelanoma cell lines which are reported to initiate melanocytetransformation and promote melanoma growth [85]

Challenges in integrative data analysis

EDA is widely used and well accepted in the analysis of singleomics data sets but there is an increasing need for methodsthat integrate multi-omics data particularly in cancer researchRecently 20 leading scientists were invited to a meeting organ-ized by Nature Medicine Nature Biotechnology and the VolkswagenFoundation The meeting identified the need to simultaneouslycharacterize DNA sequence epigenome transcriptome proteinmetabolites and infiltrating immune cells in both the tumorand the stroma [91] The TCGA pan-cancer project plans to

comprehensively interrogate multi-omics data across 33 humancancers [92] The data are biologically complex In addition totumor heterogeneity [91] there may be technical issues batcheffects and outliers EDA approaches for complex multi-omicsdata are needed

We describe emerging applications of multivariateapproaches to omics data analysis These are descriptiveapproaches that do not test a hypothesis or generate a P-valueThey are not optimized for variable or biomarker discoveryalthough the introduction of sparsity in variable loadings mayhelp in the selection of variables for downstream analysis Fewcomparisons of different methods exist and the numbers ofcomponents and the sparsity level have to be optimized Cross-validation approaches are potentially useful but this stillremains an open area of research

Another limitation of these methods is that although vari-ables may vary among data sets the observations need to bematchable Therefore researchers need to have careful experi-mental design in the early stage of a study There is an exten-sion of CIA for the analysis of unmatched samples [93] whichcombines a Hungarian algorithm with CIA to iteratively pairsamples that are similar but not matched Multi-block andmulti-group methods (data sets with matched variables) havebeen reviewed recently by Tenenhaus and Tenenhaus [69]

The number of variables in genomics data is a challenge totraditional EDA visualization tools Most visualizationapproaches were designed for data sets with fewer variablesWithin R new packages including ggord are being developedDynamic data visualization is possible using ggvis plotly explorand other packages However interpretation of long lists of bio-logical variables (genes proteins miRNAs) is challenging Oneway to gain more insight into lists of omics variables is to per-form a network gene set enrichment or pathway analysis [94]An attractive feature of decomposition methods is that variableannotation such as Gene Ontology or Reactome can be pro-jected into the same space to determine a score for each geneset (or pathway) in that space [32 33 60]

Conclusion

Dimension reduction methods have a long history Many simi-lar methods have been developed in parallel by multiple fieldsIn this review we provided an overview of dimension reductiontechniques that are both well-established and maybe new tothe multi-omics data community We reviewed methods forsingle-table two-table and multi-table analysis There are sig-nificant challenges in extracting biologically and clinically ac-tionable results from multi-omics data however the field mayleverage the varied and rich resource of dimension reductionapproaches that other disciplines have developed

Key Points

bull There are many dimension-reduction methods whichcan be applied to exploratory data analysis of a singledata set or integrated analysis of a pair or multipledata sets In addition to exploratory analysis thesecan be extended to clustering supervised and discrim-inant analysis

bull The goal of dimension reduction is to map data ontoa new set of variables so that most of the variance (orinformation) in the data is explained by a few new(latent) variables

638 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

bull Multi-data set methods such as multiple co-inertiaanalysis (MCIA) multiple factor analysis (MFA) or ca-nonical correlations analysis (CCA) identify correlatedstructure between data sets with matched observa-tions (samples) Each data set may have different vari-ables (genes proteins miRNA mutations drugresponse etc)

bull MCIA MFA CCA and related methods provide a visu-alization of consensus and incongruence in andbetween data sets enabling discovery of potential out-liers batch effects or technical errors

bull Multi-dataset methods transform diverse variablesfrom each data set onto the same space and scalefacilitating integrative variable selection gene setanalysis pathway and downstream analyses

R Supplement

R code to re-generate all figures in this article is available asa Supplementary File Data and code are also available on thegithub repository httpsgithubcomaedinNCI60Example

Supplementary Data

Supplementary data are available online at httpbiboxfordjournalsorg

Acknowledgements

We are grateful to Drs John Platig and Marieke Kuijjer forreading the manuscript and for their comments

Funding

GGT was supported by the Austrian Ministry of ScienceResearch and Economics [OMICS Center HSRSM initiative]AC was supported by Dana-Farber Cancer Institute BCBResearch Scientist Developmental Funds National CancerInstitute 2P50 CA101942-11 and the Assistant Secretary ofDefense Health Program through the Breast CancerResearch Program under Award No W81XWH-15-1-0013Opinions interpretations conclusions and recommenda-tions are those of the author and are not necessarilyendorsed by the Department of Defense

References1 Brazma A Culhane AC Algorithms for gene expression

analysis In Jorde LB Little PFR Dunn MJ Subramaniam S(eds) Encyclopedia of Genetics Genomics Proteomics andBioinformatics London John Wiley amp Sons 2005 3148ndash59

2 Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

3 Legendre P Legendre LFJ Numerical Ecology AmsterdamElsevier Science 3rd edition 2012

4 Biton A Bernard-Pierrot I Lou Y et al Independent compo-nent analysis uncovers the landscape of the bladder tumortranscriptome and reveals insights into luminal and basalsubtypes Cell Rep 201491235ndash45

5 Hoadley KA Yau C Wolf DM et al Multiplatform analysis of12 cancer types reveals molecular classification within andacross tissues of origin Cell 2014158929ndash44

6 Verhaak RG Tamayo P Yang JY et al Prognostically relevantgene signatures of high-grade serous ovarian carcinomaJ Clin Invest 2013123517ndash25

7 Senbabaoglu Y Michailidis G Li JZ Critical limitations of con-sensus clustering in class discovery Sci Rep 201446207

8 Gusenleitner D Howe EA Bentink S et al iBBiG iterativebinary bi-clustering of gene sets Bioinformatics 2012282484ndash92

9 Pearson K On lines and planes of closest fit to systems ofpoints in space Philos Magazine Series 19016(2)559ndash72

10Hotelling H Analysis of a complex of statistical variables intoprincipal components J Educ Psychol 193324417

11Bland M An Introduction to Medical Statistics 3rd edn OxfordNew York Oxford University Press 2000 xvi 405

12Warner RM Applied Statistics From Bivariate throughMultivariate Techniques London SAGE Publications Inc2nd Edition 2012 1004

13Dray S Chessel D Thioulouse J Co-inertia analysis and thelinking of ecological data tables Ecology 2003843078ndash89

14Golub GH Loan CF Matrix Computations Baltimore JHU Press2012

15Anthony M Harvey M Linear Algebra Concepts and MethodsCambridge Cambridge University Press 1st edition 2012149ndash60

16McShane LM Cavenagh MM Lively TG et al Criteria for theuse of omics-based predictors in clinical trials explanationand elaboration BMC Med 201311220

17Guyon I Elisseeff A An introduction to variable and featureselection J Mach Learn Res 200331157ndash82

18Tukey JW Exploratory Data Analysis Boston Addison-WesleyPublishing Company 1977

19Hogben L Handbook of Linear Algebra Boca Raton Chapman ampHallCRC Press 1st edition 2007 20

20Ringner M What is principal component analysis NatBiotechnol 200826303ndash4

21Wall ME Rechtsteiner A Rocha LM (2003) Singular value de-composition and principal component analysis In A PracticalApproach to Microarray Data Analysis Berrar DP DubitzkyW Granzow M (eds) Norwell MA Kluwer 2003 91ndash109

22 terBraak CJF Prentice IC A theory of gradient analysis AdvEcol Res 198818271ndash313

23 terBraak CJF Smilauer P CANOCO Reference Manual andUserrsquos Guide to Canoco for Windows Software for CanonicalCommunity Ordination (version 4) Microcomputer Power(Ithaca NY USA) httpwwwmicrocomputerpowercom1998352

24Abdi H Williams LJ Principal Component Analysis HobokenJohn Wiley amp SonsInterdisciplinary Reviews ComputationalStatistics 20102(4)433ndash59

25Dray S On the number of principal components a test ofdimensionality based on measurements of similarity be-tween matrices Comput Stat Data Anal 2008522228ndash37

26Wouters L Gohlmann HW Bijnens L et al Graphical explor-ation of gene expression data a comparative study of threemultivariate methods Biometrics 2003591131ndash9

27Greenacre M Correspondence Analysis in Practice Boca RatonChapman and HallCRC 2 edition 2007 201ndash11

28Goodrich JK Rienzi DSC Poole AC et al Conducting a micro-biome study Cell 2014158250ndash62

29Fasham MJR A comparison of nonmetric multidimensionalscaling principal components and reciprocal averaging forthe ordination of simulated coenoclines and coenoplanesEcology 197758551ndash61

Dimension reduction techniques | 639

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

30Beh EJ Lombardo R Correspondence Analysis Theory Practiceand New Strategies Hoboken John Wiley amp Sons 1st edition2014 120ndash76

31Greenacre MJ Theory and Applications of Correspondence AnalysisWaltham MA Academic Press 3rd Edition 1993 83ndash125

32Fagan A Culhane AC Higgins DG A multivariate analysis ap-proach to the integration of proteomic and gene expressiondata Proteomics 200772162ndash71

33Fellenberg K Hauser NC Brors B et al Correspondence ana-lysis applied to microarray data Proc Natl Acad Sci USA20019810781ndash6

34Lee DD Seung HS Learning the parts of objects by non-negative matrix factorization Nature 1999401788ndash91

35Comon P Independent component analysis a new conceptSignal Processing 199436287ndash314

36Witten DM Tibshirani R Hastie T A penalized matrix decom-position with applications to sparse principal componentsand canonical correlation analysis Biostatistics 200910515ndash34

37Shen H Huang JZ Sparse principal component analysis viaregularized low rank matrix approximation J Multivar Anal2008991015ndash34

38Lee S Epstein MP Duncan R et al Sparse principal componentanalysis for identifying ancestry-informative markers in gen-ome-wide association studies Genet Epidemiol 201236293ndash302

39Sill M Saadati M Benner A Applying stability selection toconsistently estimate sparse principal components in high-dimensional molecular data Bioinformatics 2015312683ndash90

40Zhao J Efficient model selection for mixtures of probabilisticPCA via hierarchical BIC IEEE Trans Cybern 2014441871ndash83

41Zhao Q Shi X Huang J et al Integrative analysis of lsquo-omicsrsquodata using penalty functions Wiley Interdiscip Rev Comput Stat2015710

42Alter O Brown PO Botstein D Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms Proc NatlAcad Sci USA 20031003351ndash6

43Culhane AC Perriere G Higgins DG Cross-platform compari-son and visualisation of gene expression data using co-inertia analysis BMC Bioinformatics 2003459

44Dray S Analysing a pair of tables coinertia analysis and dual-ity diagrams In Blasius J Greenacre M (eds) Visualization andverbalization of data Boca Raton CRC Press 2014 289ndash300

45Le Cao KA Martin PG Robert-Granie C et al Sparse canonicalmethods for biological data integration application to across-platform study BMC Bioinformatics 20091034

46Braak CJF Canonical correspondence analysis a new eigen-vector technique for multivariate direct gradient analysisEcology 1986671167ndash79

47Hotelling H Relations between two sets of variates Biometrika193628321ndash77

48McGarigal K Landguth E Stafford S Multivariate Statistics forWildlife and Ecology Research New York Springer 2002

49Thioulouse J Simultaneous analysis of a sequence of pairedecological tables a comparison of several methods Ann ApplStat 201152300ndash25

50Hong S Chen X Jin L et al Canonical correlation analysisfor RNA-seq co-expression networks Nucleic Acids Res201341e95

51Mardia KV Kent JT Bibby JM Multivariate Analysis LondonNew York NY Toronto Sydney San Francisco CA AcademicPress 1979

52Waaijenborg S Zwinderman AH Sparse canonical correl-ation analysis for identifying connecting and completinggene-expression networks BMC Bioinformatics 200910315

53Parkhomenko E Tritchler D Beyene J Sparse canonical cor-relation analysis with application to genomic data integra-tion Stat Appl Genet Mol Biol 20098article 1

54Witten DM Tibshirani RJ Extensions of sparse canonical cor-relation analysis with applications to genomic data Stat ApplGenet Mol Biol 20098article 28

55Lin D Zhang J Li J et al Group sparse canonical correlationanalysis for genomic data integration BMC Bioinformatics201314245

56Le Cao KA Boitard S Besse P Sparse PLS discriminantanalysis biologically relevant feature selection and graph-ical displays for multiclass problems BMC Bioinform 201112253

57Boulesteix AL Strimmer K Partial least squares a versatiletool for the analysis of high-dimensional genomic data BriefBioinform 2007832ndash44

58Doledec S Chessel D Co-inertia analysis an alternativemethod for studying speciesndashenvironment relationshipsFreshwater Biol 199431277ndash94

59Ponnapalli SP Saunders MA Van Loan CF et al A higher-order generalized singular value decomposition for compari-son of global mRNA expression from multiple organismsPLoS One 20116e28072

60Meng C Kuster B Culhane AC et al A multivariate approachto the integration of multi-omics datasets BMC Bioinformatics201415162

61de Tayrac M Le S Aubry M et al Simultaneous analysis ofdistinct Omics data sets with integration of biological know-ledge multiple factor analysis approach BMC Genomics20091032

62Abdi H Williams LJ Valentin D Statis and Distatis OptimumMultitable Principal Component Analysis and Three Way MetricMultidimensional Scaling Wiley Interdisciplinary 20124124ndash67

63De Jong S Kiers HAL Principal covariates regression Part ITheory Chemometrics and Intelligent Laboratory Systems199214155ndash64

64Benasseni J Dosse MB Analyzing multiset data by the powerSTATIS-ACT method Adv Data Anal Classification 2012649ndash65

65Escoufier Y The duality diagram a means for better practicalapplications Devel Num Ecol 198714139ndash56

66Giordani P Kiers HAL Ferraro DMA Three-way componentanalysis using the R package threeway J Stat Software2014577

67Leibovici DG Spatio-temporal multiway decompositionsusing principal tensor analysis on k-modes the R packagePTAk J Stat Software 201034(10)1ndash34

68Kolda TG Bader BW Tensor decompositions and applica-tions SIAM Rev 200951455ndash500

69Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis for multiblock or multigroup dataanalysis Eur J Oper Res 2014238391ndash403

70Chessel D Hanafi M Analyses de la co-inertie de K nuages depoints Rev Stat Appl 19964435ndash60

71Hanafi M Kohler A Qannari EM Connections betweenmultiple co-inertia analysis and consensus principal compo-nent analysis Chemometrics and Intelligent Laboratory2011106(1)37ndash40

72Carroll JD Generalization of canonical correlation analysis tothree or more sets of variables Proc 76th Convent Am PsychAssoc 19683227ndash8

73Takane Y Hwang H Abdi H Regularized multiple-set canon-ical correlation analysis Psychometrika 200873753ndash75

640 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5
Page 2: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

Exploratory data analysis (EDA) is an important early step inomics data analysis [1] It summarizes the main characteristics ofdata and may identify potential issues such as batch effects [2]and outliers Techniques for EDA include cluster analysis and di-mension reduction Both have been widely applied to transcrip-tomics data analysis [1] but there are advantages to dimensionreduction approaches when integrating multi-assay data Whilecluster analysis generally investigates pairwise distances betweenobjects looking for fine relationships dimension reduction or la-tent variable methods consider the global variance of the data sethighlighting general gradients or patterns in the data [3]

Biological data frequently have complex phenotypes and de-pending on the subset of variables analyzed multiple valid clus-tering classifications may co-exist Dimension reductionapproaches decompose the data into a few new variables (calledcomponents) that explain most of the differences in observa-tions For example a recent dimension reduction analysis ofbladder cancers identified components associated with batcheffects GC content in the RNA sequencing data in addition toseven components that were specific to tumor cells and threecomponents associated with tumor stroma [4] By contrastmost clustering approaches are optimized for discovery of dis-crete clusters where each observation or variable is assigned toonly one cluster Limitations of clustering were observed whenthe method cluster-of-cluster assignments was applied toTCGA pan-cancer multi-omics data of 3527 specimens from 12cancer type sources [5] Tumors were assigned to one clusterand these clusters grouped largely by anatomical origin andfailed to identify clusters associated with known cancer path-ways [5] However a dimension reduction analysis across 10 dif-ferent cancers identified novel and known cancer-specificpathways in addition to pathways such as cell cycle mitochon-dria gender interferon response and immune response thatwere common among different cancers [4]

Overlapping clusters have been identified in many tumorsincluding glioblastoma and serous ovarian cancer [6 7]Gusenleitner and colleagues [8] found that k-means or hierarch-ical clustering failed to identify the correct cluster structure insimulated data with multiple overlapping clusters Clusteringmethods may also falsely discover clusters in unimodal dataFor example Senbabaoglu et al [7] applied consensus clusteringto randomly generated unimodal data and found it divided thedata into apparently stable clusters for a range of K where K is apredefined number of clusters However principal componentanalysis (PCA) did not identify these clusters

In this article we first introduce linear dimension reduction ofa single data set describing the fundamental concepts and ter-minology that are needed to understand its extensions to mul-tiple matrices Then we review multivariate dimension reductionapproaches which can be applied to the integrative exploratoryanalysis of multi-omics data To demonstrate the application ofthese methods we apply multiple co-inertia analysis (MCIA) toEDA of mRNA miRNA and proteomics data of a subset of 60 celllines studied at the National Cancer Institute (NCI-60)

Introduction to dimension reduction

Dimension reduction methods arose in the early 20th century[9 10] and have continued to evolve often independently inmultiple fields giving rise to a myriad of associated termin-ology Wikipedia lists over 10 different names for PCA the mostwidely used dimension reduction approach Therefore weprovide a glossary (Table 1) and tables of methods (Tables 2ndash4)to assist beginners to the field Each of these are dimension

reduction techniques whether they are applied to one (Table 2)or multiple (Tables 3 and 4) data sets We start by introducingthe central concepts of dimension reduction

We denote matrices with boldface uppercase letters Therows of a matrix contain the observations while the columnshold the variables In an omics study the variables (alsoreferred to as features) generally measure tissue or cell attri-butes including abundance of mRNAs proteins and metabol-ites All vectors are columns vectors and are denoted withboldface lowercase letters Scalars are indicated by italic letters

Given an omics data set X which is an np matrix of nobservations and p variables it can be represented by

X frac14 x1x2 xp

(1)

where x are vectors of length n and are measurements ofmRNA or other biological variables for n observations (samples)In a typical omics study p ranges from several hundred to mil-lions Therefore observations (samples) are represented in largedimensional spaces Rp The goal of dimension reduction is toidentify a (set of) new variable(s) using a linear combination ofthe original variables such that the number of new variables ismuch smaller than p An example of such a linear combinationis shown in Equation (2)

f frac14 q1x1 thorn q2x2 thorn thorn qpxp (2)

or expressed in a matrix form

f frac14 Xq (3)

In Equations (2) and (3) f is a new variable which is often calleda latent variable or a component Depending on the scientificfield f may also be called principal axis eigenvector or latentfactor qfrac14(q1 q2 qp)T is a p-length vector of coefficients ofscalar values in which at least one of the coefficients is differentfrom zero These coefficients are also called lsquoloadingsrsquoDimension reduction analysis introduces constraints to obtaina meaningful solution we find the set of qrsquos that maximize thevariance of components frsquos In doing so a smaller number ofvariables f capture most of the variance in the data Differentoptimization and constraint criteria distinguish between differ-ent dimension reduction methods Table 2 provides a nonex-haustive list of these methods which includes PCA lineardiscriminant analysis and factor analysis

Principal component analysis

PCA is one of the most widely used dimension reductionmethods [20] Given a column centered and scaled (unit variance)matrix X PCA finds a set of new variables fifrac14Xqi where i is the ithcomponent and qi is the variable loading for the ith principalcomponent (PC superscript denotes the component or the di-mension) The variance of fi is maximized that is

arg maxqi varethXqiTHORN (4)

with the constraints that jjqijj frac14 1 and each pair of components(f if j) are orthogonal to each other (or uncorrelated ie f iTf j frac14 0for j = i)PCA can be computed using different algorithms includingeigen analysis latent variable analysis factor analysis singularvalue decomposition (SVD) [21] or linear regression [3] Among

Dimension reduction techniques | 629

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

them SVD is the most widely used approach Given X an npmatrix with rank r (r ltfrac14min[n p]) SVD decomposes X intothree matrices

X frac14 USQTsubject to the constraint that UTU frac14 QTQ frac14 I (5)

where U is an nr matrix and Q is a pr matrix The columns ofU and Q are the orthogonal left and right singular vectors re-spectively S is an rr diagonal matrix of singular values whichare proportional to the standard deviations associated with rsingular vectors The singular vectors are ordered such that

their associated variances are monotonically decreasingIn a PCA of X the PCs comprise an nr matrix F which isdefined as

F frac14 US frac14 USQTQ frac14 XQ (6)

where the columns of matrix F are the PCs and the matrix Q iscalled the loadings matrix and contains the linear combinationcoefficients of the variables for each PC (q in Equation (3))Therefore we represent the variance of X in lower dimension rThe above formula also emphasizes that Q is a matrix that

Table 1 Glossary

Term Definition

Variance The variance of a random variable measures the spread (variability) of its realizations (values of the random vari-able) The variance is always a positive number If the variance is small the values of the random variable areclose to the mean of the random variable (the spread of the data is low) A high variance is equivalent to widelyspread values of the random variable See [11]

Standard deviation The standard deviation of a random variable measures the spread (variability) of its realizations (values of the ran-dom variable) It is defined as the square root of the variance The standard deviation will have the same units asthe random variable in contrast to the variance See [11]

Covariance The covariance is an unstandardized measure about the tendency of two random variables to vary together See[12]

Correlation The correlation of two random variables is defined by the covariance of the two random variables normalized bythe product between their standard deviations It measures the linear relationship between the two randomvariables The correlation coefficient ranges between 1 and thorn1 See [12]

Inertia Inertia is a measure for the variability of the data The inertia of a set of points relative to one point P is defined bythe weighted sum of the squared distances between each considered point and the point P Correspondingly theinertia of a centered matrix (mean is equal to zero) is simply the sum of the squared matrix elements The inertiaof the matrix X defined by the metrics L and D is the weighted sum of its squared values The inertia is equal thetotal variance of X when X is centered L is the Euclidean metric and D is a diagonal matrix with diagonal elementsequal to 1n See [13]

Co-inertia The co-inertia is a global measure for the co-variability of two data sets (for example two high-dimensional randomvariables) If the data sets are centered the co-inertia is the sum of squared covariances When coupling a pair ofdata sets the co-inertia between two matrices X and Y is calculated as trace (XLXTDYRYTD) See [13]

Orthogonal Two vectors are called orthogonal if they form an angle that measures 90 degrees Generally two vectors are orthog-onal if their inner product is equal to zero Two orthogonal vectors are always linearly independent See [12]

Independent In linear algebra two vectors are called linearly independent if their liner combination is equal to zero only when allconstants of the linear combination are equal to zero See [14] In statistics two random variables are called statis-tically independent if the distribution of one of them does not affect the distribution of the other If two independ-ent random variables are added then the mean of the sum is the sum of the two mean values This is also true forthe variance The covariance of two independent variables is equal to zero See [11]

Eigenvector eigenvalue An eigenvector of a matrix is a vector that does not change its direction after a linear transformation The vector v isan eigenvector of the matrix A if Av frac14 kv k is the eigenvalue associated with the eigenvector v and it reflects thestretch of the eigenvector following the linear transformation The most popular way to compute eigenvectorsand eigenvalues is the SVD See [14]

Linear combination Mathematical expression calculated through the multiplication of variables with constants and adding the individ-ual multiplication results A linear combination of the variables x and y is ax thorn by where a and b are the constantsSee [15]

Omics The study of biological molecules in a comprehensive fashion Examples of omics data types include genomicstranscriptomics proteomics metabolomics and epigenomics [16]

Dimension reduction Dimension reduction is the mapping of data to a lower dimensional space such that redundant variance in the datais reduced or discarded enabling a lower-dimensional representation without significant loss of informationSee [17]

Exploratory data analysis EDA is the application of statistical techniques that summarize the main characteristics of data often with visualmethods In contrast to statistical hypothesis testing (confirmatory data analysis) EDA can help to generatehypotheses See [18]

Sparse vector A sparse vector is a vector in which most elements are zero A sparse loadings matrix in PCA or related methodsreduce the number of features contributing to a PC The variables with nonzero entries (features) are the lsquoselectedfeaturesrsquo See [19]

630 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

projects the observations in X onto the PCs The sum of squaredvalues of the columns in U equals 1 (Equation (5)) therefore thevariance of the ith PC di2 can be calculated as

di2 frac14 si2

n 1(7)

where si is the ith diagonal element in S The variance reflectsthe amount of information (underling structure) captured byeach PC The squared correlations between PCs and the original

variables are informative and often illustrated using a correl-ation circle plot These can be calculated by

C frac14 QD (8)

where D frac14 (d1 d2 dr)T is a diagonal matrix of the standarddeviation of the PCs

In contrast to SVD which calculates all PCs simultaneouslyPCA can also be calculated using the Nonlinear Iterative PartialLeast Squares (NIPALS) algorithm which uses an iterative

Table 2 Dimension reduction methods for one data set

Method Description Name of R function R package

PCA Principal component analysis prcompstats princompstats dudipcaade4 pcaveganPCAFactoMineR principalpsych

CA COA Correspondence analysis caca CAFactoMineR dudicoaade4NSC Nonsymmetric correspondence analysis dudinscade4PCoA MDS Principal co-ordinate analysismultiple dimensional

scalingcmdscalestats dudipcoade4 pcoaape

NMF Nonnegative matrix factorization nmfnmfnmMDS Nonmetric multidimensional scaling metaMDSvegansPCA nsPCA pPCA Sparse PCA nonnegative sparse PCA penalized PCA

(PCA with feature selection)SPCPMA spcamixOmics nsprcompnsprcomp

PMDPMANIPALS PCA Nonlinear iterative partial least squares analysis (PCA

on data with missing values)nipalsade4 pcapcaMethodsa nipalsmixOmics

pPCA bPCA Probabilistic PCA Bayesian PCA pcapcaMethodsa

MCA Multiple correspondence analysis dudiacmade4 mcaMASSICA Independent component analysis fastICAFastICAsIPCA Sparse independent PCA (combines sPCA and ICA) sipcamixOmics ipcamixOmicsplots Graphical resources R packages including scatterplot3d ggordb ggbiplotc

plotlyd explor

aAvailable in BioconductorbOn github devtoolsinstall_github (lsquofawda123ggordrsquo)cOn github devtoolsinstall_github (lsquoggbiplotrsquo lsquovqvrsquo)dOn github devtoolsinstall_github (lsquoropensciplotlyrsquo)

Table 3 Dimension reduction methods for pairs of data sets

Method Description Featureselection

R Function package

CCAa Canonical correlation analysis Limited to ngtpa No cccca CCorAveganCCAa Canonical correspondence analysis is a constrained correspondence

analysis which is popular in ecologya

No ccaade4 ccavegan cancorstats

RDA Redundancy analysis is a constrained PCA Popular in ecology No rdaveganProcrutes Procrutes rotation rotates a matrix to maximum similarity with a tar-

get matrix minimizing sum of squared differencesNo procrustesvegan procusteade4

rCCA Regularized canonical correlation No rccccasCCA Sparse CCA Yes CCApmapCCA Penalized CCA Yes spCCAspCCA supervised versionWAPLS Weighted averaging PLS regression No WAPLSrioja waplspaltranPLS Partial least squares of K-tables (multi-block PLS) No mbplsade4 plsdacaretsPLS pPLS Sparse PLS Penalized PLS Yes splsspls splsmixOmics pplspplssPLS-DA Sparse PLS-discriminant analysis Yes splsdamixOmics splsdacaretcPCA Consensus PCA No cpcamogsaCIA Coinertia analysis No coinertiaade4 ciamade4

aA source for confusion CCA is widely used as an acronym for both Canonical lsquoCorrespondencersquo Analysis and Canonical lsquoCorrelationrsquo Analysis Throughout this article

we use CCA for canonical lsquocorrelationrsquo analysis Both methods search for the multivariate relationships between two data sets Canonical lsquocorrespondencersquo analysis is

an extension and constrained form of lsquocorrespondencersquo analysis [22] Both canonical lsquocorrelationrsquo analysis and RDA assume a linear model however RDA is a con-

strained PCA (and assumes one matrix is the dependent variable and one independent) whereas canonical correlation analysis considers both equally See [23] for

more explanation

Dimension reduction techniques | 631

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

regression procedure to calculate PCs Computation can be per-formed on data with missing values and it is faster than SVDwhen applied to large matrices Furthermore NIPALS may begeneralized to discover the correlated structure in more thanone data set (see sections on the analysis of multi-omics datasets) Please refer to the Supplementary Information for add-itional details on NIPALS

Visualizing and interpreting results ofdimension reduction analysis

We present an example to illustrate how to interpret results of aPCA PCA was applied to analyze mRNA gene expression data ofa subset of cell lines from the NCI-60 panel those of melanomaleukemia and central nervous system (CNS) tumors The resultsof PCA can be easily interpreted by visualizing the observationsand variables in the same space using a biplot Figure 1A is abiplot of the first (PC1) and second PC (PC2) where points andarrows from the plot origin represent observations and genesrespectively Cell lines (points) with correlated gene expressionprofiles have similar scores and are projected close to eachother on PC1 and PC2 We see that cell lines from the same ana-tomical location are clustered

Both the direction and length of the mRNA gene expressionvectors can be interpreted Gene expression vectors point in thedirection of the latent variable (PC) to which it is most similar(squared correlation) Gene vectors with the same direction (egFAM69B PNPLA2 PODXL2) have similar gene expression pro-files The length of the gene expression vector is proportional tothe squared multiple correlation between the fitted values forthe variable and the variable itself

A gene expression vector and a cell line projected in the samedirection from the origin are positively associated For example inFigure 1A FAM69B PNPLA2 PODXL2 are active (have higher geneexpression) in melanoma cell lines Similarly genes DPPA5 andRPL34P1 are among those genes that are highly expressed inmost leukemia cell lines By contrast genes WASL and HEXBpoint in the opposite direction to most leukemia cell lines indicat-ing low association In Figure 1B it is clear that these genes arenot expressed in leukemia cell lines and are colored blue in theheatmap (these are colored dark gray in grayscale)

The sum of the squared correlation coefficients between avariable and all the components (calculated in equation 8)equals 1 Therefore variables are often shown within a correl-ation circle (Figure 1C) Variables positioned on the unit circlerepresent variables that are perfectly represented by the two di-mensions displayed Those not on the unit circle may requireadditional components to be represented

In most analyses only the first few PCs are plotted andstudied as these explain the most variant trends in the dataGenerally the selection of components is subjective anddepends on the purpose of the EDA An informal elbow test mayhelp to determine the number of PCs to retain and examine [2425] From the scree plot of PC eigenvalues in Figure 1D we mightdecide to examine the first three PCs because the decrease inPC variance becomes relatively moderate after PC3 Anotherapproach that is widely used is to include (or retain) PCs thatcumulatively capture a certain proportion of variance for ex-ample 70 of variance is modeled with three PCs If a parsi-mony model is preferred the variance proportion cutoff can beas low as 50 [24] More formal tests including the Q2 statisticare also available (for details see [24]) In practice the first com-ponent might explain most of the variance and the remainingaxes may simply be attributed to noise from technical or biolo-gical sources in a study with low complexity (eg cell line repli-cates of controls and one treatment condition) However acomplex data set (for example a set of heterogeneous tumors)may require multiple PCs to capture the same amount ofvariance

Different dimension reduction approaches areoptimized for different data

There are many dimension reduction approaches related to PCA(Table 2) including principal co-ordinate analysis (PCoA) cor-respondence analysis (CA) and nonsymmetrical correspond-ence analysis (NSCA) These may be computed by SVD butdiffer in how the data are transformed before decomposition[21 26 27] and therefore each is optimized for specific dataproperties PCoA (also known as Classical MultidimensionalScaling) is versatile as it is a SVD of a distance matrix that canbe applied to decompose distance matrices of binary count or

Table 4 Dimension reduction methods for multiple (more than two) data sets

Method Description Featureselection

Matchedcases

R Function package

MCIA Multiple coinertia analysis No No mciaomicade4 mcoaade4gCCA Generalized CCA No No regCCAdmtrGCCA Regularized generalized CCA No No regCCAdmt rgccargcca

wrapperrgccamixOmicssGCCA Sparse generalized canonical

correlation analysisYes No sgccargcca wrappersgccamixOmics

STATIS Structuration des Tableaux aTrois Indices de la Statistique(STATIS) Family of methods whichinclude X-statis

No No statisade4

CANDECOMPPARAFAC Tucker3

Higher order generalizations of SVDand PCA Require matchedvariables and cases

No Yes CPThreeWay T3ThreeWay PCAnPTaKCANDPARAPTaK

PTA Partial triadic analysis No Yes ptaade4statico Statis and CIA (find structure

between two pairs of K-tables)No No staticoade4

632 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

continuous data It is frequently applied in the analysis ofmicrobiome data [28]

PCA is designed for the analysis of multi-normal distributeddata If data are strongly skewed or extreme outliers are pre-sent the first few axes may only separate those objects with ex-treme values instead of displaying the main axes of variation Ifdata are unimodal or display nonlinear trends one may see dis-tortions or artifacts in the resulting plots in which the secondaxis is an arched function of the first axis In PCA this is calledthe horseshoe effect and it is well described including illustra-tions in Legendre and Legendre [3] Both nonmetric Multi-Dimensional Scaling (MDS) and CA perform better than PCA inthese cases [26 29] Unlike PCA CA can be applied to sparsecount data with many zeros Although designed for contingencytables of nonnegative count data CA and NSCA decomposea chi-squared matrix [30 31] but have been successfullyapplied to continuous data including gene expression and pro-tein profiles [32 33] As described by Fellenberg et al [33] gene

and protein expression can be seen as an approximation of thenumber of corresponding molecules present in the cell during acertain measured condition Additionally Greenacre [27]emphasized that the descriptive nature of CA and NSCA allowstheir application on data tables in general not only on countdata These two arguments support the suitability of CA andNSCA as analysis methods for omics data While CA investi-gates symmetric associations between two variables NSCA cap-tures asymmetric relations between variables Spectral mapanalysis is related to CA and performs comparably with CAeach outperforming PCA in the identification of clusters of leu-kemia gene expression profiles [26] All dimension reductionmethods can be formulated in terms of the duality diagramDetails on this powerful framework are included in theSupplementary Information

Nonnegative matrix factorization (NMF) [34] forces a positiveor nonnegative constraint on the resulting data matrices andsimilar to Independent Component Analysis (ICA) [35] there is no

Figure 1 Results of a PCA analysis of mRNA gene expression data of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines from the NCI-60 cell

line panel All variables were centered and scaled Results show (A) a biplot where observations (cell lines) are points and gene expression profiles are arrows (B) a

heatmap showing the gene expression of the same 20 genes in the cell lines red to blue scale represent high to low gene expression (light to dark gray represent high

to low gene expression on the black and white figure) (C) correlation circle (D) variance barplot of the first ten PCs To improve the readability of the biplot some labels

of the variables (genes) in (A) have been moved slightly A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 633

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

requirement for orthogonality or independence in the compo-nents The nonnegative constraint guarantees that only the addi-tive combinations of latent variables are allowed This may bemore intuitive in biology where many biological measurements(eg protein concentrations count data) are represented by posi-tive values NMF is described in more detail in the SupplementalInformation ICA was recently applied to molecular subtype dis-covery in bladder cancer [4] Biton et al [4] applied ICA to geneexpression data of 198 bladder cancers and examined 20 compo-nents ICA successfully decomposed and extracted multiplelayers of signal from the data The first two components wereassociated with batch effects but other components revealed newbiology about tumor cells and the tumor microenvironmentThey also applied ICA to non-bladder cancers and by comparingthe correlation between components were able to identify a setof bladder cancer-specific components and their associatedgenes

As omics data sets tend to have high dimensionality (p n)it is often useful to reduce the number of variables Several re-cent extensions of PCA include variable selection often via aregularization step or L-1 penalization (eg Least AbsoluteShrinkage and Selection Operator LASSO) [36] The NIPALS algo-rithm uses an iterative regression approach to calculate thecomponents and loadings which is easily extended to have asparse operator that can be included during regression on thecomponent [37] A cross-validation approach can be used to de-termine the level of sparsity Sparse penalized and regularizedextensions of PCA and related methods have been describedrecently [36 38ndash41]

Integrative analysis of two data sets

One-table dimension reduction methods have been extended tothe EDA of two matrices and can simultaneously decomposeand integrate a pair of matrices that measure different variableson the same observations (Table 3) Methods include general-ized SVD [42] Co-Inertia Analysis (CIA) [43 44] sparse or penal-ized extensions of Partial Least Squares (PLS) CanonicalCorrespondence analysis (CCA) and Canonical CorrelationAnalysis (CCA) [36 45ndash47] Note both canonical correspondenceanalysis and canonical correlation analysis are referred to bythe acronym CCA Canonical correspondence analysis is a con-strained form of CA that is widely used in ecological statistics[46] however it is yet to be adopted by the genomics commu-nity in analysis of pairs of omics data By contrast severalgroups have applied extensions of canonical correlation ana-lysis to omics data integration Therefore in this review we useCCA to describe canonical correlation analysis

Canonical correlation analysis

Two omics data sets X (dimension npx) and Y (dimension npy)can be expressed by the following latent component decompos-ition problem

X frac14 FxQTx thorn Ex

Yfrac14 FyQTy thorn Ey

(9)

where Fx and Fy are nr matrices with r columns of compo-nents that explain the co-structure between X and Y The col-umns of Qx and Qy are variable loading vectors for X and Yrespectively Ex and Ey are error terms

Proposed by Hotelling in 1936 [47] CCA searches for associ-ations or correlations among the variables of X and Y [47] bymaximizing the correlation between Xqx

i and Yqyi

for the ith component arg maxqixqi

ycorethXqi

x YqiyTHORN (10)

In CCA the components Xqxi and Yqy

i are called canonical vari-ates and their correlations are the canonical correlations

Sparse canonical correlation analysis

The main limitation of applying CCA to omics data is that itrequires an inversion of the correlation or covariance matrix[38 49 50] which cannot be calculated when the number ofvariables exceeds the number of observations [46] In high-dimensional omics data where p n application of these meth-ods requires a regularization step This may be accomplished byadding a ridge penalty that is adding a multiple of the identitymatrix to the covariance matrix [51] A sparse solution of theloading vectors (Qx and Qy) filters the number of variables andsimplifies the interpretation of results For this purpose penal-ized CCA [52] sparse CCA [53] CCA-l1 [54] CCA elastic net (CCA-EN) [45] and CCA-group sparse [55] have been introduced andapplied to the integrative analysis of two omics data setsWitten et al [36] provided an elegant comparison of various CCAextensions accompanied by a unified approach to compute bothpenalized CCA and sparse PCA In addition Witten andTibshirani [54] extended sparse CCA into a supervised frame-work which allows the integration of two data sets with aquantitative phenotype for example selecting variables fromboth genomics and transcriptomics data and linking them todrug sensitivity data

Partial least squares analysis

PLS is an efficient dimension reduction method in the analysisof high-dimensional omics data PLS maximizes the covariancerather than the correlation between components making itmore resistant to outliers than CCA Additionally PLS does notsuffer from the p n problem as CCA does Nonetheless asparse solution is desired in some cases For example Le Caoet al [56] proposed a sparse PLS method for the feature selectionby introducing a LASSO penalty for the loading vectors In a re-cent comparison sPLS performed similarly to sparse CCA [45]There are many implementations of PLS which optimize differ-ent objective functions with different constraints several ofwhich are described by Boulesteix et al [57]

Co-Inertia analysis

CIA is a descriptive nonconstrained approach for coupling pairsof data matrices It was originally proposed to link two ecolo-gical tables [13 58] but has been successfully applied in integra-tive analysis of omics data [32 43] CIA is implemented andformulated under the duality diagram framework(Supplementary Information) CIA is performed in two steps (i)application of a dimension reduction technique such as PCACA or NSCA to the initial data sets depending on the type ofdata (binary categorical discrete counts or continuous data)and (ii) constraining the projections of the orthogonal axes suchthat they are maximally covariant [43 58] CIA does not requirean inversion step of the correlation or covariance matrix thusit can be applied to high-dimensional genomics data withoutregularization or penalization

634 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

Though closely related to CCA [49] CIA maximizes thesquared covariance between the linear combination of the pre-processed matrix that is

for the ith dimension argmaxqixqi

ycov2ethXqi

xYqiyTHORN (11)

Equation (11) can be decomposed as

cov2 XqxiYqy

i

frac14 cor2 XqxiYqy

i

var Xqxi

var Xqy

i

(12)

CIA decomposition of covariance maximizes the varianceand the correlation between matrices and thus is less sensitiveto outliers The relationship between CIA Procrustes analysis[13] and CCA have been well described [49] A comparison be-tween sCCA (with elastic net penalty) sPLS and CIA is providedby Le Cao et al [45] In summary CIA and sPLS both maximizethe covariance between eigenvectors and efficiently identifyjoint and individual variance in paired data In contrast CCA-EN maximizes the correlation between eigenvectors and willdiscover effects present in both data sets but may fail to dis-cover strong individual effects [45] Both sCCA and sPLS aresparse methods that select similar subsets of variables whereasCIA does not include a feature selection step thus in terms offeature selection results of CIA are more likely to contain re-dundant information in comparison with sparse methods [45]

Similar to classical dimension reduction approaches thenumber of dimensions to be examined needs to be consideredand can be visualized using a scree plot (similar to Figure 1D)Components may be evaluated by their associated variance [25]the elbow test or Q2 statistics as described previously Forexample the Q2 statistic was applied to select the number ofdimensions in the predictive mode of PLS [56] In additionwhen a sparse factor is introduced in the loading vectors cross-validation approaches may be used to determine the number ofvariables selected from each pair of components Selection ofthe number of components and optimization of these param-eters is still an open research question and is an active area ofresearch

Integrative analysis of multi-assay data

There is a growing need to integrate more than two data sets ingenomics Generalizations of dimension reduction methods tothree or more data sets are sometimes called K-table methods[59ndash61] and a number of them have been applied to multi-assaydata (Table 4) Simultaneous decomposition and integration ofmultiple matrices is more complex than an analysis of a singledata set or paired data because each data set may have differentnumbers of variables scales or internal structure and thus havedifferent variance This might produce global scores that aredominated by one or a few data sets Therefore data are prepro-cessed before decomposition Preprocessing is often performedon two levels On the variable levels variables are often cen-tered and normalized so that their sum of squared values orvariance equals 1 This procedure enables all the variables tohave equal contribution to the total inertia (sum of squares ofall elements) of a data set However the number of variablesmay vary between data sets or filteringpreprocessing stepsmay generate data sets that have a higher variance contributionto the final result Therefore a data set level normalization isalso required In the simplest K-table analysis all matrices haveequal weights More commonly data sets are weighted by theirexpected contribution or expected data quality for example by

the square root of their total inertia or by the square root of thenumbers of columns of each data set [62] Alternatively greaterweights can be given to smaller or less redundant matricesmatrices that have more stable predictive information or thosethat share more information with other matrices Such weight-ing approaches are implemented in multiple-factor analysis(MFA) principal covariates regression [63] and STATIS

The simplest multi-omics data integration is when the datasets have the same variables and observations that is matchedrows and matched columns In genomics these could be pro-duced when variables from different multi-assay data sets aremapped to a common set of genomic coordinates or gene iden-tifiers thus generating data sets with matched variables andmatched observations Alternatively a repeated analysis orlongitudinal analysis of the same samples and the same vari-ables could produce such data (one should note that thesedimension reduction approaches do not model the time correla-tion between different datasets) There is a history of such ana-lyses in ecology where counts of species and environmentvariables are measured over different seasons [49 64 65] and inpsychology where different standardized tests are measuredmultiple times on study populations [66 67] Analysis of suchvariables x samples x time data are called a three-mode decom-position triadic cube or three-way table analysis tensor de-composition three-way PCA three-mode PCA three-modeFactor Analysis Tucker-3 model Tucker3 TUCKALS3 multi-block analysis among others (Table 4) The relationship be-tween various tensor or higher decompositions for multi-blockanalysis are reviewed by Kolda and Bader [68]

More frequently we need to find associations in multi-assaydata that have matched observations but have different andunmatched numbers of variables For example TCGA generatedmiRNA and mRNA transcriptome (RNAseq microarray) DNAcopy number DNA mutation DNA methylation and proteomicsmolecular profiles on each tumor The NCI-60 and the CancerCell Line Encyclopedia projects have measured pharmacologicalcompound profiles in addition to exome sequencing and tran-scriptomic profiles Methods that can be applied to EDA ofK-table of multi-assay data with different variables includeMCIA MFA Generalized CCA (GCCA) and Consensus PCA(CPCA)

The K-table methods can be generally expressed by the fol-lowing model

X1 frac14 FQT1 thorn E1

Xk frac14 FQTk thorn Ek

XK frac14 FQTK thorn EK

(13)

where there are K matrices or omics data sets X1 XK For con-venience we assume that the rows of Xk share a common set ofobservations but the columns of Xk may each have differentvariables F is the lsquoglobal scorersquo matrix Its columns are the PCsand are interpreted similarly to PCs from a PCA of a single dataset The global score matrix which is identical in all decompos-itions integrates information from all data sets Therefore it isnot specific to any single data set rather it represents the com-mon pattern defined by all data sets The matrices Qk with kranging from 1 to K are the loadings or coefficient matricesA high positive value indicates a strong positive contribution ofthe corresponding variable to the lsquoglobal scorersquo While the above

Dimension reduction techniques | 635

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

methods are formulated for multiple data sets with differentvariables but the same observations most can be similarly for-mulated for multiple data sets with the same variables but dif-ferent observations [69]

Multiple co-inertia analysis

MCIA is an extension of CIA which aims to analyze multiplematrices through optimizing a covariance criterion [60 70]MCIA simultaneously projects K data sets into the same dimen-sional space Instead of maximizing the covariance betweenscores from two data sets as in CIA the optimization problemused by MCIA is as following

argmaxqi1 q

ikqi

K

XK

kfrac141

cov2ethXikqi

kXiqiTHORN (14)

for dimension i with the constraints that jjqkijj frac14 var (Xiqi) frac14 1

where X frac14 (X1j jXkj jXK) and qi holds the correspondingloading values (lsquoglobalrsquo loading vector) [60 69 70]

MCIA derives a set of lsquoblock scoresrsquo Xkiqk

i using linear com-binations of the original variables from each individual matrixThe global score Xiqi is then further defined as the linear com-bination of lsquoblock scoresrsquo In practice the global scores representa correlated structure defined by multiple data sets whereasthe concordance and discrepancy between these data sets maybe revealed by the block scores (for detail see lsquoExample casestudyrsquo section) MCIA may be calculated with the ad hoc exten-sion of the NIPALS PCA [71] This algorithm starts with an ini-tialization step in which the global scores and the blockloadings for the first dimension are computed The residualmatrices are calculated in an iterative step by removing thevariance induced by the variable loadings (the lsquodeflationrsquo step)For higher order solutions the same procedure is applied to theresidual matrices and re-iterated until the desired number of di-mensions is reached Therefore the computation time stronglydepends on the number of desired dimensions MCIA is imple-mented in the R package omicade4 and has been applied to theintegrative analysis of transcriptomic and proteomic data setsfrom the NCI-60 cell lines [60]

Generalized canonical correlation analysis

GCCA [71] is a generalization of CCA to K-table analysis [73ndash75]It has also been applied to the analysis of omics data [36 76]Typically MCIA and GCCA will produce similar results (for amore detailed comparison see [60]) GCCA uses a different defla-tion strategy than MCIA it calculates the residual matrices byremoving the variance with respect to the lsquoblock scoresrsquo (insteadof lsquovariable loadingsrsquo used by MCIA or lsquoglobal scoresrsquo used byCPCA see later) When applied to omics data where p n avariable selection step is often integrated within the GCCA ap-proach which cannot be done in case of MCIA In addition asblock scores are better representations of a single data set (incontrast to the global score) GCCA is more likely to find com-mon variables across data sets Witten and Tibshirani [54]applied sparse multiple CCA to analyze gene expression andCopy Number Variation (CNV) data from diffuse large B-celllymphoma patients and successfully identified lsquocis interactionsrsquothat are both up-regulated in CNV and mRNA data

Consensus PCA

CPCA is closely related to GCCA and MCIA but has had less expos-ure to the omics data community CPCA optimizes the same criter-ion as GCCA and MCIA and is subject to the same constraints asMCIA [71] The deflation step of CPCA relies on the lsquoglobal scorersquo Asa result it guarantees the orthogonality of the global score onlyand tends to find common patterns in the data sets This charac-teristic makes it is more suitable for the discovery of joint patternsin multiple data sets such as the joint clustering problem

Regularized generalized canonical correlation analysis

Recently Tenenhaus and Tenenhaus [69 74] proposed regular-ized generalized CCA (RGCCA) which provides a unified frame-work for different K-table multivariate methods The RGCCAmodel introduces some extra parameters particularly a shrink-age parameter and a linkage parameter The linkage parameteris defined so that the connection between matrices may be cus-tomized The shrinkage parameter ranges from 0 to 1 Settingthis parameter to 0 will force the correlation criterion (criterionused by GCCA) whereas a shrinkage parameter of 1 will applythe covariance criterion (used by MCIA and CPCA) A value be-tween 0 and 1 leads to a compromise between the two optionsIn practice the correlation criterion is better in explaining thecorrelated structure across data sets while discarding the vari-ance within each individual data set The introduction of theextra parameters make RGCCA highly versatile GCCA CIA andCPCA can be described as special cases of RGCCA (see [69] andSupplementary Information) In addition RGCCA also integratesa feature selection procedure named sparse GCCA (SGCCA)Tenenhaus et al [76] applied SGCCA to combine gene expres-sion comparative genomic hybridization and a qualitativephenotype measured on a set of 53 children with glioma Sparsemultiple CCA [54] and SGCCA [76] are available in the R pack-ages PMA and RGCCA respectively Similarly a higher orderimplementation of spare PLS is described in Zhao et al [77]

Joint NMF

NMF has also been extended to jointly factorize multiple matri-ces In joint NMF the values in the global score F and in the co-efficient matrices (Q1 QK) are nonnegative and there is noexplicit definition of the block loadings An optimization algo-rithm is applied to minimize an objective function typically the

sum of squared errors ieXK

kfrac141Ek

2 This approach can be con-

sidered to be a nonnegative implementation of PARAFAC al-though it has also been implemented using the Tucker model[78ndash80] Zhang et al [81] apply joint NMF to a three-way analysisof DNA methylation gene expression and miRNA expressiondata to identify modules in each of these regulatory layers thatare associated with each other

Advantages of dimension reduction whenintegrating multi-assay data

Dimension reduction or latent variable approaches provideEDA integrate multi-assay data highlight global correlationsacross data sets and discover outliers or batch effects in indi-vidual data sets Dimension reduction approaches also facilitatedown-stream analysis of both observations and variables(genes) Compared with cluster analysis of individual data setscluster analysis of the global score matrix (F matrix) is robust asit aggregates observations across data sets and is less likely to

636 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

reflect a technical or batch effect of a single data set Similarlydimension reduction of multi-assay data facilities downstreamgene set pathway and network analysis of variables MCIAtransforms variables from each data set onto the same scaleand their loadings (Q matrix) rank the variables by their contri-bution to the global data structure (Figure 2) Meng et al reportthat pathway or gene set enrichment analysis (GSEA) of thetransformed variables is more sensitive than GSEA of each indi-vidual data set This is both because of the re-weighting andtransformation of variables but also because GSEA on the com-bined data has greater coverage of variables (genes) thus com-pensating for missing or unreliable information in any singledata set For example in Figure 2B we integrate and transformmRNA proteomics and miRNA data on the same scale allowingus to extract and study the union of all variables

Example case study

To demonstrate the integration of multi-data sets using dimen-sions reduction we applied MCIA to analyze mRNA miRNA andproteomics expression profiles of melanoma leukemia and CNS

cells lines from the NCI-60 panel The graphical output fromthis analysis a plot of the sample space variable space anddata weighting space are provided in Figure 2A B and E Theeigenvalues can be interpreted similarly to PCA a higher eigen-value contributes more information to the global score As withPCA researchers may be subjective in their selection of thenumber of components [24] The scree plot in Figure 2D showsthe eigenvalues of each global score In this case the first twoeigenvalues were significantly larger so we visualized the celllines and variables on PC1 and PC2

In the observation space (Figure 2A) assay data (mRNAmiRNA and proteomics) are distinguished by shape The coord-inates of each cell line (Fk in Figure 2A) are connected by lines tothe global scores (F) Short lines between points and cell lineglobal scores reflect high concordance in cell line data Most celllines have concordant information between data sets (mRNAmiRNA protein) as indicated by relatively short lines In add-ition the RV coefficient [82 83] which is a generalized Pearsoncorrelation coefficient for matrices may be used to estimate thecorrelation between two transformed data sets The RV coeffi-cient has values between 0 and 1 where a higher value

Figure 2 MCIA of mRNA miRNA and proteomics profiles of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines (A) shows a plot of the first two

components in sample space (sample lsquotypersquo is coded by the point shape circles for mRNAs triangles for proteins and squares for miRNAs) Each sample (cell line) is

represented by a ldquostarrdquo where the three omics data for each cell line are connected by lines to a center point which is the global score (F) for that cell line the shorter

the line the higher the level of concordance between the data types and the global structure (B) shows the variable space of MCIA A variable that is highly expressed

in a cell line will be projected with a high weight (far from the origin) in the direction of that cell line Some miRNAs with a large distance from the origin are labeled as

these miRNAs are the strongly associated with cancer tissue of origin (C) shows the correlation coefficients of the proteome profiling of SR with other cell lines The

proteome profiling of SR cell line is more correlated with melanoma cell line There may be a technical issue with the LESR proteomics data (D) A scree plot of the

eigenvalues and (E) a plot of data weighting space A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 637

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

indicates higher co-structure In this example we observed rela-tively high RV coefficients between the three data sets rangingfrom 078 to 084 It was recently reported that the RV coefficientis biased toward large data sets and a modified RV coefficienthas been proposed [84]

In this analysis (Figure 2A) cell lines originating from thesame anatomical source are projected close to each other andconverge in clusters The first PC separates the leukemia celllines (positive end of PC1) from the other two cell lines (negativeend of PC1) and PC2 separates the melanoma and CNS celllines The melanoma cell line LOX-IMVI which lacks the mela-nogenesis is projected close to the origin away from the mel-anoma cluster We were surprised to see that the proteomicsprofile of leukemia cell line SR was projected closer to melan-oma rather than leukemia cell lines We examined withintumor type correlations to the SR cell line (Figure 2C) Weobserved that the SR proteomics data had higher correlationwith melanoma compared with to leukemia cell lines Giventhat the mRNA and miRNA profiles of LE_SR are closer to theleukemia cell lines it suggests that there may have been a tech-nical error in generating the proteomics data on the SR cell line(Figure 2A and C)

MCIA projects all variables into the same space The variablespace (Q1 QK) is visualized in Figure 2B Variables and sam-ples projected in the same direction are associated This allowsone to select the variables most strongly associated with spe-cific observations from each data set for subsequent analysis Inour previous study [85] we have shown that the genes and pro-teins highly weighted on the melanoma side (positive end of se-cond dimension) are enriched with melanogenesis functionsand genesproteins highly weighted on the protein side arehighly enriched in T-cell or immune-related functions

We examined the miRNA data to extract the miRNAs withthe most extreme weights on the first two dimensions miR-142and miR-223 which are active and expressed in leukemia [8283 86ndash88] had high weights on the positive end of both thefirst and second axis (close to the leukemia cell lines samplespace Figure 2A) miR-142 plays an essential role in T-lympho-cyte development miR-223 is regulated by the Notch andNF-kB signaling pathways in T-cell acute lymphoblastic leuke-mia [89]

The miRNA with strongest association to CNS cell lines wasmiR-409 This miRNA is reported to promote the epithelial-to-mesenchymal transition in prostate cancer [90] In the NCI-60cell line data CNS cell lines are characterized more by a pro-nounced mesenchymal phenotype which is consistent withhigh expression of this miRNA On the positive end of the se-cond axis and negative end of the first axis (which correspondsto melanoma cell lines in the sample space Figure 2A) wefound miR-509 miR-513 and miR-506 strongly associated withmelanoma cell lines which are reported to initiate melanocytetransformation and promote melanoma growth [85]

Challenges in integrative data analysis

EDA is widely used and well accepted in the analysis of singleomics data sets but there is an increasing need for methodsthat integrate multi-omics data particularly in cancer researchRecently 20 leading scientists were invited to a meeting organ-ized by Nature Medicine Nature Biotechnology and the VolkswagenFoundation The meeting identified the need to simultaneouslycharacterize DNA sequence epigenome transcriptome proteinmetabolites and infiltrating immune cells in both the tumorand the stroma [91] The TCGA pan-cancer project plans to

comprehensively interrogate multi-omics data across 33 humancancers [92] The data are biologically complex In addition totumor heterogeneity [91] there may be technical issues batcheffects and outliers EDA approaches for complex multi-omicsdata are needed

We describe emerging applications of multivariateapproaches to omics data analysis These are descriptiveapproaches that do not test a hypothesis or generate a P-valueThey are not optimized for variable or biomarker discoveryalthough the introduction of sparsity in variable loadings mayhelp in the selection of variables for downstream analysis Fewcomparisons of different methods exist and the numbers ofcomponents and the sparsity level have to be optimized Cross-validation approaches are potentially useful but this stillremains an open area of research

Another limitation of these methods is that although vari-ables may vary among data sets the observations need to bematchable Therefore researchers need to have careful experi-mental design in the early stage of a study There is an exten-sion of CIA for the analysis of unmatched samples [93] whichcombines a Hungarian algorithm with CIA to iteratively pairsamples that are similar but not matched Multi-block andmulti-group methods (data sets with matched variables) havebeen reviewed recently by Tenenhaus and Tenenhaus [69]

The number of variables in genomics data is a challenge totraditional EDA visualization tools Most visualizationapproaches were designed for data sets with fewer variablesWithin R new packages including ggord are being developedDynamic data visualization is possible using ggvis plotly explorand other packages However interpretation of long lists of bio-logical variables (genes proteins miRNAs) is challenging Oneway to gain more insight into lists of omics variables is to per-form a network gene set enrichment or pathway analysis [94]An attractive feature of decomposition methods is that variableannotation such as Gene Ontology or Reactome can be pro-jected into the same space to determine a score for each geneset (or pathway) in that space [32 33 60]

Conclusion

Dimension reduction methods have a long history Many simi-lar methods have been developed in parallel by multiple fieldsIn this review we provided an overview of dimension reductiontechniques that are both well-established and maybe new tothe multi-omics data community We reviewed methods forsingle-table two-table and multi-table analysis There are sig-nificant challenges in extracting biologically and clinically ac-tionable results from multi-omics data however the field mayleverage the varied and rich resource of dimension reductionapproaches that other disciplines have developed

Key Points

bull There are many dimension-reduction methods whichcan be applied to exploratory data analysis of a singledata set or integrated analysis of a pair or multipledata sets In addition to exploratory analysis thesecan be extended to clustering supervised and discrim-inant analysis

bull The goal of dimension reduction is to map data ontoa new set of variables so that most of the variance (orinformation) in the data is explained by a few new(latent) variables

638 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

bull Multi-data set methods such as multiple co-inertiaanalysis (MCIA) multiple factor analysis (MFA) or ca-nonical correlations analysis (CCA) identify correlatedstructure between data sets with matched observa-tions (samples) Each data set may have different vari-ables (genes proteins miRNA mutations drugresponse etc)

bull MCIA MFA CCA and related methods provide a visu-alization of consensus and incongruence in andbetween data sets enabling discovery of potential out-liers batch effects or technical errors

bull Multi-dataset methods transform diverse variablesfrom each data set onto the same space and scalefacilitating integrative variable selection gene setanalysis pathway and downstream analyses

R Supplement

R code to re-generate all figures in this article is available asa Supplementary File Data and code are also available on thegithub repository httpsgithubcomaedinNCI60Example

Supplementary Data

Supplementary data are available online at httpbiboxfordjournalsorg

Acknowledgements

We are grateful to Drs John Platig and Marieke Kuijjer forreading the manuscript and for their comments

Funding

GGT was supported by the Austrian Ministry of ScienceResearch and Economics [OMICS Center HSRSM initiative]AC was supported by Dana-Farber Cancer Institute BCBResearch Scientist Developmental Funds National CancerInstitute 2P50 CA101942-11 and the Assistant Secretary ofDefense Health Program through the Breast CancerResearch Program under Award No W81XWH-15-1-0013Opinions interpretations conclusions and recommenda-tions are those of the author and are not necessarilyendorsed by the Department of Defense

References1 Brazma A Culhane AC Algorithms for gene expression

analysis In Jorde LB Little PFR Dunn MJ Subramaniam S(eds) Encyclopedia of Genetics Genomics Proteomics andBioinformatics London John Wiley amp Sons 2005 3148ndash59

2 Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

3 Legendre P Legendre LFJ Numerical Ecology AmsterdamElsevier Science 3rd edition 2012

4 Biton A Bernard-Pierrot I Lou Y et al Independent compo-nent analysis uncovers the landscape of the bladder tumortranscriptome and reveals insights into luminal and basalsubtypes Cell Rep 201491235ndash45

5 Hoadley KA Yau C Wolf DM et al Multiplatform analysis of12 cancer types reveals molecular classification within andacross tissues of origin Cell 2014158929ndash44

6 Verhaak RG Tamayo P Yang JY et al Prognostically relevantgene signatures of high-grade serous ovarian carcinomaJ Clin Invest 2013123517ndash25

7 Senbabaoglu Y Michailidis G Li JZ Critical limitations of con-sensus clustering in class discovery Sci Rep 201446207

8 Gusenleitner D Howe EA Bentink S et al iBBiG iterativebinary bi-clustering of gene sets Bioinformatics 2012282484ndash92

9 Pearson K On lines and planes of closest fit to systems ofpoints in space Philos Magazine Series 19016(2)559ndash72

10Hotelling H Analysis of a complex of statistical variables intoprincipal components J Educ Psychol 193324417

11Bland M An Introduction to Medical Statistics 3rd edn OxfordNew York Oxford University Press 2000 xvi 405

12Warner RM Applied Statistics From Bivariate throughMultivariate Techniques London SAGE Publications Inc2nd Edition 2012 1004

13Dray S Chessel D Thioulouse J Co-inertia analysis and thelinking of ecological data tables Ecology 2003843078ndash89

14Golub GH Loan CF Matrix Computations Baltimore JHU Press2012

15Anthony M Harvey M Linear Algebra Concepts and MethodsCambridge Cambridge University Press 1st edition 2012149ndash60

16McShane LM Cavenagh MM Lively TG et al Criteria for theuse of omics-based predictors in clinical trials explanationand elaboration BMC Med 201311220

17Guyon I Elisseeff A An introduction to variable and featureselection J Mach Learn Res 200331157ndash82

18Tukey JW Exploratory Data Analysis Boston Addison-WesleyPublishing Company 1977

19Hogben L Handbook of Linear Algebra Boca Raton Chapman ampHallCRC Press 1st edition 2007 20

20Ringner M What is principal component analysis NatBiotechnol 200826303ndash4

21Wall ME Rechtsteiner A Rocha LM (2003) Singular value de-composition and principal component analysis In A PracticalApproach to Microarray Data Analysis Berrar DP DubitzkyW Granzow M (eds) Norwell MA Kluwer 2003 91ndash109

22 terBraak CJF Prentice IC A theory of gradient analysis AdvEcol Res 198818271ndash313

23 terBraak CJF Smilauer P CANOCO Reference Manual andUserrsquos Guide to Canoco for Windows Software for CanonicalCommunity Ordination (version 4) Microcomputer Power(Ithaca NY USA) httpwwwmicrocomputerpowercom1998352

24Abdi H Williams LJ Principal Component Analysis HobokenJohn Wiley amp SonsInterdisciplinary Reviews ComputationalStatistics 20102(4)433ndash59

25Dray S On the number of principal components a test ofdimensionality based on measurements of similarity be-tween matrices Comput Stat Data Anal 2008522228ndash37

26Wouters L Gohlmann HW Bijnens L et al Graphical explor-ation of gene expression data a comparative study of threemultivariate methods Biometrics 2003591131ndash9

27Greenacre M Correspondence Analysis in Practice Boca RatonChapman and HallCRC 2 edition 2007 201ndash11

28Goodrich JK Rienzi DSC Poole AC et al Conducting a micro-biome study Cell 2014158250ndash62

29Fasham MJR A comparison of nonmetric multidimensionalscaling principal components and reciprocal averaging forthe ordination of simulated coenoclines and coenoplanesEcology 197758551ndash61

Dimension reduction techniques | 639

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

30Beh EJ Lombardo R Correspondence Analysis Theory Practiceand New Strategies Hoboken John Wiley amp Sons 1st edition2014 120ndash76

31Greenacre MJ Theory and Applications of Correspondence AnalysisWaltham MA Academic Press 3rd Edition 1993 83ndash125

32Fagan A Culhane AC Higgins DG A multivariate analysis ap-proach to the integration of proteomic and gene expressiondata Proteomics 200772162ndash71

33Fellenberg K Hauser NC Brors B et al Correspondence ana-lysis applied to microarray data Proc Natl Acad Sci USA20019810781ndash6

34Lee DD Seung HS Learning the parts of objects by non-negative matrix factorization Nature 1999401788ndash91

35Comon P Independent component analysis a new conceptSignal Processing 199436287ndash314

36Witten DM Tibshirani R Hastie T A penalized matrix decom-position with applications to sparse principal componentsand canonical correlation analysis Biostatistics 200910515ndash34

37Shen H Huang JZ Sparse principal component analysis viaregularized low rank matrix approximation J Multivar Anal2008991015ndash34

38Lee S Epstein MP Duncan R et al Sparse principal componentanalysis for identifying ancestry-informative markers in gen-ome-wide association studies Genet Epidemiol 201236293ndash302

39Sill M Saadati M Benner A Applying stability selection toconsistently estimate sparse principal components in high-dimensional molecular data Bioinformatics 2015312683ndash90

40Zhao J Efficient model selection for mixtures of probabilisticPCA via hierarchical BIC IEEE Trans Cybern 2014441871ndash83

41Zhao Q Shi X Huang J et al Integrative analysis of lsquo-omicsrsquodata using penalty functions Wiley Interdiscip Rev Comput Stat2015710

42Alter O Brown PO Botstein D Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms Proc NatlAcad Sci USA 20031003351ndash6

43Culhane AC Perriere G Higgins DG Cross-platform compari-son and visualisation of gene expression data using co-inertia analysis BMC Bioinformatics 2003459

44Dray S Analysing a pair of tables coinertia analysis and dual-ity diagrams In Blasius J Greenacre M (eds) Visualization andverbalization of data Boca Raton CRC Press 2014 289ndash300

45Le Cao KA Martin PG Robert-Granie C et al Sparse canonicalmethods for biological data integration application to across-platform study BMC Bioinformatics 20091034

46Braak CJF Canonical correspondence analysis a new eigen-vector technique for multivariate direct gradient analysisEcology 1986671167ndash79

47Hotelling H Relations between two sets of variates Biometrika193628321ndash77

48McGarigal K Landguth E Stafford S Multivariate Statistics forWildlife and Ecology Research New York Springer 2002

49Thioulouse J Simultaneous analysis of a sequence of pairedecological tables a comparison of several methods Ann ApplStat 201152300ndash25

50Hong S Chen X Jin L et al Canonical correlation analysisfor RNA-seq co-expression networks Nucleic Acids Res201341e95

51Mardia KV Kent JT Bibby JM Multivariate Analysis LondonNew York NY Toronto Sydney San Francisco CA AcademicPress 1979

52Waaijenborg S Zwinderman AH Sparse canonical correl-ation analysis for identifying connecting and completinggene-expression networks BMC Bioinformatics 200910315

53Parkhomenko E Tritchler D Beyene J Sparse canonical cor-relation analysis with application to genomic data integra-tion Stat Appl Genet Mol Biol 20098article 1

54Witten DM Tibshirani RJ Extensions of sparse canonical cor-relation analysis with applications to genomic data Stat ApplGenet Mol Biol 20098article 28

55Lin D Zhang J Li J et al Group sparse canonical correlationanalysis for genomic data integration BMC Bioinformatics201314245

56Le Cao KA Boitard S Besse P Sparse PLS discriminantanalysis biologically relevant feature selection and graph-ical displays for multiclass problems BMC Bioinform 201112253

57Boulesteix AL Strimmer K Partial least squares a versatiletool for the analysis of high-dimensional genomic data BriefBioinform 2007832ndash44

58Doledec S Chessel D Co-inertia analysis an alternativemethod for studying speciesndashenvironment relationshipsFreshwater Biol 199431277ndash94

59Ponnapalli SP Saunders MA Van Loan CF et al A higher-order generalized singular value decomposition for compari-son of global mRNA expression from multiple organismsPLoS One 20116e28072

60Meng C Kuster B Culhane AC et al A multivariate approachto the integration of multi-omics datasets BMC Bioinformatics201415162

61de Tayrac M Le S Aubry M et al Simultaneous analysis ofdistinct Omics data sets with integration of biological know-ledge multiple factor analysis approach BMC Genomics20091032

62Abdi H Williams LJ Valentin D Statis and Distatis OptimumMultitable Principal Component Analysis and Three Way MetricMultidimensional Scaling Wiley Interdisciplinary 20124124ndash67

63De Jong S Kiers HAL Principal covariates regression Part ITheory Chemometrics and Intelligent Laboratory Systems199214155ndash64

64Benasseni J Dosse MB Analyzing multiset data by the powerSTATIS-ACT method Adv Data Anal Classification 2012649ndash65

65Escoufier Y The duality diagram a means for better practicalapplications Devel Num Ecol 198714139ndash56

66Giordani P Kiers HAL Ferraro DMA Three-way componentanalysis using the R package threeway J Stat Software2014577

67Leibovici DG Spatio-temporal multiway decompositionsusing principal tensor analysis on k-modes the R packagePTAk J Stat Software 201034(10)1ndash34

68Kolda TG Bader BW Tensor decompositions and applica-tions SIAM Rev 200951455ndash500

69Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis for multiblock or multigroup dataanalysis Eur J Oper Res 2014238391ndash403

70Chessel D Hanafi M Analyses de la co-inertie de K nuages depoints Rev Stat Appl 19964435ndash60

71Hanafi M Kohler A Qannari EM Connections betweenmultiple co-inertia analysis and consensus principal compo-nent analysis Chemometrics and Intelligent Laboratory2011106(1)37ndash40

72Carroll JD Generalization of canonical correlation analysis tothree or more sets of variables Proc 76th Convent Am PsychAssoc 19683227ndash8

73Takane Y Hwang H Abdi H Regularized multiple-set canon-ical correlation analysis Psychometrika 200873753ndash75

640 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5
Page 3: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

them SVD is the most widely used approach Given X an npmatrix with rank r (r ltfrac14min[n p]) SVD decomposes X intothree matrices

X frac14 USQTsubject to the constraint that UTU frac14 QTQ frac14 I (5)

where U is an nr matrix and Q is a pr matrix The columns ofU and Q are the orthogonal left and right singular vectors re-spectively S is an rr diagonal matrix of singular values whichare proportional to the standard deviations associated with rsingular vectors The singular vectors are ordered such that

their associated variances are monotonically decreasingIn a PCA of X the PCs comprise an nr matrix F which isdefined as

F frac14 US frac14 USQTQ frac14 XQ (6)

where the columns of matrix F are the PCs and the matrix Q iscalled the loadings matrix and contains the linear combinationcoefficients of the variables for each PC (q in Equation (3))Therefore we represent the variance of X in lower dimension rThe above formula also emphasizes that Q is a matrix that

Table 1 Glossary

Term Definition

Variance The variance of a random variable measures the spread (variability) of its realizations (values of the random vari-able) The variance is always a positive number If the variance is small the values of the random variable areclose to the mean of the random variable (the spread of the data is low) A high variance is equivalent to widelyspread values of the random variable See [11]

Standard deviation The standard deviation of a random variable measures the spread (variability) of its realizations (values of the ran-dom variable) It is defined as the square root of the variance The standard deviation will have the same units asthe random variable in contrast to the variance See [11]

Covariance The covariance is an unstandardized measure about the tendency of two random variables to vary together See[12]

Correlation The correlation of two random variables is defined by the covariance of the two random variables normalized bythe product between their standard deviations It measures the linear relationship between the two randomvariables The correlation coefficient ranges between 1 and thorn1 See [12]

Inertia Inertia is a measure for the variability of the data The inertia of a set of points relative to one point P is defined bythe weighted sum of the squared distances between each considered point and the point P Correspondingly theinertia of a centered matrix (mean is equal to zero) is simply the sum of the squared matrix elements The inertiaof the matrix X defined by the metrics L and D is the weighted sum of its squared values The inertia is equal thetotal variance of X when X is centered L is the Euclidean metric and D is a diagonal matrix with diagonal elementsequal to 1n See [13]

Co-inertia The co-inertia is a global measure for the co-variability of two data sets (for example two high-dimensional randomvariables) If the data sets are centered the co-inertia is the sum of squared covariances When coupling a pair ofdata sets the co-inertia between two matrices X and Y is calculated as trace (XLXTDYRYTD) See [13]

Orthogonal Two vectors are called orthogonal if they form an angle that measures 90 degrees Generally two vectors are orthog-onal if their inner product is equal to zero Two orthogonal vectors are always linearly independent See [12]

Independent In linear algebra two vectors are called linearly independent if their liner combination is equal to zero only when allconstants of the linear combination are equal to zero See [14] In statistics two random variables are called statis-tically independent if the distribution of one of them does not affect the distribution of the other If two independ-ent random variables are added then the mean of the sum is the sum of the two mean values This is also true forthe variance The covariance of two independent variables is equal to zero See [11]

Eigenvector eigenvalue An eigenvector of a matrix is a vector that does not change its direction after a linear transformation The vector v isan eigenvector of the matrix A if Av frac14 kv k is the eigenvalue associated with the eigenvector v and it reflects thestretch of the eigenvector following the linear transformation The most popular way to compute eigenvectorsand eigenvalues is the SVD See [14]

Linear combination Mathematical expression calculated through the multiplication of variables with constants and adding the individ-ual multiplication results A linear combination of the variables x and y is ax thorn by where a and b are the constantsSee [15]

Omics The study of biological molecules in a comprehensive fashion Examples of omics data types include genomicstranscriptomics proteomics metabolomics and epigenomics [16]

Dimension reduction Dimension reduction is the mapping of data to a lower dimensional space such that redundant variance in the datais reduced or discarded enabling a lower-dimensional representation without significant loss of informationSee [17]

Exploratory data analysis EDA is the application of statistical techniques that summarize the main characteristics of data often with visualmethods In contrast to statistical hypothesis testing (confirmatory data analysis) EDA can help to generatehypotheses See [18]

Sparse vector A sparse vector is a vector in which most elements are zero A sparse loadings matrix in PCA or related methodsreduce the number of features contributing to a PC The variables with nonzero entries (features) are the lsquoselectedfeaturesrsquo See [19]

630 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

projects the observations in X onto the PCs The sum of squaredvalues of the columns in U equals 1 (Equation (5)) therefore thevariance of the ith PC di2 can be calculated as

di2 frac14 si2

n 1(7)

where si is the ith diagonal element in S The variance reflectsthe amount of information (underling structure) captured byeach PC The squared correlations between PCs and the original

variables are informative and often illustrated using a correl-ation circle plot These can be calculated by

C frac14 QD (8)

where D frac14 (d1 d2 dr)T is a diagonal matrix of the standarddeviation of the PCs

In contrast to SVD which calculates all PCs simultaneouslyPCA can also be calculated using the Nonlinear Iterative PartialLeast Squares (NIPALS) algorithm which uses an iterative

Table 2 Dimension reduction methods for one data set

Method Description Name of R function R package

PCA Principal component analysis prcompstats princompstats dudipcaade4 pcaveganPCAFactoMineR principalpsych

CA COA Correspondence analysis caca CAFactoMineR dudicoaade4NSC Nonsymmetric correspondence analysis dudinscade4PCoA MDS Principal co-ordinate analysismultiple dimensional

scalingcmdscalestats dudipcoade4 pcoaape

NMF Nonnegative matrix factorization nmfnmfnmMDS Nonmetric multidimensional scaling metaMDSvegansPCA nsPCA pPCA Sparse PCA nonnegative sparse PCA penalized PCA

(PCA with feature selection)SPCPMA spcamixOmics nsprcompnsprcomp

PMDPMANIPALS PCA Nonlinear iterative partial least squares analysis (PCA

on data with missing values)nipalsade4 pcapcaMethodsa nipalsmixOmics

pPCA bPCA Probabilistic PCA Bayesian PCA pcapcaMethodsa

MCA Multiple correspondence analysis dudiacmade4 mcaMASSICA Independent component analysis fastICAFastICAsIPCA Sparse independent PCA (combines sPCA and ICA) sipcamixOmics ipcamixOmicsplots Graphical resources R packages including scatterplot3d ggordb ggbiplotc

plotlyd explor

aAvailable in BioconductorbOn github devtoolsinstall_github (lsquofawda123ggordrsquo)cOn github devtoolsinstall_github (lsquoggbiplotrsquo lsquovqvrsquo)dOn github devtoolsinstall_github (lsquoropensciplotlyrsquo)

Table 3 Dimension reduction methods for pairs of data sets

Method Description Featureselection

R Function package

CCAa Canonical correlation analysis Limited to ngtpa No cccca CCorAveganCCAa Canonical correspondence analysis is a constrained correspondence

analysis which is popular in ecologya

No ccaade4 ccavegan cancorstats

RDA Redundancy analysis is a constrained PCA Popular in ecology No rdaveganProcrutes Procrutes rotation rotates a matrix to maximum similarity with a tar-

get matrix minimizing sum of squared differencesNo procrustesvegan procusteade4

rCCA Regularized canonical correlation No rccccasCCA Sparse CCA Yes CCApmapCCA Penalized CCA Yes spCCAspCCA supervised versionWAPLS Weighted averaging PLS regression No WAPLSrioja waplspaltranPLS Partial least squares of K-tables (multi-block PLS) No mbplsade4 plsdacaretsPLS pPLS Sparse PLS Penalized PLS Yes splsspls splsmixOmics pplspplssPLS-DA Sparse PLS-discriminant analysis Yes splsdamixOmics splsdacaretcPCA Consensus PCA No cpcamogsaCIA Coinertia analysis No coinertiaade4 ciamade4

aA source for confusion CCA is widely used as an acronym for both Canonical lsquoCorrespondencersquo Analysis and Canonical lsquoCorrelationrsquo Analysis Throughout this article

we use CCA for canonical lsquocorrelationrsquo analysis Both methods search for the multivariate relationships between two data sets Canonical lsquocorrespondencersquo analysis is

an extension and constrained form of lsquocorrespondencersquo analysis [22] Both canonical lsquocorrelationrsquo analysis and RDA assume a linear model however RDA is a con-

strained PCA (and assumes one matrix is the dependent variable and one independent) whereas canonical correlation analysis considers both equally See [23] for

more explanation

Dimension reduction techniques | 631

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

regression procedure to calculate PCs Computation can be per-formed on data with missing values and it is faster than SVDwhen applied to large matrices Furthermore NIPALS may begeneralized to discover the correlated structure in more thanone data set (see sections on the analysis of multi-omics datasets) Please refer to the Supplementary Information for add-itional details on NIPALS

Visualizing and interpreting results ofdimension reduction analysis

We present an example to illustrate how to interpret results of aPCA PCA was applied to analyze mRNA gene expression data ofa subset of cell lines from the NCI-60 panel those of melanomaleukemia and central nervous system (CNS) tumors The resultsof PCA can be easily interpreted by visualizing the observationsand variables in the same space using a biplot Figure 1A is abiplot of the first (PC1) and second PC (PC2) where points andarrows from the plot origin represent observations and genesrespectively Cell lines (points) with correlated gene expressionprofiles have similar scores and are projected close to eachother on PC1 and PC2 We see that cell lines from the same ana-tomical location are clustered

Both the direction and length of the mRNA gene expressionvectors can be interpreted Gene expression vectors point in thedirection of the latent variable (PC) to which it is most similar(squared correlation) Gene vectors with the same direction (egFAM69B PNPLA2 PODXL2) have similar gene expression pro-files The length of the gene expression vector is proportional tothe squared multiple correlation between the fitted values forthe variable and the variable itself

A gene expression vector and a cell line projected in the samedirection from the origin are positively associated For example inFigure 1A FAM69B PNPLA2 PODXL2 are active (have higher geneexpression) in melanoma cell lines Similarly genes DPPA5 andRPL34P1 are among those genes that are highly expressed inmost leukemia cell lines By contrast genes WASL and HEXBpoint in the opposite direction to most leukemia cell lines indicat-ing low association In Figure 1B it is clear that these genes arenot expressed in leukemia cell lines and are colored blue in theheatmap (these are colored dark gray in grayscale)

The sum of the squared correlation coefficients between avariable and all the components (calculated in equation 8)equals 1 Therefore variables are often shown within a correl-ation circle (Figure 1C) Variables positioned on the unit circlerepresent variables that are perfectly represented by the two di-mensions displayed Those not on the unit circle may requireadditional components to be represented

In most analyses only the first few PCs are plotted andstudied as these explain the most variant trends in the dataGenerally the selection of components is subjective anddepends on the purpose of the EDA An informal elbow test mayhelp to determine the number of PCs to retain and examine [2425] From the scree plot of PC eigenvalues in Figure 1D we mightdecide to examine the first three PCs because the decrease inPC variance becomes relatively moderate after PC3 Anotherapproach that is widely used is to include (or retain) PCs thatcumulatively capture a certain proportion of variance for ex-ample 70 of variance is modeled with three PCs If a parsi-mony model is preferred the variance proportion cutoff can beas low as 50 [24] More formal tests including the Q2 statisticare also available (for details see [24]) In practice the first com-ponent might explain most of the variance and the remainingaxes may simply be attributed to noise from technical or biolo-gical sources in a study with low complexity (eg cell line repli-cates of controls and one treatment condition) However acomplex data set (for example a set of heterogeneous tumors)may require multiple PCs to capture the same amount ofvariance

Different dimension reduction approaches areoptimized for different data

There are many dimension reduction approaches related to PCA(Table 2) including principal co-ordinate analysis (PCoA) cor-respondence analysis (CA) and nonsymmetrical correspond-ence analysis (NSCA) These may be computed by SVD butdiffer in how the data are transformed before decomposition[21 26 27] and therefore each is optimized for specific dataproperties PCoA (also known as Classical MultidimensionalScaling) is versatile as it is a SVD of a distance matrix that canbe applied to decompose distance matrices of binary count or

Table 4 Dimension reduction methods for multiple (more than two) data sets

Method Description Featureselection

Matchedcases

R Function package

MCIA Multiple coinertia analysis No No mciaomicade4 mcoaade4gCCA Generalized CCA No No regCCAdmtrGCCA Regularized generalized CCA No No regCCAdmt rgccargcca

wrapperrgccamixOmicssGCCA Sparse generalized canonical

correlation analysisYes No sgccargcca wrappersgccamixOmics

STATIS Structuration des Tableaux aTrois Indices de la Statistique(STATIS) Family of methods whichinclude X-statis

No No statisade4

CANDECOMPPARAFAC Tucker3

Higher order generalizations of SVDand PCA Require matchedvariables and cases

No Yes CPThreeWay T3ThreeWay PCAnPTaKCANDPARAPTaK

PTA Partial triadic analysis No Yes ptaade4statico Statis and CIA (find structure

between two pairs of K-tables)No No staticoade4

632 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

continuous data It is frequently applied in the analysis ofmicrobiome data [28]

PCA is designed for the analysis of multi-normal distributeddata If data are strongly skewed or extreme outliers are pre-sent the first few axes may only separate those objects with ex-treme values instead of displaying the main axes of variation Ifdata are unimodal or display nonlinear trends one may see dis-tortions or artifacts in the resulting plots in which the secondaxis is an arched function of the first axis In PCA this is calledthe horseshoe effect and it is well described including illustra-tions in Legendre and Legendre [3] Both nonmetric Multi-Dimensional Scaling (MDS) and CA perform better than PCA inthese cases [26 29] Unlike PCA CA can be applied to sparsecount data with many zeros Although designed for contingencytables of nonnegative count data CA and NSCA decomposea chi-squared matrix [30 31] but have been successfullyapplied to continuous data including gene expression and pro-tein profiles [32 33] As described by Fellenberg et al [33] gene

and protein expression can be seen as an approximation of thenumber of corresponding molecules present in the cell during acertain measured condition Additionally Greenacre [27]emphasized that the descriptive nature of CA and NSCA allowstheir application on data tables in general not only on countdata These two arguments support the suitability of CA andNSCA as analysis methods for omics data While CA investi-gates symmetric associations between two variables NSCA cap-tures asymmetric relations between variables Spectral mapanalysis is related to CA and performs comparably with CAeach outperforming PCA in the identification of clusters of leu-kemia gene expression profiles [26] All dimension reductionmethods can be formulated in terms of the duality diagramDetails on this powerful framework are included in theSupplementary Information

Nonnegative matrix factorization (NMF) [34] forces a positiveor nonnegative constraint on the resulting data matrices andsimilar to Independent Component Analysis (ICA) [35] there is no

Figure 1 Results of a PCA analysis of mRNA gene expression data of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines from the NCI-60 cell

line panel All variables were centered and scaled Results show (A) a biplot where observations (cell lines) are points and gene expression profiles are arrows (B) a

heatmap showing the gene expression of the same 20 genes in the cell lines red to blue scale represent high to low gene expression (light to dark gray represent high

to low gene expression on the black and white figure) (C) correlation circle (D) variance barplot of the first ten PCs To improve the readability of the biplot some labels

of the variables (genes) in (A) have been moved slightly A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 633

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

requirement for orthogonality or independence in the compo-nents The nonnegative constraint guarantees that only the addi-tive combinations of latent variables are allowed This may bemore intuitive in biology where many biological measurements(eg protein concentrations count data) are represented by posi-tive values NMF is described in more detail in the SupplementalInformation ICA was recently applied to molecular subtype dis-covery in bladder cancer [4] Biton et al [4] applied ICA to geneexpression data of 198 bladder cancers and examined 20 compo-nents ICA successfully decomposed and extracted multiplelayers of signal from the data The first two components wereassociated with batch effects but other components revealed newbiology about tumor cells and the tumor microenvironmentThey also applied ICA to non-bladder cancers and by comparingthe correlation between components were able to identify a setof bladder cancer-specific components and their associatedgenes

As omics data sets tend to have high dimensionality (p n)it is often useful to reduce the number of variables Several re-cent extensions of PCA include variable selection often via aregularization step or L-1 penalization (eg Least AbsoluteShrinkage and Selection Operator LASSO) [36] The NIPALS algo-rithm uses an iterative regression approach to calculate thecomponents and loadings which is easily extended to have asparse operator that can be included during regression on thecomponent [37] A cross-validation approach can be used to de-termine the level of sparsity Sparse penalized and regularizedextensions of PCA and related methods have been describedrecently [36 38ndash41]

Integrative analysis of two data sets

One-table dimension reduction methods have been extended tothe EDA of two matrices and can simultaneously decomposeand integrate a pair of matrices that measure different variableson the same observations (Table 3) Methods include general-ized SVD [42] Co-Inertia Analysis (CIA) [43 44] sparse or penal-ized extensions of Partial Least Squares (PLS) CanonicalCorrespondence analysis (CCA) and Canonical CorrelationAnalysis (CCA) [36 45ndash47] Note both canonical correspondenceanalysis and canonical correlation analysis are referred to bythe acronym CCA Canonical correspondence analysis is a con-strained form of CA that is widely used in ecological statistics[46] however it is yet to be adopted by the genomics commu-nity in analysis of pairs of omics data By contrast severalgroups have applied extensions of canonical correlation ana-lysis to omics data integration Therefore in this review we useCCA to describe canonical correlation analysis

Canonical correlation analysis

Two omics data sets X (dimension npx) and Y (dimension npy)can be expressed by the following latent component decompos-ition problem

X frac14 FxQTx thorn Ex

Yfrac14 FyQTy thorn Ey

(9)

where Fx and Fy are nr matrices with r columns of compo-nents that explain the co-structure between X and Y The col-umns of Qx and Qy are variable loading vectors for X and Yrespectively Ex and Ey are error terms

Proposed by Hotelling in 1936 [47] CCA searches for associ-ations or correlations among the variables of X and Y [47] bymaximizing the correlation between Xqx

i and Yqyi

for the ith component arg maxqixqi

ycorethXqi

x YqiyTHORN (10)

In CCA the components Xqxi and Yqy

i are called canonical vari-ates and their correlations are the canonical correlations

Sparse canonical correlation analysis

The main limitation of applying CCA to omics data is that itrequires an inversion of the correlation or covariance matrix[38 49 50] which cannot be calculated when the number ofvariables exceeds the number of observations [46] In high-dimensional omics data where p n application of these meth-ods requires a regularization step This may be accomplished byadding a ridge penalty that is adding a multiple of the identitymatrix to the covariance matrix [51] A sparse solution of theloading vectors (Qx and Qy) filters the number of variables andsimplifies the interpretation of results For this purpose penal-ized CCA [52] sparse CCA [53] CCA-l1 [54] CCA elastic net (CCA-EN) [45] and CCA-group sparse [55] have been introduced andapplied to the integrative analysis of two omics data setsWitten et al [36] provided an elegant comparison of various CCAextensions accompanied by a unified approach to compute bothpenalized CCA and sparse PCA In addition Witten andTibshirani [54] extended sparse CCA into a supervised frame-work which allows the integration of two data sets with aquantitative phenotype for example selecting variables fromboth genomics and transcriptomics data and linking them todrug sensitivity data

Partial least squares analysis

PLS is an efficient dimension reduction method in the analysisof high-dimensional omics data PLS maximizes the covariancerather than the correlation between components making itmore resistant to outliers than CCA Additionally PLS does notsuffer from the p n problem as CCA does Nonetheless asparse solution is desired in some cases For example Le Caoet al [56] proposed a sparse PLS method for the feature selectionby introducing a LASSO penalty for the loading vectors In a re-cent comparison sPLS performed similarly to sparse CCA [45]There are many implementations of PLS which optimize differ-ent objective functions with different constraints several ofwhich are described by Boulesteix et al [57]

Co-Inertia analysis

CIA is a descriptive nonconstrained approach for coupling pairsof data matrices It was originally proposed to link two ecolo-gical tables [13 58] but has been successfully applied in integra-tive analysis of omics data [32 43] CIA is implemented andformulated under the duality diagram framework(Supplementary Information) CIA is performed in two steps (i)application of a dimension reduction technique such as PCACA or NSCA to the initial data sets depending on the type ofdata (binary categorical discrete counts or continuous data)and (ii) constraining the projections of the orthogonal axes suchthat they are maximally covariant [43 58] CIA does not requirean inversion step of the correlation or covariance matrix thusit can be applied to high-dimensional genomics data withoutregularization or penalization

634 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

Though closely related to CCA [49] CIA maximizes thesquared covariance between the linear combination of the pre-processed matrix that is

for the ith dimension argmaxqixqi

ycov2ethXqi

xYqiyTHORN (11)

Equation (11) can be decomposed as

cov2 XqxiYqy

i

frac14 cor2 XqxiYqy

i

var Xqxi

var Xqy

i

(12)

CIA decomposition of covariance maximizes the varianceand the correlation between matrices and thus is less sensitiveto outliers The relationship between CIA Procrustes analysis[13] and CCA have been well described [49] A comparison be-tween sCCA (with elastic net penalty) sPLS and CIA is providedby Le Cao et al [45] In summary CIA and sPLS both maximizethe covariance between eigenvectors and efficiently identifyjoint and individual variance in paired data In contrast CCA-EN maximizes the correlation between eigenvectors and willdiscover effects present in both data sets but may fail to dis-cover strong individual effects [45] Both sCCA and sPLS aresparse methods that select similar subsets of variables whereasCIA does not include a feature selection step thus in terms offeature selection results of CIA are more likely to contain re-dundant information in comparison with sparse methods [45]

Similar to classical dimension reduction approaches thenumber of dimensions to be examined needs to be consideredand can be visualized using a scree plot (similar to Figure 1D)Components may be evaluated by their associated variance [25]the elbow test or Q2 statistics as described previously Forexample the Q2 statistic was applied to select the number ofdimensions in the predictive mode of PLS [56] In additionwhen a sparse factor is introduced in the loading vectors cross-validation approaches may be used to determine the number ofvariables selected from each pair of components Selection ofthe number of components and optimization of these param-eters is still an open research question and is an active area ofresearch

Integrative analysis of multi-assay data

There is a growing need to integrate more than two data sets ingenomics Generalizations of dimension reduction methods tothree or more data sets are sometimes called K-table methods[59ndash61] and a number of them have been applied to multi-assaydata (Table 4) Simultaneous decomposition and integration ofmultiple matrices is more complex than an analysis of a singledata set or paired data because each data set may have differentnumbers of variables scales or internal structure and thus havedifferent variance This might produce global scores that aredominated by one or a few data sets Therefore data are prepro-cessed before decomposition Preprocessing is often performedon two levels On the variable levels variables are often cen-tered and normalized so that their sum of squared values orvariance equals 1 This procedure enables all the variables tohave equal contribution to the total inertia (sum of squares ofall elements) of a data set However the number of variablesmay vary between data sets or filteringpreprocessing stepsmay generate data sets that have a higher variance contributionto the final result Therefore a data set level normalization isalso required In the simplest K-table analysis all matrices haveequal weights More commonly data sets are weighted by theirexpected contribution or expected data quality for example by

the square root of their total inertia or by the square root of thenumbers of columns of each data set [62] Alternatively greaterweights can be given to smaller or less redundant matricesmatrices that have more stable predictive information or thosethat share more information with other matrices Such weight-ing approaches are implemented in multiple-factor analysis(MFA) principal covariates regression [63] and STATIS

The simplest multi-omics data integration is when the datasets have the same variables and observations that is matchedrows and matched columns In genomics these could be pro-duced when variables from different multi-assay data sets aremapped to a common set of genomic coordinates or gene iden-tifiers thus generating data sets with matched variables andmatched observations Alternatively a repeated analysis orlongitudinal analysis of the same samples and the same vari-ables could produce such data (one should note that thesedimension reduction approaches do not model the time correla-tion between different datasets) There is a history of such ana-lyses in ecology where counts of species and environmentvariables are measured over different seasons [49 64 65] and inpsychology where different standardized tests are measuredmultiple times on study populations [66 67] Analysis of suchvariables x samples x time data are called a three-mode decom-position triadic cube or three-way table analysis tensor de-composition three-way PCA three-mode PCA three-modeFactor Analysis Tucker-3 model Tucker3 TUCKALS3 multi-block analysis among others (Table 4) The relationship be-tween various tensor or higher decompositions for multi-blockanalysis are reviewed by Kolda and Bader [68]

More frequently we need to find associations in multi-assaydata that have matched observations but have different andunmatched numbers of variables For example TCGA generatedmiRNA and mRNA transcriptome (RNAseq microarray) DNAcopy number DNA mutation DNA methylation and proteomicsmolecular profiles on each tumor The NCI-60 and the CancerCell Line Encyclopedia projects have measured pharmacologicalcompound profiles in addition to exome sequencing and tran-scriptomic profiles Methods that can be applied to EDA ofK-table of multi-assay data with different variables includeMCIA MFA Generalized CCA (GCCA) and Consensus PCA(CPCA)

The K-table methods can be generally expressed by the fol-lowing model

X1 frac14 FQT1 thorn E1

Xk frac14 FQTk thorn Ek

XK frac14 FQTK thorn EK

(13)

where there are K matrices or omics data sets X1 XK For con-venience we assume that the rows of Xk share a common set ofobservations but the columns of Xk may each have differentvariables F is the lsquoglobal scorersquo matrix Its columns are the PCsand are interpreted similarly to PCs from a PCA of a single dataset The global score matrix which is identical in all decompos-itions integrates information from all data sets Therefore it isnot specific to any single data set rather it represents the com-mon pattern defined by all data sets The matrices Qk with kranging from 1 to K are the loadings or coefficient matricesA high positive value indicates a strong positive contribution ofthe corresponding variable to the lsquoglobal scorersquo While the above

Dimension reduction techniques | 635

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

methods are formulated for multiple data sets with differentvariables but the same observations most can be similarly for-mulated for multiple data sets with the same variables but dif-ferent observations [69]

Multiple co-inertia analysis

MCIA is an extension of CIA which aims to analyze multiplematrices through optimizing a covariance criterion [60 70]MCIA simultaneously projects K data sets into the same dimen-sional space Instead of maximizing the covariance betweenscores from two data sets as in CIA the optimization problemused by MCIA is as following

argmaxqi1 q

ikqi

K

XK

kfrac141

cov2ethXikqi

kXiqiTHORN (14)

for dimension i with the constraints that jjqkijj frac14 var (Xiqi) frac14 1

where X frac14 (X1j jXkj jXK) and qi holds the correspondingloading values (lsquoglobalrsquo loading vector) [60 69 70]

MCIA derives a set of lsquoblock scoresrsquo Xkiqk

i using linear com-binations of the original variables from each individual matrixThe global score Xiqi is then further defined as the linear com-bination of lsquoblock scoresrsquo In practice the global scores representa correlated structure defined by multiple data sets whereasthe concordance and discrepancy between these data sets maybe revealed by the block scores (for detail see lsquoExample casestudyrsquo section) MCIA may be calculated with the ad hoc exten-sion of the NIPALS PCA [71] This algorithm starts with an ini-tialization step in which the global scores and the blockloadings for the first dimension are computed The residualmatrices are calculated in an iterative step by removing thevariance induced by the variable loadings (the lsquodeflationrsquo step)For higher order solutions the same procedure is applied to theresidual matrices and re-iterated until the desired number of di-mensions is reached Therefore the computation time stronglydepends on the number of desired dimensions MCIA is imple-mented in the R package omicade4 and has been applied to theintegrative analysis of transcriptomic and proteomic data setsfrom the NCI-60 cell lines [60]

Generalized canonical correlation analysis

GCCA [71] is a generalization of CCA to K-table analysis [73ndash75]It has also been applied to the analysis of omics data [36 76]Typically MCIA and GCCA will produce similar results (for amore detailed comparison see [60]) GCCA uses a different defla-tion strategy than MCIA it calculates the residual matrices byremoving the variance with respect to the lsquoblock scoresrsquo (insteadof lsquovariable loadingsrsquo used by MCIA or lsquoglobal scoresrsquo used byCPCA see later) When applied to omics data where p n avariable selection step is often integrated within the GCCA ap-proach which cannot be done in case of MCIA In addition asblock scores are better representations of a single data set (incontrast to the global score) GCCA is more likely to find com-mon variables across data sets Witten and Tibshirani [54]applied sparse multiple CCA to analyze gene expression andCopy Number Variation (CNV) data from diffuse large B-celllymphoma patients and successfully identified lsquocis interactionsrsquothat are both up-regulated in CNV and mRNA data

Consensus PCA

CPCA is closely related to GCCA and MCIA but has had less expos-ure to the omics data community CPCA optimizes the same criter-ion as GCCA and MCIA and is subject to the same constraints asMCIA [71] The deflation step of CPCA relies on the lsquoglobal scorersquo Asa result it guarantees the orthogonality of the global score onlyand tends to find common patterns in the data sets This charac-teristic makes it is more suitable for the discovery of joint patternsin multiple data sets such as the joint clustering problem

Regularized generalized canonical correlation analysis

Recently Tenenhaus and Tenenhaus [69 74] proposed regular-ized generalized CCA (RGCCA) which provides a unified frame-work for different K-table multivariate methods The RGCCAmodel introduces some extra parameters particularly a shrink-age parameter and a linkage parameter The linkage parameteris defined so that the connection between matrices may be cus-tomized The shrinkage parameter ranges from 0 to 1 Settingthis parameter to 0 will force the correlation criterion (criterionused by GCCA) whereas a shrinkage parameter of 1 will applythe covariance criterion (used by MCIA and CPCA) A value be-tween 0 and 1 leads to a compromise between the two optionsIn practice the correlation criterion is better in explaining thecorrelated structure across data sets while discarding the vari-ance within each individual data set The introduction of theextra parameters make RGCCA highly versatile GCCA CIA andCPCA can be described as special cases of RGCCA (see [69] andSupplementary Information) In addition RGCCA also integratesa feature selection procedure named sparse GCCA (SGCCA)Tenenhaus et al [76] applied SGCCA to combine gene expres-sion comparative genomic hybridization and a qualitativephenotype measured on a set of 53 children with glioma Sparsemultiple CCA [54] and SGCCA [76] are available in the R pack-ages PMA and RGCCA respectively Similarly a higher orderimplementation of spare PLS is described in Zhao et al [77]

Joint NMF

NMF has also been extended to jointly factorize multiple matri-ces In joint NMF the values in the global score F and in the co-efficient matrices (Q1 QK) are nonnegative and there is noexplicit definition of the block loadings An optimization algo-rithm is applied to minimize an objective function typically the

sum of squared errors ieXK

kfrac141Ek

2 This approach can be con-

sidered to be a nonnegative implementation of PARAFAC al-though it has also been implemented using the Tucker model[78ndash80] Zhang et al [81] apply joint NMF to a three-way analysisof DNA methylation gene expression and miRNA expressiondata to identify modules in each of these regulatory layers thatare associated with each other

Advantages of dimension reduction whenintegrating multi-assay data

Dimension reduction or latent variable approaches provideEDA integrate multi-assay data highlight global correlationsacross data sets and discover outliers or batch effects in indi-vidual data sets Dimension reduction approaches also facilitatedown-stream analysis of both observations and variables(genes) Compared with cluster analysis of individual data setscluster analysis of the global score matrix (F matrix) is robust asit aggregates observations across data sets and is less likely to

636 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

reflect a technical or batch effect of a single data set Similarlydimension reduction of multi-assay data facilities downstreamgene set pathway and network analysis of variables MCIAtransforms variables from each data set onto the same scaleand their loadings (Q matrix) rank the variables by their contri-bution to the global data structure (Figure 2) Meng et al reportthat pathway or gene set enrichment analysis (GSEA) of thetransformed variables is more sensitive than GSEA of each indi-vidual data set This is both because of the re-weighting andtransformation of variables but also because GSEA on the com-bined data has greater coverage of variables (genes) thus com-pensating for missing or unreliable information in any singledata set For example in Figure 2B we integrate and transformmRNA proteomics and miRNA data on the same scale allowingus to extract and study the union of all variables

Example case study

To demonstrate the integration of multi-data sets using dimen-sions reduction we applied MCIA to analyze mRNA miRNA andproteomics expression profiles of melanoma leukemia and CNS

cells lines from the NCI-60 panel The graphical output fromthis analysis a plot of the sample space variable space anddata weighting space are provided in Figure 2A B and E Theeigenvalues can be interpreted similarly to PCA a higher eigen-value contributes more information to the global score As withPCA researchers may be subjective in their selection of thenumber of components [24] The scree plot in Figure 2D showsthe eigenvalues of each global score In this case the first twoeigenvalues were significantly larger so we visualized the celllines and variables on PC1 and PC2

In the observation space (Figure 2A) assay data (mRNAmiRNA and proteomics) are distinguished by shape The coord-inates of each cell line (Fk in Figure 2A) are connected by lines tothe global scores (F) Short lines between points and cell lineglobal scores reflect high concordance in cell line data Most celllines have concordant information between data sets (mRNAmiRNA protein) as indicated by relatively short lines In add-ition the RV coefficient [82 83] which is a generalized Pearsoncorrelation coefficient for matrices may be used to estimate thecorrelation between two transformed data sets The RV coeffi-cient has values between 0 and 1 where a higher value

Figure 2 MCIA of mRNA miRNA and proteomics profiles of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines (A) shows a plot of the first two

components in sample space (sample lsquotypersquo is coded by the point shape circles for mRNAs triangles for proteins and squares for miRNAs) Each sample (cell line) is

represented by a ldquostarrdquo where the three omics data for each cell line are connected by lines to a center point which is the global score (F) for that cell line the shorter

the line the higher the level of concordance between the data types and the global structure (B) shows the variable space of MCIA A variable that is highly expressed

in a cell line will be projected with a high weight (far from the origin) in the direction of that cell line Some miRNAs with a large distance from the origin are labeled as

these miRNAs are the strongly associated with cancer tissue of origin (C) shows the correlation coefficients of the proteome profiling of SR with other cell lines The

proteome profiling of SR cell line is more correlated with melanoma cell line There may be a technical issue with the LESR proteomics data (D) A scree plot of the

eigenvalues and (E) a plot of data weighting space A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 637

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

indicates higher co-structure In this example we observed rela-tively high RV coefficients between the three data sets rangingfrom 078 to 084 It was recently reported that the RV coefficientis biased toward large data sets and a modified RV coefficienthas been proposed [84]

In this analysis (Figure 2A) cell lines originating from thesame anatomical source are projected close to each other andconverge in clusters The first PC separates the leukemia celllines (positive end of PC1) from the other two cell lines (negativeend of PC1) and PC2 separates the melanoma and CNS celllines The melanoma cell line LOX-IMVI which lacks the mela-nogenesis is projected close to the origin away from the mel-anoma cluster We were surprised to see that the proteomicsprofile of leukemia cell line SR was projected closer to melan-oma rather than leukemia cell lines We examined withintumor type correlations to the SR cell line (Figure 2C) Weobserved that the SR proteomics data had higher correlationwith melanoma compared with to leukemia cell lines Giventhat the mRNA and miRNA profiles of LE_SR are closer to theleukemia cell lines it suggests that there may have been a tech-nical error in generating the proteomics data on the SR cell line(Figure 2A and C)

MCIA projects all variables into the same space The variablespace (Q1 QK) is visualized in Figure 2B Variables and sam-ples projected in the same direction are associated This allowsone to select the variables most strongly associated with spe-cific observations from each data set for subsequent analysis Inour previous study [85] we have shown that the genes and pro-teins highly weighted on the melanoma side (positive end of se-cond dimension) are enriched with melanogenesis functionsand genesproteins highly weighted on the protein side arehighly enriched in T-cell or immune-related functions

We examined the miRNA data to extract the miRNAs withthe most extreme weights on the first two dimensions miR-142and miR-223 which are active and expressed in leukemia [8283 86ndash88] had high weights on the positive end of both thefirst and second axis (close to the leukemia cell lines samplespace Figure 2A) miR-142 plays an essential role in T-lympho-cyte development miR-223 is regulated by the Notch andNF-kB signaling pathways in T-cell acute lymphoblastic leuke-mia [89]

The miRNA with strongest association to CNS cell lines wasmiR-409 This miRNA is reported to promote the epithelial-to-mesenchymal transition in prostate cancer [90] In the NCI-60cell line data CNS cell lines are characterized more by a pro-nounced mesenchymal phenotype which is consistent withhigh expression of this miRNA On the positive end of the se-cond axis and negative end of the first axis (which correspondsto melanoma cell lines in the sample space Figure 2A) wefound miR-509 miR-513 and miR-506 strongly associated withmelanoma cell lines which are reported to initiate melanocytetransformation and promote melanoma growth [85]

Challenges in integrative data analysis

EDA is widely used and well accepted in the analysis of singleomics data sets but there is an increasing need for methodsthat integrate multi-omics data particularly in cancer researchRecently 20 leading scientists were invited to a meeting organ-ized by Nature Medicine Nature Biotechnology and the VolkswagenFoundation The meeting identified the need to simultaneouslycharacterize DNA sequence epigenome transcriptome proteinmetabolites and infiltrating immune cells in both the tumorand the stroma [91] The TCGA pan-cancer project plans to

comprehensively interrogate multi-omics data across 33 humancancers [92] The data are biologically complex In addition totumor heterogeneity [91] there may be technical issues batcheffects and outliers EDA approaches for complex multi-omicsdata are needed

We describe emerging applications of multivariateapproaches to omics data analysis These are descriptiveapproaches that do not test a hypothesis or generate a P-valueThey are not optimized for variable or biomarker discoveryalthough the introduction of sparsity in variable loadings mayhelp in the selection of variables for downstream analysis Fewcomparisons of different methods exist and the numbers ofcomponents and the sparsity level have to be optimized Cross-validation approaches are potentially useful but this stillremains an open area of research

Another limitation of these methods is that although vari-ables may vary among data sets the observations need to bematchable Therefore researchers need to have careful experi-mental design in the early stage of a study There is an exten-sion of CIA for the analysis of unmatched samples [93] whichcombines a Hungarian algorithm with CIA to iteratively pairsamples that are similar but not matched Multi-block andmulti-group methods (data sets with matched variables) havebeen reviewed recently by Tenenhaus and Tenenhaus [69]

The number of variables in genomics data is a challenge totraditional EDA visualization tools Most visualizationapproaches were designed for data sets with fewer variablesWithin R new packages including ggord are being developedDynamic data visualization is possible using ggvis plotly explorand other packages However interpretation of long lists of bio-logical variables (genes proteins miRNAs) is challenging Oneway to gain more insight into lists of omics variables is to per-form a network gene set enrichment or pathway analysis [94]An attractive feature of decomposition methods is that variableannotation such as Gene Ontology or Reactome can be pro-jected into the same space to determine a score for each geneset (or pathway) in that space [32 33 60]

Conclusion

Dimension reduction methods have a long history Many simi-lar methods have been developed in parallel by multiple fieldsIn this review we provided an overview of dimension reductiontechniques that are both well-established and maybe new tothe multi-omics data community We reviewed methods forsingle-table two-table and multi-table analysis There are sig-nificant challenges in extracting biologically and clinically ac-tionable results from multi-omics data however the field mayleverage the varied and rich resource of dimension reductionapproaches that other disciplines have developed

Key Points

bull There are many dimension-reduction methods whichcan be applied to exploratory data analysis of a singledata set or integrated analysis of a pair or multipledata sets In addition to exploratory analysis thesecan be extended to clustering supervised and discrim-inant analysis

bull The goal of dimension reduction is to map data ontoa new set of variables so that most of the variance (orinformation) in the data is explained by a few new(latent) variables

638 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

bull Multi-data set methods such as multiple co-inertiaanalysis (MCIA) multiple factor analysis (MFA) or ca-nonical correlations analysis (CCA) identify correlatedstructure between data sets with matched observa-tions (samples) Each data set may have different vari-ables (genes proteins miRNA mutations drugresponse etc)

bull MCIA MFA CCA and related methods provide a visu-alization of consensus and incongruence in andbetween data sets enabling discovery of potential out-liers batch effects or technical errors

bull Multi-dataset methods transform diverse variablesfrom each data set onto the same space and scalefacilitating integrative variable selection gene setanalysis pathway and downstream analyses

R Supplement

R code to re-generate all figures in this article is available asa Supplementary File Data and code are also available on thegithub repository httpsgithubcomaedinNCI60Example

Supplementary Data

Supplementary data are available online at httpbiboxfordjournalsorg

Acknowledgements

We are grateful to Drs John Platig and Marieke Kuijjer forreading the manuscript and for their comments

Funding

GGT was supported by the Austrian Ministry of ScienceResearch and Economics [OMICS Center HSRSM initiative]AC was supported by Dana-Farber Cancer Institute BCBResearch Scientist Developmental Funds National CancerInstitute 2P50 CA101942-11 and the Assistant Secretary ofDefense Health Program through the Breast CancerResearch Program under Award No W81XWH-15-1-0013Opinions interpretations conclusions and recommenda-tions are those of the author and are not necessarilyendorsed by the Department of Defense

References1 Brazma A Culhane AC Algorithms for gene expression

analysis In Jorde LB Little PFR Dunn MJ Subramaniam S(eds) Encyclopedia of Genetics Genomics Proteomics andBioinformatics London John Wiley amp Sons 2005 3148ndash59

2 Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

3 Legendre P Legendre LFJ Numerical Ecology AmsterdamElsevier Science 3rd edition 2012

4 Biton A Bernard-Pierrot I Lou Y et al Independent compo-nent analysis uncovers the landscape of the bladder tumortranscriptome and reveals insights into luminal and basalsubtypes Cell Rep 201491235ndash45

5 Hoadley KA Yau C Wolf DM et al Multiplatform analysis of12 cancer types reveals molecular classification within andacross tissues of origin Cell 2014158929ndash44

6 Verhaak RG Tamayo P Yang JY et al Prognostically relevantgene signatures of high-grade serous ovarian carcinomaJ Clin Invest 2013123517ndash25

7 Senbabaoglu Y Michailidis G Li JZ Critical limitations of con-sensus clustering in class discovery Sci Rep 201446207

8 Gusenleitner D Howe EA Bentink S et al iBBiG iterativebinary bi-clustering of gene sets Bioinformatics 2012282484ndash92

9 Pearson K On lines and planes of closest fit to systems ofpoints in space Philos Magazine Series 19016(2)559ndash72

10Hotelling H Analysis of a complex of statistical variables intoprincipal components J Educ Psychol 193324417

11Bland M An Introduction to Medical Statistics 3rd edn OxfordNew York Oxford University Press 2000 xvi 405

12Warner RM Applied Statistics From Bivariate throughMultivariate Techniques London SAGE Publications Inc2nd Edition 2012 1004

13Dray S Chessel D Thioulouse J Co-inertia analysis and thelinking of ecological data tables Ecology 2003843078ndash89

14Golub GH Loan CF Matrix Computations Baltimore JHU Press2012

15Anthony M Harvey M Linear Algebra Concepts and MethodsCambridge Cambridge University Press 1st edition 2012149ndash60

16McShane LM Cavenagh MM Lively TG et al Criteria for theuse of omics-based predictors in clinical trials explanationand elaboration BMC Med 201311220

17Guyon I Elisseeff A An introduction to variable and featureselection J Mach Learn Res 200331157ndash82

18Tukey JW Exploratory Data Analysis Boston Addison-WesleyPublishing Company 1977

19Hogben L Handbook of Linear Algebra Boca Raton Chapman ampHallCRC Press 1st edition 2007 20

20Ringner M What is principal component analysis NatBiotechnol 200826303ndash4

21Wall ME Rechtsteiner A Rocha LM (2003) Singular value de-composition and principal component analysis In A PracticalApproach to Microarray Data Analysis Berrar DP DubitzkyW Granzow M (eds) Norwell MA Kluwer 2003 91ndash109

22 terBraak CJF Prentice IC A theory of gradient analysis AdvEcol Res 198818271ndash313

23 terBraak CJF Smilauer P CANOCO Reference Manual andUserrsquos Guide to Canoco for Windows Software for CanonicalCommunity Ordination (version 4) Microcomputer Power(Ithaca NY USA) httpwwwmicrocomputerpowercom1998352

24Abdi H Williams LJ Principal Component Analysis HobokenJohn Wiley amp SonsInterdisciplinary Reviews ComputationalStatistics 20102(4)433ndash59

25Dray S On the number of principal components a test ofdimensionality based on measurements of similarity be-tween matrices Comput Stat Data Anal 2008522228ndash37

26Wouters L Gohlmann HW Bijnens L et al Graphical explor-ation of gene expression data a comparative study of threemultivariate methods Biometrics 2003591131ndash9

27Greenacre M Correspondence Analysis in Practice Boca RatonChapman and HallCRC 2 edition 2007 201ndash11

28Goodrich JK Rienzi DSC Poole AC et al Conducting a micro-biome study Cell 2014158250ndash62

29Fasham MJR A comparison of nonmetric multidimensionalscaling principal components and reciprocal averaging forthe ordination of simulated coenoclines and coenoplanesEcology 197758551ndash61

Dimension reduction techniques | 639

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

30Beh EJ Lombardo R Correspondence Analysis Theory Practiceand New Strategies Hoboken John Wiley amp Sons 1st edition2014 120ndash76

31Greenacre MJ Theory and Applications of Correspondence AnalysisWaltham MA Academic Press 3rd Edition 1993 83ndash125

32Fagan A Culhane AC Higgins DG A multivariate analysis ap-proach to the integration of proteomic and gene expressiondata Proteomics 200772162ndash71

33Fellenberg K Hauser NC Brors B et al Correspondence ana-lysis applied to microarray data Proc Natl Acad Sci USA20019810781ndash6

34Lee DD Seung HS Learning the parts of objects by non-negative matrix factorization Nature 1999401788ndash91

35Comon P Independent component analysis a new conceptSignal Processing 199436287ndash314

36Witten DM Tibshirani R Hastie T A penalized matrix decom-position with applications to sparse principal componentsand canonical correlation analysis Biostatistics 200910515ndash34

37Shen H Huang JZ Sparse principal component analysis viaregularized low rank matrix approximation J Multivar Anal2008991015ndash34

38Lee S Epstein MP Duncan R et al Sparse principal componentanalysis for identifying ancestry-informative markers in gen-ome-wide association studies Genet Epidemiol 201236293ndash302

39Sill M Saadati M Benner A Applying stability selection toconsistently estimate sparse principal components in high-dimensional molecular data Bioinformatics 2015312683ndash90

40Zhao J Efficient model selection for mixtures of probabilisticPCA via hierarchical BIC IEEE Trans Cybern 2014441871ndash83

41Zhao Q Shi X Huang J et al Integrative analysis of lsquo-omicsrsquodata using penalty functions Wiley Interdiscip Rev Comput Stat2015710

42Alter O Brown PO Botstein D Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms Proc NatlAcad Sci USA 20031003351ndash6

43Culhane AC Perriere G Higgins DG Cross-platform compari-son and visualisation of gene expression data using co-inertia analysis BMC Bioinformatics 2003459

44Dray S Analysing a pair of tables coinertia analysis and dual-ity diagrams In Blasius J Greenacre M (eds) Visualization andverbalization of data Boca Raton CRC Press 2014 289ndash300

45Le Cao KA Martin PG Robert-Granie C et al Sparse canonicalmethods for biological data integration application to across-platform study BMC Bioinformatics 20091034

46Braak CJF Canonical correspondence analysis a new eigen-vector technique for multivariate direct gradient analysisEcology 1986671167ndash79

47Hotelling H Relations between two sets of variates Biometrika193628321ndash77

48McGarigal K Landguth E Stafford S Multivariate Statistics forWildlife and Ecology Research New York Springer 2002

49Thioulouse J Simultaneous analysis of a sequence of pairedecological tables a comparison of several methods Ann ApplStat 201152300ndash25

50Hong S Chen X Jin L et al Canonical correlation analysisfor RNA-seq co-expression networks Nucleic Acids Res201341e95

51Mardia KV Kent JT Bibby JM Multivariate Analysis LondonNew York NY Toronto Sydney San Francisco CA AcademicPress 1979

52Waaijenborg S Zwinderman AH Sparse canonical correl-ation analysis for identifying connecting and completinggene-expression networks BMC Bioinformatics 200910315

53Parkhomenko E Tritchler D Beyene J Sparse canonical cor-relation analysis with application to genomic data integra-tion Stat Appl Genet Mol Biol 20098article 1

54Witten DM Tibshirani RJ Extensions of sparse canonical cor-relation analysis with applications to genomic data Stat ApplGenet Mol Biol 20098article 28

55Lin D Zhang J Li J et al Group sparse canonical correlationanalysis for genomic data integration BMC Bioinformatics201314245

56Le Cao KA Boitard S Besse P Sparse PLS discriminantanalysis biologically relevant feature selection and graph-ical displays for multiclass problems BMC Bioinform 201112253

57Boulesteix AL Strimmer K Partial least squares a versatiletool for the analysis of high-dimensional genomic data BriefBioinform 2007832ndash44

58Doledec S Chessel D Co-inertia analysis an alternativemethod for studying speciesndashenvironment relationshipsFreshwater Biol 199431277ndash94

59Ponnapalli SP Saunders MA Van Loan CF et al A higher-order generalized singular value decomposition for compari-son of global mRNA expression from multiple organismsPLoS One 20116e28072

60Meng C Kuster B Culhane AC et al A multivariate approachto the integration of multi-omics datasets BMC Bioinformatics201415162

61de Tayrac M Le S Aubry M et al Simultaneous analysis ofdistinct Omics data sets with integration of biological know-ledge multiple factor analysis approach BMC Genomics20091032

62Abdi H Williams LJ Valentin D Statis and Distatis OptimumMultitable Principal Component Analysis and Three Way MetricMultidimensional Scaling Wiley Interdisciplinary 20124124ndash67

63De Jong S Kiers HAL Principal covariates regression Part ITheory Chemometrics and Intelligent Laboratory Systems199214155ndash64

64Benasseni J Dosse MB Analyzing multiset data by the powerSTATIS-ACT method Adv Data Anal Classification 2012649ndash65

65Escoufier Y The duality diagram a means for better practicalapplications Devel Num Ecol 198714139ndash56

66Giordani P Kiers HAL Ferraro DMA Three-way componentanalysis using the R package threeway J Stat Software2014577

67Leibovici DG Spatio-temporal multiway decompositionsusing principal tensor analysis on k-modes the R packagePTAk J Stat Software 201034(10)1ndash34

68Kolda TG Bader BW Tensor decompositions and applica-tions SIAM Rev 200951455ndash500

69Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis for multiblock or multigroup dataanalysis Eur J Oper Res 2014238391ndash403

70Chessel D Hanafi M Analyses de la co-inertie de K nuages depoints Rev Stat Appl 19964435ndash60

71Hanafi M Kohler A Qannari EM Connections betweenmultiple co-inertia analysis and consensus principal compo-nent analysis Chemometrics and Intelligent Laboratory2011106(1)37ndash40

72Carroll JD Generalization of canonical correlation analysis tothree or more sets of variables Proc 76th Convent Am PsychAssoc 19683227ndash8

73Takane Y Hwang H Abdi H Regularized multiple-set canon-ical correlation analysis Psychometrika 200873753ndash75

640 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5
Page 4: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

projects the observations in X onto the PCs The sum of squaredvalues of the columns in U equals 1 (Equation (5)) therefore thevariance of the ith PC di2 can be calculated as

di2 frac14 si2

n 1(7)

where si is the ith diagonal element in S The variance reflectsthe amount of information (underling structure) captured byeach PC The squared correlations between PCs and the original

variables are informative and often illustrated using a correl-ation circle plot These can be calculated by

C frac14 QD (8)

where D frac14 (d1 d2 dr)T is a diagonal matrix of the standarddeviation of the PCs

In contrast to SVD which calculates all PCs simultaneouslyPCA can also be calculated using the Nonlinear Iterative PartialLeast Squares (NIPALS) algorithm which uses an iterative

Table 2 Dimension reduction methods for one data set

Method Description Name of R function R package

PCA Principal component analysis prcompstats princompstats dudipcaade4 pcaveganPCAFactoMineR principalpsych

CA COA Correspondence analysis caca CAFactoMineR dudicoaade4NSC Nonsymmetric correspondence analysis dudinscade4PCoA MDS Principal co-ordinate analysismultiple dimensional

scalingcmdscalestats dudipcoade4 pcoaape

NMF Nonnegative matrix factorization nmfnmfnmMDS Nonmetric multidimensional scaling metaMDSvegansPCA nsPCA pPCA Sparse PCA nonnegative sparse PCA penalized PCA

(PCA with feature selection)SPCPMA spcamixOmics nsprcompnsprcomp

PMDPMANIPALS PCA Nonlinear iterative partial least squares analysis (PCA

on data with missing values)nipalsade4 pcapcaMethodsa nipalsmixOmics

pPCA bPCA Probabilistic PCA Bayesian PCA pcapcaMethodsa

MCA Multiple correspondence analysis dudiacmade4 mcaMASSICA Independent component analysis fastICAFastICAsIPCA Sparse independent PCA (combines sPCA and ICA) sipcamixOmics ipcamixOmicsplots Graphical resources R packages including scatterplot3d ggordb ggbiplotc

plotlyd explor

aAvailable in BioconductorbOn github devtoolsinstall_github (lsquofawda123ggordrsquo)cOn github devtoolsinstall_github (lsquoggbiplotrsquo lsquovqvrsquo)dOn github devtoolsinstall_github (lsquoropensciplotlyrsquo)

Table 3 Dimension reduction methods for pairs of data sets

Method Description Featureselection

R Function package

CCAa Canonical correlation analysis Limited to ngtpa No cccca CCorAveganCCAa Canonical correspondence analysis is a constrained correspondence

analysis which is popular in ecologya

No ccaade4 ccavegan cancorstats

RDA Redundancy analysis is a constrained PCA Popular in ecology No rdaveganProcrutes Procrutes rotation rotates a matrix to maximum similarity with a tar-

get matrix minimizing sum of squared differencesNo procrustesvegan procusteade4

rCCA Regularized canonical correlation No rccccasCCA Sparse CCA Yes CCApmapCCA Penalized CCA Yes spCCAspCCA supervised versionWAPLS Weighted averaging PLS regression No WAPLSrioja waplspaltranPLS Partial least squares of K-tables (multi-block PLS) No mbplsade4 plsdacaretsPLS pPLS Sparse PLS Penalized PLS Yes splsspls splsmixOmics pplspplssPLS-DA Sparse PLS-discriminant analysis Yes splsdamixOmics splsdacaretcPCA Consensus PCA No cpcamogsaCIA Coinertia analysis No coinertiaade4 ciamade4

aA source for confusion CCA is widely used as an acronym for both Canonical lsquoCorrespondencersquo Analysis and Canonical lsquoCorrelationrsquo Analysis Throughout this article

we use CCA for canonical lsquocorrelationrsquo analysis Both methods search for the multivariate relationships between two data sets Canonical lsquocorrespondencersquo analysis is

an extension and constrained form of lsquocorrespondencersquo analysis [22] Both canonical lsquocorrelationrsquo analysis and RDA assume a linear model however RDA is a con-

strained PCA (and assumes one matrix is the dependent variable and one independent) whereas canonical correlation analysis considers both equally See [23] for

more explanation

Dimension reduction techniques | 631

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

regression procedure to calculate PCs Computation can be per-formed on data with missing values and it is faster than SVDwhen applied to large matrices Furthermore NIPALS may begeneralized to discover the correlated structure in more thanone data set (see sections on the analysis of multi-omics datasets) Please refer to the Supplementary Information for add-itional details on NIPALS

Visualizing and interpreting results ofdimension reduction analysis

We present an example to illustrate how to interpret results of aPCA PCA was applied to analyze mRNA gene expression data ofa subset of cell lines from the NCI-60 panel those of melanomaleukemia and central nervous system (CNS) tumors The resultsof PCA can be easily interpreted by visualizing the observationsand variables in the same space using a biplot Figure 1A is abiplot of the first (PC1) and second PC (PC2) where points andarrows from the plot origin represent observations and genesrespectively Cell lines (points) with correlated gene expressionprofiles have similar scores and are projected close to eachother on PC1 and PC2 We see that cell lines from the same ana-tomical location are clustered

Both the direction and length of the mRNA gene expressionvectors can be interpreted Gene expression vectors point in thedirection of the latent variable (PC) to which it is most similar(squared correlation) Gene vectors with the same direction (egFAM69B PNPLA2 PODXL2) have similar gene expression pro-files The length of the gene expression vector is proportional tothe squared multiple correlation between the fitted values forthe variable and the variable itself

A gene expression vector and a cell line projected in the samedirection from the origin are positively associated For example inFigure 1A FAM69B PNPLA2 PODXL2 are active (have higher geneexpression) in melanoma cell lines Similarly genes DPPA5 andRPL34P1 are among those genes that are highly expressed inmost leukemia cell lines By contrast genes WASL and HEXBpoint in the opposite direction to most leukemia cell lines indicat-ing low association In Figure 1B it is clear that these genes arenot expressed in leukemia cell lines and are colored blue in theheatmap (these are colored dark gray in grayscale)

The sum of the squared correlation coefficients between avariable and all the components (calculated in equation 8)equals 1 Therefore variables are often shown within a correl-ation circle (Figure 1C) Variables positioned on the unit circlerepresent variables that are perfectly represented by the two di-mensions displayed Those not on the unit circle may requireadditional components to be represented

In most analyses only the first few PCs are plotted andstudied as these explain the most variant trends in the dataGenerally the selection of components is subjective anddepends on the purpose of the EDA An informal elbow test mayhelp to determine the number of PCs to retain and examine [2425] From the scree plot of PC eigenvalues in Figure 1D we mightdecide to examine the first three PCs because the decrease inPC variance becomes relatively moderate after PC3 Anotherapproach that is widely used is to include (or retain) PCs thatcumulatively capture a certain proportion of variance for ex-ample 70 of variance is modeled with three PCs If a parsi-mony model is preferred the variance proportion cutoff can beas low as 50 [24] More formal tests including the Q2 statisticare also available (for details see [24]) In practice the first com-ponent might explain most of the variance and the remainingaxes may simply be attributed to noise from technical or biolo-gical sources in a study with low complexity (eg cell line repli-cates of controls and one treatment condition) However acomplex data set (for example a set of heterogeneous tumors)may require multiple PCs to capture the same amount ofvariance

Different dimension reduction approaches areoptimized for different data

There are many dimension reduction approaches related to PCA(Table 2) including principal co-ordinate analysis (PCoA) cor-respondence analysis (CA) and nonsymmetrical correspond-ence analysis (NSCA) These may be computed by SVD butdiffer in how the data are transformed before decomposition[21 26 27] and therefore each is optimized for specific dataproperties PCoA (also known as Classical MultidimensionalScaling) is versatile as it is a SVD of a distance matrix that canbe applied to decompose distance matrices of binary count or

Table 4 Dimension reduction methods for multiple (more than two) data sets

Method Description Featureselection

Matchedcases

R Function package

MCIA Multiple coinertia analysis No No mciaomicade4 mcoaade4gCCA Generalized CCA No No regCCAdmtrGCCA Regularized generalized CCA No No regCCAdmt rgccargcca

wrapperrgccamixOmicssGCCA Sparse generalized canonical

correlation analysisYes No sgccargcca wrappersgccamixOmics

STATIS Structuration des Tableaux aTrois Indices de la Statistique(STATIS) Family of methods whichinclude X-statis

No No statisade4

CANDECOMPPARAFAC Tucker3

Higher order generalizations of SVDand PCA Require matchedvariables and cases

No Yes CPThreeWay T3ThreeWay PCAnPTaKCANDPARAPTaK

PTA Partial triadic analysis No Yes ptaade4statico Statis and CIA (find structure

between two pairs of K-tables)No No staticoade4

632 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

continuous data It is frequently applied in the analysis ofmicrobiome data [28]

PCA is designed for the analysis of multi-normal distributeddata If data are strongly skewed or extreme outliers are pre-sent the first few axes may only separate those objects with ex-treme values instead of displaying the main axes of variation Ifdata are unimodal or display nonlinear trends one may see dis-tortions or artifacts in the resulting plots in which the secondaxis is an arched function of the first axis In PCA this is calledthe horseshoe effect and it is well described including illustra-tions in Legendre and Legendre [3] Both nonmetric Multi-Dimensional Scaling (MDS) and CA perform better than PCA inthese cases [26 29] Unlike PCA CA can be applied to sparsecount data with many zeros Although designed for contingencytables of nonnegative count data CA and NSCA decomposea chi-squared matrix [30 31] but have been successfullyapplied to continuous data including gene expression and pro-tein profiles [32 33] As described by Fellenberg et al [33] gene

and protein expression can be seen as an approximation of thenumber of corresponding molecules present in the cell during acertain measured condition Additionally Greenacre [27]emphasized that the descriptive nature of CA and NSCA allowstheir application on data tables in general not only on countdata These two arguments support the suitability of CA andNSCA as analysis methods for omics data While CA investi-gates symmetric associations between two variables NSCA cap-tures asymmetric relations between variables Spectral mapanalysis is related to CA and performs comparably with CAeach outperforming PCA in the identification of clusters of leu-kemia gene expression profiles [26] All dimension reductionmethods can be formulated in terms of the duality diagramDetails on this powerful framework are included in theSupplementary Information

Nonnegative matrix factorization (NMF) [34] forces a positiveor nonnegative constraint on the resulting data matrices andsimilar to Independent Component Analysis (ICA) [35] there is no

Figure 1 Results of a PCA analysis of mRNA gene expression data of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines from the NCI-60 cell

line panel All variables were centered and scaled Results show (A) a biplot where observations (cell lines) are points and gene expression profiles are arrows (B) a

heatmap showing the gene expression of the same 20 genes in the cell lines red to blue scale represent high to low gene expression (light to dark gray represent high

to low gene expression on the black and white figure) (C) correlation circle (D) variance barplot of the first ten PCs To improve the readability of the biplot some labels

of the variables (genes) in (A) have been moved slightly A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 633

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

requirement for orthogonality or independence in the compo-nents The nonnegative constraint guarantees that only the addi-tive combinations of latent variables are allowed This may bemore intuitive in biology where many biological measurements(eg protein concentrations count data) are represented by posi-tive values NMF is described in more detail in the SupplementalInformation ICA was recently applied to molecular subtype dis-covery in bladder cancer [4] Biton et al [4] applied ICA to geneexpression data of 198 bladder cancers and examined 20 compo-nents ICA successfully decomposed and extracted multiplelayers of signal from the data The first two components wereassociated with batch effects but other components revealed newbiology about tumor cells and the tumor microenvironmentThey also applied ICA to non-bladder cancers and by comparingthe correlation between components were able to identify a setof bladder cancer-specific components and their associatedgenes

As omics data sets tend to have high dimensionality (p n)it is often useful to reduce the number of variables Several re-cent extensions of PCA include variable selection often via aregularization step or L-1 penalization (eg Least AbsoluteShrinkage and Selection Operator LASSO) [36] The NIPALS algo-rithm uses an iterative regression approach to calculate thecomponents and loadings which is easily extended to have asparse operator that can be included during regression on thecomponent [37] A cross-validation approach can be used to de-termine the level of sparsity Sparse penalized and regularizedextensions of PCA and related methods have been describedrecently [36 38ndash41]

Integrative analysis of two data sets

One-table dimension reduction methods have been extended tothe EDA of two matrices and can simultaneously decomposeand integrate a pair of matrices that measure different variableson the same observations (Table 3) Methods include general-ized SVD [42] Co-Inertia Analysis (CIA) [43 44] sparse or penal-ized extensions of Partial Least Squares (PLS) CanonicalCorrespondence analysis (CCA) and Canonical CorrelationAnalysis (CCA) [36 45ndash47] Note both canonical correspondenceanalysis and canonical correlation analysis are referred to bythe acronym CCA Canonical correspondence analysis is a con-strained form of CA that is widely used in ecological statistics[46] however it is yet to be adopted by the genomics commu-nity in analysis of pairs of omics data By contrast severalgroups have applied extensions of canonical correlation ana-lysis to omics data integration Therefore in this review we useCCA to describe canonical correlation analysis

Canonical correlation analysis

Two omics data sets X (dimension npx) and Y (dimension npy)can be expressed by the following latent component decompos-ition problem

X frac14 FxQTx thorn Ex

Yfrac14 FyQTy thorn Ey

(9)

where Fx and Fy are nr matrices with r columns of compo-nents that explain the co-structure between X and Y The col-umns of Qx and Qy are variable loading vectors for X and Yrespectively Ex and Ey are error terms

Proposed by Hotelling in 1936 [47] CCA searches for associ-ations or correlations among the variables of X and Y [47] bymaximizing the correlation between Xqx

i and Yqyi

for the ith component arg maxqixqi

ycorethXqi

x YqiyTHORN (10)

In CCA the components Xqxi and Yqy

i are called canonical vari-ates and their correlations are the canonical correlations

Sparse canonical correlation analysis

The main limitation of applying CCA to omics data is that itrequires an inversion of the correlation or covariance matrix[38 49 50] which cannot be calculated when the number ofvariables exceeds the number of observations [46] In high-dimensional omics data where p n application of these meth-ods requires a regularization step This may be accomplished byadding a ridge penalty that is adding a multiple of the identitymatrix to the covariance matrix [51] A sparse solution of theloading vectors (Qx and Qy) filters the number of variables andsimplifies the interpretation of results For this purpose penal-ized CCA [52] sparse CCA [53] CCA-l1 [54] CCA elastic net (CCA-EN) [45] and CCA-group sparse [55] have been introduced andapplied to the integrative analysis of two omics data setsWitten et al [36] provided an elegant comparison of various CCAextensions accompanied by a unified approach to compute bothpenalized CCA and sparse PCA In addition Witten andTibshirani [54] extended sparse CCA into a supervised frame-work which allows the integration of two data sets with aquantitative phenotype for example selecting variables fromboth genomics and transcriptomics data and linking them todrug sensitivity data

Partial least squares analysis

PLS is an efficient dimension reduction method in the analysisof high-dimensional omics data PLS maximizes the covariancerather than the correlation between components making itmore resistant to outliers than CCA Additionally PLS does notsuffer from the p n problem as CCA does Nonetheless asparse solution is desired in some cases For example Le Caoet al [56] proposed a sparse PLS method for the feature selectionby introducing a LASSO penalty for the loading vectors In a re-cent comparison sPLS performed similarly to sparse CCA [45]There are many implementations of PLS which optimize differ-ent objective functions with different constraints several ofwhich are described by Boulesteix et al [57]

Co-Inertia analysis

CIA is a descriptive nonconstrained approach for coupling pairsof data matrices It was originally proposed to link two ecolo-gical tables [13 58] but has been successfully applied in integra-tive analysis of omics data [32 43] CIA is implemented andformulated under the duality diagram framework(Supplementary Information) CIA is performed in two steps (i)application of a dimension reduction technique such as PCACA or NSCA to the initial data sets depending on the type ofdata (binary categorical discrete counts or continuous data)and (ii) constraining the projections of the orthogonal axes suchthat they are maximally covariant [43 58] CIA does not requirean inversion step of the correlation or covariance matrix thusit can be applied to high-dimensional genomics data withoutregularization or penalization

634 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

Though closely related to CCA [49] CIA maximizes thesquared covariance between the linear combination of the pre-processed matrix that is

for the ith dimension argmaxqixqi

ycov2ethXqi

xYqiyTHORN (11)

Equation (11) can be decomposed as

cov2 XqxiYqy

i

frac14 cor2 XqxiYqy

i

var Xqxi

var Xqy

i

(12)

CIA decomposition of covariance maximizes the varianceand the correlation between matrices and thus is less sensitiveto outliers The relationship between CIA Procrustes analysis[13] and CCA have been well described [49] A comparison be-tween sCCA (with elastic net penalty) sPLS and CIA is providedby Le Cao et al [45] In summary CIA and sPLS both maximizethe covariance between eigenvectors and efficiently identifyjoint and individual variance in paired data In contrast CCA-EN maximizes the correlation between eigenvectors and willdiscover effects present in both data sets but may fail to dis-cover strong individual effects [45] Both sCCA and sPLS aresparse methods that select similar subsets of variables whereasCIA does not include a feature selection step thus in terms offeature selection results of CIA are more likely to contain re-dundant information in comparison with sparse methods [45]

Similar to classical dimension reduction approaches thenumber of dimensions to be examined needs to be consideredand can be visualized using a scree plot (similar to Figure 1D)Components may be evaluated by their associated variance [25]the elbow test or Q2 statistics as described previously Forexample the Q2 statistic was applied to select the number ofdimensions in the predictive mode of PLS [56] In additionwhen a sparse factor is introduced in the loading vectors cross-validation approaches may be used to determine the number ofvariables selected from each pair of components Selection ofthe number of components and optimization of these param-eters is still an open research question and is an active area ofresearch

Integrative analysis of multi-assay data

There is a growing need to integrate more than two data sets ingenomics Generalizations of dimension reduction methods tothree or more data sets are sometimes called K-table methods[59ndash61] and a number of them have been applied to multi-assaydata (Table 4) Simultaneous decomposition and integration ofmultiple matrices is more complex than an analysis of a singledata set or paired data because each data set may have differentnumbers of variables scales or internal structure and thus havedifferent variance This might produce global scores that aredominated by one or a few data sets Therefore data are prepro-cessed before decomposition Preprocessing is often performedon two levels On the variable levels variables are often cen-tered and normalized so that their sum of squared values orvariance equals 1 This procedure enables all the variables tohave equal contribution to the total inertia (sum of squares ofall elements) of a data set However the number of variablesmay vary between data sets or filteringpreprocessing stepsmay generate data sets that have a higher variance contributionto the final result Therefore a data set level normalization isalso required In the simplest K-table analysis all matrices haveequal weights More commonly data sets are weighted by theirexpected contribution or expected data quality for example by

the square root of their total inertia or by the square root of thenumbers of columns of each data set [62] Alternatively greaterweights can be given to smaller or less redundant matricesmatrices that have more stable predictive information or thosethat share more information with other matrices Such weight-ing approaches are implemented in multiple-factor analysis(MFA) principal covariates regression [63] and STATIS

The simplest multi-omics data integration is when the datasets have the same variables and observations that is matchedrows and matched columns In genomics these could be pro-duced when variables from different multi-assay data sets aremapped to a common set of genomic coordinates or gene iden-tifiers thus generating data sets with matched variables andmatched observations Alternatively a repeated analysis orlongitudinal analysis of the same samples and the same vari-ables could produce such data (one should note that thesedimension reduction approaches do not model the time correla-tion between different datasets) There is a history of such ana-lyses in ecology where counts of species and environmentvariables are measured over different seasons [49 64 65] and inpsychology where different standardized tests are measuredmultiple times on study populations [66 67] Analysis of suchvariables x samples x time data are called a three-mode decom-position triadic cube or three-way table analysis tensor de-composition three-way PCA three-mode PCA three-modeFactor Analysis Tucker-3 model Tucker3 TUCKALS3 multi-block analysis among others (Table 4) The relationship be-tween various tensor or higher decompositions for multi-blockanalysis are reviewed by Kolda and Bader [68]

More frequently we need to find associations in multi-assaydata that have matched observations but have different andunmatched numbers of variables For example TCGA generatedmiRNA and mRNA transcriptome (RNAseq microarray) DNAcopy number DNA mutation DNA methylation and proteomicsmolecular profiles on each tumor The NCI-60 and the CancerCell Line Encyclopedia projects have measured pharmacologicalcompound profiles in addition to exome sequencing and tran-scriptomic profiles Methods that can be applied to EDA ofK-table of multi-assay data with different variables includeMCIA MFA Generalized CCA (GCCA) and Consensus PCA(CPCA)

The K-table methods can be generally expressed by the fol-lowing model

X1 frac14 FQT1 thorn E1

Xk frac14 FQTk thorn Ek

XK frac14 FQTK thorn EK

(13)

where there are K matrices or omics data sets X1 XK For con-venience we assume that the rows of Xk share a common set ofobservations but the columns of Xk may each have differentvariables F is the lsquoglobal scorersquo matrix Its columns are the PCsand are interpreted similarly to PCs from a PCA of a single dataset The global score matrix which is identical in all decompos-itions integrates information from all data sets Therefore it isnot specific to any single data set rather it represents the com-mon pattern defined by all data sets The matrices Qk with kranging from 1 to K are the loadings or coefficient matricesA high positive value indicates a strong positive contribution ofthe corresponding variable to the lsquoglobal scorersquo While the above

Dimension reduction techniques | 635

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

methods are formulated for multiple data sets with differentvariables but the same observations most can be similarly for-mulated for multiple data sets with the same variables but dif-ferent observations [69]

Multiple co-inertia analysis

MCIA is an extension of CIA which aims to analyze multiplematrices through optimizing a covariance criterion [60 70]MCIA simultaneously projects K data sets into the same dimen-sional space Instead of maximizing the covariance betweenscores from two data sets as in CIA the optimization problemused by MCIA is as following

argmaxqi1 q

ikqi

K

XK

kfrac141

cov2ethXikqi

kXiqiTHORN (14)

for dimension i with the constraints that jjqkijj frac14 var (Xiqi) frac14 1

where X frac14 (X1j jXkj jXK) and qi holds the correspondingloading values (lsquoglobalrsquo loading vector) [60 69 70]

MCIA derives a set of lsquoblock scoresrsquo Xkiqk

i using linear com-binations of the original variables from each individual matrixThe global score Xiqi is then further defined as the linear com-bination of lsquoblock scoresrsquo In practice the global scores representa correlated structure defined by multiple data sets whereasthe concordance and discrepancy between these data sets maybe revealed by the block scores (for detail see lsquoExample casestudyrsquo section) MCIA may be calculated with the ad hoc exten-sion of the NIPALS PCA [71] This algorithm starts with an ini-tialization step in which the global scores and the blockloadings for the first dimension are computed The residualmatrices are calculated in an iterative step by removing thevariance induced by the variable loadings (the lsquodeflationrsquo step)For higher order solutions the same procedure is applied to theresidual matrices and re-iterated until the desired number of di-mensions is reached Therefore the computation time stronglydepends on the number of desired dimensions MCIA is imple-mented in the R package omicade4 and has been applied to theintegrative analysis of transcriptomic and proteomic data setsfrom the NCI-60 cell lines [60]

Generalized canonical correlation analysis

GCCA [71] is a generalization of CCA to K-table analysis [73ndash75]It has also been applied to the analysis of omics data [36 76]Typically MCIA and GCCA will produce similar results (for amore detailed comparison see [60]) GCCA uses a different defla-tion strategy than MCIA it calculates the residual matrices byremoving the variance with respect to the lsquoblock scoresrsquo (insteadof lsquovariable loadingsrsquo used by MCIA or lsquoglobal scoresrsquo used byCPCA see later) When applied to omics data where p n avariable selection step is often integrated within the GCCA ap-proach which cannot be done in case of MCIA In addition asblock scores are better representations of a single data set (incontrast to the global score) GCCA is more likely to find com-mon variables across data sets Witten and Tibshirani [54]applied sparse multiple CCA to analyze gene expression andCopy Number Variation (CNV) data from diffuse large B-celllymphoma patients and successfully identified lsquocis interactionsrsquothat are both up-regulated in CNV and mRNA data

Consensus PCA

CPCA is closely related to GCCA and MCIA but has had less expos-ure to the omics data community CPCA optimizes the same criter-ion as GCCA and MCIA and is subject to the same constraints asMCIA [71] The deflation step of CPCA relies on the lsquoglobal scorersquo Asa result it guarantees the orthogonality of the global score onlyand tends to find common patterns in the data sets This charac-teristic makes it is more suitable for the discovery of joint patternsin multiple data sets such as the joint clustering problem

Regularized generalized canonical correlation analysis

Recently Tenenhaus and Tenenhaus [69 74] proposed regular-ized generalized CCA (RGCCA) which provides a unified frame-work for different K-table multivariate methods The RGCCAmodel introduces some extra parameters particularly a shrink-age parameter and a linkage parameter The linkage parameteris defined so that the connection between matrices may be cus-tomized The shrinkage parameter ranges from 0 to 1 Settingthis parameter to 0 will force the correlation criterion (criterionused by GCCA) whereas a shrinkage parameter of 1 will applythe covariance criterion (used by MCIA and CPCA) A value be-tween 0 and 1 leads to a compromise between the two optionsIn practice the correlation criterion is better in explaining thecorrelated structure across data sets while discarding the vari-ance within each individual data set The introduction of theextra parameters make RGCCA highly versatile GCCA CIA andCPCA can be described as special cases of RGCCA (see [69] andSupplementary Information) In addition RGCCA also integratesa feature selection procedure named sparse GCCA (SGCCA)Tenenhaus et al [76] applied SGCCA to combine gene expres-sion comparative genomic hybridization and a qualitativephenotype measured on a set of 53 children with glioma Sparsemultiple CCA [54] and SGCCA [76] are available in the R pack-ages PMA and RGCCA respectively Similarly a higher orderimplementation of spare PLS is described in Zhao et al [77]

Joint NMF

NMF has also been extended to jointly factorize multiple matri-ces In joint NMF the values in the global score F and in the co-efficient matrices (Q1 QK) are nonnegative and there is noexplicit definition of the block loadings An optimization algo-rithm is applied to minimize an objective function typically the

sum of squared errors ieXK

kfrac141Ek

2 This approach can be con-

sidered to be a nonnegative implementation of PARAFAC al-though it has also been implemented using the Tucker model[78ndash80] Zhang et al [81] apply joint NMF to a three-way analysisof DNA methylation gene expression and miRNA expressiondata to identify modules in each of these regulatory layers thatare associated with each other

Advantages of dimension reduction whenintegrating multi-assay data

Dimension reduction or latent variable approaches provideEDA integrate multi-assay data highlight global correlationsacross data sets and discover outliers or batch effects in indi-vidual data sets Dimension reduction approaches also facilitatedown-stream analysis of both observations and variables(genes) Compared with cluster analysis of individual data setscluster analysis of the global score matrix (F matrix) is robust asit aggregates observations across data sets and is less likely to

636 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

reflect a technical or batch effect of a single data set Similarlydimension reduction of multi-assay data facilities downstreamgene set pathway and network analysis of variables MCIAtransforms variables from each data set onto the same scaleand their loadings (Q matrix) rank the variables by their contri-bution to the global data structure (Figure 2) Meng et al reportthat pathway or gene set enrichment analysis (GSEA) of thetransformed variables is more sensitive than GSEA of each indi-vidual data set This is both because of the re-weighting andtransformation of variables but also because GSEA on the com-bined data has greater coverage of variables (genes) thus com-pensating for missing or unreliable information in any singledata set For example in Figure 2B we integrate and transformmRNA proteomics and miRNA data on the same scale allowingus to extract and study the union of all variables

Example case study

To demonstrate the integration of multi-data sets using dimen-sions reduction we applied MCIA to analyze mRNA miRNA andproteomics expression profiles of melanoma leukemia and CNS

cells lines from the NCI-60 panel The graphical output fromthis analysis a plot of the sample space variable space anddata weighting space are provided in Figure 2A B and E Theeigenvalues can be interpreted similarly to PCA a higher eigen-value contributes more information to the global score As withPCA researchers may be subjective in their selection of thenumber of components [24] The scree plot in Figure 2D showsthe eigenvalues of each global score In this case the first twoeigenvalues were significantly larger so we visualized the celllines and variables on PC1 and PC2

In the observation space (Figure 2A) assay data (mRNAmiRNA and proteomics) are distinguished by shape The coord-inates of each cell line (Fk in Figure 2A) are connected by lines tothe global scores (F) Short lines between points and cell lineglobal scores reflect high concordance in cell line data Most celllines have concordant information between data sets (mRNAmiRNA protein) as indicated by relatively short lines In add-ition the RV coefficient [82 83] which is a generalized Pearsoncorrelation coefficient for matrices may be used to estimate thecorrelation between two transformed data sets The RV coeffi-cient has values between 0 and 1 where a higher value

Figure 2 MCIA of mRNA miRNA and proteomics profiles of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines (A) shows a plot of the first two

components in sample space (sample lsquotypersquo is coded by the point shape circles for mRNAs triangles for proteins and squares for miRNAs) Each sample (cell line) is

represented by a ldquostarrdquo where the three omics data for each cell line are connected by lines to a center point which is the global score (F) for that cell line the shorter

the line the higher the level of concordance between the data types and the global structure (B) shows the variable space of MCIA A variable that is highly expressed

in a cell line will be projected with a high weight (far from the origin) in the direction of that cell line Some miRNAs with a large distance from the origin are labeled as

these miRNAs are the strongly associated with cancer tissue of origin (C) shows the correlation coefficients of the proteome profiling of SR with other cell lines The

proteome profiling of SR cell line is more correlated with melanoma cell line There may be a technical issue with the LESR proteomics data (D) A scree plot of the

eigenvalues and (E) a plot of data weighting space A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 637

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

indicates higher co-structure In this example we observed rela-tively high RV coefficients between the three data sets rangingfrom 078 to 084 It was recently reported that the RV coefficientis biased toward large data sets and a modified RV coefficienthas been proposed [84]

In this analysis (Figure 2A) cell lines originating from thesame anatomical source are projected close to each other andconverge in clusters The first PC separates the leukemia celllines (positive end of PC1) from the other two cell lines (negativeend of PC1) and PC2 separates the melanoma and CNS celllines The melanoma cell line LOX-IMVI which lacks the mela-nogenesis is projected close to the origin away from the mel-anoma cluster We were surprised to see that the proteomicsprofile of leukemia cell line SR was projected closer to melan-oma rather than leukemia cell lines We examined withintumor type correlations to the SR cell line (Figure 2C) Weobserved that the SR proteomics data had higher correlationwith melanoma compared with to leukemia cell lines Giventhat the mRNA and miRNA profiles of LE_SR are closer to theleukemia cell lines it suggests that there may have been a tech-nical error in generating the proteomics data on the SR cell line(Figure 2A and C)

MCIA projects all variables into the same space The variablespace (Q1 QK) is visualized in Figure 2B Variables and sam-ples projected in the same direction are associated This allowsone to select the variables most strongly associated with spe-cific observations from each data set for subsequent analysis Inour previous study [85] we have shown that the genes and pro-teins highly weighted on the melanoma side (positive end of se-cond dimension) are enriched with melanogenesis functionsand genesproteins highly weighted on the protein side arehighly enriched in T-cell or immune-related functions

We examined the miRNA data to extract the miRNAs withthe most extreme weights on the first two dimensions miR-142and miR-223 which are active and expressed in leukemia [8283 86ndash88] had high weights on the positive end of both thefirst and second axis (close to the leukemia cell lines samplespace Figure 2A) miR-142 plays an essential role in T-lympho-cyte development miR-223 is regulated by the Notch andNF-kB signaling pathways in T-cell acute lymphoblastic leuke-mia [89]

The miRNA with strongest association to CNS cell lines wasmiR-409 This miRNA is reported to promote the epithelial-to-mesenchymal transition in prostate cancer [90] In the NCI-60cell line data CNS cell lines are characterized more by a pro-nounced mesenchymal phenotype which is consistent withhigh expression of this miRNA On the positive end of the se-cond axis and negative end of the first axis (which correspondsto melanoma cell lines in the sample space Figure 2A) wefound miR-509 miR-513 and miR-506 strongly associated withmelanoma cell lines which are reported to initiate melanocytetransformation and promote melanoma growth [85]

Challenges in integrative data analysis

EDA is widely used and well accepted in the analysis of singleomics data sets but there is an increasing need for methodsthat integrate multi-omics data particularly in cancer researchRecently 20 leading scientists were invited to a meeting organ-ized by Nature Medicine Nature Biotechnology and the VolkswagenFoundation The meeting identified the need to simultaneouslycharacterize DNA sequence epigenome transcriptome proteinmetabolites and infiltrating immune cells in both the tumorand the stroma [91] The TCGA pan-cancer project plans to

comprehensively interrogate multi-omics data across 33 humancancers [92] The data are biologically complex In addition totumor heterogeneity [91] there may be technical issues batcheffects and outliers EDA approaches for complex multi-omicsdata are needed

We describe emerging applications of multivariateapproaches to omics data analysis These are descriptiveapproaches that do not test a hypothesis or generate a P-valueThey are not optimized for variable or biomarker discoveryalthough the introduction of sparsity in variable loadings mayhelp in the selection of variables for downstream analysis Fewcomparisons of different methods exist and the numbers ofcomponents and the sparsity level have to be optimized Cross-validation approaches are potentially useful but this stillremains an open area of research

Another limitation of these methods is that although vari-ables may vary among data sets the observations need to bematchable Therefore researchers need to have careful experi-mental design in the early stage of a study There is an exten-sion of CIA for the analysis of unmatched samples [93] whichcombines a Hungarian algorithm with CIA to iteratively pairsamples that are similar but not matched Multi-block andmulti-group methods (data sets with matched variables) havebeen reviewed recently by Tenenhaus and Tenenhaus [69]

The number of variables in genomics data is a challenge totraditional EDA visualization tools Most visualizationapproaches were designed for data sets with fewer variablesWithin R new packages including ggord are being developedDynamic data visualization is possible using ggvis plotly explorand other packages However interpretation of long lists of bio-logical variables (genes proteins miRNAs) is challenging Oneway to gain more insight into lists of omics variables is to per-form a network gene set enrichment or pathway analysis [94]An attractive feature of decomposition methods is that variableannotation such as Gene Ontology or Reactome can be pro-jected into the same space to determine a score for each geneset (or pathway) in that space [32 33 60]

Conclusion

Dimension reduction methods have a long history Many simi-lar methods have been developed in parallel by multiple fieldsIn this review we provided an overview of dimension reductiontechniques that are both well-established and maybe new tothe multi-omics data community We reviewed methods forsingle-table two-table and multi-table analysis There are sig-nificant challenges in extracting biologically and clinically ac-tionable results from multi-omics data however the field mayleverage the varied and rich resource of dimension reductionapproaches that other disciplines have developed

Key Points

bull There are many dimension-reduction methods whichcan be applied to exploratory data analysis of a singledata set or integrated analysis of a pair or multipledata sets In addition to exploratory analysis thesecan be extended to clustering supervised and discrim-inant analysis

bull The goal of dimension reduction is to map data ontoa new set of variables so that most of the variance (orinformation) in the data is explained by a few new(latent) variables

638 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

bull Multi-data set methods such as multiple co-inertiaanalysis (MCIA) multiple factor analysis (MFA) or ca-nonical correlations analysis (CCA) identify correlatedstructure between data sets with matched observa-tions (samples) Each data set may have different vari-ables (genes proteins miRNA mutations drugresponse etc)

bull MCIA MFA CCA and related methods provide a visu-alization of consensus and incongruence in andbetween data sets enabling discovery of potential out-liers batch effects or technical errors

bull Multi-dataset methods transform diverse variablesfrom each data set onto the same space and scalefacilitating integrative variable selection gene setanalysis pathway and downstream analyses

R Supplement

R code to re-generate all figures in this article is available asa Supplementary File Data and code are also available on thegithub repository httpsgithubcomaedinNCI60Example

Supplementary Data

Supplementary data are available online at httpbiboxfordjournalsorg

Acknowledgements

We are grateful to Drs John Platig and Marieke Kuijjer forreading the manuscript and for their comments

Funding

GGT was supported by the Austrian Ministry of ScienceResearch and Economics [OMICS Center HSRSM initiative]AC was supported by Dana-Farber Cancer Institute BCBResearch Scientist Developmental Funds National CancerInstitute 2P50 CA101942-11 and the Assistant Secretary ofDefense Health Program through the Breast CancerResearch Program under Award No W81XWH-15-1-0013Opinions interpretations conclusions and recommenda-tions are those of the author and are not necessarilyendorsed by the Department of Defense

References1 Brazma A Culhane AC Algorithms for gene expression

analysis In Jorde LB Little PFR Dunn MJ Subramaniam S(eds) Encyclopedia of Genetics Genomics Proteomics andBioinformatics London John Wiley amp Sons 2005 3148ndash59

2 Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

3 Legendre P Legendre LFJ Numerical Ecology AmsterdamElsevier Science 3rd edition 2012

4 Biton A Bernard-Pierrot I Lou Y et al Independent compo-nent analysis uncovers the landscape of the bladder tumortranscriptome and reveals insights into luminal and basalsubtypes Cell Rep 201491235ndash45

5 Hoadley KA Yau C Wolf DM et al Multiplatform analysis of12 cancer types reveals molecular classification within andacross tissues of origin Cell 2014158929ndash44

6 Verhaak RG Tamayo P Yang JY et al Prognostically relevantgene signatures of high-grade serous ovarian carcinomaJ Clin Invest 2013123517ndash25

7 Senbabaoglu Y Michailidis G Li JZ Critical limitations of con-sensus clustering in class discovery Sci Rep 201446207

8 Gusenleitner D Howe EA Bentink S et al iBBiG iterativebinary bi-clustering of gene sets Bioinformatics 2012282484ndash92

9 Pearson K On lines and planes of closest fit to systems ofpoints in space Philos Magazine Series 19016(2)559ndash72

10Hotelling H Analysis of a complex of statistical variables intoprincipal components J Educ Psychol 193324417

11Bland M An Introduction to Medical Statistics 3rd edn OxfordNew York Oxford University Press 2000 xvi 405

12Warner RM Applied Statistics From Bivariate throughMultivariate Techniques London SAGE Publications Inc2nd Edition 2012 1004

13Dray S Chessel D Thioulouse J Co-inertia analysis and thelinking of ecological data tables Ecology 2003843078ndash89

14Golub GH Loan CF Matrix Computations Baltimore JHU Press2012

15Anthony M Harvey M Linear Algebra Concepts and MethodsCambridge Cambridge University Press 1st edition 2012149ndash60

16McShane LM Cavenagh MM Lively TG et al Criteria for theuse of omics-based predictors in clinical trials explanationand elaboration BMC Med 201311220

17Guyon I Elisseeff A An introduction to variable and featureselection J Mach Learn Res 200331157ndash82

18Tukey JW Exploratory Data Analysis Boston Addison-WesleyPublishing Company 1977

19Hogben L Handbook of Linear Algebra Boca Raton Chapman ampHallCRC Press 1st edition 2007 20

20Ringner M What is principal component analysis NatBiotechnol 200826303ndash4

21Wall ME Rechtsteiner A Rocha LM (2003) Singular value de-composition and principal component analysis In A PracticalApproach to Microarray Data Analysis Berrar DP DubitzkyW Granzow M (eds) Norwell MA Kluwer 2003 91ndash109

22 terBraak CJF Prentice IC A theory of gradient analysis AdvEcol Res 198818271ndash313

23 terBraak CJF Smilauer P CANOCO Reference Manual andUserrsquos Guide to Canoco for Windows Software for CanonicalCommunity Ordination (version 4) Microcomputer Power(Ithaca NY USA) httpwwwmicrocomputerpowercom1998352

24Abdi H Williams LJ Principal Component Analysis HobokenJohn Wiley amp SonsInterdisciplinary Reviews ComputationalStatistics 20102(4)433ndash59

25Dray S On the number of principal components a test ofdimensionality based on measurements of similarity be-tween matrices Comput Stat Data Anal 2008522228ndash37

26Wouters L Gohlmann HW Bijnens L et al Graphical explor-ation of gene expression data a comparative study of threemultivariate methods Biometrics 2003591131ndash9

27Greenacre M Correspondence Analysis in Practice Boca RatonChapman and HallCRC 2 edition 2007 201ndash11

28Goodrich JK Rienzi DSC Poole AC et al Conducting a micro-biome study Cell 2014158250ndash62

29Fasham MJR A comparison of nonmetric multidimensionalscaling principal components and reciprocal averaging forthe ordination of simulated coenoclines and coenoplanesEcology 197758551ndash61

Dimension reduction techniques | 639

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

30Beh EJ Lombardo R Correspondence Analysis Theory Practiceand New Strategies Hoboken John Wiley amp Sons 1st edition2014 120ndash76

31Greenacre MJ Theory and Applications of Correspondence AnalysisWaltham MA Academic Press 3rd Edition 1993 83ndash125

32Fagan A Culhane AC Higgins DG A multivariate analysis ap-proach to the integration of proteomic and gene expressiondata Proteomics 200772162ndash71

33Fellenberg K Hauser NC Brors B et al Correspondence ana-lysis applied to microarray data Proc Natl Acad Sci USA20019810781ndash6

34Lee DD Seung HS Learning the parts of objects by non-negative matrix factorization Nature 1999401788ndash91

35Comon P Independent component analysis a new conceptSignal Processing 199436287ndash314

36Witten DM Tibshirani R Hastie T A penalized matrix decom-position with applications to sparse principal componentsand canonical correlation analysis Biostatistics 200910515ndash34

37Shen H Huang JZ Sparse principal component analysis viaregularized low rank matrix approximation J Multivar Anal2008991015ndash34

38Lee S Epstein MP Duncan R et al Sparse principal componentanalysis for identifying ancestry-informative markers in gen-ome-wide association studies Genet Epidemiol 201236293ndash302

39Sill M Saadati M Benner A Applying stability selection toconsistently estimate sparse principal components in high-dimensional molecular data Bioinformatics 2015312683ndash90

40Zhao J Efficient model selection for mixtures of probabilisticPCA via hierarchical BIC IEEE Trans Cybern 2014441871ndash83

41Zhao Q Shi X Huang J et al Integrative analysis of lsquo-omicsrsquodata using penalty functions Wiley Interdiscip Rev Comput Stat2015710

42Alter O Brown PO Botstein D Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms Proc NatlAcad Sci USA 20031003351ndash6

43Culhane AC Perriere G Higgins DG Cross-platform compari-son and visualisation of gene expression data using co-inertia analysis BMC Bioinformatics 2003459

44Dray S Analysing a pair of tables coinertia analysis and dual-ity diagrams In Blasius J Greenacre M (eds) Visualization andverbalization of data Boca Raton CRC Press 2014 289ndash300

45Le Cao KA Martin PG Robert-Granie C et al Sparse canonicalmethods for biological data integration application to across-platform study BMC Bioinformatics 20091034

46Braak CJF Canonical correspondence analysis a new eigen-vector technique for multivariate direct gradient analysisEcology 1986671167ndash79

47Hotelling H Relations between two sets of variates Biometrika193628321ndash77

48McGarigal K Landguth E Stafford S Multivariate Statistics forWildlife and Ecology Research New York Springer 2002

49Thioulouse J Simultaneous analysis of a sequence of pairedecological tables a comparison of several methods Ann ApplStat 201152300ndash25

50Hong S Chen X Jin L et al Canonical correlation analysisfor RNA-seq co-expression networks Nucleic Acids Res201341e95

51Mardia KV Kent JT Bibby JM Multivariate Analysis LondonNew York NY Toronto Sydney San Francisco CA AcademicPress 1979

52Waaijenborg S Zwinderman AH Sparse canonical correl-ation analysis for identifying connecting and completinggene-expression networks BMC Bioinformatics 200910315

53Parkhomenko E Tritchler D Beyene J Sparse canonical cor-relation analysis with application to genomic data integra-tion Stat Appl Genet Mol Biol 20098article 1

54Witten DM Tibshirani RJ Extensions of sparse canonical cor-relation analysis with applications to genomic data Stat ApplGenet Mol Biol 20098article 28

55Lin D Zhang J Li J et al Group sparse canonical correlationanalysis for genomic data integration BMC Bioinformatics201314245

56Le Cao KA Boitard S Besse P Sparse PLS discriminantanalysis biologically relevant feature selection and graph-ical displays for multiclass problems BMC Bioinform 201112253

57Boulesteix AL Strimmer K Partial least squares a versatiletool for the analysis of high-dimensional genomic data BriefBioinform 2007832ndash44

58Doledec S Chessel D Co-inertia analysis an alternativemethod for studying speciesndashenvironment relationshipsFreshwater Biol 199431277ndash94

59Ponnapalli SP Saunders MA Van Loan CF et al A higher-order generalized singular value decomposition for compari-son of global mRNA expression from multiple organismsPLoS One 20116e28072

60Meng C Kuster B Culhane AC et al A multivariate approachto the integration of multi-omics datasets BMC Bioinformatics201415162

61de Tayrac M Le S Aubry M et al Simultaneous analysis ofdistinct Omics data sets with integration of biological know-ledge multiple factor analysis approach BMC Genomics20091032

62Abdi H Williams LJ Valentin D Statis and Distatis OptimumMultitable Principal Component Analysis and Three Way MetricMultidimensional Scaling Wiley Interdisciplinary 20124124ndash67

63De Jong S Kiers HAL Principal covariates regression Part ITheory Chemometrics and Intelligent Laboratory Systems199214155ndash64

64Benasseni J Dosse MB Analyzing multiset data by the powerSTATIS-ACT method Adv Data Anal Classification 2012649ndash65

65Escoufier Y The duality diagram a means for better practicalapplications Devel Num Ecol 198714139ndash56

66Giordani P Kiers HAL Ferraro DMA Three-way componentanalysis using the R package threeway J Stat Software2014577

67Leibovici DG Spatio-temporal multiway decompositionsusing principal tensor analysis on k-modes the R packagePTAk J Stat Software 201034(10)1ndash34

68Kolda TG Bader BW Tensor decompositions and applica-tions SIAM Rev 200951455ndash500

69Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis for multiblock or multigroup dataanalysis Eur J Oper Res 2014238391ndash403

70Chessel D Hanafi M Analyses de la co-inertie de K nuages depoints Rev Stat Appl 19964435ndash60

71Hanafi M Kohler A Qannari EM Connections betweenmultiple co-inertia analysis and consensus principal compo-nent analysis Chemometrics and Intelligent Laboratory2011106(1)37ndash40

72Carroll JD Generalization of canonical correlation analysis tothree or more sets of variables Proc 76th Convent Am PsychAssoc 19683227ndash8

73Takane Y Hwang H Abdi H Regularized multiple-set canon-ical correlation analysis Psychometrika 200873753ndash75

640 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5
Page 5: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

regression procedure to calculate PCs Computation can be per-formed on data with missing values and it is faster than SVDwhen applied to large matrices Furthermore NIPALS may begeneralized to discover the correlated structure in more thanone data set (see sections on the analysis of multi-omics datasets) Please refer to the Supplementary Information for add-itional details on NIPALS

Visualizing and interpreting results ofdimension reduction analysis

We present an example to illustrate how to interpret results of aPCA PCA was applied to analyze mRNA gene expression data ofa subset of cell lines from the NCI-60 panel those of melanomaleukemia and central nervous system (CNS) tumors The resultsof PCA can be easily interpreted by visualizing the observationsand variables in the same space using a biplot Figure 1A is abiplot of the first (PC1) and second PC (PC2) where points andarrows from the plot origin represent observations and genesrespectively Cell lines (points) with correlated gene expressionprofiles have similar scores and are projected close to eachother on PC1 and PC2 We see that cell lines from the same ana-tomical location are clustered

Both the direction and length of the mRNA gene expressionvectors can be interpreted Gene expression vectors point in thedirection of the latent variable (PC) to which it is most similar(squared correlation) Gene vectors with the same direction (egFAM69B PNPLA2 PODXL2) have similar gene expression pro-files The length of the gene expression vector is proportional tothe squared multiple correlation between the fitted values forthe variable and the variable itself

A gene expression vector and a cell line projected in the samedirection from the origin are positively associated For example inFigure 1A FAM69B PNPLA2 PODXL2 are active (have higher geneexpression) in melanoma cell lines Similarly genes DPPA5 andRPL34P1 are among those genes that are highly expressed inmost leukemia cell lines By contrast genes WASL and HEXBpoint in the opposite direction to most leukemia cell lines indicat-ing low association In Figure 1B it is clear that these genes arenot expressed in leukemia cell lines and are colored blue in theheatmap (these are colored dark gray in grayscale)

The sum of the squared correlation coefficients between avariable and all the components (calculated in equation 8)equals 1 Therefore variables are often shown within a correl-ation circle (Figure 1C) Variables positioned on the unit circlerepresent variables that are perfectly represented by the two di-mensions displayed Those not on the unit circle may requireadditional components to be represented

In most analyses only the first few PCs are plotted andstudied as these explain the most variant trends in the dataGenerally the selection of components is subjective anddepends on the purpose of the EDA An informal elbow test mayhelp to determine the number of PCs to retain and examine [2425] From the scree plot of PC eigenvalues in Figure 1D we mightdecide to examine the first three PCs because the decrease inPC variance becomes relatively moderate after PC3 Anotherapproach that is widely used is to include (or retain) PCs thatcumulatively capture a certain proportion of variance for ex-ample 70 of variance is modeled with three PCs If a parsi-mony model is preferred the variance proportion cutoff can beas low as 50 [24] More formal tests including the Q2 statisticare also available (for details see [24]) In practice the first com-ponent might explain most of the variance and the remainingaxes may simply be attributed to noise from technical or biolo-gical sources in a study with low complexity (eg cell line repli-cates of controls and one treatment condition) However acomplex data set (for example a set of heterogeneous tumors)may require multiple PCs to capture the same amount ofvariance

Different dimension reduction approaches areoptimized for different data

There are many dimension reduction approaches related to PCA(Table 2) including principal co-ordinate analysis (PCoA) cor-respondence analysis (CA) and nonsymmetrical correspond-ence analysis (NSCA) These may be computed by SVD butdiffer in how the data are transformed before decomposition[21 26 27] and therefore each is optimized for specific dataproperties PCoA (also known as Classical MultidimensionalScaling) is versatile as it is a SVD of a distance matrix that canbe applied to decompose distance matrices of binary count or

Table 4 Dimension reduction methods for multiple (more than two) data sets

Method Description Featureselection

Matchedcases

R Function package

MCIA Multiple coinertia analysis No No mciaomicade4 mcoaade4gCCA Generalized CCA No No regCCAdmtrGCCA Regularized generalized CCA No No regCCAdmt rgccargcca

wrapperrgccamixOmicssGCCA Sparse generalized canonical

correlation analysisYes No sgccargcca wrappersgccamixOmics

STATIS Structuration des Tableaux aTrois Indices de la Statistique(STATIS) Family of methods whichinclude X-statis

No No statisade4

CANDECOMPPARAFAC Tucker3

Higher order generalizations of SVDand PCA Require matchedvariables and cases

No Yes CPThreeWay T3ThreeWay PCAnPTaKCANDPARAPTaK

PTA Partial triadic analysis No Yes ptaade4statico Statis and CIA (find structure

between two pairs of K-tables)No No staticoade4

632 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

continuous data It is frequently applied in the analysis ofmicrobiome data [28]

PCA is designed for the analysis of multi-normal distributeddata If data are strongly skewed or extreme outliers are pre-sent the first few axes may only separate those objects with ex-treme values instead of displaying the main axes of variation Ifdata are unimodal or display nonlinear trends one may see dis-tortions or artifacts in the resulting plots in which the secondaxis is an arched function of the first axis In PCA this is calledthe horseshoe effect and it is well described including illustra-tions in Legendre and Legendre [3] Both nonmetric Multi-Dimensional Scaling (MDS) and CA perform better than PCA inthese cases [26 29] Unlike PCA CA can be applied to sparsecount data with many zeros Although designed for contingencytables of nonnegative count data CA and NSCA decomposea chi-squared matrix [30 31] but have been successfullyapplied to continuous data including gene expression and pro-tein profiles [32 33] As described by Fellenberg et al [33] gene

and protein expression can be seen as an approximation of thenumber of corresponding molecules present in the cell during acertain measured condition Additionally Greenacre [27]emphasized that the descriptive nature of CA and NSCA allowstheir application on data tables in general not only on countdata These two arguments support the suitability of CA andNSCA as analysis methods for omics data While CA investi-gates symmetric associations between two variables NSCA cap-tures asymmetric relations between variables Spectral mapanalysis is related to CA and performs comparably with CAeach outperforming PCA in the identification of clusters of leu-kemia gene expression profiles [26] All dimension reductionmethods can be formulated in terms of the duality diagramDetails on this powerful framework are included in theSupplementary Information

Nonnegative matrix factorization (NMF) [34] forces a positiveor nonnegative constraint on the resulting data matrices andsimilar to Independent Component Analysis (ICA) [35] there is no

Figure 1 Results of a PCA analysis of mRNA gene expression data of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines from the NCI-60 cell

line panel All variables were centered and scaled Results show (A) a biplot where observations (cell lines) are points and gene expression profiles are arrows (B) a

heatmap showing the gene expression of the same 20 genes in the cell lines red to blue scale represent high to low gene expression (light to dark gray represent high

to low gene expression on the black and white figure) (C) correlation circle (D) variance barplot of the first ten PCs To improve the readability of the biplot some labels

of the variables (genes) in (A) have been moved slightly A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 633

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

requirement for orthogonality or independence in the compo-nents The nonnegative constraint guarantees that only the addi-tive combinations of latent variables are allowed This may bemore intuitive in biology where many biological measurements(eg protein concentrations count data) are represented by posi-tive values NMF is described in more detail in the SupplementalInformation ICA was recently applied to molecular subtype dis-covery in bladder cancer [4] Biton et al [4] applied ICA to geneexpression data of 198 bladder cancers and examined 20 compo-nents ICA successfully decomposed and extracted multiplelayers of signal from the data The first two components wereassociated with batch effects but other components revealed newbiology about tumor cells and the tumor microenvironmentThey also applied ICA to non-bladder cancers and by comparingthe correlation between components were able to identify a setof bladder cancer-specific components and their associatedgenes

As omics data sets tend to have high dimensionality (p n)it is often useful to reduce the number of variables Several re-cent extensions of PCA include variable selection often via aregularization step or L-1 penalization (eg Least AbsoluteShrinkage and Selection Operator LASSO) [36] The NIPALS algo-rithm uses an iterative regression approach to calculate thecomponents and loadings which is easily extended to have asparse operator that can be included during regression on thecomponent [37] A cross-validation approach can be used to de-termine the level of sparsity Sparse penalized and regularizedextensions of PCA and related methods have been describedrecently [36 38ndash41]

Integrative analysis of two data sets

One-table dimension reduction methods have been extended tothe EDA of two matrices and can simultaneously decomposeand integrate a pair of matrices that measure different variableson the same observations (Table 3) Methods include general-ized SVD [42] Co-Inertia Analysis (CIA) [43 44] sparse or penal-ized extensions of Partial Least Squares (PLS) CanonicalCorrespondence analysis (CCA) and Canonical CorrelationAnalysis (CCA) [36 45ndash47] Note both canonical correspondenceanalysis and canonical correlation analysis are referred to bythe acronym CCA Canonical correspondence analysis is a con-strained form of CA that is widely used in ecological statistics[46] however it is yet to be adopted by the genomics commu-nity in analysis of pairs of omics data By contrast severalgroups have applied extensions of canonical correlation ana-lysis to omics data integration Therefore in this review we useCCA to describe canonical correlation analysis

Canonical correlation analysis

Two omics data sets X (dimension npx) and Y (dimension npy)can be expressed by the following latent component decompos-ition problem

X frac14 FxQTx thorn Ex

Yfrac14 FyQTy thorn Ey

(9)

where Fx and Fy are nr matrices with r columns of compo-nents that explain the co-structure between X and Y The col-umns of Qx and Qy are variable loading vectors for X and Yrespectively Ex and Ey are error terms

Proposed by Hotelling in 1936 [47] CCA searches for associ-ations or correlations among the variables of X and Y [47] bymaximizing the correlation between Xqx

i and Yqyi

for the ith component arg maxqixqi

ycorethXqi

x YqiyTHORN (10)

In CCA the components Xqxi and Yqy

i are called canonical vari-ates and their correlations are the canonical correlations

Sparse canonical correlation analysis

The main limitation of applying CCA to omics data is that itrequires an inversion of the correlation or covariance matrix[38 49 50] which cannot be calculated when the number ofvariables exceeds the number of observations [46] In high-dimensional omics data where p n application of these meth-ods requires a regularization step This may be accomplished byadding a ridge penalty that is adding a multiple of the identitymatrix to the covariance matrix [51] A sparse solution of theloading vectors (Qx and Qy) filters the number of variables andsimplifies the interpretation of results For this purpose penal-ized CCA [52] sparse CCA [53] CCA-l1 [54] CCA elastic net (CCA-EN) [45] and CCA-group sparse [55] have been introduced andapplied to the integrative analysis of two omics data setsWitten et al [36] provided an elegant comparison of various CCAextensions accompanied by a unified approach to compute bothpenalized CCA and sparse PCA In addition Witten andTibshirani [54] extended sparse CCA into a supervised frame-work which allows the integration of two data sets with aquantitative phenotype for example selecting variables fromboth genomics and transcriptomics data and linking them todrug sensitivity data

Partial least squares analysis

PLS is an efficient dimension reduction method in the analysisof high-dimensional omics data PLS maximizes the covariancerather than the correlation between components making itmore resistant to outliers than CCA Additionally PLS does notsuffer from the p n problem as CCA does Nonetheless asparse solution is desired in some cases For example Le Caoet al [56] proposed a sparse PLS method for the feature selectionby introducing a LASSO penalty for the loading vectors In a re-cent comparison sPLS performed similarly to sparse CCA [45]There are many implementations of PLS which optimize differ-ent objective functions with different constraints several ofwhich are described by Boulesteix et al [57]

Co-Inertia analysis

CIA is a descriptive nonconstrained approach for coupling pairsof data matrices It was originally proposed to link two ecolo-gical tables [13 58] but has been successfully applied in integra-tive analysis of omics data [32 43] CIA is implemented andformulated under the duality diagram framework(Supplementary Information) CIA is performed in two steps (i)application of a dimension reduction technique such as PCACA or NSCA to the initial data sets depending on the type ofdata (binary categorical discrete counts or continuous data)and (ii) constraining the projections of the orthogonal axes suchthat they are maximally covariant [43 58] CIA does not requirean inversion step of the correlation or covariance matrix thusit can be applied to high-dimensional genomics data withoutregularization or penalization

634 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

Though closely related to CCA [49] CIA maximizes thesquared covariance between the linear combination of the pre-processed matrix that is

for the ith dimension argmaxqixqi

ycov2ethXqi

xYqiyTHORN (11)

Equation (11) can be decomposed as

cov2 XqxiYqy

i

frac14 cor2 XqxiYqy

i

var Xqxi

var Xqy

i

(12)

CIA decomposition of covariance maximizes the varianceand the correlation between matrices and thus is less sensitiveto outliers The relationship between CIA Procrustes analysis[13] and CCA have been well described [49] A comparison be-tween sCCA (with elastic net penalty) sPLS and CIA is providedby Le Cao et al [45] In summary CIA and sPLS both maximizethe covariance between eigenvectors and efficiently identifyjoint and individual variance in paired data In contrast CCA-EN maximizes the correlation between eigenvectors and willdiscover effects present in both data sets but may fail to dis-cover strong individual effects [45] Both sCCA and sPLS aresparse methods that select similar subsets of variables whereasCIA does not include a feature selection step thus in terms offeature selection results of CIA are more likely to contain re-dundant information in comparison with sparse methods [45]

Similar to classical dimension reduction approaches thenumber of dimensions to be examined needs to be consideredand can be visualized using a scree plot (similar to Figure 1D)Components may be evaluated by their associated variance [25]the elbow test or Q2 statistics as described previously Forexample the Q2 statistic was applied to select the number ofdimensions in the predictive mode of PLS [56] In additionwhen a sparse factor is introduced in the loading vectors cross-validation approaches may be used to determine the number ofvariables selected from each pair of components Selection ofthe number of components and optimization of these param-eters is still an open research question and is an active area ofresearch

Integrative analysis of multi-assay data

There is a growing need to integrate more than two data sets ingenomics Generalizations of dimension reduction methods tothree or more data sets are sometimes called K-table methods[59ndash61] and a number of them have been applied to multi-assaydata (Table 4) Simultaneous decomposition and integration ofmultiple matrices is more complex than an analysis of a singledata set or paired data because each data set may have differentnumbers of variables scales or internal structure and thus havedifferent variance This might produce global scores that aredominated by one or a few data sets Therefore data are prepro-cessed before decomposition Preprocessing is often performedon two levels On the variable levels variables are often cen-tered and normalized so that their sum of squared values orvariance equals 1 This procedure enables all the variables tohave equal contribution to the total inertia (sum of squares ofall elements) of a data set However the number of variablesmay vary between data sets or filteringpreprocessing stepsmay generate data sets that have a higher variance contributionto the final result Therefore a data set level normalization isalso required In the simplest K-table analysis all matrices haveequal weights More commonly data sets are weighted by theirexpected contribution or expected data quality for example by

the square root of their total inertia or by the square root of thenumbers of columns of each data set [62] Alternatively greaterweights can be given to smaller or less redundant matricesmatrices that have more stable predictive information or thosethat share more information with other matrices Such weight-ing approaches are implemented in multiple-factor analysis(MFA) principal covariates regression [63] and STATIS

The simplest multi-omics data integration is when the datasets have the same variables and observations that is matchedrows and matched columns In genomics these could be pro-duced when variables from different multi-assay data sets aremapped to a common set of genomic coordinates or gene iden-tifiers thus generating data sets with matched variables andmatched observations Alternatively a repeated analysis orlongitudinal analysis of the same samples and the same vari-ables could produce such data (one should note that thesedimension reduction approaches do not model the time correla-tion between different datasets) There is a history of such ana-lyses in ecology where counts of species and environmentvariables are measured over different seasons [49 64 65] and inpsychology where different standardized tests are measuredmultiple times on study populations [66 67] Analysis of suchvariables x samples x time data are called a three-mode decom-position triadic cube or three-way table analysis tensor de-composition three-way PCA three-mode PCA three-modeFactor Analysis Tucker-3 model Tucker3 TUCKALS3 multi-block analysis among others (Table 4) The relationship be-tween various tensor or higher decompositions for multi-blockanalysis are reviewed by Kolda and Bader [68]

More frequently we need to find associations in multi-assaydata that have matched observations but have different andunmatched numbers of variables For example TCGA generatedmiRNA and mRNA transcriptome (RNAseq microarray) DNAcopy number DNA mutation DNA methylation and proteomicsmolecular profiles on each tumor The NCI-60 and the CancerCell Line Encyclopedia projects have measured pharmacologicalcompound profiles in addition to exome sequencing and tran-scriptomic profiles Methods that can be applied to EDA ofK-table of multi-assay data with different variables includeMCIA MFA Generalized CCA (GCCA) and Consensus PCA(CPCA)

The K-table methods can be generally expressed by the fol-lowing model

X1 frac14 FQT1 thorn E1

Xk frac14 FQTk thorn Ek

XK frac14 FQTK thorn EK

(13)

where there are K matrices or omics data sets X1 XK For con-venience we assume that the rows of Xk share a common set ofobservations but the columns of Xk may each have differentvariables F is the lsquoglobal scorersquo matrix Its columns are the PCsand are interpreted similarly to PCs from a PCA of a single dataset The global score matrix which is identical in all decompos-itions integrates information from all data sets Therefore it isnot specific to any single data set rather it represents the com-mon pattern defined by all data sets The matrices Qk with kranging from 1 to K are the loadings or coefficient matricesA high positive value indicates a strong positive contribution ofthe corresponding variable to the lsquoglobal scorersquo While the above

Dimension reduction techniques | 635

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

methods are formulated for multiple data sets with differentvariables but the same observations most can be similarly for-mulated for multiple data sets with the same variables but dif-ferent observations [69]

Multiple co-inertia analysis

MCIA is an extension of CIA which aims to analyze multiplematrices through optimizing a covariance criterion [60 70]MCIA simultaneously projects K data sets into the same dimen-sional space Instead of maximizing the covariance betweenscores from two data sets as in CIA the optimization problemused by MCIA is as following

argmaxqi1 q

ikqi

K

XK

kfrac141

cov2ethXikqi

kXiqiTHORN (14)

for dimension i with the constraints that jjqkijj frac14 var (Xiqi) frac14 1

where X frac14 (X1j jXkj jXK) and qi holds the correspondingloading values (lsquoglobalrsquo loading vector) [60 69 70]

MCIA derives a set of lsquoblock scoresrsquo Xkiqk

i using linear com-binations of the original variables from each individual matrixThe global score Xiqi is then further defined as the linear com-bination of lsquoblock scoresrsquo In practice the global scores representa correlated structure defined by multiple data sets whereasthe concordance and discrepancy between these data sets maybe revealed by the block scores (for detail see lsquoExample casestudyrsquo section) MCIA may be calculated with the ad hoc exten-sion of the NIPALS PCA [71] This algorithm starts with an ini-tialization step in which the global scores and the blockloadings for the first dimension are computed The residualmatrices are calculated in an iterative step by removing thevariance induced by the variable loadings (the lsquodeflationrsquo step)For higher order solutions the same procedure is applied to theresidual matrices and re-iterated until the desired number of di-mensions is reached Therefore the computation time stronglydepends on the number of desired dimensions MCIA is imple-mented in the R package omicade4 and has been applied to theintegrative analysis of transcriptomic and proteomic data setsfrom the NCI-60 cell lines [60]

Generalized canonical correlation analysis

GCCA [71] is a generalization of CCA to K-table analysis [73ndash75]It has also been applied to the analysis of omics data [36 76]Typically MCIA and GCCA will produce similar results (for amore detailed comparison see [60]) GCCA uses a different defla-tion strategy than MCIA it calculates the residual matrices byremoving the variance with respect to the lsquoblock scoresrsquo (insteadof lsquovariable loadingsrsquo used by MCIA or lsquoglobal scoresrsquo used byCPCA see later) When applied to omics data where p n avariable selection step is often integrated within the GCCA ap-proach which cannot be done in case of MCIA In addition asblock scores are better representations of a single data set (incontrast to the global score) GCCA is more likely to find com-mon variables across data sets Witten and Tibshirani [54]applied sparse multiple CCA to analyze gene expression andCopy Number Variation (CNV) data from diffuse large B-celllymphoma patients and successfully identified lsquocis interactionsrsquothat are both up-regulated in CNV and mRNA data

Consensus PCA

CPCA is closely related to GCCA and MCIA but has had less expos-ure to the omics data community CPCA optimizes the same criter-ion as GCCA and MCIA and is subject to the same constraints asMCIA [71] The deflation step of CPCA relies on the lsquoglobal scorersquo Asa result it guarantees the orthogonality of the global score onlyand tends to find common patterns in the data sets This charac-teristic makes it is more suitable for the discovery of joint patternsin multiple data sets such as the joint clustering problem

Regularized generalized canonical correlation analysis

Recently Tenenhaus and Tenenhaus [69 74] proposed regular-ized generalized CCA (RGCCA) which provides a unified frame-work for different K-table multivariate methods The RGCCAmodel introduces some extra parameters particularly a shrink-age parameter and a linkage parameter The linkage parameteris defined so that the connection between matrices may be cus-tomized The shrinkage parameter ranges from 0 to 1 Settingthis parameter to 0 will force the correlation criterion (criterionused by GCCA) whereas a shrinkage parameter of 1 will applythe covariance criterion (used by MCIA and CPCA) A value be-tween 0 and 1 leads to a compromise between the two optionsIn practice the correlation criterion is better in explaining thecorrelated structure across data sets while discarding the vari-ance within each individual data set The introduction of theextra parameters make RGCCA highly versatile GCCA CIA andCPCA can be described as special cases of RGCCA (see [69] andSupplementary Information) In addition RGCCA also integratesa feature selection procedure named sparse GCCA (SGCCA)Tenenhaus et al [76] applied SGCCA to combine gene expres-sion comparative genomic hybridization and a qualitativephenotype measured on a set of 53 children with glioma Sparsemultiple CCA [54] and SGCCA [76] are available in the R pack-ages PMA and RGCCA respectively Similarly a higher orderimplementation of spare PLS is described in Zhao et al [77]

Joint NMF

NMF has also been extended to jointly factorize multiple matri-ces In joint NMF the values in the global score F and in the co-efficient matrices (Q1 QK) are nonnegative and there is noexplicit definition of the block loadings An optimization algo-rithm is applied to minimize an objective function typically the

sum of squared errors ieXK

kfrac141Ek

2 This approach can be con-

sidered to be a nonnegative implementation of PARAFAC al-though it has also been implemented using the Tucker model[78ndash80] Zhang et al [81] apply joint NMF to a three-way analysisof DNA methylation gene expression and miRNA expressiondata to identify modules in each of these regulatory layers thatare associated with each other

Advantages of dimension reduction whenintegrating multi-assay data

Dimension reduction or latent variable approaches provideEDA integrate multi-assay data highlight global correlationsacross data sets and discover outliers or batch effects in indi-vidual data sets Dimension reduction approaches also facilitatedown-stream analysis of both observations and variables(genes) Compared with cluster analysis of individual data setscluster analysis of the global score matrix (F matrix) is robust asit aggregates observations across data sets and is less likely to

636 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

reflect a technical or batch effect of a single data set Similarlydimension reduction of multi-assay data facilities downstreamgene set pathway and network analysis of variables MCIAtransforms variables from each data set onto the same scaleand their loadings (Q matrix) rank the variables by their contri-bution to the global data structure (Figure 2) Meng et al reportthat pathway or gene set enrichment analysis (GSEA) of thetransformed variables is more sensitive than GSEA of each indi-vidual data set This is both because of the re-weighting andtransformation of variables but also because GSEA on the com-bined data has greater coverage of variables (genes) thus com-pensating for missing or unreliable information in any singledata set For example in Figure 2B we integrate and transformmRNA proteomics and miRNA data on the same scale allowingus to extract and study the union of all variables

Example case study

To demonstrate the integration of multi-data sets using dimen-sions reduction we applied MCIA to analyze mRNA miRNA andproteomics expression profiles of melanoma leukemia and CNS

cells lines from the NCI-60 panel The graphical output fromthis analysis a plot of the sample space variable space anddata weighting space are provided in Figure 2A B and E Theeigenvalues can be interpreted similarly to PCA a higher eigen-value contributes more information to the global score As withPCA researchers may be subjective in their selection of thenumber of components [24] The scree plot in Figure 2D showsthe eigenvalues of each global score In this case the first twoeigenvalues were significantly larger so we visualized the celllines and variables on PC1 and PC2

In the observation space (Figure 2A) assay data (mRNAmiRNA and proteomics) are distinguished by shape The coord-inates of each cell line (Fk in Figure 2A) are connected by lines tothe global scores (F) Short lines between points and cell lineglobal scores reflect high concordance in cell line data Most celllines have concordant information between data sets (mRNAmiRNA protein) as indicated by relatively short lines In add-ition the RV coefficient [82 83] which is a generalized Pearsoncorrelation coefficient for matrices may be used to estimate thecorrelation between two transformed data sets The RV coeffi-cient has values between 0 and 1 where a higher value

Figure 2 MCIA of mRNA miRNA and proteomics profiles of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines (A) shows a plot of the first two

components in sample space (sample lsquotypersquo is coded by the point shape circles for mRNAs triangles for proteins and squares for miRNAs) Each sample (cell line) is

represented by a ldquostarrdquo where the three omics data for each cell line are connected by lines to a center point which is the global score (F) for that cell line the shorter

the line the higher the level of concordance between the data types and the global structure (B) shows the variable space of MCIA A variable that is highly expressed

in a cell line will be projected with a high weight (far from the origin) in the direction of that cell line Some miRNAs with a large distance from the origin are labeled as

these miRNAs are the strongly associated with cancer tissue of origin (C) shows the correlation coefficients of the proteome profiling of SR with other cell lines The

proteome profiling of SR cell line is more correlated with melanoma cell line There may be a technical issue with the LESR proteomics data (D) A scree plot of the

eigenvalues and (E) a plot of data weighting space A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 637

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

indicates higher co-structure In this example we observed rela-tively high RV coefficients between the three data sets rangingfrom 078 to 084 It was recently reported that the RV coefficientis biased toward large data sets and a modified RV coefficienthas been proposed [84]

In this analysis (Figure 2A) cell lines originating from thesame anatomical source are projected close to each other andconverge in clusters The first PC separates the leukemia celllines (positive end of PC1) from the other two cell lines (negativeend of PC1) and PC2 separates the melanoma and CNS celllines The melanoma cell line LOX-IMVI which lacks the mela-nogenesis is projected close to the origin away from the mel-anoma cluster We were surprised to see that the proteomicsprofile of leukemia cell line SR was projected closer to melan-oma rather than leukemia cell lines We examined withintumor type correlations to the SR cell line (Figure 2C) Weobserved that the SR proteomics data had higher correlationwith melanoma compared with to leukemia cell lines Giventhat the mRNA and miRNA profiles of LE_SR are closer to theleukemia cell lines it suggests that there may have been a tech-nical error in generating the proteomics data on the SR cell line(Figure 2A and C)

MCIA projects all variables into the same space The variablespace (Q1 QK) is visualized in Figure 2B Variables and sam-ples projected in the same direction are associated This allowsone to select the variables most strongly associated with spe-cific observations from each data set for subsequent analysis Inour previous study [85] we have shown that the genes and pro-teins highly weighted on the melanoma side (positive end of se-cond dimension) are enriched with melanogenesis functionsand genesproteins highly weighted on the protein side arehighly enriched in T-cell or immune-related functions

We examined the miRNA data to extract the miRNAs withthe most extreme weights on the first two dimensions miR-142and miR-223 which are active and expressed in leukemia [8283 86ndash88] had high weights on the positive end of both thefirst and second axis (close to the leukemia cell lines samplespace Figure 2A) miR-142 plays an essential role in T-lympho-cyte development miR-223 is regulated by the Notch andNF-kB signaling pathways in T-cell acute lymphoblastic leuke-mia [89]

The miRNA with strongest association to CNS cell lines wasmiR-409 This miRNA is reported to promote the epithelial-to-mesenchymal transition in prostate cancer [90] In the NCI-60cell line data CNS cell lines are characterized more by a pro-nounced mesenchymal phenotype which is consistent withhigh expression of this miRNA On the positive end of the se-cond axis and negative end of the first axis (which correspondsto melanoma cell lines in the sample space Figure 2A) wefound miR-509 miR-513 and miR-506 strongly associated withmelanoma cell lines which are reported to initiate melanocytetransformation and promote melanoma growth [85]

Challenges in integrative data analysis

EDA is widely used and well accepted in the analysis of singleomics data sets but there is an increasing need for methodsthat integrate multi-omics data particularly in cancer researchRecently 20 leading scientists were invited to a meeting organ-ized by Nature Medicine Nature Biotechnology and the VolkswagenFoundation The meeting identified the need to simultaneouslycharacterize DNA sequence epigenome transcriptome proteinmetabolites and infiltrating immune cells in both the tumorand the stroma [91] The TCGA pan-cancer project plans to

comprehensively interrogate multi-omics data across 33 humancancers [92] The data are biologically complex In addition totumor heterogeneity [91] there may be technical issues batcheffects and outliers EDA approaches for complex multi-omicsdata are needed

We describe emerging applications of multivariateapproaches to omics data analysis These are descriptiveapproaches that do not test a hypothesis or generate a P-valueThey are not optimized for variable or biomarker discoveryalthough the introduction of sparsity in variable loadings mayhelp in the selection of variables for downstream analysis Fewcomparisons of different methods exist and the numbers ofcomponents and the sparsity level have to be optimized Cross-validation approaches are potentially useful but this stillremains an open area of research

Another limitation of these methods is that although vari-ables may vary among data sets the observations need to bematchable Therefore researchers need to have careful experi-mental design in the early stage of a study There is an exten-sion of CIA for the analysis of unmatched samples [93] whichcombines a Hungarian algorithm with CIA to iteratively pairsamples that are similar but not matched Multi-block andmulti-group methods (data sets with matched variables) havebeen reviewed recently by Tenenhaus and Tenenhaus [69]

The number of variables in genomics data is a challenge totraditional EDA visualization tools Most visualizationapproaches were designed for data sets with fewer variablesWithin R new packages including ggord are being developedDynamic data visualization is possible using ggvis plotly explorand other packages However interpretation of long lists of bio-logical variables (genes proteins miRNAs) is challenging Oneway to gain more insight into lists of omics variables is to per-form a network gene set enrichment or pathway analysis [94]An attractive feature of decomposition methods is that variableannotation such as Gene Ontology or Reactome can be pro-jected into the same space to determine a score for each geneset (or pathway) in that space [32 33 60]

Conclusion

Dimension reduction methods have a long history Many simi-lar methods have been developed in parallel by multiple fieldsIn this review we provided an overview of dimension reductiontechniques that are both well-established and maybe new tothe multi-omics data community We reviewed methods forsingle-table two-table and multi-table analysis There are sig-nificant challenges in extracting biologically and clinically ac-tionable results from multi-omics data however the field mayleverage the varied and rich resource of dimension reductionapproaches that other disciplines have developed

Key Points

bull There are many dimension-reduction methods whichcan be applied to exploratory data analysis of a singledata set or integrated analysis of a pair or multipledata sets In addition to exploratory analysis thesecan be extended to clustering supervised and discrim-inant analysis

bull The goal of dimension reduction is to map data ontoa new set of variables so that most of the variance (orinformation) in the data is explained by a few new(latent) variables

638 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

bull Multi-data set methods such as multiple co-inertiaanalysis (MCIA) multiple factor analysis (MFA) or ca-nonical correlations analysis (CCA) identify correlatedstructure between data sets with matched observa-tions (samples) Each data set may have different vari-ables (genes proteins miRNA mutations drugresponse etc)

bull MCIA MFA CCA and related methods provide a visu-alization of consensus and incongruence in andbetween data sets enabling discovery of potential out-liers batch effects or technical errors

bull Multi-dataset methods transform diverse variablesfrom each data set onto the same space and scalefacilitating integrative variable selection gene setanalysis pathway and downstream analyses

R Supplement

R code to re-generate all figures in this article is available asa Supplementary File Data and code are also available on thegithub repository httpsgithubcomaedinNCI60Example

Supplementary Data

Supplementary data are available online at httpbiboxfordjournalsorg

Acknowledgements

We are grateful to Drs John Platig and Marieke Kuijjer forreading the manuscript and for their comments

Funding

GGT was supported by the Austrian Ministry of ScienceResearch and Economics [OMICS Center HSRSM initiative]AC was supported by Dana-Farber Cancer Institute BCBResearch Scientist Developmental Funds National CancerInstitute 2P50 CA101942-11 and the Assistant Secretary ofDefense Health Program through the Breast CancerResearch Program under Award No W81XWH-15-1-0013Opinions interpretations conclusions and recommenda-tions are those of the author and are not necessarilyendorsed by the Department of Defense

References1 Brazma A Culhane AC Algorithms for gene expression

analysis In Jorde LB Little PFR Dunn MJ Subramaniam S(eds) Encyclopedia of Genetics Genomics Proteomics andBioinformatics London John Wiley amp Sons 2005 3148ndash59

2 Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

3 Legendre P Legendre LFJ Numerical Ecology AmsterdamElsevier Science 3rd edition 2012

4 Biton A Bernard-Pierrot I Lou Y et al Independent compo-nent analysis uncovers the landscape of the bladder tumortranscriptome and reveals insights into luminal and basalsubtypes Cell Rep 201491235ndash45

5 Hoadley KA Yau C Wolf DM et al Multiplatform analysis of12 cancer types reveals molecular classification within andacross tissues of origin Cell 2014158929ndash44

6 Verhaak RG Tamayo P Yang JY et al Prognostically relevantgene signatures of high-grade serous ovarian carcinomaJ Clin Invest 2013123517ndash25

7 Senbabaoglu Y Michailidis G Li JZ Critical limitations of con-sensus clustering in class discovery Sci Rep 201446207

8 Gusenleitner D Howe EA Bentink S et al iBBiG iterativebinary bi-clustering of gene sets Bioinformatics 2012282484ndash92

9 Pearson K On lines and planes of closest fit to systems ofpoints in space Philos Magazine Series 19016(2)559ndash72

10Hotelling H Analysis of a complex of statistical variables intoprincipal components J Educ Psychol 193324417

11Bland M An Introduction to Medical Statistics 3rd edn OxfordNew York Oxford University Press 2000 xvi 405

12Warner RM Applied Statistics From Bivariate throughMultivariate Techniques London SAGE Publications Inc2nd Edition 2012 1004

13Dray S Chessel D Thioulouse J Co-inertia analysis and thelinking of ecological data tables Ecology 2003843078ndash89

14Golub GH Loan CF Matrix Computations Baltimore JHU Press2012

15Anthony M Harvey M Linear Algebra Concepts and MethodsCambridge Cambridge University Press 1st edition 2012149ndash60

16McShane LM Cavenagh MM Lively TG et al Criteria for theuse of omics-based predictors in clinical trials explanationand elaboration BMC Med 201311220

17Guyon I Elisseeff A An introduction to variable and featureselection J Mach Learn Res 200331157ndash82

18Tukey JW Exploratory Data Analysis Boston Addison-WesleyPublishing Company 1977

19Hogben L Handbook of Linear Algebra Boca Raton Chapman ampHallCRC Press 1st edition 2007 20

20Ringner M What is principal component analysis NatBiotechnol 200826303ndash4

21Wall ME Rechtsteiner A Rocha LM (2003) Singular value de-composition and principal component analysis In A PracticalApproach to Microarray Data Analysis Berrar DP DubitzkyW Granzow M (eds) Norwell MA Kluwer 2003 91ndash109

22 terBraak CJF Prentice IC A theory of gradient analysis AdvEcol Res 198818271ndash313

23 terBraak CJF Smilauer P CANOCO Reference Manual andUserrsquos Guide to Canoco for Windows Software for CanonicalCommunity Ordination (version 4) Microcomputer Power(Ithaca NY USA) httpwwwmicrocomputerpowercom1998352

24Abdi H Williams LJ Principal Component Analysis HobokenJohn Wiley amp SonsInterdisciplinary Reviews ComputationalStatistics 20102(4)433ndash59

25Dray S On the number of principal components a test ofdimensionality based on measurements of similarity be-tween matrices Comput Stat Data Anal 2008522228ndash37

26Wouters L Gohlmann HW Bijnens L et al Graphical explor-ation of gene expression data a comparative study of threemultivariate methods Biometrics 2003591131ndash9

27Greenacre M Correspondence Analysis in Practice Boca RatonChapman and HallCRC 2 edition 2007 201ndash11

28Goodrich JK Rienzi DSC Poole AC et al Conducting a micro-biome study Cell 2014158250ndash62

29Fasham MJR A comparison of nonmetric multidimensionalscaling principal components and reciprocal averaging forthe ordination of simulated coenoclines and coenoplanesEcology 197758551ndash61

Dimension reduction techniques | 639

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

30Beh EJ Lombardo R Correspondence Analysis Theory Practiceand New Strategies Hoboken John Wiley amp Sons 1st edition2014 120ndash76

31Greenacre MJ Theory and Applications of Correspondence AnalysisWaltham MA Academic Press 3rd Edition 1993 83ndash125

32Fagan A Culhane AC Higgins DG A multivariate analysis ap-proach to the integration of proteomic and gene expressiondata Proteomics 200772162ndash71

33Fellenberg K Hauser NC Brors B et al Correspondence ana-lysis applied to microarray data Proc Natl Acad Sci USA20019810781ndash6

34Lee DD Seung HS Learning the parts of objects by non-negative matrix factorization Nature 1999401788ndash91

35Comon P Independent component analysis a new conceptSignal Processing 199436287ndash314

36Witten DM Tibshirani R Hastie T A penalized matrix decom-position with applications to sparse principal componentsand canonical correlation analysis Biostatistics 200910515ndash34

37Shen H Huang JZ Sparse principal component analysis viaregularized low rank matrix approximation J Multivar Anal2008991015ndash34

38Lee S Epstein MP Duncan R et al Sparse principal componentanalysis for identifying ancestry-informative markers in gen-ome-wide association studies Genet Epidemiol 201236293ndash302

39Sill M Saadati M Benner A Applying stability selection toconsistently estimate sparse principal components in high-dimensional molecular data Bioinformatics 2015312683ndash90

40Zhao J Efficient model selection for mixtures of probabilisticPCA via hierarchical BIC IEEE Trans Cybern 2014441871ndash83

41Zhao Q Shi X Huang J et al Integrative analysis of lsquo-omicsrsquodata using penalty functions Wiley Interdiscip Rev Comput Stat2015710

42Alter O Brown PO Botstein D Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms Proc NatlAcad Sci USA 20031003351ndash6

43Culhane AC Perriere G Higgins DG Cross-platform compari-son and visualisation of gene expression data using co-inertia analysis BMC Bioinformatics 2003459

44Dray S Analysing a pair of tables coinertia analysis and dual-ity diagrams In Blasius J Greenacre M (eds) Visualization andverbalization of data Boca Raton CRC Press 2014 289ndash300

45Le Cao KA Martin PG Robert-Granie C et al Sparse canonicalmethods for biological data integration application to across-platform study BMC Bioinformatics 20091034

46Braak CJF Canonical correspondence analysis a new eigen-vector technique for multivariate direct gradient analysisEcology 1986671167ndash79

47Hotelling H Relations between two sets of variates Biometrika193628321ndash77

48McGarigal K Landguth E Stafford S Multivariate Statistics forWildlife and Ecology Research New York Springer 2002

49Thioulouse J Simultaneous analysis of a sequence of pairedecological tables a comparison of several methods Ann ApplStat 201152300ndash25

50Hong S Chen X Jin L et al Canonical correlation analysisfor RNA-seq co-expression networks Nucleic Acids Res201341e95

51Mardia KV Kent JT Bibby JM Multivariate Analysis LondonNew York NY Toronto Sydney San Francisco CA AcademicPress 1979

52Waaijenborg S Zwinderman AH Sparse canonical correl-ation analysis for identifying connecting and completinggene-expression networks BMC Bioinformatics 200910315

53Parkhomenko E Tritchler D Beyene J Sparse canonical cor-relation analysis with application to genomic data integra-tion Stat Appl Genet Mol Biol 20098article 1

54Witten DM Tibshirani RJ Extensions of sparse canonical cor-relation analysis with applications to genomic data Stat ApplGenet Mol Biol 20098article 28

55Lin D Zhang J Li J et al Group sparse canonical correlationanalysis for genomic data integration BMC Bioinformatics201314245

56Le Cao KA Boitard S Besse P Sparse PLS discriminantanalysis biologically relevant feature selection and graph-ical displays for multiclass problems BMC Bioinform 201112253

57Boulesteix AL Strimmer K Partial least squares a versatiletool for the analysis of high-dimensional genomic data BriefBioinform 2007832ndash44

58Doledec S Chessel D Co-inertia analysis an alternativemethod for studying speciesndashenvironment relationshipsFreshwater Biol 199431277ndash94

59Ponnapalli SP Saunders MA Van Loan CF et al A higher-order generalized singular value decomposition for compari-son of global mRNA expression from multiple organismsPLoS One 20116e28072

60Meng C Kuster B Culhane AC et al A multivariate approachto the integration of multi-omics datasets BMC Bioinformatics201415162

61de Tayrac M Le S Aubry M et al Simultaneous analysis ofdistinct Omics data sets with integration of biological know-ledge multiple factor analysis approach BMC Genomics20091032

62Abdi H Williams LJ Valentin D Statis and Distatis OptimumMultitable Principal Component Analysis and Three Way MetricMultidimensional Scaling Wiley Interdisciplinary 20124124ndash67

63De Jong S Kiers HAL Principal covariates regression Part ITheory Chemometrics and Intelligent Laboratory Systems199214155ndash64

64Benasseni J Dosse MB Analyzing multiset data by the powerSTATIS-ACT method Adv Data Anal Classification 2012649ndash65

65Escoufier Y The duality diagram a means for better practicalapplications Devel Num Ecol 198714139ndash56

66Giordani P Kiers HAL Ferraro DMA Three-way componentanalysis using the R package threeway J Stat Software2014577

67Leibovici DG Spatio-temporal multiway decompositionsusing principal tensor analysis on k-modes the R packagePTAk J Stat Software 201034(10)1ndash34

68Kolda TG Bader BW Tensor decompositions and applica-tions SIAM Rev 200951455ndash500

69Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis for multiblock or multigroup dataanalysis Eur J Oper Res 2014238391ndash403

70Chessel D Hanafi M Analyses de la co-inertie de K nuages depoints Rev Stat Appl 19964435ndash60

71Hanafi M Kohler A Qannari EM Connections betweenmultiple co-inertia analysis and consensus principal compo-nent analysis Chemometrics and Intelligent Laboratory2011106(1)37ndash40

72Carroll JD Generalization of canonical correlation analysis tothree or more sets of variables Proc 76th Convent Am PsychAssoc 19683227ndash8

73Takane Y Hwang H Abdi H Regularized multiple-set canon-ical correlation analysis Psychometrika 200873753ndash75

640 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5
Page 6: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

continuous data It is frequently applied in the analysis ofmicrobiome data [28]

PCA is designed for the analysis of multi-normal distributeddata If data are strongly skewed or extreme outliers are pre-sent the first few axes may only separate those objects with ex-treme values instead of displaying the main axes of variation Ifdata are unimodal or display nonlinear trends one may see dis-tortions or artifacts in the resulting plots in which the secondaxis is an arched function of the first axis In PCA this is calledthe horseshoe effect and it is well described including illustra-tions in Legendre and Legendre [3] Both nonmetric Multi-Dimensional Scaling (MDS) and CA perform better than PCA inthese cases [26 29] Unlike PCA CA can be applied to sparsecount data with many zeros Although designed for contingencytables of nonnegative count data CA and NSCA decomposea chi-squared matrix [30 31] but have been successfullyapplied to continuous data including gene expression and pro-tein profiles [32 33] As described by Fellenberg et al [33] gene

and protein expression can be seen as an approximation of thenumber of corresponding molecules present in the cell during acertain measured condition Additionally Greenacre [27]emphasized that the descriptive nature of CA and NSCA allowstheir application on data tables in general not only on countdata These two arguments support the suitability of CA andNSCA as analysis methods for omics data While CA investi-gates symmetric associations between two variables NSCA cap-tures asymmetric relations between variables Spectral mapanalysis is related to CA and performs comparably with CAeach outperforming PCA in the identification of clusters of leu-kemia gene expression profiles [26] All dimension reductionmethods can be formulated in terms of the duality diagramDetails on this powerful framework are included in theSupplementary Information

Nonnegative matrix factorization (NMF) [34] forces a positiveor nonnegative constraint on the resulting data matrices andsimilar to Independent Component Analysis (ICA) [35] there is no

Figure 1 Results of a PCA analysis of mRNA gene expression data of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines from the NCI-60 cell

line panel All variables were centered and scaled Results show (A) a biplot where observations (cell lines) are points and gene expression profiles are arrows (B) a

heatmap showing the gene expression of the same 20 genes in the cell lines red to blue scale represent high to low gene expression (light to dark gray represent high

to low gene expression on the black and white figure) (C) correlation circle (D) variance barplot of the first ten PCs To improve the readability of the biplot some labels

of the variables (genes) in (A) have been moved slightly A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 633

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

requirement for orthogonality or independence in the compo-nents The nonnegative constraint guarantees that only the addi-tive combinations of latent variables are allowed This may bemore intuitive in biology where many biological measurements(eg protein concentrations count data) are represented by posi-tive values NMF is described in more detail in the SupplementalInformation ICA was recently applied to molecular subtype dis-covery in bladder cancer [4] Biton et al [4] applied ICA to geneexpression data of 198 bladder cancers and examined 20 compo-nents ICA successfully decomposed and extracted multiplelayers of signal from the data The first two components wereassociated with batch effects but other components revealed newbiology about tumor cells and the tumor microenvironmentThey also applied ICA to non-bladder cancers and by comparingthe correlation between components were able to identify a setof bladder cancer-specific components and their associatedgenes

As omics data sets tend to have high dimensionality (p n)it is often useful to reduce the number of variables Several re-cent extensions of PCA include variable selection often via aregularization step or L-1 penalization (eg Least AbsoluteShrinkage and Selection Operator LASSO) [36] The NIPALS algo-rithm uses an iterative regression approach to calculate thecomponents and loadings which is easily extended to have asparse operator that can be included during regression on thecomponent [37] A cross-validation approach can be used to de-termine the level of sparsity Sparse penalized and regularizedextensions of PCA and related methods have been describedrecently [36 38ndash41]

Integrative analysis of two data sets

One-table dimension reduction methods have been extended tothe EDA of two matrices and can simultaneously decomposeand integrate a pair of matrices that measure different variableson the same observations (Table 3) Methods include general-ized SVD [42] Co-Inertia Analysis (CIA) [43 44] sparse or penal-ized extensions of Partial Least Squares (PLS) CanonicalCorrespondence analysis (CCA) and Canonical CorrelationAnalysis (CCA) [36 45ndash47] Note both canonical correspondenceanalysis and canonical correlation analysis are referred to bythe acronym CCA Canonical correspondence analysis is a con-strained form of CA that is widely used in ecological statistics[46] however it is yet to be adopted by the genomics commu-nity in analysis of pairs of omics data By contrast severalgroups have applied extensions of canonical correlation ana-lysis to omics data integration Therefore in this review we useCCA to describe canonical correlation analysis

Canonical correlation analysis

Two omics data sets X (dimension npx) and Y (dimension npy)can be expressed by the following latent component decompos-ition problem

X frac14 FxQTx thorn Ex

Yfrac14 FyQTy thorn Ey

(9)

where Fx and Fy are nr matrices with r columns of compo-nents that explain the co-structure between X and Y The col-umns of Qx and Qy are variable loading vectors for X and Yrespectively Ex and Ey are error terms

Proposed by Hotelling in 1936 [47] CCA searches for associ-ations or correlations among the variables of X and Y [47] bymaximizing the correlation between Xqx

i and Yqyi

for the ith component arg maxqixqi

ycorethXqi

x YqiyTHORN (10)

In CCA the components Xqxi and Yqy

i are called canonical vari-ates and their correlations are the canonical correlations

Sparse canonical correlation analysis

The main limitation of applying CCA to omics data is that itrequires an inversion of the correlation or covariance matrix[38 49 50] which cannot be calculated when the number ofvariables exceeds the number of observations [46] In high-dimensional omics data where p n application of these meth-ods requires a regularization step This may be accomplished byadding a ridge penalty that is adding a multiple of the identitymatrix to the covariance matrix [51] A sparse solution of theloading vectors (Qx and Qy) filters the number of variables andsimplifies the interpretation of results For this purpose penal-ized CCA [52] sparse CCA [53] CCA-l1 [54] CCA elastic net (CCA-EN) [45] and CCA-group sparse [55] have been introduced andapplied to the integrative analysis of two omics data setsWitten et al [36] provided an elegant comparison of various CCAextensions accompanied by a unified approach to compute bothpenalized CCA and sparse PCA In addition Witten andTibshirani [54] extended sparse CCA into a supervised frame-work which allows the integration of two data sets with aquantitative phenotype for example selecting variables fromboth genomics and transcriptomics data and linking them todrug sensitivity data

Partial least squares analysis

PLS is an efficient dimension reduction method in the analysisof high-dimensional omics data PLS maximizes the covariancerather than the correlation between components making itmore resistant to outliers than CCA Additionally PLS does notsuffer from the p n problem as CCA does Nonetheless asparse solution is desired in some cases For example Le Caoet al [56] proposed a sparse PLS method for the feature selectionby introducing a LASSO penalty for the loading vectors In a re-cent comparison sPLS performed similarly to sparse CCA [45]There are many implementations of PLS which optimize differ-ent objective functions with different constraints several ofwhich are described by Boulesteix et al [57]

Co-Inertia analysis

CIA is a descriptive nonconstrained approach for coupling pairsof data matrices It was originally proposed to link two ecolo-gical tables [13 58] but has been successfully applied in integra-tive analysis of omics data [32 43] CIA is implemented andformulated under the duality diagram framework(Supplementary Information) CIA is performed in two steps (i)application of a dimension reduction technique such as PCACA or NSCA to the initial data sets depending on the type ofdata (binary categorical discrete counts or continuous data)and (ii) constraining the projections of the orthogonal axes suchthat they are maximally covariant [43 58] CIA does not requirean inversion step of the correlation or covariance matrix thusit can be applied to high-dimensional genomics data withoutregularization or penalization

634 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

Though closely related to CCA [49] CIA maximizes thesquared covariance between the linear combination of the pre-processed matrix that is

for the ith dimension argmaxqixqi

ycov2ethXqi

xYqiyTHORN (11)

Equation (11) can be decomposed as

cov2 XqxiYqy

i

frac14 cor2 XqxiYqy

i

var Xqxi

var Xqy

i

(12)

CIA decomposition of covariance maximizes the varianceand the correlation between matrices and thus is less sensitiveto outliers The relationship between CIA Procrustes analysis[13] and CCA have been well described [49] A comparison be-tween sCCA (with elastic net penalty) sPLS and CIA is providedby Le Cao et al [45] In summary CIA and sPLS both maximizethe covariance between eigenvectors and efficiently identifyjoint and individual variance in paired data In contrast CCA-EN maximizes the correlation between eigenvectors and willdiscover effects present in both data sets but may fail to dis-cover strong individual effects [45] Both sCCA and sPLS aresparse methods that select similar subsets of variables whereasCIA does not include a feature selection step thus in terms offeature selection results of CIA are more likely to contain re-dundant information in comparison with sparse methods [45]

Similar to classical dimension reduction approaches thenumber of dimensions to be examined needs to be consideredand can be visualized using a scree plot (similar to Figure 1D)Components may be evaluated by their associated variance [25]the elbow test or Q2 statistics as described previously Forexample the Q2 statistic was applied to select the number ofdimensions in the predictive mode of PLS [56] In additionwhen a sparse factor is introduced in the loading vectors cross-validation approaches may be used to determine the number ofvariables selected from each pair of components Selection ofthe number of components and optimization of these param-eters is still an open research question and is an active area ofresearch

Integrative analysis of multi-assay data

There is a growing need to integrate more than two data sets ingenomics Generalizations of dimension reduction methods tothree or more data sets are sometimes called K-table methods[59ndash61] and a number of them have been applied to multi-assaydata (Table 4) Simultaneous decomposition and integration ofmultiple matrices is more complex than an analysis of a singledata set or paired data because each data set may have differentnumbers of variables scales or internal structure and thus havedifferent variance This might produce global scores that aredominated by one or a few data sets Therefore data are prepro-cessed before decomposition Preprocessing is often performedon two levels On the variable levels variables are often cen-tered and normalized so that their sum of squared values orvariance equals 1 This procedure enables all the variables tohave equal contribution to the total inertia (sum of squares ofall elements) of a data set However the number of variablesmay vary between data sets or filteringpreprocessing stepsmay generate data sets that have a higher variance contributionto the final result Therefore a data set level normalization isalso required In the simplest K-table analysis all matrices haveequal weights More commonly data sets are weighted by theirexpected contribution or expected data quality for example by

the square root of their total inertia or by the square root of thenumbers of columns of each data set [62] Alternatively greaterweights can be given to smaller or less redundant matricesmatrices that have more stable predictive information or thosethat share more information with other matrices Such weight-ing approaches are implemented in multiple-factor analysis(MFA) principal covariates regression [63] and STATIS

The simplest multi-omics data integration is when the datasets have the same variables and observations that is matchedrows and matched columns In genomics these could be pro-duced when variables from different multi-assay data sets aremapped to a common set of genomic coordinates or gene iden-tifiers thus generating data sets with matched variables andmatched observations Alternatively a repeated analysis orlongitudinal analysis of the same samples and the same vari-ables could produce such data (one should note that thesedimension reduction approaches do not model the time correla-tion between different datasets) There is a history of such ana-lyses in ecology where counts of species and environmentvariables are measured over different seasons [49 64 65] and inpsychology where different standardized tests are measuredmultiple times on study populations [66 67] Analysis of suchvariables x samples x time data are called a three-mode decom-position triadic cube or three-way table analysis tensor de-composition three-way PCA three-mode PCA three-modeFactor Analysis Tucker-3 model Tucker3 TUCKALS3 multi-block analysis among others (Table 4) The relationship be-tween various tensor or higher decompositions for multi-blockanalysis are reviewed by Kolda and Bader [68]

More frequently we need to find associations in multi-assaydata that have matched observations but have different andunmatched numbers of variables For example TCGA generatedmiRNA and mRNA transcriptome (RNAseq microarray) DNAcopy number DNA mutation DNA methylation and proteomicsmolecular profiles on each tumor The NCI-60 and the CancerCell Line Encyclopedia projects have measured pharmacologicalcompound profiles in addition to exome sequencing and tran-scriptomic profiles Methods that can be applied to EDA ofK-table of multi-assay data with different variables includeMCIA MFA Generalized CCA (GCCA) and Consensus PCA(CPCA)

The K-table methods can be generally expressed by the fol-lowing model

X1 frac14 FQT1 thorn E1

Xk frac14 FQTk thorn Ek

XK frac14 FQTK thorn EK

(13)

where there are K matrices or omics data sets X1 XK For con-venience we assume that the rows of Xk share a common set ofobservations but the columns of Xk may each have differentvariables F is the lsquoglobal scorersquo matrix Its columns are the PCsand are interpreted similarly to PCs from a PCA of a single dataset The global score matrix which is identical in all decompos-itions integrates information from all data sets Therefore it isnot specific to any single data set rather it represents the com-mon pattern defined by all data sets The matrices Qk with kranging from 1 to K are the loadings or coefficient matricesA high positive value indicates a strong positive contribution ofthe corresponding variable to the lsquoglobal scorersquo While the above

Dimension reduction techniques | 635

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

methods are formulated for multiple data sets with differentvariables but the same observations most can be similarly for-mulated for multiple data sets with the same variables but dif-ferent observations [69]

Multiple co-inertia analysis

MCIA is an extension of CIA which aims to analyze multiplematrices through optimizing a covariance criterion [60 70]MCIA simultaneously projects K data sets into the same dimen-sional space Instead of maximizing the covariance betweenscores from two data sets as in CIA the optimization problemused by MCIA is as following

argmaxqi1 q

ikqi

K

XK

kfrac141

cov2ethXikqi

kXiqiTHORN (14)

for dimension i with the constraints that jjqkijj frac14 var (Xiqi) frac14 1

where X frac14 (X1j jXkj jXK) and qi holds the correspondingloading values (lsquoglobalrsquo loading vector) [60 69 70]

MCIA derives a set of lsquoblock scoresrsquo Xkiqk

i using linear com-binations of the original variables from each individual matrixThe global score Xiqi is then further defined as the linear com-bination of lsquoblock scoresrsquo In practice the global scores representa correlated structure defined by multiple data sets whereasthe concordance and discrepancy between these data sets maybe revealed by the block scores (for detail see lsquoExample casestudyrsquo section) MCIA may be calculated with the ad hoc exten-sion of the NIPALS PCA [71] This algorithm starts with an ini-tialization step in which the global scores and the blockloadings for the first dimension are computed The residualmatrices are calculated in an iterative step by removing thevariance induced by the variable loadings (the lsquodeflationrsquo step)For higher order solutions the same procedure is applied to theresidual matrices and re-iterated until the desired number of di-mensions is reached Therefore the computation time stronglydepends on the number of desired dimensions MCIA is imple-mented in the R package omicade4 and has been applied to theintegrative analysis of transcriptomic and proteomic data setsfrom the NCI-60 cell lines [60]

Generalized canonical correlation analysis

GCCA [71] is a generalization of CCA to K-table analysis [73ndash75]It has also been applied to the analysis of omics data [36 76]Typically MCIA and GCCA will produce similar results (for amore detailed comparison see [60]) GCCA uses a different defla-tion strategy than MCIA it calculates the residual matrices byremoving the variance with respect to the lsquoblock scoresrsquo (insteadof lsquovariable loadingsrsquo used by MCIA or lsquoglobal scoresrsquo used byCPCA see later) When applied to omics data where p n avariable selection step is often integrated within the GCCA ap-proach which cannot be done in case of MCIA In addition asblock scores are better representations of a single data set (incontrast to the global score) GCCA is more likely to find com-mon variables across data sets Witten and Tibshirani [54]applied sparse multiple CCA to analyze gene expression andCopy Number Variation (CNV) data from diffuse large B-celllymphoma patients and successfully identified lsquocis interactionsrsquothat are both up-regulated in CNV and mRNA data

Consensus PCA

CPCA is closely related to GCCA and MCIA but has had less expos-ure to the omics data community CPCA optimizes the same criter-ion as GCCA and MCIA and is subject to the same constraints asMCIA [71] The deflation step of CPCA relies on the lsquoglobal scorersquo Asa result it guarantees the orthogonality of the global score onlyand tends to find common patterns in the data sets This charac-teristic makes it is more suitable for the discovery of joint patternsin multiple data sets such as the joint clustering problem

Regularized generalized canonical correlation analysis

Recently Tenenhaus and Tenenhaus [69 74] proposed regular-ized generalized CCA (RGCCA) which provides a unified frame-work for different K-table multivariate methods The RGCCAmodel introduces some extra parameters particularly a shrink-age parameter and a linkage parameter The linkage parameteris defined so that the connection between matrices may be cus-tomized The shrinkage parameter ranges from 0 to 1 Settingthis parameter to 0 will force the correlation criterion (criterionused by GCCA) whereas a shrinkage parameter of 1 will applythe covariance criterion (used by MCIA and CPCA) A value be-tween 0 and 1 leads to a compromise between the two optionsIn practice the correlation criterion is better in explaining thecorrelated structure across data sets while discarding the vari-ance within each individual data set The introduction of theextra parameters make RGCCA highly versatile GCCA CIA andCPCA can be described as special cases of RGCCA (see [69] andSupplementary Information) In addition RGCCA also integratesa feature selection procedure named sparse GCCA (SGCCA)Tenenhaus et al [76] applied SGCCA to combine gene expres-sion comparative genomic hybridization and a qualitativephenotype measured on a set of 53 children with glioma Sparsemultiple CCA [54] and SGCCA [76] are available in the R pack-ages PMA and RGCCA respectively Similarly a higher orderimplementation of spare PLS is described in Zhao et al [77]

Joint NMF

NMF has also been extended to jointly factorize multiple matri-ces In joint NMF the values in the global score F and in the co-efficient matrices (Q1 QK) are nonnegative and there is noexplicit definition of the block loadings An optimization algo-rithm is applied to minimize an objective function typically the

sum of squared errors ieXK

kfrac141Ek

2 This approach can be con-

sidered to be a nonnegative implementation of PARAFAC al-though it has also been implemented using the Tucker model[78ndash80] Zhang et al [81] apply joint NMF to a three-way analysisof DNA methylation gene expression and miRNA expressiondata to identify modules in each of these regulatory layers thatare associated with each other

Advantages of dimension reduction whenintegrating multi-assay data

Dimension reduction or latent variable approaches provideEDA integrate multi-assay data highlight global correlationsacross data sets and discover outliers or batch effects in indi-vidual data sets Dimension reduction approaches also facilitatedown-stream analysis of both observations and variables(genes) Compared with cluster analysis of individual data setscluster analysis of the global score matrix (F matrix) is robust asit aggregates observations across data sets and is less likely to

636 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

reflect a technical or batch effect of a single data set Similarlydimension reduction of multi-assay data facilities downstreamgene set pathway and network analysis of variables MCIAtransforms variables from each data set onto the same scaleand their loadings (Q matrix) rank the variables by their contri-bution to the global data structure (Figure 2) Meng et al reportthat pathway or gene set enrichment analysis (GSEA) of thetransformed variables is more sensitive than GSEA of each indi-vidual data set This is both because of the re-weighting andtransformation of variables but also because GSEA on the com-bined data has greater coverage of variables (genes) thus com-pensating for missing or unreliable information in any singledata set For example in Figure 2B we integrate and transformmRNA proteomics and miRNA data on the same scale allowingus to extract and study the union of all variables

Example case study

To demonstrate the integration of multi-data sets using dimen-sions reduction we applied MCIA to analyze mRNA miRNA andproteomics expression profiles of melanoma leukemia and CNS

cells lines from the NCI-60 panel The graphical output fromthis analysis a plot of the sample space variable space anddata weighting space are provided in Figure 2A B and E Theeigenvalues can be interpreted similarly to PCA a higher eigen-value contributes more information to the global score As withPCA researchers may be subjective in their selection of thenumber of components [24] The scree plot in Figure 2D showsthe eigenvalues of each global score In this case the first twoeigenvalues were significantly larger so we visualized the celllines and variables on PC1 and PC2

In the observation space (Figure 2A) assay data (mRNAmiRNA and proteomics) are distinguished by shape The coord-inates of each cell line (Fk in Figure 2A) are connected by lines tothe global scores (F) Short lines between points and cell lineglobal scores reflect high concordance in cell line data Most celllines have concordant information between data sets (mRNAmiRNA protein) as indicated by relatively short lines In add-ition the RV coefficient [82 83] which is a generalized Pearsoncorrelation coefficient for matrices may be used to estimate thecorrelation between two transformed data sets The RV coeffi-cient has values between 0 and 1 where a higher value

Figure 2 MCIA of mRNA miRNA and proteomics profiles of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines (A) shows a plot of the first two

components in sample space (sample lsquotypersquo is coded by the point shape circles for mRNAs triangles for proteins and squares for miRNAs) Each sample (cell line) is

represented by a ldquostarrdquo where the three omics data for each cell line are connected by lines to a center point which is the global score (F) for that cell line the shorter

the line the higher the level of concordance between the data types and the global structure (B) shows the variable space of MCIA A variable that is highly expressed

in a cell line will be projected with a high weight (far from the origin) in the direction of that cell line Some miRNAs with a large distance from the origin are labeled as

these miRNAs are the strongly associated with cancer tissue of origin (C) shows the correlation coefficients of the proteome profiling of SR with other cell lines The

proteome profiling of SR cell line is more correlated with melanoma cell line There may be a technical issue with the LESR proteomics data (D) A scree plot of the

eigenvalues and (E) a plot of data weighting space A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 637

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

indicates higher co-structure In this example we observed rela-tively high RV coefficients between the three data sets rangingfrom 078 to 084 It was recently reported that the RV coefficientis biased toward large data sets and a modified RV coefficienthas been proposed [84]

In this analysis (Figure 2A) cell lines originating from thesame anatomical source are projected close to each other andconverge in clusters The first PC separates the leukemia celllines (positive end of PC1) from the other two cell lines (negativeend of PC1) and PC2 separates the melanoma and CNS celllines The melanoma cell line LOX-IMVI which lacks the mela-nogenesis is projected close to the origin away from the mel-anoma cluster We were surprised to see that the proteomicsprofile of leukemia cell line SR was projected closer to melan-oma rather than leukemia cell lines We examined withintumor type correlations to the SR cell line (Figure 2C) Weobserved that the SR proteomics data had higher correlationwith melanoma compared with to leukemia cell lines Giventhat the mRNA and miRNA profiles of LE_SR are closer to theleukemia cell lines it suggests that there may have been a tech-nical error in generating the proteomics data on the SR cell line(Figure 2A and C)

MCIA projects all variables into the same space The variablespace (Q1 QK) is visualized in Figure 2B Variables and sam-ples projected in the same direction are associated This allowsone to select the variables most strongly associated with spe-cific observations from each data set for subsequent analysis Inour previous study [85] we have shown that the genes and pro-teins highly weighted on the melanoma side (positive end of se-cond dimension) are enriched with melanogenesis functionsand genesproteins highly weighted on the protein side arehighly enriched in T-cell or immune-related functions

We examined the miRNA data to extract the miRNAs withthe most extreme weights on the first two dimensions miR-142and miR-223 which are active and expressed in leukemia [8283 86ndash88] had high weights on the positive end of both thefirst and second axis (close to the leukemia cell lines samplespace Figure 2A) miR-142 plays an essential role in T-lympho-cyte development miR-223 is regulated by the Notch andNF-kB signaling pathways in T-cell acute lymphoblastic leuke-mia [89]

The miRNA with strongest association to CNS cell lines wasmiR-409 This miRNA is reported to promote the epithelial-to-mesenchymal transition in prostate cancer [90] In the NCI-60cell line data CNS cell lines are characterized more by a pro-nounced mesenchymal phenotype which is consistent withhigh expression of this miRNA On the positive end of the se-cond axis and negative end of the first axis (which correspondsto melanoma cell lines in the sample space Figure 2A) wefound miR-509 miR-513 and miR-506 strongly associated withmelanoma cell lines which are reported to initiate melanocytetransformation and promote melanoma growth [85]

Challenges in integrative data analysis

EDA is widely used and well accepted in the analysis of singleomics data sets but there is an increasing need for methodsthat integrate multi-omics data particularly in cancer researchRecently 20 leading scientists were invited to a meeting organ-ized by Nature Medicine Nature Biotechnology and the VolkswagenFoundation The meeting identified the need to simultaneouslycharacterize DNA sequence epigenome transcriptome proteinmetabolites and infiltrating immune cells in both the tumorand the stroma [91] The TCGA pan-cancer project plans to

comprehensively interrogate multi-omics data across 33 humancancers [92] The data are biologically complex In addition totumor heterogeneity [91] there may be technical issues batcheffects and outliers EDA approaches for complex multi-omicsdata are needed

We describe emerging applications of multivariateapproaches to omics data analysis These are descriptiveapproaches that do not test a hypothesis or generate a P-valueThey are not optimized for variable or biomarker discoveryalthough the introduction of sparsity in variable loadings mayhelp in the selection of variables for downstream analysis Fewcomparisons of different methods exist and the numbers ofcomponents and the sparsity level have to be optimized Cross-validation approaches are potentially useful but this stillremains an open area of research

Another limitation of these methods is that although vari-ables may vary among data sets the observations need to bematchable Therefore researchers need to have careful experi-mental design in the early stage of a study There is an exten-sion of CIA for the analysis of unmatched samples [93] whichcombines a Hungarian algorithm with CIA to iteratively pairsamples that are similar but not matched Multi-block andmulti-group methods (data sets with matched variables) havebeen reviewed recently by Tenenhaus and Tenenhaus [69]

The number of variables in genomics data is a challenge totraditional EDA visualization tools Most visualizationapproaches were designed for data sets with fewer variablesWithin R new packages including ggord are being developedDynamic data visualization is possible using ggvis plotly explorand other packages However interpretation of long lists of bio-logical variables (genes proteins miRNAs) is challenging Oneway to gain more insight into lists of omics variables is to per-form a network gene set enrichment or pathway analysis [94]An attractive feature of decomposition methods is that variableannotation such as Gene Ontology or Reactome can be pro-jected into the same space to determine a score for each geneset (or pathway) in that space [32 33 60]

Conclusion

Dimension reduction methods have a long history Many simi-lar methods have been developed in parallel by multiple fieldsIn this review we provided an overview of dimension reductiontechniques that are both well-established and maybe new tothe multi-omics data community We reviewed methods forsingle-table two-table and multi-table analysis There are sig-nificant challenges in extracting biologically and clinically ac-tionable results from multi-omics data however the field mayleverage the varied and rich resource of dimension reductionapproaches that other disciplines have developed

Key Points

bull There are many dimension-reduction methods whichcan be applied to exploratory data analysis of a singledata set or integrated analysis of a pair or multipledata sets In addition to exploratory analysis thesecan be extended to clustering supervised and discrim-inant analysis

bull The goal of dimension reduction is to map data ontoa new set of variables so that most of the variance (orinformation) in the data is explained by a few new(latent) variables

638 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

bull Multi-data set methods such as multiple co-inertiaanalysis (MCIA) multiple factor analysis (MFA) or ca-nonical correlations analysis (CCA) identify correlatedstructure between data sets with matched observa-tions (samples) Each data set may have different vari-ables (genes proteins miRNA mutations drugresponse etc)

bull MCIA MFA CCA and related methods provide a visu-alization of consensus and incongruence in andbetween data sets enabling discovery of potential out-liers batch effects or technical errors

bull Multi-dataset methods transform diverse variablesfrom each data set onto the same space and scalefacilitating integrative variable selection gene setanalysis pathway and downstream analyses

R Supplement

R code to re-generate all figures in this article is available asa Supplementary File Data and code are also available on thegithub repository httpsgithubcomaedinNCI60Example

Supplementary Data

Supplementary data are available online at httpbiboxfordjournalsorg

Acknowledgements

We are grateful to Drs John Platig and Marieke Kuijjer forreading the manuscript and for their comments

Funding

GGT was supported by the Austrian Ministry of ScienceResearch and Economics [OMICS Center HSRSM initiative]AC was supported by Dana-Farber Cancer Institute BCBResearch Scientist Developmental Funds National CancerInstitute 2P50 CA101942-11 and the Assistant Secretary ofDefense Health Program through the Breast CancerResearch Program under Award No W81XWH-15-1-0013Opinions interpretations conclusions and recommenda-tions are those of the author and are not necessarilyendorsed by the Department of Defense

References1 Brazma A Culhane AC Algorithms for gene expression

analysis In Jorde LB Little PFR Dunn MJ Subramaniam S(eds) Encyclopedia of Genetics Genomics Proteomics andBioinformatics London John Wiley amp Sons 2005 3148ndash59

2 Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

3 Legendre P Legendre LFJ Numerical Ecology AmsterdamElsevier Science 3rd edition 2012

4 Biton A Bernard-Pierrot I Lou Y et al Independent compo-nent analysis uncovers the landscape of the bladder tumortranscriptome and reveals insights into luminal and basalsubtypes Cell Rep 201491235ndash45

5 Hoadley KA Yau C Wolf DM et al Multiplatform analysis of12 cancer types reveals molecular classification within andacross tissues of origin Cell 2014158929ndash44

6 Verhaak RG Tamayo P Yang JY et al Prognostically relevantgene signatures of high-grade serous ovarian carcinomaJ Clin Invest 2013123517ndash25

7 Senbabaoglu Y Michailidis G Li JZ Critical limitations of con-sensus clustering in class discovery Sci Rep 201446207

8 Gusenleitner D Howe EA Bentink S et al iBBiG iterativebinary bi-clustering of gene sets Bioinformatics 2012282484ndash92

9 Pearson K On lines and planes of closest fit to systems ofpoints in space Philos Magazine Series 19016(2)559ndash72

10Hotelling H Analysis of a complex of statistical variables intoprincipal components J Educ Psychol 193324417

11Bland M An Introduction to Medical Statistics 3rd edn OxfordNew York Oxford University Press 2000 xvi 405

12Warner RM Applied Statistics From Bivariate throughMultivariate Techniques London SAGE Publications Inc2nd Edition 2012 1004

13Dray S Chessel D Thioulouse J Co-inertia analysis and thelinking of ecological data tables Ecology 2003843078ndash89

14Golub GH Loan CF Matrix Computations Baltimore JHU Press2012

15Anthony M Harvey M Linear Algebra Concepts and MethodsCambridge Cambridge University Press 1st edition 2012149ndash60

16McShane LM Cavenagh MM Lively TG et al Criteria for theuse of omics-based predictors in clinical trials explanationand elaboration BMC Med 201311220

17Guyon I Elisseeff A An introduction to variable and featureselection J Mach Learn Res 200331157ndash82

18Tukey JW Exploratory Data Analysis Boston Addison-WesleyPublishing Company 1977

19Hogben L Handbook of Linear Algebra Boca Raton Chapman ampHallCRC Press 1st edition 2007 20

20Ringner M What is principal component analysis NatBiotechnol 200826303ndash4

21Wall ME Rechtsteiner A Rocha LM (2003) Singular value de-composition and principal component analysis In A PracticalApproach to Microarray Data Analysis Berrar DP DubitzkyW Granzow M (eds) Norwell MA Kluwer 2003 91ndash109

22 terBraak CJF Prentice IC A theory of gradient analysis AdvEcol Res 198818271ndash313

23 terBraak CJF Smilauer P CANOCO Reference Manual andUserrsquos Guide to Canoco for Windows Software for CanonicalCommunity Ordination (version 4) Microcomputer Power(Ithaca NY USA) httpwwwmicrocomputerpowercom1998352

24Abdi H Williams LJ Principal Component Analysis HobokenJohn Wiley amp SonsInterdisciplinary Reviews ComputationalStatistics 20102(4)433ndash59

25Dray S On the number of principal components a test ofdimensionality based on measurements of similarity be-tween matrices Comput Stat Data Anal 2008522228ndash37

26Wouters L Gohlmann HW Bijnens L et al Graphical explor-ation of gene expression data a comparative study of threemultivariate methods Biometrics 2003591131ndash9

27Greenacre M Correspondence Analysis in Practice Boca RatonChapman and HallCRC 2 edition 2007 201ndash11

28Goodrich JK Rienzi DSC Poole AC et al Conducting a micro-biome study Cell 2014158250ndash62

29Fasham MJR A comparison of nonmetric multidimensionalscaling principal components and reciprocal averaging forthe ordination of simulated coenoclines and coenoplanesEcology 197758551ndash61

Dimension reduction techniques | 639

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

30Beh EJ Lombardo R Correspondence Analysis Theory Practiceand New Strategies Hoboken John Wiley amp Sons 1st edition2014 120ndash76

31Greenacre MJ Theory and Applications of Correspondence AnalysisWaltham MA Academic Press 3rd Edition 1993 83ndash125

32Fagan A Culhane AC Higgins DG A multivariate analysis ap-proach to the integration of proteomic and gene expressiondata Proteomics 200772162ndash71

33Fellenberg K Hauser NC Brors B et al Correspondence ana-lysis applied to microarray data Proc Natl Acad Sci USA20019810781ndash6

34Lee DD Seung HS Learning the parts of objects by non-negative matrix factorization Nature 1999401788ndash91

35Comon P Independent component analysis a new conceptSignal Processing 199436287ndash314

36Witten DM Tibshirani R Hastie T A penalized matrix decom-position with applications to sparse principal componentsand canonical correlation analysis Biostatistics 200910515ndash34

37Shen H Huang JZ Sparse principal component analysis viaregularized low rank matrix approximation J Multivar Anal2008991015ndash34

38Lee S Epstein MP Duncan R et al Sparse principal componentanalysis for identifying ancestry-informative markers in gen-ome-wide association studies Genet Epidemiol 201236293ndash302

39Sill M Saadati M Benner A Applying stability selection toconsistently estimate sparse principal components in high-dimensional molecular data Bioinformatics 2015312683ndash90

40Zhao J Efficient model selection for mixtures of probabilisticPCA via hierarchical BIC IEEE Trans Cybern 2014441871ndash83

41Zhao Q Shi X Huang J et al Integrative analysis of lsquo-omicsrsquodata using penalty functions Wiley Interdiscip Rev Comput Stat2015710

42Alter O Brown PO Botstein D Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms Proc NatlAcad Sci USA 20031003351ndash6

43Culhane AC Perriere G Higgins DG Cross-platform compari-son and visualisation of gene expression data using co-inertia analysis BMC Bioinformatics 2003459

44Dray S Analysing a pair of tables coinertia analysis and dual-ity diagrams In Blasius J Greenacre M (eds) Visualization andverbalization of data Boca Raton CRC Press 2014 289ndash300

45Le Cao KA Martin PG Robert-Granie C et al Sparse canonicalmethods for biological data integration application to across-platform study BMC Bioinformatics 20091034

46Braak CJF Canonical correspondence analysis a new eigen-vector technique for multivariate direct gradient analysisEcology 1986671167ndash79

47Hotelling H Relations between two sets of variates Biometrika193628321ndash77

48McGarigal K Landguth E Stafford S Multivariate Statistics forWildlife and Ecology Research New York Springer 2002

49Thioulouse J Simultaneous analysis of a sequence of pairedecological tables a comparison of several methods Ann ApplStat 201152300ndash25

50Hong S Chen X Jin L et al Canonical correlation analysisfor RNA-seq co-expression networks Nucleic Acids Res201341e95

51Mardia KV Kent JT Bibby JM Multivariate Analysis LondonNew York NY Toronto Sydney San Francisco CA AcademicPress 1979

52Waaijenborg S Zwinderman AH Sparse canonical correl-ation analysis for identifying connecting and completinggene-expression networks BMC Bioinformatics 200910315

53Parkhomenko E Tritchler D Beyene J Sparse canonical cor-relation analysis with application to genomic data integra-tion Stat Appl Genet Mol Biol 20098article 1

54Witten DM Tibshirani RJ Extensions of sparse canonical cor-relation analysis with applications to genomic data Stat ApplGenet Mol Biol 20098article 28

55Lin D Zhang J Li J et al Group sparse canonical correlationanalysis for genomic data integration BMC Bioinformatics201314245

56Le Cao KA Boitard S Besse P Sparse PLS discriminantanalysis biologically relevant feature selection and graph-ical displays for multiclass problems BMC Bioinform 201112253

57Boulesteix AL Strimmer K Partial least squares a versatiletool for the analysis of high-dimensional genomic data BriefBioinform 2007832ndash44

58Doledec S Chessel D Co-inertia analysis an alternativemethod for studying speciesndashenvironment relationshipsFreshwater Biol 199431277ndash94

59Ponnapalli SP Saunders MA Van Loan CF et al A higher-order generalized singular value decomposition for compari-son of global mRNA expression from multiple organismsPLoS One 20116e28072

60Meng C Kuster B Culhane AC et al A multivariate approachto the integration of multi-omics datasets BMC Bioinformatics201415162

61de Tayrac M Le S Aubry M et al Simultaneous analysis ofdistinct Omics data sets with integration of biological know-ledge multiple factor analysis approach BMC Genomics20091032

62Abdi H Williams LJ Valentin D Statis and Distatis OptimumMultitable Principal Component Analysis and Three Way MetricMultidimensional Scaling Wiley Interdisciplinary 20124124ndash67

63De Jong S Kiers HAL Principal covariates regression Part ITheory Chemometrics and Intelligent Laboratory Systems199214155ndash64

64Benasseni J Dosse MB Analyzing multiset data by the powerSTATIS-ACT method Adv Data Anal Classification 2012649ndash65

65Escoufier Y The duality diagram a means for better practicalapplications Devel Num Ecol 198714139ndash56

66Giordani P Kiers HAL Ferraro DMA Three-way componentanalysis using the R package threeway J Stat Software2014577

67Leibovici DG Spatio-temporal multiway decompositionsusing principal tensor analysis on k-modes the R packagePTAk J Stat Software 201034(10)1ndash34

68Kolda TG Bader BW Tensor decompositions and applica-tions SIAM Rev 200951455ndash500

69Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis for multiblock or multigroup dataanalysis Eur J Oper Res 2014238391ndash403

70Chessel D Hanafi M Analyses de la co-inertie de K nuages depoints Rev Stat Appl 19964435ndash60

71Hanafi M Kohler A Qannari EM Connections betweenmultiple co-inertia analysis and consensus principal compo-nent analysis Chemometrics and Intelligent Laboratory2011106(1)37ndash40

72Carroll JD Generalization of canonical correlation analysis tothree or more sets of variables Proc 76th Convent Am PsychAssoc 19683227ndash8

73Takane Y Hwang H Abdi H Regularized multiple-set canon-ical correlation analysis Psychometrika 200873753ndash75

640 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5
Page 7: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

requirement for orthogonality or independence in the compo-nents The nonnegative constraint guarantees that only the addi-tive combinations of latent variables are allowed This may bemore intuitive in biology where many biological measurements(eg protein concentrations count data) are represented by posi-tive values NMF is described in more detail in the SupplementalInformation ICA was recently applied to molecular subtype dis-covery in bladder cancer [4] Biton et al [4] applied ICA to geneexpression data of 198 bladder cancers and examined 20 compo-nents ICA successfully decomposed and extracted multiplelayers of signal from the data The first two components wereassociated with batch effects but other components revealed newbiology about tumor cells and the tumor microenvironmentThey also applied ICA to non-bladder cancers and by comparingthe correlation between components were able to identify a setof bladder cancer-specific components and their associatedgenes

As omics data sets tend to have high dimensionality (p n)it is often useful to reduce the number of variables Several re-cent extensions of PCA include variable selection often via aregularization step or L-1 penalization (eg Least AbsoluteShrinkage and Selection Operator LASSO) [36] The NIPALS algo-rithm uses an iterative regression approach to calculate thecomponents and loadings which is easily extended to have asparse operator that can be included during regression on thecomponent [37] A cross-validation approach can be used to de-termine the level of sparsity Sparse penalized and regularizedextensions of PCA and related methods have been describedrecently [36 38ndash41]

Integrative analysis of two data sets

One-table dimension reduction methods have been extended tothe EDA of two matrices and can simultaneously decomposeand integrate a pair of matrices that measure different variableson the same observations (Table 3) Methods include general-ized SVD [42] Co-Inertia Analysis (CIA) [43 44] sparse or penal-ized extensions of Partial Least Squares (PLS) CanonicalCorrespondence analysis (CCA) and Canonical CorrelationAnalysis (CCA) [36 45ndash47] Note both canonical correspondenceanalysis and canonical correlation analysis are referred to bythe acronym CCA Canonical correspondence analysis is a con-strained form of CA that is widely used in ecological statistics[46] however it is yet to be adopted by the genomics commu-nity in analysis of pairs of omics data By contrast severalgroups have applied extensions of canonical correlation ana-lysis to omics data integration Therefore in this review we useCCA to describe canonical correlation analysis

Canonical correlation analysis

Two omics data sets X (dimension npx) and Y (dimension npy)can be expressed by the following latent component decompos-ition problem

X frac14 FxQTx thorn Ex

Yfrac14 FyQTy thorn Ey

(9)

where Fx and Fy are nr matrices with r columns of compo-nents that explain the co-structure between X and Y The col-umns of Qx and Qy are variable loading vectors for X and Yrespectively Ex and Ey are error terms

Proposed by Hotelling in 1936 [47] CCA searches for associ-ations or correlations among the variables of X and Y [47] bymaximizing the correlation between Xqx

i and Yqyi

for the ith component arg maxqixqi

ycorethXqi

x YqiyTHORN (10)

In CCA the components Xqxi and Yqy

i are called canonical vari-ates and their correlations are the canonical correlations

Sparse canonical correlation analysis

The main limitation of applying CCA to omics data is that itrequires an inversion of the correlation or covariance matrix[38 49 50] which cannot be calculated when the number ofvariables exceeds the number of observations [46] In high-dimensional omics data where p n application of these meth-ods requires a regularization step This may be accomplished byadding a ridge penalty that is adding a multiple of the identitymatrix to the covariance matrix [51] A sparse solution of theloading vectors (Qx and Qy) filters the number of variables andsimplifies the interpretation of results For this purpose penal-ized CCA [52] sparse CCA [53] CCA-l1 [54] CCA elastic net (CCA-EN) [45] and CCA-group sparse [55] have been introduced andapplied to the integrative analysis of two omics data setsWitten et al [36] provided an elegant comparison of various CCAextensions accompanied by a unified approach to compute bothpenalized CCA and sparse PCA In addition Witten andTibshirani [54] extended sparse CCA into a supervised frame-work which allows the integration of two data sets with aquantitative phenotype for example selecting variables fromboth genomics and transcriptomics data and linking them todrug sensitivity data

Partial least squares analysis

PLS is an efficient dimension reduction method in the analysisof high-dimensional omics data PLS maximizes the covariancerather than the correlation between components making itmore resistant to outliers than CCA Additionally PLS does notsuffer from the p n problem as CCA does Nonetheless asparse solution is desired in some cases For example Le Caoet al [56] proposed a sparse PLS method for the feature selectionby introducing a LASSO penalty for the loading vectors In a re-cent comparison sPLS performed similarly to sparse CCA [45]There are many implementations of PLS which optimize differ-ent objective functions with different constraints several ofwhich are described by Boulesteix et al [57]

Co-Inertia analysis

CIA is a descriptive nonconstrained approach for coupling pairsof data matrices It was originally proposed to link two ecolo-gical tables [13 58] but has been successfully applied in integra-tive analysis of omics data [32 43] CIA is implemented andformulated under the duality diagram framework(Supplementary Information) CIA is performed in two steps (i)application of a dimension reduction technique such as PCACA or NSCA to the initial data sets depending on the type ofdata (binary categorical discrete counts or continuous data)and (ii) constraining the projections of the orthogonal axes suchthat they are maximally covariant [43 58] CIA does not requirean inversion step of the correlation or covariance matrix thusit can be applied to high-dimensional genomics data withoutregularization or penalization

634 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

Though closely related to CCA [49] CIA maximizes thesquared covariance between the linear combination of the pre-processed matrix that is

for the ith dimension argmaxqixqi

ycov2ethXqi

xYqiyTHORN (11)

Equation (11) can be decomposed as

cov2 XqxiYqy

i

frac14 cor2 XqxiYqy

i

var Xqxi

var Xqy

i

(12)

CIA decomposition of covariance maximizes the varianceand the correlation between matrices and thus is less sensitiveto outliers The relationship between CIA Procrustes analysis[13] and CCA have been well described [49] A comparison be-tween sCCA (with elastic net penalty) sPLS and CIA is providedby Le Cao et al [45] In summary CIA and sPLS both maximizethe covariance between eigenvectors and efficiently identifyjoint and individual variance in paired data In contrast CCA-EN maximizes the correlation between eigenvectors and willdiscover effects present in both data sets but may fail to dis-cover strong individual effects [45] Both sCCA and sPLS aresparse methods that select similar subsets of variables whereasCIA does not include a feature selection step thus in terms offeature selection results of CIA are more likely to contain re-dundant information in comparison with sparse methods [45]

Similar to classical dimension reduction approaches thenumber of dimensions to be examined needs to be consideredand can be visualized using a scree plot (similar to Figure 1D)Components may be evaluated by their associated variance [25]the elbow test or Q2 statistics as described previously Forexample the Q2 statistic was applied to select the number ofdimensions in the predictive mode of PLS [56] In additionwhen a sparse factor is introduced in the loading vectors cross-validation approaches may be used to determine the number ofvariables selected from each pair of components Selection ofthe number of components and optimization of these param-eters is still an open research question and is an active area ofresearch

Integrative analysis of multi-assay data

There is a growing need to integrate more than two data sets ingenomics Generalizations of dimension reduction methods tothree or more data sets are sometimes called K-table methods[59ndash61] and a number of them have been applied to multi-assaydata (Table 4) Simultaneous decomposition and integration ofmultiple matrices is more complex than an analysis of a singledata set or paired data because each data set may have differentnumbers of variables scales or internal structure and thus havedifferent variance This might produce global scores that aredominated by one or a few data sets Therefore data are prepro-cessed before decomposition Preprocessing is often performedon two levels On the variable levels variables are often cen-tered and normalized so that their sum of squared values orvariance equals 1 This procedure enables all the variables tohave equal contribution to the total inertia (sum of squares ofall elements) of a data set However the number of variablesmay vary between data sets or filteringpreprocessing stepsmay generate data sets that have a higher variance contributionto the final result Therefore a data set level normalization isalso required In the simplest K-table analysis all matrices haveequal weights More commonly data sets are weighted by theirexpected contribution or expected data quality for example by

the square root of their total inertia or by the square root of thenumbers of columns of each data set [62] Alternatively greaterweights can be given to smaller or less redundant matricesmatrices that have more stable predictive information or thosethat share more information with other matrices Such weight-ing approaches are implemented in multiple-factor analysis(MFA) principal covariates regression [63] and STATIS

The simplest multi-omics data integration is when the datasets have the same variables and observations that is matchedrows and matched columns In genomics these could be pro-duced when variables from different multi-assay data sets aremapped to a common set of genomic coordinates or gene iden-tifiers thus generating data sets with matched variables andmatched observations Alternatively a repeated analysis orlongitudinal analysis of the same samples and the same vari-ables could produce such data (one should note that thesedimension reduction approaches do not model the time correla-tion between different datasets) There is a history of such ana-lyses in ecology where counts of species and environmentvariables are measured over different seasons [49 64 65] and inpsychology where different standardized tests are measuredmultiple times on study populations [66 67] Analysis of suchvariables x samples x time data are called a three-mode decom-position triadic cube or three-way table analysis tensor de-composition three-way PCA three-mode PCA three-modeFactor Analysis Tucker-3 model Tucker3 TUCKALS3 multi-block analysis among others (Table 4) The relationship be-tween various tensor or higher decompositions for multi-blockanalysis are reviewed by Kolda and Bader [68]

More frequently we need to find associations in multi-assaydata that have matched observations but have different andunmatched numbers of variables For example TCGA generatedmiRNA and mRNA transcriptome (RNAseq microarray) DNAcopy number DNA mutation DNA methylation and proteomicsmolecular profiles on each tumor The NCI-60 and the CancerCell Line Encyclopedia projects have measured pharmacologicalcompound profiles in addition to exome sequencing and tran-scriptomic profiles Methods that can be applied to EDA ofK-table of multi-assay data with different variables includeMCIA MFA Generalized CCA (GCCA) and Consensus PCA(CPCA)

The K-table methods can be generally expressed by the fol-lowing model

X1 frac14 FQT1 thorn E1

Xk frac14 FQTk thorn Ek

XK frac14 FQTK thorn EK

(13)

where there are K matrices or omics data sets X1 XK For con-venience we assume that the rows of Xk share a common set ofobservations but the columns of Xk may each have differentvariables F is the lsquoglobal scorersquo matrix Its columns are the PCsand are interpreted similarly to PCs from a PCA of a single dataset The global score matrix which is identical in all decompos-itions integrates information from all data sets Therefore it isnot specific to any single data set rather it represents the com-mon pattern defined by all data sets The matrices Qk with kranging from 1 to K are the loadings or coefficient matricesA high positive value indicates a strong positive contribution ofthe corresponding variable to the lsquoglobal scorersquo While the above

Dimension reduction techniques | 635

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

methods are formulated for multiple data sets with differentvariables but the same observations most can be similarly for-mulated for multiple data sets with the same variables but dif-ferent observations [69]

Multiple co-inertia analysis

MCIA is an extension of CIA which aims to analyze multiplematrices through optimizing a covariance criterion [60 70]MCIA simultaneously projects K data sets into the same dimen-sional space Instead of maximizing the covariance betweenscores from two data sets as in CIA the optimization problemused by MCIA is as following

argmaxqi1 q

ikqi

K

XK

kfrac141

cov2ethXikqi

kXiqiTHORN (14)

for dimension i with the constraints that jjqkijj frac14 var (Xiqi) frac14 1

where X frac14 (X1j jXkj jXK) and qi holds the correspondingloading values (lsquoglobalrsquo loading vector) [60 69 70]

MCIA derives a set of lsquoblock scoresrsquo Xkiqk

i using linear com-binations of the original variables from each individual matrixThe global score Xiqi is then further defined as the linear com-bination of lsquoblock scoresrsquo In practice the global scores representa correlated structure defined by multiple data sets whereasthe concordance and discrepancy between these data sets maybe revealed by the block scores (for detail see lsquoExample casestudyrsquo section) MCIA may be calculated with the ad hoc exten-sion of the NIPALS PCA [71] This algorithm starts with an ini-tialization step in which the global scores and the blockloadings for the first dimension are computed The residualmatrices are calculated in an iterative step by removing thevariance induced by the variable loadings (the lsquodeflationrsquo step)For higher order solutions the same procedure is applied to theresidual matrices and re-iterated until the desired number of di-mensions is reached Therefore the computation time stronglydepends on the number of desired dimensions MCIA is imple-mented in the R package omicade4 and has been applied to theintegrative analysis of transcriptomic and proteomic data setsfrom the NCI-60 cell lines [60]

Generalized canonical correlation analysis

GCCA [71] is a generalization of CCA to K-table analysis [73ndash75]It has also been applied to the analysis of omics data [36 76]Typically MCIA and GCCA will produce similar results (for amore detailed comparison see [60]) GCCA uses a different defla-tion strategy than MCIA it calculates the residual matrices byremoving the variance with respect to the lsquoblock scoresrsquo (insteadof lsquovariable loadingsrsquo used by MCIA or lsquoglobal scoresrsquo used byCPCA see later) When applied to omics data where p n avariable selection step is often integrated within the GCCA ap-proach which cannot be done in case of MCIA In addition asblock scores are better representations of a single data set (incontrast to the global score) GCCA is more likely to find com-mon variables across data sets Witten and Tibshirani [54]applied sparse multiple CCA to analyze gene expression andCopy Number Variation (CNV) data from diffuse large B-celllymphoma patients and successfully identified lsquocis interactionsrsquothat are both up-regulated in CNV and mRNA data

Consensus PCA

CPCA is closely related to GCCA and MCIA but has had less expos-ure to the omics data community CPCA optimizes the same criter-ion as GCCA and MCIA and is subject to the same constraints asMCIA [71] The deflation step of CPCA relies on the lsquoglobal scorersquo Asa result it guarantees the orthogonality of the global score onlyand tends to find common patterns in the data sets This charac-teristic makes it is more suitable for the discovery of joint patternsin multiple data sets such as the joint clustering problem

Regularized generalized canonical correlation analysis

Recently Tenenhaus and Tenenhaus [69 74] proposed regular-ized generalized CCA (RGCCA) which provides a unified frame-work for different K-table multivariate methods The RGCCAmodel introduces some extra parameters particularly a shrink-age parameter and a linkage parameter The linkage parameteris defined so that the connection between matrices may be cus-tomized The shrinkage parameter ranges from 0 to 1 Settingthis parameter to 0 will force the correlation criterion (criterionused by GCCA) whereas a shrinkage parameter of 1 will applythe covariance criterion (used by MCIA and CPCA) A value be-tween 0 and 1 leads to a compromise between the two optionsIn practice the correlation criterion is better in explaining thecorrelated structure across data sets while discarding the vari-ance within each individual data set The introduction of theextra parameters make RGCCA highly versatile GCCA CIA andCPCA can be described as special cases of RGCCA (see [69] andSupplementary Information) In addition RGCCA also integratesa feature selection procedure named sparse GCCA (SGCCA)Tenenhaus et al [76] applied SGCCA to combine gene expres-sion comparative genomic hybridization and a qualitativephenotype measured on a set of 53 children with glioma Sparsemultiple CCA [54] and SGCCA [76] are available in the R pack-ages PMA and RGCCA respectively Similarly a higher orderimplementation of spare PLS is described in Zhao et al [77]

Joint NMF

NMF has also been extended to jointly factorize multiple matri-ces In joint NMF the values in the global score F and in the co-efficient matrices (Q1 QK) are nonnegative and there is noexplicit definition of the block loadings An optimization algo-rithm is applied to minimize an objective function typically the

sum of squared errors ieXK

kfrac141Ek

2 This approach can be con-

sidered to be a nonnegative implementation of PARAFAC al-though it has also been implemented using the Tucker model[78ndash80] Zhang et al [81] apply joint NMF to a three-way analysisof DNA methylation gene expression and miRNA expressiondata to identify modules in each of these regulatory layers thatare associated with each other

Advantages of dimension reduction whenintegrating multi-assay data

Dimension reduction or latent variable approaches provideEDA integrate multi-assay data highlight global correlationsacross data sets and discover outliers or batch effects in indi-vidual data sets Dimension reduction approaches also facilitatedown-stream analysis of both observations and variables(genes) Compared with cluster analysis of individual data setscluster analysis of the global score matrix (F matrix) is robust asit aggregates observations across data sets and is less likely to

636 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

reflect a technical or batch effect of a single data set Similarlydimension reduction of multi-assay data facilities downstreamgene set pathway and network analysis of variables MCIAtransforms variables from each data set onto the same scaleand their loadings (Q matrix) rank the variables by their contri-bution to the global data structure (Figure 2) Meng et al reportthat pathway or gene set enrichment analysis (GSEA) of thetransformed variables is more sensitive than GSEA of each indi-vidual data set This is both because of the re-weighting andtransformation of variables but also because GSEA on the com-bined data has greater coverage of variables (genes) thus com-pensating for missing or unreliable information in any singledata set For example in Figure 2B we integrate and transformmRNA proteomics and miRNA data on the same scale allowingus to extract and study the union of all variables

Example case study

To demonstrate the integration of multi-data sets using dimen-sions reduction we applied MCIA to analyze mRNA miRNA andproteomics expression profiles of melanoma leukemia and CNS

cells lines from the NCI-60 panel The graphical output fromthis analysis a plot of the sample space variable space anddata weighting space are provided in Figure 2A B and E Theeigenvalues can be interpreted similarly to PCA a higher eigen-value contributes more information to the global score As withPCA researchers may be subjective in their selection of thenumber of components [24] The scree plot in Figure 2D showsthe eigenvalues of each global score In this case the first twoeigenvalues were significantly larger so we visualized the celllines and variables on PC1 and PC2

In the observation space (Figure 2A) assay data (mRNAmiRNA and proteomics) are distinguished by shape The coord-inates of each cell line (Fk in Figure 2A) are connected by lines tothe global scores (F) Short lines between points and cell lineglobal scores reflect high concordance in cell line data Most celllines have concordant information between data sets (mRNAmiRNA protein) as indicated by relatively short lines In add-ition the RV coefficient [82 83] which is a generalized Pearsoncorrelation coefficient for matrices may be used to estimate thecorrelation between two transformed data sets The RV coeffi-cient has values between 0 and 1 where a higher value

Figure 2 MCIA of mRNA miRNA and proteomics profiles of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines (A) shows a plot of the first two

components in sample space (sample lsquotypersquo is coded by the point shape circles for mRNAs triangles for proteins and squares for miRNAs) Each sample (cell line) is

represented by a ldquostarrdquo where the three omics data for each cell line are connected by lines to a center point which is the global score (F) for that cell line the shorter

the line the higher the level of concordance between the data types and the global structure (B) shows the variable space of MCIA A variable that is highly expressed

in a cell line will be projected with a high weight (far from the origin) in the direction of that cell line Some miRNAs with a large distance from the origin are labeled as

these miRNAs are the strongly associated with cancer tissue of origin (C) shows the correlation coefficients of the proteome profiling of SR with other cell lines The

proteome profiling of SR cell line is more correlated with melanoma cell line There may be a technical issue with the LESR proteomics data (D) A scree plot of the

eigenvalues and (E) a plot of data weighting space A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 637

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

indicates higher co-structure In this example we observed rela-tively high RV coefficients between the three data sets rangingfrom 078 to 084 It was recently reported that the RV coefficientis biased toward large data sets and a modified RV coefficienthas been proposed [84]

In this analysis (Figure 2A) cell lines originating from thesame anatomical source are projected close to each other andconverge in clusters The first PC separates the leukemia celllines (positive end of PC1) from the other two cell lines (negativeend of PC1) and PC2 separates the melanoma and CNS celllines The melanoma cell line LOX-IMVI which lacks the mela-nogenesis is projected close to the origin away from the mel-anoma cluster We were surprised to see that the proteomicsprofile of leukemia cell line SR was projected closer to melan-oma rather than leukemia cell lines We examined withintumor type correlations to the SR cell line (Figure 2C) Weobserved that the SR proteomics data had higher correlationwith melanoma compared with to leukemia cell lines Giventhat the mRNA and miRNA profiles of LE_SR are closer to theleukemia cell lines it suggests that there may have been a tech-nical error in generating the proteomics data on the SR cell line(Figure 2A and C)

MCIA projects all variables into the same space The variablespace (Q1 QK) is visualized in Figure 2B Variables and sam-ples projected in the same direction are associated This allowsone to select the variables most strongly associated with spe-cific observations from each data set for subsequent analysis Inour previous study [85] we have shown that the genes and pro-teins highly weighted on the melanoma side (positive end of se-cond dimension) are enriched with melanogenesis functionsand genesproteins highly weighted on the protein side arehighly enriched in T-cell or immune-related functions

We examined the miRNA data to extract the miRNAs withthe most extreme weights on the first two dimensions miR-142and miR-223 which are active and expressed in leukemia [8283 86ndash88] had high weights on the positive end of both thefirst and second axis (close to the leukemia cell lines samplespace Figure 2A) miR-142 plays an essential role in T-lympho-cyte development miR-223 is regulated by the Notch andNF-kB signaling pathways in T-cell acute lymphoblastic leuke-mia [89]

The miRNA with strongest association to CNS cell lines wasmiR-409 This miRNA is reported to promote the epithelial-to-mesenchymal transition in prostate cancer [90] In the NCI-60cell line data CNS cell lines are characterized more by a pro-nounced mesenchymal phenotype which is consistent withhigh expression of this miRNA On the positive end of the se-cond axis and negative end of the first axis (which correspondsto melanoma cell lines in the sample space Figure 2A) wefound miR-509 miR-513 and miR-506 strongly associated withmelanoma cell lines which are reported to initiate melanocytetransformation and promote melanoma growth [85]

Challenges in integrative data analysis

EDA is widely used and well accepted in the analysis of singleomics data sets but there is an increasing need for methodsthat integrate multi-omics data particularly in cancer researchRecently 20 leading scientists were invited to a meeting organ-ized by Nature Medicine Nature Biotechnology and the VolkswagenFoundation The meeting identified the need to simultaneouslycharacterize DNA sequence epigenome transcriptome proteinmetabolites and infiltrating immune cells in both the tumorand the stroma [91] The TCGA pan-cancer project plans to

comprehensively interrogate multi-omics data across 33 humancancers [92] The data are biologically complex In addition totumor heterogeneity [91] there may be technical issues batcheffects and outliers EDA approaches for complex multi-omicsdata are needed

We describe emerging applications of multivariateapproaches to omics data analysis These are descriptiveapproaches that do not test a hypothesis or generate a P-valueThey are not optimized for variable or biomarker discoveryalthough the introduction of sparsity in variable loadings mayhelp in the selection of variables for downstream analysis Fewcomparisons of different methods exist and the numbers ofcomponents and the sparsity level have to be optimized Cross-validation approaches are potentially useful but this stillremains an open area of research

Another limitation of these methods is that although vari-ables may vary among data sets the observations need to bematchable Therefore researchers need to have careful experi-mental design in the early stage of a study There is an exten-sion of CIA for the analysis of unmatched samples [93] whichcombines a Hungarian algorithm with CIA to iteratively pairsamples that are similar but not matched Multi-block andmulti-group methods (data sets with matched variables) havebeen reviewed recently by Tenenhaus and Tenenhaus [69]

The number of variables in genomics data is a challenge totraditional EDA visualization tools Most visualizationapproaches were designed for data sets with fewer variablesWithin R new packages including ggord are being developedDynamic data visualization is possible using ggvis plotly explorand other packages However interpretation of long lists of bio-logical variables (genes proteins miRNAs) is challenging Oneway to gain more insight into lists of omics variables is to per-form a network gene set enrichment or pathway analysis [94]An attractive feature of decomposition methods is that variableannotation such as Gene Ontology or Reactome can be pro-jected into the same space to determine a score for each geneset (or pathway) in that space [32 33 60]

Conclusion

Dimension reduction methods have a long history Many simi-lar methods have been developed in parallel by multiple fieldsIn this review we provided an overview of dimension reductiontechniques that are both well-established and maybe new tothe multi-omics data community We reviewed methods forsingle-table two-table and multi-table analysis There are sig-nificant challenges in extracting biologically and clinically ac-tionable results from multi-omics data however the field mayleverage the varied and rich resource of dimension reductionapproaches that other disciplines have developed

Key Points

bull There are many dimension-reduction methods whichcan be applied to exploratory data analysis of a singledata set or integrated analysis of a pair or multipledata sets In addition to exploratory analysis thesecan be extended to clustering supervised and discrim-inant analysis

bull The goal of dimension reduction is to map data ontoa new set of variables so that most of the variance (orinformation) in the data is explained by a few new(latent) variables

638 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

bull Multi-data set methods such as multiple co-inertiaanalysis (MCIA) multiple factor analysis (MFA) or ca-nonical correlations analysis (CCA) identify correlatedstructure between data sets with matched observa-tions (samples) Each data set may have different vari-ables (genes proteins miRNA mutations drugresponse etc)

bull MCIA MFA CCA and related methods provide a visu-alization of consensus and incongruence in andbetween data sets enabling discovery of potential out-liers batch effects or technical errors

bull Multi-dataset methods transform diverse variablesfrom each data set onto the same space and scalefacilitating integrative variable selection gene setanalysis pathway and downstream analyses

R Supplement

R code to re-generate all figures in this article is available asa Supplementary File Data and code are also available on thegithub repository httpsgithubcomaedinNCI60Example

Supplementary Data

Supplementary data are available online at httpbiboxfordjournalsorg

Acknowledgements

We are grateful to Drs John Platig and Marieke Kuijjer forreading the manuscript and for their comments

Funding

GGT was supported by the Austrian Ministry of ScienceResearch and Economics [OMICS Center HSRSM initiative]AC was supported by Dana-Farber Cancer Institute BCBResearch Scientist Developmental Funds National CancerInstitute 2P50 CA101942-11 and the Assistant Secretary ofDefense Health Program through the Breast CancerResearch Program under Award No W81XWH-15-1-0013Opinions interpretations conclusions and recommenda-tions are those of the author and are not necessarilyendorsed by the Department of Defense

References1 Brazma A Culhane AC Algorithms for gene expression

analysis In Jorde LB Little PFR Dunn MJ Subramaniam S(eds) Encyclopedia of Genetics Genomics Proteomics andBioinformatics London John Wiley amp Sons 2005 3148ndash59

2 Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

3 Legendre P Legendre LFJ Numerical Ecology AmsterdamElsevier Science 3rd edition 2012

4 Biton A Bernard-Pierrot I Lou Y et al Independent compo-nent analysis uncovers the landscape of the bladder tumortranscriptome and reveals insights into luminal and basalsubtypes Cell Rep 201491235ndash45

5 Hoadley KA Yau C Wolf DM et al Multiplatform analysis of12 cancer types reveals molecular classification within andacross tissues of origin Cell 2014158929ndash44

6 Verhaak RG Tamayo P Yang JY et al Prognostically relevantgene signatures of high-grade serous ovarian carcinomaJ Clin Invest 2013123517ndash25

7 Senbabaoglu Y Michailidis G Li JZ Critical limitations of con-sensus clustering in class discovery Sci Rep 201446207

8 Gusenleitner D Howe EA Bentink S et al iBBiG iterativebinary bi-clustering of gene sets Bioinformatics 2012282484ndash92

9 Pearson K On lines and planes of closest fit to systems ofpoints in space Philos Magazine Series 19016(2)559ndash72

10Hotelling H Analysis of a complex of statistical variables intoprincipal components J Educ Psychol 193324417

11Bland M An Introduction to Medical Statistics 3rd edn OxfordNew York Oxford University Press 2000 xvi 405

12Warner RM Applied Statistics From Bivariate throughMultivariate Techniques London SAGE Publications Inc2nd Edition 2012 1004

13Dray S Chessel D Thioulouse J Co-inertia analysis and thelinking of ecological data tables Ecology 2003843078ndash89

14Golub GH Loan CF Matrix Computations Baltimore JHU Press2012

15Anthony M Harvey M Linear Algebra Concepts and MethodsCambridge Cambridge University Press 1st edition 2012149ndash60

16McShane LM Cavenagh MM Lively TG et al Criteria for theuse of omics-based predictors in clinical trials explanationand elaboration BMC Med 201311220

17Guyon I Elisseeff A An introduction to variable and featureselection J Mach Learn Res 200331157ndash82

18Tukey JW Exploratory Data Analysis Boston Addison-WesleyPublishing Company 1977

19Hogben L Handbook of Linear Algebra Boca Raton Chapman ampHallCRC Press 1st edition 2007 20

20Ringner M What is principal component analysis NatBiotechnol 200826303ndash4

21Wall ME Rechtsteiner A Rocha LM (2003) Singular value de-composition and principal component analysis In A PracticalApproach to Microarray Data Analysis Berrar DP DubitzkyW Granzow M (eds) Norwell MA Kluwer 2003 91ndash109

22 terBraak CJF Prentice IC A theory of gradient analysis AdvEcol Res 198818271ndash313

23 terBraak CJF Smilauer P CANOCO Reference Manual andUserrsquos Guide to Canoco for Windows Software for CanonicalCommunity Ordination (version 4) Microcomputer Power(Ithaca NY USA) httpwwwmicrocomputerpowercom1998352

24Abdi H Williams LJ Principal Component Analysis HobokenJohn Wiley amp SonsInterdisciplinary Reviews ComputationalStatistics 20102(4)433ndash59

25Dray S On the number of principal components a test ofdimensionality based on measurements of similarity be-tween matrices Comput Stat Data Anal 2008522228ndash37

26Wouters L Gohlmann HW Bijnens L et al Graphical explor-ation of gene expression data a comparative study of threemultivariate methods Biometrics 2003591131ndash9

27Greenacre M Correspondence Analysis in Practice Boca RatonChapman and HallCRC 2 edition 2007 201ndash11

28Goodrich JK Rienzi DSC Poole AC et al Conducting a micro-biome study Cell 2014158250ndash62

29Fasham MJR A comparison of nonmetric multidimensionalscaling principal components and reciprocal averaging forthe ordination of simulated coenoclines and coenoplanesEcology 197758551ndash61

Dimension reduction techniques | 639

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

30Beh EJ Lombardo R Correspondence Analysis Theory Practiceand New Strategies Hoboken John Wiley amp Sons 1st edition2014 120ndash76

31Greenacre MJ Theory and Applications of Correspondence AnalysisWaltham MA Academic Press 3rd Edition 1993 83ndash125

32Fagan A Culhane AC Higgins DG A multivariate analysis ap-proach to the integration of proteomic and gene expressiondata Proteomics 200772162ndash71

33Fellenberg K Hauser NC Brors B et al Correspondence ana-lysis applied to microarray data Proc Natl Acad Sci USA20019810781ndash6

34Lee DD Seung HS Learning the parts of objects by non-negative matrix factorization Nature 1999401788ndash91

35Comon P Independent component analysis a new conceptSignal Processing 199436287ndash314

36Witten DM Tibshirani R Hastie T A penalized matrix decom-position with applications to sparse principal componentsand canonical correlation analysis Biostatistics 200910515ndash34

37Shen H Huang JZ Sparse principal component analysis viaregularized low rank matrix approximation J Multivar Anal2008991015ndash34

38Lee S Epstein MP Duncan R et al Sparse principal componentanalysis for identifying ancestry-informative markers in gen-ome-wide association studies Genet Epidemiol 201236293ndash302

39Sill M Saadati M Benner A Applying stability selection toconsistently estimate sparse principal components in high-dimensional molecular data Bioinformatics 2015312683ndash90

40Zhao J Efficient model selection for mixtures of probabilisticPCA via hierarchical BIC IEEE Trans Cybern 2014441871ndash83

41Zhao Q Shi X Huang J et al Integrative analysis of lsquo-omicsrsquodata using penalty functions Wiley Interdiscip Rev Comput Stat2015710

42Alter O Brown PO Botstein D Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms Proc NatlAcad Sci USA 20031003351ndash6

43Culhane AC Perriere G Higgins DG Cross-platform compari-son and visualisation of gene expression data using co-inertia analysis BMC Bioinformatics 2003459

44Dray S Analysing a pair of tables coinertia analysis and dual-ity diagrams In Blasius J Greenacre M (eds) Visualization andverbalization of data Boca Raton CRC Press 2014 289ndash300

45Le Cao KA Martin PG Robert-Granie C et al Sparse canonicalmethods for biological data integration application to across-platform study BMC Bioinformatics 20091034

46Braak CJF Canonical correspondence analysis a new eigen-vector technique for multivariate direct gradient analysisEcology 1986671167ndash79

47Hotelling H Relations between two sets of variates Biometrika193628321ndash77

48McGarigal K Landguth E Stafford S Multivariate Statistics forWildlife and Ecology Research New York Springer 2002

49Thioulouse J Simultaneous analysis of a sequence of pairedecological tables a comparison of several methods Ann ApplStat 201152300ndash25

50Hong S Chen X Jin L et al Canonical correlation analysisfor RNA-seq co-expression networks Nucleic Acids Res201341e95

51Mardia KV Kent JT Bibby JM Multivariate Analysis LondonNew York NY Toronto Sydney San Francisco CA AcademicPress 1979

52Waaijenborg S Zwinderman AH Sparse canonical correl-ation analysis for identifying connecting and completinggene-expression networks BMC Bioinformatics 200910315

53Parkhomenko E Tritchler D Beyene J Sparse canonical cor-relation analysis with application to genomic data integra-tion Stat Appl Genet Mol Biol 20098article 1

54Witten DM Tibshirani RJ Extensions of sparse canonical cor-relation analysis with applications to genomic data Stat ApplGenet Mol Biol 20098article 28

55Lin D Zhang J Li J et al Group sparse canonical correlationanalysis for genomic data integration BMC Bioinformatics201314245

56Le Cao KA Boitard S Besse P Sparse PLS discriminantanalysis biologically relevant feature selection and graph-ical displays for multiclass problems BMC Bioinform 201112253

57Boulesteix AL Strimmer K Partial least squares a versatiletool for the analysis of high-dimensional genomic data BriefBioinform 2007832ndash44

58Doledec S Chessel D Co-inertia analysis an alternativemethod for studying speciesndashenvironment relationshipsFreshwater Biol 199431277ndash94

59Ponnapalli SP Saunders MA Van Loan CF et al A higher-order generalized singular value decomposition for compari-son of global mRNA expression from multiple organismsPLoS One 20116e28072

60Meng C Kuster B Culhane AC et al A multivariate approachto the integration of multi-omics datasets BMC Bioinformatics201415162

61de Tayrac M Le S Aubry M et al Simultaneous analysis ofdistinct Omics data sets with integration of biological know-ledge multiple factor analysis approach BMC Genomics20091032

62Abdi H Williams LJ Valentin D Statis and Distatis OptimumMultitable Principal Component Analysis and Three Way MetricMultidimensional Scaling Wiley Interdisciplinary 20124124ndash67

63De Jong S Kiers HAL Principal covariates regression Part ITheory Chemometrics and Intelligent Laboratory Systems199214155ndash64

64Benasseni J Dosse MB Analyzing multiset data by the powerSTATIS-ACT method Adv Data Anal Classification 2012649ndash65

65Escoufier Y The duality diagram a means for better practicalapplications Devel Num Ecol 198714139ndash56

66Giordani P Kiers HAL Ferraro DMA Three-way componentanalysis using the R package threeway J Stat Software2014577

67Leibovici DG Spatio-temporal multiway decompositionsusing principal tensor analysis on k-modes the R packagePTAk J Stat Software 201034(10)1ndash34

68Kolda TG Bader BW Tensor decompositions and applica-tions SIAM Rev 200951455ndash500

69Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis for multiblock or multigroup dataanalysis Eur J Oper Res 2014238391ndash403

70Chessel D Hanafi M Analyses de la co-inertie de K nuages depoints Rev Stat Appl 19964435ndash60

71Hanafi M Kohler A Qannari EM Connections betweenmultiple co-inertia analysis and consensus principal compo-nent analysis Chemometrics and Intelligent Laboratory2011106(1)37ndash40

72Carroll JD Generalization of canonical correlation analysis tothree or more sets of variables Proc 76th Convent Am PsychAssoc 19683227ndash8

73Takane Y Hwang H Abdi H Regularized multiple-set canon-ical correlation analysis Psychometrika 200873753ndash75

640 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5
Page 8: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

Though closely related to CCA [49] CIA maximizes thesquared covariance between the linear combination of the pre-processed matrix that is

for the ith dimension argmaxqixqi

ycov2ethXqi

xYqiyTHORN (11)

Equation (11) can be decomposed as

cov2 XqxiYqy

i

frac14 cor2 XqxiYqy

i

var Xqxi

var Xqy

i

(12)

CIA decomposition of covariance maximizes the varianceand the correlation between matrices and thus is less sensitiveto outliers The relationship between CIA Procrustes analysis[13] and CCA have been well described [49] A comparison be-tween sCCA (with elastic net penalty) sPLS and CIA is providedby Le Cao et al [45] In summary CIA and sPLS both maximizethe covariance between eigenvectors and efficiently identifyjoint and individual variance in paired data In contrast CCA-EN maximizes the correlation between eigenvectors and willdiscover effects present in both data sets but may fail to dis-cover strong individual effects [45] Both sCCA and sPLS aresparse methods that select similar subsets of variables whereasCIA does not include a feature selection step thus in terms offeature selection results of CIA are more likely to contain re-dundant information in comparison with sparse methods [45]

Similar to classical dimension reduction approaches thenumber of dimensions to be examined needs to be consideredand can be visualized using a scree plot (similar to Figure 1D)Components may be evaluated by their associated variance [25]the elbow test or Q2 statistics as described previously Forexample the Q2 statistic was applied to select the number ofdimensions in the predictive mode of PLS [56] In additionwhen a sparse factor is introduced in the loading vectors cross-validation approaches may be used to determine the number ofvariables selected from each pair of components Selection ofthe number of components and optimization of these param-eters is still an open research question and is an active area ofresearch

Integrative analysis of multi-assay data

There is a growing need to integrate more than two data sets ingenomics Generalizations of dimension reduction methods tothree or more data sets are sometimes called K-table methods[59ndash61] and a number of them have been applied to multi-assaydata (Table 4) Simultaneous decomposition and integration ofmultiple matrices is more complex than an analysis of a singledata set or paired data because each data set may have differentnumbers of variables scales or internal structure and thus havedifferent variance This might produce global scores that aredominated by one or a few data sets Therefore data are prepro-cessed before decomposition Preprocessing is often performedon two levels On the variable levels variables are often cen-tered and normalized so that their sum of squared values orvariance equals 1 This procedure enables all the variables tohave equal contribution to the total inertia (sum of squares ofall elements) of a data set However the number of variablesmay vary between data sets or filteringpreprocessing stepsmay generate data sets that have a higher variance contributionto the final result Therefore a data set level normalization isalso required In the simplest K-table analysis all matrices haveequal weights More commonly data sets are weighted by theirexpected contribution or expected data quality for example by

the square root of their total inertia or by the square root of thenumbers of columns of each data set [62] Alternatively greaterweights can be given to smaller or less redundant matricesmatrices that have more stable predictive information or thosethat share more information with other matrices Such weight-ing approaches are implemented in multiple-factor analysis(MFA) principal covariates regression [63] and STATIS

The simplest multi-omics data integration is when the datasets have the same variables and observations that is matchedrows and matched columns In genomics these could be pro-duced when variables from different multi-assay data sets aremapped to a common set of genomic coordinates or gene iden-tifiers thus generating data sets with matched variables andmatched observations Alternatively a repeated analysis orlongitudinal analysis of the same samples and the same vari-ables could produce such data (one should note that thesedimension reduction approaches do not model the time correla-tion between different datasets) There is a history of such ana-lyses in ecology where counts of species and environmentvariables are measured over different seasons [49 64 65] and inpsychology where different standardized tests are measuredmultiple times on study populations [66 67] Analysis of suchvariables x samples x time data are called a three-mode decom-position triadic cube or three-way table analysis tensor de-composition three-way PCA three-mode PCA three-modeFactor Analysis Tucker-3 model Tucker3 TUCKALS3 multi-block analysis among others (Table 4) The relationship be-tween various tensor or higher decompositions for multi-blockanalysis are reviewed by Kolda and Bader [68]

More frequently we need to find associations in multi-assaydata that have matched observations but have different andunmatched numbers of variables For example TCGA generatedmiRNA and mRNA transcriptome (RNAseq microarray) DNAcopy number DNA mutation DNA methylation and proteomicsmolecular profiles on each tumor The NCI-60 and the CancerCell Line Encyclopedia projects have measured pharmacologicalcompound profiles in addition to exome sequencing and tran-scriptomic profiles Methods that can be applied to EDA ofK-table of multi-assay data with different variables includeMCIA MFA Generalized CCA (GCCA) and Consensus PCA(CPCA)

The K-table methods can be generally expressed by the fol-lowing model

X1 frac14 FQT1 thorn E1

Xk frac14 FQTk thorn Ek

XK frac14 FQTK thorn EK

(13)

where there are K matrices or omics data sets X1 XK For con-venience we assume that the rows of Xk share a common set ofobservations but the columns of Xk may each have differentvariables F is the lsquoglobal scorersquo matrix Its columns are the PCsand are interpreted similarly to PCs from a PCA of a single dataset The global score matrix which is identical in all decompos-itions integrates information from all data sets Therefore it isnot specific to any single data set rather it represents the com-mon pattern defined by all data sets The matrices Qk with kranging from 1 to K are the loadings or coefficient matricesA high positive value indicates a strong positive contribution ofthe corresponding variable to the lsquoglobal scorersquo While the above

Dimension reduction techniques | 635

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

methods are formulated for multiple data sets with differentvariables but the same observations most can be similarly for-mulated for multiple data sets with the same variables but dif-ferent observations [69]

Multiple co-inertia analysis

MCIA is an extension of CIA which aims to analyze multiplematrices through optimizing a covariance criterion [60 70]MCIA simultaneously projects K data sets into the same dimen-sional space Instead of maximizing the covariance betweenscores from two data sets as in CIA the optimization problemused by MCIA is as following

argmaxqi1 q

ikqi

K

XK

kfrac141

cov2ethXikqi

kXiqiTHORN (14)

for dimension i with the constraints that jjqkijj frac14 var (Xiqi) frac14 1

where X frac14 (X1j jXkj jXK) and qi holds the correspondingloading values (lsquoglobalrsquo loading vector) [60 69 70]

MCIA derives a set of lsquoblock scoresrsquo Xkiqk

i using linear com-binations of the original variables from each individual matrixThe global score Xiqi is then further defined as the linear com-bination of lsquoblock scoresrsquo In practice the global scores representa correlated structure defined by multiple data sets whereasthe concordance and discrepancy between these data sets maybe revealed by the block scores (for detail see lsquoExample casestudyrsquo section) MCIA may be calculated with the ad hoc exten-sion of the NIPALS PCA [71] This algorithm starts with an ini-tialization step in which the global scores and the blockloadings for the first dimension are computed The residualmatrices are calculated in an iterative step by removing thevariance induced by the variable loadings (the lsquodeflationrsquo step)For higher order solutions the same procedure is applied to theresidual matrices and re-iterated until the desired number of di-mensions is reached Therefore the computation time stronglydepends on the number of desired dimensions MCIA is imple-mented in the R package omicade4 and has been applied to theintegrative analysis of transcriptomic and proteomic data setsfrom the NCI-60 cell lines [60]

Generalized canonical correlation analysis

GCCA [71] is a generalization of CCA to K-table analysis [73ndash75]It has also been applied to the analysis of omics data [36 76]Typically MCIA and GCCA will produce similar results (for amore detailed comparison see [60]) GCCA uses a different defla-tion strategy than MCIA it calculates the residual matrices byremoving the variance with respect to the lsquoblock scoresrsquo (insteadof lsquovariable loadingsrsquo used by MCIA or lsquoglobal scoresrsquo used byCPCA see later) When applied to omics data where p n avariable selection step is often integrated within the GCCA ap-proach which cannot be done in case of MCIA In addition asblock scores are better representations of a single data set (incontrast to the global score) GCCA is more likely to find com-mon variables across data sets Witten and Tibshirani [54]applied sparse multiple CCA to analyze gene expression andCopy Number Variation (CNV) data from diffuse large B-celllymphoma patients and successfully identified lsquocis interactionsrsquothat are both up-regulated in CNV and mRNA data

Consensus PCA

CPCA is closely related to GCCA and MCIA but has had less expos-ure to the omics data community CPCA optimizes the same criter-ion as GCCA and MCIA and is subject to the same constraints asMCIA [71] The deflation step of CPCA relies on the lsquoglobal scorersquo Asa result it guarantees the orthogonality of the global score onlyand tends to find common patterns in the data sets This charac-teristic makes it is more suitable for the discovery of joint patternsin multiple data sets such as the joint clustering problem

Regularized generalized canonical correlation analysis

Recently Tenenhaus and Tenenhaus [69 74] proposed regular-ized generalized CCA (RGCCA) which provides a unified frame-work for different K-table multivariate methods The RGCCAmodel introduces some extra parameters particularly a shrink-age parameter and a linkage parameter The linkage parameteris defined so that the connection between matrices may be cus-tomized The shrinkage parameter ranges from 0 to 1 Settingthis parameter to 0 will force the correlation criterion (criterionused by GCCA) whereas a shrinkage parameter of 1 will applythe covariance criterion (used by MCIA and CPCA) A value be-tween 0 and 1 leads to a compromise between the two optionsIn practice the correlation criterion is better in explaining thecorrelated structure across data sets while discarding the vari-ance within each individual data set The introduction of theextra parameters make RGCCA highly versatile GCCA CIA andCPCA can be described as special cases of RGCCA (see [69] andSupplementary Information) In addition RGCCA also integratesa feature selection procedure named sparse GCCA (SGCCA)Tenenhaus et al [76] applied SGCCA to combine gene expres-sion comparative genomic hybridization and a qualitativephenotype measured on a set of 53 children with glioma Sparsemultiple CCA [54] and SGCCA [76] are available in the R pack-ages PMA and RGCCA respectively Similarly a higher orderimplementation of spare PLS is described in Zhao et al [77]

Joint NMF

NMF has also been extended to jointly factorize multiple matri-ces In joint NMF the values in the global score F and in the co-efficient matrices (Q1 QK) are nonnegative and there is noexplicit definition of the block loadings An optimization algo-rithm is applied to minimize an objective function typically the

sum of squared errors ieXK

kfrac141Ek

2 This approach can be con-

sidered to be a nonnegative implementation of PARAFAC al-though it has also been implemented using the Tucker model[78ndash80] Zhang et al [81] apply joint NMF to a three-way analysisof DNA methylation gene expression and miRNA expressiondata to identify modules in each of these regulatory layers thatare associated with each other

Advantages of dimension reduction whenintegrating multi-assay data

Dimension reduction or latent variable approaches provideEDA integrate multi-assay data highlight global correlationsacross data sets and discover outliers or batch effects in indi-vidual data sets Dimension reduction approaches also facilitatedown-stream analysis of both observations and variables(genes) Compared with cluster analysis of individual data setscluster analysis of the global score matrix (F matrix) is robust asit aggregates observations across data sets and is less likely to

636 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

reflect a technical or batch effect of a single data set Similarlydimension reduction of multi-assay data facilities downstreamgene set pathway and network analysis of variables MCIAtransforms variables from each data set onto the same scaleand their loadings (Q matrix) rank the variables by their contri-bution to the global data structure (Figure 2) Meng et al reportthat pathway or gene set enrichment analysis (GSEA) of thetransformed variables is more sensitive than GSEA of each indi-vidual data set This is both because of the re-weighting andtransformation of variables but also because GSEA on the com-bined data has greater coverage of variables (genes) thus com-pensating for missing or unreliable information in any singledata set For example in Figure 2B we integrate and transformmRNA proteomics and miRNA data on the same scale allowingus to extract and study the union of all variables

Example case study

To demonstrate the integration of multi-data sets using dimen-sions reduction we applied MCIA to analyze mRNA miRNA andproteomics expression profiles of melanoma leukemia and CNS

cells lines from the NCI-60 panel The graphical output fromthis analysis a plot of the sample space variable space anddata weighting space are provided in Figure 2A B and E Theeigenvalues can be interpreted similarly to PCA a higher eigen-value contributes more information to the global score As withPCA researchers may be subjective in their selection of thenumber of components [24] The scree plot in Figure 2D showsthe eigenvalues of each global score In this case the first twoeigenvalues were significantly larger so we visualized the celllines and variables on PC1 and PC2

In the observation space (Figure 2A) assay data (mRNAmiRNA and proteomics) are distinguished by shape The coord-inates of each cell line (Fk in Figure 2A) are connected by lines tothe global scores (F) Short lines between points and cell lineglobal scores reflect high concordance in cell line data Most celllines have concordant information between data sets (mRNAmiRNA protein) as indicated by relatively short lines In add-ition the RV coefficient [82 83] which is a generalized Pearsoncorrelation coefficient for matrices may be used to estimate thecorrelation between two transformed data sets The RV coeffi-cient has values between 0 and 1 where a higher value

Figure 2 MCIA of mRNA miRNA and proteomics profiles of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines (A) shows a plot of the first two

components in sample space (sample lsquotypersquo is coded by the point shape circles for mRNAs triangles for proteins and squares for miRNAs) Each sample (cell line) is

represented by a ldquostarrdquo where the three omics data for each cell line are connected by lines to a center point which is the global score (F) for that cell line the shorter

the line the higher the level of concordance between the data types and the global structure (B) shows the variable space of MCIA A variable that is highly expressed

in a cell line will be projected with a high weight (far from the origin) in the direction of that cell line Some miRNAs with a large distance from the origin are labeled as

these miRNAs are the strongly associated with cancer tissue of origin (C) shows the correlation coefficients of the proteome profiling of SR with other cell lines The

proteome profiling of SR cell line is more correlated with melanoma cell line There may be a technical issue with the LESR proteomics data (D) A scree plot of the

eigenvalues and (E) a plot of data weighting space A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 637

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

indicates higher co-structure In this example we observed rela-tively high RV coefficients between the three data sets rangingfrom 078 to 084 It was recently reported that the RV coefficientis biased toward large data sets and a modified RV coefficienthas been proposed [84]

In this analysis (Figure 2A) cell lines originating from thesame anatomical source are projected close to each other andconverge in clusters The first PC separates the leukemia celllines (positive end of PC1) from the other two cell lines (negativeend of PC1) and PC2 separates the melanoma and CNS celllines The melanoma cell line LOX-IMVI which lacks the mela-nogenesis is projected close to the origin away from the mel-anoma cluster We were surprised to see that the proteomicsprofile of leukemia cell line SR was projected closer to melan-oma rather than leukemia cell lines We examined withintumor type correlations to the SR cell line (Figure 2C) Weobserved that the SR proteomics data had higher correlationwith melanoma compared with to leukemia cell lines Giventhat the mRNA and miRNA profiles of LE_SR are closer to theleukemia cell lines it suggests that there may have been a tech-nical error in generating the proteomics data on the SR cell line(Figure 2A and C)

MCIA projects all variables into the same space The variablespace (Q1 QK) is visualized in Figure 2B Variables and sam-ples projected in the same direction are associated This allowsone to select the variables most strongly associated with spe-cific observations from each data set for subsequent analysis Inour previous study [85] we have shown that the genes and pro-teins highly weighted on the melanoma side (positive end of se-cond dimension) are enriched with melanogenesis functionsand genesproteins highly weighted on the protein side arehighly enriched in T-cell or immune-related functions

We examined the miRNA data to extract the miRNAs withthe most extreme weights on the first two dimensions miR-142and miR-223 which are active and expressed in leukemia [8283 86ndash88] had high weights on the positive end of both thefirst and second axis (close to the leukemia cell lines samplespace Figure 2A) miR-142 plays an essential role in T-lympho-cyte development miR-223 is regulated by the Notch andNF-kB signaling pathways in T-cell acute lymphoblastic leuke-mia [89]

The miRNA with strongest association to CNS cell lines wasmiR-409 This miRNA is reported to promote the epithelial-to-mesenchymal transition in prostate cancer [90] In the NCI-60cell line data CNS cell lines are characterized more by a pro-nounced mesenchymal phenotype which is consistent withhigh expression of this miRNA On the positive end of the se-cond axis and negative end of the first axis (which correspondsto melanoma cell lines in the sample space Figure 2A) wefound miR-509 miR-513 and miR-506 strongly associated withmelanoma cell lines which are reported to initiate melanocytetransformation and promote melanoma growth [85]

Challenges in integrative data analysis

EDA is widely used and well accepted in the analysis of singleomics data sets but there is an increasing need for methodsthat integrate multi-omics data particularly in cancer researchRecently 20 leading scientists were invited to a meeting organ-ized by Nature Medicine Nature Biotechnology and the VolkswagenFoundation The meeting identified the need to simultaneouslycharacterize DNA sequence epigenome transcriptome proteinmetabolites and infiltrating immune cells in both the tumorand the stroma [91] The TCGA pan-cancer project plans to

comprehensively interrogate multi-omics data across 33 humancancers [92] The data are biologically complex In addition totumor heterogeneity [91] there may be technical issues batcheffects and outliers EDA approaches for complex multi-omicsdata are needed

We describe emerging applications of multivariateapproaches to omics data analysis These are descriptiveapproaches that do not test a hypothesis or generate a P-valueThey are not optimized for variable or biomarker discoveryalthough the introduction of sparsity in variable loadings mayhelp in the selection of variables for downstream analysis Fewcomparisons of different methods exist and the numbers ofcomponents and the sparsity level have to be optimized Cross-validation approaches are potentially useful but this stillremains an open area of research

Another limitation of these methods is that although vari-ables may vary among data sets the observations need to bematchable Therefore researchers need to have careful experi-mental design in the early stage of a study There is an exten-sion of CIA for the analysis of unmatched samples [93] whichcombines a Hungarian algorithm with CIA to iteratively pairsamples that are similar but not matched Multi-block andmulti-group methods (data sets with matched variables) havebeen reviewed recently by Tenenhaus and Tenenhaus [69]

The number of variables in genomics data is a challenge totraditional EDA visualization tools Most visualizationapproaches were designed for data sets with fewer variablesWithin R new packages including ggord are being developedDynamic data visualization is possible using ggvis plotly explorand other packages However interpretation of long lists of bio-logical variables (genes proteins miRNAs) is challenging Oneway to gain more insight into lists of omics variables is to per-form a network gene set enrichment or pathway analysis [94]An attractive feature of decomposition methods is that variableannotation such as Gene Ontology or Reactome can be pro-jected into the same space to determine a score for each geneset (or pathway) in that space [32 33 60]

Conclusion

Dimension reduction methods have a long history Many simi-lar methods have been developed in parallel by multiple fieldsIn this review we provided an overview of dimension reductiontechniques that are both well-established and maybe new tothe multi-omics data community We reviewed methods forsingle-table two-table and multi-table analysis There are sig-nificant challenges in extracting biologically and clinically ac-tionable results from multi-omics data however the field mayleverage the varied and rich resource of dimension reductionapproaches that other disciplines have developed

Key Points

bull There are many dimension-reduction methods whichcan be applied to exploratory data analysis of a singledata set or integrated analysis of a pair or multipledata sets In addition to exploratory analysis thesecan be extended to clustering supervised and discrim-inant analysis

bull The goal of dimension reduction is to map data ontoa new set of variables so that most of the variance (orinformation) in the data is explained by a few new(latent) variables

638 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

bull Multi-data set methods such as multiple co-inertiaanalysis (MCIA) multiple factor analysis (MFA) or ca-nonical correlations analysis (CCA) identify correlatedstructure between data sets with matched observa-tions (samples) Each data set may have different vari-ables (genes proteins miRNA mutations drugresponse etc)

bull MCIA MFA CCA and related methods provide a visu-alization of consensus and incongruence in andbetween data sets enabling discovery of potential out-liers batch effects or technical errors

bull Multi-dataset methods transform diverse variablesfrom each data set onto the same space and scalefacilitating integrative variable selection gene setanalysis pathway and downstream analyses

R Supplement

R code to re-generate all figures in this article is available asa Supplementary File Data and code are also available on thegithub repository httpsgithubcomaedinNCI60Example

Supplementary Data

Supplementary data are available online at httpbiboxfordjournalsorg

Acknowledgements

We are grateful to Drs John Platig and Marieke Kuijjer forreading the manuscript and for their comments

Funding

GGT was supported by the Austrian Ministry of ScienceResearch and Economics [OMICS Center HSRSM initiative]AC was supported by Dana-Farber Cancer Institute BCBResearch Scientist Developmental Funds National CancerInstitute 2P50 CA101942-11 and the Assistant Secretary ofDefense Health Program through the Breast CancerResearch Program under Award No W81XWH-15-1-0013Opinions interpretations conclusions and recommenda-tions are those of the author and are not necessarilyendorsed by the Department of Defense

References1 Brazma A Culhane AC Algorithms for gene expression

analysis In Jorde LB Little PFR Dunn MJ Subramaniam S(eds) Encyclopedia of Genetics Genomics Proteomics andBioinformatics London John Wiley amp Sons 2005 3148ndash59

2 Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

3 Legendre P Legendre LFJ Numerical Ecology AmsterdamElsevier Science 3rd edition 2012

4 Biton A Bernard-Pierrot I Lou Y et al Independent compo-nent analysis uncovers the landscape of the bladder tumortranscriptome and reveals insights into luminal and basalsubtypes Cell Rep 201491235ndash45

5 Hoadley KA Yau C Wolf DM et al Multiplatform analysis of12 cancer types reveals molecular classification within andacross tissues of origin Cell 2014158929ndash44

6 Verhaak RG Tamayo P Yang JY et al Prognostically relevantgene signatures of high-grade serous ovarian carcinomaJ Clin Invest 2013123517ndash25

7 Senbabaoglu Y Michailidis G Li JZ Critical limitations of con-sensus clustering in class discovery Sci Rep 201446207

8 Gusenleitner D Howe EA Bentink S et al iBBiG iterativebinary bi-clustering of gene sets Bioinformatics 2012282484ndash92

9 Pearson K On lines and planes of closest fit to systems ofpoints in space Philos Magazine Series 19016(2)559ndash72

10Hotelling H Analysis of a complex of statistical variables intoprincipal components J Educ Psychol 193324417

11Bland M An Introduction to Medical Statistics 3rd edn OxfordNew York Oxford University Press 2000 xvi 405

12Warner RM Applied Statistics From Bivariate throughMultivariate Techniques London SAGE Publications Inc2nd Edition 2012 1004

13Dray S Chessel D Thioulouse J Co-inertia analysis and thelinking of ecological data tables Ecology 2003843078ndash89

14Golub GH Loan CF Matrix Computations Baltimore JHU Press2012

15Anthony M Harvey M Linear Algebra Concepts and MethodsCambridge Cambridge University Press 1st edition 2012149ndash60

16McShane LM Cavenagh MM Lively TG et al Criteria for theuse of omics-based predictors in clinical trials explanationand elaboration BMC Med 201311220

17Guyon I Elisseeff A An introduction to variable and featureselection J Mach Learn Res 200331157ndash82

18Tukey JW Exploratory Data Analysis Boston Addison-WesleyPublishing Company 1977

19Hogben L Handbook of Linear Algebra Boca Raton Chapman ampHallCRC Press 1st edition 2007 20

20Ringner M What is principal component analysis NatBiotechnol 200826303ndash4

21Wall ME Rechtsteiner A Rocha LM (2003) Singular value de-composition and principal component analysis In A PracticalApproach to Microarray Data Analysis Berrar DP DubitzkyW Granzow M (eds) Norwell MA Kluwer 2003 91ndash109

22 terBraak CJF Prentice IC A theory of gradient analysis AdvEcol Res 198818271ndash313

23 terBraak CJF Smilauer P CANOCO Reference Manual andUserrsquos Guide to Canoco for Windows Software for CanonicalCommunity Ordination (version 4) Microcomputer Power(Ithaca NY USA) httpwwwmicrocomputerpowercom1998352

24Abdi H Williams LJ Principal Component Analysis HobokenJohn Wiley amp SonsInterdisciplinary Reviews ComputationalStatistics 20102(4)433ndash59

25Dray S On the number of principal components a test ofdimensionality based on measurements of similarity be-tween matrices Comput Stat Data Anal 2008522228ndash37

26Wouters L Gohlmann HW Bijnens L et al Graphical explor-ation of gene expression data a comparative study of threemultivariate methods Biometrics 2003591131ndash9

27Greenacre M Correspondence Analysis in Practice Boca RatonChapman and HallCRC 2 edition 2007 201ndash11

28Goodrich JK Rienzi DSC Poole AC et al Conducting a micro-biome study Cell 2014158250ndash62

29Fasham MJR A comparison of nonmetric multidimensionalscaling principal components and reciprocal averaging forthe ordination of simulated coenoclines and coenoplanesEcology 197758551ndash61

Dimension reduction techniques | 639

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

30Beh EJ Lombardo R Correspondence Analysis Theory Practiceand New Strategies Hoboken John Wiley amp Sons 1st edition2014 120ndash76

31Greenacre MJ Theory and Applications of Correspondence AnalysisWaltham MA Academic Press 3rd Edition 1993 83ndash125

32Fagan A Culhane AC Higgins DG A multivariate analysis ap-proach to the integration of proteomic and gene expressiondata Proteomics 200772162ndash71

33Fellenberg K Hauser NC Brors B et al Correspondence ana-lysis applied to microarray data Proc Natl Acad Sci USA20019810781ndash6

34Lee DD Seung HS Learning the parts of objects by non-negative matrix factorization Nature 1999401788ndash91

35Comon P Independent component analysis a new conceptSignal Processing 199436287ndash314

36Witten DM Tibshirani R Hastie T A penalized matrix decom-position with applications to sparse principal componentsand canonical correlation analysis Biostatistics 200910515ndash34

37Shen H Huang JZ Sparse principal component analysis viaregularized low rank matrix approximation J Multivar Anal2008991015ndash34

38Lee S Epstein MP Duncan R et al Sparse principal componentanalysis for identifying ancestry-informative markers in gen-ome-wide association studies Genet Epidemiol 201236293ndash302

39Sill M Saadati M Benner A Applying stability selection toconsistently estimate sparse principal components in high-dimensional molecular data Bioinformatics 2015312683ndash90

40Zhao J Efficient model selection for mixtures of probabilisticPCA via hierarchical BIC IEEE Trans Cybern 2014441871ndash83

41Zhao Q Shi X Huang J et al Integrative analysis of lsquo-omicsrsquodata using penalty functions Wiley Interdiscip Rev Comput Stat2015710

42Alter O Brown PO Botstein D Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms Proc NatlAcad Sci USA 20031003351ndash6

43Culhane AC Perriere G Higgins DG Cross-platform compari-son and visualisation of gene expression data using co-inertia analysis BMC Bioinformatics 2003459

44Dray S Analysing a pair of tables coinertia analysis and dual-ity diagrams In Blasius J Greenacre M (eds) Visualization andverbalization of data Boca Raton CRC Press 2014 289ndash300

45Le Cao KA Martin PG Robert-Granie C et al Sparse canonicalmethods for biological data integration application to across-platform study BMC Bioinformatics 20091034

46Braak CJF Canonical correspondence analysis a new eigen-vector technique for multivariate direct gradient analysisEcology 1986671167ndash79

47Hotelling H Relations between two sets of variates Biometrika193628321ndash77

48McGarigal K Landguth E Stafford S Multivariate Statistics forWildlife and Ecology Research New York Springer 2002

49Thioulouse J Simultaneous analysis of a sequence of pairedecological tables a comparison of several methods Ann ApplStat 201152300ndash25

50Hong S Chen X Jin L et al Canonical correlation analysisfor RNA-seq co-expression networks Nucleic Acids Res201341e95

51Mardia KV Kent JT Bibby JM Multivariate Analysis LondonNew York NY Toronto Sydney San Francisco CA AcademicPress 1979

52Waaijenborg S Zwinderman AH Sparse canonical correl-ation analysis for identifying connecting and completinggene-expression networks BMC Bioinformatics 200910315

53Parkhomenko E Tritchler D Beyene J Sparse canonical cor-relation analysis with application to genomic data integra-tion Stat Appl Genet Mol Biol 20098article 1

54Witten DM Tibshirani RJ Extensions of sparse canonical cor-relation analysis with applications to genomic data Stat ApplGenet Mol Biol 20098article 28

55Lin D Zhang J Li J et al Group sparse canonical correlationanalysis for genomic data integration BMC Bioinformatics201314245

56Le Cao KA Boitard S Besse P Sparse PLS discriminantanalysis biologically relevant feature selection and graph-ical displays for multiclass problems BMC Bioinform 201112253

57Boulesteix AL Strimmer K Partial least squares a versatiletool for the analysis of high-dimensional genomic data BriefBioinform 2007832ndash44

58Doledec S Chessel D Co-inertia analysis an alternativemethod for studying speciesndashenvironment relationshipsFreshwater Biol 199431277ndash94

59Ponnapalli SP Saunders MA Van Loan CF et al A higher-order generalized singular value decomposition for compari-son of global mRNA expression from multiple organismsPLoS One 20116e28072

60Meng C Kuster B Culhane AC et al A multivariate approachto the integration of multi-omics datasets BMC Bioinformatics201415162

61de Tayrac M Le S Aubry M et al Simultaneous analysis ofdistinct Omics data sets with integration of biological know-ledge multiple factor analysis approach BMC Genomics20091032

62Abdi H Williams LJ Valentin D Statis and Distatis OptimumMultitable Principal Component Analysis and Three Way MetricMultidimensional Scaling Wiley Interdisciplinary 20124124ndash67

63De Jong S Kiers HAL Principal covariates regression Part ITheory Chemometrics and Intelligent Laboratory Systems199214155ndash64

64Benasseni J Dosse MB Analyzing multiset data by the powerSTATIS-ACT method Adv Data Anal Classification 2012649ndash65

65Escoufier Y The duality diagram a means for better practicalapplications Devel Num Ecol 198714139ndash56

66Giordani P Kiers HAL Ferraro DMA Three-way componentanalysis using the R package threeway J Stat Software2014577

67Leibovici DG Spatio-temporal multiway decompositionsusing principal tensor analysis on k-modes the R packagePTAk J Stat Software 201034(10)1ndash34

68Kolda TG Bader BW Tensor decompositions and applica-tions SIAM Rev 200951455ndash500

69Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis for multiblock or multigroup dataanalysis Eur J Oper Res 2014238391ndash403

70Chessel D Hanafi M Analyses de la co-inertie de K nuages depoints Rev Stat Appl 19964435ndash60

71Hanafi M Kohler A Qannari EM Connections betweenmultiple co-inertia analysis and consensus principal compo-nent analysis Chemometrics and Intelligent Laboratory2011106(1)37ndash40

72Carroll JD Generalization of canonical correlation analysis tothree or more sets of variables Proc 76th Convent Am PsychAssoc 19683227ndash8

73Takane Y Hwang H Abdi H Regularized multiple-set canon-ical correlation analysis Psychometrika 200873753ndash75

640 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5
Page 9: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

methods are formulated for multiple data sets with differentvariables but the same observations most can be similarly for-mulated for multiple data sets with the same variables but dif-ferent observations [69]

Multiple co-inertia analysis

MCIA is an extension of CIA which aims to analyze multiplematrices through optimizing a covariance criterion [60 70]MCIA simultaneously projects K data sets into the same dimen-sional space Instead of maximizing the covariance betweenscores from two data sets as in CIA the optimization problemused by MCIA is as following

argmaxqi1 q

ikqi

K

XK

kfrac141

cov2ethXikqi

kXiqiTHORN (14)

for dimension i with the constraints that jjqkijj frac14 var (Xiqi) frac14 1

where X frac14 (X1j jXkj jXK) and qi holds the correspondingloading values (lsquoglobalrsquo loading vector) [60 69 70]

MCIA derives a set of lsquoblock scoresrsquo Xkiqk

i using linear com-binations of the original variables from each individual matrixThe global score Xiqi is then further defined as the linear com-bination of lsquoblock scoresrsquo In practice the global scores representa correlated structure defined by multiple data sets whereasthe concordance and discrepancy between these data sets maybe revealed by the block scores (for detail see lsquoExample casestudyrsquo section) MCIA may be calculated with the ad hoc exten-sion of the NIPALS PCA [71] This algorithm starts with an ini-tialization step in which the global scores and the blockloadings for the first dimension are computed The residualmatrices are calculated in an iterative step by removing thevariance induced by the variable loadings (the lsquodeflationrsquo step)For higher order solutions the same procedure is applied to theresidual matrices and re-iterated until the desired number of di-mensions is reached Therefore the computation time stronglydepends on the number of desired dimensions MCIA is imple-mented in the R package omicade4 and has been applied to theintegrative analysis of transcriptomic and proteomic data setsfrom the NCI-60 cell lines [60]

Generalized canonical correlation analysis

GCCA [71] is a generalization of CCA to K-table analysis [73ndash75]It has also been applied to the analysis of omics data [36 76]Typically MCIA and GCCA will produce similar results (for amore detailed comparison see [60]) GCCA uses a different defla-tion strategy than MCIA it calculates the residual matrices byremoving the variance with respect to the lsquoblock scoresrsquo (insteadof lsquovariable loadingsrsquo used by MCIA or lsquoglobal scoresrsquo used byCPCA see later) When applied to omics data where p n avariable selection step is often integrated within the GCCA ap-proach which cannot be done in case of MCIA In addition asblock scores are better representations of a single data set (incontrast to the global score) GCCA is more likely to find com-mon variables across data sets Witten and Tibshirani [54]applied sparse multiple CCA to analyze gene expression andCopy Number Variation (CNV) data from diffuse large B-celllymphoma patients and successfully identified lsquocis interactionsrsquothat are both up-regulated in CNV and mRNA data

Consensus PCA

CPCA is closely related to GCCA and MCIA but has had less expos-ure to the omics data community CPCA optimizes the same criter-ion as GCCA and MCIA and is subject to the same constraints asMCIA [71] The deflation step of CPCA relies on the lsquoglobal scorersquo Asa result it guarantees the orthogonality of the global score onlyand tends to find common patterns in the data sets This charac-teristic makes it is more suitable for the discovery of joint patternsin multiple data sets such as the joint clustering problem

Regularized generalized canonical correlation analysis

Recently Tenenhaus and Tenenhaus [69 74] proposed regular-ized generalized CCA (RGCCA) which provides a unified frame-work for different K-table multivariate methods The RGCCAmodel introduces some extra parameters particularly a shrink-age parameter and a linkage parameter The linkage parameteris defined so that the connection between matrices may be cus-tomized The shrinkage parameter ranges from 0 to 1 Settingthis parameter to 0 will force the correlation criterion (criterionused by GCCA) whereas a shrinkage parameter of 1 will applythe covariance criterion (used by MCIA and CPCA) A value be-tween 0 and 1 leads to a compromise between the two optionsIn practice the correlation criterion is better in explaining thecorrelated structure across data sets while discarding the vari-ance within each individual data set The introduction of theextra parameters make RGCCA highly versatile GCCA CIA andCPCA can be described as special cases of RGCCA (see [69] andSupplementary Information) In addition RGCCA also integratesa feature selection procedure named sparse GCCA (SGCCA)Tenenhaus et al [76] applied SGCCA to combine gene expres-sion comparative genomic hybridization and a qualitativephenotype measured on a set of 53 children with glioma Sparsemultiple CCA [54] and SGCCA [76] are available in the R pack-ages PMA and RGCCA respectively Similarly a higher orderimplementation of spare PLS is described in Zhao et al [77]

Joint NMF

NMF has also been extended to jointly factorize multiple matri-ces In joint NMF the values in the global score F and in the co-efficient matrices (Q1 QK) are nonnegative and there is noexplicit definition of the block loadings An optimization algo-rithm is applied to minimize an objective function typically the

sum of squared errors ieXK

kfrac141Ek

2 This approach can be con-

sidered to be a nonnegative implementation of PARAFAC al-though it has also been implemented using the Tucker model[78ndash80] Zhang et al [81] apply joint NMF to a three-way analysisof DNA methylation gene expression and miRNA expressiondata to identify modules in each of these regulatory layers thatare associated with each other

Advantages of dimension reduction whenintegrating multi-assay data

Dimension reduction or latent variable approaches provideEDA integrate multi-assay data highlight global correlationsacross data sets and discover outliers or batch effects in indi-vidual data sets Dimension reduction approaches also facilitatedown-stream analysis of both observations and variables(genes) Compared with cluster analysis of individual data setscluster analysis of the global score matrix (F matrix) is robust asit aggregates observations across data sets and is less likely to

636 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

reflect a technical or batch effect of a single data set Similarlydimension reduction of multi-assay data facilities downstreamgene set pathway and network analysis of variables MCIAtransforms variables from each data set onto the same scaleand their loadings (Q matrix) rank the variables by their contri-bution to the global data structure (Figure 2) Meng et al reportthat pathway or gene set enrichment analysis (GSEA) of thetransformed variables is more sensitive than GSEA of each indi-vidual data set This is both because of the re-weighting andtransformation of variables but also because GSEA on the com-bined data has greater coverage of variables (genes) thus com-pensating for missing or unreliable information in any singledata set For example in Figure 2B we integrate and transformmRNA proteomics and miRNA data on the same scale allowingus to extract and study the union of all variables

Example case study

To demonstrate the integration of multi-data sets using dimen-sions reduction we applied MCIA to analyze mRNA miRNA andproteomics expression profiles of melanoma leukemia and CNS

cells lines from the NCI-60 panel The graphical output fromthis analysis a plot of the sample space variable space anddata weighting space are provided in Figure 2A B and E Theeigenvalues can be interpreted similarly to PCA a higher eigen-value contributes more information to the global score As withPCA researchers may be subjective in their selection of thenumber of components [24] The scree plot in Figure 2D showsthe eigenvalues of each global score In this case the first twoeigenvalues were significantly larger so we visualized the celllines and variables on PC1 and PC2

In the observation space (Figure 2A) assay data (mRNAmiRNA and proteomics) are distinguished by shape The coord-inates of each cell line (Fk in Figure 2A) are connected by lines tothe global scores (F) Short lines between points and cell lineglobal scores reflect high concordance in cell line data Most celllines have concordant information between data sets (mRNAmiRNA protein) as indicated by relatively short lines In add-ition the RV coefficient [82 83] which is a generalized Pearsoncorrelation coefficient for matrices may be used to estimate thecorrelation between two transformed data sets The RV coeffi-cient has values between 0 and 1 where a higher value

Figure 2 MCIA of mRNA miRNA and proteomics profiles of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines (A) shows a plot of the first two

components in sample space (sample lsquotypersquo is coded by the point shape circles for mRNAs triangles for proteins and squares for miRNAs) Each sample (cell line) is

represented by a ldquostarrdquo where the three omics data for each cell line are connected by lines to a center point which is the global score (F) for that cell line the shorter

the line the higher the level of concordance between the data types and the global structure (B) shows the variable space of MCIA A variable that is highly expressed

in a cell line will be projected with a high weight (far from the origin) in the direction of that cell line Some miRNAs with a large distance from the origin are labeled as

these miRNAs are the strongly associated with cancer tissue of origin (C) shows the correlation coefficients of the proteome profiling of SR with other cell lines The

proteome profiling of SR cell line is more correlated with melanoma cell line There may be a technical issue with the LESR proteomics data (D) A scree plot of the

eigenvalues and (E) a plot of data weighting space A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 637

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

indicates higher co-structure In this example we observed rela-tively high RV coefficients between the three data sets rangingfrom 078 to 084 It was recently reported that the RV coefficientis biased toward large data sets and a modified RV coefficienthas been proposed [84]

In this analysis (Figure 2A) cell lines originating from thesame anatomical source are projected close to each other andconverge in clusters The first PC separates the leukemia celllines (positive end of PC1) from the other two cell lines (negativeend of PC1) and PC2 separates the melanoma and CNS celllines The melanoma cell line LOX-IMVI which lacks the mela-nogenesis is projected close to the origin away from the mel-anoma cluster We were surprised to see that the proteomicsprofile of leukemia cell line SR was projected closer to melan-oma rather than leukemia cell lines We examined withintumor type correlations to the SR cell line (Figure 2C) Weobserved that the SR proteomics data had higher correlationwith melanoma compared with to leukemia cell lines Giventhat the mRNA and miRNA profiles of LE_SR are closer to theleukemia cell lines it suggests that there may have been a tech-nical error in generating the proteomics data on the SR cell line(Figure 2A and C)

MCIA projects all variables into the same space The variablespace (Q1 QK) is visualized in Figure 2B Variables and sam-ples projected in the same direction are associated This allowsone to select the variables most strongly associated with spe-cific observations from each data set for subsequent analysis Inour previous study [85] we have shown that the genes and pro-teins highly weighted on the melanoma side (positive end of se-cond dimension) are enriched with melanogenesis functionsand genesproteins highly weighted on the protein side arehighly enriched in T-cell or immune-related functions

We examined the miRNA data to extract the miRNAs withthe most extreme weights on the first two dimensions miR-142and miR-223 which are active and expressed in leukemia [8283 86ndash88] had high weights on the positive end of both thefirst and second axis (close to the leukemia cell lines samplespace Figure 2A) miR-142 plays an essential role in T-lympho-cyte development miR-223 is regulated by the Notch andNF-kB signaling pathways in T-cell acute lymphoblastic leuke-mia [89]

The miRNA with strongest association to CNS cell lines wasmiR-409 This miRNA is reported to promote the epithelial-to-mesenchymal transition in prostate cancer [90] In the NCI-60cell line data CNS cell lines are characterized more by a pro-nounced mesenchymal phenotype which is consistent withhigh expression of this miRNA On the positive end of the se-cond axis and negative end of the first axis (which correspondsto melanoma cell lines in the sample space Figure 2A) wefound miR-509 miR-513 and miR-506 strongly associated withmelanoma cell lines which are reported to initiate melanocytetransformation and promote melanoma growth [85]

Challenges in integrative data analysis

EDA is widely used and well accepted in the analysis of singleomics data sets but there is an increasing need for methodsthat integrate multi-omics data particularly in cancer researchRecently 20 leading scientists were invited to a meeting organ-ized by Nature Medicine Nature Biotechnology and the VolkswagenFoundation The meeting identified the need to simultaneouslycharacterize DNA sequence epigenome transcriptome proteinmetabolites and infiltrating immune cells in both the tumorand the stroma [91] The TCGA pan-cancer project plans to

comprehensively interrogate multi-omics data across 33 humancancers [92] The data are biologically complex In addition totumor heterogeneity [91] there may be technical issues batcheffects and outliers EDA approaches for complex multi-omicsdata are needed

We describe emerging applications of multivariateapproaches to omics data analysis These are descriptiveapproaches that do not test a hypothesis or generate a P-valueThey are not optimized for variable or biomarker discoveryalthough the introduction of sparsity in variable loadings mayhelp in the selection of variables for downstream analysis Fewcomparisons of different methods exist and the numbers ofcomponents and the sparsity level have to be optimized Cross-validation approaches are potentially useful but this stillremains an open area of research

Another limitation of these methods is that although vari-ables may vary among data sets the observations need to bematchable Therefore researchers need to have careful experi-mental design in the early stage of a study There is an exten-sion of CIA for the analysis of unmatched samples [93] whichcombines a Hungarian algorithm with CIA to iteratively pairsamples that are similar but not matched Multi-block andmulti-group methods (data sets with matched variables) havebeen reviewed recently by Tenenhaus and Tenenhaus [69]

The number of variables in genomics data is a challenge totraditional EDA visualization tools Most visualizationapproaches were designed for data sets with fewer variablesWithin R new packages including ggord are being developedDynamic data visualization is possible using ggvis plotly explorand other packages However interpretation of long lists of bio-logical variables (genes proteins miRNAs) is challenging Oneway to gain more insight into lists of omics variables is to per-form a network gene set enrichment or pathway analysis [94]An attractive feature of decomposition methods is that variableannotation such as Gene Ontology or Reactome can be pro-jected into the same space to determine a score for each geneset (or pathway) in that space [32 33 60]

Conclusion

Dimension reduction methods have a long history Many simi-lar methods have been developed in parallel by multiple fieldsIn this review we provided an overview of dimension reductiontechniques that are both well-established and maybe new tothe multi-omics data community We reviewed methods forsingle-table two-table and multi-table analysis There are sig-nificant challenges in extracting biologically and clinically ac-tionable results from multi-omics data however the field mayleverage the varied and rich resource of dimension reductionapproaches that other disciplines have developed

Key Points

bull There are many dimension-reduction methods whichcan be applied to exploratory data analysis of a singledata set or integrated analysis of a pair or multipledata sets In addition to exploratory analysis thesecan be extended to clustering supervised and discrim-inant analysis

bull The goal of dimension reduction is to map data ontoa new set of variables so that most of the variance (orinformation) in the data is explained by a few new(latent) variables

638 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

bull Multi-data set methods such as multiple co-inertiaanalysis (MCIA) multiple factor analysis (MFA) or ca-nonical correlations analysis (CCA) identify correlatedstructure between data sets with matched observa-tions (samples) Each data set may have different vari-ables (genes proteins miRNA mutations drugresponse etc)

bull MCIA MFA CCA and related methods provide a visu-alization of consensus and incongruence in andbetween data sets enabling discovery of potential out-liers batch effects or technical errors

bull Multi-dataset methods transform diverse variablesfrom each data set onto the same space and scalefacilitating integrative variable selection gene setanalysis pathway and downstream analyses

R Supplement

R code to re-generate all figures in this article is available asa Supplementary File Data and code are also available on thegithub repository httpsgithubcomaedinNCI60Example

Supplementary Data

Supplementary data are available online at httpbiboxfordjournalsorg

Acknowledgements

We are grateful to Drs John Platig and Marieke Kuijjer forreading the manuscript and for their comments

Funding

GGT was supported by the Austrian Ministry of ScienceResearch and Economics [OMICS Center HSRSM initiative]AC was supported by Dana-Farber Cancer Institute BCBResearch Scientist Developmental Funds National CancerInstitute 2P50 CA101942-11 and the Assistant Secretary ofDefense Health Program through the Breast CancerResearch Program under Award No W81XWH-15-1-0013Opinions interpretations conclusions and recommenda-tions are those of the author and are not necessarilyendorsed by the Department of Defense

References1 Brazma A Culhane AC Algorithms for gene expression

analysis In Jorde LB Little PFR Dunn MJ Subramaniam S(eds) Encyclopedia of Genetics Genomics Proteomics andBioinformatics London John Wiley amp Sons 2005 3148ndash59

2 Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

3 Legendre P Legendre LFJ Numerical Ecology AmsterdamElsevier Science 3rd edition 2012

4 Biton A Bernard-Pierrot I Lou Y et al Independent compo-nent analysis uncovers the landscape of the bladder tumortranscriptome and reveals insights into luminal and basalsubtypes Cell Rep 201491235ndash45

5 Hoadley KA Yau C Wolf DM et al Multiplatform analysis of12 cancer types reveals molecular classification within andacross tissues of origin Cell 2014158929ndash44

6 Verhaak RG Tamayo P Yang JY et al Prognostically relevantgene signatures of high-grade serous ovarian carcinomaJ Clin Invest 2013123517ndash25

7 Senbabaoglu Y Michailidis G Li JZ Critical limitations of con-sensus clustering in class discovery Sci Rep 201446207

8 Gusenleitner D Howe EA Bentink S et al iBBiG iterativebinary bi-clustering of gene sets Bioinformatics 2012282484ndash92

9 Pearson K On lines and planes of closest fit to systems ofpoints in space Philos Magazine Series 19016(2)559ndash72

10Hotelling H Analysis of a complex of statistical variables intoprincipal components J Educ Psychol 193324417

11Bland M An Introduction to Medical Statistics 3rd edn OxfordNew York Oxford University Press 2000 xvi 405

12Warner RM Applied Statistics From Bivariate throughMultivariate Techniques London SAGE Publications Inc2nd Edition 2012 1004

13Dray S Chessel D Thioulouse J Co-inertia analysis and thelinking of ecological data tables Ecology 2003843078ndash89

14Golub GH Loan CF Matrix Computations Baltimore JHU Press2012

15Anthony M Harvey M Linear Algebra Concepts and MethodsCambridge Cambridge University Press 1st edition 2012149ndash60

16McShane LM Cavenagh MM Lively TG et al Criteria for theuse of omics-based predictors in clinical trials explanationand elaboration BMC Med 201311220

17Guyon I Elisseeff A An introduction to variable and featureselection J Mach Learn Res 200331157ndash82

18Tukey JW Exploratory Data Analysis Boston Addison-WesleyPublishing Company 1977

19Hogben L Handbook of Linear Algebra Boca Raton Chapman ampHallCRC Press 1st edition 2007 20

20Ringner M What is principal component analysis NatBiotechnol 200826303ndash4

21Wall ME Rechtsteiner A Rocha LM (2003) Singular value de-composition and principal component analysis In A PracticalApproach to Microarray Data Analysis Berrar DP DubitzkyW Granzow M (eds) Norwell MA Kluwer 2003 91ndash109

22 terBraak CJF Prentice IC A theory of gradient analysis AdvEcol Res 198818271ndash313

23 terBraak CJF Smilauer P CANOCO Reference Manual andUserrsquos Guide to Canoco for Windows Software for CanonicalCommunity Ordination (version 4) Microcomputer Power(Ithaca NY USA) httpwwwmicrocomputerpowercom1998352

24Abdi H Williams LJ Principal Component Analysis HobokenJohn Wiley amp SonsInterdisciplinary Reviews ComputationalStatistics 20102(4)433ndash59

25Dray S On the number of principal components a test ofdimensionality based on measurements of similarity be-tween matrices Comput Stat Data Anal 2008522228ndash37

26Wouters L Gohlmann HW Bijnens L et al Graphical explor-ation of gene expression data a comparative study of threemultivariate methods Biometrics 2003591131ndash9

27Greenacre M Correspondence Analysis in Practice Boca RatonChapman and HallCRC 2 edition 2007 201ndash11

28Goodrich JK Rienzi DSC Poole AC et al Conducting a micro-biome study Cell 2014158250ndash62

29Fasham MJR A comparison of nonmetric multidimensionalscaling principal components and reciprocal averaging forthe ordination of simulated coenoclines and coenoplanesEcology 197758551ndash61

Dimension reduction techniques | 639

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

30Beh EJ Lombardo R Correspondence Analysis Theory Practiceand New Strategies Hoboken John Wiley amp Sons 1st edition2014 120ndash76

31Greenacre MJ Theory and Applications of Correspondence AnalysisWaltham MA Academic Press 3rd Edition 1993 83ndash125

32Fagan A Culhane AC Higgins DG A multivariate analysis ap-proach to the integration of proteomic and gene expressiondata Proteomics 200772162ndash71

33Fellenberg K Hauser NC Brors B et al Correspondence ana-lysis applied to microarray data Proc Natl Acad Sci USA20019810781ndash6

34Lee DD Seung HS Learning the parts of objects by non-negative matrix factorization Nature 1999401788ndash91

35Comon P Independent component analysis a new conceptSignal Processing 199436287ndash314

36Witten DM Tibshirani R Hastie T A penalized matrix decom-position with applications to sparse principal componentsand canonical correlation analysis Biostatistics 200910515ndash34

37Shen H Huang JZ Sparse principal component analysis viaregularized low rank matrix approximation J Multivar Anal2008991015ndash34

38Lee S Epstein MP Duncan R et al Sparse principal componentanalysis for identifying ancestry-informative markers in gen-ome-wide association studies Genet Epidemiol 201236293ndash302

39Sill M Saadati M Benner A Applying stability selection toconsistently estimate sparse principal components in high-dimensional molecular data Bioinformatics 2015312683ndash90

40Zhao J Efficient model selection for mixtures of probabilisticPCA via hierarchical BIC IEEE Trans Cybern 2014441871ndash83

41Zhao Q Shi X Huang J et al Integrative analysis of lsquo-omicsrsquodata using penalty functions Wiley Interdiscip Rev Comput Stat2015710

42Alter O Brown PO Botstein D Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms Proc NatlAcad Sci USA 20031003351ndash6

43Culhane AC Perriere G Higgins DG Cross-platform compari-son and visualisation of gene expression data using co-inertia analysis BMC Bioinformatics 2003459

44Dray S Analysing a pair of tables coinertia analysis and dual-ity diagrams In Blasius J Greenacre M (eds) Visualization andverbalization of data Boca Raton CRC Press 2014 289ndash300

45Le Cao KA Martin PG Robert-Granie C et al Sparse canonicalmethods for biological data integration application to across-platform study BMC Bioinformatics 20091034

46Braak CJF Canonical correspondence analysis a new eigen-vector technique for multivariate direct gradient analysisEcology 1986671167ndash79

47Hotelling H Relations between two sets of variates Biometrika193628321ndash77

48McGarigal K Landguth E Stafford S Multivariate Statistics forWildlife and Ecology Research New York Springer 2002

49Thioulouse J Simultaneous analysis of a sequence of pairedecological tables a comparison of several methods Ann ApplStat 201152300ndash25

50Hong S Chen X Jin L et al Canonical correlation analysisfor RNA-seq co-expression networks Nucleic Acids Res201341e95

51Mardia KV Kent JT Bibby JM Multivariate Analysis LondonNew York NY Toronto Sydney San Francisco CA AcademicPress 1979

52Waaijenborg S Zwinderman AH Sparse canonical correl-ation analysis for identifying connecting and completinggene-expression networks BMC Bioinformatics 200910315

53Parkhomenko E Tritchler D Beyene J Sparse canonical cor-relation analysis with application to genomic data integra-tion Stat Appl Genet Mol Biol 20098article 1

54Witten DM Tibshirani RJ Extensions of sparse canonical cor-relation analysis with applications to genomic data Stat ApplGenet Mol Biol 20098article 28

55Lin D Zhang J Li J et al Group sparse canonical correlationanalysis for genomic data integration BMC Bioinformatics201314245

56Le Cao KA Boitard S Besse P Sparse PLS discriminantanalysis biologically relevant feature selection and graph-ical displays for multiclass problems BMC Bioinform 201112253

57Boulesteix AL Strimmer K Partial least squares a versatiletool for the analysis of high-dimensional genomic data BriefBioinform 2007832ndash44

58Doledec S Chessel D Co-inertia analysis an alternativemethod for studying speciesndashenvironment relationshipsFreshwater Biol 199431277ndash94

59Ponnapalli SP Saunders MA Van Loan CF et al A higher-order generalized singular value decomposition for compari-son of global mRNA expression from multiple organismsPLoS One 20116e28072

60Meng C Kuster B Culhane AC et al A multivariate approachto the integration of multi-omics datasets BMC Bioinformatics201415162

61de Tayrac M Le S Aubry M et al Simultaneous analysis ofdistinct Omics data sets with integration of biological know-ledge multiple factor analysis approach BMC Genomics20091032

62Abdi H Williams LJ Valentin D Statis and Distatis OptimumMultitable Principal Component Analysis and Three Way MetricMultidimensional Scaling Wiley Interdisciplinary 20124124ndash67

63De Jong S Kiers HAL Principal covariates regression Part ITheory Chemometrics and Intelligent Laboratory Systems199214155ndash64

64Benasseni J Dosse MB Analyzing multiset data by the powerSTATIS-ACT method Adv Data Anal Classification 2012649ndash65

65Escoufier Y The duality diagram a means for better practicalapplications Devel Num Ecol 198714139ndash56

66Giordani P Kiers HAL Ferraro DMA Three-way componentanalysis using the R package threeway J Stat Software2014577

67Leibovici DG Spatio-temporal multiway decompositionsusing principal tensor analysis on k-modes the R packagePTAk J Stat Software 201034(10)1ndash34

68Kolda TG Bader BW Tensor decompositions and applica-tions SIAM Rev 200951455ndash500

69Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis for multiblock or multigroup dataanalysis Eur J Oper Res 2014238391ndash403

70Chessel D Hanafi M Analyses de la co-inertie de K nuages depoints Rev Stat Appl 19964435ndash60

71Hanafi M Kohler A Qannari EM Connections betweenmultiple co-inertia analysis and consensus principal compo-nent analysis Chemometrics and Intelligent Laboratory2011106(1)37ndash40

72Carroll JD Generalization of canonical correlation analysis tothree or more sets of variables Proc 76th Convent Am PsychAssoc 19683227ndash8

73Takane Y Hwang H Abdi H Regularized multiple-set canon-ical correlation analysis Psychometrika 200873753ndash75

640 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5
Page 10: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

reflect a technical or batch effect of a single data set Similarlydimension reduction of multi-assay data facilities downstreamgene set pathway and network analysis of variables MCIAtransforms variables from each data set onto the same scaleand their loadings (Q matrix) rank the variables by their contri-bution to the global data structure (Figure 2) Meng et al reportthat pathway or gene set enrichment analysis (GSEA) of thetransformed variables is more sensitive than GSEA of each indi-vidual data set This is both because of the re-weighting andtransformation of variables but also because GSEA on the com-bined data has greater coverage of variables (genes) thus com-pensating for missing or unreliable information in any singledata set For example in Figure 2B we integrate and transformmRNA proteomics and miRNA data on the same scale allowingus to extract and study the union of all variables

Example case study

To demonstrate the integration of multi-data sets using dimen-sions reduction we applied MCIA to analyze mRNA miRNA andproteomics expression profiles of melanoma leukemia and CNS

cells lines from the NCI-60 panel The graphical output fromthis analysis a plot of the sample space variable space anddata weighting space are provided in Figure 2A B and E Theeigenvalues can be interpreted similarly to PCA a higher eigen-value contributes more information to the global score As withPCA researchers may be subjective in their selection of thenumber of components [24] The scree plot in Figure 2D showsthe eigenvalues of each global score In this case the first twoeigenvalues were significantly larger so we visualized the celllines and variables on PC1 and PC2

In the observation space (Figure 2A) assay data (mRNAmiRNA and proteomics) are distinguished by shape The coord-inates of each cell line (Fk in Figure 2A) are connected by lines tothe global scores (F) Short lines between points and cell lineglobal scores reflect high concordance in cell line data Most celllines have concordant information between data sets (mRNAmiRNA protein) as indicated by relatively short lines In add-ition the RV coefficient [82 83] which is a generalized Pearsoncorrelation coefficient for matrices may be used to estimate thecorrelation between two transformed data sets The RV coeffi-cient has values between 0 and 1 where a higher value

Figure 2 MCIA of mRNA miRNA and proteomics profiles of melanoma (ME) leukemia (LE) and central nervous system (CNS) cell lines (A) shows a plot of the first two

components in sample space (sample lsquotypersquo is coded by the point shape circles for mRNAs triangles for proteins and squares for miRNAs) Each sample (cell line) is

represented by a ldquostarrdquo where the three omics data for each cell line are connected by lines to a center point which is the global score (F) for that cell line the shorter

the line the higher the level of concordance between the data types and the global structure (B) shows the variable space of MCIA A variable that is highly expressed

in a cell line will be projected with a high weight (far from the origin) in the direction of that cell line Some miRNAs with a large distance from the origin are labeled as

these miRNAs are the strongly associated with cancer tissue of origin (C) shows the correlation coefficients of the proteome profiling of SR with other cell lines The

proteome profiling of SR cell line is more correlated with melanoma cell line There may be a technical issue with the LESR proteomics data (D) A scree plot of the

eigenvalues and (E) a plot of data weighting space A colour version of this figure is available online at BIB online httpbiboxfordjournalsorg

Dimension reduction techniques | 637

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

indicates higher co-structure In this example we observed rela-tively high RV coefficients between the three data sets rangingfrom 078 to 084 It was recently reported that the RV coefficientis biased toward large data sets and a modified RV coefficienthas been proposed [84]

In this analysis (Figure 2A) cell lines originating from thesame anatomical source are projected close to each other andconverge in clusters The first PC separates the leukemia celllines (positive end of PC1) from the other two cell lines (negativeend of PC1) and PC2 separates the melanoma and CNS celllines The melanoma cell line LOX-IMVI which lacks the mela-nogenesis is projected close to the origin away from the mel-anoma cluster We were surprised to see that the proteomicsprofile of leukemia cell line SR was projected closer to melan-oma rather than leukemia cell lines We examined withintumor type correlations to the SR cell line (Figure 2C) Weobserved that the SR proteomics data had higher correlationwith melanoma compared with to leukemia cell lines Giventhat the mRNA and miRNA profiles of LE_SR are closer to theleukemia cell lines it suggests that there may have been a tech-nical error in generating the proteomics data on the SR cell line(Figure 2A and C)

MCIA projects all variables into the same space The variablespace (Q1 QK) is visualized in Figure 2B Variables and sam-ples projected in the same direction are associated This allowsone to select the variables most strongly associated with spe-cific observations from each data set for subsequent analysis Inour previous study [85] we have shown that the genes and pro-teins highly weighted on the melanoma side (positive end of se-cond dimension) are enriched with melanogenesis functionsand genesproteins highly weighted on the protein side arehighly enriched in T-cell or immune-related functions

We examined the miRNA data to extract the miRNAs withthe most extreme weights on the first two dimensions miR-142and miR-223 which are active and expressed in leukemia [8283 86ndash88] had high weights on the positive end of both thefirst and second axis (close to the leukemia cell lines samplespace Figure 2A) miR-142 plays an essential role in T-lympho-cyte development miR-223 is regulated by the Notch andNF-kB signaling pathways in T-cell acute lymphoblastic leuke-mia [89]

The miRNA with strongest association to CNS cell lines wasmiR-409 This miRNA is reported to promote the epithelial-to-mesenchymal transition in prostate cancer [90] In the NCI-60cell line data CNS cell lines are characterized more by a pro-nounced mesenchymal phenotype which is consistent withhigh expression of this miRNA On the positive end of the se-cond axis and negative end of the first axis (which correspondsto melanoma cell lines in the sample space Figure 2A) wefound miR-509 miR-513 and miR-506 strongly associated withmelanoma cell lines which are reported to initiate melanocytetransformation and promote melanoma growth [85]

Challenges in integrative data analysis

EDA is widely used and well accepted in the analysis of singleomics data sets but there is an increasing need for methodsthat integrate multi-omics data particularly in cancer researchRecently 20 leading scientists were invited to a meeting organ-ized by Nature Medicine Nature Biotechnology and the VolkswagenFoundation The meeting identified the need to simultaneouslycharacterize DNA sequence epigenome transcriptome proteinmetabolites and infiltrating immune cells in both the tumorand the stroma [91] The TCGA pan-cancer project plans to

comprehensively interrogate multi-omics data across 33 humancancers [92] The data are biologically complex In addition totumor heterogeneity [91] there may be technical issues batcheffects and outliers EDA approaches for complex multi-omicsdata are needed

We describe emerging applications of multivariateapproaches to omics data analysis These are descriptiveapproaches that do not test a hypothesis or generate a P-valueThey are not optimized for variable or biomarker discoveryalthough the introduction of sparsity in variable loadings mayhelp in the selection of variables for downstream analysis Fewcomparisons of different methods exist and the numbers ofcomponents and the sparsity level have to be optimized Cross-validation approaches are potentially useful but this stillremains an open area of research

Another limitation of these methods is that although vari-ables may vary among data sets the observations need to bematchable Therefore researchers need to have careful experi-mental design in the early stage of a study There is an exten-sion of CIA for the analysis of unmatched samples [93] whichcombines a Hungarian algorithm with CIA to iteratively pairsamples that are similar but not matched Multi-block andmulti-group methods (data sets with matched variables) havebeen reviewed recently by Tenenhaus and Tenenhaus [69]

The number of variables in genomics data is a challenge totraditional EDA visualization tools Most visualizationapproaches were designed for data sets with fewer variablesWithin R new packages including ggord are being developedDynamic data visualization is possible using ggvis plotly explorand other packages However interpretation of long lists of bio-logical variables (genes proteins miRNAs) is challenging Oneway to gain more insight into lists of omics variables is to per-form a network gene set enrichment or pathway analysis [94]An attractive feature of decomposition methods is that variableannotation such as Gene Ontology or Reactome can be pro-jected into the same space to determine a score for each geneset (or pathway) in that space [32 33 60]

Conclusion

Dimension reduction methods have a long history Many simi-lar methods have been developed in parallel by multiple fieldsIn this review we provided an overview of dimension reductiontechniques that are both well-established and maybe new tothe multi-omics data community We reviewed methods forsingle-table two-table and multi-table analysis There are sig-nificant challenges in extracting biologically and clinically ac-tionable results from multi-omics data however the field mayleverage the varied and rich resource of dimension reductionapproaches that other disciplines have developed

Key Points

bull There are many dimension-reduction methods whichcan be applied to exploratory data analysis of a singledata set or integrated analysis of a pair or multipledata sets In addition to exploratory analysis thesecan be extended to clustering supervised and discrim-inant analysis

bull The goal of dimension reduction is to map data ontoa new set of variables so that most of the variance (orinformation) in the data is explained by a few new(latent) variables

638 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

bull Multi-data set methods such as multiple co-inertiaanalysis (MCIA) multiple factor analysis (MFA) or ca-nonical correlations analysis (CCA) identify correlatedstructure between data sets with matched observa-tions (samples) Each data set may have different vari-ables (genes proteins miRNA mutations drugresponse etc)

bull MCIA MFA CCA and related methods provide a visu-alization of consensus and incongruence in andbetween data sets enabling discovery of potential out-liers batch effects or technical errors

bull Multi-dataset methods transform diverse variablesfrom each data set onto the same space and scalefacilitating integrative variable selection gene setanalysis pathway and downstream analyses

R Supplement

R code to re-generate all figures in this article is available asa Supplementary File Data and code are also available on thegithub repository httpsgithubcomaedinNCI60Example

Supplementary Data

Supplementary data are available online at httpbiboxfordjournalsorg

Acknowledgements

We are grateful to Drs John Platig and Marieke Kuijjer forreading the manuscript and for their comments

Funding

GGT was supported by the Austrian Ministry of ScienceResearch and Economics [OMICS Center HSRSM initiative]AC was supported by Dana-Farber Cancer Institute BCBResearch Scientist Developmental Funds National CancerInstitute 2P50 CA101942-11 and the Assistant Secretary ofDefense Health Program through the Breast CancerResearch Program under Award No W81XWH-15-1-0013Opinions interpretations conclusions and recommenda-tions are those of the author and are not necessarilyendorsed by the Department of Defense

References1 Brazma A Culhane AC Algorithms for gene expression

analysis In Jorde LB Little PFR Dunn MJ Subramaniam S(eds) Encyclopedia of Genetics Genomics Proteomics andBioinformatics London John Wiley amp Sons 2005 3148ndash59

2 Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

3 Legendre P Legendre LFJ Numerical Ecology AmsterdamElsevier Science 3rd edition 2012

4 Biton A Bernard-Pierrot I Lou Y et al Independent compo-nent analysis uncovers the landscape of the bladder tumortranscriptome and reveals insights into luminal and basalsubtypes Cell Rep 201491235ndash45

5 Hoadley KA Yau C Wolf DM et al Multiplatform analysis of12 cancer types reveals molecular classification within andacross tissues of origin Cell 2014158929ndash44

6 Verhaak RG Tamayo P Yang JY et al Prognostically relevantgene signatures of high-grade serous ovarian carcinomaJ Clin Invest 2013123517ndash25

7 Senbabaoglu Y Michailidis G Li JZ Critical limitations of con-sensus clustering in class discovery Sci Rep 201446207

8 Gusenleitner D Howe EA Bentink S et al iBBiG iterativebinary bi-clustering of gene sets Bioinformatics 2012282484ndash92

9 Pearson K On lines and planes of closest fit to systems ofpoints in space Philos Magazine Series 19016(2)559ndash72

10Hotelling H Analysis of a complex of statistical variables intoprincipal components J Educ Psychol 193324417

11Bland M An Introduction to Medical Statistics 3rd edn OxfordNew York Oxford University Press 2000 xvi 405

12Warner RM Applied Statistics From Bivariate throughMultivariate Techniques London SAGE Publications Inc2nd Edition 2012 1004

13Dray S Chessel D Thioulouse J Co-inertia analysis and thelinking of ecological data tables Ecology 2003843078ndash89

14Golub GH Loan CF Matrix Computations Baltimore JHU Press2012

15Anthony M Harvey M Linear Algebra Concepts and MethodsCambridge Cambridge University Press 1st edition 2012149ndash60

16McShane LM Cavenagh MM Lively TG et al Criteria for theuse of omics-based predictors in clinical trials explanationand elaboration BMC Med 201311220

17Guyon I Elisseeff A An introduction to variable and featureselection J Mach Learn Res 200331157ndash82

18Tukey JW Exploratory Data Analysis Boston Addison-WesleyPublishing Company 1977

19Hogben L Handbook of Linear Algebra Boca Raton Chapman ampHallCRC Press 1st edition 2007 20

20Ringner M What is principal component analysis NatBiotechnol 200826303ndash4

21Wall ME Rechtsteiner A Rocha LM (2003) Singular value de-composition and principal component analysis In A PracticalApproach to Microarray Data Analysis Berrar DP DubitzkyW Granzow M (eds) Norwell MA Kluwer 2003 91ndash109

22 terBraak CJF Prentice IC A theory of gradient analysis AdvEcol Res 198818271ndash313

23 terBraak CJF Smilauer P CANOCO Reference Manual andUserrsquos Guide to Canoco for Windows Software for CanonicalCommunity Ordination (version 4) Microcomputer Power(Ithaca NY USA) httpwwwmicrocomputerpowercom1998352

24Abdi H Williams LJ Principal Component Analysis HobokenJohn Wiley amp SonsInterdisciplinary Reviews ComputationalStatistics 20102(4)433ndash59

25Dray S On the number of principal components a test ofdimensionality based on measurements of similarity be-tween matrices Comput Stat Data Anal 2008522228ndash37

26Wouters L Gohlmann HW Bijnens L et al Graphical explor-ation of gene expression data a comparative study of threemultivariate methods Biometrics 2003591131ndash9

27Greenacre M Correspondence Analysis in Practice Boca RatonChapman and HallCRC 2 edition 2007 201ndash11

28Goodrich JK Rienzi DSC Poole AC et al Conducting a micro-biome study Cell 2014158250ndash62

29Fasham MJR A comparison of nonmetric multidimensionalscaling principal components and reciprocal averaging forthe ordination of simulated coenoclines and coenoplanesEcology 197758551ndash61

Dimension reduction techniques | 639

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

30Beh EJ Lombardo R Correspondence Analysis Theory Practiceand New Strategies Hoboken John Wiley amp Sons 1st edition2014 120ndash76

31Greenacre MJ Theory and Applications of Correspondence AnalysisWaltham MA Academic Press 3rd Edition 1993 83ndash125

32Fagan A Culhane AC Higgins DG A multivariate analysis ap-proach to the integration of proteomic and gene expressiondata Proteomics 200772162ndash71

33Fellenberg K Hauser NC Brors B et al Correspondence ana-lysis applied to microarray data Proc Natl Acad Sci USA20019810781ndash6

34Lee DD Seung HS Learning the parts of objects by non-negative matrix factorization Nature 1999401788ndash91

35Comon P Independent component analysis a new conceptSignal Processing 199436287ndash314

36Witten DM Tibshirani R Hastie T A penalized matrix decom-position with applications to sparse principal componentsand canonical correlation analysis Biostatistics 200910515ndash34

37Shen H Huang JZ Sparse principal component analysis viaregularized low rank matrix approximation J Multivar Anal2008991015ndash34

38Lee S Epstein MP Duncan R et al Sparse principal componentanalysis for identifying ancestry-informative markers in gen-ome-wide association studies Genet Epidemiol 201236293ndash302

39Sill M Saadati M Benner A Applying stability selection toconsistently estimate sparse principal components in high-dimensional molecular data Bioinformatics 2015312683ndash90

40Zhao J Efficient model selection for mixtures of probabilisticPCA via hierarchical BIC IEEE Trans Cybern 2014441871ndash83

41Zhao Q Shi X Huang J et al Integrative analysis of lsquo-omicsrsquodata using penalty functions Wiley Interdiscip Rev Comput Stat2015710

42Alter O Brown PO Botstein D Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms Proc NatlAcad Sci USA 20031003351ndash6

43Culhane AC Perriere G Higgins DG Cross-platform compari-son and visualisation of gene expression data using co-inertia analysis BMC Bioinformatics 2003459

44Dray S Analysing a pair of tables coinertia analysis and dual-ity diagrams In Blasius J Greenacre M (eds) Visualization andverbalization of data Boca Raton CRC Press 2014 289ndash300

45Le Cao KA Martin PG Robert-Granie C et al Sparse canonicalmethods for biological data integration application to across-platform study BMC Bioinformatics 20091034

46Braak CJF Canonical correspondence analysis a new eigen-vector technique for multivariate direct gradient analysisEcology 1986671167ndash79

47Hotelling H Relations between two sets of variates Biometrika193628321ndash77

48McGarigal K Landguth E Stafford S Multivariate Statistics forWildlife and Ecology Research New York Springer 2002

49Thioulouse J Simultaneous analysis of a sequence of pairedecological tables a comparison of several methods Ann ApplStat 201152300ndash25

50Hong S Chen X Jin L et al Canonical correlation analysisfor RNA-seq co-expression networks Nucleic Acids Res201341e95

51Mardia KV Kent JT Bibby JM Multivariate Analysis LondonNew York NY Toronto Sydney San Francisco CA AcademicPress 1979

52Waaijenborg S Zwinderman AH Sparse canonical correl-ation analysis for identifying connecting and completinggene-expression networks BMC Bioinformatics 200910315

53Parkhomenko E Tritchler D Beyene J Sparse canonical cor-relation analysis with application to genomic data integra-tion Stat Appl Genet Mol Biol 20098article 1

54Witten DM Tibshirani RJ Extensions of sparse canonical cor-relation analysis with applications to genomic data Stat ApplGenet Mol Biol 20098article 28

55Lin D Zhang J Li J et al Group sparse canonical correlationanalysis for genomic data integration BMC Bioinformatics201314245

56Le Cao KA Boitard S Besse P Sparse PLS discriminantanalysis biologically relevant feature selection and graph-ical displays for multiclass problems BMC Bioinform 201112253

57Boulesteix AL Strimmer K Partial least squares a versatiletool for the analysis of high-dimensional genomic data BriefBioinform 2007832ndash44

58Doledec S Chessel D Co-inertia analysis an alternativemethod for studying speciesndashenvironment relationshipsFreshwater Biol 199431277ndash94

59Ponnapalli SP Saunders MA Van Loan CF et al A higher-order generalized singular value decomposition for compari-son of global mRNA expression from multiple organismsPLoS One 20116e28072

60Meng C Kuster B Culhane AC et al A multivariate approachto the integration of multi-omics datasets BMC Bioinformatics201415162

61de Tayrac M Le S Aubry M et al Simultaneous analysis ofdistinct Omics data sets with integration of biological know-ledge multiple factor analysis approach BMC Genomics20091032

62Abdi H Williams LJ Valentin D Statis and Distatis OptimumMultitable Principal Component Analysis and Three Way MetricMultidimensional Scaling Wiley Interdisciplinary 20124124ndash67

63De Jong S Kiers HAL Principal covariates regression Part ITheory Chemometrics and Intelligent Laboratory Systems199214155ndash64

64Benasseni J Dosse MB Analyzing multiset data by the powerSTATIS-ACT method Adv Data Anal Classification 2012649ndash65

65Escoufier Y The duality diagram a means for better practicalapplications Devel Num Ecol 198714139ndash56

66Giordani P Kiers HAL Ferraro DMA Three-way componentanalysis using the R package threeway J Stat Software2014577

67Leibovici DG Spatio-temporal multiway decompositionsusing principal tensor analysis on k-modes the R packagePTAk J Stat Software 201034(10)1ndash34

68Kolda TG Bader BW Tensor decompositions and applica-tions SIAM Rev 200951455ndash500

69Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis for multiblock or multigroup dataanalysis Eur J Oper Res 2014238391ndash403

70Chessel D Hanafi M Analyses de la co-inertie de K nuages depoints Rev Stat Appl 19964435ndash60

71Hanafi M Kohler A Qannari EM Connections betweenmultiple co-inertia analysis and consensus principal compo-nent analysis Chemometrics and Intelligent Laboratory2011106(1)37ndash40

72Carroll JD Generalization of canonical correlation analysis tothree or more sets of variables Proc 76th Convent Am PsychAssoc 19683227ndash8

73Takane Y Hwang H Abdi H Regularized multiple-set canon-ical correlation analysis Psychometrika 200873753ndash75

640 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5
Page 11: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

indicates higher co-structure In this example we observed rela-tively high RV coefficients between the three data sets rangingfrom 078 to 084 It was recently reported that the RV coefficientis biased toward large data sets and a modified RV coefficienthas been proposed [84]

In this analysis (Figure 2A) cell lines originating from thesame anatomical source are projected close to each other andconverge in clusters The first PC separates the leukemia celllines (positive end of PC1) from the other two cell lines (negativeend of PC1) and PC2 separates the melanoma and CNS celllines The melanoma cell line LOX-IMVI which lacks the mela-nogenesis is projected close to the origin away from the mel-anoma cluster We were surprised to see that the proteomicsprofile of leukemia cell line SR was projected closer to melan-oma rather than leukemia cell lines We examined withintumor type correlations to the SR cell line (Figure 2C) Weobserved that the SR proteomics data had higher correlationwith melanoma compared with to leukemia cell lines Giventhat the mRNA and miRNA profiles of LE_SR are closer to theleukemia cell lines it suggests that there may have been a tech-nical error in generating the proteomics data on the SR cell line(Figure 2A and C)

MCIA projects all variables into the same space The variablespace (Q1 QK) is visualized in Figure 2B Variables and sam-ples projected in the same direction are associated This allowsone to select the variables most strongly associated with spe-cific observations from each data set for subsequent analysis Inour previous study [85] we have shown that the genes and pro-teins highly weighted on the melanoma side (positive end of se-cond dimension) are enriched with melanogenesis functionsand genesproteins highly weighted on the protein side arehighly enriched in T-cell or immune-related functions

We examined the miRNA data to extract the miRNAs withthe most extreme weights on the first two dimensions miR-142and miR-223 which are active and expressed in leukemia [8283 86ndash88] had high weights on the positive end of both thefirst and second axis (close to the leukemia cell lines samplespace Figure 2A) miR-142 plays an essential role in T-lympho-cyte development miR-223 is regulated by the Notch andNF-kB signaling pathways in T-cell acute lymphoblastic leuke-mia [89]

The miRNA with strongest association to CNS cell lines wasmiR-409 This miRNA is reported to promote the epithelial-to-mesenchymal transition in prostate cancer [90] In the NCI-60cell line data CNS cell lines are characterized more by a pro-nounced mesenchymal phenotype which is consistent withhigh expression of this miRNA On the positive end of the se-cond axis and negative end of the first axis (which correspondsto melanoma cell lines in the sample space Figure 2A) wefound miR-509 miR-513 and miR-506 strongly associated withmelanoma cell lines which are reported to initiate melanocytetransformation and promote melanoma growth [85]

Challenges in integrative data analysis

EDA is widely used and well accepted in the analysis of singleomics data sets but there is an increasing need for methodsthat integrate multi-omics data particularly in cancer researchRecently 20 leading scientists were invited to a meeting organ-ized by Nature Medicine Nature Biotechnology and the VolkswagenFoundation The meeting identified the need to simultaneouslycharacterize DNA sequence epigenome transcriptome proteinmetabolites and infiltrating immune cells in both the tumorand the stroma [91] The TCGA pan-cancer project plans to

comprehensively interrogate multi-omics data across 33 humancancers [92] The data are biologically complex In addition totumor heterogeneity [91] there may be technical issues batcheffects and outliers EDA approaches for complex multi-omicsdata are needed

We describe emerging applications of multivariateapproaches to omics data analysis These are descriptiveapproaches that do not test a hypothesis or generate a P-valueThey are not optimized for variable or biomarker discoveryalthough the introduction of sparsity in variable loadings mayhelp in the selection of variables for downstream analysis Fewcomparisons of different methods exist and the numbers ofcomponents and the sparsity level have to be optimized Cross-validation approaches are potentially useful but this stillremains an open area of research

Another limitation of these methods is that although vari-ables may vary among data sets the observations need to bematchable Therefore researchers need to have careful experi-mental design in the early stage of a study There is an exten-sion of CIA for the analysis of unmatched samples [93] whichcombines a Hungarian algorithm with CIA to iteratively pairsamples that are similar but not matched Multi-block andmulti-group methods (data sets with matched variables) havebeen reviewed recently by Tenenhaus and Tenenhaus [69]

The number of variables in genomics data is a challenge totraditional EDA visualization tools Most visualizationapproaches were designed for data sets with fewer variablesWithin R new packages including ggord are being developedDynamic data visualization is possible using ggvis plotly explorand other packages However interpretation of long lists of bio-logical variables (genes proteins miRNAs) is challenging Oneway to gain more insight into lists of omics variables is to per-form a network gene set enrichment or pathway analysis [94]An attractive feature of decomposition methods is that variableannotation such as Gene Ontology or Reactome can be pro-jected into the same space to determine a score for each geneset (or pathway) in that space [32 33 60]

Conclusion

Dimension reduction methods have a long history Many simi-lar methods have been developed in parallel by multiple fieldsIn this review we provided an overview of dimension reductiontechniques that are both well-established and maybe new tothe multi-omics data community We reviewed methods forsingle-table two-table and multi-table analysis There are sig-nificant challenges in extracting biologically and clinically ac-tionable results from multi-omics data however the field mayleverage the varied and rich resource of dimension reductionapproaches that other disciplines have developed

Key Points

bull There are many dimension-reduction methods whichcan be applied to exploratory data analysis of a singledata set or integrated analysis of a pair or multipledata sets In addition to exploratory analysis thesecan be extended to clustering supervised and discrim-inant analysis

bull The goal of dimension reduction is to map data ontoa new set of variables so that most of the variance (orinformation) in the data is explained by a few new(latent) variables

638 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

bull Multi-data set methods such as multiple co-inertiaanalysis (MCIA) multiple factor analysis (MFA) or ca-nonical correlations analysis (CCA) identify correlatedstructure between data sets with matched observa-tions (samples) Each data set may have different vari-ables (genes proteins miRNA mutations drugresponse etc)

bull MCIA MFA CCA and related methods provide a visu-alization of consensus and incongruence in andbetween data sets enabling discovery of potential out-liers batch effects or technical errors

bull Multi-dataset methods transform diverse variablesfrom each data set onto the same space and scalefacilitating integrative variable selection gene setanalysis pathway and downstream analyses

R Supplement

R code to re-generate all figures in this article is available asa Supplementary File Data and code are also available on thegithub repository httpsgithubcomaedinNCI60Example

Supplementary Data

Supplementary data are available online at httpbiboxfordjournalsorg

Acknowledgements

We are grateful to Drs John Platig and Marieke Kuijjer forreading the manuscript and for their comments

Funding

GGT was supported by the Austrian Ministry of ScienceResearch and Economics [OMICS Center HSRSM initiative]AC was supported by Dana-Farber Cancer Institute BCBResearch Scientist Developmental Funds National CancerInstitute 2P50 CA101942-11 and the Assistant Secretary ofDefense Health Program through the Breast CancerResearch Program under Award No W81XWH-15-1-0013Opinions interpretations conclusions and recommenda-tions are those of the author and are not necessarilyendorsed by the Department of Defense

References1 Brazma A Culhane AC Algorithms for gene expression

analysis In Jorde LB Little PFR Dunn MJ Subramaniam S(eds) Encyclopedia of Genetics Genomics Proteomics andBioinformatics London John Wiley amp Sons 2005 3148ndash59

2 Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

3 Legendre P Legendre LFJ Numerical Ecology AmsterdamElsevier Science 3rd edition 2012

4 Biton A Bernard-Pierrot I Lou Y et al Independent compo-nent analysis uncovers the landscape of the bladder tumortranscriptome and reveals insights into luminal and basalsubtypes Cell Rep 201491235ndash45

5 Hoadley KA Yau C Wolf DM et al Multiplatform analysis of12 cancer types reveals molecular classification within andacross tissues of origin Cell 2014158929ndash44

6 Verhaak RG Tamayo P Yang JY et al Prognostically relevantgene signatures of high-grade serous ovarian carcinomaJ Clin Invest 2013123517ndash25

7 Senbabaoglu Y Michailidis G Li JZ Critical limitations of con-sensus clustering in class discovery Sci Rep 201446207

8 Gusenleitner D Howe EA Bentink S et al iBBiG iterativebinary bi-clustering of gene sets Bioinformatics 2012282484ndash92

9 Pearson K On lines and planes of closest fit to systems ofpoints in space Philos Magazine Series 19016(2)559ndash72

10Hotelling H Analysis of a complex of statistical variables intoprincipal components J Educ Psychol 193324417

11Bland M An Introduction to Medical Statistics 3rd edn OxfordNew York Oxford University Press 2000 xvi 405

12Warner RM Applied Statistics From Bivariate throughMultivariate Techniques London SAGE Publications Inc2nd Edition 2012 1004

13Dray S Chessel D Thioulouse J Co-inertia analysis and thelinking of ecological data tables Ecology 2003843078ndash89

14Golub GH Loan CF Matrix Computations Baltimore JHU Press2012

15Anthony M Harvey M Linear Algebra Concepts and MethodsCambridge Cambridge University Press 1st edition 2012149ndash60

16McShane LM Cavenagh MM Lively TG et al Criteria for theuse of omics-based predictors in clinical trials explanationand elaboration BMC Med 201311220

17Guyon I Elisseeff A An introduction to variable and featureselection J Mach Learn Res 200331157ndash82

18Tukey JW Exploratory Data Analysis Boston Addison-WesleyPublishing Company 1977

19Hogben L Handbook of Linear Algebra Boca Raton Chapman ampHallCRC Press 1st edition 2007 20

20Ringner M What is principal component analysis NatBiotechnol 200826303ndash4

21Wall ME Rechtsteiner A Rocha LM (2003) Singular value de-composition and principal component analysis In A PracticalApproach to Microarray Data Analysis Berrar DP DubitzkyW Granzow M (eds) Norwell MA Kluwer 2003 91ndash109

22 terBraak CJF Prentice IC A theory of gradient analysis AdvEcol Res 198818271ndash313

23 terBraak CJF Smilauer P CANOCO Reference Manual andUserrsquos Guide to Canoco for Windows Software for CanonicalCommunity Ordination (version 4) Microcomputer Power(Ithaca NY USA) httpwwwmicrocomputerpowercom1998352

24Abdi H Williams LJ Principal Component Analysis HobokenJohn Wiley amp SonsInterdisciplinary Reviews ComputationalStatistics 20102(4)433ndash59

25Dray S On the number of principal components a test ofdimensionality based on measurements of similarity be-tween matrices Comput Stat Data Anal 2008522228ndash37

26Wouters L Gohlmann HW Bijnens L et al Graphical explor-ation of gene expression data a comparative study of threemultivariate methods Biometrics 2003591131ndash9

27Greenacre M Correspondence Analysis in Practice Boca RatonChapman and HallCRC 2 edition 2007 201ndash11

28Goodrich JK Rienzi DSC Poole AC et al Conducting a micro-biome study Cell 2014158250ndash62

29Fasham MJR A comparison of nonmetric multidimensionalscaling principal components and reciprocal averaging forthe ordination of simulated coenoclines and coenoplanesEcology 197758551ndash61

Dimension reduction techniques | 639

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

30Beh EJ Lombardo R Correspondence Analysis Theory Practiceand New Strategies Hoboken John Wiley amp Sons 1st edition2014 120ndash76

31Greenacre MJ Theory and Applications of Correspondence AnalysisWaltham MA Academic Press 3rd Edition 1993 83ndash125

32Fagan A Culhane AC Higgins DG A multivariate analysis ap-proach to the integration of proteomic and gene expressiondata Proteomics 200772162ndash71

33Fellenberg K Hauser NC Brors B et al Correspondence ana-lysis applied to microarray data Proc Natl Acad Sci USA20019810781ndash6

34Lee DD Seung HS Learning the parts of objects by non-negative matrix factorization Nature 1999401788ndash91

35Comon P Independent component analysis a new conceptSignal Processing 199436287ndash314

36Witten DM Tibshirani R Hastie T A penalized matrix decom-position with applications to sparse principal componentsand canonical correlation analysis Biostatistics 200910515ndash34

37Shen H Huang JZ Sparse principal component analysis viaregularized low rank matrix approximation J Multivar Anal2008991015ndash34

38Lee S Epstein MP Duncan R et al Sparse principal componentanalysis for identifying ancestry-informative markers in gen-ome-wide association studies Genet Epidemiol 201236293ndash302

39Sill M Saadati M Benner A Applying stability selection toconsistently estimate sparse principal components in high-dimensional molecular data Bioinformatics 2015312683ndash90

40Zhao J Efficient model selection for mixtures of probabilisticPCA via hierarchical BIC IEEE Trans Cybern 2014441871ndash83

41Zhao Q Shi X Huang J et al Integrative analysis of lsquo-omicsrsquodata using penalty functions Wiley Interdiscip Rev Comput Stat2015710

42Alter O Brown PO Botstein D Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms Proc NatlAcad Sci USA 20031003351ndash6

43Culhane AC Perriere G Higgins DG Cross-platform compari-son and visualisation of gene expression data using co-inertia analysis BMC Bioinformatics 2003459

44Dray S Analysing a pair of tables coinertia analysis and dual-ity diagrams In Blasius J Greenacre M (eds) Visualization andverbalization of data Boca Raton CRC Press 2014 289ndash300

45Le Cao KA Martin PG Robert-Granie C et al Sparse canonicalmethods for biological data integration application to across-platform study BMC Bioinformatics 20091034

46Braak CJF Canonical correspondence analysis a new eigen-vector technique for multivariate direct gradient analysisEcology 1986671167ndash79

47Hotelling H Relations between two sets of variates Biometrika193628321ndash77

48McGarigal K Landguth E Stafford S Multivariate Statistics forWildlife and Ecology Research New York Springer 2002

49Thioulouse J Simultaneous analysis of a sequence of pairedecological tables a comparison of several methods Ann ApplStat 201152300ndash25

50Hong S Chen X Jin L et al Canonical correlation analysisfor RNA-seq co-expression networks Nucleic Acids Res201341e95

51Mardia KV Kent JT Bibby JM Multivariate Analysis LondonNew York NY Toronto Sydney San Francisco CA AcademicPress 1979

52Waaijenborg S Zwinderman AH Sparse canonical correl-ation analysis for identifying connecting and completinggene-expression networks BMC Bioinformatics 200910315

53Parkhomenko E Tritchler D Beyene J Sparse canonical cor-relation analysis with application to genomic data integra-tion Stat Appl Genet Mol Biol 20098article 1

54Witten DM Tibshirani RJ Extensions of sparse canonical cor-relation analysis with applications to genomic data Stat ApplGenet Mol Biol 20098article 28

55Lin D Zhang J Li J et al Group sparse canonical correlationanalysis for genomic data integration BMC Bioinformatics201314245

56Le Cao KA Boitard S Besse P Sparse PLS discriminantanalysis biologically relevant feature selection and graph-ical displays for multiclass problems BMC Bioinform 201112253

57Boulesteix AL Strimmer K Partial least squares a versatiletool for the analysis of high-dimensional genomic data BriefBioinform 2007832ndash44

58Doledec S Chessel D Co-inertia analysis an alternativemethod for studying speciesndashenvironment relationshipsFreshwater Biol 199431277ndash94

59Ponnapalli SP Saunders MA Van Loan CF et al A higher-order generalized singular value decomposition for compari-son of global mRNA expression from multiple organismsPLoS One 20116e28072

60Meng C Kuster B Culhane AC et al A multivariate approachto the integration of multi-omics datasets BMC Bioinformatics201415162

61de Tayrac M Le S Aubry M et al Simultaneous analysis ofdistinct Omics data sets with integration of biological know-ledge multiple factor analysis approach BMC Genomics20091032

62Abdi H Williams LJ Valentin D Statis and Distatis OptimumMultitable Principal Component Analysis and Three Way MetricMultidimensional Scaling Wiley Interdisciplinary 20124124ndash67

63De Jong S Kiers HAL Principal covariates regression Part ITheory Chemometrics and Intelligent Laboratory Systems199214155ndash64

64Benasseni J Dosse MB Analyzing multiset data by the powerSTATIS-ACT method Adv Data Anal Classification 2012649ndash65

65Escoufier Y The duality diagram a means for better practicalapplications Devel Num Ecol 198714139ndash56

66Giordani P Kiers HAL Ferraro DMA Three-way componentanalysis using the R package threeway J Stat Software2014577

67Leibovici DG Spatio-temporal multiway decompositionsusing principal tensor analysis on k-modes the R packagePTAk J Stat Software 201034(10)1ndash34

68Kolda TG Bader BW Tensor decompositions and applica-tions SIAM Rev 200951455ndash500

69Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis for multiblock or multigroup dataanalysis Eur J Oper Res 2014238391ndash403

70Chessel D Hanafi M Analyses de la co-inertie de K nuages depoints Rev Stat Appl 19964435ndash60

71Hanafi M Kohler A Qannari EM Connections betweenmultiple co-inertia analysis and consensus principal compo-nent analysis Chemometrics and Intelligent Laboratory2011106(1)37ndash40

72Carroll JD Generalization of canonical correlation analysis tothree or more sets of variables Proc 76th Convent Am PsychAssoc 19683227ndash8

73Takane Y Hwang H Abdi H Regularized multiple-set canon-ical correlation analysis Psychometrika 200873753ndash75

640 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5
Page 12: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

bull Multi-data set methods such as multiple co-inertiaanalysis (MCIA) multiple factor analysis (MFA) or ca-nonical correlations analysis (CCA) identify correlatedstructure between data sets with matched observa-tions (samples) Each data set may have different vari-ables (genes proteins miRNA mutations drugresponse etc)

bull MCIA MFA CCA and related methods provide a visu-alization of consensus and incongruence in andbetween data sets enabling discovery of potential out-liers batch effects or technical errors

bull Multi-dataset methods transform diverse variablesfrom each data set onto the same space and scalefacilitating integrative variable selection gene setanalysis pathway and downstream analyses

R Supplement

R code to re-generate all figures in this article is available asa Supplementary File Data and code are also available on thegithub repository httpsgithubcomaedinNCI60Example

Supplementary Data

Supplementary data are available online at httpbiboxfordjournalsorg

Acknowledgements

We are grateful to Drs John Platig and Marieke Kuijjer forreading the manuscript and for their comments

Funding

GGT was supported by the Austrian Ministry of ScienceResearch and Economics [OMICS Center HSRSM initiative]AC was supported by Dana-Farber Cancer Institute BCBResearch Scientist Developmental Funds National CancerInstitute 2P50 CA101942-11 and the Assistant Secretary ofDefense Health Program through the Breast CancerResearch Program under Award No W81XWH-15-1-0013Opinions interpretations conclusions and recommenda-tions are those of the author and are not necessarilyendorsed by the Department of Defense

References1 Brazma A Culhane AC Algorithms for gene expression

analysis In Jorde LB Little PFR Dunn MJ Subramaniam S(eds) Encyclopedia of Genetics Genomics Proteomics andBioinformatics London John Wiley amp Sons 2005 3148ndash59

2 Leek JT Scharpf RB Bravo HC et al Tackling the widespreadand critical impact of batch effects in high-throughput dataNat Rev Genet 201011733ndash9

3 Legendre P Legendre LFJ Numerical Ecology AmsterdamElsevier Science 3rd edition 2012

4 Biton A Bernard-Pierrot I Lou Y et al Independent compo-nent analysis uncovers the landscape of the bladder tumortranscriptome and reveals insights into luminal and basalsubtypes Cell Rep 201491235ndash45

5 Hoadley KA Yau C Wolf DM et al Multiplatform analysis of12 cancer types reveals molecular classification within andacross tissues of origin Cell 2014158929ndash44

6 Verhaak RG Tamayo P Yang JY et al Prognostically relevantgene signatures of high-grade serous ovarian carcinomaJ Clin Invest 2013123517ndash25

7 Senbabaoglu Y Michailidis G Li JZ Critical limitations of con-sensus clustering in class discovery Sci Rep 201446207

8 Gusenleitner D Howe EA Bentink S et al iBBiG iterativebinary bi-clustering of gene sets Bioinformatics 2012282484ndash92

9 Pearson K On lines and planes of closest fit to systems ofpoints in space Philos Magazine Series 19016(2)559ndash72

10Hotelling H Analysis of a complex of statistical variables intoprincipal components J Educ Psychol 193324417

11Bland M An Introduction to Medical Statistics 3rd edn OxfordNew York Oxford University Press 2000 xvi 405

12Warner RM Applied Statistics From Bivariate throughMultivariate Techniques London SAGE Publications Inc2nd Edition 2012 1004

13Dray S Chessel D Thioulouse J Co-inertia analysis and thelinking of ecological data tables Ecology 2003843078ndash89

14Golub GH Loan CF Matrix Computations Baltimore JHU Press2012

15Anthony M Harvey M Linear Algebra Concepts and MethodsCambridge Cambridge University Press 1st edition 2012149ndash60

16McShane LM Cavenagh MM Lively TG et al Criteria for theuse of omics-based predictors in clinical trials explanationand elaboration BMC Med 201311220

17Guyon I Elisseeff A An introduction to variable and featureselection J Mach Learn Res 200331157ndash82

18Tukey JW Exploratory Data Analysis Boston Addison-WesleyPublishing Company 1977

19Hogben L Handbook of Linear Algebra Boca Raton Chapman ampHallCRC Press 1st edition 2007 20

20Ringner M What is principal component analysis NatBiotechnol 200826303ndash4

21Wall ME Rechtsteiner A Rocha LM (2003) Singular value de-composition and principal component analysis In A PracticalApproach to Microarray Data Analysis Berrar DP DubitzkyW Granzow M (eds) Norwell MA Kluwer 2003 91ndash109

22 terBraak CJF Prentice IC A theory of gradient analysis AdvEcol Res 198818271ndash313

23 terBraak CJF Smilauer P CANOCO Reference Manual andUserrsquos Guide to Canoco for Windows Software for CanonicalCommunity Ordination (version 4) Microcomputer Power(Ithaca NY USA) httpwwwmicrocomputerpowercom1998352

24Abdi H Williams LJ Principal Component Analysis HobokenJohn Wiley amp SonsInterdisciplinary Reviews ComputationalStatistics 20102(4)433ndash59

25Dray S On the number of principal components a test ofdimensionality based on measurements of similarity be-tween matrices Comput Stat Data Anal 2008522228ndash37

26Wouters L Gohlmann HW Bijnens L et al Graphical explor-ation of gene expression data a comparative study of threemultivariate methods Biometrics 2003591131ndash9

27Greenacre M Correspondence Analysis in Practice Boca RatonChapman and HallCRC 2 edition 2007 201ndash11

28Goodrich JK Rienzi DSC Poole AC et al Conducting a micro-biome study Cell 2014158250ndash62

29Fasham MJR A comparison of nonmetric multidimensionalscaling principal components and reciprocal averaging forthe ordination of simulated coenoclines and coenoplanesEcology 197758551ndash61

Dimension reduction techniques | 639

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

30Beh EJ Lombardo R Correspondence Analysis Theory Practiceand New Strategies Hoboken John Wiley amp Sons 1st edition2014 120ndash76

31Greenacre MJ Theory and Applications of Correspondence AnalysisWaltham MA Academic Press 3rd Edition 1993 83ndash125

32Fagan A Culhane AC Higgins DG A multivariate analysis ap-proach to the integration of proteomic and gene expressiondata Proteomics 200772162ndash71

33Fellenberg K Hauser NC Brors B et al Correspondence ana-lysis applied to microarray data Proc Natl Acad Sci USA20019810781ndash6

34Lee DD Seung HS Learning the parts of objects by non-negative matrix factorization Nature 1999401788ndash91

35Comon P Independent component analysis a new conceptSignal Processing 199436287ndash314

36Witten DM Tibshirani R Hastie T A penalized matrix decom-position with applications to sparse principal componentsand canonical correlation analysis Biostatistics 200910515ndash34

37Shen H Huang JZ Sparse principal component analysis viaregularized low rank matrix approximation J Multivar Anal2008991015ndash34

38Lee S Epstein MP Duncan R et al Sparse principal componentanalysis for identifying ancestry-informative markers in gen-ome-wide association studies Genet Epidemiol 201236293ndash302

39Sill M Saadati M Benner A Applying stability selection toconsistently estimate sparse principal components in high-dimensional molecular data Bioinformatics 2015312683ndash90

40Zhao J Efficient model selection for mixtures of probabilisticPCA via hierarchical BIC IEEE Trans Cybern 2014441871ndash83

41Zhao Q Shi X Huang J et al Integrative analysis of lsquo-omicsrsquodata using penalty functions Wiley Interdiscip Rev Comput Stat2015710

42Alter O Brown PO Botstein D Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms Proc NatlAcad Sci USA 20031003351ndash6

43Culhane AC Perriere G Higgins DG Cross-platform compari-son and visualisation of gene expression data using co-inertia analysis BMC Bioinformatics 2003459

44Dray S Analysing a pair of tables coinertia analysis and dual-ity diagrams In Blasius J Greenacre M (eds) Visualization andverbalization of data Boca Raton CRC Press 2014 289ndash300

45Le Cao KA Martin PG Robert-Granie C et al Sparse canonicalmethods for biological data integration application to across-platform study BMC Bioinformatics 20091034

46Braak CJF Canonical correspondence analysis a new eigen-vector technique for multivariate direct gradient analysisEcology 1986671167ndash79

47Hotelling H Relations between two sets of variates Biometrika193628321ndash77

48McGarigal K Landguth E Stafford S Multivariate Statistics forWildlife and Ecology Research New York Springer 2002

49Thioulouse J Simultaneous analysis of a sequence of pairedecological tables a comparison of several methods Ann ApplStat 201152300ndash25

50Hong S Chen X Jin L et al Canonical correlation analysisfor RNA-seq co-expression networks Nucleic Acids Res201341e95

51Mardia KV Kent JT Bibby JM Multivariate Analysis LondonNew York NY Toronto Sydney San Francisco CA AcademicPress 1979

52Waaijenborg S Zwinderman AH Sparse canonical correl-ation analysis for identifying connecting and completinggene-expression networks BMC Bioinformatics 200910315

53Parkhomenko E Tritchler D Beyene J Sparse canonical cor-relation analysis with application to genomic data integra-tion Stat Appl Genet Mol Biol 20098article 1

54Witten DM Tibshirani RJ Extensions of sparse canonical cor-relation analysis with applications to genomic data Stat ApplGenet Mol Biol 20098article 28

55Lin D Zhang J Li J et al Group sparse canonical correlationanalysis for genomic data integration BMC Bioinformatics201314245

56Le Cao KA Boitard S Besse P Sparse PLS discriminantanalysis biologically relevant feature selection and graph-ical displays for multiclass problems BMC Bioinform 201112253

57Boulesteix AL Strimmer K Partial least squares a versatiletool for the analysis of high-dimensional genomic data BriefBioinform 2007832ndash44

58Doledec S Chessel D Co-inertia analysis an alternativemethod for studying speciesndashenvironment relationshipsFreshwater Biol 199431277ndash94

59Ponnapalli SP Saunders MA Van Loan CF et al A higher-order generalized singular value decomposition for compari-son of global mRNA expression from multiple organismsPLoS One 20116e28072

60Meng C Kuster B Culhane AC et al A multivariate approachto the integration of multi-omics datasets BMC Bioinformatics201415162

61de Tayrac M Le S Aubry M et al Simultaneous analysis ofdistinct Omics data sets with integration of biological know-ledge multiple factor analysis approach BMC Genomics20091032

62Abdi H Williams LJ Valentin D Statis and Distatis OptimumMultitable Principal Component Analysis and Three Way MetricMultidimensional Scaling Wiley Interdisciplinary 20124124ndash67

63De Jong S Kiers HAL Principal covariates regression Part ITheory Chemometrics and Intelligent Laboratory Systems199214155ndash64

64Benasseni J Dosse MB Analyzing multiset data by the powerSTATIS-ACT method Adv Data Anal Classification 2012649ndash65

65Escoufier Y The duality diagram a means for better practicalapplications Devel Num Ecol 198714139ndash56

66Giordani P Kiers HAL Ferraro DMA Three-way componentanalysis using the R package threeway J Stat Software2014577

67Leibovici DG Spatio-temporal multiway decompositionsusing principal tensor analysis on k-modes the R packagePTAk J Stat Software 201034(10)1ndash34

68Kolda TG Bader BW Tensor decompositions and applica-tions SIAM Rev 200951455ndash500

69Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis for multiblock or multigroup dataanalysis Eur J Oper Res 2014238391ndash403

70Chessel D Hanafi M Analyses de la co-inertie de K nuages depoints Rev Stat Appl 19964435ndash60

71Hanafi M Kohler A Qannari EM Connections betweenmultiple co-inertia analysis and consensus principal compo-nent analysis Chemometrics and Intelligent Laboratory2011106(1)37ndash40

72Carroll JD Generalization of canonical correlation analysis tothree or more sets of variables Proc 76th Convent Am PsychAssoc 19683227ndash8

73Takane Y Hwang H Abdi H Regularized multiple-set canon-ical correlation analysis Psychometrika 200873753ndash75

640 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5
Page 13: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

30Beh EJ Lombardo R Correspondence Analysis Theory Practiceand New Strategies Hoboken John Wiley amp Sons 1st edition2014 120ndash76

31Greenacre MJ Theory and Applications of Correspondence AnalysisWaltham MA Academic Press 3rd Edition 1993 83ndash125

32Fagan A Culhane AC Higgins DG A multivariate analysis ap-proach to the integration of proteomic and gene expressiondata Proteomics 200772162ndash71

33Fellenberg K Hauser NC Brors B et al Correspondence ana-lysis applied to microarray data Proc Natl Acad Sci USA20019810781ndash6

34Lee DD Seung HS Learning the parts of objects by non-negative matrix factorization Nature 1999401788ndash91

35Comon P Independent component analysis a new conceptSignal Processing 199436287ndash314

36Witten DM Tibshirani R Hastie T A penalized matrix decom-position with applications to sparse principal componentsand canonical correlation analysis Biostatistics 200910515ndash34

37Shen H Huang JZ Sparse principal component analysis viaregularized low rank matrix approximation J Multivar Anal2008991015ndash34

38Lee S Epstein MP Duncan R et al Sparse principal componentanalysis for identifying ancestry-informative markers in gen-ome-wide association studies Genet Epidemiol 201236293ndash302

39Sill M Saadati M Benner A Applying stability selection toconsistently estimate sparse principal components in high-dimensional molecular data Bioinformatics 2015312683ndash90

40Zhao J Efficient model selection for mixtures of probabilisticPCA via hierarchical BIC IEEE Trans Cybern 2014441871ndash83

41Zhao Q Shi X Huang J et al Integrative analysis of lsquo-omicsrsquodata using penalty functions Wiley Interdiscip Rev Comput Stat2015710

42Alter O Brown PO Botstein D Generalized singular valuedecomposition for comparative analysis of genome-scaleexpression data sets of two different organisms Proc NatlAcad Sci USA 20031003351ndash6

43Culhane AC Perriere G Higgins DG Cross-platform compari-son and visualisation of gene expression data using co-inertia analysis BMC Bioinformatics 2003459

44Dray S Analysing a pair of tables coinertia analysis and dual-ity diagrams In Blasius J Greenacre M (eds) Visualization andverbalization of data Boca Raton CRC Press 2014 289ndash300

45Le Cao KA Martin PG Robert-Granie C et al Sparse canonicalmethods for biological data integration application to across-platform study BMC Bioinformatics 20091034

46Braak CJF Canonical correspondence analysis a new eigen-vector technique for multivariate direct gradient analysisEcology 1986671167ndash79

47Hotelling H Relations between two sets of variates Biometrika193628321ndash77

48McGarigal K Landguth E Stafford S Multivariate Statistics forWildlife and Ecology Research New York Springer 2002

49Thioulouse J Simultaneous analysis of a sequence of pairedecological tables a comparison of several methods Ann ApplStat 201152300ndash25

50Hong S Chen X Jin L et al Canonical correlation analysisfor RNA-seq co-expression networks Nucleic Acids Res201341e95

51Mardia KV Kent JT Bibby JM Multivariate Analysis LondonNew York NY Toronto Sydney San Francisco CA AcademicPress 1979

52Waaijenborg S Zwinderman AH Sparse canonical correl-ation analysis for identifying connecting and completinggene-expression networks BMC Bioinformatics 200910315

53Parkhomenko E Tritchler D Beyene J Sparse canonical cor-relation analysis with application to genomic data integra-tion Stat Appl Genet Mol Biol 20098article 1

54Witten DM Tibshirani RJ Extensions of sparse canonical cor-relation analysis with applications to genomic data Stat ApplGenet Mol Biol 20098article 28

55Lin D Zhang J Li J et al Group sparse canonical correlationanalysis for genomic data integration BMC Bioinformatics201314245

56Le Cao KA Boitard S Besse P Sparse PLS discriminantanalysis biologically relevant feature selection and graph-ical displays for multiclass problems BMC Bioinform 201112253

57Boulesteix AL Strimmer K Partial least squares a versatiletool for the analysis of high-dimensional genomic data BriefBioinform 2007832ndash44

58Doledec S Chessel D Co-inertia analysis an alternativemethod for studying speciesndashenvironment relationshipsFreshwater Biol 199431277ndash94

59Ponnapalli SP Saunders MA Van Loan CF et al A higher-order generalized singular value decomposition for compari-son of global mRNA expression from multiple organismsPLoS One 20116e28072

60Meng C Kuster B Culhane AC et al A multivariate approachto the integration of multi-omics datasets BMC Bioinformatics201415162

61de Tayrac M Le S Aubry M et al Simultaneous analysis ofdistinct Omics data sets with integration of biological know-ledge multiple factor analysis approach BMC Genomics20091032

62Abdi H Williams LJ Valentin D Statis and Distatis OptimumMultitable Principal Component Analysis and Three Way MetricMultidimensional Scaling Wiley Interdisciplinary 20124124ndash67

63De Jong S Kiers HAL Principal covariates regression Part ITheory Chemometrics and Intelligent Laboratory Systems199214155ndash64

64Benasseni J Dosse MB Analyzing multiset data by the powerSTATIS-ACT method Adv Data Anal Classification 2012649ndash65

65Escoufier Y The duality diagram a means for better practicalapplications Devel Num Ecol 198714139ndash56

66Giordani P Kiers HAL Ferraro DMA Three-way componentanalysis using the R package threeway J Stat Software2014577

67Leibovici DG Spatio-temporal multiway decompositionsusing principal tensor analysis on k-modes the R packagePTAk J Stat Software 201034(10)1ndash34

68Kolda TG Bader BW Tensor decompositions and applica-tions SIAM Rev 200951455ndash500

69Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis for multiblock or multigroup dataanalysis Eur J Oper Res 2014238391ndash403

70Chessel D Hanafi M Analyses de la co-inertie de K nuages depoints Rev Stat Appl 19964435ndash60

71Hanafi M Kohler A Qannari EM Connections betweenmultiple co-inertia analysis and consensus principal compo-nent analysis Chemometrics and Intelligent Laboratory2011106(1)37ndash40

72Carroll JD Generalization of canonical correlation analysis tothree or more sets of variables Proc 76th Convent Am PsychAssoc 19683227ndash8

73Takane Y Hwang H Abdi H Regularized multiple-set canon-ical correlation analysis Psychometrika 200873753ndash75

640 | Meng et al

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5
Page 14: Dimension reduction techniques for the integrative ... · Amin M. Gholami is a bioinformatics scientist in the Division of Vaccine Discovery at La Jolla Institute for Allergy and

74Tenenhaus A Tenenhaus M Regularized generalized canon-ical correlation analysis Psychometrika 201176257ndash84

75van de Velden M On generalized canonical correlation ana-lysis In Proceedings of the 58th World Statistical CongressDublin 2011

76Tenenhaus A Philippe C Guillemot V et al Variable selectionfor generalized canonical correlation analysis Biostatistics201415569ndash83

77Zhao Q Caiafa CF Mandic DP et al Higher order partial leastsquares (HOPLS) a generalized multilinear regressionmethod IEEE Trans Pattern Anal Mach Intell 2013351660ndash73

78Kim H Park H Elden L Non-negative tensor factorizationbased on alternating large-scale non-negativity-constrainedleast squares In Proceedings of the 7th IEEE Symposium onBioinformatics and Bioengineering (BIBE) Dublin 2007 1147ndash51

79Morup M Hansen LK Arnfred SM Algorithms for sparse nonneg-ative Tucker decompositions Neural Comput 2008202112ndash31

80Wang HQ Zheng CH Zhao XM jNMFMA a joint non-negativematrix factorization meta-analysis of transcriptomics dataBioinformatics 201531572ndash80

81Zhang S Liu CC Li W et al Discovery of multi-dimensionalmodules by integrative analysis of cancer genomic dataNucleic Acids Res 2012409379ndash91

82Lv M Zhang X Jia H et al An oncogenic role of miR-142-3p inhuman T-cell acute lymphoblastic leukemia (T-ALL) by tar-geting glucocorticoid receptor-alpha and cAMPPKA path-ways Leukemia 201226769ndash77

83Pulikkan JA Dengler V Peramangalam PS et al Cell-cycleregulator E2F1 and microRNA-223 comprise an autoregula-tory negative feedback loop in acute myeloid leukemia Blood20101151768ndash78

84Smilde AK Kiers HA Bijlsma S et al Matrix correlations forhigh-dimensional data the modified RV-coefficientBioinformatics 200925401ndash5

85Streicher KL Zhu W Lehmann KP et al A novel oncogenicrole for the miRNA-506-514 cluster in initiating melanocytetransformation and promoting melanoma growth Oncogene2012311558ndash70

86Chiaretti S Messina M Tavolaro S et al Gene expressionprofiling identifies a subset of adult T-cell acute lympho-blastic leukemia with myeloid-like gene features and over-expression of miR-223 Haematologica 2010951114ndash21

87Dahlhaus M Roolf C Ruck S et al Expression and prognosticsignificance of hsa-miR-142-3p in acute leukemiasNeoplasma 201360432ndash8

88Eyholzer M Schmid S Schardt JA et al Complexity of miR-223 regulation by CEBPA in human AML Leuk Res201034672ndash6

89Kumar V Palermo R Talora C et al Notch and NF-kB signal-ing pathways regulate miR-223FBXW7 axis in T-cell acutelymphoblastic leukemia Leukemia 2014282324ndash35

90 Josson S Gururajan M Hu P et al miR-409-3p-5p promotestumorigenesis epithelial-to-mesenchymal transition andbone metastasis of human prostate cancer Clin Cancer Res2014204636ndash46

91Alizadeh AA Aranda V Bardelli A et al Toward understand-ing and exploiting tumor heterogeneity Nat Med 201521846ndash53

92Cancer Genome Atlas Research Network Weinstein JNCollisson EA Mills GB et al The Cancer Genome Atlas Pan-Cancer analysis project Nat Genet 2013451113ndash20

93Gholami AM Fellenberg K Cross-species common regulatorynetwork inference without requirement for prior gene affilia-tion Bioinformatics 201026(8)1082ndash90

94Langfelder P Horvath S WGCNA an R package forweighted correlation network analysis BMC Bioinformatics20089559

Dimension reduction techniques | 641

at Baruj B

enacerraf Library on Septem

ber 29 2016httpbiboxfordjournalsorg

Dow

nloaded from

  • bbv108-TF1
  • bbv108-TF2
  • bbv108-TF3
  • bbv108-TF4
  • bbv108-TF5