Similarity Network Fusion for ... - Nature Research › ... › nmeth › journal › v11 › n3 › extref › nme… · Figure 10: Sensitivity test of SNF hyper-parameters in GBM

Similarity Network Fusion for aggregating data types

on a genomic scale

Supplementary Information

Bo Wang 1,5, Aziz Mezlini1,2, Feyyaz Demir1,2,Marc Fiume2, Zhuowen Tu3,MichaelBrudno1,2,Benjamin Haibe-Kains4,6,Anna Goldenberg1,2

1SickKids Research Institute, Toronto, Ontario, Canada2Department of Computer Science, University of Toronto, Toronto, Ontario, Canada3Department of Cognitive Science, University of California San Diego, San Diego,

California, USA4Institut de Recherches Cliniques de Montreal, Universite de Montreal, Montreal, Quebec,

Canada5Department of Computer Science, Stanford University, Stanford, California, USA

6Ontario Cancer Institute, Princess Margaret Cancer Centre, University Health Network,Toronto, Ontario, Canada

Nature Methods: doi:10.1038/nmeth.2810

Contents

1 Supplementary Figures 2

2 Supplementary Tables 23

3 Supplementary Results 253.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Clinical and biological assessment of the identified subtypes in GBM . . . . . . . . . 253.3 More Discussion for METABRIC Dataset . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Supplementary Note 1 274.1 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.4 Predicting the Subtype of any New Patient . . . . . . . . . . . . . . . . . . . . . . . 284.5 Network-based Feature Selection Approach . . . . . . . . . . . . . . . . . . . . . . . 28

5 Supplementary Note 2 285.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.2 Gene Ranking by Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6 Supplementary Note 3 296.1 Alternative approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.1.1 Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.1.2 iCluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.1.3 Multiple Kernel Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.1.4 Patient Specific Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1


1 Supplementary Figures

List of Figures

1 Simulation 1: each data type contains different sets of mislabeled points near thecluster boundaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Simulation 2: each data type has different noise distribution . . . . . . . . . . . . . . 43 Comparative performance of different methods for Simulation 2 . . . . . . . . . . . . 54 Accuracy of predicting ‘new’ patient’s subtype . . . . . . . . . . . . . . . . . . . . . 65 Convergence analysis of SNF for Simulation1 . . . . . . . . . . . . . . . . . . . . . . 76 Convergence analysis of SNF for Simulation2 . . . . . . . . . . . . . . . . . . . . . . 87 Convergence analysis of SNF for GBM data . . . . . . . . . . . . . . . . . . . . . . . 98 SNF hyper-parameter sensitivity test on Simulation 1 . . . . . . . . . . . . . . . . . 109 Sensitivity test for SNF on Simulation 2 . . . . . . . . . . . . . . . . . . . . . . . . . 1110 Sensitivity test for SNF on GBM data . . . . . . . . . . . . . . . . . . . . . . . . . . 1211 Clinical and biological analysis of the three identified GBM subtypes . . . . . . . . . 1312 Survival analysis of the Temozolomide treatment in each subtype . . . . . . . . . . . 1413 Importance of each data type in fusing patient data using SNF in GBM . . . . . . . 1514 Heatmap of genes significantly differential among GBM subtypes . . . . . . . . . . . 1615 Patient subtypes for each of the five cancers: GBM, BIC, KRCCC, LSCC and COAD 1716 METABRIC: Kaplan-Meier survival curves corresponding to 5 clusters defined by

SNF with respect to disease free survival(DFS). . . . . . . . . . . . . . . . . . . . . . 1817 METABRIC: Kaplan-Meier survival curves corresponding to 5 clusters defined by

SNF with respect to overall survival(OS). . . . . . . . . . . . . . . . . . . . . . . . . 1918 METABRIC: Kaplan-Meier survival curves corresponding to 10 clusters defined by

SNF with respect to disease free survival(DFS). . . . . . . . . . . . . . . . . . . . . . 2019 METABRIC: Kaplan-Meier survival curves corresponding to 10 clusters defined by

SNF with respect to overall survival(OS). . . . . . . . . . . . . . . . . . . . . . . . . 2120 Now and the future: subtype-based vs network-based outcome prediction. . . . . . 22

2


Figure 1: A simulation study that shows how our Similarity Network Fusion exploits the com-plementary information between multiple data types. a) The framework of our simulation study.First, we generate two clusters that are linearly separable. Then, to generate the first data type,we mislabel a certain percentage of data points that are close to the boundary of one cluster tohave the label of the other cluster. The second data type is generated by mislabeling the pointsfrom the second data cluster to have the label of the first. This is a common source of mistakesat the cluster boundaries. We show two patient similarity matrices corresponding to each of thedata types and our fused matrix obtained using SNF. b) Normalized Mutual Information (NMI)(see Supplementary Section 5) between cluster labels obtained by spectral clustering of the SNFfused similarity matrix and the simulated ground truth plotted as a function of proportion of mis-labeled data points. Higher NMI corresponds to higher concordance between obtained clustersand ground truth. We compare our method (SNF+spectral clustering) with iCluster, concatena-tion+hierarchical clustering, Multiple Kernel Learning (see Section 6.1.3) and Patient Specific DataFusion (see Section 6.1.4). SNF performs significantly better than all the other methods.

3


Figure 2: A simulation study that tests the power of our method on noisy data. There are twosimulated data types created by perturbing ground truth (shown in the middle of the first column)with different types of noise. Data type 1 has an added N(0, σ) noise (top panel of column 1).Bottom panel of column 1 shows data type 2 with an added noise according to the Gamma distri-bution as a function of λ : f(x;λ) = λeλxI(x), I(x) = 1 if x is non-negative, 0 otherwise. We showthe similarity heatmaps for each data type for σ = 2 and λ = 2 setting in column 2. The fusednetwork by SNF is shown in column 3. The initial two matrices contain a lot of similarity betweenclasses due to noise and in the fused similarity matrix the within-class similarity is strengthenedand between-class noise is substantially reduced.

4


Figure 3: Comparisons of the performances of SNF, iCluster, concatenation, MKL and PSDFas a function of Gaussian and Gamma noise in the two different data types for the Simulation2 scenario as captured in Figure 2. We vary σ and λ, and perform spectral clustering to getfinal clustering. Normalized Mutual Information (NMI) is used to compare the concordance of theobtained partitions and the ground-truth (Figure 2), where 1 means complete concordance, 0 meansthe clusters are completely different. a) The NMI values when fixing λ = 2. b) The NMI valueswhen fixing σ = 2. Our SNF is more robust with respect to noise compared to the alternativeapproaches.

5


Figure 4: Accuracy of predicting ‘new’ patient’s subtype. We leave each patient out in turn andreport the average accuracy of patient subtype classification over all patients for each cancer. Theerror bars represent 1 standard deviation.

6


Figure 5: Convergence performance of SNF in Simulation 1 when 20% of points are mislabeled.a) The relative error of consecutive iterations during network fusion (‖Wt+1−Wt‖

‖Wt‖ ), where Wt is the

output of SNF after t steps. b) The NMI (y-axis) vs SNF iteration number (x-axis).

7


Figure 6: Convergence performance of SNF in Simulation 2 for λ = 2 and σ = 2 setting. a) The

relative error of consecutive iterations during network fusion (‖Wt+1−Wt‖‖Wt‖ ). b) The NMI vs number

of SNF iterations.

8


Figure 7: Convergence performance of SNF in GBM data. a) The relative error of consecutive

iterations during network fusion (‖Wt+1−Wt‖‖Wt‖ ). b) Concordance of SNF found clusters with the

ground truth as measured by p-values in the survival analysis vs the number of SNF iterations.

9


Figure 8: Sensitivity test of hyper-parameters of SNF in simulation 1 when 20% of points aremislabeled. a) Fixing η = 0.5 and varying K, then NMI is recorded. b) Fixing K = 20 and varyingη, then NMI is recorded. SNF is robust with respect to the hyper-parameters.

10


Figure 9: Sensitivity test of hyper-parameters of SNF in Simulation 2. a) NMI for fixed η = 0.5and various settings for K, b) NMI for fixed K = 20 while varying η. SNF is robust with respectto the different hyper-parameter settings.

11


Figure 10: Sensitivity test of SNF hyper-parameters in GBM survival analysis. − log10(P ) isrecorded with different choices of hyper-parameters K and η, where P is the p-value from log rankexact test. a) −log10(P ) as a function of K for η = 0.5; b) −log10(P ) as a function of η whilekeeping K = 20. SNF performance starts declining when the number of neighbors (K) is too high,indicating that sparser networks are more robust

12


Figure 11: (a) Kaplan-Meier survival curves for three GBM subtypes as identified from fused data(overall Cox log rank p-value for 3 subtypes is 2x10-4) (b) age distribution in three GBM subtypes(the same legend as Figure 3a) as represented by a box plot (bars represent median, lower and upperquartiles and whiskers capture max and min values of the ages in each subtype), three-way two-sidedWelchs t-test p-value = 3x10-5; (c) Survival analysis of the Temozolomide treatment in Subtype1(HR = 0.278, p = 0.001); (d) Genetic variant signatures in terms of mutations (IDH1, EGFR,PTEN), chromosomal aberrations and CNVs (EGFR, CDKN2A, PTEN, RB1, CDK6) that aresignificantly different between the three identified subtypes; (e) Representative genes whose mRNAexpression (ge) and DNA methylation (me) profiles are significantly different between Subtype 1(APBA2, GLI2) and Subtype 2 (CTSD and S100A4). We also indicate Copy Number Variation(cn) for those genes that had significant CNV differences between subtypes (CNV data was notused to identify subtypes).

13


0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Survival Time / month

Sur

viva

l Pro

babi

lity

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0


Su

rviv

al P

rob

ab

ility

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0


Su

rviv

al P

rob

ab

ility

Survival time (months)

1.0 0.8 0.6 0.4 0.2 0.0

Sur

viva

l pro

babi

lity

0 20 40 60 80 100 Survival time (months)

1.0 0.8 0.6 0.4 0.2 0.0

Sur

viva

l pro

babi

lity

0 20 40 60 80 100 Survival time (months)

1.0 0.8 0.6 0.4 0.2 0.0

Sur

viva

l pro

babi

lity

0 20 40 60 80 100

Subtype 1 Subtype 2 Subtype 3

Figure 12: Survival analysis of the Temozolomide treatment. We performed a log-rank test for eachsubtype with regard to the use of the Temozolomide treatment. For Subtype 1, there is significantdifference with the use of Temozolomide (p = 0.001, HR = 0.278), while for subtype 2 and subtype3, the difference is not significant (subtype 2: p = 0.210, HR = 0.700; subtype 3: p = 0.996, HR= 0.997).

14


Figure 13: Importance of each data type in fusing patient data using SNF in GBM. (a)-(c) Patientsimilarity distributions in the similarity networks for each of the available data types: (a) geneexpression, (b) DNA methylation and (c) miRNA expression, respectively; (d) bar plot illustratingthe relative number of important features from each of the data types across the whole GBMcohort; (e) relative contribution of each data type or combination of data types to explaining theconnectivity within each of the three subtypes. The colors correspond to the combined networkin Figure 2d in the main text, where each edge’s color corresponds to the type or types of datathat support it. The edge is considered supported by a single data type if the weight (patientsimiliarity) in that data type’s network is more than 10% higher than the similarity of the sameedge in the other data types’ networks. If the difference between two highest edge weights fromthe corresponding data types is less than 10%, the edge is considered to be supported by those 2data types and if the difference between all of the edge weights between all data types is less than10%, it is considered to be supported by all data types. This figure illustrates that SNF’s fusednetwork is 1) driven by all data types to different extent in different parts of the network; 2) nosingle data type is driving the fusion and 3) majority of the patient similarities in the final networkare supported by two or more data types.

15


Figure 14: Heatmap of features significantly differential among GBM subtypes. Top panel - mRNAexpression, middle panel - DNA methylation and bottom panel - miRNA expression. These featuresare selected as follows: we construct a network by using each feature independently and thenperform spectral clustering on this network. Normalized Mutual Information (NMI) is calculatedbetween the resulting cluster labels and the SNF found subtypes. Then we order all the NMI valuesand show those that are ranked in the top 1% of all NMI values across all feature types. DNAMethylation provides the clearest signal between Subtype 3 and the rest of the data confirmingprevious findings reported in the literature. Interestingly, Subtype 3 has a very different miRNAprofile as well. The defining feature of Subtype 3 is the IDH1 mutation that was not used to clusterpatients but results in a very clear patterns in the measurements that were collected to allow thissubtype to be detected. In particular, the signatures in DNA methylation and miRNA expressiondata for Subtype 3 are completely distinct from the other subtypes.

16


Figure 15: (a) patients colored by their cluster label plotted using the first three principle compo-nents and (b) Kaplan-Meier survival curves (p-values are recorded in Table 2). The legends presenton the survival curves correspond to both (a) and (b) figures.

17


0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Time (months)

Pro

babi

lity

of o

vera

ll su

rviv

al

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Time (months)

Pro

babi

lity

of o

vera

ll su

rviv

alP = 6.10× 10−11 P = 5.12× 10−13

Figure 16: METABRIC: Kaplan-Meier survival curves corresponding to 5 clusters defined by SNFwith respect to disease free survival(DFS) on both discover set (left panel) and validation set (rightpanel).

18


0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Time (months)

Pro

babi

lity

of o

vera

ll su

rviv

al

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Time (months)

Pro

babi

lity

of o

vera

ll su

rviv

alP = 5.28× 10−10 P = 8.42× 10−10

Figure 17: METABRIC: Kaplan-Meier survival curves corresponding to 5 clusters defined by SNFwith respect to overall survival(OS) on both discover set (left panel) and validation set (right panel).

19


0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Time (months)

Pro

babi

lity

of o

vera

ll su

rviv

al

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Time (months)

Pro

babi

lity

of o

vera

ll su

rviv

alP = 3.31× 10−12 P = 7.86× 10−12

Figure 18: METABRIC: Kaplan-Meier survival curves corresponding to 10 clusters defined by SNFwith respect to disease free survival(DFS) on both discover set (left panel) and validation set (rightpanel).

20


0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Time (months)

Pro

babi

lity

of o

vera

ll su

rviv

al

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Time (months)

Pro

babi

lity

of o

vera

ll su

rviv

alP = 3.64× 10−09 P = 5.62× 10−08

Figure 19: METABRIC: Kaplan-Meier survival curves corresponding to 10 clusters defined by SNFwith respect to overall survival(OS) on both discover set (left panel) and validation set (right panel).

21


Figure 20: (Left panel) The current approach to determine patient outcomes is based on classifyinga patient into one of the pre-determined subtypes (clusters). New patients probability of a particularclinical outcome is then expected to be consistent with the patients already observed to have thegiven subtype (e.g. survival is predicted to follow a Kaplan Meier survival curve associated withthat cluster). (Right panel) Our proposed future representation of diseases: networks of patients.Here the whole network is used to determine new patients outcomes. For example, if a patient isnot a clear fit within any of the pre-defined subgroups, this approach can still yield an accurateanswer by predicting a similar outcome to patients with matching biological profiles regardless oftheir pre-determined subtype.

22


2 Supplementary Tables

List of Tables

1 A list of cancer dataset collections used in our study. . . . . . . . . . . . . . . . . . . 24

23


Table 1: The number of patients used in our experiments and the number of genes in the mRNAexpression, DNA methylation and miRNA expression datasets respectively, and the number ofhealthy individuals (controls) for mRNA expression and DNA methylation profiles

Cancer Patients mRNA DNA miRNA mRNA MethylationType Number Expression Methylation Expression (Controls) (Controls)

GBM 215 12,042 1,491 534 10 -BIC 105 17,814 23,094 1,046 63 27

KRCCC 124 20,532 24,976 1,046 68 199LSCC 105 12,042 27,578 1,046 - 27COAD 92 17,814 27,578 705 19 37

24


3 Supplementary Results

3.1 Simulations

We generated two types of simulations to test the properties of our SNF method. Each simulationconsists of two data types capturing different information about 2 clusters with 100 points ineach cluster. The first simulation tests the ability to handle complementary information. Wegenerated two data types, each capturing correct information about majority of the points exceptfor a few mislabelled ones (randomly, near the cluster boundaries) in only one of the two availabledata types. Each data type thus complements the other by providing correct information for thefew points that are mislabeled. We illustrate this simulation scenario and compare SNF to othermethods using normalized mutual information as a function of the number of mislabeled (swapped)points (Supplementary Figure 1). SNF clearly outperforms the other competitors. iCluster andPSDF perform similarly to each other and concatenation performs better than MKL while theproportion of mislabeled points is small. SNF converges to its final solution in under 10 iterations(Supplementary Figure 5). ALso, SNF performance is not sensitive to different settings of hyperparameters (Supplementary Figure 8). Each parameter setting reported shows an average over 50trials.

In the second simulation, we aim to test the robustness of the SNF method to different typesof noise. We start with ‘ground-truth’ data in which there are two distinct groups of points. Togenerate the noisy data from the ground-truth, we add Gaussian noise to get data type 1 and singleparameter Gamma noise to get data type 2. Thus each data type is a perturbation of the groundtruth. SNF is able to substantially reduce inter-cluster noise, while boosting intra-cluster similarity,performing significantly better than all the competing methods (Supplementary Figure 2). iClusterand PSDF performance is similar to each other regardless of the type of present noise, while MKLis more sensitive to Gaussian noise (Supplementary Figure 3). Concatenation’s performance isthe worst of the compared methods. Supplementary Figures 6 and 9 show quick convergence androbustness to a variety of different parameter settings. Each reported point represents an averageof 50 trials. The standard deviation was too small to show on the graph.

3.2 Clinical and biological assessment of the identified subtypes in GBM

The overal survival analysis for the fused network-based subtypes resulted in Cox log-rank p-valuebeing 2 × 10−4, our pairwise lowest is 5 × 10−6 (HR = 2.34) (Supplementary Figure 12a). Thesmallest cluster, Subtype 3, identified by our methodology has the best prognosis, and consistsof younger patients as was previously found in the IDH-subtype (Supplementary Figures 12a,12b,three-way two-sided Welch’s t-test p-value = 3× 10−5). While the IDH1 mutation status was notavailable for all patients in our cohort, 100% of the patients with IDH1 mutation for whom theinformation was available belonged to Subtype 3, constituting more than half of this subgroup andconfirming it as the IDH-subtype (n = 14, Fisher exact test p-value = 4.87×10−11, SupplementaryFigure 12d). We have also confirmed significant loss of chromosome 5 (p-value = 0.0083, adjusted,Fisher exact test), significant gain of chromosome 19 (p-value =0.006) and lack of chromosome 7gain (p-value = 1.837×10−4) and chromosome 10 loss (p-value = 1.05×10−4) to be associated withSubtype 3 [14, 8]. EGFR and CDK6 are both located on chromosome 7 following its pattern ofassociation of significant amplifications in subtypes other than Subtype 3 (p-value = 3.81×10−17 andp-value= 3.23× 10−9 respectively), the significance likely due to the chromosomal gain. CDKN2A

25


(p = 0.0037) and PTEN (p = 0.001) are also preferentially deleted in Subtypes 1 and 2 but notin Subtype 3 (Supplementary Figure S12d). These signatures have all been confirmed in recentstudies [14, 8, 19].

To further differentiate Subtypes 1 and 2, we analyzed genes for which both mRNA expressionand DNA methylation profiles were significantly different between these two groups (SupplementaryFigure S12e). We found that Subtype 1 was strongly associated with the hypermethylation andoverexpression of APBA2 (Mint2). APBA2 is a neuronal adapter, involved in signal transductionprocesses found to be strongly associated with neuropsychiatric disorders, such as autism [16] andschizophrenia [17]. Additionally, Mint2 hypermethylation has been verified as a BRAF signatureof the CpG Island Methylator Phenotype (CIMP) in colorectal cancer [20]. Subtype 1 is alsoassociated with overexpression and hypermethylation of GLI2, an important member of the SonicHedgehog pathway that can act independently of the pathway in GBM [4] hinting at a potentiallyimportant unexplored mechanism. Subtype 2 was associated with overexpression and hypermethy-lation of temozolomide-resistant CTSD [9], previously had only a tangential role in GBM, and alsohypomethylation and overexpression of epithelial-to-mesenchymal transition driver S100A4 [10].We have also identified several significantly differential miRNAs (has-miR-23a and has-miR-21),which have been previously found to play a role in GBM [3]. Thus, through our integrative analysiswe have been able to unify and refine several sets of biological markers as associated with GBM.

3.3 More Discussion for METABRIC Dataset

The difference between currently widely adopted subtype-based survival prediction and what we be-lieve is the future direction of refined diagnosis and outcome prediction is captured in the schematicon the Supplementary Figure 20. Note that in the future direction panel of the figure, the wholepatient network is used to predict the risk of death for the new patient mimicking the way thedoctor infers a new patient’s diagnosis using the knowledge based on all the patients she had seenover the years.

For each of these patients, matched DNA and RNA were extracted from each primary tumorspecimen and subjected to copy number and genotype analysis on the Affymetrix SNP 6.0 platformand transcriptional profiling on the Illumina HT−12 v3 platform (Illumina−Human−WG−v3).We have used the normalized data available from the European Genome-Phenome Archive (EGA,http://www.ebi.ac.uk/ega/). High quality follow up clinical data including information on diseasefree survival (DFS) was also available for both cohorts . As a preprocessing step, we mapped CNVsto genes using the PennCNV package [18].

The original detailed METABRIC subtyping analysis has been performed using iCluster (calledInterClust in the paper), using a selection of two thousand features based on mRNA expression andCNV data (one thousand each) [2]. In our work, we used the published patient subtype iClusterlabels on both discovery and validation cohorts to avoid any inconsistencies that could incur dueto feature selection process and rerunning of iCluster. We have applied the Cox regression modelto predict disease free survival using published labels. To obtain subtype labels in our analysis,we applied SNF to construct and combine expression-based and CNV-based patient networks fromthe discovery cohort followed by spectral clustering as described above. We then concatenatedvalidation and discovery cohorts to learn the new patient network, where only the patients presentin the discovery cohort had subtype labels. We used label propagation on the integrated patientnetwork to obtain the subtype labels for the patients in the validation cohort (SupplementaryMethod section 7). The published p-value obtained using iCluster on the validation set is less

26


significant than both the iCluster-based discovery cohort p-value and SNF obtained p-value alludingto potential overfitting by iCluster.

4 Supplementary Note 1

4.1 Stopping Criteria

SNF is proved to converge, and empirically it converges fast. We keep track of the relative changein consecutive rounds Et = ‖Wt+1−Wt‖

‖Wt‖ . One simple stopping criteria is that we set a threshold

ε = 10−6 and if the relative change is lower than the threshold, we stop the iteration. Someempirical observations about the convergence can be found in Supplementary Figures 5, 6 and 7.We note that, when the number of iterations exceeds 20, it is always enough to converge. So weempirically set the number of iterations in the range of [10, 20].

4.2 Parameter Selection

In our method, there are two free parameters η and K. Our method is not sensitive to these twoparameters, see Supplementary Figures 8, 9 and 10. In our method, a reasonable range for η wouldbe 0.3 − 1. The rule of thumb for choosing parameter K is K = N/C where N is the numberof patients, and C is the number of clusters that is believed to be in the data. However, if C isunknown, we usually set K ≈ N/10.

4.3 Model Selection

How to identify the number of subtypes is a key problem in disease classification. Spectral Cluster-ing method provides two main approaches to decide the optimal number of clusters. One is to useeigengap to decide the best number of clusters based on the connectivity of the network. Eigengapis defined as follows:

eigengap(i) = λi+1 − λi (1)

where λi is the i-th eigenvalue of the matrix L and we sort the eigenvalues in an ascending order(λ1 ≤ λ2 ≤ . . . λn). The best number of clusters C∗ is

C∗ = maxi>1

eigengap(i) (2)

Another direction is to exploit the structure of eigen-vectors of the Laplacian L [22]. As-sume U = [U1, U2, . . . , UC ] is the orthogonal eigenvectors of L corresponding to the eigenvalueλ1, λ2, . . . , λC . We try to find a rotation matrix R such that Z = UR and every row in Z there willbe at most one non-zero entry. In other words, if we denote Mi = maxj Zij , we try to minimize

J =n∑i=1

C∑j=1

Z2ij

M2i

. (3)

A gradient descent method can be used to solve this optimization problem [22]. There the bestnumber of clusters can be decided by the number which achieves the minimal value of J in Eq.(3).

In two out of the five tumours the two methods were in agreement regarding the number ofintrinsic clusters, in the remaining 3 cases we used the second method known to be more stable.We have utilized eigengap values to analyze the intrinsic dimensionality of the network.

27


4.4 Predicting the Subtype of any New Patient

Given a new patient, we employ a simple classifier based on network structure, label propagation[23], to predict the subtype of the new patient. Assume we have n patients that have been wellanalyzed by SNF to determine the subtypes y with a fused network W . First, SNF is used to calcu-late the similarities vector s, then we can construct a big similarity matrix W = [W s; s′ 1].

The corresponding transition kernel would be Pij =Wij∑j Wij

. Then we try to predict the label of the

new patient. An n+ 1 dimensional vector l = [y; 0]. We iterate the propagation process as follows:Repeat:

lt+1 = P ∗ lt

lt+1(1 : n) = y

This process will converge (usually 1000 iterations is enough to converge this iteration). Andwe can get the subtype of the new patient given by l(n+ 1). Details about label propagation canbe found in [23].

4.5 Network-based Feature Selection Approach

Though we don’t need pre-selection of features when performing our SNF, we propose a post-analysis procedure to select important features that correspond to our network structure. Afterobtaining the subtypes of diseases, denoted as g, for each feature fi, we construct a patient networkWi based on fi alone and then use spectral clustering to obtain subtypes, li. To measure the relativecontribution or consistency between fi and the whole network, we assign a consistency score to fias csi = NMI(li,g). If csi = 1, it means that the network of patients based on feature fi leads tothe same subtyping as the fused network, and therefore we consider this feature very important. Ifcsi = 0, it means that there is no real correspondence between the feature and the fused network.Therefore, the higher is the csi, the more important is the fi to the fused network structure, allowingus to rank all the features with respect to their importance for the fused network.


5.1 Evaluation Metrics

In this paper, we used several metrics for evaluation and comparison of our method to existingapproaches. Here we list a few technical details.

In simulation, we use Normalized Mutual Information (NMI) [7]. Given two clustering results

U and V on a set of data points, NMI is defined as I(U,V )√H(U)H(V )

, where I(U, V ) is the mutual

information between U and V , and H(U) represents the entropy of the clustering U . Details canbe found in [15]. NMI is a value between 0 and 1, measuring the concordance of two clusteringresults. In the simulation, we calculate the obtained clustering with respect to the ground-truth.Therefore, a higher NMI refers to higher concordance with truth, i.e. a more accurate result.

To test the robustness of our method and the prediction power of the fused network, we useleave-one-out cross validation. For each new patient as a test data, we treat the remaining data astraining samples. A simple classifier, label propagation [23], is used to predict the labels of the test

28


data. The average prediction accuracy is recorded and is shown in Supplementary Figure 4. Thismeasure evaluates the robustness of the labels obtained by the defined disease subtypes. Highervalue of prediction accuracy stands for higher stability of the defined subtypes.

For GBM, we performed differential analysis for chromosomal gain/loss to differentiate sub-types and check correspondence to the literature [14]. We combined all the CNV information perchromosome and declared gain/loss if more than 75% of the genes on the given chromosome wereamplified/depleted.

For the task of survival prediction in the METABRIC breast cancer data [2] analysis, CoxModel [12] is the mostly widely used tool so far. It performs regression to estimate the survivalrisk functions. After obtaining a risk estimation of new patients, Concordance index [5] is usedto measure the consistency between the estimated risk and the real survival time. Concordanceindex, to be specific, quantifies the quality of rankings, as the fraction of all pairs of subjects whosepredicted survival times are correctly ordered among all subjects that can be actually ordered.Namely, it is the probability of concordance of the predicted risk and the observed survival.

5.2 Gene Ranking by Normalization

In order to have a fair comparison with iCluster, which requires feature selection, we rank the genesaccording to mRNA expression and DNA methylation in a simple way. For every gene (its mRNAexpression or DNA methylation profiles), we compute the normalized differential expression:

NDE =|E(Cases)− E(Controls)|√|V (Cases)− V (Controls)|

·1n + 1

m

n+m− 2;

where E(Cases) denotes the mean value for cases (patients) and E(Controls) is the mean valuefor controls (healthy individuals, or corresponding healthy tissue from patients), V (Cases) standsfor variance value for cases(similar for V (Controls)) and n, m represents the number of cases andcontrols respectively. The term V (Cases) penalizes the genes having high variance between cancerpatients. This normalization is known as SAM method [13]. Higher value of NDE means higherrelevance of this particular gene to cancer.

In addition, for lung cancer, where we do not have control data, we use high standard deviationin all cases as a measure of the importance of the gene.


6.1 Alternative approaches

In this paper, we consider four alternative approaches. All four of these approaches satisfied thefollowing two criteria: 1) code availability or ease of implementation, making sure that the code runswithin a reasonable amount of time; and 2) the approach does not require excessive manual tuning.iCluster [6] and concatenation are the two most commonly used methods to integrate biologicaldata in application to disease subtyping. Patient Specific Data Fusion (PSDF) [1] and MultipleKernel Learning (MKL) [21] are promising machine learning methods that however have limitations.For example, PSDF can only integrate two types of data and is very computationally intensive,though appears to perform similarly to iCluster in our simulations (Supplementary Figures 1 and2). MKL, developed for the task of linear kernel combination, is not ideally suited for this task andhas performed significantly worse in application to the first simulation (Supplementary Figure 1).

29


6.1.1 Concatenation

Concatenation is a simple approach for combining data types. Briefly, we normalized the data perdata type and then concatenated all the measurements for each patient into one long vector. Thecombined high-dimensional patient-by-feature matrix was then used as input into a hierarchicalclustering algorithm available as part of matlab distribution that yielded a set of clusters. Wechose correlation as the distance metric and ‘average’ as the linkage function. The number ofclusters was chosen to be the same as the result of clustering of the SNF fused matrix.

6.1.2 iCluster

iCluster is a Gaussian latent variable model with sparsity regularization in Lasso-type optimizationframework. Briefly, the main assumption behind this approach is that the sets of m genomic datadomains {xk}mk=1 shared a common set of latent variables zi using the following linear model:

xik = Wkzi + εik, i = 1, . . . , n, k = 1, . . . ,m

where Wk denotes the loading matrix associated with the k-th genomic data and n is the numberof patients.

The common variables zi represent the underlying driving factors on patient i that can be inter-preted as disease subtype assignment. iCluster uses the Expectation-Maximization (EM) algorithmto estimate parameteres due to the assumption that the error in the model follows a Gaussian dis-tribution. Sparsity in the estimated Wk is enforced by adding an `1 norm regularization, whereλ = 0.2 as is suggested in the method’s manual.

6.1.3 Multiple Kernel Learning

Multiple Kernel Learning (MKL) learns a linear combination of kernels into a unified one thatimproves the performance of classifiers. For instance, if we have m different kernels Kv, v =1, . . . ,m, the objective function is to learn a vector α = {α1, . . . , αm} that makes the combinedkernel Kcombine =

∑mv=1 αvKv more suitable for either supervised or unsupervised tasks. Usually,

MKL is used in supervised setting because known labels can be used to guide the search of theoptimal α. Recently, an unsupervised MKL algorithm [1] was proposed that can be used togetherwith spectral clustering framework. Eigen-structure is used to help the search of α, however thediscriminative power is very limited. We use this method and the code provided by the authorssetting nearest neighbor parameter to K = 20 and η = 0.5.

Though parallels can be drawn, there are substantial differences between MKL and SNF: 1)MKL is a linear combination of kernels, whereas SNF is a dynamic and non-linear fusion of multiplenetworks. SNF’s iterative updating of each of the multi-view similarity matrices (”kernels”) allowsfor non-linearities and thus substantially improved performance (one iteration of our approach couldbe considered in the MKL family of methods); 2) While MKL is very effective in supervised learning,it is not very commonly used in unsupervised problems; 3) MKL is designed to combine differentkernels, such as linear and Gaussian, for a single-view data, whereas SNF is specifically designedto solve multi-view integration problem. Though MKL variants are proposed for unsupervisedcases [1] the performance often does not compare favorably with methods designed specificallyfor unsupervised learning problems. In the case of multiview data integration, our experimentsemphasize the importance of capturing internal structure of the data which is often unknown

30


(i.e. it’s not clear which kernel should be used), making the problem difficult for MKL where thestructure must be encoded in the choice of the kernel. In general, we actually found that MKLperforms worse than iCluster, for example, presumably due to the lack of adequate accounting fornon-linearity.

6.1.4 Patient Specific Data Fusion

Patient Specific Data Fusion (PSDF) [21] is a nonparametric Bayesian model for discovering cancersubtypes by combing gene expression and copy number variations. The model is based on an ideaof Hierarchical Dirichlet Processes [11]. Each patient is inferred with a binary state variable thatdefines whether their data are concordant across the data sets. In addition, feature selection isincorporated in the model by assuming each feature is drawn from a multinomial distribution withunknown class probabilities. Multiple MCMC chains are employed to improve the mixing effectand infer the statical uncertainties in PSDF. In our experiments we used the authors’ code, setting100 MCMC iterations in each step and fusion weight to 0.5 as was suggested by the authors.

While PSDF appears to be a powerful framework for unsupervised multi-view learning, thereare essential disadvantages precluding the use of PSDF to analyze the real cancer data used inthis paper: 1) large number of unknown parameters make the model inference very difficult andcomputationally expensive; 2) it is only suitable for combing two data types. PSDF could poten-tially be applied to the METABRIC cohort which only contains 2 data types, but unfortunatelythe approach is not scalable to the full size of this data.

References

[1] Y.-Y. Chuang. Affinity aggregation for spectral clustering. In Proceedings of the 2012 IEEEConference on Computer Vision and Pattern Recognition (CVPR), CVPR ’12, pages 773–780,Washington, DC, USA, 2012. IEEE Computer Society.

[2] C. Curtis, S. P. Shah, S.-F. Chin, G. Turashvili, O. M. Rueda, M. J. Dunning, D. Speed,A. G. Lynch, S. Samarajiwa, Y. Yuan, S. Graf, G. Ha, G. Haffari, A. Bashashati, R. Russell,S. McKinney, METABRIC Group, A. Langerød, A. Green, E. Provenzano, G. Wishart, S. Pin-der, P. Watson, F. Markowetz, L. Murphy, I. Ellis, A. Purushotham, A.-L. Børresen-Dale, J. D.Brenton, S. Tavare, C. Caldas, and S. Aparicio. The genomic and transcriptomic architectureof 2,000 breast tumours reveals novel subgroups. Nature, 486(7403):346–52, Jun 2012.

[3] H. Dong, H. Siu, L. Luo, X. Fang, L. Jin, and M. Xiong. Investigation gene and micrornaexpression in glioblastoma. BMC Genomics, 11(Suppl 3):S16, 2010.

[4] D. Javelaud, V. I. Alexaki, S. Dennler, K. S. Mohammad, T. A. Guise, and A. Mauviel. Tgf-/smad/gli2 signaling axis in cancer progression and metastasis. Cancer Res, 71(17):5606–10,Sep 2011.

[5] M. J. Pencina and R. B. D’Agostino. Overall C as a measure of discrimination in survivalanalysis: model specific population value and confidence interval estimation. Statist. Med.,23(13):2109–2123, July 2004.

[6] R. Shen, A. B. Olshen, and M. Ladanyi. Integrative clustering of multiple genomic data typesusing a joint latent variable model with application to breast and lung cancer subtype analysis.Bioinformatics, 25(22):2906–12, Nov 2009.

31


[7] A. Strehl and J. Ghosh. Cluster ensembles — a knowledge reuse framework for combiningmultiple partitions. J. Mach. Learn. Res., 3:583–617, Mar. 2003.

[8] D. Sturm, H. Witt, V. Hovestadt, D.-A. Khuong-Quang, D. T. W. Jones, C. Konermann,E. Pfaff, M. Tonjes, M. Sill, S. Bender, M. Kool, M. Zapatka, N. Becker, M. Zucknick,T. Hielscher, X.-Y. Liu, A. M. Fontebasso, M. Ryzhova, S. Albrecht, K. Jacob, M. Wolter,M. Ebinger, M. U. Schuhmann, T. van Meter, M. C. Fruhwald, H. Hauch, A. Pekrun, B. Radl-wimmer, T. Niehues, G. von Komorowski, M. Durken, A. E. Kulozik, J. Madden, A. Don-son, N. K. Foreman, R. Drissi, M. Fouladi, W. Scheurlen, A. von Deimling, C. Monoranu,W. Roggendorf, C. Herold-Mende, A. Unterberg, C. M. Kramm, J. Felsberg, C. Hartmann,B. Wiestler, W. Wick, T. Milde, O. Witt, A. M. Lindroth, J. Schwartzentruber, D. Faury,A. Fleming, M. Zakrzewska, P. P. Liberski, K. Zakrzewski, P. Hauser, M. Garami, A. Klekner,L. Bognar, S. Morrissy, F. Cavalli, M. D. Taylor, P. van Sluis, J. Koster, R. Versteeg, R. Vol-ckmann, T. Mikkelsen, K. Aldape, G. Reifenberger, V. P. Collins, J. Majewski, A. Korshunov,P. Lichter, C. Plass, N. Jabado, and S. M. Pfister. Hotspot mutations in h3f3a and idh1 definedistinct epigenetic and biological subgroups of glioblastoma. Cancer Cell, 22(4):425–37, Oct2012.

[9] S. Sun, T. S. Wong, X. Q. Zhang, J. K. S. Pu, N. P. Lee, P. J. R. Day, G. K. B. Ng, W. M. Lui,and G. K. K. Leung. Protein alterations associated with temozolomide resistance in subclonesof human glioblastoma cell lines. J Neurooncol, 107(1):89–100, Mar 2012.

[10] K. Takenaga, J. Nygren, M. Zelenina, M. Ohira, T. Iuchi, E. Lukanidin, M. Sjoquist, andE. N. Kozlova. Modified expression of mts1/s100a4 protein in c6 glioma cells or surroundingastrocytes affects migration of tumor cells in vitro and in vivo. Neurobiol Dis, 25(3):455–63,Mar 2007.

[11] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet processes. Journalof the american statistical association, 101(476), 2006.

[12] T. M. Therneau and P. M. Grambsch. Modeling Survival Data: Extending the Cox Model.Springer-Verlag, New York, 2000.

[13] V. G. Tusher, R. Tibshirani, and G. Chu. Significance analysis of microarrays applied to theionizing radiation response. Proceedings of the National Academy of Sciences of the UnitedStates of America, 98(9):5116–5121, Apr. 2001.

[14] R. G. W. Verhaak, K. A. Hoadley, E. Purdom, V. Wang, Y. Qi, M. D. Wilkerson, C. R.Miller, L. Ding, T. Golub, J. P. Mesirov, G. Alexe, M. Lawrence, M. O’Kelly, P. Tamayo,B. A. Weir, S. Gabriel, W. Winckler, S. Gupta, L. Jakkula, H. S. Feiler, J. G. Hodgson, C. D.James, J. N. Sarkaria, C. Brennan, A. Kahn, P. T. Spellman, R. K. Wilson, T. P. Speed, J. W.Gray, M. Meyerson, G. Getz, C. M. Perou, D. N. Hayes, and Cancer Genome Atlas ResearchNetwork. Integrated genomic analysis identifies clinically relevant subtypes of glioblastomacharacterized by abnormalities in pdgfra, idh1, egfr, and nf1. Cancer Cell, 17(1):98–110, Jan2010.

[15] N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison:Variants, properties, normalization and correction for chance. J. Mach. Learn. Res., 11:2837–2854, Dec. 2010.

[16] I. Voineagu, X. Wang, P. Johnston, J. K. Lowe, Y. Tian, S. Horvath, J. Mill, R. M. Cantor, B. J.Blencowe, and D. H. Geschwind. Transcriptomic analysis of autistic brain reveals convergentmolecular pathology. Nature, 474(7351):380–4, Jun 2011.

32


[17] T. Walsh, J. M. McClellan, S. E. McCarthy, A. M. Addington, S. B. Pierce, G. M. Cooper,A. S. Nord, M. Kusenda, D. Malhotra, A. Bhandari, S. M. Stray, C. F. Rippey, P. Roccanova,V. Makarov, B. Lakshmi, R. L. Findling, L. Sikich, T. Stromberg, B. Merriman, N. Gogtay,P. Butler, K. Eckstrand, L. Noory, P. Gochman, R. Long, Z. Chen, S. Davis, C. Baker, E. E.Eichler, P. S. Meltzer, S. F. Nelson, A. B. Singleton, M. K. Lee, J. L. Rapoport, M.-C. King,and J. Sebat. Rare structural variants disrupt multiple genes in neurodevelopmental pathwaysin schizophrenia. Science, 320(5875):539–43, Apr 2008.

[18] K. Wang, M. Li, D. Hadley, R. Liu, J. Glessner, S. F. A. Grant, H. Hakonarson, and M. Bu-can. Penncnv: an integrated hidden markov model designed for high-resolution copy numbervariation detection in whole-genome snp genotyping data. Genome Res, 17(11):1665–74, Nov2007.

[19] G. L. Weber, M.-O. Parat, Z. A. Binder, G. L. Gallia, and G. J. Riggins. Abrogation ofpik3ca or pik3r1 reduces proliferation, migration, and invasion in glioblastoma multiformecells. Oncotarget, 2(11):833–49, Nov 2011.

[20] D. J. Weisenberger, K. D. Siegmund, M. Campan, J. Young, T. I. Long, M. A. Faasse, G. H.Kang, M. Widschwendter, D. Weener, D. Buchanan, H. Koh, L. Simms, M. Barker, B. Leggett,J. Levine, M. Kim, A. J. French, S. N. Thibodeau, J. Jass, R. Haile, and P. W. Laird. Cpg islandmethylator phenotype underlies sporadic microsatellite instability and is tightly associatedwith braf mutation in colorectal cancer. Nat Genet, 38(7):787–93, Jul 2006.

[21] Y. Yuan, R. S. Savage, and F. Markowetz. Patient-specific data fusion defines prognosticcancer subtypes. PLoS Computational Biology, 7(10), 2011.

[22] L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In L. K. Saul, Y. Weiss, andL. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1601–1608.MIT Press, Cambridge, MA, 2005.

[23] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation.2002.

33


Documents

Similarity Network Fusion for ... - Nature Research › ... › nmeth › journal › v11 › n3 › extref › nme… · Figure 10: Sensitivity test of SNF hyper-parameters in GBM