Analysis of Molecular and Clinical Data at PolyomX Adrian Driga 1, Kathryn Graham 1, 2, Sambasivarao Damaraju 1, 2, Jennifer Listgarten 3, Russ Greiner

Analysis of Molecular and Clinical Data at PolyomXAdrian Driga1, Kathryn Graham1, 2, Sambasivarao Damaraju1, 2, Jennifer Listgarten3, Russ Greiner4, John Mackey1, 2 , Carol Cass1, 2

1PolyomX, Cross Cancer Institute, Alberta Cancer Board, Depts. of 2Oncology and 4Computing Science, University of Alberta, Edmonton, 3Dept. Of Computer Science, University of Toronto, Toronto

www.polyomx.org

PolyomX has conducted several population-based association studies on gene expression microarray and SNP type data to determine the contribution of genes to disease susceptibility and clinical characteristics. For each study, we have generated datasets for investigation from the DORA database and have analyzed the data using software tools that have been developed by PolyomX or are publicly available to researchers. The data analysis process can be automated so that researchers can easily generate result reports for various combinations of parameters and datasets.We present the analysis process for gene expression microarray data, focusing on prediction of clinical characteristics and survival analysis, and describe each step of the process from data generation to results. We use a real dataset in which the identity of the breast cancer patients has been anonymized, and we list the genes that are differentially expressed between the classes defined by Nuclear Grade and ER status. We include the results generated using a software package, BRB ArrayTools, available from the National Cancer Institute (US). We also give details on how one would interpret the results of the data analysis.

Microarray Data Analysis

1. Retrieve gene expression microarray data for specified set of patients. A web form is used to retrieve microarray data form the DORA database. In the resulting Excel file (Figure 1), each column holds gene expression data for a patient and each row holds gene expression data for a gene. Each column is headed by a patient ID. Each row is headed by a gene title and a gene ID (Unigene number). Microarray data stored in the DORA database is filtered, normalized, and aggregated using algorithms developed by PolyomX (1). Each gene expression value is the log base 2 value of the ratio between the intensities of the red and green channels from the hybridized microarray.

2. Retrieve clinical data for specified set of patients and clinical characteristics that will be analyzed. A pre-defined query is used to retrieve clinical data from the DORA database. In the resulting Excel file (Figure 2), each column holds values for a clinical characteristic and each row holds the values of all clinical characteristics for a patient. The first column and row are headers. The patient IDs must be listed in the same order as in the gene expression data file. Each clinical characteristic partitions the patients into two or more classes. The clinical data file is called the experiment descriptors file. For survival analysis, the clinical data must include the patient status and the number of follow-up days since the event of reference.

3. Collate horizontally aligned microarray data and experiment descriptors in a BRB ArrayTools project. The ‘Data import wizard’ (Figure 3) of BRB ArrayTools links microarray expression data and clinical data into a BRB ArrayTools project. Genes are further filtered based on their expression values and presence in the patient profile.

Figure 2: Experiment descriptors for BRB ArrayTools input

Figure 1: Horizontally aligned microarray data for BRB ArrayTools input

5. Interpret the results and rerun analysis with different parameter values if necessary. For class prediction analysis, the accuracy of a classifier is considered statistically significant only if the p-value of that classifier is very small, e.g., p < 0.001. The prediction accuracy of each classifier is estimated through leave-one-out cross-validation (LOOCV). The sensitivity and specificity are shown for each classifier and each class. A classifier is significant if, for one of the classes, both sensitivity and specificity are high (e.g., greater than 0.7), and sensitivity is very high (e.g., greater than 0.9). For survival analysis, the user specifies limits for the number and the proportion of false discoveries. The output shows where to cut the gene list to comply with each of the two constraints specified.

Figure 4: Complete results for Nuclear Grade class prediction at the significance level 10-4. The Support Vector Machine classifier has the highest sensitivity and specificity. Its overall accuracy is 74% and its global p-value is less than 5x10-4. The same 12 genes are used to build each classifier. A reading of 3 for Nuclear Grade is considered High, otherwise it is deemed Low.

Figure 5: The Nearest Centroid classifier is the most accurate classifier for Nuclear Grade class prediction when using the significance level 10-3 and random univariate tests. The classifier uses 29 differentially expressed genes (not shown). Acc. = 76%

Figure 6: The 3-Nearest Neighbors classifier is the most accurate classifier for ER class prediction when using the significance level 10-6 and random univariate tests. The classifier uses 20 differentially expressed genes. A reading of 3+ for ER is considered Positive, otherwise it is deemed Negative. Acc. = 89%

Figure 7: Survival analysis is based on the number of days from diagnostic to last follow-up. The first gene is the only gene associated with survival. It is possible that the other 9 genes in the list are false discoveries.

Acknowledgements:• PolyomX is supported by the Alberta Cancer Foundation and the Alberta Cancer Board • DORA is the molecular and clinical database developed by PolyomX• Analyses were performed using BRB ArrayTools 3.2 developed by Dr. Richard Simon and Amy Peng• (1) Clinically validated benchmarking of normalization techniques for two color oligonucleotide spotted microarray slides (2004). J. Listgarten, K. Graham, S. Damaraju, C. Cass, J. Mackey and B. Zanke. Applied Bioinformatics: Vol 2, 219-228

Figure 8: With the same conditions as above, the maximum allowed number of false-positive genes is set to 1 and the maximum allowed proportion of false-positives genes is set to 0.01. The glutathione peroxidase 1 is again the only gene associated with survival. The small number of genes found to be associated with survival can be explained by the short follow-up time of the patients in this study: avg. = 746 days, med. = 811 days. Only 8 patients out of the 117 breast cancer patients studied are deceased.

Figure 3: BRB ArrayTools: Collate microarray data and experiment descriptors

4. Run class prediction or survival analysis algorithms from BRB ArrayTools. Run the class prediction analysis to discover a set of genes that could accurately predict the class to which a patient belongs given the gene expression profile of the patient. The user chooses a clinical characteristic to analyze and a threshold (significance level) that determines whether a gene is differentially expressed between the classes defined by the clinical characteristic. The software builds six classifiers based on the expression values of the differentially expressed genes and tests these classifiers through cross validation to determine their sensitivity and specificity. The statistical significance of each classifier’s accuracy is determined through permutation testing (multivariate analysis). The global p-value generated for each classifier represents the false discovery rate of that classifier. Survival analysis identifies genes that are significantly associated with the survival of the patients. The software computes a statistical significance level for each gene based on univariate proportional hazards models (Cox proportional hazards model). Multivariate permutation test provides 90% confidence that the rate of false discoveries is less than a specified quantity.

Analysis Goal: Identify a list of genes that differentiate between patient classes or are associated with patient survival. The sensitivity and specificity of a classifier measure how well the classifier built using these genes can predict the class of a new patient using the expression data of only the genes from the list.

Analysis Directions:• Ensure that the global p-value of a classifier is small. The p-value gives the number of times when a classifier built on randomized data is more accurate than the classifier built on the original data. A small p-value shows that the sensitivity and specificity of the classifier are statistically significant.• Test as many classifiers as possible. An underlying association between gene expression data and a clinical characteristic may be predicted by one classifier more accurately than another. For different significance levels, different classifiers may provide the highest accuracy. BRB ArrayTools includes 6 classifiers that have been shown to model microarray data effectively. • For biologists, a short list of differentially expressed genes is more useful than a longer list if the most accurate classifier built from the short lists has sensitivity and specificity values comparable to those of the classifier built with the longer list. The short list includes the most significant genes from the longer list.

Results: We find BRB ArrayTools to be effective in discovering statistically relevant associations between the microarray data and the clinical characteristics of the 117 breast cancer patients included in our study. We found genes that have significant associations with Nuclear Grade and ER status, and one gene that is likely to be associated with survival. Figures 4-8 give details on the results of our analysis.

Documents

Analysis of Molecular and Clinical Data at PolyomX Adrian Driga 1, Kathryn Graham 1, 2, Sambasivarao Damaraju 1, 2, Jennifer Listgarten 3, Russ Greiner