6
On Predicting Medulloblastoma Metastasis by Gene Expression Profiling Michael J. Korenberg* Department of Electrical and Computer Engineering, Queen’s University, Kingston, Ontario, Canada K7L 3N6 Received August 29, 2003 Accurately predicting clinical outcome or metastatic status from gene expression profiles remains one of the biggest hurdles facing the adoption of predictive medicine. Recently, MacDonald et al. (Nat. Genet. 2001, 29, 143-152) used gene expression profiles, from samples taken at diagnosis, to distinguish between clinically designated metastatic and nonmetastatic primary medulloblastomas, helping to elucidate the genetic mechanisms underlying metastasis and suggesting novel therapeutic targets. The obtained accuracy of predicting metastatic status does not, however, reach statistical significance on Fisher’s exact test, although 22 training samples were used to make each prediction via leave-one- out testing. This paper introduces readily implemented nonlinear filters to transform sequences of gene expression levels into output signals that are significantly easier to classify and predict metastasis. It is shown that when only 3 exemplars each from the metastatic and nonmetastatic classes were assumed known, a predictor was constructed whose accuracy is statistically significant over the remaining profiles set aside as a test set. The predictor was as effective in recognizing metastatic as nonmetastatic medulloblastomas, and may be helpful in deciding which patients require more aggressive therapy. The same predictor was similarly effective on an independent set of 5 nonmetastatic tumors and 3 metastatic cell lines also used by MacDonald et al. Keywords: metastasis prediction clinical outcome gene expression DNA chips microarray 1. Introduction An unsolved problem in predictive medicine is to build effective classifiers of gene expression profiles based on only a few exemplars of the classes to be distinguished. In these circumstances, statistically based prediction methods may be at a disadvantage, because they rely on estimating means, standard deviations, etc. of gene expression levels for each class from very few representatives, to choose the “best” genes for distinguishing the classes. MacDonald et al. 1 constructed predictors of metastatic status by adapting a “weighted voting” (WV) technique introduced by Golub et al. 2 for distinguishing between various acute leukemia classes. Subsequently, this approach, along with several others, has been employed to predict clinical outcome of medulloblastoma patients. 3 The MacDonald et al. study is intriguing and provides valuable clues to the genetic regulation of metastasis, identifying some genes critical to this process and suggesting possible targets for novel therapies, including use of specific inhibitors of platelet-derived growth factor receptor-R as possible new treatments of medullo- blastoma. 1 Moreover, prediction of metastasis has rarely been attempted elsewhere. Using Affymetrix G110 Cancer Chips, MacDonald et al. 1 obtained gene expression profiles for 14 nonmetastatic (M0) and 9 metastatic (M+) medulloblastoma tumors. Over this dataset, a leave-one-out protocol was employed, where each time a predictor was trained on 22 profiles and then used to predict metastatic status of the remaining profile, provided that a “prediction strength” measure exceeded a pre-set threshold. The procedure was repeated until an attempt had been made to classify each profile. Their predictors could make decisions about metastatic status for 18 of the 23 samples. Thirteen of the eighteen predictions were correct (4 of 8 metastatic, 9 of 10 nonmetastatic, Fisher’s exact test P < 0.0883,1 - tail). Although the predictions do not reach statistical significance on this test, the use of the prediction strength threshold had benefited the resulting accuracy. Indeed, it is clear from their Figure 2 that if vote decisions where the prediction strength was low were not excluded, then only 4 of 9 M+, and 11 of 14 M0, profiles would have been correctly classified. The present paper introduces nonlinear filters that convert input signals corresponding to gene expression profiles into output signals that are much easier to classify into metastatic and nonmetastatic classes than the original profiles. In par- ticular, the effectiveness of this approach is investigated when very few training data are used. It is shown that when only the first three exemplars were assumed to be known from each class, a predictor was constructed that exhibits statistically significant performance over the test set. The nonlinear filters are found via a modeling technique known as parallel cascade identification (PCI), 4 which has been previously applied 5 to predict treatment response of a group of acute myeloid leukemia (AML) patients from their gene expression profiles. See also the review by Kirkpatrick. 6 Subsequently, the approach * To whom correspondence should be addressed. Tel: (613) 533-2931. Fax: (613) 353-1729. E-mail: [email protected]. 10.1021/pr034069s CCC: $27.50 2004 American Chemical Society Journal of Proteome Research 2004, 3, 91-96 91 Published on Web 01/13/2004

On Predicting Medulloblastoma Metastasis by Gene Expression Profiling

Embed Size (px)

Citation preview

On Predicting Medulloblastoma Metastasis by Gene Expression

Profiling

Michael J. Korenberg*

Department of Electrical and Computer Engineering, Queen’s University, Kingston, Ontario, Canada K7L 3N6

Received August 29, 2003

Accurately predicting clinical outcome or metastatic status from gene expression profiles remains oneof the biggest hurdles facing the adoption of predictive medicine. Recently, MacDonald et al. (Nat.Genet. 2001, 29, 143-152) used gene expression profiles, from samples taken at diagnosis, to distinguishbetween clinically designated metastatic and nonmetastatic primary medulloblastomas, helping toelucidate the genetic mechanisms underlying metastasis and suggesting novel therapeutic targets.The obtained accuracy of predicting metastatic status does not, however, reach statistical significanceon Fisher’s exact test, although 22 training samples were used to make each prediction via leave-one-out testing. This paper introduces readily implemented nonlinear filters to transform sequences of geneexpression levels into output signals that are significantly easier to classify and predict metastasis. Itis shown that when only 3 exemplars each from the metastatic and nonmetastatic classes were assumedknown, a predictor was constructed whose accuracy is statistically significant over the remaining profilesset aside as a test set. The predictor was as effective in recognizing metastatic as nonmetastaticmedulloblastomas, and may be helpful in deciding which patients require more aggressive therapy.The same predictor was similarly effective on an independent set of 5 nonmetastatic tumors and 3metastatic cell lines also used by MacDonald et al.

Keywords: metastasis • prediction • clinical outcome • gene expression • DNA chips • microarray

1. Introduction

An unsolved problem in predictive medicine is to buildeffective classifiers of gene expression profiles based on only afew exemplars of the classes to be distinguished. In thesecircumstances, statistically based prediction methods may beat a disadvantage, because they rely on estimating means,standard deviations, etc. of gene expression levels for each classfrom very few representatives, to choose the “best” genes fordistinguishing the classes. MacDonald et al.1 constructedpredictors of metastatic status by adapting a “weighted voting”(WV) technique introduced by Golub et al.2 for distinguishingbetween various acute leukemia classes. Subsequently, thisapproach, along with several others, has been employed topredict clinical outcome of medulloblastoma patients.3 TheMacDonald et al. study is intriguing and provides valuable cluesto the genetic regulation of metastasis, identifying some genescritical to this process and suggesting possible targets for noveltherapies, including use of specific inhibitors of platelet-derivedgrowth factor receptor-R as possible new treatments of medullo-blastoma.1 Moreover, prediction of metastasis has rarely beenattempted elsewhere.

Using Affymetrix G110 Cancer Chips, MacDonald et al.1

obtained gene expression profiles for 14 nonmetastatic (M0)and 9 metastatic (M+) medulloblastoma tumors. Over thisdataset, a leave-one-out protocol was employed, where each

time a predictor was trained on 22 profiles and then used topredict metastatic status of the remaining profile, provided thata “prediction strength” measure exceeded a pre-set threshold.The procedure was repeated until an attempt had been madeto classify each profile. Their predictors could make decisionsabout metastatic status for 18 of the 23 samples. Thirteen ofthe eighteen predictions were correct (4 of 8 metastatic, 9 of10 nonmetastatic, Fisher’s exact test P < 0.0883,1 - tail).Although the predictions do not reach statistical significanceon this test, the use of the prediction strength threshold hadbenefited the resulting accuracy. Indeed, it is clear from theirFigure 2 that if vote decisions where the prediction strengthwas low were not excluded, then only 4 of 9 M+, and 11 of 14M0, profiles would have been correctly classified.

The present paper introduces nonlinear filters that convertinput signals corresponding to gene expression profiles intooutput signals that are much easier to classify into metastaticand nonmetastatic classes than the original profiles. In par-ticular, the effectiveness of this approach is investigated whenvery few training data are used. It is shown that when only thefirst three exemplars were assumed to be known from eachclass, a predictor was constructed that exhibits statisticallysignificant performance over the test set. The nonlinear filtersare found via a modeling technique known as parallel cascadeidentification (PCI),4 which has been previously applied5 topredict treatment response of a group of acute myeloidleukemia (AML) patients from their gene expression profiles.See also the review by Kirkpatrick.6 Subsequently, the approach

* To whom correspondence should be addressed. Tel: (613) 533-2931.Fax: (613) 353-1729. E-mail: [email protected].

10.1021/pr034069s CCC: $27.50 2004 American Chemical Society Journal of Proteome Research 2004, 3, 91-96 91Published on Web 01/13/2004

was employed to predict failure/survivor outcome of medullo-blastoma patients,7 rather than metastatic status of primarytumors as in this paper.

Indeed, there are several significant ways that the presentpaper differs from the earlier applications5,7 of PCI to interpretgene expression profiles. First, as noted above, a critical issuein predictive medicine is to develop accurate classifiers whenthere are only a few known examples of the classes to bedistinguished. For example, after a Phase II trial, there may onlybe a few responders to an experimental cancer drug, but itwould be helpful to use their profiles to select likely respondersfor Phase III. Certainly, there have been few demonstrationsof gene-expression-based predictors developed on only threeexemplars from each class to be distinguished that have shownstatistically significant accuracy over an independent set. Thisis true for the PCI predictor created here, using only the firstexemplars of each class to build the model, and the next twoof each class to form references for classifying new profiles.Moreover, this predictor is at least as effective in recognizingmetastatic as nonmetastatic tumors. This is an importantdevelopment because the earlier predictor had much moredifficulty in recognizing M+ than M0 tumors, perhaps becauseof the heterogeneity of the metastatic tumors.1

In the medulloblastoma outcome study,7 only one exemplarper class was used to identify the PCI model per se. However,to classify the remaining 58 profiles, a leave-one-out protocolwas used in which 57 of the profiles provided reference outputsignals for classifying the held-out profile. Similarly, in the AMLstudy,5 again in a leave-one-out protocol, 12 profiles providedcertain PCI model architecture parameter values needed toclassify the held-out profile. Of course, this is not possible whenonly three examples from each class are assumed to be knownas in the main focus of the present paper.

Second, the present application does not search for the“best” parameter values, and number of genes to use, as iscommonly done in other papers concerned with developinggene-expression-based predictors. Rather, all of these valueswere taken from a different study,7 in order to rigorouslymaintain independence between the 6-profile training set andthe test set of remaining profiles. Third, even with only threeexemplars assumed known from each class, it is possible toconstruct a voting scheme of PCI predictors to further improveaccuracy, which was not shown in the two previous papers.5,7

Fourth, while the PCI model constructed with only the firsttraining exemplar of each class is the main focus here, it isalso shown how multiple expression profiles from each classcan be employed to construct the training input and identify

the model. Finally, a list of 22 genes is provided that are usefulin predicting metastatic status of medulloblastomas.

The approach taken here has special advantage when fewtraining exemplars are available, and represents a radicaldeparture from statistically based methods such as weightedvoting. Rather than attempting to select the “best” genes fordistinguishing between classes, PCI seeks to accentuate, for theselected genes, the model’s differential response to exemplarsfrom different classes. The model is identified from a traininginput derived from expression levels of a few class exemplarsat selected genes, and a training output defined to havedifferent values over portions of the input corresponding todifferent classes. The model’s memory, essentially the numberof consecutive input values needed to calculate an output value,is typically chosen to be much shorter than each segment ofthe training input corresponding to one class exemplar.Consequently, as the model “slides” over the training input andattempts to approximate the corresponding training output, itencounters many more training examples than the number ofexemplars used to create that input. Thus, the approach takesadvantage of much more information inherent in gene expres-sion profiles than other approaches are currently using.

The challenge here of constructing effective predictors fromvery few exemplars affords no opportunity to “tune” the modelover additional known examples of M+ and M0 classes. Hence,all required architectural parameter values for the model, aswell as the number of genes used, and the method of selectinggenes, were implemented exactly as previously published forthe outcome prediction study7 (that had involved 60 patients).Note that the Affymetrix HuGeneFL arrays used there, contain-ing 7129 expression levels per profile, differ from microarraysanalyzed in the present study with 2059 expression levels ineach profile. The latter are the same profiles used by Mac-Donald et al.1 (i.e., the raw values given after each data set hadbeen normalized by them), obtained from samples taken inthe initial diagnostic biopsy.

Each profile contains 2059 integer values, the gene expres-sion data from an M0 or M+ medulloblastoma or a medullo-blastoma cell line, indicating the degree of expression ofselected genes in the medulloblastoma that are represented onthe G110 Cancer Array. RNA from a medulloblastoma biopsyor cell line is converted into cDNA and thence into fluorophore-labeled cRNA that is fragmented to approximately 200 bp piecesand hybridized to the array.1 The raw expression values arederived for the genes by measuring the absolute fluorescenceintensities for their respective probe-sets on the chip, and werescaled to normalize the data for comparison between arrays.1

Figure 1. Parallel cascade model used to predict medulloblastoma metastasis. Each L is a dynamic linear element; each N is a polynomialstatic nonlinearity.

research articles Korenberg

92 Journal of Proteome Research • Vol. 3, No. 1, 2004

A larger value indicates greater expression by the gene. Furtherdetails are provided by MacDonald et al.1 concerning how thegene expression profiles were obtained.

2. Building a Nonlinear Filter

Use of PCI for class prediction from gene expression profileshas been described recently,5 and is now summarized. In brief,given one or more exemplars from each class to be distin-guished, the first step is to select genes that assist in distin-

guishing between the two classes. In the outcome predictionstudy,7 the most effective predictor was constructed by selectingthe 22 genes with greatest difference in raw expression levelsbetween the first failure and survivor profiles and using thesetwo profiles to create the training input. The same strategy wasfollowed here. Hence, the first M+ and M0 profiles, respectivelydesignated as M1 and N2 by MacDonald et al.,1 were comparedto find the 22 genes with greatest difference in raw expressionlevels between these profiles. The corresponding 22 raw valuesfrom profile M1 were appended, in the same order as in theprofile, to form an M+ segment, and an M0 segment wassimilarly prepared from profile N2. The two segments werespliced together to form a 44-point training input, with thetraining output defined as -1 over the M+ segment and 1 overthe M0 segment of the training input.7 Clearly, a nonlinearsystem having this input/output relation would function as apredictor of metastatic status, at least for the two trainingprofiles, and its output values would be expected to be negativefor M+ and positive for M0 profiles.

The training input and output were then used to identify aparallel cascade model of the form in Figure 1 as explainedearlier.4,5 In Figure 1, each linear element L is dynamic, i.e., itsoutput at instant i depends not only on its input at instant ibut also at instants i - 1,...,i - R, so the memory length is saidto be R+1; each static nonlinearity N is a polynomial. Earlier,to model discrete-time nonlinear systems, Palm8 had intro-duced a parallel array of LNL cascades, where the staticnonlinearities were logarithmic and exponential functions,rather than the polynomials used here.

In another case below, the first four exemplar profiles of eachclass (respectively designated1 as M1, M2, M3, M4 from M+,and N2, N3, N4, N5 from M0) were employed to create thetraining input. As described above, each profile consists of 2059integer values reflecting the expression levels of specific genesrepresented on the microarray, and the 22 genes (Table 1) usedin forming the training input were selected as follows. For eachgene, the mean of its raw expression values was computed overthe four M+ training profiles, and the mean was also computedover the four M0 training profiles. Then the absolute value ofthe difference between the two means was computed for thegene. The 22 genes having the largest of such absolute valueswere selected. Four segments each of M+ and M0 type wereprepared, similarly to above, maintaining the same order ofappending the raw values as they had in the original profiles.The eight segments were spliced together to form the 176-pointtraining input x(i) in Figure 2a. The corresponding trainingoutput y(i) (Figure 2b, solid line) was defined as -1 over M+and 1 over M0 segments of the input. The training input andoutput were used to identify a PCI model (Figure 1), and Figure2b (dotted line) shows the resulting model output z(i) whenstimulated by the training input. Thus, Figure 2, parts a and b,is for the case of training inputs constructed from multipleexemplars per family, whereas only single-exemplar figureswere presented in the two earlier papers.5,7

To identify the PCI model, certain parameter settings relatedprimarily to its architecture had to be pre-specified.7 These wereas follows: the memory length of each linear element L, thedegree of each polynomial N, the maximum number ofcascades allowed in the model, and a threshold regulating therequired reduction in mean-square error for allowing a can-didate cascade into the model. As noted above, these parametersettings were not established by selecting values that areoptimal for the present data set. In the main focus of this paper,

Figure 2. (a) Training input x(i) formed by splicing together theraw expression levels of genes from the first four metastatic (M+)profiles and first four nonmetastatic (M0) profiles. The genesused (Table 1) were the 22 having greatest difference in expres-sion levels between the M+ and M0 training profiles. (b) Trainingoutput y(i) (solid line) defined as -1 over the M+ portions of thetraining input and 1 over the M0 portions. The training input andoutput were used to identify a parallel cascade model of the formin Figure 1. The dotted line represents calculated output z(i) whenthe identified model is stimulated by training input x(i). Note thatz(i) is predominately negative over the M+ portions, and positiveover the M0 portions, of the training input. The identified model’sability to separate metastatic and nonmetastatic profiles isexploited by replacing the profiles with corresponding modeloutput signals that are easier to classify and predict metastasis.

Predicting Medulloblastoma Metastasis research articles

Journal of Proteome Research • Vol. 3, No. 1, 2004 93

only three known exemplars are assumed from each class. Thisnecessitated employing published parameter settings tailoredfor a different study, concerned with predicting clinical out-come.7 There it was found that a memory length of 4, apolynomial degree of 5, two cascades allowed in the model,and a threshold of 6, resulted in good classification accuracy.Specifically, these values were found by systematically tryingdifferent parameter settings, obtaining the corresponding PCImodel, and observing the resulting classification accuracy overa test set of 58 profiles.7 Although the microarrays differedbetween that study and the present one, and the class distinc-tions (failure/survivor outcome, metastatic/nonmetastatic sta-tus of primary tumors) are different, these values were directlyadopted here so as to minimize the number of exemplars thathad to be known from each class.

3. Classifying a Test Profile

The classification procedure set out previously7 was followedexactly. The raw expression values of the 22 previously selectedgenes were appended in the same order used above to forman input signal corresponding to the test profile. This inputsignal was fed to the PCI model to obtain the output signalthat was compared, using correlation, with reference outputsignals for both the M+ and M0 classes. In the principle caseconsidered below, only the first three exemplars from each classwere employed to construct the predictor, with all other profilesreserved for testing.

In particular, M+ profile M1 and M0 profile N2 were usedto create the training input for identifying the PCI model, andreference output signals from the model were prepared forknown M+ exemplars M2, M3, and M0 exemplars N3, N4.Then, if z(0)(i) represents the output signal for the test profile,and z(j)(i) one of the reference output signals, with i ) 1,...,22,the correlation coefficient

was calculated, where z(0) and z(j) denote the average of z(0)(i)and z(j)(i) respectively over i ) 4,...,22. Note that the first threepoints of each 22-point output signal were not used, to allowthe model to “settle”, because the memory length was 4. Thetest output signal was assigned the class of the reference outputsignal it was most positively correlated with; i.e., the correlationcoefficient was largest.

4. Predicting Metastatic Status

As noted above, the initial PCI predictor was constructedfrom only the first three M+ and M0 profiles (respectivelydesignated1 as M1, M2, M3 and N2, N3, N4). None of theremaining profiles were used here, rather they were reservedas an independent set for testing.

First, the nonlinear filter, in the form of the parallel cascademodel in Figure 1, was identified from a training input derivedonly from M1 and N2, as earlier set out. Recall that the samearchitectural parameter values and number (22) of genes wereused as in the outcome prediction study.7 Second, the modelwas employed to obtain reference output signals correspondingto the remaining 2 known profiles (M2, M3 and N3, N4) fromeach class. Using correlation with the reference outputs fromthe model to predict class, as explained above, yielded thesetest results: 5 of the 6 novel M+, and 8 of the 11 novel M0,profiles were correctly classified (Matthews9 correlation coef-ficient φ ) 0.54, Fisher’s exact test P < 0.043 1-tail, P < 0.052-tail). Corresponding results for WV,1 from leave-one-outtraining with 22 profiles and predicting class of the held-outprofile, were φ ) 0.44, Fisher’s exact test P < 0.0883 1-tail, P <0.118 2-tail.

Even with only the three exemplars assumed known fromeach class, it was possible to study how the choice of profilesfor training the PCI model influenced the resulting accuracy.Here, it is important that the test set, consisting of theremaining profiles, stays the same for each of the modelsconstructed. In addition to the above model, whose traininginput was derived from profiles M1 and N2, two other modelswere identified. Thus, a training input derived from M2 andN3 was used to identify a second model, with M1, M3 and N2,N4 employed to obtain reference output signals for classifying

Table 1. Twenty-Two Genes Used to Predict Medulloblastoma Metastasis

position in profile

(1-2059) description

90 M33764cds human ornithine decarboxylase gene, complete cds115 M11717mRNA human heat shock protein (hsp 70) gene, complete cds219 D13748 HUM4AI human mRNA for eukaryotic initiation factor 4AI467 D78577expanded D78576S2 human DNA for 14-3-3 protein eta chain; exon2 and complete cds744 M55409 human pancreatic tumor-related protein mRNA, 3′ end763 D11139exons#1-4 HUMTIMP human gene for tissue inhibitor of metalloproteinases; partial sequence1078 X58965 H. sapiens RNA for nm23-H2 gene1083 X73066cds Homo sapiens NM23-H1 mRNA1138 M55914 HUMCMYCQ human c-myc binding protein (MBP-1) mRNA; complete cds1168 L19182 HUMMAC25X human MAC25 mRNA; complete cds1194 D17517 HUMSKY human sky mRNA for Sky; complete cds1291 HG4322-HT4592 Tubulin, â1423 V00567cds HSMGLO human messenger RNA fragment for the â-2 microglobulin1570 M94250expanded human retinoic acid inducible factor (MK) gene exons 1-5, complete cds1664 J03040 human SPARC/osteonectin mRNA, complete cds1669 J04164 HUM927A human interferon-inducible protein 9-27 mRNA; complete cds1684 J02783mRNA HUMTHBP human thyroid hormone binding protein (p55) mRNA; complete cds1762 D00017 HUMLIC Homo sapiens mRNA for lipocortin II; complete cds1822 U21689cds human glutathione S-transferase-P1c gene; complete cds1863 M93311cds human metallothionein-III gene, complete cds1871 M29386mRNA HUMPRLA human prolactin mRNA; 3′ end1949 HG1980-HT2023 Tubulin, â 2

r )

∑i)4

22

(z(j)(i) - z(j))(z(0)(i) - z(0))

(∑i)4

22

(z(j)(i) - z(j))2)(1/2)(∑i)4

22

(z(0)(i) - z(0))2)(1/2)

research articles Korenberg

94 Journal of Proteome Research • Vol. 3, No. 1, 2004

novel profiles via correlation. Another model was identifiedfrom M3 and N4, whereas exemplars M1, M2 and N2, N3 wereused to obtain reference output signals.

Although there were similarities in performance between thethree models, there were some differences that benefited overallaccuracy. For the first predictor, only M+ profile M20, and M0profiles N15, N24, N25, were misclassified. It is of interest thatthe three predictors never unanimously misclassified a profile;rather, every unanimous decision (for profiles M4, N18, N22,and N2+box2) was correct. In addition, the second and thirdpredictors reversed the misclassification of profile M20 by thefirst predictor, but both of the former erred on profile M28.Overall, no other classification of M+ profiles was affected, sothat by majority vote there was still one M+ misclassification.Next, the second and third predictors corrected the misclas-sification of M0 profiles N24, N25 by the first predictor, butcontributed one new error, profile N16. Overall, no other M0classifications were affected, so by majority vote there were onlytwo M0 misclassifications, namely N15, N16. In summary,constructing three PCI predictors to vote on the final clas-sification decisions slightly raised accuracy: 5 of 6 remainingM+, 9 of 11 remaining M0 (φ ) 0.63, Fisher’s exact test P <0.018 1- or 2-tail). Thus, although the choice of exemplars usedfor training the PCI model does have some effect on theclassification results, the differences may be used constructivelyto increase the accuracy over that of an individual predictor.

The first PCI predictor was next tested over a second datasetof five M0 “validation” medulloblastomas and three M+ celllines (DAOY, D283, D341) also used previously.1 Recall that onlyprofiles M1 and N2 had been employed to obtain the PCImodel, with the next two profiles from each class used to obtainreference output signals. The predictor correctly classified 4of 5 validation profiles, and 2 of 3 cell lines (misclassifyingD283). This performance, starting with 6 known profiles, isroughly comparable to that of the WV predictor1 trained on all23 profiles from the first set, which could make classificationdecisions (all correct) on 4 of 5 validation medulloblastomas,and 1 of 3 cell lines. Overall, the PCI predictor’s accuracy was7 of 9 M+ and 12 of 16 M0 novel profiles (φ ) 0.51, Fisher’sexact test P < 0.017, 1- or 2-tail).

Finally, when 4 known profiles from each class producedthe training input (Figure 2a), and the remaining profiles fromthe first set produced reference outputs from the model, thePCI predictor correctly classified all 3 cell lines, and 4 of 5validation medulloblastomas. Moreover, the sign was examinedof the values of the 3 metastatic and 5 nonmetastatic outputsignals, using every 4th point of each output signal to avoid anyoverlap in gene expression levels used to obtain such outputvalues (see Statistical Considerations). The sign of these outputvalues correlated with actual metastatic status (Fisher’s exacttest P < 0.0193, 1-tail).

5. Statistical Considerations

In addition to calculating Fisher’s exact probability, the utilityof the PCI model was checked by classifying the test 22-pointinput signals through correlation with reference input signals,rather than using the corresponding output signals from themodel. Classification without first obtaining model outputsignals dropped the accuracy to 2 of 6 novel M+ and 6 of 11novel M0 profiles in the original data set, a negative Matthews’correlation (coefficient φ ) - 0.12) with actual metastatic status.This should be compared with the first predictor’s accuracy of

5 of 6 novel M+ and 8 of 11 novel M0, which was obtainedusing the output signals, showing the value of the PCI model.

All PCI models employed here had memory length of 4.Hence, when the sign was examined of the values of the outputsignals for the 3 cell lines and 5 validation medulloblastomas,every 4th point of each output signal was used to avoid anyoverlap in gene expression levels used to obtain such outputvalues. This gave rise to 25 such output values correspondingto the 5 M0 validation medulloblastomas, of which 7 valueswere negative and 18 were positive. For the three M+ cell lines,there were 15 such output values with 10 negative and 5positive. Indeed, the sign of these output values was correlatedwith actual metastatic status (Fisher’s exact test P < 0.0193,1-tail). A 1-tail test was used here since, because of the waythe model had been trained, output signals corresponding toM0 and M+ profiles are expected to have positive and negativevalues, respectively.

6. Conclusion

The above results show the capability of PCI to buildaccurate and reliable predictors from small amounts of trainingdata, and suggest a new approach to microarray analysis.Certainly, achieving the above accuracy required that the threeM+ and M0 exemplars actually be representative of theirrespective classes, otherwise no method could succeed. More-over, if the training exemplars do not well cover the variety ofprofile types in each class, then additional exemplars will clearlybe needed beyond the three per class used above. A key pointof the present paper is that a predictor was developed thatexhibited good accuracy in predicting medulloblastoma me-tastasis and in particular was effective in recognizing M+tumors. The predictor can now be investigated as one com-ponent in a scheme for predicting which patients can safelyforego radiation therapy, an important issue raised by Mac-Donald et al.1

It is notable that, given good class representatives fortraining, PCI is highly efficient at building models that separateM+ and M0 profiles, compared with weighted voting trainedon almost four times the data. As further exemplars of theclasses to be distinguished become available, the above parallelcascade filters can be used to obtain the corresponding outputsignals, and the known classes of these additional referenceoutput signals lead to increased accuracy in classifying novelprofiles. Also, although nearest neighbor was used here toclassify output signals successfully, it remains open to inves-tigate whether other predictors, e.g., based on support vectormachines,10 artificial neural networks,11 K-means-clustering,12

or PCI, are better suited to distinguishing these output signals.

References

(1) MacDonald, T. J.; Brown, K. M.; LaFleur, B.; Peterson, K.; Lawlor,C.; Chen, Y.; Packer, R. J.; Cogen, P.; Stephan, D. A. Expressionprofiling of medulloblastoma: PDGFRA and the RAS/MAPKpathway as therapeutic targets for metastatic disease. Nat. Genet.2001, 29, 143-152. Datasets: http://microarray.cnmcresearch.org/datafiles/MacDonaldetal.xls.

(2) Golub, T. R.; Slonim, D. K.; Tamayo, P.; Huard, C.; Gaasenbeek,M.; Mesirov, J. P.; Coller, H.; Loh, M. L.; Downing, J. R.; Caligiuri,M. A.; Bloomfield, C. D.; Lander, E. S. Molecular classification ofcancer: class discovery and class prediction by gene expressionmonitoring. Science 1999, 286, 531-537.

(3) Pomeroy, S. L.; Tamayo, P.; Gaasenbeek, M.; Sturla, L. M.; Angelo;M.; McLaughlin, M. E.; Kim, J. Y. H.; Goumneroval, L. C.; Black,P. M.; Lau, C.; Allen, J. C.; Zagzag, D.; Olson, J. M.; Curran, T.;

Predicting Medulloblastoma Metastasis research articles

Journal of Proteome Research • Vol. 3, No. 1, 2004 95

Wetmore, C.; Biegel, J. A.; Poggio, T.; Mukherjee, S.; Rifkin, R.;Califano, A.; Stolovitzky, G.; Louis, D. N.; Mesirov, J. P.; Lander,E. S.; Golub, T. R. Prediction of central nervous system embryonaltumour outcome based on gene expression. Nature 2002, 415,436-442.

(4) Korenberg, M. J. Parallel cascade identification and kernelestimation for nonlinear systems. Ann. Biomed. Eng. 1991, 19,429-455.

(5) Korenberg, M. J. Prediction of treatment response using geneexpression profiles. J. Proteome Res. 2002, 1, 55-61.

(6) Kirkpatrick, P. Look into the future. Nat. Rev. Drug Discovery 2002,1 (5), 334.

(7) Korenberg, M. J. Gene expression monitoring accurately predictsmedulloblastoma positive and negative clinical outcomes. FEBSLett. 2003, 533, 110-114.

(8) Palm, G. On representation and approximation of nonlinearsystems. Part II: Discrete time. Biol. Cybern. 1979, 34, 49-52.

(9) Matthews, B. W. Comparison of the predicted and observedsecondary structure of T4 phage lysozyme. Biochim. Biophys. Acta1975, 405, 442-451.

(10) Yeang, C.-H.; Ramaswamy, S.; Tamayo, P.; Mukherjee, S.; Rifkin,R. M.; Angelo, M.; Reich, M.; Lander, E.; Mesirov, J.; Golub, T.Molecular classification of multiple tumor types. Bioinformatics2001, 17, Suppl. 1, S316-S322.

(11) Khan, J.; Wei, J. S.; Ringner, M.; Saal, L. H.; Ladanyi, M.;Westermann, F.; Berthold, F.; Schwab, M.; Antonescu, C. R.;Peterson, C.; Meltzer, P. S. Classification and diagnostic predictionof cancers using gene expression profiling and artificial neuralnetworks. Nat. Med. 2001, 7, 673-679.

(12) Tavazoie, S.; Hughes, J. D.; Campbell, M. J.; Cho, R. J.; Church,G. M. Systematic determination of genetic network architecture.Nat. Genet. 1999, 22, 281-285.

PR034069S

research articles Korenberg

96 Journal of Proteome Research • Vol. 3, No. 1, 2004