10
Toward breast cancer survivability prediction models through improving training space Jaree Thongkam * , Guandong Xu, Yanchun Zhang, Fuchun Huang School of Computer Science and Mathematics, Victoria University, P.O. Box 14428, Melbourne, Vic. 8001, Australia article info Keywords: Data mining Outliers Over-sampling Breast cancer survivability prediction models abstract Due to the difficulties of outlier and skewed data, the prediction of breast cancer survivability has pre- sented many challenges in the field of data mining and pattern precognition, especially in medical research. To solve these problems, we have proposed a hybrid approach to generating higher quality data sets in the creation of improved breast cancer survival prediction models. This approach comprises two main steps: (1) utilization of an outlier filtering approach based on C-Support Vector Classification (C- SVC) to identify and eliminate outlier instances; and (2) application of an over-sampling approach using over-sampling with replacement to increase the number of instances in the minority class. In order to assess the capability and effectiveness of the proposed approach, several measurement methods includ- ing basic performance (e.g., accuracy, sensitivity, and specificity), Area Under the receiver operating char- acteristic Curve (AUC) and F-measure were utilized. Moreover, a 10-fold cross-validation method was used to reduce the bias and variance of the results of breast cancer survivability prediction models. Results have indicated that the proposed approach leads to improving the performance of breast cancer survivability prediction models by up to 28.34% due to the improved training data space. Ó 2009 Elsevier Ltd. All rights reserved. 1. Introduction Breast cancer is the second most frequent cause of cancer inci- dence among women in Thailand, with a estimated incidence rate of 17.2 per 10,000 in 1995–1997 (National Cancer Institute of Thai- land, 2006). It is the most common cause of cancer death among women in Thailand, and has been increasing with more than 5000 new cases reported every year (Thongsuksai, Chongsuvivat- wong, & Sriplung, 2000). The main contributing factors are lifestyle changes, dietary patterns, and genetic issues (Srinivasan, Chandra- sekhar, Seshadri, & Jonathan, 2005). Although several research studies have analyzed breast cancer data sets related to breast can- cer diagnosis (Bridgett, Brandt, & Harris, 1995; Fang & Ng, 1993; Wang, Wu, Liang, & Guo, 2006; Wang, Xu, Wang, & Zhang, 2006) and addressed the prediction of breast cancer outcomes (Bellaachia & Guven, 2006; Delen, Walker, & Kadam, 2005; Ryu, Chandraseka- ran, & Jacob, 2007; Thongkam, Xu, & Zhang, 2008; Thongkam, Xu, Zhang, & Huang, 2008a, 2008b), further research into this field will enable patients to have an idea of the prognosis of the likely course and outcome of their disease. Medical prognoses need to deal with the application of various methods to historical data in order to predict the survivability of particular patients suffering from a disease using traditional ana- lytical applications such as Kaplan–Meier and Cox-Proportional Hazard, over a particular time period (Borovkova, 2002). However, more recently, due to the increased use of computing automated tools allowing the storage and retrieval of large volumes of medical data to be collected and made available to the medical research community, there has been increasing interest in the development of prediction models using a new method of survival analysis enti- tled period analysis. This kind of analysis is used to monitor sur- vival rate and provide up-to-date estimation of long-term patient survival (Brenner, Gefeller, & Hakulinen, 2002; Delen et al., 2005). Data mining methods are gaining popularity as research tools for medical researchers who seek to identify and exploit patterns and prediction models. These methods are proven to be more pow- erful than traditional statistical methods (Ohno-Machado, 2001; Xiong, Kim, Baek, Rhee, & Kim, 2005) in discovering useful patterns or models in most sizes of data sets (Han & Kamber, 2006). In comparative studies using data mining methods, Delen et al. (2005) utilized three techniques. These included logistic regres- sion, artificial natural networks (ANN) and decision tree (C5), to build a 5-year breast cancer survivability prediction model from SEER databases. The attributes consisted of 11 category attributes and five numeric attributes. Results showed that the decision tree model outperformed both ANN and logistic regression using the three measurement methods of accuracy, sensitivity and specific- ity. Moreover, Yi and Fuyong (2006) applied Support Vector Ma- chine (SVM) to discover breast cancer diagnosis patterns in data 0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2009.04.067 * Corresponding author. Tel.: +61 9919 4228; fax: +61 9919 4050. E-mail address: [email protected] (J. Thongkam). Expert Systems with Applications 36 (2009) 12200–12209 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Toward breast cancer survivability prediction models through improving training space

Embed Size (px)

Citation preview

Expert Systems with Applications 36 (2009) 12200–12209

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Toward breast cancer survivability prediction models through improvingtraining space

Jaree Thongkam *, Guandong Xu, Yanchun Zhang, Fuchun HuangSchool of Computer Science and Mathematics, Victoria University, P.O. Box 14428, Melbourne, Vic. 8001, Australia

a r t i c l e i n f o

Keywords:Data miningOutliersOver-samplingBreast cancer survivability predictionmodels

0957-4174/$ - see front matter � 2009 Elsevier Ltd. Adoi:10.1016/j.eswa.2009.04.067

* Corresponding author. Tel.: +61 9919 4228; fax: +E-mail address: [email protected] (J.

a b s t r a c t

Due to the difficulties of outlier and skewed data, the prediction of breast cancer survivability has pre-sented many challenges in the field of data mining and pattern precognition, especially in medicalresearch. To solve these problems, we have proposed a hybrid approach to generating higher quality datasets in the creation of improved breast cancer survival prediction models. This approach comprises twomain steps: (1) utilization of an outlier filtering approach based on C-Support Vector Classification (C-SVC) to identify and eliminate outlier instances; and (2) application of an over-sampling approach usingover-sampling with replacement to increase the number of instances in the minority class. In order toassess the capability and effectiveness of the proposed approach, several measurement methods includ-ing basic performance (e.g., accuracy, sensitivity, and specificity), Area Under the receiver operating char-acteristic Curve (AUC) and F-measure were utilized. Moreover, a 10-fold cross-validation method wasused to reduce the bias and variance of the results of breast cancer survivability prediction models.Results have indicated that the proposed approach leads to improving the performance of breast cancersurvivability prediction models by up to 28.34% due to the improved training data space.

� 2009 Elsevier Ltd. All rights reserved.

1. Introduction

Breast cancer is the second most frequent cause of cancer inci-dence among women in Thailand, with a estimated incidence rateof 17.2 per 10,000 in 1995–1997 (National Cancer Institute of Thai-land, 2006). It is the most common cause of cancer death amongwomen in Thailand, and has been increasing with more than5000 new cases reported every year (Thongsuksai, Chongsuvivat-wong, & Sriplung, 2000). The main contributing factors are lifestylechanges, dietary patterns, and genetic issues (Srinivasan, Chandra-sekhar, Seshadri, & Jonathan, 2005). Although several researchstudies have analyzed breast cancer data sets related to breast can-cer diagnosis (Bridgett, Brandt, & Harris, 1995; Fang & Ng, 1993;Wang, Wu, Liang, & Guo, 2006; Wang, Xu, Wang, & Zhang, 2006)and addressed the prediction of breast cancer outcomes (Bellaachia& Guven, 2006; Delen, Walker, & Kadam, 2005; Ryu, Chandraseka-ran, & Jacob, 2007; Thongkam, Xu, & Zhang, 2008; Thongkam, Xu,Zhang, & Huang, 2008a, 2008b), further research into this field willenable patients to have an idea of the prognosis of the likely courseand outcome of their disease.

Medical prognoses need to deal with the application of variousmethods to historical data in order to predict the survivability ofparticular patients suffering from a disease using traditional ana-

ll rights reserved.

61 9919 4050.Thongkam).

lytical applications such as Kaplan–Meier and Cox-ProportionalHazard, over a particular time period (Borovkova, 2002). However,more recently, due to the increased use of computing automatedtools allowing the storage and retrieval of large volumes of medicaldata to be collected and made available to the medical researchcommunity, there has been increasing interest in the developmentof prediction models using a new method of survival analysis enti-tled period analysis. This kind of analysis is used to monitor sur-vival rate and provide up-to-date estimation of long-term patientsurvival (Brenner, Gefeller, & Hakulinen, 2002; Delen et al., 2005).

Data mining methods are gaining popularity as research toolsfor medical researchers who seek to identify and exploit patternsand prediction models. These methods are proven to be more pow-erful than traditional statistical methods (Ohno-Machado, 2001;Xiong, Kim, Baek, Rhee, & Kim, 2005) in discovering useful patternsor models in most sizes of data sets (Han & Kamber, 2006).

In comparative studies using data mining methods, Delen et al.(2005) utilized three techniques. These included logistic regres-sion, artificial natural networks (ANN) and decision tree (C5), tobuild a 5-year breast cancer survivability prediction model fromSEER databases. The attributes consisted of 11 category attributesand five numeric attributes. Results showed that the decision treemodel outperformed both ANN and logistic regression using thethree measurement methods of accuracy, sensitivity and specific-ity. Moreover, Yi and Fuyong (2006) applied Support Vector Ma-chine (SVM) to discover breast cancer diagnosis patterns in data

J. Thongkam et al. / Expert Systems with Applications 36 (2009) 12200–12209 12201

from the University of Wisconsin Hospitals. Their results indicatedthat SVM was suited to breast cancer diagnosis. Furthermore, Ryuet al. (2007) employed an isotonic separation technique to predictbreast cancer survival rates in both Wisconsin breast cancer diag-nosis and Ljubljana breast cancer recurrence data sets. They foundthat the isotonic separation technique outperformed decision tree(C4.5), Robust LP-P, and SVM Gaussian kernel. In short, many re-search studies have concentrated on selecting suitable learningalgorithms. However, improving data quality is also of concern inbuilding accurate prediction models, especially in medical datasets (Podgorelec, Hericko, & Rozman, 2005; Thongkam et al.,2008, 2008a, 2008b). This is because medical data are commonlycollected without any specific research purpose (Li, Fu, He, Chen,& Kelman, 2005; Tsumoto, 2000). Accordingly, these data have spe-cific quality problems including data that is missing, outlier orskewed. These problems frequently and directly affect the perfor-mance of prediction models (Podgorelec et al., 2005), since mostalgorithms can handle missing data very well, but rarely handleoutlier and skewed data in data sets (Brodley & Friedl, 1999; Pelayo& Dick, 2007). Consequently, the basic idea of this paper is to com-bine both outlier and skewed data handling methods to improvethe quality of data sets.

Outliers refer to data that may contain records which do not fol-low the common rules, and thus affect the model’s performance.For example, patients who have breast cancer in stage I and areaged less than 30 years old, should be categorized as ‘alive’. But,in medical records, due to having died of other causes, these pa-tients have been categorized as ‘dead’ in the data set. In this casewe assume the instance as an outlier. To address such problems,three common outlier handling approaches including robust algo-rithms, outlier filtering, and correction of outlier instances havebeen used (Brodley & Friedl, 1999). However, apart from beingunstable in correcting and cleaning unwanted instances, outliercorrection methods are usually more computationally expensivethan robust algorithms and outlier filtering techniques (Brodley& Friedl, 1996). Here, several research studies have utilized outlierfiltering approaches to assist in improving the performance of theclassifiers. For example, Verbaeten and Assche (2003) employedInductive Logic Programming (ILP) and a first order decision treealgorithm to construct their ensembles. Their techniques startedwith a outlier-free data set, followed by adding different levels ofclassification outliers in the outlier-free data set, and evaluatedits effectiveness using a decision tree. Their results showed thatthe accuracy of the decision tree was decreased rapidly afterincreasing the levels of noise. Furthermore, Blanco, Ricket, andMartı́n-Merino (2007) combined multiple dissimilarities and Sup-port Vector Machine (SVM) to filter out spam messages in process-ing e-mail data. Their results illustrated that the combination ofmultiple dissimilarities and SVM performed better than using dis-similarity alone.

Skewed data refers to the problem of an imbalanced data setthat has the number of instances for one class in the data set out-numbering the instances of other classes (Wang, Wu, et al., 2006;Wang, Xu, et al., 2006; Xie & Qiu, 2007). In this case the classifica-tion algorithm usually exhibits poor performance while dealingwith skewed data sets, and results are biased towards the majorityclass (Padmaja, Dhulipalla, Bapi, & Krishna, 2007). In relation toskewed problems, much research has used the re-sampling ap-proach including under-sampling and over-sampling (Estabrooks,Jo, & Japkowicz, 2004; Wang, Wu, et al., 2006; Wang, Xu, et al.,2006; Xie & Qiu, 2007). Under-sampling is used to decrease thesize of the majority class to the same size of the minority class,whereas over-sampling is used to increase the size of the minorityclass to the same size of the majority class. Much research has uti-lized over-sampling approaches to handle skewed data. For exam-ple, Barandelaa, Sánchez, Garcı́aa, and Rangel (2003) evaluated the

under-sampling approach using Classical Wilson, the k-NearestCentroid Neighborhood (k-NCN), and a modified selective in fourdata sets including Phoneme, Satimage, Glass and Vehicle fromthe UCI Databases Repository. Their results showed that the un-der-sampling approach was ineffective in improving the perfor-mance of classifiers. However, Alejo, Garcia, Sotoca, Mollineda,and Sánchez (2006) successfully utilized an under-sampling ap-proach using the Nearest Neighbor rule to reduce the majorityclass. Their results demonstrated that this approach was suitablefor enhancing the classification accuracy in neural networks. Incontrast, Pelayo and Dick (2007) utilized the synthetic minorityover-sampling (SMOTE) approach for increasing instances in theminority class. Their results showed that SMOTE improved 23%of the average of geometric mean classification accuracy in fourbenchmark data sets of NASA project (available at the PROMISERepository of the Software Engineering databases). However, pro-ducing synthetic instances using the SMOTE approach did notseem to fit well, due to new instances that could lead to misinter-pretation of patterns.

Due to the above mentioned challenges of training data qualitywhile utilizing learning algorithms in data mining, in this paper wepropose using a hybrid approach to improving the quality of breastcancer survivability data sets in order to enhance the performanceof breast cancer prediction models to yield better classification re-sults. This hybrid approach consists of two main steps: (1) usingthe outlier filtering approach to filter out outliers from the datasets; and (2) using the over-sampling approach to increase the sizeof the minority class to the same size of the majority class. It canthus be suggested that this proposed approach leads to the im-proved performance of prediction models generated from well-known algorithms including AdaBoost, Bagging, C4.5 and SupportVector Machine (SVM). The effectiveness of the hybrid approachwas evaluated using a five measurements criterion of classificationincluding accuracy, sensitivity, specificity, Area Under the receiveroperating characteristic Curve (AUC) and F-measure. In addition,10-fold cross-validation was used to divide the breast cancer sur-vivability data sets into training and test sets. The training setwas used to build the prediction model, while the test set was usedto evaluate the model.

The remainder of this paper is organized as follows. Section 2introduces research approaches, and Section 3 illustrates the meth-odologies and evaluation methods, used in this study. Experimentresults and discussions are presented in Sections 4 and 5, respec-tively. The conclusion and outline of future work are given in Sec-tion 6.

2. Research approaches

In this section, outlier filtering and over-sampling approachesare described. Following this, a combined approach is proposedand illustrated.

2.1. Outlier filtering approach

An outlier filtering approach commonly utilizes distance mea-sures to detect outsider instances that are at a substantial distancefrom the others. It is a challenging task in instance-based learnerswhich employ k-Nearest Neighbors (k-NN) to eliminate outliers inorder to improve the performance of classifiers (Brodley & Friedl,1996; Han & Kamber, 2006). In this paper, C-Support VectorClassification Filter (C-SVCF) algorithm (Thongkam et al., 2008,2008a, 2008b) was used to identify and eliminate outliers in breastcancer survivability data sets. The C-SVCF algorithm employed C-Support Vector Classification (C-SVC) algorithms (Vapnik, 1998)with the radial basis kernel function as an outliers identification

Input: D: a training data set

N: number of instances Output:

F: a filtered data set O: an Outlier data set

1) Empty F and O;2) Train (T) using C-SVC(D); 3) Assign i = 1; 4) If D(i) ∈ T then;

5) Insert D(i) to F else; 6) Insert D(i) to O end if; 7) Increase i by 1, then go to step 4) and do it until i = N, then go to step 8); and 8) Return F,O.

Fig. 1. C-SVCF algorithm.

12202 J. Thongkam et al. / Expert Systems with Applications 36 (2009) 12200–12209

method. C-Support Vector Classification (C-SVC) is a binary classi-fication in the Support Vector Machine (SVM) (Xiao, Khoshgoftaar,& Seliya, 2005) family which is a new generation learning algo-rithm based on recent advances in statistics, machine learningand pattern recognition (Yin, Yin, Sun, & Wu, 2006). Eight stepsof C-SVCF algorithm are shown in Fig. 1.

2.2. Over-sampling approach

An over-sampling approach is commonly used in the problem ofimbalanced data, due to the fact that it significantly improves theperformance of classifiers (Barandelaa et al., 2003; Estabrookset al., 2004; Xie & Qiu, 2007). This approach is a non-heuristicmethod used to balance the class distribution through the randomreplication of a minority class (Xie & Qiu, 2007). It is used to in-crease the training data size and enhance the performance of clas-sifiers (Pelayo & Dick, 2007). In this paper, we utilized the over-sampling approach to rebalance the imbalanced breast cancer sur-vivability data sets by calculating the ratio between majority andminority classes. Following this, over-sampling with replacementwas performed to increase the size of the minority class by follow-ing the calculated ratio. In this way, both majority and minorityclasses can be able to achieve a similar size that had less effecton performance of classifiers in relation to predicting the unseendata.

2.3. Combined approach

In medical databases, raw data commonly contains outliers(Han & Kamber, 2006) and skewed data (called imbalances) (Jons-dottir, Hvannberg, Sigurdsson, & Sigurdsson, 2008) which affectthe performance of classifiers (Weiss & Provost, 2003; Xie & Qiu,2007). In relation to improving the data quality from data sets withoutliers and imbalanced problems, much research has combinedoutlier filtering and re-sampling approaches in fraud detectiondata sets. For instance, Padmaja et al. (2007) employed k-NearestNeighbors (k-NN) to eliminate outliers in a minority class, and ap-plied over-sampling to increase the size of this minority class whileapplying under-sampling to reduce the size of the majority class. Incontrast, our framework started from cleaned data sets (withoutduplicated and missing data). Following this, we applied an outlierfiltering approach using C-Support Vector Classification to identifythe misclassified instances (i.e., outliers) from both classes. Wethen employed the over-sampling approach in the minority class.The combining of outlier filtering and over-sampling approaches(called OOS) is illustrated in Fig. 2.

Fig. 2 shows the OOS framework that consists of four steps asfollows:

Step1: the Classification Support Vector Filtering (C-SVCF)approach is used to identify and eliminate outliers from both‘dead’ and ‘alive’ classes in the original data set;Step2: data sets are divided into minority and majority classes;Step3: the over-sampling approach is employed to increase thesize of the minority class to the same size as the majority classby using the ratio between the majority and minority classes(see Table 3); andStep4: the majority and minority classes are combined into anew data set which becomes a balanced data set.

The aim of this research framework is to improve the quality ofdata sets in order to enhance the classifiers’ performance. We uti-lized several measurement methods including basic performance(e.g., accuracy, sensitivity and specificity), Area Under the receiveroperating characteristic Curve (AUC) and F-measure to evaluatethe capability and effectiveness of the proposed approach andcompare it with the outlier and over-sampling approaches.

3. Experimental design and evaluation methods

In this section, we first describe the breast cancer data prepara-tion used in these experiments. Then we analyze performanceevaluation methods including accuracy, sensitivity, specificity,AUC and F-measure used in these experiments.

3.1. Data sets

The breast cancer survivability data sets were obtained fromSrinagarind Hospital in Thailand. Data sets include patient infor-mation and the treatment choice of patients who were diagnosedwith breast cancer in 1985–2006. The breast cancer survivabilitydata consists of 4312 instances and 26 attributes. After studyingthe descriptive statistics the results showed that some attributeshave more than 30 missing values while some attributes have onlyone value. The reason is that some patients have been diagnosed inSrinagarind, but received treatments in other hospitals. The finalattributes contain 14 attributes which are presented in Table 1.

In order to analyze the performance and effectiveness of the ap-proach, we generated 10 breast cancer survivability data sets usingdifferent survival period from 1 to 10 years. Therefore, each dataset has unique numbers of instances and values of attributes. How-ever, the final class attribute is the same in all data sets. The classattribute consists of two values: value ‘0’ refers to patients whowere dead before the period of time, and value ‘1’ refers to patientswho were alive after the period of time. For example, if we considerthe 5-year survival period, this means that patients who surviveless than 60 months were coded as 0 (‘dead’) otherwise were codedas 1 (‘alive’). The number of instances in class attributes is shownin Table 2.

Table 2 shows the class distribution in data sets which are prob-lematic in our data sets. Lack of quality of data sets has existed as aprediction model problem for many years. This indicates a need tounderstand the data set before constructing the prediction models.

3.2. Evaluation methods

In experiments, we applied three evaluation methods includingbasic performance measures, AUC score and F-measure. Theseevaluation methods are based on a confusion matrix. The confu-sion matrix is a visualization tool commonly used to present per-formances of prediction models or classifiers in classificationtasks (Han & Kamber, 2006). It is used to show the relationshipsbetween real class attributes and those of predicted classes. The le-vel of effectiveness of the predictor is calculated with the number

Cleaning (duplicated and

mining data)

Outlier Filter-ing

(C-SVC)

Raw Data

Final Data (MA+MI)

Majority Class (MA)

Minority Class (MI)

Over-sampling MA/MI

10-fold cross-validation

Prediction Out-come Model

Estimation of Accuracy, Sensitivity, Specificity,

AUC and F-measure

End

If (imbalance)

9 folds (Training)

1 fold(Testing)

Learning Algorithm

Filtered Data

No

Yes

Fig. 2. Outlier filtering and over-sampling framework.

J. Thongkam et al. / Expert Systems with Applications 36 (2009) 12200–12209 12203

of correct and incorrect classifications in each possible value of thevariables being classified in the confusion matrix (Cabena, Hadji-nian, Stadler, Verhees, & Zanasi, 1998) (see Fig. 3).

The confusion matrix is used to compute true positives (TP),false positives (FP), true negatives (TN) and false negatives (FN),as represented in Fig. 3.

3.2.1. Basic performance measuresThere are three commonly used performance measurements

including accuracy, sensitivity and specificity (Han & Kamber,2006). The accuracy of classifiers is the percentage of the correct-ness of outcomes among the test sets exploited in this study. It isdefined in (1). The sensitivity refers to the true positive rate, and

Table 1Input attributes.

No. Attributes Types

1 Age Number2 Marital status Category (3)3 Basis of diagnosis Category (6)4 Topography Category (9)5 Morphology Category (14)6 Extent Category (4)7 Stage Category (4)8 Received surgery Category (2)9 Received radiation Category (2)10 Received chemo therapy Category (2)11 Received homo therapy Category (2)12 Received supportive Category (2)13 Received other therapy Category (2)14 Survivability (class attribute) Category (2)

Predicted Classes

‘Dead’ ‘Alive’

‘Dead’ TP FN Outcomes

‘Alive’ FP TN

Fig. 3. The confusion matrix.

Fig. 4. The area under the ROC Curve (AUC).

12204 J. Thongkam et al. / Expert Systems with Applications 36 (2009) 12200–12209

the specificity to the true negative rate. Both the sensitivity andspecificity used for measuring the factors that affect the perfor-mance are presented in (2) and (3), respectively

Accuracy ¼ TP þ TNTP þ FP þ TN þ FN

ð1Þ

Sensitiv ity ¼ TPTP þ FP

ð2Þ

Specificity ¼ TNTN þ FN

ð3Þ

In this study, the sensitivity is the probability of correct tests among‘dead’ patients. In contrast, the specificity is the probability of cor-rect tests among ‘alive’ patients.

3.2.2. Area Under the receiver operating characteristic Curve (AUC)An Area Under the receiver operating characteristic Curve (AUC)

is traditionally used in medical diagnosis systems. It has been pro-posed as an alternative measure for evaluating the predictive abil-ity of learning algorithms (Hand, Mannila, & Smyth, 2001). It alsoprovides an approach for evaluating classifiers based on an averagewhich graphically interprets the performance of the decision-mak-ing algorithm with regard to the decision parameter by plotting thetrue positive rate against the false positive rate (He & Frey, 2006;Woods & Bowyer, 1997) (see Fig. 4).

Fig. 4 shows the AUC score which usually has scores between 0and 1 for evaluating the prediction model performance and the re-sults are easy to understand (Huang and Ling, 2005). For example,Fig. 4 shows the Areas Under ROC Curves (AUC) of A and B predic-tion models. The AUC score of A prediction model is larger than Bprediction model, therefore, A model has better performance thanB model. Several research studies have utilized AUC for comparingthe prediction models’ performance. For instance, Huang and Ling(2005) demonstrated that AUC is a more accurate measurementmethod than directly calculating from the confusion matrix. While

Table 2The number of instances in original data sets.

Data sets Years ‘Dead’ ‘Alive’

1-Year 1985–2006 351 11282-Year 1985–2005 455 8463-Year 1985–2004 485 6544-Year 1985–2003 488 4955-Year 1985–2002 466 3926-Year 1985–2001 437 3047-Year 1985–2000 351 1988-Year 1985–1999 276 1309-Year 1985–1998 248 10310-Year 1985–1997 221 90

Jiang (2007) employed the average of AUC to analyze the optimallinearity in an artificial neural network (ANN) output, the currentstudy utilized the average AUC as a performance selection criterionof the combined method in the classification task.

3.2.3. F-measureF-measure is a general evaluation method used in text recogni-

tion and information retrieval systems. It is used to evaluate theeffectiveness expressed in terms of hits, misses, false alarms, andcorrect rejections (Nakache, Metais, & Timsit, 2005). Moreover, itis used to find an estimate point which provides a certain trade-off between precision and recall, and the operating point. As a re-sult, much research has utilized the F-measure for comparing theperformance of the classifiers by giving an equal weight to preci-sion and recall. The F-measure is an effectiveness measurementthat characterises the performance of classification in precisionand recall space (Tan, Steinbach, & Kumar, 2006; Witten & Frank,2005), and is defined as the weighted harmonic mean of the preci-sion ðPÞ and recall ðRÞ. F-measure is shown in Eq. (4)

F-measure ¼ 2PRP þ R

ð4Þ

Therefore, a high F-measure value is better to ensure that both pre-cision and recall are reasonably high. Several research studies haveutilized the F-measure to measure the effectiveness and the perfor-mance of classification models in text mining. For instance, Li and

Total ‘Dead’ (% positive class) ‘Alive’ (% negative class)

1479 23.73 76.271301 34.97 65.031139 42.58 57.42

983 49.64 50.36858 54.31 45.69741 58.97 41.03549 63.93 36.07406 67.98 32.02351 70.66 29.34311 71.06 28.94

J. Thongkam et al. / Expert Systems with Applications 36 (2009) 12200–12209 12205

Park (2006) employed the F-measure to measure the categorizedeffectiveness of artificial neural networks in text categorization.Their results showed that the F-measure is superior for evaluatingthe effectiveness of the text classifiers. Furthermore, Musicant, Ku-mar, and Ozgur (2003) utilized the F-measure to measure the min-imization number of misclassified points of the Support VectorMachine (SVM) predictor in the training set. Their results pointedout that the SVM classifier with the suitable parameter settingscan result in the optimization of the F-measure on data sets.

4. Experimental results

In this paper, WEKA version 3.5.6 (Witten & Frank, 2005) wasselected to evaluate the capability and effectiveness of the hybridapproach for the improvement of the quality of breast cancer datasets. The WEKA experimenter has a well-defined framework thatoffers a variety of learning algorithms in data mining, pattern rec-ognition, and machine learning. Four well-known algorithmsincluding AdaBoost, Bagging, C4.5 and SVM were employed basedon the default parameters of the WEKA application to develop pre-diction models for breast cancer survivability data sets at Srinaga-rind Hospital in Thailand. We first illustrated the instances numberusing the proposed approach and compared it with the outlier fil-tering and over-sampling approaches. The capability and effective-ness of the OOS approach was evaluated using basic performance(sensitivity, sensitivity and specificity), AUC scores and F-measureof four classifiers including AdaBoost, Bagging, C4.5 and SVM (seeSections 4.2–4.4). Experiments were performed using a 10-foldcross-validation approach to reduce bias associated with the ran-dom sampling strategy (Kohavi, 1995; Thongkam et al. 2008,2008a, 2008b) on the same breast cancer survivability data sets.The results were an average of 10 iterations and 10-folds.

4.1. Instances in data sets

In these experiments, imbalanced data can be defined as a ratiobetween the majority class and the minority class. The number ofinstances using the OOS approach was compared with using theoutlier filtering and over-sampling approaches alone. These resultsare presented in Table 3.

Table 3 shows the number of instances that incurred the imbal-anced problems in each data set. Results indicated that 4- and 5-year breast cancer survivability data sets have less serious prob-lems than 1-, 2-, 3-, 6-, 7-, 8-, 9- and 10-year breast cancer surviv-ability data sets. This imbalanced problem significantly increasedafter applying the outlier filtering approach especially to 1-, 7-,8-, 9- and 10-year data sets. In addition, we found that althoughapplying the over-sampling approach can handle the imbalance

Table 3The numbers of instances.

Data sets Original Outlier approach

‘Dead’ ‘Alive’ Ratio (%)(MA/MI)

‘Dead’ ‘Alive’ Ratio (%(MA/MI

1-Year 351 (MI) 1128 (MA) 321 87 (MI) 1093 (MA) 12562-Year 455 (MI) 846 (MA) 186 159 (MI) 799 (MA) 5033-Year 485 (MI) 654 (MA) 135 270 (MI) 544 (MA) 2014-Year 488 (MI) 495 (MA) 101 349 (MI) 355 (MA) 1025-Year 466 (MA) 392 (MI) 119 368 (MA) 250 (MI) 1476-Year 437 (MA) 304 (MI) 144 378 (MA) 153 (MI) 2477-Year 351 (MA) 198 (MI) 177 316 (MA) 77 (MI) 4108-Year 276 (MA) 130 (MI) 212 265 (MA) 36 (MI) 7369-Year 248 (MA) 103 (MI) 241 238 (MA) 24 (MI) 99210-Year 221 (MA) 90 (MI) 246 212 (MA) 26 (MI) 815

Note: MA refers to the majority class and MI refers to the minority class.

problem, it cannot reduce the misclassified instances. However,the OOS approach can reduce the misclassified instances and han-dle the imbalanced problem in data sets well.

4.2. Basic performance comparison

In these experiments, the capability and effectiveness of theOOS approach was evaluated in comparison with outlier filteringand over-sampling approaches, respectively. The capability andeffectiveness of the proposed approach was measured using theaverage of accuracy, sensitivity and specificity of four classifiersincluding AdaBoost, Bagging, C4.5 and Support Vector Machine(SVM). In relation to the base learner strategy, we selected thedecision stump as a base learner in AdaBoost, and the fast decisiontree learner as a base learner in Bagging. The results are shown inTables 4–7, respectively.

Tables 4–7 show the capability and effectiveness of the OOS ap-proach using the basic performance (accuracy, sensitivity andspecificity) of AdaBoost, Bagging, C4.5 and SVM, respectively. Theresults showed that in order to improve classification results,improving the data quality using OOS approach outperforms usingoutlier filtering and random over-sampling approaches in terms ofaccuracy, sensitivity and specificity of Bagging, C4.5 and SVM.Although the average accuracy of AdaBoost using the outlier filter-ing approach was slightly better than the average accuracy of Ada-Boost using the OOS approach, the averages sensitivity andspecificity of AdaBoost using the OOS approach were much betterthan using the outlier filtering approach. This may be because theAdaBoost algorithm assigns the weight to misclassified instanceswithout concerning the majority and minority classes. This seemsAdaBoost is less affected by the imbalanced problem. Nonetheless,the problem of being low in sensitivity towards 1-, 2- and 3-yearbreast cancer survivability prediction models remained in Ada-Boost after applying the outlier filtering approach. Similarly, thespecificity of 6-, 7-, 8- and 10-year survivability prediction modelswas also unable to achieve high performance after applying theoutlier filtering approach. Moreover, the sensitivity of a 5-year sur-vivability prediction model of all four classifiers using outlier filter-ing gave similar results to the OOS approach, and the specificity of5-year breast cancer survivability prediction model of all four clas-sifiers using outlier filtering gave lower results than the OOS ap-proach. These results may be due to the fact that the 5-yearsurvivability data set was less imbalanced than the others. Further-more, although using the over-sampling approach slightly im-proved the performance of prediction models, but low overallsensitivity and specificity remained in classification results. Thismeans that using the over-sampling approach alone can slightlyimprove the average of accuracy, sensitivity and specificity. Hence,

Over-sample approach OOS approach

))

‘Dead’ ‘Alive’ Ratio (%)(MA/MI)

‘Dead’ ‘Alive’ Ratio (%)(MA/MI)

1126 1128 1.00 1092 1093 1.00846 846 1.00 798 799 1.00653 654 1.00 542 544 1.00492 495 0.99 355 355 1.00466 466 1.00 368 367 1.00437 437 1.00 378 377 1.00351 350 1.00 316 315 1.00276 275 1.00 265 264 1.00248 248 1.00 238 238 1.00221 220 1.00 212 211 1.00

Table 4Basic performance of AdaBoost.

Data sets Accuracy (%) Sensitivity (%) Specificity (%)

Raw data Outlier Over-sampling OOS Raw data Outlier Over-sampling OOS Raw data Outlier Over-sampling OOS

1-Year 78.48 96.07 68.74 93.00 25.67 72.90 61.68 93.01 94.91 97.92 75.78 93.002-Year 69.70 90.60 63.85 92.15 38.59 59.54 69.87 93.65 86.43 96.78 57.84 90.663-Year 67.83 86.99 67.38 88.24 48.57 76.11 84.18 86.68 82.09 92.39 50.61 89.804-Year 66.27 84.42 68.50 87.44 80.89 91.87 68.99 91.54 51.85 77.12 68.02 83.345-Year 68.88 87.77 66.75 87.25 82.92 95.84 65.33 95.87 52.19 75.88 68.19 78.616-Year 69.41 88.26 68.09 89.37 81.10 92.64 78.20 90.32 52.61 77.42 57.99 88.417-Year 68.32 93.69 63.72 92.55 82.23 98.19 67.69 92.91 43.72 75.38 59.74 92.198-Year 68.97 92.64 66.48 92.17 87.56 94.41 67.00 85.30 29.54 80.00 65.96 99.059-Year 71.54 97.64 67.13 95.67 92.46 99.45 64.88 92.61 21.04 80.33 69.38 98.7410-Year 71.64 96.69 66.33 96.62 92.18 98.11 68.18 95.56 21.22 85.33 64.45 97.68Average 70.10 91.48 66.70 91.45 71.22 87.91 69.60 91.75 53.56 83.86 63.80 91.15

Table 5Basic performance of Bagging.

Data sets Accuracy (%) Sensitivity (%) Specificity (%)

Raw data Outlier Over-sampling OOS Raw data Outlier Over-sampling OOS Raw data Outlier Over-sampling OOS

1-Year 77.45 96.50 76.85 98.01 26.38 72.86 81.03 100.00 93.34 98.41 72.68 96.022-Year 67.86 94.24 71.30 96.99 37.33 81.53 74.87 98.93 84.28 96.77 67.73 95.043-Year 65.64 92.86 71.75 95.48 54.35 91.04 76.78 97.60 73.99 93.77 66.73 93.364-Year 65.67 90.74 71.09 91.46 69.28 90.20 74.87 92.31 62.09 91.27 67.33 90.625-Year 67.04 92.14 70.31 94.39 74.31 93.97 68.70 92.98 58.39 89.44 71.91 95.806-Year 68.83 92.08 69.58 94.17 78.63 94.99 69.23 92.33 54.75 84.89 69.92 96.037-Year 66.63 94.68 71.17 96.15 80.54 96.68 68.50 95.29 41.98 86.55 73.86 97.028-Year 69.86 93.81 71.20 97.33 87.38 95.70 67.33 94.69 32.69 79.67 75.07 100.009-Year 70.89 97.63 74.09 99.14 89.03 98.69 72.55 98.28 27.07 87.67 75.66 100.0010-Year 69.87 95.46 70.72 97.85 88.10 96.18 66.09 96.23 25.11 90.33 75.36 99.48Average 68.97 94.01 71.81 96.10 68.53 91.18 72.00 95.86 55.37 89.88 71.63 96.34

Table 6Basic performance of C4.5.

Data sets Accuracy (%) Sensitivity (%) Specificity (%)

Raw data Outlier Over-sampling OOS Raw data Outlier Over-sampling OOS Raw data Outlier Over-sampling OOS

1-Year 78.38 95.96 75.95 98.53 25.61 62.00 77.62 100.00 94.80 98.68 74.29 97.052-Year 69.32 94.13 68.93 97.49 32.80 79.33 73.80 99.00 88.96 97.08 64.06 96.003-Year 67.26 92.08 72.04 95.17 61.47 88.19 74.72 96.59 71.55 94.01 69.36 93.754-Year 67.15 90.19 70.88 91.76 76.53 88.57 78.69 90.99 57.89 91.78 63.11 92.535-Year 68.64 92.64 70.36 94.12 78.32 93.48 74.46 93.29 57.12 91.40 66.26 94.966-Year 68.84 93.17 69.06 94.93 78.58 95.39 73.25 94.02 54.85 87.68 64.87 95.847-Year 67.58 95.39 67.96 97.67 78.52 97.49 66.79 96.90 48.17 86.91 69.14 98.448-Year 68.93 95.21 67.61 97.49 89.97 97.01 63.54 94.99 24.31 81.67 71.71 100.009-Year 71.68 97.56 73.61 99.37 92.82 98.61 71.35 98.74 20.75 87.83 75.86 100.0010-Year 66.72 96.17 70.52 98.84 88.52 97.97 67.46 97.97 13.22 81.83 73.59 99.71Average 69.45 94.25 70.69 96.54 70.31 89.80 72.17 96.25 53.16 89.89 69.23 96.83

Table 7Basic performance of SVM.

Data sets Accuracy (%) Sensitivity (%) Specificity (%)

Raw data Outlier Over-sampling OOS Raw data Outlier Over-sampling OOS Raw data Outlier Over-sampling OOS

1-Year 77.16 98.18 73.89 99.76 16.67 79.46 73.25 100.00 95.98 99.68 74.53 99.512-Year 69.35 96.46 69.37 98.56 28.18 86.36 71.39 99.95 91.49 98.47 67.37 97.173-Year 66.06 94.69 68.85 96.93 48.77 89.67 71.45 98.19 78.86 97.19 66.24 95.684-Year 63.65 93.87 69.56 94.41 63.72 93.95 68.92 94.37 63.58 93.78 70.19 94.455-Year 64.84 93.9 68.8 97.46 73.77 96.31 68.21 96.22 54.21 90.36 69.38 98.696-Year 65.71 95.29 67.23 98.03 81.65 98.70 71.65 97.30 42.79 86.86 62.81 98.767-Year 64.43 95.86 66.32 98.64 86.10 99.40 62.83 97.72 26.04 81.30 69.83 99.568-Year 70.08 97.41 72.74 98.68 92.65 99.43 72.30 97.34 22.15 82.33 73.19 100.009-Year 72.23 98.48 73.26 99.56 94.68 100.00 67.63 99.12 18.09 84.17 78.90 100.0010-Year 71.93 96.9 70.45 99.27 94.21 99.29 66.70 98.53 17.22 78.67 74.23 100.00Average 68.54 96.10 70.05 98.13 68.04 94.26 69.43 97.87 51.04 89.28 70.67 98.38

12206 J. Thongkam et al. / Expert Systems with Applications 36 (2009) 12200–12209

Table 8The percentage of AUC scores of AdaBoost, Bagging, C4.5 and SVM classifiers.

Datasets

AdaBoost Bagging C4.5 SVM

Rawdata

Outlier Over-sampling

OOS Rawdata

Outlier Over-sampling

OOS Rawdata

Outlier Over-sampling

OOS Rawdata

Outlier Over-sampling

OOS

1-Year 75.13 97.83 75.15 98.24 72.80 96.16 84.53 99.02 60.86 84.80 81.47 98.62 56.33 89.57 73.89 99.762-Year 72.50 96.03 69.61 96.89 69.92 96.82 78.09 98.60 69.52 93.38 73.36 98.07 59.84 92.42 69.38 98.563-Year 73.22 94.90 72.12 94.99 72.59 97.27 79.44 98.38 72.16 93.41 76.63 96.24 63.82 93.43 68.85 96.944-Year 70.56 92.96 74.58 94.17 71.72 95.86 78.88 96.31 71.10 91.17 76.14 93.18 63.65 93.87 69.55 94.415-Year 72.41 93.44 73.49 93.36 71.62 95.96 77.29 96.78 71.59 93.04 74.10 94.24 63.99 93.33 68.80 97.456-Year 73.17 93.36 71.87 93.44 71.27 95.21 76.17 97.61 70.68 93.32 71.63 96.14 62.22 92.78 67.23 98.037-Year 68.68 95.05 70.44 96.49 68.06 93.22 77.57 98.59 67.78 90.53 70.46 97.51 56.07 90.35 66.33 98.648-Year 67.41 96.33 73.69 96.46 68.04 98.16 78.57 99.21 67.17 93.64 71.94 98.94 57.40 90.88 72.74 98.679-Year 68.15 98.28 73.55 99.34 67.61 97.57 82.08 99.59 66.95 94.77 78.06 99.67 56.39 92.08 73.26 99.5610-Year 67.62 98.15 71.40 98.98 67.16 97.36 78.61 99.37 55.32 90.51 75.34 98.71 55.72 88.98 70.46 99.27Average 70.89 95.63 72.59 96.24 70.08 96.36 79.12 98.35 67.31 91.86 74.91 97.13 59.54 91.77 70.05 98.13

Table 9The percentage of F-measure of AdaBoost, Bagging, C4.5 and SVM classifiers.

Datasets

AdaBoost Bagging C4.5 SVM

Rawdata

Outlier Over-sampling

OOS Rawdata

Outlier Over-sampling

OOS Rawdata

Outlier Over-sampling

OOS Rawdata

Outlier Over-sampling

OOS

1-Year 35.78 73.01 66.27 92.99 35.58 75.01 77.74 98.06 35.52 68.80 76.31 98.55 25.55 86.20 73.68 99.762-Year 46.84 67.35 64.72 92.25 44.66 82.26 72.25 97.05 42.39 81.57 70.28 97.54 38.83 88.92 69.94 98.593-Year 55.68 79.04 71.93 87.99 57.22 89.43 73.08 95.58 61.26 88.03 72.75 95.24 54.83 91.72 69.57 96.984-Year 70.17 85.48 68.39 87.96 66.62 90.60 72.02 91.52 69.77 89.90 72.83 91.68 63.44 93.82 69.23 94.395-Year 74.30 90.34 66.10 88.32 70.96 93.44 69.75 94.31 73.00 93.78 71.44 94.07 69.44 94.95 68.58 97.416-Year 75.74 91.80 70.87 89.45 74.78 94.44 69.34 94.04 74.81 95.20 70.21 94.88 73.69 96.77 68.54 98.017-Year 76.73 96.17 64.99 92.57 75.46 96.69 70.27 96.10 75.53 97.13 67.50 97.64 75.52 97.49 64.98 98.628-Year 79.18 95.72 66.56 91.46 79.67 96.43 70.01 97.21 79.56 97.26 66.05 97.38 80.78 98.55 72.63 98.639-Year 82.06 98.71 66.05 95.44 81.15 98.69 73.54 99.12 82.20 98.65 72.83 99.35 82.77 99.17 71.42 99.5510-Year 82.15 98.13 66.79 96.52 80.49 97.38 69.13 97.79 78.76 97.85 69.46 98.82 82.63 98.28 69.23 99.24Average 67.86 87.58 67.27 91.50 66.66 91.44 71.71 96.08 67.28 90.82 70.97 96.52 64.75 94.59 69.78 98.12

J. Thongkam et al. / Expert Systems with Applications 36 (2009) 12200–12209 12207

a combined outlier filtering and over-sampling is suitable to im-prove the overall accuracy, sensitivity and specificity of the classi-fiers in breast cancer survivability data sets.

4.3. AUC comparison

Area Under the receiver operating characteristic Curve (AUC) iscommonly used to evaluate the performance and effectiveness ofclassifiers in imbalanced data sets (Alejo et al., 2006; Estabrookset al., 2004; Xie & Qiu, 2007) and was proven to be a superiormethod in evaluating imbalanced data sets (Chawla, Bowyer, Hall,& Kegelmeyer, 2002). In the experiments of this study, the capabil-ity and effectiveness of the proposed approach was measured andcompared with the outlier filtering and over-sampling approachusing the average AUC score of four classifiers including AdaBoost,Bagging, C4.5 and Support Vector Machine (SVM). The AUC resultsof these four classifiers are shown in Table 8.

Table 8 shows the capability and effectiveness of the proposedapproach (OOS) using the AUC score of four classifiers includingAdaBoost, Bagging, C4.5 and SVM. Results indicated that the overallaverage AUC scores of AdaBoost, Bagging, C4.5 and SVM was im-proved 24.74%, 26.28%, 24.55% and 32.23%, respectively, usingthe outlier filtering approach. Likewise, the overall average ofAUC scores of AdaBoost, Bagging, C4.5 and SVM using the randomover-sampling approach was somewhat improved at 1.7%, 9.04%,7.6% and 10.51%, respectively. The overall average of AUC scoresof AdaBoost, Bagging, C4.5 and SVM was improved by up to25.35%, 28.27%, 29.22% and 38.59%, respectively, after applyingthe OOS approach. Accordingly, these results indicated that sup-port vector machine (SVM) achieved higher AUC scores in compar-ison with AdaBoost, Bagging and C4.5, after applying the OOSapproach in breast cancer survivability data sets.

4.4. F-measure comparison

F-measure provides a certain trade-off between precision andrecall used in text recognition and information retrieval systems.In these experiments, capability and effectiveness of the proposedapproach were also evaluated using F-measure, and comparedwith the outlier filtering and over-sampling approach using theF-measure of four classifiers including AdaBoost, Bagging, C4.5and Support Vector Machine (SVM). The F-measure results areillustrated in Table 9.

Table 9 indicates the capability and effectiveness of the pro-posed approach (OOS), based on the F-measure of four classifiersincluding AdaBoost, Bagging, C4.5 and SVM in the last row. Resultsdemonstrated that the OOS approach outperformed the outlier fil-tering approach by approximately 3.92%, 4.64%, 5.7 % and 3.53% ofthe F-measure of AdaBoost, Bagging, C4.5 and SVM, respectively.However, the OOS approach is better than the over-sampling ap-proach by roughly 24.23%, 24.37%, 25.55% and 28.34% of the F-measure of AdaBoost, Bagging, C4.5 and SVM, respectively. As a re-sult, the OOS approach was found to be effective for improving thequality of data sets which have both outlier and imbalanced prob-lems in relation to improving the performance of classifiers.

5. Discussion

Our aim was to construct breast cancer survivability predictionmodels or classifiers to enhance the provision of care in medicalprognoses. In order to improve the performance of these models,we proposed a hybrid approach combining outlier filtering andover-sampling approaches to improve the quality of the data sets,and evaluated them using various measurement methods includ-

12208 J. Thongkam et al. / Expert Systems with Applications 36 (2009) 12200–12209

ing accuracy, sensitivity, specificity, AUC and F-measure via fourlearning algorithms including AdaBoost, Bagging, C4.5 and SupportVector Machine (SVM). Also, 10-fold cross-validation was em-ployed to divide data sets into the training and test sets in orderto reduce the bias and variance of classification results. There areseveral possible discussions emanating from these results.

Firstly, we found that the outliers and imbalanced data directlyaffected the classification performance and effectiveness of classi-fiers. A possible solution for this might be that the performance ofwell-known classifiers is improved by eliminating a number ofoutliers from both the minority and majority classes, and increas-ing the size of the minority class to the same size as the majorityclass. These findings are consistent with those of Padmaja et al.(2007) who found that the performance improvement of classifiersoccured only after firstly eliminating outliers in a minority class,then increasing the size of the minority class, and lastly decreasingthe size of the majority class of fraud detection databases.

Secondly, the finding that AdaBoost is less affected by the prob-lem of imbalanced data sets than in Bagging, C4.5 and SVM, wasinteresting. This may be due to the fact that the AdaBoost algo-rithm applies weight directly to instances in order to generate aseparation line (Vezhnevets & Vezhnevets, 2005), while C4.5 andSVM use statistical calculations to separate the binary classes(Quinlan, 1993; Vapnik, 1998). However, the performance of C4.5and SVM using the OOS approach is superior to the outlier filteringand over-sampling approaches alone.

Finally, SVM was found to be a bit more accurate than AdaBoost,Bagging or C4.5. This may be due to C-SVC (one type of SVM algo-rithms) being utilized to eliminate the outliers from data sets. Inaddition, the OOS approach was suited to improving the qualityof data sets in order to enhance the prediction results of classifiers.

6. Conclusion

In this paper, the OOS approach has been proposed and appliedto the tasks of building accurate breast cancer survivability predic-tion models. This approach is a combination of outlier filtering andover-sampling approaches. Results have indicated that the OOS ap-proach can remove the insignificant outlier instances and can alsosignificantly increase the performance of classification results.Here, after applying the OOS approach, the average of accuracy,sensitivity, specificity, AUC score and F-measure of SVM have beenimproved by 29.83%, 29.83%, 47.34%, 38.59% and 33.38%, respec-tively. Although we found it difficult to choose an appropriatemethod for developing prediction models, it is shown that applyingthe right approach has led to improving the performance and effec-tiveness of the classifiers. This suggests that in future research, fur-ther investigation needs to be conducted on selecting a suitablemethod to develop prediction models and interpretation usinghigh data quality in order to further understand survival patternsin medical data sets.

Acknowledgements

Special thanks to IT and Cancer Department staff at SrinagarindHospital for kindly providing the data. Thanks to Assoc. Prof. Vati-nee Sukmak for her helpful comments, suggestions and criticisms.Thanks also go to Dr. Petre Santry for her proofreading and exten-sive advice on English expression.

References

Alejo, R., Garcia, V., Sotoca, J. M., Mollineda, R. A., & Sánchez, J. S. (2006). Improvingthe classification accuracy of RBF and MLP neural networks trained withimbalanced samples in intelligent data engineering and automated learning.Berlin/Heidelberg: Springer. pp. 464–471.

Barandelaa, R., Sánchez, J. S., Garcı́aa, V., & Rangel, E. (2003). Strategies for learningin class imbalance problems. Journal of Pattern Recognition, 36(3), 849–851.

Bellaachia, A., & Guven, E. (2006). Predicting breast cancer survivability using datamining techniques. <http://www.siam.org/meetings/sdm06/workproceed/Scientific%20Datasets/bellaachia.pdf>.

Blanco, Á., Ricket, A. M., & Martı́n-Merino, M. (2007). Combining SVM classifiers fore-mail anti-spam filtering. In Proceedings of the ninth international work-conference on artificial neural networks (Vol. 4507, pp. 903–910). Berlin/Heidelberg: Springer.

Borovkova, S. (2002). Analysis of survival data. <http://www.math.leidenuniv.nl/~naw/serie5/deel03/dec2002/pdf/borovkova.pdf>.

Brenner, H., Gefeller, O., & Hakulinen, T. (2002). A computer program for periodanalysis of cancer patient survival. Journal of Cancer European, 38, 690–695.

Bridgett, N. A., Brandt, J., & Harris, C. J. (1995). Artificial neural networks for use inthe diagnosis and treatment of breast cancer. Journal of e Prints, 409, 448–453.

Brodley, C. E., & Friedl, M. A. (1996). Identifying and eliminating mislabeled traininginstances. Journal of Artificial Intelligence Research, 1.

Brodley, C. E., & Friedl, M. A. (1999). Identifying mislabeled training data. Journal ofArtificial Intelligence Research, 11, 131–167.

Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., & Zanasi, A. (1998). Discovering datamining from concept to implementation. Upper Saddle River, NJ: Prentice Hall.

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE:Synthetic minority over-sampling technique. Journal of Artificial Intelligence andResearch, 321–357.

Delen, D., Walker, G., & Kadam, A. (2005). Predicting breast cancer survivability: Acomparison of three data mining methods. Journal of Artificial Intelligence inMedicine, 34, 113–127.

Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method forlearning from imbalanced data sets. Journal of Computational Intelligence, 20.

Fang, R., & Ng, V. (1993). Use of neural network analysis to diagnose breast cancerpatients. In Proceedings of the IEEE TENCON region 10 conference on computer,communication, control and power engineering (pp. 841–844).

Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge:Massachusetts London, England, the MIT Press.

Han, J., & Kamber, M. (2006). Data mining: Concepts and techniques. San Francisco:Morgan Kaufmann, Elsevier Science.

He, X., & Frey, E. C. (2006). Three-class ROC analysis – The equal error utilityassumption and the optimality of three-class ROC surface using the idealobserver. IEEE Transactions on Medical Imaging, Medical Imaging, 979–986.

Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learningalgorithms. IEEE Transactions on Knowledge and Data Engineering, 7(3), 299–310.

Jiang, Y. (2007). Uncertainty in the output of artificial neural networks. InternationalJoint Conference on Neural Networks, 2551–2556.

Jonsdottir, T., Hvannberg, E. T., Sigurdsson, H., & Sigurdsson, S. (2008). Thefeasibility of constructing a predictive outcome model for breast cancer usingthe tools of data mining. Journal of Expert Systems with Applications, 34(1),108–118.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimationand model selection. In Proceedings of the international joint conference onartificial intelligence (pp. 1137–1143).

Li, J., Fu, A. W.-C., He, H., Chen, J., & Kelman, C. (2005). Mining risk patterns inmedical data. In Proceedings of the eleventh ACM SIGKDD international conferenceon knowledge discovery in data mining (pp. 770–775).

Li, C. H., & Park, S. C. (2006). Text categorization based on artificial neural networks.In I. King (Ed.), Neural information processing (pp. 302–311). Berlin/Heidelberg:Springer-Verlag.

Musicant, D. R., Kumar, V., & Ozgur, A. (2003). Optimizing F-measure with supportvector machines. Journal of Flairs, 16, 356–360.

Nakache, D., Metais, E., & Timsit, J. F. (2005). Evaluation and NLP in database andexpert systems applications. Berlin/Heidelberg: Springer. 626-632.

National Cancer Institute of Thailand (2006). Cancer in Thailand 1995–1997.<http://www.nci.go.th/cancer_record/>.

Ohno-Machado, L. (2001). Modeling medical prognosis: Survival analysistechniques. Journal of Biomedical Informatics, 34, 428–439.

Padmaja, T. M., Dhulipalla, N., Bapi, R. S., & Krishna, P. R. (2007). Unbalanced dataclassification using extreme outlier elimination and sampling techniques forfraud detection. In International conference on machine learning and cyberneticson advanced computing and communications (pp. 511–516).

Pelayo, L., & Dick, S. (2007). Applying novel resampling strategies to software defectprediction. In Proceedings of the annual meeting of the north American fuzzyinformation processing society (pp. 69–72).

Podgorelec, V., Hericko, M., & Rozman, I. (2005). Improving mining of medical databy outliers prediction. In Proceedings of the IEEE the eighteenth symposium oncomputer-based medical systems (pp. 91–96).

Quinlan, J. R. (1993). C4.5: Programs for machine learning.Ryu, Y. U., Chandrasekaran, R., & Jacob, V. S. (2007). Breast cancer prediction using

the isotonic separation technique. Journal of European Operational Research, 181,842–854.

Srinivasan, T., Chandrasekhar, A., Seshadri, J., & Jonathan, J. B. S. (2005). Knowledgediscovery in clinical databases with neural network evidence combination. InProceedings of the international conference on intelligent sensing and information(pp. 512–517).

Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston:Pearson Addison Wesley.

Thongkam, J., Xu, G., & Zhang, Y. (2008). An analysis of data selection methods onclassifiers accuracy measures. Journal of Korn Ken University, 35, 1–10.

J. Thongkam et al. / Expert Systems with Applications 36 (2009) 12200–12209 12209

Thongkam, J., Xu, G., Zhang, Y., & Huang, F. (2008a). Breast cancer survivability viaAdaBoost algorithms. In The Australiasian workshop on health data andknowledge management (Vol. 80, pp. 1–10).

Thongkam, J., Xu, G., Zhang, Y., & Huang, F. (2008b). Support vector machines foroutlier detection in cancers survivability prediction. In International workshopon health data management, APWeb’08 (pp. 99–109).

Thongsuksai, P., Chongsuvivatwong, V., & Sriplung, H. (2000). Delay in breast cancercare: A study in Thai women. Journal of Med Care, 38(1), 108–114.

Tsumoto, S. (2000). Problems with mining medical data. In The twenty-fourth annualinternational conference on computer software and applications (pp. 467–468).

Vapnik, V. (1998). Statistical learning theory. New York: Wiley.Verbaeten, S., & Assche, A. V. (2003). Ensemble methods for noise elimination in

classification problems. In T. Windeatt & F. Roli (Eds.), Multiple classifier systems(pp. 317–325). Berlin/Heidelberg: Springer-Verlag.

Vezhnevets, A., & Vezhnevets, V. (2005). ‘Modest AdaBoost’ – Teaching AdaBoost togeneralize better. Graphicon-2005. Novosibirsk Akademgorodok, Russia.

Wang, C.-Y., Wu, C.-G., Liang, Y.-C., & Guo, X.-C. (2006). Diagnosis of breast cancertumor based on ICA and LS-SVM. In Proceedings of the IEEE internationalconference on machine learning and cybernetics (pp. 2565–2570).

Wang, J., Xu, M., Wang, H., & Zhang, J. (2006). Classification of imbalanced data byusing the SMOTE algorithm and locally linear embedding. In The eightinternational conference on signal processing (Vol. 3, pp. 16–20).

Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: The effectof class distribution on tree induction. Journal of Artificial Intelligence Research,19, 315–354.

Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools andtechniques. San Francisco: Morgan Kaufmann.

Woods, K., & Bowyer, K. W. (1997). Generating ROC curves for artificial neuralnetworks. IEEE Transactions on Medical Imaging, Medical Imaging, 16(3),329–337.

Xiao, Y., Khoshgoftaar, T. M., & Seliya, N. (2005). The partitioning- and rule-basedfilter for noise detection. In Proceedings of the IEEE international conference oninformation reuse and integration (pp. 205–210).

Xie, J., & Qiu, Z. (2007). The effect of imbalanced data sets on LDA: A theoretical andempirical analysis. Journal of Pattern Recognition, 40(2), 557–562.

Xiong, X., Kim, Y., Baek, Y., Rhee, D. W., & Kim, S.-H. (2005). Analysis of breast cancerusing data mining and statistical techniques. In The sixth international conferenceon software engineering, artificial intelligence, networking and parallel (pp. 82–87).

Yi, W., & Fuyong, W. (2006). Breast cancer diagnosis via support vector machines. InProceedings of the twenty fifth chinese control conference (pp. 1853–1856).

Yin, Z., Yin, P., Sun, F., & Wu, H. (2006). A writer recognition approach based on SVM.In Multi conference on computational engineering in systems applications (Vol. 1,pp. 581–586).