10
Research Article Cervical Cancer Diagnosis Model Using Extreme Gradient Boosting and Bioinspired Firefly Optimization Irfan Ullah Khan, Nida Aslam , Rawan Alshehri, Seham Alzahrani, Manal Alghamdi, Atheer Almalki, and Maryam Balabeed Department of Computer Science, College of Computer Science and Information Technology, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia Correspondence should be addressed to Nida Aslam; [email protected] Received 24 February 2021; Accepted 27 June 2021; Published 19 July 2021 Academic Editor: Antonio J. Peña Copyright © 2021 Irfan Ullah Khan et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Cervical cancer is frequently a deadly disease, common in females. However, early diagnosis of cervical cancer can reduce the mortality rate and other associated complications. Cervical cancer risk factors can aid the early diagnosis. For better diagnosis accuracy, we proposed a study for early diagnosis of cervical cancer using reduced risk feature set and three ensemble-based classification techniques, i.e., extreme Gradient Boosting (XGBoost), AdaBoost, and Random Forest (RF) along with Firefly algorithm for optimization. Synthetic Minority Oversampling Technique (SMOTE) data sampling technique was used to alleviate the data imbalance problem. Cervical cancer Risk Factors data set, containing 32 risks factor and four targets (Hinselmann, Schiller, Cytology, and Biopsy), is used in the study. e four targets are the widely used diagnosis test for cervical cancer. e effectiveness of the proposed study is evaluated in terms of accuracy, sensitivity, specificity, positive predictive accuracy (PPA), and negative predictive accuracy (NPA). Moreover, Firefly features selection technique was used to achieve better results with the reduced number of features. Experimental results reveal the significance of the proposed model and achieved the highest outcome for Hinselmann test when compared with other three diagnostic tests. Furthermore, the reduction in the number of features has enhanced the outcomes. Additionally, the performance of the proposed models is noticeable in terms of accuracy when compared with other benchmark studies for cervical cancer diagnosis using reduced risk factors data set. 1. Introduction Cervical cancer is one of the commonly occurring types of cancer in females and mostly develops during their midlives (35 years–44 years) [1]. is type of cancer can be fatal as it does not show clear symptoms in its early stages. Symptoms usually appear in late stages, where it could have spread to other organs like bones, liver, lymph nodes, and lungs. One of the early signs of cervical cancer is when the tube that carries urine from the kidney is blocked. Other late symptoms that can appear are vaginal bleeding, pelvic pain, weight loss, and leg pain [2]. e risk factors that lead to the development of cervical cancer are hormones containing medicines, birth control pills, smoking, and the number of pregnancies. However, it is believed that human papilloma virus (HPV) is the major factor in developing cervical cancer [2]. HPV is a common sexually transmitted infection; it is usually harmless, but sometimes it may lead to cancer [3]. HPV infection becomes at a higher risk of getting cervical cancer. Furthermore, the probability of getting cervical cancer increases if one pos- sesses more than one risk factor. As the cancer does not show signs in its early stages, regular checkups are required es- pecially for those who have the risk factors. In the developing countries, lack of medical equipment and the cost of con- ducting checkups could also be a burden. With the advent and advancement of machine learning, it has become possible to find robust solutions for early diagnosis of cancer cases using data-driven approaches. Various studies have contributed to the field of cervical cancer diagnosis using several classification techniques by Hindawi Scientific Programming Volume 2021, Article ID 5540024, 10 pages https://doi.org/10.1155/2021/5540024

Cervical Cancer Diagnosis Model Using Extreme Gradient

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cervical Cancer Diagnosis Model Using Extreme Gradient

Research ArticleCervical Cancer Diagnosis Model Using Extreme GradientBoosting and Bioinspired Firefly Optimization

Irfan Ullah Khan Nida Aslam Rawan Alshehri Seham Alzahrani Manal AlghamdiAtheer Almalki and Maryam Balabeed

Department of Computer Science College of Computer Science and Information TechnologyImam Abdulrahman Bin Faisal University Dammam 31441 Saudi Arabia

Correspondence should be addressed to Nida Aslam naslamiauedusa

Received 24 February 2021 Accepted 27 June 2021 Published 19 July 2021

Academic Editor Antonio J Pentildea

Copyright copy 2021 Irfan Ullah Khan et al +is is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Cervical cancer is frequently a deadly disease common in females However early diagnosis of cervical cancer can reduce themortality rate and other associated complications Cervical cancer risk factors can aid the early diagnosis For better diagnosisaccuracy we proposed a study for early diagnosis of cervical cancer using reduced risk feature set and three ensemble-basedclassification techniques ie extreme Gradient Boosting (XGBoost) AdaBoost and Random Forest (RF) along with Fireflyalgorithm for optimization Synthetic Minority Oversampling Technique (SMOTE) data sampling technique was used to alleviatethe data imbalance problem Cervical cancer Risk Factors data set containing 32 risks factor and four targets (HinselmannSchiller Cytology and Biopsy) is used in the study +e four targets are the widely used diagnosis test for cervical cancer +eeffectiveness of the proposed study is evaluated in terms of accuracy sensitivity specificity positive predictive accuracy (PPA)and negative predictive accuracy (NPA) Moreover Firefly features selection technique was used to achieve better results with thereduced number of features Experimental results reveal the significance of the proposed model and achieved the highest outcomefor Hinselmann test when compared with other three diagnostic tests Furthermore the reduction in the number of features hasenhanced the outcomes Additionally the performance of the proposed models is noticeable in terms of accuracy when comparedwith other benchmark studies for cervical cancer diagnosis using reduced risk factors data set

1 Introduction

Cervical cancer is one of the commonly occurring types ofcancer in females andmostly develops during theirmidlives (35yearsndash44 years) [1] +is type of cancer can be fatal as it doesnot show clear symptoms in its early stages Symptoms usuallyappear in late stages where it could have spread to other organslike bones liver lymph nodes and lungs One of the early signsof cervical cancer is when the tube that carries urine from thekidney is blocked Other late symptoms that can appear arevaginal bleeding pelvic pain weight loss and leg pain [2]

+e risk factors that lead to the development of cervicalcancer are hormones containing medicines birth controlpills smoking and the number of pregnancies However itis believed that human papilloma virus (HPV) is the major

factor in developing cervical cancer [2] HPV is a commonsexually transmitted infection it is usually harmless butsometimes it may lead to cancer [3] HPV infection becomesat a higher risk of getting cervical cancer Furthermore theprobability of getting cervical cancer increases if one pos-sesses more than one risk factor As the cancer does not showsigns in its early stages regular checkups are required es-pecially for those who have the risk factors In the developingcountries lack of medical equipment and the cost of con-ducting checkups could also be a burden With the adventand advancement of machine learning it has becomepossible to find robust solutions for early diagnosis of cancercases using data-driven approaches

Various studies have contributed to the field of cervicalcancer diagnosis using several classification techniques by

HindawiScientific ProgrammingVolume 2021 Article ID 5540024 10 pageshttpsdoiorg10115520215540024

using different types of data such as clinical-based imageand genetic-based data In our study we used clinical cer-vical risk factor data Two similar studies were conducted byWu and Zhou [4] and Abdoh et al [5] they performed thecomparative analysis of two feature selection techniquesnamely recursive feature elimination (RFE) and PrincipalComponent Analysis (PCA) +e first study used SupportVector Machine (SVM) and the other study used RandomForest (RF) Both studies used the same number of featuresAlthough the data suffered from imbalance an oversamplingwas applied to the data in [4] and SMOTE was used in [5]Both studies identified two risk factors to be removed such astime since the first and last diagnosis of STDs (sexuallytransmitted diseases) due to a lot of missing entries Fur-thermore the study [4] discovered that less computationalcost was an advantage given by both SVM-PCA and SVM-RFE whereas high computational cost is a limitation to theSVM model Moreover STDs intrauterine device (IUD)hormonal contraceptives and first sexual intercourse wereidentified as the highly relevant features [5] Overall theoutcome of both the studies showed that using 30 featuresproduced highest results Furthermore it was found that theSMOTE-RF model performed well for all targets

Similarly Lu et al [6] and Karim and Neehal [7] usedensemble models to estimate the risk of cervical cancer Bothstudies performed data cleaning mechanism to replacemissing values +e former study used an ensemble classifierwith voting strategy using a combination of a private andpublic data set +e private data set contains 472 recordstaken from Chinese hospital +e public data set was ob-tained from the UCI repository 14 features were used +eprivate data set was collected using questionnaire +e re-sults revealed that voting ensemble classifier produced betterresults when compared to Linear Regression Decision Tree(DT) Multilayer Perceptron (MLP) SVM and K-NNclassifiers On the other hand Karim and Neehal study usedDT MLP and SVM using sequential Minimal Optimization(SMO) and K nearest neighbor (KNN) techniques Exper-iments showed that SMO has a better performance in termsof accuracy precision recall and F-measure Similarly Ul-Islam et al [8] used DT RF Logistic Model Tree and ANNfor cervical cancer detection Apriori algorithm was used toidentify features that strongly relate to cancer +e studyfound that age number of sexual partners hormonalcontraceptives number of pregnancies and first sexualintercourse are significant risk factors Results indicated thatRF produced best outcome when compared to the othermodels

Al-Wesabi et al [9] conducted a comparison betweendifferent machine learning classifiers such as Gaussian NaıveBayes (GNB) KNN DT LR and SVM +e outcome of theclassifiers was not satisfactory due to the data imbalance Toresolve this problem undersampling oversampling andSMOTETomek were applied Oversampling had the bestresult among all three methods Moreover a SequentialFeature Selector was applied with both forward and back-ward versions Both the Sequential Forward Feature Selector(SFS) and Sequential Backward Feature Selector (SBS) en-hanced the performance of the prediction with an accuracy

of 95 After selecting the common features between DTand KNN the accuracy exceeded 97 for the DT+e resultsrevealed that age first sexual intercourse number ofpregnancies smoking hormonal contraceptives and STDsgenital herpes were the main predictive features

Similarly several studies have been made using deeplearning and transfer learning for cervical cancer diagnosisFernandes et al [10] and Adem et al [11] used deep learningand showed significant outcome in terms of diagnosis ac-curacy +e study [10] used a loss function that provides asupervised optimization of dimensionality reduction andclassification models +e study indicated that it can beuseful in examining records of patients if the Biopsy andperhaps other testing results are absent and are capable ofclassifying successfully whether they have cervical cancer ornot On the other hand the researchers in [11] used a deepneural network model with softmax function to classify thedata sets +e performance of the softmax function withstacked autoencoder was compared with the other machinelearning methods (DT KNN SVM Feed Forward NN andRotation Forest models) It was found that the softmaxfunction with a stacked autoencoder model produced betteroutcome classification rate of 978

Similarly Fernandes et al [12] applied transfer learningwith partial observability for cancer screenings +e limi-tation of the study was that several patients were resistinganswering some questions for privacy concerns Challengeswere also faced in defining quality as there are multiplereadings and it started relying on human preference+erefore as an alternative of an ordinal scale a simplebinary scheme was used Nevertheless the model perfor-mance was considerable

Conclusively the findingmade after the above-mentionedliterature is that the data set found at UCI repository hadseveral missing values therefore previous studies have re-moved at least 2 features Missing values were due to pa-tientrsquos concerns regarding their privacy After removing 2features due to huge missing value SVM-PCA seemed toprovide satisfactory performance However SMO andSMOTE-RF were amongst the best performing modelsAnother approach to deal with the imbalance in UCI cervicalrisk factor data set was using oversampling Deep learningproved to be effective especially where the Biopsy andpossibly other screening results are absent Age first sexualintercourse number of pregnancies smoking hormonalcontraceptives IUD STDs STDs genital warts or HPVinfections were identified as the top key features +e sig-nificant outcomes made by the machine learning classifiersmotivate the need for further investigation and enhance-ment of the outcomes for the prediction of cervical cancer

In this study three ensemble-based classifiers extremeGradient Boosting Ada Boost and RF are used to classifycervical cancer Cervical Cancer Risk factor data set fromUCI machine learning repository was collected at ldquoHospitalUniversitario de Caracasrdquo in Caracas Venezuela [13] Inaddition to the importance of correctly classifying cancerousand noncancerous cases it is also essential to identify keyrisk factors that contribute to developing cancer Nature-inspired Firefly feature selection and optimization algorithm

2 Scientific Programming

was applied Furthermore the Synthetic Minority Over-sampling Technique (SMOTE) is used to balance the classesof the data as it suffers greatly from imbalanced problem

+e paper is organized as follows Section 2 presentsmaterial andmethods Section 3 contains experimental setupand results +e comparison of the proposed model with theexisting studies using the same dataset is discussed in Section4 Finally Section 5 contains the conclusion

2 Material and Method

21DatasetDescription +e cervical cancer risk factors dataset used in the study was collected at ldquoHospital Universitariode Caracasrdquo in Caracas Venezuela and is available on theUCI Machine Learning repository [13] It consists of 858records with somemissing values as several patients did notanswer some of the questions due to privacy concerns +edata set contains 32 risk factors and 4 targets ie the di-agnosis tests used for cervical cancer It contains differentcategories of feature set such as habits demographic in-formation history and Genomic medical records Featuressuch as age Dx Cancer Dx CIN Dx HPV and Dx featurescontains no missing values Dx CIN is a change in the wallsof cervix and is commonly due to HPV infection sometimesit may lead to cancer if it is not treated properly HoweverDx cancer variable is represented if the patient has othertypes of cancer or not Sometimes a patient may have morethan one type of cancer In the data set some of the patientsdo not have cervical cancer but they had the Dx cancervalue true +erefore it is not used as a target variable

Table 1 presents a brief description of each feature withthe type Cervical cancer diagnosis usually requires severaltests this data contains the widely used diagnosis tests as thetarget Hinselmann Schiller Cytology and Biopsy are fourwidely used diagnosis tests for cervical cancer Hinselmannor Colposcopy is a test that examines the inside of the vaginaand cervix using a tool that magnifies the tissues to detectany anomalies [3] Schiller is a test in which a chemicalsubstance called iodine is applied to the cervix where itstains healthy cells into brown color and leaves the abnormalcells uncolored while cytology is a test that examines bodycells from uterine cervix for any cancerous cells or otherdiseases And Biopsy refers to the test where a small part ofcervical tissue is examined under a microscope Most Biopsytests can make significant diagnosis

22 Dataset Preprocessing +e data set suffers from a hugenumber of missing values 24 features out of the 32 con-tained missing values Initially the features with the hugepercentage of missing values were removed STDs Timesince first diagnosis and STDs Time since last diagnosisfeatures were removed since they have 787 missing values(see Table 2) which is more than half of the data Howeverthe data imputation was performed for the features withfewer numbers of missing values +e most frequent valuetechnique was used to impute the remaining missing valuesAdditionally the data set also suffers from huge class im-balance +e data set target labels were imbalanced with 35

for the Hinselmann 74 for Schiller 44 for Cytology and 55Biopsy out of the 858 records as shown in Figure 1 SMOTEwas used to deal with class imbalance SMOTE works byoversampling the minority class by generating new syntheticdata for minority instances based on nearest neighbors usingthe Euclidean Distance between data points [14] Figure 1shows the number of records per class labels in the data set

23 Firefly Feature Selection Dimensionality reduction isone of the effective ways to select the features that improvethe performance of the supervised learning model In thestudy we adopted nature-inspired algorithm Firefly forselecting the features that better formulate the problemFirefly was proposed by Yang [15] and was initially proposedfor the optimization Metaheuristic Firefly algorithm is in-spired by firefliesrsquo and flash lightening capability of a fly It isa population-based optimization algorithm to find the op-timal value or parameter for a target function In thistechnique each fly is pulled out by the glow intensity of thenearby flies If the intensity of the gleam is extremely low atsome point then the attraction will be declining Firefly usedthree rules that is (a) all the flies should be of the samegender (b) the criteria of attractiveness depend upon theintensity of the glow (c) target function will generate thegleam of the firefly +e flies with less glow will move to-wards the flies with brighter glow +e brightness can beadjusted using objective function +e same idea is imple-mented in the algorithm to search the optimal features thatcan better fit the training model Firefly is more computa-tionally economical and produced better outcome in featureselection when compared with other metaheuristic tech-niques like genetic algorithms and particle swarm optimi-zation [16] +e time complexity of firefly is O(n2t) [17] Ituses the light intensity to select the features Highly relevantfeatures are represented as the features with high intensitylight

For feature selection initially some fireflies will begenerated and each fly will randomly assign the weights toall features In our study we generated 50 number of flies(n 50) +e dimension of the data set is 30 Furthermorethe lower bound was set to minus 50 while the upper bound isequal to 50 +e maximum generations were 500 Addi-tionally α (alpha) was initially set to 05 and in everysubsequent iteration we used the (1) and (2) to update α(alpha) value

X limi1⟶500

1 minus 10minus (409)1i (1)

α (1 minus X) times α (2)

However the gamma (c) was set to 1 +e number offeatures selected using Firefly for Hinselmann was 15 forSchiller 13 features for Cytology 11 features and 11 featuresfor Biopsy respectively

24 Ensemble-Based Classification Methods +ree ensem-ble-based classification techniques such as Random ForestExtreme Gradient Boosting and Ada Boost were used to

Scientific Programming 3

train the model +e description of these techniques isdiscussed in the section below

25 RandomForest Random Forest (RF) was first proposedby Breiman in 2001 [18] Random forest is an ensemblemodel that uses decision tree as individual model andbagging as ensemble method It improves the performanceof decision tree by adding many trees to reduce the over-fitting in the decision tree RF can be used for both clas-sification and regression RF generates a random forest thatcontains decision trees and gets a prediction from each oneof them and then selects the best solution with the maximumvotes [19]

When training a tree it is important to measure howmuch each feature decreases the impurity as the decrease in

the impurity indicates the significance of the feature +etree classification result depends on the impurity measureused For classification the measures for impurity are eitherGini impurity or information gain and for regression andthe measure for impurity is variance Training decision treeconsists of iteratively splitting the data Gini impurity de-cides the best split of the data using the formula

G 1 minus 1113944i

p2i (3)

where p (i) is the probability of selecting a datapoint withclass ie Information gain (IG) is also another measure todecide the best split of the data depending on the gain of eachfeature +e formula that calculates the information gain isgiven in the following equation

Entropy minus 1113944i

pilog2pi

IG(parent child) Entropy(parent) minus [p1(c1)lowast entropy(c1) + p(c2)lowast entropy(c2) + middot middot middot]

(4)

Table 1 Statistical description of the data set

Feature type Feature nameHinselmann Schiller Cytology Biopsy

Mean (micro)plusmn Std (σ) Mean (micro)plusmn Std (σ) Mean (micro)plusmn Std (σ) Mean (micro)plusmn Std (σ)Positive Negative Positive Negative Positive Negative Positive Negative

Demographic

Age 267plusmn 77 268plusmn 85 296plusmn 11 266plusmn 82 262plusmn 84 269plusmn 85 286plusmn 89 267plusmn 85Smokes 02plusmn 04 01plusmn 04 02plusmn 04 02plusmn 03 01plusmn 03 01plusmn 04 02plusmn 04 01plusmn 03

Smokes (years) 25plusmn 72 11plusmn 39 24plusmn 62 11plusmn 38 11plusmn 34 12plusmn 41 22plusmn 62 11plusmn 39Smokes (packsyear) 07plusmn 26 04plusmn 22 06plusmn 19 04plusmn 22 05plusmn 23 04plusmn 22 07plusmn 23 04plusmn 22

Number of sexual partners 22plusmn 09 25plusmn 17 25plusmn 12 25plusmn 17 27plusmn 13 25plusmn 17 25plusmn 13 25plusmn 17First sexual intercourse (age) 168plusmn 20 169plusmn 28 17plusmn 25 169plusmn 28 169plusmn 29 169plusmn 28 171plusmn 26 169plusmn 28

Number of pregnancies 24plusmn 14 25plusmn 17 26plusmn 17 22plusmn 14 21plusmn 14 22plusmn 14 23plusmn 13 22plusmn 14Hormonal contraceptives 07plusmn 45 07plusmn 05 06plusmn 05 07plusmn 05 07plusmn 05 07plusmn 05 07plusmn 05 07plusmn 05

Habit

Hormonal contraceptives (years) 29plusmn 48 19plusmn 35 32plusmn 52 19plusmn 34 33plusmn 64 19plusmn 34 33plusmn 54 19plusmn 34IUD 02plusmn 04 01plusmn 03 02plusmn 04 01plusmn 03 01plusmn 03 01plusmn 03 02plusmn 04 01plusmn 03

IUD (years) 06plusmn 15 04plusmn 18 09plusmn 29 04plusmn 17 05plusmn 17 04plusmn 18 07plusmn 20 04plusmn 18STDs 02plusmn 04 01plusmn 03 02plusmn 04 01plusmn 03 02plusmn 04 02plusmn 03 02plusmn 04 01plusmn 03

STDs (number) 03plusmn 09 02plusmn 05 04plusmn 08 01plusmn 05 03plusmn 07 01plusmn 05 03plusmn 08 01plusmn 05STDs condylomatosis 01plusmn 03 01plusmn 02 01plusmn 03 00plusmn 02 01plusmn 03 00plusmn 02 01plusmn 03 0plusmn 02

STDs cervical condylomatosis 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0STDs vaginal condylomatosis 0plusmn 0 00plusmn 01 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01

History

STDs vulvo-perinealcondylomatosis 01plusmn 03 00plusmn 02 01plusmn 03 0plusmn 02 01plusmn 03 0plusmn 02 01plusmn 04 0plusmn 01

STDs syphilis 00plusmn 02 00plusmn 01 00plusmn 02 0plusmn 01 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01STDs pelvic inflammatory

disease 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0

STDs genital herpes 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 01 0plusmn 0STDs molluscum contagiosum 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0

STDs AIDS 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0STDs HIV 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01

STDs Hepatitis B 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0STDs HPV 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01

Genomics

Dx CIN 0plusmn 0 00plusmn 01 00plusmn 01 0plusmn 01 0plusmn 0 0plusmn 01 01plusmn 02 0plusmn 01Dx HPV 01plusmn 03 00plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01

Dx 01plusmn 03 00plusmn 02 01plusmn 03 0plusmn 02 01plusmn 03 0plusmn 02 01plusmn 03 0plusmn 01Dx cancer 01plusmn 03 00plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01

4 Scientific Programming

26 Extreme Gradient Boosting eXtreme Gradient Boosting(XGBoost) is a tree-based ensemble technique [20] XGBoostcan be used for classification regression and rankingproblems XG boosting is a type of gradient boosting Gra-dient Boosting (GB) is a boosting ensemble technique thatmakes predicators sequentially instead of individually GB is amethod that produces a strong classifier by combining weakclassifiers [21] +e goal of the GB is building an iterativemodel that optimizes a loss function It pinpoints the failingsof weak learners by using gradients in the loss function [21]

y ax + b + e (5)

where e denotes the error term +e loss function measureshow good is the model at fitting the underlying data+e lossfunction depends on the optimization goal for regression isa measure of the error between the true and predicatedvalues whereas for classification it measures the how goodis a model at classifying cases correctly [21] +is techniquetakes less time and less iterations since predictors arelearning from the past mistakes of the other predictors +e

Table 2 Missing records per attribute in the data set

Feature name Missing valuesFirst sexual intercourse (age) 7Smoking 13Smokes (years) 13Smokes (packsyear) 13Number of sexual partners 26Number of pregnancies 56Hormonal contraceptives 108Hormonal contraceptives (years) 108IUD 117IUD (years) 117STDs 105STDs (number) 105STDs condylomatosis 105STDs cervical condylomatosis 105STDs vaginal condylomatosis 105STDs vulvo-perineal condylomatosis 105STDs syphilis 105STDs pelvic inflammatory disease 105STDs genital herpes 105STDs molluscum contagiosum 105STDs AIDS 105STDs HIV 105STDs hepatitis B 105STDs HPV 105STDs time since first diagnosis 787STDs time since last diagnosis 787

Biopsy Cytology Hinselmann Schiller0

200

400

600

800

1000

803 814 823784

55 44 3574

Class labels

NegativePositive

Num

ber o

f sam

ples

Figure 1 Number of records per class label in the data set

Scientific Programming 5

GB works by teaching a model C to predict values of theform

Yprime C(x) (6)

By minimizing a loss function eg MSE1n

1113944i

yiprime minus yi( 1113857

2 (7)

where i iterates over a training set of size n of true values ofthe target variable yyʹ estimated values of C (x) y truevalues amp n number of instances in y

Considering a GBmodel withM phases andm as a singlephase being (1lemleM) to improve some deficient modelFm a new estimator hm (x) is added +erefore

hm(x) Y minus Fm(x) (8)

Estimator h will be fitted to Y minus Fm(x) which is thedifference between the true value and the predicated valueie the residual +us we attempt to adjust the errors of theprevious model (Fm) [22]

XGBoost is better than Ada boost in terms of speed andperformance It is highly scalable and runs 10 times faster ascompared to the other traditional single machine learningalgorithms XGBoost handles the sparse data and imple-ments several optimization and regularization techniquesMoreover it also uses the concept of parallel and distributedcomputing

27 AdaBoost Adaptive Boosting (AdaBoost) is a meta-learner originally proposed for the binary classificationproposed by Freund and Schapire [23] It is an ensembletechnique to build a meta classifier by combining severalweak classifiers using progressive learning

AdaBoost uses the concept of boosting data samplingtechnique adaptive sampling was used to assign highweights to the misclassified events+emisclassified sampleswill be selected in the next iteration to better train the modeland the final prediction was made using weighted votingAdaBoost has reduced error rate has a better effect on theprediction as compared to bagging [24] and uses decisiontree stumps Initially all the samples in the data set haveequal weights Let x be the number of samples in the data setand let y be the target+e target is a binary class representedby 0 and 1 +e first decision tree stump will use somerecords from the data set and predictions will be performedAfter the initial prediction the weights to the sample will beupdated More weights will be assigned to the data samplesthat were misclassified +e samples with the high weightswill be selected in the next iteration +e process will becontinued unless the error rate is completely reduced or acertain target level is achieved

AdaBoost contains two main steps combination andstep forward using sequential iterative approach All theinstances in the training set have equal weights in the firstiteration However in subsequent iterations the weights arechanged based on the error rates +e instances with errorhave increased weights For the binary class classification

problem containing T training samples is represented in thefollowing equation

xi yi( 11138571113864 1113865T

i1with yi isin 0 1 (9)

Let C be the linear combination of weak classifiers +ecombination of the classifiers is represented as

C(x) 1113944

N

n1wncn(x) (10)

where N is the number of weak classifiers w represents theweights and C (x) represents weak classifiers In every nextiteration the classifier is trained based on the performance ofthe classifier in previous iteration

C(x)t C(x)tminus 1 + wncn(x) (11)

where C(x)t represents the classifier in t iteration C(x)tminus 1 isthe performance of the classifier at t minus 1 iteration

+e weights can be calculated using the followingequation

wn 12ln

1 minus ϵnϵn

1113888 1113889 (12)

ϵn represents the error rate of the weak classifier

28 Optimization Strategy +is section discusses optimi-zation strategy to find the best hyperparameters combina-tion that produces the highest targeted outcomes Fireflyoptimization algorithmwas used for parameter tunning+edetails of Firefly are discussed in Section 23 Table 3 presentsthe hyperparameter values of Random Forest for all the fourtargets For RF ldquoginirdquo index criterion was used Table 4represents the parameters used for XGBoost Gbtreebooster was used with the random state of 42 and thelearning rate of 005 Similarly Table 5 presents the optimalfeature vales for AdaBoost Furthermore Figures 2ndash4 rep-resent the Grid Search optimization graph for RandomForest Extreme Gradient Boosting and AdaBoost classifier

3 Experimental Setup and Results

+e model was implemented in Python language 380 re-lease using Jupyter Notebook environment Ski-learn librarywas used for the classifiers along with other needed built-intools while separate library (xgboost 120) was used forXGBoost ensemble +ere is K-fold cross validation withK 10 for partitioning the data into training and testingFive evaluation measures such as accuracy sensitivity (re-call) specificity (precision) positive predictive accuracy(PPA) and negative predictive accuracy (NPA) were usedSensitivity and specificity are focused more during the studydue to the application of the proposed model Accuracydenotes the percentage of correctly classified cases sensi-tivity measures the percentage of positives cases that wereclassified as positives and specificity refers to the percentageof negative cases that were classified as negatives Moreoverthe criteria for the selection of the performance evaluation

6 Scientific Programming

measures depend upon the measures used in the benchmarkstudies Two sets of experiments were conducted for eachtarget using selected features by using Firefly feature se-lection algorithm and 30 features for four targets +eSMOTE technique was applied to generate synthetic data+e results of model are presented in section below

31 Hinselmann Table 6 presents the accuracy sensitivityspecificity PPA and NPA for the RF AdaBoost andXGBoost models respectively using SMOTE for Hinsel-mann test target class +e number of selected features forHinselmann was 15 XGBoost outperformed the otherclassifiers for both feature sets However the performance ofXGBoost with selected feature is better when compared with30 features +e model produces an accuracy of 9883sensitivity of 975 specificity of 992 PPA of 9917 and NPAof 9763 respectively

32 Schiller Table 7 presents the outcomes for the Schillertest Like Hinselmann target XGBoost with selected featuresoutperformed that of Schiller respectively However theoutcomes achieved by the model for Schiller are lower whencompared with Hinselmann target class+e performance of

Table 3 Random Forest optimized parameters set for all fourtargets using Firefly

Parameters Optimal values obtainedn_estimators 100max_features Log2criterion ginimax_depth 15min_samples_split 5min_samples_leaf 1

Table 4 XGBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedBooster gbtreeRandom_state 42nthread 8learning_rate 005gamma (ie min_split_loss) 01max_depth 3

Table 5 AdaBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedbase_estimator Nonen_estimators 600learning_rate 10

15 20 25 30 35 40Param_max_depth

0950

0945

0940

0935

0930

0925

Log-

likel

ihoo

d co

mpa

red

to an

AR

(0)

Grid-search results (higher is better)

(None lsquoparam_n_estimatorsrsquo)(Mean_test_score 10)(Mean_test_score 100)(Mean_test_score 200)

(Mean_test_score 400)(Mean_test_score 800)

Figure 2 Grid search optimization for random forest

2 4 6 8 10 1412Param_min_child_weight

0955

0950

0945

0940

0935

0930

0920

0925

Log-

likeli

hood

com

pare

d to

an A

R (0

)

(None lsquoparam_max_depthrsquo)

Grid-search results (higher is better)

(Mean_test_score 3)(Mean_test_score 4)(Mean_test_score 5)

Figure 3 Grid search optimization for extreme gradient boosting

20 50 75 100 125 175150 200Param_n_estimators

(None lsquoparam_learning_ratersquo)(Mean_test_score 0001)(Mean_test_score 001)(Mean_test_score 01)

(Mean_test_score 02)(Mean_test_score 05)

096

095

094

093

091

090

092

089

Log-

likeli

hood

com

pare

d to

an A

R (0

)Grid-search results (higher is better)

Figure 4 Grid search optimization for AdaBoost

Scientific Programming 7

RF and XGBoost is similar with selected feature for Schillerwith a minor difference +e number of features selected byFirefly for Schiller was 13

33 Cytology Table 8 presents the outcome of all theclassifiers for the cytology diagnosis tests Like Hinsel-mann and Schiller diagnostic test XGBoost outperformedCytology test as well with selected features For specificityand accuracy similar outcomes were achieved using 30and selected features Similarly the performance of RF issimilar in both 30 and selected features +e number offeatures selected by Firefly feature selector for Cytologywas 11

34 Biopsy Similarly performance was not drastically dif-ferent yet using all the features resulted in a higher accuracythan when using SMOTE with selected features for Biopsy asshown in Table 9 XGB obtained the highest accuracy of 971with all features However for other measures the perfor-mance of the XGBoost is better with the selected featuresSimilar performance was achieved for all measures whenclassified using RF for both feature sets 30 and selectedrespectively+e number of selected features used for Biopsytarget class was 11

Overall after comparing all the four-diagnostic testsHinselmann test achieved the better outcome and can beused for the diagnosis of cervical cancer as shown in Ta-ble 10 As per the outcome achieved in the proposed studyHinselmann diagnosis test has better performance whencompared from other cervical cancer diagnosis tests likeSchiller Biopsy and Cytology respectively Similar findingshave been made in Abdoh et al [5] and Wu and Zhou [4]study

4 Comparison with Existing Studies

+e study used three ensemble techniques AdaBoost ex-treme Gradient Boosting and Random Forest Furthermorethe proposed study is the pioneer in using bioinspired al-gorithm for feature selection and optimization for cervicalcancer diagnosis To explore the significance of our proposedstudy the outcome of the study was compared with thebenchmark studies +e criteria for the benchmark studiesselection were based on data set used for the diagnosis ofcervical cancer Table 11 contains the comparison of theproposed technique with the benchmark studies in the lit-erature +e best outcomes in the benchmark studies were

achieved using 30 features However some of the outcomesin the previous studies were achieved with the reducedfeatures +e number in the brackets next to some of theoutcomes represents the number of features

+erefore based on Table 11 the proposed study out-performs the two studies in the benchmark interms of ac-curacy with reduced risk factors However the achievedsensitivity and NPA are less than those of Wu and Zhou [4]but higher than those of Abdoh et al [5] +e number offeatures in Wu et al study is 30 while the proposed studyused reduced risk factors +e specificity and PPA of theproposed study are higher than those of the benchmarkstudies except for the Schiller diagnosis test

In nutshell the main contributions of the current studyare applying bioinspired algorithm for feature selection andfor model optimization for cervical cancer risk factors +e

Table 6 Performance of ensemble classifiers using 30 and selectedfeature for Hinselmann target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9636 9705 9572 9544 9723AB 9008 9034 8984 8921 9091XGB 9737 9672 98 9793 9684

SelectedRF 9757 9712 9801 9793 9723AB 9494 95 9488 9461 9526XGB 9838 9755 992 9917 9763

Table 7 Performance of ensemble classifiers using 30 and selectedfeature for Schiller target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9384 9454 9312 9336 9435AB 8662 8532 8813 8921 8391XGB 9236 9399 9076 9087 9391

SelectedRF 9597 9549 9648 9668 9589AB 896 8692 9289 9378 8522XGB 9698 959 9692 971 9565

Table 8 Performance of ensemble classifiers using 30 and selectedfeature for Cytology target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9591 9409 9787 9795 9388AB 9366 9112 9652 9672 9061XGB 9632 9449 983 9836 9429

SelectedRF 9591 9444 9747 9754 9429AB 9264 9094 9447 9467 9061XGB 9693 9526 9974 9873 951

Table 9 Performance of ensemble classifiers using 30 and selectedfeature for Biopsy target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9668 9545 9732 9788 9553AB 9357 9399 9317 928 9431XGB 971 9549 9752 9746 9553

SelectedRF 9627 958 9672 9661 9593AB 8921 8802 9042 9025 8821XGB 9668 9583 9874 9873 9593

Table 10 Results of proposed model for 4 diagnosis tests forcervical cancer

Diagnosis test Accuracy Sensitivity Specificity PPA NPAHinselmann 9838 9755 992 9917 9763Schiller 9698 959 9692 971 9565Biopsy 9668 9583 9874 9873 9593Cytology 9693 9526 9873 9877 951

8 Scientific Programming

proposed model enhanced the outcomes when comparedwith the previous studies related with cervical cancer riskfactors data set Despite the above-mentioned advantagesthe study suffers from some limitations the data set suffersfrom huge imbalance and augmented data was generatedusing SMOTE Moreover the current study was based onopen-source data set and further testing is required to useother real and open-source data sets

To alleviate the above-mentioned limitations there is aneed for validating the model on real data set from thehospital

5 Conclusion

+is study presents an investigation of several ensembletechniques such as Random Forest AdaBoost and Ex-treme Gradient Boosting for diagnosing cervical cancer+e data set was obtained from the UCI machine learningrepository containing 858 records 32 features and 4target variables +e target variables are the diagnosis testused for cervical cancer Experiments were conducted foreach target class separately Data preprocessing includesimputing missing values and class balancing usingSMOTE Moreover bioinspired firefly algorithm was usedto optimize the models and to identify the key features Tocompare the performance of the models the experimentswere conducted with 30 features and the selected featuresusing SMOTED data Extreme Gradient Boosting out-performed the other two models for all four target vari-ables For future work the model will be validated onmultiple data sets Also other models that can handleoutliers and unbalanced data differently should beinvestigated

Data Availability

+e study used open-source data set available at httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] American Cancer Society Key Statistics for Cervical CancerAmerican Cancer Society Atlanta GA USA 2021 httpswwwcancerorgcancercervical-canceraboutkey-statisticshtml

[2] ldquoCervical cancer risk factorsrdquo 2020 httpswwwcancernetcancer-typescervical-cancerrisk-factors7Etext=+e20most20important20risk20factor10020different20types20of20HPV

[3] ldquoCervical cancerrdquo 2020 httpswwwnccc-onlineorghpvcervical-cancercervical-cancer-overview

[4] W Wu and H Zhou ldquoData-driven diagnosis of cervicalcancer with support vector machine-based approachesrdquo IEEEAccess vol 5 2017

[5] S F Abdoh M Abo Rizka and F A Maghraby ldquoCervicalcancer diagnosis using random forest classifier with SMOTEand feature reduction techniquesrdquo IEEE Access vol 6 2018

[6] J Lu E Song A Ghoneim and M Alrashoud ldquoMachinelearning for assisting cervical cancer diagnosis an ensembleapproachrdquo Future Generation Computer Systems vol 106pp 199ndash205 2020

[7] E Karim and N Neehal ldquoAn empirical study of cervicalcancer diagnosis using ensemble methodsrdquo in Proceedings ofthe 2019 1st International Conference on Advances in ScienceEngineering and Robotics Technology (ICASERT) DhakaBangladesh December 2019

[8] A Ul-Islam S H Ripon and N Qaisar Bhuiyan ldquoCervicalcancer risk factors classification and mining associationsrdquoAPTIKOM Journal on Computer Science and InformationTechnologies vol 4 no 1 pp 8ndash18 2019

[9] Y M S Al-Wesabi A Choudhury and D Won ldquoClassifi-cation of cervical cancer datasetrdquo in Proceedings of the 2018IISE Annual Conference and Expo pp 1456ndash1461 OrlandoFL USA May 2018

[10] K Fernandes D Chicco J S Cardoso and J FernandesldquoSupervised deep learning embeddings for the prediction ofcervical cancer diagnosisrdquo PeerJ Computer Science vol 4no 5 pp e154ndash21 2018

[11] K Adem S Kiliccedilarslan and O Comert ldquoClassification anddiagnosis of cervical cancer with stacked autoencoder andsoftmax classificationrdquo Expert Systems with Applicationsvol 115 pp 557ndash564 2019

[12] K Fernandes J S Cardoso and J Fernandes ldquoTransferlearning with partial observability applied to cervical cancerscreeningrdquo Pattern Recognition and Image Analysis Springervol 10255 pp 243ndash250 Berlin Germany 2017

Table 11 Comparison of the proposed study with benchmark studies

Target class Model Accuracy Sensitivity Specificity PPA NPA

Hinselmann (15)Abdoh et al [5] 9397 100 8996 8497 100Wu and Zhou [4] 976 9665 9854 9848 9678Proposed study 9838 9755 992 9917 9763

Schiller (13)Abdoh et al [5] 9018 9873 8463 8075 9903Wu and Zhou [4] 9501 9324 9758 (12) 9729 (12) 9381Proposed study 9639 959 9692 971 9565

Cytology (11)Abdoh et al [5] 9275 100 8792 83 100Wu and Zhou [4] 9694 9558 (8) 9901 9894 9576 (8)Proposed study 9698 9583 9974 9873 9593

Biopsy (11)Abdoh et al [5] 9413 100 9021 8607 100Wu and Zhou [4] 9606 9494 (6) 9776 (11) 9758 (11) 9491Proposed study 9693 9526 9873 9877 951

Scientific Programming 9

[13] ldquoCervical cancer (risk factors) data setrdquo 2020 httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

[14] A Fernandez S Garcia F Herrera and N V ChawlaldquoSMOTE for learning from imbalanced data progress andchallenges marking the 15-year anniversaryrdquo Journal ofArtificial Intelligence Research vol 61 pp 863ndash905 2018

[15] X Yang Nature-Inspired Metaheuristic Algorithms Vol 4Luniver Press Cambridge UK 2nd edition 2010

[16] B H Nguyen B Xue and M Zhang ldquoA survey on swarmintelligence approaches to feature selection in data miningrdquoSwarm and Evolutionary Computation vol 54 2020

[17] M Anbu and G S Anandha Mala ldquoFeature selection usingfirefly algorithm in software defect predictionrdquo ClusterComputing vol 22 no s5 Article ID 10925 2019

[18] L Breiman ldquoRandom forestsrdquo in Hands-On MachineLearning with R pp 203ndash219 CRC Press Boca Raton FLUSA 2019

[19] T M Oshiro P S Perez and J A Baranauskas ldquoHow manytrees in a random forestrdquo in Lecture Notes in ComputerScience vol 7376 pp 154ndash168 Springer New York NY USA2012

[20] T Chen and C Guestrin ldquoXGBoost a scalable tree boostingsystemrdquo in Proceedings of the 22nd ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Miningvol 13ndash17 pp 785ndash794 San Francisco CA USA August2016

[21] A Natekin and A Knoll ldquoGradient boosting machines atutorialrdquo Frontiers in Neurorobotics vol 7 2013

[22] J H Friedman and J J Meulman ldquoMultiple additive re-gression trees with application in epidemiologyrdquo Statistics inMedicine vol 22 no 9 pp 1365ndash1381 2003

[23] Y Freund and R E Schapire ldquoA decision-theoretic gener-alization of on-line learning and an application to boostingrdquoJournal of Computer and System Sciences vol 55 no 1pp 119ndash139 1997

[24] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo Informatica vol 31 pp 249ndash2682007

10 Scientific Programming

Page 2: Cervical Cancer Diagnosis Model Using Extreme Gradient

using different types of data such as clinical-based imageand genetic-based data In our study we used clinical cer-vical risk factor data Two similar studies were conducted byWu and Zhou [4] and Abdoh et al [5] they performed thecomparative analysis of two feature selection techniquesnamely recursive feature elimination (RFE) and PrincipalComponent Analysis (PCA) +e first study used SupportVector Machine (SVM) and the other study used RandomForest (RF) Both studies used the same number of featuresAlthough the data suffered from imbalance an oversamplingwas applied to the data in [4] and SMOTE was used in [5]Both studies identified two risk factors to be removed such astime since the first and last diagnosis of STDs (sexuallytransmitted diseases) due to a lot of missing entries Fur-thermore the study [4] discovered that less computationalcost was an advantage given by both SVM-PCA and SVM-RFE whereas high computational cost is a limitation to theSVM model Moreover STDs intrauterine device (IUD)hormonal contraceptives and first sexual intercourse wereidentified as the highly relevant features [5] Overall theoutcome of both the studies showed that using 30 featuresproduced highest results Furthermore it was found that theSMOTE-RF model performed well for all targets

Similarly Lu et al [6] and Karim and Neehal [7] usedensemble models to estimate the risk of cervical cancer Bothstudies performed data cleaning mechanism to replacemissing values +e former study used an ensemble classifierwith voting strategy using a combination of a private andpublic data set +e private data set contains 472 recordstaken from Chinese hospital +e public data set was ob-tained from the UCI repository 14 features were used +eprivate data set was collected using questionnaire +e re-sults revealed that voting ensemble classifier produced betterresults when compared to Linear Regression Decision Tree(DT) Multilayer Perceptron (MLP) SVM and K-NNclassifiers On the other hand Karim and Neehal study usedDT MLP and SVM using sequential Minimal Optimization(SMO) and K nearest neighbor (KNN) techniques Exper-iments showed that SMO has a better performance in termsof accuracy precision recall and F-measure Similarly Ul-Islam et al [8] used DT RF Logistic Model Tree and ANNfor cervical cancer detection Apriori algorithm was used toidentify features that strongly relate to cancer +e studyfound that age number of sexual partners hormonalcontraceptives number of pregnancies and first sexualintercourse are significant risk factors Results indicated thatRF produced best outcome when compared to the othermodels

Al-Wesabi et al [9] conducted a comparison betweendifferent machine learning classifiers such as Gaussian NaıveBayes (GNB) KNN DT LR and SVM +e outcome of theclassifiers was not satisfactory due to the data imbalance Toresolve this problem undersampling oversampling andSMOTETomek were applied Oversampling had the bestresult among all three methods Moreover a SequentialFeature Selector was applied with both forward and back-ward versions Both the Sequential Forward Feature Selector(SFS) and Sequential Backward Feature Selector (SBS) en-hanced the performance of the prediction with an accuracy

of 95 After selecting the common features between DTand KNN the accuracy exceeded 97 for the DT+e resultsrevealed that age first sexual intercourse number ofpregnancies smoking hormonal contraceptives and STDsgenital herpes were the main predictive features

Similarly several studies have been made using deeplearning and transfer learning for cervical cancer diagnosisFernandes et al [10] and Adem et al [11] used deep learningand showed significant outcome in terms of diagnosis ac-curacy +e study [10] used a loss function that provides asupervised optimization of dimensionality reduction andclassification models +e study indicated that it can beuseful in examining records of patients if the Biopsy andperhaps other testing results are absent and are capable ofclassifying successfully whether they have cervical cancer ornot On the other hand the researchers in [11] used a deepneural network model with softmax function to classify thedata sets +e performance of the softmax function withstacked autoencoder was compared with the other machinelearning methods (DT KNN SVM Feed Forward NN andRotation Forest models) It was found that the softmaxfunction with a stacked autoencoder model produced betteroutcome classification rate of 978

Similarly Fernandes et al [12] applied transfer learningwith partial observability for cancer screenings +e limi-tation of the study was that several patients were resistinganswering some questions for privacy concerns Challengeswere also faced in defining quality as there are multiplereadings and it started relying on human preference+erefore as an alternative of an ordinal scale a simplebinary scheme was used Nevertheless the model perfor-mance was considerable

Conclusively the findingmade after the above-mentionedliterature is that the data set found at UCI repository hadseveral missing values therefore previous studies have re-moved at least 2 features Missing values were due to pa-tientrsquos concerns regarding their privacy After removing 2features due to huge missing value SVM-PCA seemed toprovide satisfactory performance However SMO andSMOTE-RF were amongst the best performing modelsAnother approach to deal with the imbalance in UCI cervicalrisk factor data set was using oversampling Deep learningproved to be effective especially where the Biopsy andpossibly other screening results are absent Age first sexualintercourse number of pregnancies smoking hormonalcontraceptives IUD STDs STDs genital warts or HPVinfections were identified as the top key features +e sig-nificant outcomes made by the machine learning classifiersmotivate the need for further investigation and enhance-ment of the outcomes for the prediction of cervical cancer

In this study three ensemble-based classifiers extremeGradient Boosting Ada Boost and RF are used to classifycervical cancer Cervical Cancer Risk factor data set fromUCI machine learning repository was collected at ldquoHospitalUniversitario de Caracasrdquo in Caracas Venezuela [13] Inaddition to the importance of correctly classifying cancerousand noncancerous cases it is also essential to identify keyrisk factors that contribute to developing cancer Nature-inspired Firefly feature selection and optimization algorithm

2 Scientific Programming

was applied Furthermore the Synthetic Minority Over-sampling Technique (SMOTE) is used to balance the classesof the data as it suffers greatly from imbalanced problem

+e paper is organized as follows Section 2 presentsmaterial andmethods Section 3 contains experimental setupand results +e comparison of the proposed model with theexisting studies using the same dataset is discussed in Section4 Finally Section 5 contains the conclusion

2 Material and Method

21DatasetDescription +e cervical cancer risk factors dataset used in the study was collected at ldquoHospital Universitariode Caracasrdquo in Caracas Venezuela and is available on theUCI Machine Learning repository [13] It consists of 858records with somemissing values as several patients did notanswer some of the questions due to privacy concerns +edata set contains 32 risk factors and 4 targets ie the di-agnosis tests used for cervical cancer It contains differentcategories of feature set such as habits demographic in-formation history and Genomic medical records Featuressuch as age Dx Cancer Dx CIN Dx HPV and Dx featurescontains no missing values Dx CIN is a change in the wallsof cervix and is commonly due to HPV infection sometimesit may lead to cancer if it is not treated properly HoweverDx cancer variable is represented if the patient has othertypes of cancer or not Sometimes a patient may have morethan one type of cancer In the data set some of the patientsdo not have cervical cancer but they had the Dx cancervalue true +erefore it is not used as a target variable

Table 1 presents a brief description of each feature withthe type Cervical cancer diagnosis usually requires severaltests this data contains the widely used diagnosis tests as thetarget Hinselmann Schiller Cytology and Biopsy are fourwidely used diagnosis tests for cervical cancer Hinselmannor Colposcopy is a test that examines the inside of the vaginaand cervix using a tool that magnifies the tissues to detectany anomalies [3] Schiller is a test in which a chemicalsubstance called iodine is applied to the cervix where itstains healthy cells into brown color and leaves the abnormalcells uncolored while cytology is a test that examines bodycells from uterine cervix for any cancerous cells or otherdiseases And Biopsy refers to the test where a small part ofcervical tissue is examined under a microscope Most Biopsytests can make significant diagnosis

22 Dataset Preprocessing +e data set suffers from a hugenumber of missing values 24 features out of the 32 con-tained missing values Initially the features with the hugepercentage of missing values were removed STDs Timesince first diagnosis and STDs Time since last diagnosisfeatures were removed since they have 787 missing values(see Table 2) which is more than half of the data Howeverthe data imputation was performed for the features withfewer numbers of missing values +e most frequent valuetechnique was used to impute the remaining missing valuesAdditionally the data set also suffers from huge class im-balance +e data set target labels were imbalanced with 35

for the Hinselmann 74 for Schiller 44 for Cytology and 55Biopsy out of the 858 records as shown in Figure 1 SMOTEwas used to deal with class imbalance SMOTE works byoversampling the minority class by generating new syntheticdata for minority instances based on nearest neighbors usingthe Euclidean Distance between data points [14] Figure 1shows the number of records per class labels in the data set

23 Firefly Feature Selection Dimensionality reduction isone of the effective ways to select the features that improvethe performance of the supervised learning model In thestudy we adopted nature-inspired algorithm Firefly forselecting the features that better formulate the problemFirefly was proposed by Yang [15] and was initially proposedfor the optimization Metaheuristic Firefly algorithm is in-spired by firefliesrsquo and flash lightening capability of a fly It isa population-based optimization algorithm to find the op-timal value or parameter for a target function In thistechnique each fly is pulled out by the glow intensity of thenearby flies If the intensity of the gleam is extremely low atsome point then the attraction will be declining Firefly usedthree rules that is (a) all the flies should be of the samegender (b) the criteria of attractiveness depend upon theintensity of the glow (c) target function will generate thegleam of the firefly +e flies with less glow will move to-wards the flies with brighter glow +e brightness can beadjusted using objective function +e same idea is imple-mented in the algorithm to search the optimal features thatcan better fit the training model Firefly is more computa-tionally economical and produced better outcome in featureselection when compared with other metaheuristic tech-niques like genetic algorithms and particle swarm optimi-zation [16] +e time complexity of firefly is O(n2t) [17] Ituses the light intensity to select the features Highly relevantfeatures are represented as the features with high intensitylight

For feature selection initially some fireflies will begenerated and each fly will randomly assign the weights toall features In our study we generated 50 number of flies(n 50) +e dimension of the data set is 30 Furthermorethe lower bound was set to minus 50 while the upper bound isequal to 50 +e maximum generations were 500 Addi-tionally α (alpha) was initially set to 05 and in everysubsequent iteration we used the (1) and (2) to update α(alpha) value

X limi1⟶500

1 minus 10minus (409)1i (1)

α (1 minus X) times α (2)

However the gamma (c) was set to 1 +e number offeatures selected using Firefly for Hinselmann was 15 forSchiller 13 features for Cytology 11 features and 11 featuresfor Biopsy respectively

24 Ensemble-Based Classification Methods +ree ensem-ble-based classification techniques such as Random ForestExtreme Gradient Boosting and Ada Boost were used to

Scientific Programming 3

train the model +e description of these techniques isdiscussed in the section below

25 RandomForest Random Forest (RF) was first proposedby Breiman in 2001 [18] Random forest is an ensemblemodel that uses decision tree as individual model andbagging as ensemble method It improves the performanceof decision tree by adding many trees to reduce the over-fitting in the decision tree RF can be used for both clas-sification and regression RF generates a random forest thatcontains decision trees and gets a prediction from each oneof them and then selects the best solution with the maximumvotes [19]

When training a tree it is important to measure howmuch each feature decreases the impurity as the decrease in

the impurity indicates the significance of the feature +etree classification result depends on the impurity measureused For classification the measures for impurity are eitherGini impurity or information gain and for regression andthe measure for impurity is variance Training decision treeconsists of iteratively splitting the data Gini impurity de-cides the best split of the data using the formula

G 1 minus 1113944i

p2i (3)

where p (i) is the probability of selecting a datapoint withclass ie Information gain (IG) is also another measure todecide the best split of the data depending on the gain of eachfeature +e formula that calculates the information gain isgiven in the following equation

Entropy minus 1113944i

pilog2pi

IG(parent child) Entropy(parent) minus [p1(c1)lowast entropy(c1) + p(c2)lowast entropy(c2) + middot middot middot]

(4)

Table 1 Statistical description of the data set

Feature type Feature nameHinselmann Schiller Cytology Biopsy

Mean (micro)plusmn Std (σ) Mean (micro)plusmn Std (σ) Mean (micro)plusmn Std (σ) Mean (micro)plusmn Std (σ)Positive Negative Positive Negative Positive Negative Positive Negative

Demographic

Age 267plusmn 77 268plusmn 85 296plusmn 11 266plusmn 82 262plusmn 84 269plusmn 85 286plusmn 89 267plusmn 85Smokes 02plusmn 04 01plusmn 04 02plusmn 04 02plusmn 03 01plusmn 03 01plusmn 04 02plusmn 04 01plusmn 03

Smokes (years) 25plusmn 72 11plusmn 39 24plusmn 62 11plusmn 38 11plusmn 34 12plusmn 41 22plusmn 62 11plusmn 39Smokes (packsyear) 07plusmn 26 04plusmn 22 06plusmn 19 04plusmn 22 05plusmn 23 04plusmn 22 07plusmn 23 04plusmn 22

Number of sexual partners 22plusmn 09 25plusmn 17 25plusmn 12 25plusmn 17 27plusmn 13 25plusmn 17 25plusmn 13 25plusmn 17First sexual intercourse (age) 168plusmn 20 169plusmn 28 17plusmn 25 169plusmn 28 169plusmn 29 169plusmn 28 171plusmn 26 169plusmn 28

Number of pregnancies 24plusmn 14 25plusmn 17 26plusmn 17 22plusmn 14 21plusmn 14 22plusmn 14 23plusmn 13 22plusmn 14Hormonal contraceptives 07plusmn 45 07plusmn 05 06plusmn 05 07plusmn 05 07plusmn 05 07plusmn 05 07plusmn 05 07plusmn 05

Habit

Hormonal contraceptives (years) 29plusmn 48 19plusmn 35 32plusmn 52 19plusmn 34 33plusmn 64 19plusmn 34 33plusmn 54 19plusmn 34IUD 02plusmn 04 01plusmn 03 02plusmn 04 01plusmn 03 01plusmn 03 01plusmn 03 02plusmn 04 01plusmn 03

IUD (years) 06plusmn 15 04plusmn 18 09plusmn 29 04plusmn 17 05plusmn 17 04plusmn 18 07plusmn 20 04plusmn 18STDs 02plusmn 04 01plusmn 03 02plusmn 04 01plusmn 03 02plusmn 04 02plusmn 03 02plusmn 04 01plusmn 03

STDs (number) 03plusmn 09 02plusmn 05 04plusmn 08 01plusmn 05 03plusmn 07 01plusmn 05 03plusmn 08 01plusmn 05STDs condylomatosis 01plusmn 03 01plusmn 02 01plusmn 03 00plusmn 02 01plusmn 03 00plusmn 02 01plusmn 03 0plusmn 02

STDs cervical condylomatosis 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0STDs vaginal condylomatosis 0plusmn 0 00plusmn 01 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01

History

STDs vulvo-perinealcondylomatosis 01plusmn 03 00plusmn 02 01plusmn 03 0plusmn 02 01plusmn 03 0plusmn 02 01plusmn 04 0plusmn 01

STDs syphilis 00plusmn 02 00plusmn 01 00plusmn 02 0plusmn 01 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01STDs pelvic inflammatory

disease 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0

STDs genital herpes 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 01 0plusmn 0STDs molluscum contagiosum 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0

STDs AIDS 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0STDs HIV 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01

STDs Hepatitis B 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0STDs HPV 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01

Genomics

Dx CIN 0plusmn 0 00plusmn 01 00plusmn 01 0plusmn 01 0plusmn 0 0plusmn 01 01plusmn 02 0plusmn 01Dx HPV 01plusmn 03 00plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01

Dx 01plusmn 03 00plusmn 02 01plusmn 03 0plusmn 02 01plusmn 03 0plusmn 02 01plusmn 03 0plusmn 01Dx cancer 01plusmn 03 00plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01

4 Scientific Programming

26 Extreme Gradient Boosting eXtreme Gradient Boosting(XGBoost) is a tree-based ensemble technique [20] XGBoostcan be used for classification regression and rankingproblems XG boosting is a type of gradient boosting Gra-dient Boosting (GB) is a boosting ensemble technique thatmakes predicators sequentially instead of individually GB is amethod that produces a strong classifier by combining weakclassifiers [21] +e goal of the GB is building an iterativemodel that optimizes a loss function It pinpoints the failingsof weak learners by using gradients in the loss function [21]

y ax + b + e (5)

where e denotes the error term +e loss function measureshow good is the model at fitting the underlying data+e lossfunction depends on the optimization goal for regression isa measure of the error between the true and predicatedvalues whereas for classification it measures the how goodis a model at classifying cases correctly [21] +is techniquetakes less time and less iterations since predictors arelearning from the past mistakes of the other predictors +e

Table 2 Missing records per attribute in the data set

Feature name Missing valuesFirst sexual intercourse (age) 7Smoking 13Smokes (years) 13Smokes (packsyear) 13Number of sexual partners 26Number of pregnancies 56Hormonal contraceptives 108Hormonal contraceptives (years) 108IUD 117IUD (years) 117STDs 105STDs (number) 105STDs condylomatosis 105STDs cervical condylomatosis 105STDs vaginal condylomatosis 105STDs vulvo-perineal condylomatosis 105STDs syphilis 105STDs pelvic inflammatory disease 105STDs genital herpes 105STDs molluscum contagiosum 105STDs AIDS 105STDs HIV 105STDs hepatitis B 105STDs HPV 105STDs time since first diagnosis 787STDs time since last diagnosis 787

Biopsy Cytology Hinselmann Schiller0

200

400

600

800

1000

803 814 823784

55 44 3574

Class labels

NegativePositive

Num

ber o

f sam

ples

Figure 1 Number of records per class label in the data set

Scientific Programming 5

GB works by teaching a model C to predict values of theform

Yprime C(x) (6)

By minimizing a loss function eg MSE1n

1113944i

yiprime minus yi( 1113857

2 (7)

where i iterates over a training set of size n of true values ofthe target variable yyʹ estimated values of C (x) y truevalues amp n number of instances in y

Considering a GBmodel withM phases andm as a singlephase being (1lemleM) to improve some deficient modelFm a new estimator hm (x) is added +erefore

hm(x) Y minus Fm(x) (8)

Estimator h will be fitted to Y minus Fm(x) which is thedifference between the true value and the predicated valueie the residual +us we attempt to adjust the errors of theprevious model (Fm) [22]

XGBoost is better than Ada boost in terms of speed andperformance It is highly scalable and runs 10 times faster ascompared to the other traditional single machine learningalgorithms XGBoost handles the sparse data and imple-ments several optimization and regularization techniquesMoreover it also uses the concept of parallel and distributedcomputing

27 AdaBoost Adaptive Boosting (AdaBoost) is a meta-learner originally proposed for the binary classificationproposed by Freund and Schapire [23] It is an ensembletechnique to build a meta classifier by combining severalweak classifiers using progressive learning

AdaBoost uses the concept of boosting data samplingtechnique adaptive sampling was used to assign highweights to the misclassified events+emisclassified sampleswill be selected in the next iteration to better train the modeland the final prediction was made using weighted votingAdaBoost has reduced error rate has a better effect on theprediction as compared to bagging [24] and uses decisiontree stumps Initially all the samples in the data set haveequal weights Let x be the number of samples in the data setand let y be the target+e target is a binary class representedby 0 and 1 +e first decision tree stump will use somerecords from the data set and predictions will be performedAfter the initial prediction the weights to the sample will beupdated More weights will be assigned to the data samplesthat were misclassified +e samples with the high weightswill be selected in the next iteration +e process will becontinued unless the error rate is completely reduced or acertain target level is achieved

AdaBoost contains two main steps combination andstep forward using sequential iterative approach All theinstances in the training set have equal weights in the firstiteration However in subsequent iterations the weights arechanged based on the error rates +e instances with errorhave increased weights For the binary class classification

problem containing T training samples is represented in thefollowing equation

xi yi( 11138571113864 1113865T

i1with yi isin 0 1 (9)

Let C be the linear combination of weak classifiers +ecombination of the classifiers is represented as

C(x) 1113944

N

n1wncn(x) (10)

where N is the number of weak classifiers w represents theweights and C (x) represents weak classifiers In every nextiteration the classifier is trained based on the performance ofthe classifier in previous iteration

C(x)t C(x)tminus 1 + wncn(x) (11)

where C(x)t represents the classifier in t iteration C(x)tminus 1 isthe performance of the classifier at t minus 1 iteration

+e weights can be calculated using the followingequation

wn 12ln

1 minus ϵnϵn

1113888 1113889 (12)

ϵn represents the error rate of the weak classifier

28 Optimization Strategy +is section discusses optimi-zation strategy to find the best hyperparameters combina-tion that produces the highest targeted outcomes Fireflyoptimization algorithmwas used for parameter tunning+edetails of Firefly are discussed in Section 23 Table 3 presentsthe hyperparameter values of Random Forest for all the fourtargets For RF ldquoginirdquo index criterion was used Table 4represents the parameters used for XGBoost Gbtreebooster was used with the random state of 42 and thelearning rate of 005 Similarly Table 5 presents the optimalfeature vales for AdaBoost Furthermore Figures 2ndash4 rep-resent the Grid Search optimization graph for RandomForest Extreme Gradient Boosting and AdaBoost classifier

3 Experimental Setup and Results

+e model was implemented in Python language 380 re-lease using Jupyter Notebook environment Ski-learn librarywas used for the classifiers along with other needed built-intools while separate library (xgboost 120) was used forXGBoost ensemble +ere is K-fold cross validation withK 10 for partitioning the data into training and testingFive evaluation measures such as accuracy sensitivity (re-call) specificity (precision) positive predictive accuracy(PPA) and negative predictive accuracy (NPA) were usedSensitivity and specificity are focused more during the studydue to the application of the proposed model Accuracydenotes the percentage of correctly classified cases sensi-tivity measures the percentage of positives cases that wereclassified as positives and specificity refers to the percentageof negative cases that were classified as negatives Moreoverthe criteria for the selection of the performance evaluation

6 Scientific Programming

measures depend upon the measures used in the benchmarkstudies Two sets of experiments were conducted for eachtarget using selected features by using Firefly feature se-lection algorithm and 30 features for four targets +eSMOTE technique was applied to generate synthetic data+e results of model are presented in section below

31 Hinselmann Table 6 presents the accuracy sensitivityspecificity PPA and NPA for the RF AdaBoost andXGBoost models respectively using SMOTE for Hinsel-mann test target class +e number of selected features forHinselmann was 15 XGBoost outperformed the otherclassifiers for both feature sets However the performance ofXGBoost with selected feature is better when compared with30 features +e model produces an accuracy of 9883sensitivity of 975 specificity of 992 PPA of 9917 and NPAof 9763 respectively

32 Schiller Table 7 presents the outcomes for the Schillertest Like Hinselmann target XGBoost with selected featuresoutperformed that of Schiller respectively However theoutcomes achieved by the model for Schiller are lower whencompared with Hinselmann target class+e performance of

Table 3 Random Forest optimized parameters set for all fourtargets using Firefly

Parameters Optimal values obtainedn_estimators 100max_features Log2criterion ginimax_depth 15min_samples_split 5min_samples_leaf 1

Table 4 XGBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedBooster gbtreeRandom_state 42nthread 8learning_rate 005gamma (ie min_split_loss) 01max_depth 3

Table 5 AdaBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedbase_estimator Nonen_estimators 600learning_rate 10

15 20 25 30 35 40Param_max_depth

0950

0945

0940

0935

0930

0925

Log-

likel

ihoo

d co

mpa

red

to an

AR

(0)

Grid-search results (higher is better)

(None lsquoparam_n_estimatorsrsquo)(Mean_test_score 10)(Mean_test_score 100)(Mean_test_score 200)

(Mean_test_score 400)(Mean_test_score 800)

Figure 2 Grid search optimization for random forest

2 4 6 8 10 1412Param_min_child_weight

0955

0950

0945

0940

0935

0930

0920

0925

Log-

likeli

hood

com

pare

d to

an A

R (0

)

(None lsquoparam_max_depthrsquo)

Grid-search results (higher is better)

(Mean_test_score 3)(Mean_test_score 4)(Mean_test_score 5)

Figure 3 Grid search optimization for extreme gradient boosting

20 50 75 100 125 175150 200Param_n_estimators

(None lsquoparam_learning_ratersquo)(Mean_test_score 0001)(Mean_test_score 001)(Mean_test_score 01)

(Mean_test_score 02)(Mean_test_score 05)

096

095

094

093

091

090

092

089

Log-

likeli

hood

com

pare

d to

an A

R (0

)Grid-search results (higher is better)

Figure 4 Grid search optimization for AdaBoost

Scientific Programming 7

RF and XGBoost is similar with selected feature for Schillerwith a minor difference +e number of features selected byFirefly for Schiller was 13

33 Cytology Table 8 presents the outcome of all theclassifiers for the cytology diagnosis tests Like Hinsel-mann and Schiller diagnostic test XGBoost outperformedCytology test as well with selected features For specificityand accuracy similar outcomes were achieved using 30and selected features Similarly the performance of RF issimilar in both 30 and selected features +e number offeatures selected by Firefly feature selector for Cytologywas 11

34 Biopsy Similarly performance was not drastically dif-ferent yet using all the features resulted in a higher accuracythan when using SMOTE with selected features for Biopsy asshown in Table 9 XGB obtained the highest accuracy of 971with all features However for other measures the perfor-mance of the XGBoost is better with the selected featuresSimilar performance was achieved for all measures whenclassified using RF for both feature sets 30 and selectedrespectively+e number of selected features used for Biopsytarget class was 11

Overall after comparing all the four-diagnostic testsHinselmann test achieved the better outcome and can beused for the diagnosis of cervical cancer as shown in Ta-ble 10 As per the outcome achieved in the proposed studyHinselmann diagnosis test has better performance whencompared from other cervical cancer diagnosis tests likeSchiller Biopsy and Cytology respectively Similar findingshave been made in Abdoh et al [5] and Wu and Zhou [4]study

4 Comparison with Existing Studies

+e study used three ensemble techniques AdaBoost ex-treme Gradient Boosting and Random Forest Furthermorethe proposed study is the pioneer in using bioinspired al-gorithm for feature selection and optimization for cervicalcancer diagnosis To explore the significance of our proposedstudy the outcome of the study was compared with thebenchmark studies +e criteria for the benchmark studiesselection were based on data set used for the diagnosis ofcervical cancer Table 11 contains the comparison of theproposed technique with the benchmark studies in the lit-erature +e best outcomes in the benchmark studies were

achieved using 30 features However some of the outcomesin the previous studies were achieved with the reducedfeatures +e number in the brackets next to some of theoutcomes represents the number of features

+erefore based on Table 11 the proposed study out-performs the two studies in the benchmark interms of ac-curacy with reduced risk factors However the achievedsensitivity and NPA are less than those of Wu and Zhou [4]but higher than those of Abdoh et al [5] +e number offeatures in Wu et al study is 30 while the proposed studyused reduced risk factors +e specificity and PPA of theproposed study are higher than those of the benchmarkstudies except for the Schiller diagnosis test

In nutshell the main contributions of the current studyare applying bioinspired algorithm for feature selection andfor model optimization for cervical cancer risk factors +e

Table 6 Performance of ensemble classifiers using 30 and selectedfeature for Hinselmann target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9636 9705 9572 9544 9723AB 9008 9034 8984 8921 9091XGB 9737 9672 98 9793 9684

SelectedRF 9757 9712 9801 9793 9723AB 9494 95 9488 9461 9526XGB 9838 9755 992 9917 9763

Table 7 Performance of ensemble classifiers using 30 and selectedfeature for Schiller target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9384 9454 9312 9336 9435AB 8662 8532 8813 8921 8391XGB 9236 9399 9076 9087 9391

SelectedRF 9597 9549 9648 9668 9589AB 896 8692 9289 9378 8522XGB 9698 959 9692 971 9565

Table 8 Performance of ensemble classifiers using 30 and selectedfeature for Cytology target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9591 9409 9787 9795 9388AB 9366 9112 9652 9672 9061XGB 9632 9449 983 9836 9429

SelectedRF 9591 9444 9747 9754 9429AB 9264 9094 9447 9467 9061XGB 9693 9526 9974 9873 951

Table 9 Performance of ensemble classifiers using 30 and selectedfeature for Biopsy target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9668 9545 9732 9788 9553AB 9357 9399 9317 928 9431XGB 971 9549 9752 9746 9553

SelectedRF 9627 958 9672 9661 9593AB 8921 8802 9042 9025 8821XGB 9668 9583 9874 9873 9593

Table 10 Results of proposed model for 4 diagnosis tests forcervical cancer

Diagnosis test Accuracy Sensitivity Specificity PPA NPAHinselmann 9838 9755 992 9917 9763Schiller 9698 959 9692 971 9565Biopsy 9668 9583 9874 9873 9593Cytology 9693 9526 9873 9877 951

8 Scientific Programming

proposed model enhanced the outcomes when comparedwith the previous studies related with cervical cancer riskfactors data set Despite the above-mentioned advantagesthe study suffers from some limitations the data set suffersfrom huge imbalance and augmented data was generatedusing SMOTE Moreover the current study was based onopen-source data set and further testing is required to useother real and open-source data sets

To alleviate the above-mentioned limitations there is aneed for validating the model on real data set from thehospital

5 Conclusion

+is study presents an investigation of several ensembletechniques such as Random Forest AdaBoost and Ex-treme Gradient Boosting for diagnosing cervical cancer+e data set was obtained from the UCI machine learningrepository containing 858 records 32 features and 4target variables +e target variables are the diagnosis testused for cervical cancer Experiments were conducted foreach target class separately Data preprocessing includesimputing missing values and class balancing usingSMOTE Moreover bioinspired firefly algorithm was usedto optimize the models and to identify the key features Tocompare the performance of the models the experimentswere conducted with 30 features and the selected featuresusing SMOTED data Extreme Gradient Boosting out-performed the other two models for all four target vari-ables For future work the model will be validated onmultiple data sets Also other models that can handleoutliers and unbalanced data differently should beinvestigated

Data Availability

+e study used open-source data set available at httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] American Cancer Society Key Statistics for Cervical CancerAmerican Cancer Society Atlanta GA USA 2021 httpswwwcancerorgcancercervical-canceraboutkey-statisticshtml

[2] ldquoCervical cancer risk factorsrdquo 2020 httpswwwcancernetcancer-typescervical-cancerrisk-factors7Etext=+e20most20important20risk20factor10020different20types20of20HPV

[3] ldquoCervical cancerrdquo 2020 httpswwwnccc-onlineorghpvcervical-cancercervical-cancer-overview

[4] W Wu and H Zhou ldquoData-driven diagnosis of cervicalcancer with support vector machine-based approachesrdquo IEEEAccess vol 5 2017

[5] S F Abdoh M Abo Rizka and F A Maghraby ldquoCervicalcancer diagnosis using random forest classifier with SMOTEand feature reduction techniquesrdquo IEEE Access vol 6 2018

[6] J Lu E Song A Ghoneim and M Alrashoud ldquoMachinelearning for assisting cervical cancer diagnosis an ensembleapproachrdquo Future Generation Computer Systems vol 106pp 199ndash205 2020

[7] E Karim and N Neehal ldquoAn empirical study of cervicalcancer diagnosis using ensemble methodsrdquo in Proceedings ofthe 2019 1st International Conference on Advances in ScienceEngineering and Robotics Technology (ICASERT) DhakaBangladesh December 2019

[8] A Ul-Islam S H Ripon and N Qaisar Bhuiyan ldquoCervicalcancer risk factors classification and mining associationsrdquoAPTIKOM Journal on Computer Science and InformationTechnologies vol 4 no 1 pp 8ndash18 2019

[9] Y M S Al-Wesabi A Choudhury and D Won ldquoClassifi-cation of cervical cancer datasetrdquo in Proceedings of the 2018IISE Annual Conference and Expo pp 1456ndash1461 OrlandoFL USA May 2018

[10] K Fernandes D Chicco J S Cardoso and J FernandesldquoSupervised deep learning embeddings for the prediction ofcervical cancer diagnosisrdquo PeerJ Computer Science vol 4no 5 pp e154ndash21 2018

[11] K Adem S Kiliccedilarslan and O Comert ldquoClassification anddiagnosis of cervical cancer with stacked autoencoder andsoftmax classificationrdquo Expert Systems with Applicationsvol 115 pp 557ndash564 2019

[12] K Fernandes J S Cardoso and J Fernandes ldquoTransferlearning with partial observability applied to cervical cancerscreeningrdquo Pattern Recognition and Image Analysis Springervol 10255 pp 243ndash250 Berlin Germany 2017

Table 11 Comparison of the proposed study with benchmark studies

Target class Model Accuracy Sensitivity Specificity PPA NPA

Hinselmann (15)Abdoh et al [5] 9397 100 8996 8497 100Wu and Zhou [4] 976 9665 9854 9848 9678Proposed study 9838 9755 992 9917 9763

Schiller (13)Abdoh et al [5] 9018 9873 8463 8075 9903Wu and Zhou [4] 9501 9324 9758 (12) 9729 (12) 9381Proposed study 9639 959 9692 971 9565

Cytology (11)Abdoh et al [5] 9275 100 8792 83 100Wu and Zhou [4] 9694 9558 (8) 9901 9894 9576 (8)Proposed study 9698 9583 9974 9873 9593

Biopsy (11)Abdoh et al [5] 9413 100 9021 8607 100Wu and Zhou [4] 9606 9494 (6) 9776 (11) 9758 (11) 9491Proposed study 9693 9526 9873 9877 951

Scientific Programming 9

[13] ldquoCervical cancer (risk factors) data setrdquo 2020 httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

[14] A Fernandez S Garcia F Herrera and N V ChawlaldquoSMOTE for learning from imbalanced data progress andchallenges marking the 15-year anniversaryrdquo Journal ofArtificial Intelligence Research vol 61 pp 863ndash905 2018

[15] X Yang Nature-Inspired Metaheuristic Algorithms Vol 4Luniver Press Cambridge UK 2nd edition 2010

[16] B H Nguyen B Xue and M Zhang ldquoA survey on swarmintelligence approaches to feature selection in data miningrdquoSwarm and Evolutionary Computation vol 54 2020

[17] M Anbu and G S Anandha Mala ldquoFeature selection usingfirefly algorithm in software defect predictionrdquo ClusterComputing vol 22 no s5 Article ID 10925 2019

[18] L Breiman ldquoRandom forestsrdquo in Hands-On MachineLearning with R pp 203ndash219 CRC Press Boca Raton FLUSA 2019

[19] T M Oshiro P S Perez and J A Baranauskas ldquoHow manytrees in a random forestrdquo in Lecture Notes in ComputerScience vol 7376 pp 154ndash168 Springer New York NY USA2012

[20] T Chen and C Guestrin ldquoXGBoost a scalable tree boostingsystemrdquo in Proceedings of the 22nd ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Miningvol 13ndash17 pp 785ndash794 San Francisco CA USA August2016

[21] A Natekin and A Knoll ldquoGradient boosting machines atutorialrdquo Frontiers in Neurorobotics vol 7 2013

[22] J H Friedman and J J Meulman ldquoMultiple additive re-gression trees with application in epidemiologyrdquo Statistics inMedicine vol 22 no 9 pp 1365ndash1381 2003

[23] Y Freund and R E Schapire ldquoA decision-theoretic gener-alization of on-line learning and an application to boostingrdquoJournal of Computer and System Sciences vol 55 no 1pp 119ndash139 1997

[24] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo Informatica vol 31 pp 249ndash2682007

10 Scientific Programming

Page 3: Cervical Cancer Diagnosis Model Using Extreme Gradient

was applied Furthermore the Synthetic Minority Over-sampling Technique (SMOTE) is used to balance the classesof the data as it suffers greatly from imbalanced problem

+e paper is organized as follows Section 2 presentsmaterial andmethods Section 3 contains experimental setupand results +e comparison of the proposed model with theexisting studies using the same dataset is discussed in Section4 Finally Section 5 contains the conclusion

2 Material and Method

21DatasetDescription +e cervical cancer risk factors dataset used in the study was collected at ldquoHospital Universitariode Caracasrdquo in Caracas Venezuela and is available on theUCI Machine Learning repository [13] It consists of 858records with somemissing values as several patients did notanswer some of the questions due to privacy concerns +edata set contains 32 risk factors and 4 targets ie the di-agnosis tests used for cervical cancer It contains differentcategories of feature set such as habits demographic in-formation history and Genomic medical records Featuressuch as age Dx Cancer Dx CIN Dx HPV and Dx featurescontains no missing values Dx CIN is a change in the wallsof cervix and is commonly due to HPV infection sometimesit may lead to cancer if it is not treated properly HoweverDx cancer variable is represented if the patient has othertypes of cancer or not Sometimes a patient may have morethan one type of cancer In the data set some of the patientsdo not have cervical cancer but they had the Dx cancervalue true +erefore it is not used as a target variable

Table 1 presents a brief description of each feature withthe type Cervical cancer diagnosis usually requires severaltests this data contains the widely used diagnosis tests as thetarget Hinselmann Schiller Cytology and Biopsy are fourwidely used diagnosis tests for cervical cancer Hinselmannor Colposcopy is a test that examines the inside of the vaginaand cervix using a tool that magnifies the tissues to detectany anomalies [3] Schiller is a test in which a chemicalsubstance called iodine is applied to the cervix where itstains healthy cells into brown color and leaves the abnormalcells uncolored while cytology is a test that examines bodycells from uterine cervix for any cancerous cells or otherdiseases And Biopsy refers to the test where a small part ofcervical tissue is examined under a microscope Most Biopsytests can make significant diagnosis

22 Dataset Preprocessing +e data set suffers from a hugenumber of missing values 24 features out of the 32 con-tained missing values Initially the features with the hugepercentage of missing values were removed STDs Timesince first diagnosis and STDs Time since last diagnosisfeatures were removed since they have 787 missing values(see Table 2) which is more than half of the data Howeverthe data imputation was performed for the features withfewer numbers of missing values +e most frequent valuetechnique was used to impute the remaining missing valuesAdditionally the data set also suffers from huge class im-balance +e data set target labels were imbalanced with 35

for the Hinselmann 74 for Schiller 44 for Cytology and 55Biopsy out of the 858 records as shown in Figure 1 SMOTEwas used to deal with class imbalance SMOTE works byoversampling the minority class by generating new syntheticdata for minority instances based on nearest neighbors usingthe Euclidean Distance between data points [14] Figure 1shows the number of records per class labels in the data set

23 Firefly Feature Selection Dimensionality reduction isone of the effective ways to select the features that improvethe performance of the supervised learning model In thestudy we adopted nature-inspired algorithm Firefly forselecting the features that better formulate the problemFirefly was proposed by Yang [15] and was initially proposedfor the optimization Metaheuristic Firefly algorithm is in-spired by firefliesrsquo and flash lightening capability of a fly It isa population-based optimization algorithm to find the op-timal value or parameter for a target function In thistechnique each fly is pulled out by the glow intensity of thenearby flies If the intensity of the gleam is extremely low atsome point then the attraction will be declining Firefly usedthree rules that is (a) all the flies should be of the samegender (b) the criteria of attractiveness depend upon theintensity of the glow (c) target function will generate thegleam of the firefly +e flies with less glow will move to-wards the flies with brighter glow +e brightness can beadjusted using objective function +e same idea is imple-mented in the algorithm to search the optimal features thatcan better fit the training model Firefly is more computa-tionally economical and produced better outcome in featureselection when compared with other metaheuristic tech-niques like genetic algorithms and particle swarm optimi-zation [16] +e time complexity of firefly is O(n2t) [17] Ituses the light intensity to select the features Highly relevantfeatures are represented as the features with high intensitylight

For feature selection initially some fireflies will begenerated and each fly will randomly assign the weights toall features In our study we generated 50 number of flies(n 50) +e dimension of the data set is 30 Furthermorethe lower bound was set to minus 50 while the upper bound isequal to 50 +e maximum generations were 500 Addi-tionally α (alpha) was initially set to 05 and in everysubsequent iteration we used the (1) and (2) to update α(alpha) value

X limi1⟶500

1 minus 10minus (409)1i (1)

α (1 minus X) times α (2)

However the gamma (c) was set to 1 +e number offeatures selected using Firefly for Hinselmann was 15 forSchiller 13 features for Cytology 11 features and 11 featuresfor Biopsy respectively

24 Ensemble-Based Classification Methods +ree ensem-ble-based classification techniques such as Random ForestExtreme Gradient Boosting and Ada Boost were used to

Scientific Programming 3

train the model +e description of these techniques isdiscussed in the section below

25 RandomForest Random Forest (RF) was first proposedby Breiman in 2001 [18] Random forest is an ensemblemodel that uses decision tree as individual model andbagging as ensemble method It improves the performanceof decision tree by adding many trees to reduce the over-fitting in the decision tree RF can be used for both clas-sification and regression RF generates a random forest thatcontains decision trees and gets a prediction from each oneof them and then selects the best solution with the maximumvotes [19]

When training a tree it is important to measure howmuch each feature decreases the impurity as the decrease in

the impurity indicates the significance of the feature +etree classification result depends on the impurity measureused For classification the measures for impurity are eitherGini impurity or information gain and for regression andthe measure for impurity is variance Training decision treeconsists of iteratively splitting the data Gini impurity de-cides the best split of the data using the formula

G 1 minus 1113944i

p2i (3)

where p (i) is the probability of selecting a datapoint withclass ie Information gain (IG) is also another measure todecide the best split of the data depending on the gain of eachfeature +e formula that calculates the information gain isgiven in the following equation

Entropy minus 1113944i

pilog2pi

IG(parent child) Entropy(parent) minus [p1(c1)lowast entropy(c1) + p(c2)lowast entropy(c2) + middot middot middot]

(4)

Table 1 Statistical description of the data set

Feature type Feature nameHinselmann Schiller Cytology Biopsy

Mean (micro)plusmn Std (σ) Mean (micro)plusmn Std (σ) Mean (micro)plusmn Std (σ) Mean (micro)plusmn Std (σ)Positive Negative Positive Negative Positive Negative Positive Negative

Demographic

Age 267plusmn 77 268plusmn 85 296plusmn 11 266plusmn 82 262plusmn 84 269plusmn 85 286plusmn 89 267plusmn 85Smokes 02plusmn 04 01plusmn 04 02plusmn 04 02plusmn 03 01plusmn 03 01plusmn 04 02plusmn 04 01plusmn 03

Smokes (years) 25plusmn 72 11plusmn 39 24plusmn 62 11plusmn 38 11plusmn 34 12plusmn 41 22plusmn 62 11plusmn 39Smokes (packsyear) 07plusmn 26 04plusmn 22 06plusmn 19 04plusmn 22 05plusmn 23 04plusmn 22 07plusmn 23 04plusmn 22

Number of sexual partners 22plusmn 09 25plusmn 17 25plusmn 12 25plusmn 17 27plusmn 13 25plusmn 17 25plusmn 13 25plusmn 17First sexual intercourse (age) 168plusmn 20 169plusmn 28 17plusmn 25 169plusmn 28 169plusmn 29 169plusmn 28 171plusmn 26 169plusmn 28

Number of pregnancies 24plusmn 14 25plusmn 17 26plusmn 17 22plusmn 14 21plusmn 14 22plusmn 14 23plusmn 13 22plusmn 14Hormonal contraceptives 07plusmn 45 07plusmn 05 06plusmn 05 07plusmn 05 07plusmn 05 07plusmn 05 07plusmn 05 07plusmn 05

Habit

Hormonal contraceptives (years) 29plusmn 48 19plusmn 35 32plusmn 52 19plusmn 34 33plusmn 64 19plusmn 34 33plusmn 54 19plusmn 34IUD 02plusmn 04 01plusmn 03 02plusmn 04 01plusmn 03 01plusmn 03 01plusmn 03 02plusmn 04 01plusmn 03

IUD (years) 06plusmn 15 04plusmn 18 09plusmn 29 04plusmn 17 05plusmn 17 04plusmn 18 07plusmn 20 04plusmn 18STDs 02plusmn 04 01plusmn 03 02plusmn 04 01plusmn 03 02plusmn 04 02plusmn 03 02plusmn 04 01plusmn 03

STDs (number) 03plusmn 09 02plusmn 05 04plusmn 08 01plusmn 05 03plusmn 07 01plusmn 05 03plusmn 08 01plusmn 05STDs condylomatosis 01plusmn 03 01plusmn 02 01plusmn 03 00plusmn 02 01plusmn 03 00plusmn 02 01plusmn 03 0plusmn 02

STDs cervical condylomatosis 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0STDs vaginal condylomatosis 0plusmn 0 00plusmn 01 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01

History

STDs vulvo-perinealcondylomatosis 01plusmn 03 00plusmn 02 01plusmn 03 0plusmn 02 01plusmn 03 0plusmn 02 01plusmn 04 0plusmn 01

STDs syphilis 00plusmn 02 00plusmn 01 00plusmn 02 0plusmn 01 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01STDs pelvic inflammatory

disease 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0

STDs genital herpes 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 01 0plusmn 0STDs molluscum contagiosum 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0

STDs AIDS 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0STDs HIV 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01

STDs Hepatitis B 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0STDs HPV 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01

Genomics

Dx CIN 0plusmn 0 00plusmn 01 00plusmn 01 0plusmn 01 0plusmn 0 0plusmn 01 01plusmn 02 0plusmn 01Dx HPV 01plusmn 03 00plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01

Dx 01plusmn 03 00plusmn 02 01plusmn 03 0plusmn 02 01plusmn 03 0plusmn 02 01plusmn 03 0plusmn 01Dx cancer 01plusmn 03 00plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01

4 Scientific Programming

26 Extreme Gradient Boosting eXtreme Gradient Boosting(XGBoost) is a tree-based ensemble technique [20] XGBoostcan be used for classification regression and rankingproblems XG boosting is a type of gradient boosting Gra-dient Boosting (GB) is a boosting ensemble technique thatmakes predicators sequentially instead of individually GB is amethod that produces a strong classifier by combining weakclassifiers [21] +e goal of the GB is building an iterativemodel that optimizes a loss function It pinpoints the failingsof weak learners by using gradients in the loss function [21]

y ax + b + e (5)

where e denotes the error term +e loss function measureshow good is the model at fitting the underlying data+e lossfunction depends on the optimization goal for regression isa measure of the error between the true and predicatedvalues whereas for classification it measures the how goodis a model at classifying cases correctly [21] +is techniquetakes less time and less iterations since predictors arelearning from the past mistakes of the other predictors +e

Table 2 Missing records per attribute in the data set

Feature name Missing valuesFirst sexual intercourse (age) 7Smoking 13Smokes (years) 13Smokes (packsyear) 13Number of sexual partners 26Number of pregnancies 56Hormonal contraceptives 108Hormonal contraceptives (years) 108IUD 117IUD (years) 117STDs 105STDs (number) 105STDs condylomatosis 105STDs cervical condylomatosis 105STDs vaginal condylomatosis 105STDs vulvo-perineal condylomatosis 105STDs syphilis 105STDs pelvic inflammatory disease 105STDs genital herpes 105STDs molluscum contagiosum 105STDs AIDS 105STDs HIV 105STDs hepatitis B 105STDs HPV 105STDs time since first diagnosis 787STDs time since last diagnosis 787

Biopsy Cytology Hinselmann Schiller0

200

400

600

800

1000

803 814 823784

55 44 3574

Class labels

NegativePositive

Num

ber o

f sam

ples

Figure 1 Number of records per class label in the data set

Scientific Programming 5

GB works by teaching a model C to predict values of theform

Yprime C(x) (6)

By minimizing a loss function eg MSE1n

1113944i

yiprime minus yi( 1113857

2 (7)

where i iterates over a training set of size n of true values ofthe target variable yyʹ estimated values of C (x) y truevalues amp n number of instances in y

Considering a GBmodel withM phases andm as a singlephase being (1lemleM) to improve some deficient modelFm a new estimator hm (x) is added +erefore

hm(x) Y minus Fm(x) (8)

Estimator h will be fitted to Y minus Fm(x) which is thedifference between the true value and the predicated valueie the residual +us we attempt to adjust the errors of theprevious model (Fm) [22]

XGBoost is better than Ada boost in terms of speed andperformance It is highly scalable and runs 10 times faster ascompared to the other traditional single machine learningalgorithms XGBoost handles the sparse data and imple-ments several optimization and regularization techniquesMoreover it also uses the concept of parallel and distributedcomputing

27 AdaBoost Adaptive Boosting (AdaBoost) is a meta-learner originally proposed for the binary classificationproposed by Freund and Schapire [23] It is an ensembletechnique to build a meta classifier by combining severalweak classifiers using progressive learning

AdaBoost uses the concept of boosting data samplingtechnique adaptive sampling was used to assign highweights to the misclassified events+emisclassified sampleswill be selected in the next iteration to better train the modeland the final prediction was made using weighted votingAdaBoost has reduced error rate has a better effect on theprediction as compared to bagging [24] and uses decisiontree stumps Initially all the samples in the data set haveequal weights Let x be the number of samples in the data setand let y be the target+e target is a binary class representedby 0 and 1 +e first decision tree stump will use somerecords from the data set and predictions will be performedAfter the initial prediction the weights to the sample will beupdated More weights will be assigned to the data samplesthat were misclassified +e samples with the high weightswill be selected in the next iteration +e process will becontinued unless the error rate is completely reduced or acertain target level is achieved

AdaBoost contains two main steps combination andstep forward using sequential iterative approach All theinstances in the training set have equal weights in the firstiteration However in subsequent iterations the weights arechanged based on the error rates +e instances with errorhave increased weights For the binary class classification

problem containing T training samples is represented in thefollowing equation

xi yi( 11138571113864 1113865T

i1with yi isin 0 1 (9)

Let C be the linear combination of weak classifiers +ecombination of the classifiers is represented as

C(x) 1113944

N

n1wncn(x) (10)

where N is the number of weak classifiers w represents theweights and C (x) represents weak classifiers In every nextiteration the classifier is trained based on the performance ofthe classifier in previous iteration

C(x)t C(x)tminus 1 + wncn(x) (11)

where C(x)t represents the classifier in t iteration C(x)tminus 1 isthe performance of the classifier at t minus 1 iteration

+e weights can be calculated using the followingequation

wn 12ln

1 minus ϵnϵn

1113888 1113889 (12)

ϵn represents the error rate of the weak classifier

28 Optimization Strategy +is section discusses optimi-zation strategy to find the best hyperparameters combina-tion that produces the highest targeted outcomes Fireflyoptimization algorithmwas used for parameter tunning+edetails of Firefly are discussed in Section 23 Table 3 presentsthe hyperparameter values of Random Forest for all the fourtargets For RF ldquoginirdquo index criterion was used Table 4represents the parameters used for XGBoost Gbtreebooster was used with the random state of 42 and thelearning rate of 005 Similarly Table 5 presents the optimalfeature vales for AdaBoost Furthermore Figures 2ndash4 rep-resent the Grid Search optimization graph for RandomForest Extreme Gradient Boosting and AdaBoost classifier

3 Experimental Setup and Results

+e model was implemented in Python language 380 re-lease using Jupyter Notebook environment Ski-learn librarywas used for the classifiers along with other needed built-intools while separate library (xgboost 120) was used forXGBoost ensemble +ere is K-fold cross validation withK 10 for partitioning the data into training and testingFive evaluation measures such as accuracy sensitivity (re-call) specificity (precision) positive predictive accuracy(PPA) and negative predictive accuracy (NPA) were usedSensitivity and specificity are focused more during the studydue to the application of the proposed model Accuracydenotes the percentage of correctly classified cases sensi-tivity measures the percentage of positives cases that wereclassified as positives and specificity refers to the percentageof negative cases that were classified as negatives Moreoverthe criteria for the selection of the performance evaluation

6 Scientific Programming

measures depend upon the measures used in the benchmarkstudies Two sets of experiments were conducted for eachtarget using selected features by using Firefly feature se-lection algorithm and 30 features for four targets +eSMOTE technique was applied to generate synthetic data+e results of model are presented in section below

31 Hinselmann Table 6 presents the accuracy sensitivityspecificity PPA and NPA for the RF AdaBoost andXGBoost models respectively using SMOTE for Hinsel-mann test target class +e number of selected features forHinselmann was 15 XGBoost outperformed the otherclassifiers for both feature sets However the performance ofXGBoost with selected feature is better when compared with30 features +e model produces an accuracy of 9883sensitivity of 975 specificity of 992 PPA of 9917 and NPAof 9763 respectively

32 Schiller Table 7 presents the outcomes for the Schillertest Like Hinselmann target XGBoost with selected featuresoutperformed that of Schiller respectively However theoutcomes achieved by the model for Schiller are lower whencompared with Hinselmann target class+e performance of

Table 3 Random Forest optimized parameters set for all fourtargets using Firefly

Parameters Optimal values obtainedn_estimators 100max_features Log2criterion ginimax_depth 15min_samples_split 5min_samples_leaf 1

Table 4 XGBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedBooster gbtreeRandom_state 42nthread 8learning_rate 005gamma (ie min_split_loss) 01max_depth 3

Table 5 AdaBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedbase_estimator Nonen_estimators 600learning_rate 10

15 20 25 30 35 40Param_max_depth

0950

0945

0940

0935

0930

0925

Log-

likel

ihoo

d co

mpa

red

to an

AR

(0)

Grid-search results (higher is better)

(None lsquoparam_n_estimatorsrsquo)(Mean_test_score 10)(Mean_test_score 100)(Mean_test_score 200)

(Mean_test_score 400)(Mean_test_score 800)

Figure 2 Grid search optimization for random forest

2 4 6 8 10 1412Param_min_child_weight

0955

0950

0945

0940

0935

0930

0920

0925

Log-

likeli

hood

com

pare

d to

an A

R (0

)

(None lsquoparam_max_depthrsquo)

Grid-search results (higher is better)

(Mean_test_score 3)(Mean_test_score 4)(Mean_test_score 5)

Figure 3 Grid search optimization for extreme gradient boosting

20 50 75 100 125 175150 200Param_n_estimators

(None lsquoparam_learning_ratersquo)(Mean_test_score 0001)(Mean_test_score 001)(Mean_test_score 01)

(Mean_test_score 02)(Mean_test_score 05)

096

095

094

093

091

090

092

089

Log-

likeli

hood

com

pare

d to

an A

R (0

)Grid-search results (higher is better)

Figure 4 Grid search optimization for AdaBoost

Scientific Programming 7

RF and XGBoost is similar with selected feature for Schillerwith a minor difference +e number of features selected byFirefly for Schiller was 13

33 Cytology Table 8 presents the outcome of all theclassifiers for the cytology diagnosis tests Like Hinsel-mann and Schiller diagnostic test XGBoost outperformedCytology test as well with selected features For specificityand accuracy similar outcomes were achieved using 30and selected features Similarly the performance of RF issimilar in both 30 and selected features +e number offeatures selected by Firefly feature selector for Cytologywas 11

34 Biopsy Similarly performance was not drastically dif-ferent yet using all the features resulted in a higher accuracythan when using SMOTE with selected features for Biopsy asshown in Table 9 XGB obtained the highest accuracy of 971with all features However for other measures the perfor-mance of the XGBoost is better with the selected featuresSimilar performance was achieved for all measures whenclassified using RF for both feature sets 30 and selectedrespectively+e number of selected features used for Biopsytarget class was 11

Overall after comparing all the four-diagnostic testsHinselmann test achieved the better outcome and can beused for the diagnosis of cervical cancer as shown in Ta-ble 10 As per the outcome achieved in the proposed studyHinselmann diagnosis test has better performance whencompared from other cervical cancer diagnosis tests likeSchiller Biopsy and Cytology respectively Similar findingshave been made in Abdoh et al [5] and Wu and Zhou [4]study

4 Comparison with Existing Studies

+e study used three ensemble techniques AdaBoost ex-treme Gradient Boosting and Random Forest Furthermorethe proposed study is the pioneer in using bioinspired al-gorithm for feature selection and optimization for cervicalcancer diagnosis To explore the significance of our proposedstudy the outcome of the study was compared with thebenchmark studies +e criteria for the benchmark studiesselection were based on data set used for the diagnosis ofcervical cancer Table 11 contains the comparison of theproposed technique with the benchmark studies in the lit-erature +e best outcomes in the benchmark studies were

achieved using 30 features However some of the outcomesin the previous studies were achieved with the reducedfeatures +e number in the brackets next to some of theoutcomes represents the number of features

+erefore based on Table 11 the proposed study out-performs the two studies in the benchmark interms of ac-curacy with reduced risk factors However the achievedsensitivity and NPA are less than those of Wu and Zhou [4]but higher than those of Abdoh et al [5] +e number offeatures in Wu et al study is 30 while the proposed studyused reduced risk factors +e specificity and PPA of theproposed study are higher than those of the benchmarkstudies except for the Schiller diagnosis test

In nutshell the main contributions of the current studyare applying bioinspired algorithm for feature selection andfor model optimization for cervical cancer risk factors +e

Table 6 Performance of ensemble classifiers using 30 and selectedfeature for Hinselmann target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9636 9705 9572 9544 9723AB 9008 9034 8984 8921 9091XGB 9737 9672 98 9793 9684

SelectedRF 9757 9712 9801 9793 9723AB 9494 95 9488 9461 9526XGB 9838 9755 992 9917 9763

Table 7 Performance of ensemble classifiers using 30 and selectedfeature for Schiller target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9384 9454 9312 9336 9435AB 8662 8532 8813 8921 8391XGB 9236 9399 9076 9087 9391

SelectedRF 9597 9549 9648 9668 9589AB 896 8692 9289 9378 8522XGB 9698 959 9692 971 9565

Table 8 Performance of ensemble classifiers using 30 and selectedfeature for Cytology target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9591 9409 9787 9795 9388AB 9366 9112 9652 9672 9061XGB 9632 9449 983 9836 9429

SelectedRF 9591 9444 9747 9754 9429AB 9264 9094 9447 9467 9061XGB 9693 9526 9974 9873 951

Table 9 Performance of ensemble classifiers using 30 and selectedfeature for Biopsy target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9668 9545 9732 9788 9553AB 9357 9399 9317 928 9431XGB 971 9549 9752 9746 9553

SelectedRF 9627 958 9672 9661 9593AB 8921 8802 9042 9025 8821XGB 9668 9583 9874 9873 9593

Table 10 Results of proposed model for 4 diagnosis tests forcervical cancer

Diagnosis test Accuracy Sensitivity Specificity PPA NPAHinselmann 9838 9755 992 9917 9763Schiller 9698 959 9692 971 9565Biopsy 9668 9583 9874 9873 9593Cytology 9693 9526 9873 9877 951

8 Scientific Programming

proposed model enhanced the outcomes when comparedwith the previous studies related with cervical cancer riskfactors data set Despite the above-mentioned advantagesthe study suffers from some limitations the data set suffersfrom huge imbalance and augmented data was generatedusing SMOTE Moreover the current study was based onopen-source data set and further testing is required to useother real and open-source data sets

To alleviate the above-mentioned limitations there is aneed for validating the model on real data set from thehospital

5 Conclusion

+is study presents an investigation of several ensembletechniques such as Random Forest AdaBoost and Ex-treme Gradient Boosting for diagnosing cervical cancer+e data set was obtained from the UCI machine learningrepository containing 858 records 32 features and 4target variables +e target variables are the diagnosis testused for cervical cancer Experiments were conducted foreach target class separately Data preprocessing includesimputing missing values and class balancing usingSMOTE Moreover bioinspired firefly algorithm was usedto optimize the models and to identify the key features Tocompare the performance of the models the experimentswere conducted with 30 features and the selected featuresusing SMOTED data Extreme Gradient Boosting out-performed the other two models for all four target vari-ables For future work the model will be validated onmultiple data sets Also other models that can handleoutliers and unbalanced data differently should beinvestigated

Data Availability

+e study used open-source data set available at httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] American Cancer Society Key Statistics for Cervical CancerAmerican Cancer Society Atlanta GA USA 2021 httpswwwcancerorgcancercervical-canceraboutkey-statisticshtml

[2] ldquoCervical cancer risk factorsrdquo 2020 httpswwwcancernetcancer-typescervical-cancerrisk-factors7Etext=+e20most20important20risk20factor10020different20types20of20HPV

[3] ldquoCervical cancerrdquo 2020 httpswwwnccc-onlineorghpvcervical-cancercervical-cancer-overview

[4] W Wu and H Zhou ldquoData-driven diagnosis of cervicalcancer with support vector machine-based approachesrdquo IEEEAccess vol 5 2017

[5] S F Abdoh M Abo Rizka and F A Maghraby ldquoCervicalcancer diagnosis using random forest classifier with SMOTEand feature reduction techniquesrdquo IEEE Access vol 6 2018

[6] J Lu E Song A Ghoneim and M Alrashoud ldquoMachinelearning for assisting cervical cancer diagnosis an ensembleapproachrdquo Future Generation Computer Systems vol 106pp 199ndash205 2020

[7] E Karim and N Neehal ldquoAn empirical study of cervicalcancer diagnosis using ensemble methodsrdquo in Proceedings ofthe 2019 1st International Conference on Advances in ScienceEngineering and Robotics Technology (ICASERT) DhakaBangladesh December 2019

[8] A Ul-Islam S H Ripon and N Qaisar Bhuiyan ldquoCervicalcancer risk factors classification and mining associationsrdquoAPTIKOM Journal on Computer Science and InformationTechnologies vol 4 no 1 pp 8ndash18 2019

[9] Y M S Al-Wesabi A Choudhury and D Won ldquoClassifi-cation of cervical cancer datasetrdquo in Proceedings of the 2018IISE Annual Conference and Expo pp 1456ndash1461 OrlandoFL USA May 2018

[10] K Fernandes D Chicco J S Cardoso and J FernandesldquoSupervised deep learning embeddings for the prediction ofcervical cancer diagnosisrdquo PeerJ Computer Science vol 4no 5 pp e154ndash21 2018

[11] K Adem S Kiliccedilarslan and O Comert ldquoClassification anddiagnosis of cervical cancer with stacked autoencoder andsoftmax classificationrdquo Expert Systems with Applicationsvol 115 pp 557ndash564 2019

[12] K Fernandes J S Cardoso and J Fernandes ldquoTransferlearning with partial observability applied to cervical cancerscreeningrdquo Pattern Recognition and Image Analysis Springervol 10255 pp 243ndash250 Berlin Germany 2017

Table 11 Comparison of the proposed study with benchmark studies

Target class Model Accuracy Sensitivity Specificity PPA NPA

Hinselmann (15)Abdoh et al [5] 9397 100 8996 8497 100Wu and Zhou [4] 976 9665 9854 9848 9678Proposed study 9838 9755 992 9917 9763

Schiller (13)Abdoh et al [5] 9018 9873 8463 8075 9903Wu and Zhou [4] 9501 9324 9758 (12) 9729 (12) 9381Proposed study 9639 959 9692 971 9565

Cytology (11)Abdoh et al [5] 9275 100 8792 83 100Wu and Zhou [4] 9694 9558 (8) 9901 9894 9576 (8)Proposed study 9698 9583 9974 9873 9593

Biopsy (11)Abdoh et al [5] 9413 100 9021 8607 100Wu and Zhou [4] 9606 9494 (6) 9776 (11) 9758 (11) 9491Proposed study 9693 9526 9873 9877 951

Scientific Programming 9

[13] ldquoCervical cancer (risk factors) data setrdquo 2020 httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

[14] A Fernandez S Garcia F Herrera and N V ChawlaldquoSMOTE for learning from imbalanced data progress andchallenges marking the 15-year anniversaryrdquo Journal ofArtificial Intelligence Research vol 61 pp 863ndash905 2018

[15] X Yang Nature-Inspired Metaheuristic Algorithms Vol 4Luniver Press Cambridge UK 2nd edition 2010

[16] B H Nguyen B Xue and M Zhang ldquoA survey on swarmintelligence approaches to feature selection in data miningrdquoSwarm and Evolutionary Computation vol 54 2020

[17] M Anbu and G S Anandha Mala ldquoFeature selection usingfirefly algorithm in software defect predictionrdquo ClusterComputing vol 22 no s5 Article ID 10925 2019

[18] L Breiman ldquoRandom forestsrdquo in Hands-On MachineLearning with R pp 203ndash219 CRC Press Boca Raton FLUSA 2019

[19] T M Oshiro P S Perez and J A Baranauskas ldquoHow manytrees in a random forestrdquo in Lecture Notes in ComputerScience vol 7376 pp 154ndash168 Springer New York NY USA2012

[20] T Chen and C Guestrin ldquoXGBoost a scalable tree boostingsystemrdquo in Proceedings of the 22nd ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Miningvol 13ndash17 pp 785ndash794 San Francisco CA USA August2016

[21] A Natekin and A Knoll ldquoGradient boosting machines atutorialrdquo Frontiers in Neurorobotics vol 7 2013

[22] J H Friedman and J J Meulman ldquoMultiple additive re-gression trees with application in epidemiologyrdquo Statistics inMedicine vol 22 no 9 pp 1365ndash1381 2003

[23] Y Freund and R E Schapire ldquoA decision-theoretic gener-alization of on-line learning and an application to boostingrdquoJournal of Computer and System Sciences vol 55 no 1pp 119ndash139 1997

[24] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo Informatica vol 31 pp 249ndash2682007

10 Scientific Programming

Page 4: Cervical Cancer Diagnosis Model Using Extreme Gradient

train the model +e description of these techniques isdiscussed in the section below

25 RandomForest Random Forest (RF) was first proposedby Breiman in 2001 [18] Random forest is an ensemblemodel that uses decision tree as individual model andbagging as ensemble method It improves the performanceof decision tree by adding many trees to reduce the over-fitting in the decision tree RF can be used for both clas-sification and regression RF generates a random forest thatcontains decision trees and gets a prediction from each oneof them and then selects the best solution with the maximumvotes [19]

When training a tree it is important to measure howmuch each feature decreases the impurity as the decrease in

the impurity indicates the significance of the feature +etree classification result depends on the impurity measureused For classification the measures for impurity are eitherGini impurity or information gain and for regression andthe measure for impurity is variance Training decision treeconsists of iteratively splitting the data Gini impurity de-cides the best split of the data using the formula

G 1 minus 1113944i

p2i (3)

where p (i) is the probability of selecting a datapoint withclass ie Information gain (IG) is also another measure todecide the best split of the data depending on the gain of eachfeature +e formula that calculates the information gain isgiven in the following equation

Entropy minus 1113944i

pilog2pi

IG(parent child) Entropy(parent) minus [p1(c1)lowast entropy(c1) + p(c2)lowast entropy(c2) + middot middot middot]

(4)

Table 1 Statistical description of the data set

Feature type Feature nameHinselmann Schiller Cytology Biopsy

Mean (micro)plusmn Std (σ) Mean (micro)plusmn Std (σ) Mean (micro)plusmn Std (σ) Mean (micro)plusmn Std (σ)Positive Negative Positive Negative Positive Negative Positive Negative

Demographic

Age 267plusmn 77 268plusmn 85 296plusmn 11 266plusmn 82 262plusmn 84 269plusmn 85 286plusmn 89 267plusmn 85Smokes 02plusmn 04 01plusmn 04 02plusmn 04 02plusmn 03 01plusmn 03 01plusmn 04 02plusmn 04 01plusmn 03

Smokes (years) 25plusmn 72 11plusmn 39 24plusmn 62 11plusmn 38 11plusmn 34 12plusmn 41 22plusmn 62 11plusmn 39Smokes (packsyear) 07plusmn 26 04plusmn 22 06plusmn 19 04plusmn 22 05plusmn 23 04plusmn 22 07plusmn 23 04plusmn 22

Number of sexual partners 22plusmn 09 25plusmn 17 25plusmn 12 25plusmn 17 27plusmn 13 25plusmn 17 25plusmn 13 25plusmn 17First sexual intercourse (age) 168plusmn 20 169plusmn 28 17plusmn 25 169plusmn 28 169plusmn 29 169plusmn 28 171plusmn 26 169plusmn 28

Number of pregnancies 24plusmn 14 25plusmn 17 26plusmn 17 22plusmn 14 21plusmn 14 22plusmn 14 23plusmn 13 22plusmn 14Hormonal contraceptives 07plusmn 45 07plusmn 05 06plusmn 05 07plusmn 05 07plusmn 05 07plusmn 05 07plusmn 05 07plusmn 05

Habit

Hormonal contraceptives (years) 29plusmn 48 19plusmn 35 32plusmn 52 19plusmn 34 33plusmn 64 19plusmn 34 33plusmn 54 19plusmn 34IUD 02plusmn 04 01plusmn 03 02plusmn 04 01plusmn 03 01plusmn 03 01plusmn 03 02plusmn 04 01plusmn 03

IUD (years) 06plusmn 15 04plusmn 18 09plusmn 29 04plusmn 17 05plusmn 17 04plusmn 18 07plusmn 20 04plusmn 18STDs 02plusmn 04 01plusmn 03 02plusmn 04 01plusmn 03 02plusmn 04 02plusmn 03 02plusmn 04 01plusmn 03

STDs (number) 03plusmn 09 02plusmn 05 04plusmn 08 01plusmn 05 03plusmn 07 01plusmn 05 03plusmn 08 01plusmn 05STDs condylomatosis 01plusmn 03 01plusmn 02 01plusmn 03 00plusmn 02 01plusmn 03 00plusmn 02 01plusmn 03 0plusmn 02

STDs cervical condylomatosis 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0STDs vaginal condylomatosis 0plusmn 0 00plusmn 01 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01

History

STDs vulvo-perinealcondylomatosis 01plusmn 03 00plusmn 02 01plusmn 03 0plusmn 02 01plusmn 03 0plusmn 02 01plusmn 04 0plusmn 01

STDs syphilis 00plusmn 02 00plusmn 01 00plusmn 02 0plusmn 01 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01STDs pelvic inflammatory

disease 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0

STDs genital herpes 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 01 0plusmn 0STDs molluscum contagiosum 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0

STDs AIDS 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0STDs HIV 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01

STDs Hepatitis B 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 0STDs HPV 0plusmn 0 00plusmn 00 0plusmn 0 0plusmn 0 0plusmn 0 0plusmn 01 0plusmn 0 0plusmn 01

Genomics

Dx CIN 0plusmn 0 00plusmn 01 00plusmn 01 0plusmn 01 0plusmn 0 0plusmn 01 01plusmn 02 0plusmn 01Dx HPV 01plusmn 03 00plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01

Dx 01plusmn 03 00plusmn 02 01plusmn 03 0plusmn 02 01plusmn 03 0plusmn 02 01plusmn 03 0plusmn 01Dx cancer 01plusmn 03 00plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01 01plusmn 03 0plusmn 01

4 Scientific Programming

26 Extreme Gradient Boosting eXtreme Gradient Boosting(XGBoost) is a tree-based ensemble technique [20] XGBoostcan be used for classification regression and rankingproblems XG boosting is a type of gradient boosting Gra-dient Boosting (GB) is a boosting ensemble technique thatmakes predicators sequentially instead of individually GB is amethod that produces a strong classifier by combining weakclassifiers [21] +e goal of the GB is building an iterativemodel that optimizes a loss function It pinpoints the failingsof weak learners by using gradients in the loss function [21]

y ax + b + e (5)

where e denotes the error term +e loss function measureshow good is the model at fitting the underlying data+e lossfunction depends on the optimization goal for regression isa measure of the error between the true and predicatedvalues whereas for classification it measures the how goodis a model at classifying cases correctly [21] +is techniquetakes less time and less iterations since predictors arelearning from the past mistakes of the other predictors +e

Table 2 Missing records per attribute in the data set

Feature name Missing valuesFirst sexual intercourse (age) 7Smoking 13Smokes (years) 13Smokes (packsyear) 13Number of sexual partners 26Number of pregnancies 56Hormonal contraceptives 108Hormonal contraceptives (years) 108IUD 117IUD (years) 117STDs 105STDs (number) 105STDs condylomatosis 105STDs cervical condylomatosis 105STDs vaginal condylomatosis 105STDs vulvo-perineal condylomatosis 105STDs syphilis 105STDs pelvic inflammatory disease 105STDs genital herpes 105STDs molluscum contagiosum 105STDs AIDS 105STDs HIV 105STDs hepatitis B 105STDs HPV 105STDs time since first diagnosis 787STDs time since last diagnosis 787

Biopsy Cytology Hinselmann Schiller0

200

400

600

800

1000

803 814 823784

55 44 3574

Class labels

NegativePositive

Num

ber o

f sam

ples

Figure 1 Number of records per class label in the data set

Scientific Programming 5

GB works by teaching a model C to predict values of theform

Yprime C(x) (6)

By minimizing a loss function eg MSE1n

1113944i

yiprime minus yi( 1113857

2 (7)

where i iterates over a training set of size n of true values ofthe target variable yyʹ estimated values of C (x) y truevalues amp n number of instances in y

Considering a GBmodel withM phases andm as a singlephase being (1lemleM) to improve some deficient modelFm a new estimator hm (x) is added +erefore

hm(x) Y minus Fm(x) (8)

Estimator h will be fitted to Y minus Fm(x) which is thedifference between the true value and the predicated valueie the residual +us we attempt to adjust the errors of theprevious model (Fm) [22]

XGBoost is better than Ada boost in terms of speed andperformance It is highly scalable and runs 10 times faster ascompared to the other traditional single machine learningalgorithms XGBoost handles the sparse data and imple-ments several optimization and regularization techniquesMoreover it also uses the concept of parallel and distributedcomputing

27 AdaBoost Adaptive Boosting (AdaBoost) is a meta-learner originally proposed for the binary classificationproposed by Freund and Schapire [23] It is an ensembletechnique to build a meta classifier by combining severalweak classifiers using progressive learning

AdaBoost uses the concept of boosting data samplingtechnique adaptive sampling was used to assign highweights to the misclassified events+emisclassified sampleswill be selected in the next iteration to better train the modeland the final prediction was made using weighted votingAdaBoost has reduced error rate has a better effect on theprediction as compared to bagging [24] and uses decisiontree stumps Initially all the samples in the data set haveequal weights Let x be the number of samples in the data setand let y be the target+e target is a binary class representedby 0 and 1 +e first decision tree stump will use somerecords from the data set and predictions will be performedAfter the initial prediction the weights to the sample will beupdated More weights will be assigned to the data samplesthat were misclassified +e samples with the high weightswill be selected in the next iteration +e process will becontinued unless the error rate is completely reduced or acertain target level is achieved

AdaBoost contains two main steps combination andstep forward using sequential iterative approach All theinstances in the training set have equal weights in the firstiteration However in subsequent iterations the weights arechanged based on the error rates +e instances with errorhave increased weights For the binary class classification

problem containing T training samples is represented in thefollowing equation

xi yi( 11138571113864 1113865T

i1with yi isin 0 1 (9)

Let C be the linear combination of weak classifiers +ecombination of the classifiers is represented as

C(x) 1113944

N

n1wncn(x) (10)

where N is the number of weak classifiers w represents theweights and C (x) represents weak classifiers In every nextiteration the classifier is trained based on the performance ofthe classifier in previous iteration

C(x)t C(x)tminus 1 + wncn(x) (11)

where C(x)t represents the classifier in t iteration C(x)tminus 1 isthe performance of the classifier at t minus 1 iteration

+e weights can be calculated using the followingequation

wn 12ln

1 minus ϵnϵn

1113888 1113889 (12)

ϵn represents the error rate of the weak classifier

28 Optimization Strategy +is section discusses optimi-zation strategy to find the best hyperparameters combina-tion that produces the highest targeted outcomes Fireflyoptimization algorithmwas used for parameter tunning+edetails of Firefly are discussed in Section 23 Table 3 presentsthe hyperparameter values of Random Forest for all the fourtargets For RF ldquoginirdquo index criterion was used Table 4represents the parameters used for XGBoost Gbtreebooster was used with the random state of 42 and thelearning rate of 005 Similarly Table 5 presents the optimalfeature vales for AdaBoost Furthermore Figures 2ndash4 rep-resent the Grid Search optimization graph for RandomForest Extreme Gradient Boosting and AdaBoost classifier

3 Experimental Setup and Results

+e model was implemented in Python language 380 re-lease using Jupyter Notebook environment Ski-learn librarywas used for the classifiers along with other needed built-intools while separate library (xgboost 120) was used forXGBoost ensemble +ere is K-fold cross validation withK 10 for partitioning the data into training and testingFive evaluation measures such as accuracy sensitivity (re-call) specificity (precision) positive predictive accuracy(PPA) and negative predictive accuracy (NPA) were usedSensitivity and specificity are focused more during the studydue to the application of the proposed model Accuracydenotes the percentage of correctly classified cases sensi-tivity measures the percentage of positives cases that wereclassified as positives and specificity refers to the percentageof negative cases that were classified as negatives Moreoverthe criteria for the selection of the performance evaluation

6 Scientific Programming

measures depend upon the measures used in the benchmarkstudies Two sets of experiments were conducted for eachtarget using selected features by using Firefly feature se-lection algorithm and 30 features for four targets +eSMOTE technique was applied to generate synthetic data+e results of model are presented in section below

31 Hinselmann Table 6 presents the accuracy sensitivityspecificity PPA and NPA for the RF AdaBoost andXGBoost models respectively using SMOTE for Hinsel-mann test target class +e number of selected features forHinselmann was 15 XGBoost outperformed the otherclassifiers for both feature sets However the performance ofXGBoost with selected feature is better when compared with30 features +e model produces an accuracy of 9883sensitivity of 975 specificity of 992 PPA of 9917 and NPAof 9763 respectively

32 Schiller Table 7 presents the outcomes for the Schillertest Like Hinselmann target XGBoost with selected featuresoutperformed that of Schiller respectively However theoutcomes achieved by the model for Schiller are lower whencompared with Hinselmann target class+e performance of

Table 3 Random Forest optimized parameters set for all fourtargets using Firefly

Parameters Optimal values obtainedn_estimators 100max_features Log2criterion ginimax_depth 15min_samples_split 5min_samples_leaf 1

Table 4 XGBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedBooster gbtreeRandom_state 42nthread 8learning_rate 005gamma (ie min_split_loss) 01max_depth 3

Table 5 AdaBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedbase_estimator Nonen_estimators 600learning_rate 10

15 20 25 30 35 40Param_max_depth

0950

0945

0940

0935

0930

0925

Log-

likel

ihoo

d co

mpa

red

to an

AR

(0)

Grid-search results (higher is better)

(None lsquoparam_n_estimatorsrsquo)(Mean_test_score 10)(Mean_test_score 100)(Mean_test_score 200)

(Mean_test_score 400)(Mean_test_score 800)

Figure 2 Grid search optimization for random forest

2 4 6 8 10 1412Param_min_child_weight

0955

0950

0945

0940

0935

0930

0920

0925

Log-

likeli

hood

com

pare

d to

an A

R (0

)

(None lsquoparam_max_depthrsquo)

Grid-search results (higher is better)

(Mean_test_score 3)(Mean_test_score 4)(Mean_test_score 5)

Figure 3 Grid search optimization for extreme gradient boosting

20 50 75 100 125 175150 200Param_n_estimators

(None lsquoparam_learning_ratersquo)(Mean_test_score 0001)(Mean_test_score 001)(Mean_test_score 01)

(Mean_test_score 02)(Mean_test_score 05)

096

095

094

093

091

090

092

089

Log-

likeli

hood

com

pare

d to

an A

R (0

)Grid-search results (higher is better)

Figure 4 Grid search optimization for AdaBoost

Scientific Programming 7

RF and XGBoost is similar with selected feature for Schillerwith a minor difference +e number of features selected byFirefly for Schiller was 13

33 Cytology Table 8 presents the outcome of all theclassifiers for the cytology diagnosis tests Like Hinsel-mann and Schiller diagnostic test XGBoost outperformedCytology test as well with selected features For specificityand accuracy similar outcomes were achieved using 30and selected features Similarly the performance of RF issimilar in both 30 and selected features +e number offeatures selected by Firefly feature selector for Cytologywas 11

34 Biopsy Similarly performance was not drastically dif-ferent yet using all the features resulted in a higher accuracythan when using SMOTE with selected features for Biopsy asshown in Table 9 XGB obtained the highest accuracy of 971with all features However for other measures the perfor-mance of the XGBoost is better with the selected featuresSimilar performance was achieved for all measures whenclassified using RF for both feature sets 30 and selectedrespectively+e number of selected features used for Biopsytarget class was 11

Overall after comparing all the four-diagnostic testsHinselmann test achieved the better outcome and can beused for the diagnosis of cervical cancer as shown in Ta-ble 10 As per the outcome achieved in the proposed studyHinselmann diagnosis test has better performance whencompared from other cervical cancer diagnosis tests likeSchiller Biopsy and Cytology respectively Similar findingshave been made in Abdoh et al [5] and Wu and Zhou [4]study

4 Comparison with Existing Studies

+e study used three ensemble techniques AdaBoost ex-treme Gradient Boosting and Random Forest Furthermorethe proposed study is the pioneer in using bioinspired al-gorithm for feature selection and optimization for cervicalcancer diagnosis To explore the significance of our proposedstudy the outcome of the study was compared with thebenchmark studies +e criteria for the benchmark studiesselection were based on data set used for the diagnosis ofcervical cancer Table 11 contains the comparison of theproposed technique with the benchmark studies in the lit-erature +e best outcomes in the benchmark studies were

achieved using 30 features However some of the outcomesin the previous studies were achieved with the reducedfeatures +e number in the brackets next to some of theoutcomes represents the number of features

+erefore based on Table 11 the proposed study out-performs the two studies in the benchmark interms of ac-curacy with reduced risk factors However the achievedsensitivity and NPA are less than those of Wu and Zhou [4]but higher than those of Abdoh et al [5] +e number offeatures in Wu et al study is 30 while the proposed studyused reduced risk factors +e specificity and PPA of theproposed study are higher than those of the benchmarkstudies except for the Schiller diagnosis test

In nutshell the main contributions of the current studyare applying bioinspired algorithm for feature selection andfor model optimization for cervical cancer risk factors +e

Table 6 Performance of ensemble classifiers using 30 and selectedfeature for Hinselmann target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9636 9705 9572 9544 9723AB 9008 9034 8984 8921 9091XGB 9737 9672 98 9793 9684

SelectedRF 9757 9712 9801 9793 9723AB 9494 95 9488 9461 9526XGB 9838 9755 992 9917 9763

Table 7 Performance of ensemble classifiers using 30 and selectedfeature for Schiller target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9384 9454 9312 9336 9435AB 8662 8532 8813 8921 8391XGB 9236 9399 9076 9087 9391

SelectedRF 9597 9549 9648 9668 9589AB 896 8692 9289 9378 8522XGB 9698 959 9692 971 9565

Table 8 Performance of ensemble classifiers using 30 and selectedfeature for Cytology target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9591 9409 9787 9795 9388AB 9366 9112 9652 9672 9061XGB 9632 9449 983 9836 9429

SelectedRF 9591 9444 9747 9754 9429AB 9264 9094 9447 9467 9061XGB 9693 9526 9974 9873 951

Table 9 Performance of ensemble classifiers using 30 and selectedfeature for Biopsy target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9668 9545 9732 9788 9553AB 9357 9399 9317 928 9431XGB 971 9549 9752 9746 9553

SelectedRF 9627 958 9672 9661 9593AB 8921 8802 9042 9025 8821XGB 9668 9583 9874 9873 9593

Table 10 Results of proposed model for 4 diagnosis tests forcervical cancer

Diagnosis test Accuracy Sensitivity Specificity PPA NPAHinselmann 9838 9755 992 9917 9763Schiller 9698 959 9692 971 9565Biopsy 9668 9583 9874 9873 9593Cytology 9693 9526 9873 9877 951

8 Scientific Programming

proposed model enhanced the outcomes when comparedwith the previous studies related with cervical cancer riskfactors data set Despite the above-mentioned advantagesthe study suffers from some limitations the data set suffersfrom huge imbalance and augmented data was generatedusing SMOTE Moreover the current study was based onopen-source data set and further testing is required to useother real and open-source data sets

To alleviate the above-mentioned limitations there is aneed for validating the model on real data set from thehospital

5 Conclusion

+is study presents an investigation of several ensembletechniques such as Random Forest AdaBoost and Ex-treme Gradient Boosting for diagnosing cervical cancer+e data set was obtained from the UCI machine learningrepository containing 858 records 32 features and 4target variables +e target variables are the diagnosis testused for cervical cancer Experiments were conducted foreach target class separately Data preprocessing includesimputing missing values and class balancing usingSMOTE Moreover bioinspired firefly algorithm was usedto optimize the models and to identify the key features Tocompare the performance of the models the experimentswere conducted with 30 features and the selected featuresusing SMOTED data Extreme Gradient Boosting out-performed the other two models for all four target vari-ables For future work the model will be validated onmultiple data sets Also other models that can handleoutliers and unbalanced data differently should beinvestigated

Data Availability

+e study used open-source data set available at httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] American Cancer Society Key Statistics for Cervical CancerAmerican Cancer Society Atlanta GA USA 2021 httpswwwcancerorgcancercervical-canceraboutkey-statisticshtml

[2] ldquoCervical cancer risk factorsrdquo 2020 httpswwwcancernetcancer-typescervical-cancerrisk-factors7Etext=+e20most20important20risk20factor10020different20types20of20HPV

[3] ldquoCervical cancerrdquo 2020 httpswwwnccc-onlineorghpvcervical-cancercervical-cancer-overview

[4] W Wu and H Zhou ldquoData-driven diagnosis of cervicalcancer with support vector machine-based approachesrdquo IEEEAccess vol 5 2017

[5] S F Abdoh M Abo Rizka and F A Maghraby ldquoCervicalcancer diagnosis using random forest classifier with SMOTEand feature reduction techniquesrdquo IEEE Access vol 6 2018

[6] J Lu E Song A Ghoneim and M Alrashoud ldquoMachinelearning for assisting cervical cancer diagnosis an ensembleapproachrdquo Future Generation Computer Systems vol 106pp 199ndash205 2020

[7] E Karim and N Neehal ldquoAn empirical study of cervicalcancer diagnosis using ensemble methodsrdquo in Proceedings ofthe 2019 1st International Conference on Advances in ScienceEngineering and Robotics Technology (ICASERT) DhakaBangladesh December 2019

[8] A Ul-Islam S H Ripon and N Qaisar Bhuiyan ldquoCervicalcancer risk factors classification and mining associationsrdquoAPTIKOM Journal on Computer Science and InformationTechnologies vol 4 no 1 pp 8ndash18 2019

[9] Y M S Al-Wesabi A Choudhury and D Won ldquoClassifi-cation of cervical cancer datasetrdquo in Proceedings of the 2018IISE Annual Conference and Expo pp 1456ndash1461 OrlandoFL USA May 2018

[10] K Fernandes D Chicco J S Cardoso and J FernandesldquoSupervised deep learning embeddings for the prediction ofcervical cancer diagnosisrdquo PeerJ Computer Science vol 4no 5 pp e154ndash21 2018

[11] K Adem S Kiliccedilarslan and O Comert ldquoClassification anddiagnosis of cervical cancer with stacked autoencoder andsoftmax classificationrdquo Expert Systems with Applicationsvol 115 pp 557ndash564 2019

[12] K Fernandes J S Cardoso and J Fernandes ldquoTransferlearning with partial observability applied to cervical cancerscreeningrdquo Pattern Recognition and Image Analysis Springervol 10255 pp 243ndash250 Berlin Germany 2017

Table 11 Comparison of the proposed study with benchmark studies

Target class Model Accuracy Sensitivity Specificity PPA NPA

Hinselmann (15)Abdoh et al [5] 9397 100 8996 8497 100Wu and Zhou [4] 976 9665 9854 9848 9678Proposed study 9838 9755 992 9917 9763

Schiller (13)Abdoh et al [5] 9018 9873 8463 8075 9903Wu and Zhou [4] 9501 9324 9758 (12) 9729 (12) 9381Proposed study 9639 959 9692 971 9565

Cytology (11)Abdoh et al [5] 9275 100 8792 83 100Wu and Zhou [4] 9694 9558 (8) 9901 9894 9576 (8)Proposed study 9698 9583 9974 9873 9593

Biopsy (11)Abdoh et al [5] 9413 100 9021 8607 100Wu and Zhou [4] 9606 9494 (6) 9776 (11) 9758 (11) 9491Proposed study 9693 9526 9873 9877 951

Scientific Programming 9

[13] ldquoCervical cancer (risk factors) data setrdquo 2020 httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

[14] A Fernandez S Garcia F Herrera and N V ChawlaldquoSMOTE for learning from imbalanced data progress andchallenges marking the 15-year anniversaryrdquo Journal ofArtificial Intelligence Research vol 61 pp 863ndash905 2018

[15] X Yang Nature-Inspired Metaheuristic Algorithms Vol 4Luniver Press Cambridge UK 2nd edition 2010

[16] B H Nguyen B Xue and M Zhang ldquoA survey on swarmintelligence approaches to feature selection in data miningrdquoSwarm and Evolutionary Computation vol 54 2020

[17] M Anbu and G S Anandha Mala ldquoFeature selection usingfirefly algorithm in software defect predictionrdquo ClusterComputing vol 22 no s5 Article ID 10925 2019

[18] L Breiman ldquoRandom forestsrdquo in Hands-On MachineLearning with R pp 203ndash219 CRC Press Boca Raton FLUSA 2019

[19] T M Oshiro P S Perez and J A Baranauskas ldquoHow manytrees in a random forestrdquo in Lecture Notes in ComputerScience vol 7376 pp 154ndash168 Springer New York NY USA2012

[20] T Chen and C Guestrin ldquoXGBoost a scalable tree boostingsystemrdquo in Proceedings of the 22nd ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Miningvol 13ndash17 pp 785ndash794 San Francisco CA USA August2016

[21] A Natekin and A Knoll ldquoGradient boosting machines atutorialrdquo Frontiers in Neurorobotics vol 7 2013

[22] J H Friedman and J J Meulman ldquoMultiple additive re-gression trees with application in epidemiologyrdquo Statistics inMedicine vol 22 no 9 pp 1365ndash1381 2003

[23] Y Freund and R E Schapire ldquoA decision-theoretic gener-alization of on-line learning and an application to boostingrdquoJournal of Computer and System Sciences vol 55 no 1pp 119ndash139 1997

[24] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo Informatica vol 31 pp 249ndash2682007

10 Scientific Programming

Page 5: Cervical Cancer Diagnosis Model Using Extreme Gradient

26 Extreme Gradient Boosting eXtreme Gradient Boosting(XGBoost) is a tree-based ensemble technique [20] XGBoostcan be used for classification regression and rankingproblems XG boosting is a type of gradient boosting Gra-dient Boosting (GB) is a boosting ensemble technique thatmakes predicators sequentially instead of individually GB is amethod that produces a strong classifier by combining weakclassifiers [21] +e goal of the GB is building an iterativemodel that optimizes a loss function It pinpoints the failingsof weak learners by using gradients in the loss function [21]

y ax + b + e (5)

where e denotes the error term +e loss function measureshow good is the model at fitting the underlying data+e lossfunction depends on the optimization goal for regression isa measure of the error between the true and predicatedvalues whereas for classification it measures the how goodis a model at classifying cases correctly [21] +is techniquetakes less time and less iterations since predictors arelearning from the past mistakes of the other predictors +e

Table 2 Missing records per attribute in the data set

Feature name Missing valuesFirst sexual intercourse (age) 7Smoking 13Smokes (years) 13Smokes (packsyear) 13Number of sexual partners 26Number of pregnancies 56Hormonal contraceptives 108Hormonal contraceptives (years) 108IUD 117IUD (years) 117STDs 105STDs (number) 105STDs condylomatosis 105STDs cervical condylomatosis 105STDs vaginal condylomatosis 105STDs vulvo-perineal condylomatosis 105STDs syphilis 105STDs pelvic inflammatory disease 105STDs genital herpes 105STDs molluscum contagiosum 105STDs AIDS 105STDs HIV 105STDs hepatitis B 105STDs HPV 105STDs time since first diagnosis 787STDs time since last diagnosis 787

Biopsy Cytology Hinselmann Schiller0

200

400

600

800

1000

803 814 823784

55 44 3574

Class labels

NegativePositive

Num

ber o

f sam

ples

Figure 1 Number of records per class label in the data set

Scientific Programming 5

GB works by teaching a model C to predict values of theform

Yprime C(x) (6)

By minimizing a loss function eg MSE1n

1113944i

yiprime minus yi( 1113857

2 (7)

where i iterates over a training set of size n of true values ofthe target variable yyʹ estimated values of C (x) y truevalues amp n number of instances in y

Considering a GBmodel withM phases andm as a singlephase being (1lemleM) to improve some deficient modelFm a new estimator hm (x) is added +erefore

hm(x) Y minus Fm(x) (8)

Estimator h will be fitted to Y minus Fm(x) which is thedifference between the true value and the predicated valueie the residual +us we attempt to adjust the errors of theprevious model (Fm) [22]

XGBoost is better than Ada boost in terms of speed andperformance It is highly scalable and runs 10 times faster ascompared to the other traditional single machine learningalgorithms XGBoost handles the sparse data and imple-ments several optimization and regularization techniquesMoreover it also uses the concept of parallel and distributedcomputing

27 AdaBoost Adaptive Boosting (AdaBoost) is a meta-learner originally proposed for the binary classificationproposed by Freund and Schapire [23] It is an ensembletechnique to build a meta classifier by combining severalweak classifiers using progressive learning

AdaBoost uses the concept of boosting data samplingtechnique adaptive sampling was used to assign highweights to the misclassified events+emisclassified sampleswill be selected in the next iteration to better train the modeland the final prediction was made using weighted votingAdaBoost has reduced error rate has a better effect on theprediction as compared to bagging [24] and uses decisiontree stumps Initially all the samples in the data set haveequal weights Let x be the number of samples in the data setand let y be the target+e target is a binary class representedby 0 and 1 +e first decision tree stump will use somerecords from the data set and predictions will be performedAfter the initial prediction the weights to the sample will beupdated More weights will be assigned to the data samplesthat were misclassified +e samples with the high weightswill be selected in the next iteration +e process will becontinued unless the error rate is completely reduced or acertain target level is achieved

AdaBoost contains two main steps combination andstep forward using sequential iterative approach All theinstances in the training set have equal weights in the firstiteration However in subsequent iterations the weights arechanged based on the error rates +e instances with errorhave increased weights For the binary class classification

problem containing T training samples is represented in thefollowing equation

xi yi( 11138571113864 1113865T

i1with yi isin 0 1 (9)

Let C be the linear combination of weak classifiers +ecombination of the classifiers is represented as

C(x) 1113944

N

n1wncn(x) (10)

where N is the number of weak classifiers w represents theweights and C (x) represents weak classifiers In every nextiteration the classifier is trained based on the performance ofthe classifier in previous iteration

C(x)t C(x)tminus 1 + wncn(x) (11)

where C(x)t represents the classifier in t iteration C(x)tminus 1 isthe performance of the classifier at t minus 1 iteration

+e weights can be calculated using the followingequation

wn 12ln

1 minus ϵnϵn

1113888 1113889 (12)

ϵn represents the error rate of the weak classifier

28 Optimization Strategy +is section discusses optimi-zation strategy to find the best hyperparameters combina-tion that produces the highest targeted outcomes Fireflyoptimization algorithmwas used for parameter tunning+edetails of Firefly are discussed in Section 23 Table 3 presentsthe hyperparameter values of Random Forest for all the fourtargets For RF ldquoginirdquo index criterion was used Table 4represents the parameters used for XGBoost Gbtreebooster was used with the random state of 42 and thelearning rate of 005 Similarly Table 5 presents the optimalfeature vales for AdaBoost Furthermore Figures 2ndash4 rep-resent the Grid Search optimization graph for RandomForest Extreme Gradient Boosting and AdaBoost classifier

3 Experimental Setup and Results

+e model was implemented in Python language 380 re-lease using Jupyter Notebook environment Ski-learn librarywas used for the classifiers along with other needed built-intools while separate library (xgboost 120) was used forXGBoost ensemble +ere is K-fold cross validation withK 10 for partitioning the data into training and testingFive evaluation measures such as accuracy sensitivity (re-call) specificity (precision) positive predictive accuracy(PPA) and negative predictive accuracy (NPA) were usedSensitivity and specificity are focused more during the studydue to the application of the proposed model Accuracydenotes the percentage of correctly classified cases sensi-tivity measures the percentage of positives cases that wereclassified as positives and specificity refers to the percentageof negative cases that were classified as negatives Moreoverthe criteria for the selection of the performance evaluation

6 Scientific Programming

measures depend upon the measures used in the benchmarkstudies Two sets of experiments were conducted for eachtarget using selected features by using Firefly feature se-lection algorithm and 30 features for four targets +eSMOTE technique was applied to generate synthetic data+e results of model are presented in section below

31 Hinselmann Table 6 presents the accuracy sensitivityspecificity PPA and NPA for the RF AdaBoost andXGBoost models respectively using SMOTE for Hinsel-mann test target class +e number of selected features forHinselmann was 15 XGBoost outperformed the otherclassifiers for both feature sets However the performance ofXGBoost with selected feature is better when compared with30 features +e model produces an accuracy of 9883sensitivity of 975 specificity of 992 PPA of 9917 and NPAof 9763 respectively

32 Schiller Table 7 presents the outcomes for the Schillertest Like Hinselmann target XGBoost with selected featuresoutperformed that of Schiller respectively However theoutcomes achieved by the model for Schiller are lower whencompared with Hinselmann target class+e performance of

Table 3 Random Forest optimized parameters set for all fourtargets using Firefly

Parameters Optimal values obtainedn_estimators 100max_features Log2criterion ginimax_depth 15min_samples_split 5min_samples_leaf 1

Table 4 XGBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedBooster gbtreeRandom_state 42nthread 8learning_rate 005gamma (ie min_split_loss) 01max_depth 3

Table 5 AdaBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedbase_estimator Nonen_estimators 600learning_rate 10

15 20 25 30 35 40Param_max_depth

0950

0945

0940

0935

0930

0925

Log-

likel

ihoo

d co

mpa

red

to an

AR

(0)

Grid-search results (higher is better)

(None lsquoparam_n_estimatorsrsquo)(Mean_test_score 10)(Mean_test_score 100)(Mean_test_score 200)

(Mean_test_score 400)(Mean_test_score 800)

Figure 2 Grid search optimization for random forest

2 4 6 8 10 1412Param_min_child_weight

0955

0950

0945

0940

0935

0930

0920

0925

Log-

likeli

hood

com

pare

d to

an A

R (0

)

(None lsquoparam_max_depthrsquo)

Grid-search results (higher is better)

(Mean_test_score 3)(Mean_test_score 4)(Mean_test_score 5)

Figure 3 Grid search optimization for extreme gradient boosting

20 50 75 100 125 175150 200Param_n_estimators

(None lsquoparam_learning_ratersquo)(Mean_test_score 0001)(Mean_test_score 001)(Mean_test_score 01)

(Mean_test_score 02)(Mean_test_score 05)

096

095

094

093

091

090

092

089

Log-

likeli

hood

com

pare

d to

an A

R (0

)Grid-search results (higher is better)

Figure 4 Grid search optimization for AdaBoost

Scientific Programming 7

RF and XGBoost is similar with selected feature for Schillerwith a minor difference +e number of features selected byFirefly for Schiller was 13

33 Cytology Table 8 presents the outcome of all theclassifiers for the cytology diagnosis tests Like Hinsel-mann and Schiller diagnostic test XGBoost outperformedCytology test as well with selected features For specificityand accuracy similar outcomes were achieved using 30and selected features Similarly the performance of RF issimilar in both 30 and selected features +e number offeatures selected by Firefly feature selector for Cytologywas 11

34 Biopsy Similarly performance was not drastically dif-ferent yet using all the features resulted in a higher accuracythan when using SMOTE with selected features for Biopsy asshown in Table 9 XGB obtained the highest accuracy of 971with all features However for other measures the perfor-mance of the XGBoost is better with the selected featuresSimilar performance was achieved for all measures whenclassified using RF for both feature sets 30 and selectedrespectively+e number of selected features used for Biopsytarget class was 11

Overall after comparing all the four-diagnostic testsHinselmann test achieved the better outcome and can beused for the diagnosis of cervical cancer as shown in Ta-ble 10 As per the outcome achieved in the proposed studyHinselmann diagnosis test has better performance whencompared from other cervical cancer diagnosis tests likeSchiller Biopsy and Cytology respectively Similar findingshave been made in Abdoh et al [5] and Wu and Zhou [4]study

4 Comparison with Existing Studies

+e study used three ensemble techniques AdaBoost ex-treme Gradient Boosting and Random Forest Furthermorethe proposed study is the pioneer in using bioinspired al-gorithm for feature selection and optimization for cervicalcancer diagnosis To explore the significance of our proposedstudy the outcome of the study was compared with thebenchmark studies +e criteria for the benchmark studiesselection were based on data set used for the diagnosis ofcervical cancer Table 11 contains the comparison of theproposed technique with the benchmark studies in the lit-erature +e best outcomes in the benchmark studies were

achieved using 30 features However some of the outcomesin the previous studies were achieved with the reducedfeatures +e number in the brackets next to some of theoutcomes represents the number of features

+erefore based on Table 11 the proposed study out-performs the two studies in the benchmark interms of ac-curacy with reduced risk factors However the achievedsensitivity and NPA are less than those of Wu and Zhou [4]but higher than those of Abdoh et al [5] +e number offeatures in Wu et al study is 30 while the proposed studyused reduced risk factors +e specificity and PPA of theproposed study are higher than those of the benchmarkstudies except for the Schiller diagnosis test

In nutshell the main contributions of the current studyare applying bioinspired algorithm for feature selection andfor model optimization for cervical cancer risk factors +e

Table 6 Performance of ensemble classifiers using 30 and selectedfeature for Hinselmann target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9636 9705 9572 9544 9723AB 9008 9034 8984 8921 9091XGB 9737 9672 98 9793 9684

SelectedRF 9757 9712 9801 9793 9723AB 9494 95 9488 9461 9526XGB 9838 9755 992 9917 9763

Table 7 Performance of ensemble classifiers using 30 and selectedfeature for Schiller target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9384 9454 9312 9336 9435AB 8662 8532 8813 8921 8391XGB 9236 9399 9076 9087 9391

SelectedRF 9597 9549 9648 9668 9589AB 896 8692 9289 9378 8522XGB 9698 959 9692 971 9565

Table 8 Performance of ensemble classifiers using 30 and selectedfeature for Cytology target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9591 9409 9787 9795 9388AB 9366 9112 9652 9672 9061XGB 9632 9449 983 9836 9429

SelectedRF 9591 9444 9747 9754 9429AB 9264 9094 9447 9467 9061XGB 9693 9526 9974 9873 951

Table 9 Performance of ensemble classifiers using 30 and selectedfeature for Biopsy target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9668 9545 9732 9788 9553AB 9357 9399 9317 928 9431XGB 971 9549 9752 9746 9553

SelectedRF 9627 958 9672 9661 9593AB 8921 8802 9042 9025 8821XGB 9668 9583 9874 9873 9593

Table 10 Results of proposed model for 4 diagnosis tests forcervical cancer

Diagnosis test Accuracy Sensitivity Specificity PPA NPAHinselmann 9838 9755 992 9917 9763Schiller 9698 959 9692 971 9565Biopsy 9668 9583 9874 9873 9593Cytology 9693 9526 9873 9877 951

8 Scientific Programming

proposed model enhanced the outcomes when comparedwith the previous studies related with cervical cancer riskfactors data set Despite the above-mentioned advantagesthe study suffers from some limitations the data set suffersfrom huge imbalance and augmented data was generatedusing SMOTE Moreover the current study was based onopen-source data set and further testing is required to useother real and open-source data sets

To alleviate the above-mentioned limitations there is aneed for validating the model on real data set from thehospital

5 Conclusion

+is study presents an investigation of several ensembletechniques such as Random Forest AdaBoost and Ex-treme Gradient Boosting for diagnosing cervical cancer+e data set was obtained from the UCI machine learningrepository containing 858 records 32 features and 4target variables +e target variables are the diagnosis testused for cervical cancer Experiments were conducted foreach target class separately Data preprocessing includesimputing missing values and class balancing usingSMOTE Moreover bioinspired firefly algorithm was usedto optimize the models and to identify the key features Tocompare the performance of the models the experimentswere conducted with 30 features and the selected featuresusing SMOTED data Extreme Gradient Boosting out-performed the other two models for all four target vari-ables For future work the model will be validated onmultiple data sets Also other models that can handleoutliers and unbalanced data differently should beinvestigated

Data Availability

+e study used open-source data set available at httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] American Cancer Society Key Statistics for Cervical CancerAmerican Cancer Society Atlanta GA USA 2021 httpswwwcancerorgcancercervical-canceraboutkey-statisticshtml

[2] ldquoCervical cancer risk factorsrdquo 2020 httpswwwcancernetcancer-typescervical-cancerrisk-factors7Etext=+e20most20important20risk20factor10020different20types20of20HPV

[3] ldquoCervical cancerrdquo 2020 httpswwwnccc-onlineorghpvcervical-cancercervical-cancer-overview

[4] W Wu and H Zhou ldquoData-driven diagnosis of cervicalcancer with support vector machine-based approachesrdquo IEEEAccess vol 5 2017

[5] S F Abdoh M Abo Rizka and F A Maghraby ldquoCervicalcancer diagnosis using random forest classifier with SMOTEand feature reduction techniquesrdquo IEEE Access vol 6 2018

[6] J Lu E Song A Ghoneim and M Alrashoud ldquoMachinelearning for assisting cervical cancer diagnosis an ensembleapproachrdquo Future Generation Computer Systems vol 106pp 199ndash205 2020

[7] E Karim and N Neehal ldquoAn empirical study of cervicalcancer diagnosis using ensemble methodsrdquo in Proceedings ofthe 2019 1st International Conference on Advances in ScienceEngineering and Robotics Technology (ICASERT) DhakaBangladesh December 2019

[8] A Ul-Islam S H Ripon and N Qaisar Bhuiyan ldquoCervicalcancer risk factors classification and mining associationsrdquoAPTIKOM Journal on Computer Science and InformationTechnologies vol 4 no 1 pp 8ndash18 2019

[9] Y M S Al-Wesabi A Choudhury and D Won ldquoClassifi-cation of cervical cancer datasetrdquo in Proceedings of the 2018IISE Annual Conference and Expo pp 1456ndash1461 OrlandoFL USA May 2018

[10] K Fernandes D Chicco J S Cardoso and J FernandesldquoSupervised deep learning embeddings for the prediction ofcervical cancer diagnosisrdquo PeerJ Computer Science vol 4no 5 pp e154ndash21 2018

[11] K Adem S Kiliccedilarslan and O Comert ldquoClassification anddiagnosis of cervical cancer with stacked autoencoder andsoftmax classificationrdquo Expert Systems with Applicationsvol 115 pp 557ndash564 2019

[12] K Fernandes J S Cardoso and J Fernandes ldquoTransferlearning with partial observability applied to cervical cancerscreeningrdquo Pattern Recognition and Image Analysis Springervol 10255 pp 243ndash250 Berlin Germany 2017

Table 11 Comparison of the proposed study with benchmark studies

Target class Model Accuracy Sensitivity Specificity PPA NPA

Hinselmann (15)Abdoh et al [5] 9397 100 8996 8497 100Wu and Zhou [4] 976 9665 9854 9848 9678Proposed study 9838 9755 992 9917 9763

Schiller (13)Abdoh et al [5] 9018 9873 8463 8075 9903Wu and Zhou [4] 9501 9324 9758 (12) 9729 (12) 9381Proposed study 9639 959 9692 971 9565

Cytology (11)Abdoh et al [5] 9275 100 8792 83 100Wu and Zhou [4] 9694 9558 (8) 9901 9894 9576 (8)Proposed study 9698 9583 9974 9873 9593

Biopsy (11)Abdoh et al [5] 9413 100 9021 8607 100Wu and Zhou [4] 9606 9494 (6) 9776 (11) 9758 (11) 9491Proposed study 9693 9526 9873 9877 951

Scientific Programming 9

[13] ldquoCervical cancer (risk factors) data setrdquo 2020 httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

[14] A Fernandez S Garcia F Herrera and N V ChawlaldquoSMOTE for learning from imbalanced data progress andchallenges marking the 15-year anniversaryrdquo Journal ofArtificial Intelligence Research vol 61 pp 863ndash905 2018

[15] X Yang Nature-Inspired Metaheuristic Algorithms Vol 4Luniver Press Cambridge UK 2nd edition 2010

[16] B H Nguyen B Xue and M Zhang ldquoA survey on swarmintelligence approaches to feature selection in data miningrdquoSwarm and Evolutionary Computation vol 54 2020

[17] M Anbu and G S Anandha Mala ldquoFeature selection usingfirefly algorithm in software defect predictionrdquo ClusterComputing vol 22 no s5 Article ID 10925 2019

[18] L Breiman ldquoRandom forestsrdquo in Hands-On MachineLearning with R pp 203ndash219 CRC Press Boca Raton FLUSA 2019

[19] T M Oshiro P S Perez and J A Baranauskas ldquoHow manytrees in a random forestrdquo in Lecture Notes in ComputerScience vol 7376 pp 154ndash168 Springer New York NY USA2012

[20] T Chen and C Guestrin ldquoXGBoost a scalable tree boostingsystemrdquo in Proceedings of the 22nd ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Miningvol 13ndash17 pp 785ndash794 San Francisco CA USA August2016

[21] A Natekin and A Knoll ldquoGradient boosting machines atutorialrdquo Frontiers in Neurorobotics vol 7 2013

[22] J H Friedman and J J Meulman ldquoMultiple additive re-gression trees with application in epidemiologyrdquo Statistics inMedicine vol 22 no 9 pp 1365ndash1381 2003

[23] Y Freund and R E Schapire ldquoA decision-theoretic gener-alization of on-line learning and an application to boostingrdquoJournal of Computer and System Sciences vol 55 no 1pp 119ndash139 1997

[24] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo Informatica vol 31 pp 249ndash2682007

10 Scientific Programming

Page 6: Cervical Cancer Diagnosis Model Using Extreme Gradient

GB works by teaching a model C to predict values of theform

Yprime C(x) (6)

By minimizing a loss function eg MSE1n

1113944i

yiprime minus yi( 1113857

2 (7)

where i iterates over a training set of size n of true values ofthe target variable yyʹ estimated values of C (x) y truevalues amp n number of instances in y

Considering a GBmodel withM phases andm as a singlephase being (1lemleM) to improve some deficient modelFm a new estimator hm (x) is added +erefore

hm(x) Y minus Fm(x) (8)

Estimator h will be fitted to Y minus Fm(x) which is thedifference between the true value and the predicated valueie the residual +us we attempt to adjust the errors of theprevious model (Fm) [22]

XGBoost is better than Ada boost in terms of speed andperformance It is highly scalable and runs 10 times faster ascompared to the other traditional single machine learningalgorithms XGBoost handles the sparse data and imple-ments several optimization and regularization techniquesMoreover it also uses the concept of parallel and distributedcomputing

27 AdaBoost Adaptive Boosting (AdaBoost) is a meta-learner originally proposed for the binary classificationproposed by Freund and Schapire [23] It is an ensembletechnique to build a meta classifier by combining severalweak classifiers using progressive learning

AdaBoost uses the concept of boosting data samplingtechnique adaptive sampling was used to assign highweights to the misclassified events+emisclassified sampleswill be selected in the next iteration to better train the modeland the final prediction was made using weighted votingAdaBoost has reduced error rate has a better effect on theprediction as compared to bagging [24] and uses decisiontree stumps Initially all the samples in the data set haveequal weights Let x be the number of samples in the data setand let y be the target+e target is a binary class representedby 0 and 1 +e first decision tree stump will use somerecords from the data set and predictions will be performedAfter the initial prediction the weights to the sample will beupdated More weights will be assigned to the data samplesthat were misclassified +e samples with the high weightswill be selected in the next iteration +e process will becontinued unless the error rate is completely reduced or acertain target level is achieved

AdaBoost contains two main steps combination andstep forward using sequential iterative approach All theinstances in the training set have equal weights in the firstiteration However in subsequent iterations the weights arechanged based on the error rates +e instances with errorhave increased weights For the binary class classification

problem containing T training samples is represented in thefollowing equation

xi yi( 11138571113864 1113865T

i1with yi isin 0 1 (9)

Let C be the linear combination of weak classifiers +ecombination of the classifiers is represented as

C(x) 1113944

N

n1wncn(x) (10)

where N is the number of weak classifiers w represents theweights and C (x) represents weak classifiers In every nextiteration the classifier is trained based on the performance ofthe classifier in previous iteration

C(x)t C(x)tminus 1 + wncn(x) (11)

where C(x)t represents the classifier in t iteration C(x)tminus 1 isthe performance of the classifier at t minus 1 iteration

+e weights can be calculated using the followingequation

wn 12ln

1 minus ϵnϵn

1113888 1113889 (12)

ϵn represents the error rate of the weak classifier

28 Optimization Strategy +is section discusses optimi-zation strategy to find the best hyperparameters combina-tion that produces the highest targeted outcomes Fireflyoptimization algorithmwas used for parameter tunning+edetails of Firefly are discussed in Section 23 Table 3 presentsthe hyperparameter values of Random Forest for all the fourtargets For RF ldquoginirdquo index criterion was used Table 4represents the parameters used for XGBoost Gbtreebooster was used with the random state of 42 and thelearning rate of 005 Similarly Table 5 presents the optimalfeature vales for AdaBoost Furthermore Figures 2ndash4 rep-resent the Grid Search optimization graph for RandomForest Extreme Gradient Boosting and AdaBoost classifier

3 Experimental Setup and Results

+e model was implemented in Python language 380 re-lease using Jupyter Notebook environment Ski-learn librarywas used for the classifiers along with other needed built-intools while separate library (xgboost 120) was used forXGBoost ensemble +ere is K-fold cross validation withK 10 for partitioning the data into training and testingFive evaluation measures such as accuracy sensitivity (re-call) specificity (precision) positive predictive accuracy(PPA) and negative predictive accuracy (NPA) were usedSensitivity and specificity are focused more during the studydue to the application of the proposed model Accuracydenotes the percentage of correctly classified cases sensi-tivity measures the percentage of positives cases that wereclassified as positives and specificity refers to the percentageof negative cases that were classified as negatives Moreoverthe criteria for the selection of the performance evaluation

6 Scientific Programming

measures depend upon the measures used in the benchmarkstudies Two sets of experiments were conducted for eachtarget using selected features by using Firefly feature se-lection algorithm and 30 features for four targets +eSMOTE technique was applied to generate synthetic data+e results of model are presented in section below

31 Hinselmann Table 6 presents the accuracy sensitivityspecificity PPA and NPA for the RF AdaBoost andXGBoost models respectively using SMOTE for Hinsel-mann test target class +e number of selected features forHinselmann was 15 XGBoost outperformed the otherclassifiers for both feature sets However the performance ofXGBoost with selected feature is better when compared with30 features +e model produces an accuracy of 9883sensitivity of 975 specificity of 992 PPA of 9917 and NPAof 9763 respectively

32 Schiller Table 7 presents the outcomes for the Schillertest Like Hinselmann target XGBoost with selected featuresoutperformed that of Schiller respectively However theoutcomes achieved by the model for Schiller are lower whencompared with Hinselmann target class+e performance of

Table 3 Random Forest optimized parameters set for all fourtargets using Firefly

Parameters Optimal values obtainedn_estimators 100max_features Log2criterion ginimax_depth 15min_samples_split 5min_samples_leaf 1

Table 4 XGBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedBooster gbtreeRandom_state 42nthread 8learning_rate 005gamma (ie min_split_loss) 01max_depth 3

Table 5 AdaBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedbase_estimator Nonen_estimators 600learning_rate 10

15 20 25 30 35 40Param_max_depth

0950

0945

0940

0935

0930

0925

Log-

likel

ihoo

d co

mpa

red

to an

AR

(0)

Grid-search results (higher is better)

(None lsquoparam_n_estimatorsrsquo)(Mean_test_score 10)(Mean_test_score 100)(Mean_test_score 200)

(Mean_test_score 400)(Mean_test_score 800)

Figure 2 Grid search optimization for random forest

2 4 6 8 10 1412Param_min_child_weight

0955

0950

0945

0940

0935

0930

0920

0925

Log-

likeli

hood

com

pare

d to

an A

R (0

)

(None lsquoparam_max_depthrsquo)

Grid-search results (higher is better)

(Mean_test_score 3)(Mean_test_score 4)(Mean_test_score 5)

Figure 3 Grid search optimization for extreme gradient boosting

20 50 75 100 125 175150 200Param_n_estimators

(None lsquoparam_learning_ratersquo)(Mean_test_score 0001)(Mean_test_score 001)(Mean_test_score 01)

(Mean_test_score 02)(Mean_test_score 05)

096

095

094

093

091

090

092

089

Log-

likeli

hood

com

pare

d to

an A

R (0

)Grid-search results (higher is better)

Figure 4 Grid search optimization for AdaBoost

Scientific Programming 7

RF and XGBoost is similar with selected feature for Schillerwith a minor difference +e number of features selected byFirefly for Schiller was 13

33 Cytology Table 8 presents the outcome of all theclassifiers for the cytology diagnosis tests Like Hinsel-mann and Schiller diagnostic test XGBoost outperformedCytology test as well with selected features For specificityand accuracy similar outcomes were achieved using 30and selected features Similarly the performance of RF issimilar in both 30 and selected features +e number offeatures selected by Firefly feature selector for Cytologywas 11

34 Biopsy Similarly performance was not drastically dif-ferent yet using all the features resulted in a higher accuracythan when using SMOTE with selected features for Biopsy asshown in Table 9 XGB obtained the highest accuracy of 971with all features However for other measures the perfor-mance of the XGBoost is better with the selected featuresSimilar performance was achieved for all measures whenclassified using RF for both feature sets 30 and selectedrespectively+e number of selected features used for Biopsytarget class was 11

Overall after comparing all the four-diagnostic testsHinselmann test achieved the better outcome and can beused for the diagnosis of cervical cancer as shown in Ta-ble 10 As per the outcome achieved in the proposed studyHinselmann diagnosis test has better performance whencompared from other cervical cancer diagnosis tests likeSchiller Biopsy and Cytology respectively Similar findingshave been made in Abdoh et al [5] and Wu and Zhou [4]study

4 Comparison with Existing Studies

+e study used three ensemble techniques AdaBoost ex-treme Gradient Boosting and Random Forest Furthermorethe proposed study is the pioneer in using bioinspired al-gorithm for feature selection and optimization for cervicalcancer diagnosis To explore the significance of our proposedstudy the outcome of the study was compared with thebenchmark studies +e criteria for the benchmark studiesselection were based on data set used for the diagnosis ofcervical cancer Table 11 contains the comparison of theproposed technique with the benchmark studies in the lit-erature +e best outcomes in the benchmark studies were

achieved using 30 features However some of the outcomesin the previous studies were achieved with the reducedfeatures +e number in the brackets next to some of theoutcomes represents the number of features

+erefore based on Table 11 the proposed study out-performs the two studies in the benchmark interms of ac-curacy with reduced risk factors However the achievedsensitivity and NPA are less than those of Wu and Zhou [4]but higher than those of Abdoh et al [5] +e number offeatures in Wu et al study is 30 while the proposed studyused reduced risk factors +e specificity and PPA of theproposed study are higher than those of the benchmarkstudies except for the Schiller diagnosis test

In nutshell the main contributions of the current studyare applying bioinspired algorithm for feature selection andfor model optimization for cervical cancer risk factors +e

Table 6 Performance of ensemble classifiers using 30 and selectedfeature for Hinselmann target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9636 9705 9572 9544 9723AB 9008 9034 8984 8921 9091XGB 9737 9672 98 9793 9684

SelectedRF 9757 9712 9801 9793 9723AB 9494 95 9488 9461 9526XGB 9838 9755 992 9917 9763

Table 7 Performance of ensemble classifiers using 30 and selectedfeature for Schiller target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9384 9454 9312 9336 9435AB 8662 8532 8813 8921 8391XGB 9236 9399 9076 9087 9391

SelectedRF 9597 9549 9648 9668 9589AB 896 8692 9289 9378 8522XGB 9698 959 9692 971 9565

Table 8 Performance of ensemble classifiers using 30 and selectedfeature for Cytology target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9591 9409 9787 9795 9388AB 9366 9112 9652 9672 9061XGB 9632 9449 983 9836 9429

SelectedRF 9591 9444 9747 9754 9429AB 9264 9094 9447 9467 9061XGB 9693 9526 9974 9873 951

Table 9 Performance of ensemble classifiers using 30 and selectedfeature for Biopsy target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9668 9545 9732 9788 9553AB 9357 9399 9317 928 9431XGB 971 9549 9752 9746 9553

SelectedRF 9627 958 9672 9661 9593AB 8921 8802 9042 9025 8821XGB 9668 9583 9874 9873 9593

Table 10 Results of proposed model for 4 diagnosis tests forcervical cancer

Diagnosis test Accuracy Sensitivity Specificity PPA NPAHinselmann 9838 9755 992 9917 9763Schiller 9698 959 9692 971 9565Biopsy 9668 9583 9874 9873 9593Cytology 9693 9526 9873 9877 951

8 Scientific Programming

proposed model enhanced the outcomes when comparedwith the previous studies related with cervical cancer riskfactors data set Despite the above-mentioned advantagesthe study suffers from some limitations the data set suffersfrom huge imbalance and augmented data was generatedusing SMOTE Moreover the current study was based onopen-source data set and further testing is required to useother real and open-source data sets

To alleviate the above-mentioned limitations there is aneed for validating the model on real data set from thehospital

5 Conclusion

+is study presents an investigation of several ensembletechniques such as Random Forest AdaBoost and Ex-treme Gradient Boosting for diagnosing cervical cancer+e data set was obtained from the UCI machine learningrepository containing 858 records 32 features and 4target variables +e target variables are the diagnosis testused for cervical cancer Experiments were conducted foreach target class separately Data preprocessing includesimputing missing values and class balancing usingSMOTE Moreover bioinspired firefly algorithm was usedto optimize the models and to identify the key features Tocompare the performance of the models the experimentswere conducted with 30 features and the selected featuresusing SMOTED data Extreme Gradient Boosting out-performed the other two models for all four target vari-ables For future work the model will be validated onmultiple data sets Also other models that can handleoutliers and unbalanced data differently should beinvestigated

Data Availability

+e study used open-source data set available at httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] American Cancer Society Key Statistics for Cervical CancerAmerican Cancer Society Atlanta GA USA 2021 httpswwwcancerorgcancercervical-canceraboutkey-statisticshtml

[2] ldquoCervical cancer risk factorsrdquo 2020 httpswwwcancernetcancer-typescervical-cancerrisk-factors7Etext=+e20most20important20risk20factor10020different20types20of20HPV

[3] ldquoCervical cancerrdquo 2020 httpswwwnccc-onlineorghpvcervical-cancercervical-cancer-overview

[4] W Wu and H Zhou ldquoData-driven diagnosis of cervicalcancer with support vector machine-based approachesrdquo IEEEAccess vol 5 2017

[5] S F Abdoh M Abo Rizka and F A Maghraby ldquoCervicalcancer diagnosis using random forest classifier with SMOTEand feature reduction techniquesrdquo IEEE Access vol 6 2018

[6] J Lu E Song A Ghoneim and M Alrashoud ldquoMachinelearning for assisting cervical cancer diagnosis an ensembleapproachrdquo Future Generation Computer Systems vol 106pp 199ndash205 2020

[7] E Karim and N Neehal ldquoAn empirical study of cervicalcancer diagnosis using ensemble methodsrdquo in Proceedings ofthe 2019 1st International Conference on Advances in ScienceEngineering and Robotics Technology (ICASERT) DhakaBangladesh December 2019

[8] A Ul-Islam S H Ripon and N Qaisar Bhuiyan ldquoCervicalcancer risk factors classification and mining associationsrdquoAPTIKOM Journal on Computer Science and InformationTechnologies vol 4 no 1 pp 8ndash18 2019

[9] Y M S Al-Wesabi A Choudhury and D Won ldquoClassifi-cation of cervical cancer datasetrdquo in Proceedings of the 2018IISE Annual Conference and Expo pp 1456ndash1461 OrlandoFL USA May 2018

[10] K Fernandes D Chicco J S Cardoso and J FernandesldquoSupervised deep learning embeddings for the prediction ofcervical cancer diagnosisrdquo PeerJ Computer Science vol 4no 5 pp e154ndash21 2018

[11] K Adem S Kiliccedilarslan and O Comert ldquoClassification anddiagnosis of cervical cancer with stacked autoencoder andsoftmax classificationrdquo Expert Systems with Applicationsvol 115 pp 557ndash564 2019

[12] K Fernandes J S Cardoso and J Fernandes ldquoTransferlearning with partial observability applied to cervical cancerscreeningrdquo Pattern Recognition and Image Analysis Springervol 10255 pp 243ndash250 Berlin Germany 2017

Table 11 Comparison of the proposed study with benchmark studies

Target class Model Accuracy Sensitivity Specificity PPA NPA

Hinselmann (15)Abdoh et al [5] 9397 100 8996 8497 100Wu and Zhou [4] 976 9665 9854 9848 9678Proposed study 9838 9755 992 9917 9763

Schiller (13)Abdoh et al [5] 9018 9873 8463 8075 9903Wu and Zhou [4] 9501 9324 9758 (12) 9729 (12) 9381Proposed study 9639 959 9692 971 9565

Cytology (11)Abdoh et al [5] 9275 100 8792 83 100Wu and Zhou [4] 9694 9558 (8) 9901 9894 9576 (8)Proposed study 9698 9583 9974 9873 9593

Biopsy (11)Abdoh et al [5] 9413 100 9021 8607 100Wu and Zhou [4] 9606 9494 (6) 9776 (11) 9758 (11) 9491Proposed study 9693 9526 9873 9877 951

Scientific Programming 9

[13] ldquoCervical cancer (risk factors) data setrdquo 2020 httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

[14] A Fernandez S Garcia F Herrera and N V ChawlaldquoSMOTE for learning from imbalanced data progress andchallenges marking the 15-year anniversaryrdquo Journal ofArtificial Intelligence Research vol 61 pp 863ndash905 2018

[15] X Yang Nature-Inspired Metaheuristic Algorithms Vol 4Luniver Press Cambridge UK 2nd edition 2010

[16] B H Nguyen B Xue and M Zhang ldquoA survey on swarmintelligence approaches to feature selection in data miningrdquoSwarm and Evolutionary Computation vol 54 2020

[17] M Anbu and G S Anandha Mala ldquoFeature selection usingfirefly algorithm in software defect predictionrdquo ClusterComputing vol 22 no s5 Article ID 10925 2019

[18] L Breiman ldquoRandom forestsrdquo in Hands-On MachineLearning with R pp 203ndash219 CRC Press Boca Raton FLUSA 2019

[19] T M Oshiro P S Perez and J A Baranauskas ldquoHow manytrees in a random forestrdquo in Lecture Notes in ComputerScience vol 7376 pp 154ndash168 Springer New York NY USA2012

[20] T Chen and C Guestrin ldquoXGBoost a scalable tree boostingsystemrdquo in Proceedings of the 22nd ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Miningvol 13ndash17 pp 785ndash794 San Francisco CA USA August2016

[21] A Natekin and A Knoll ldquoGradient boosting machines atutorialrdquo Frontiers in Neurorobotics vol 7 2013

[22] J H Friedman and J J Meulman ldquoMultiple additive re-gression trees with application in epidemiologyrdquo Statistics inMedicine vol 22 no 9 pp 1365ndash1381 2003

[23] Y Freund and R E Schapire ldquoA decision-theoretic gener-alization of on-line learning and an application to boostingrdquoJournal of Computer and System Sciences vol 55 no 1pp 119ndash139 1997

[24] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo Informatica vol 31 pp 249ndash2682007

10 Scientific Programming

Page 7: Cervical Cancer Diagnosis Model Using Extreme Gradient

measures depend upon the measures used in the benchmarkstudies Two sets of experiments were conducted for eachtarget using selected features by using Firefly feature se-lection algorithm and 30 features for four targets +eSMOTE technique was applied to generate synthetic data+e results of model are presented in section below

31 Hinselmann Table 6 presents the accuracy sensitivityspecificity PPA and NPA for the RF AdaBoost andXGBoost models respectively using SMOTE for Hinsel-mann test target class +e number of selected features forHinselmann was 15 XGBoost outperformed the otherclassifiers for both feature sets However the performance ofXGBoost with selected feature is better when compared with30 features +e model produces an accuracy of 9883sensitivity of 975 specificity of 992 PPA of 9917 and NPAof 9763 respectively

32 Schiller Table 7 presents the outcomes for the Schillertest Like Hinselmann target XGBoost with selected featuresoutperformed that of Schiller respectively However theoutcomes achieved by the model for Schiller are lower whencompared with Hinselmann target class+e performance of

Table 3 Random Forest optimized parameters set for all fourtargets using Firefly

Parameters Optimal values obtainedn_estimators 100max_features Log2criterion ginimax_depth 15min_samples_split 5min_samples_leaf 1

Table 4 XGBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedBooster gbtreeRandom_state 42nthread 8learning_rate 005gamma (ie min_split_loss) 01max_depth 3

Table 5 AdaBoost optimized parameters set for all four targetsusing Firefly

Parameters Optimal values obtainedbase_estimator Nonen_estimators 600learning_rate 10

15 20 25 30 35 40Param_max_depth

0950

0945

0940

0935

0930

0925

Log-

likel

ihoo

d co

mpa

red

to an

AR

(0)

Grid-search results (higher is better)

(None lsquoparam_n_estimatorsrsquo)(Mean_test_score 10)(Mean_test_score 100)(Mean_test_score 200)

(Mean_test_score 400)(Mean_test_score 800)

Figure 2 Grid search optimization for random forest

2 4 6 8 10 1412Param_min_child_weight

0955

0950

0945

0940

0935

0930

0920

0925

Log-

likeli

hood

com

pare

d to

an A

R (0

)

(None lsquoparam_max_depthrsquo)

Grid-search results (higher is better)

(Mean_test_score 3)(Mean_test_score 4)(Mean_test_score 5)

Figure 3 Grid search optimization for extreme gradient boosting

20 50 75 100 125 175150 200Param_n_estimators

(None lsquoparam_learning_ratersquo)(Mean_test_score 0001)(Mean_test_score 001)(Mean_test_score 01)

(Mean_test_score 02)(Mean_test_score 05)

096

095

094

093

091

090

092

089

Log-

likeli

hood

com

pare

d to

an A

R (0

)Grid-search results (higher is better)

Figure 4 Grid search optimization for AdaBoost

Scientific Programming 7

RF and XGBoost is similar with selected feature for Schillerwith a minor difference +e number of features selected byFirefly for Schiller was 13

33 Cytology Table 8 presents the outcome of all theclassifiers for the cytology diagnosis tests Like Hinsel-mann and Schiller diagnostic test XGBoost outperformedCytology test as well with selected features For specificityand accuracy similar outcomes were achieved using 30and selected features Similarly the performance of RF issimilar in both 30 and selected features +e number offeatures selected by Firefly feature selector for Cytologywas 11

34 Biopsy Similarly performance was not drastically dif-ferent yet using all the features resulted in a higher accuracythan when using SMOTE with selected features for Biopsy asshown in Table 9 XGB obtained the highest accuracy of 971with all features However for other measures the perfor-mance of the XGBoost is better with the selected featuresSimilar performance was achieved for all measures whenclassified using RF for both feature sets 30 and selectedrespectively+e number of selected features used for Biopsytarget class was 11

Overall after comparing all the four-diagnostic testsHinselmann test achieved the better outcome and can beused for the diagnosis of cervical cancer as shown in Ta-ble 10 As per the outcome achieved in the proposed studyHinselmann diagnosis test has better performance whencompared from other cervical cancer diagnosis tests likeSchiller Biopsy and Cytology respectively Similar findingshave been made in Abdoh et al [5] and Wu and Zhou [4]study

4 Comparison with Existing Studies

+e study used three ensemble techniques AdaBoost ex-treme Gradient Boosting and Random Forest Furthermorethe proposed study is the pioneer in using bioinspired al-gorithm for feature selection and optimization for cervicalcancer diagnosis To explore the significance of our proposedstudy the outcome of the study was compared with thebenchmark studies +e criteria for the benchmark studiesselection were based on data set used for the diagnosis ofcervical cancer Table 11 contains the comparison of theproposed technique with the benchmark studies in the lit-erature +e best outcomes in the benchmark studies were

achieved using 30 features However some of the outcomesin the previous studies were achieved with the reducedfeatures +e number in the brackets next to some of theoutcomes represents the number of features

+erefore based on Table 11 the proposed study out-performs the two studies in the benchmark interms of ac-curacy with reduced risk factors However the achievedsensitivity and NPA are less than those of Wu and Zhou [4]but higher than those of Abdoh et al [5] +e number offeatures in Wu et al study is 30 while the proposed studyused reduced risk factors +e specificity and PPA of theproposed study are higher than those of the benchmarkstudies except for the Schiller diagnosis test

In nutshell the main contributions of the current studyare applying bioinspired algorithm for feature selection andfor model optimization for cervical cancer risk factors +e

Table 6 Performance of ensemble classifiers using 30 and selectedfeature for Hinselmann target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9636 9705 9572 9544 9723AB 9008 9034 8984 8921 9091XGB 9737 9672 98 9793 9684

SelectedRF 9757 9712 9801 9793 9723AB 9494 95 9488 9461 9526XGB 9838 9755 992 9917 9763

Table 7 Performance of ensemble classifiers using 30 and selectedfeature for Schiller target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9384 9454 9312 9336 9435AB 8662 8532 8813 8921 8391XGB 9236 9399 9076 9087 9391

SelectedRF 9597 9549 9648 9668 9589AB 896 8692 9289 9378 8522XGB 9698 959 9692 971 9565

Table 8 Performance of ensemble classifiers using 30 and selectedfeature for Cytology target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9591 9409 9787 9795 9388AB 9366 9112 9652 9672 9061XGB 9632 9449 983 9836 9429

SelectedRF 9591 9444 9747 9754 9429AB 9264 9094 9447 9467 9061XGB 9693 9526 9974 9873 951

Table 9 Performance of ensemble classifiers using 30 and selectedfeature for Biopsy target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9668 9545 9732 9788 9553AB 9357 9399 9317 928 9431XGB 971 9549 9752 9746 9553

SelectedRF 9627 958 9672 9661 9593AB 8921 8802 9042 9025 8821XGB 9668 9583 9874 9873 9593

Table 10 Results of proposed model for 4 diagnosis tests forcervical cancer

Diagnosis test Accuracy Sensitivity Specificity PPA NPAHinselmann 9838 9755 992 9917 9763Schiller 9698 959 9692 971 9565Biopsy 9668 9583 9874 9873 9593Cytology 9693 9526 9873 9877 951

8 Scientific Programming

proposed model enhanced the outcomes when comparedwith the previous studies related with cervical cancer riskfactors data set Despite the above-mentioned advantagesthe study suffers from some limitations the data set suffersfrom huge imbalance and augmented data was generatedusing SMOTE Moreover the current study was based onopen-source data set and further testing is required to useother real and open-source data sets

To alleviate the above-mentioned limitations there is aneed for validating the model on real data set from thehospital

5 Conclusion

+is study presents an investigation of several ensembletechniques such as Random Forest AdaBoost and Ex-treme Gradient Boosting for diagnosing cervical cancer+e data set was obtained from the UCI machine learningrepository containing 858 records 32 features and 4target variables +e target variables are the diagnosis testused for cervical cancer Experiments were conducted foreach target class separately Data preprocessing includesimputing missing values and class balancing usingSMOTE Moreover bioinspired firefly algorithm was usedto optimize the models and to identify the key features Tocompare the performance of the models the experimentswere conducted with 30 features and the selected featuresusing SMOTED data Extreme Gradient Boosting out-performed the other two models for all four target vari-ables For future work the model will be validated onmultiple data sets Also other models that can handleoutliers and unbalanced data differently should beinvestigated

Data Availability

+e study used open-source data set available at httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] American Cancer Society Key Statistics for Cervical CancerAmerican Cancer Society Atlanta GA USA 2021 httpswwwcancerorgcancercervical-canceraboutkey-statisticshtml

[2] ldquoCervical cancer risk factorsrdquo 2020 httpswwwcancernetcancer-typescervical-cancerrisk-factors7Etext=+e20most20important20risk20factor10020different20types20of20HPV

[3] ldquoCervical cancerrdquo 2020 httpswwwnccc-onlineorghpvcervical-cancercervical-cancer-overview

[4] W Wu and H Zhou ldquoData-driven diagnosis of cervicalcancer with support vector machine-based approachesrdquo IEEEAccess vol 5 2017

[5] S F Abdoh M Abo Rizka and F A Maghraby ldquoCervicalcancer diagnosis using random forest classifier with SMOTEand feature reduction techniquesrdquo IEEE Access vol 6 2018

[6] J Lu E Song A Ghoneim and M Alrashoud ldquoMachinelearning for assisting cervical cancer diagnosis an ensembleapproachrdquo Future Generation Computer Systems vol 106pp 199ndash205 2020

[7] E Karim and N Neehal ldquoAn empirical study of cervicalcancer diagnosis using ensemble methodsrdquo in Proceedings ofthe 2019 1st International Conference on Advances in ScienceEngineering and Robotics Technology (ICASERT) DhakaBangladesh December 2019

[8] A Ul-Islam S H Ripon and N Qaisar Bhuiyan ldquoCervicalcancer risk factors classification and mining associationsrdquoAPTIKOM Journal on Computer Science and InformationTechnologies vol 4 no 1 pp 8ndash18 2019

[9] Y M S Al-Wesabi A Choudhury and D Won ldquoClassifi-cation of cervical cancer datasetrdquo in Proceedings of the 2018IISE Annual Conference and Expo pp 1456ndash1461 OrlandoFL USA May 2018

[10] K Fernandes D Chicco J S Cardoso and J FernandesldquoSupervised deep learning embeddings for the prediction ofcervical cancer diagnosisrdquo PeerJ Computer Science vol 4no 5 pp e154ndash21 2018

[11] K Adem S Kiliccedilarslan and O Comert ldquoClassification anddiagnosis of cervical cancer with stacked autoencoder andsoftmax classificationrdquo Expert Systems with Applicationsvol 115 pp 557ndash564 2019

[12] K Fernandes J S Cardoso and J Fernandes ldquoTransferlearning with partial observability applied to cervical cancerscreeningrdquo Pattern Recognition and Image Analysis Springervol 10255 pp 243ndash250 Berlin Germany 2017

Table 11 Comparison of the proposed study with benchmark studies

Target class Model Accuracy Sensitivity Specificity PPA NPA

Hinselmann (15)Abdoh et al [5] 9397 100 8996 8497 100Wu and Zhou [4] 976 9665 9854 9848 9678Proposed study 9838 9755 992 9917 9763

Schiller (13)Abdoh et al [5] 9018 9873 8463 8075 9903Wu and Zhou [4] 9501 9324 9758 (12) 9729 (12) 9381Proposed study 9639 959 9692 971 9565

Cytology (11)Abdoh et al [5] 9275 100 8792 83 100Wu and Zhou [4] 9694 9558 (8) 9901 9894 9576 (8)Proposed study 9698 9583 9974 9873 9593

Biopsy (11)Abdoh et al [5] 9413 100 9021 8607 100Wu and Zhou [4] 9606 9494 (6) 9776 (11) 9758 (11) 9491Proposed study 9693 9526 9873 9877 951

Scientific Programming 9

[13] ldquoCervical cancer (risk factors) data setrdquo 2020 httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

[14] A Fernandez S Garcia F Herrera and N V ChawlaldquoSMOTE for learning from imbalanced data progress andchallenges marking the 15-year anniversaryrdquo Journal ofArtificial Intelligence Research vol 61 pp 863ndash905 2018

[15] X Yang Nature-Inspired Metaheuristic Algorithms Vol 4Luniver Press Cambridge UK 2nd edition 2010

[16] B H Nguyen B Xue and M Zhang ldquoA survey on swarmintelligence approaches to feature selection in data miningrdquoSwarm and Evolutionary Computation vol 54 2020

[17] M Anbu and G S Anandha Mala ldquoFeature selection usingfirefly algorithm in software defect predictionrdquo ClusterComputing vol 22 no s5 Article ID 10925 2019

[18] L Breiman ldquoRandom forestsrdquo in Hands-On MachineLearning with R pp 203ndash219 CRC Press Boca Raton FLUSA 2019

[19] T M Oshiro P S Perez and J A Baranauskas ldquoHow manytrees in a random forestrdquo in Lecture Notes in ComputerScience vol 7376 pp 154ndash168 Springer New York NY USA2012

[20] T Chen and C Guestrin ldquoXGBoost a scalable tree boostingsystemrdquo in Proceedings of the 22nd ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Miningvol 13ndash17 pp 785ndash794 San Francisco CA USA August2016

[21] A Natekin and A Knoll ldquoGradient boosting machines atutorialrdquo Frontiers in Neurorobotics vol 7 2013

[22] J H Friedman and J J Meulman ldquoMultiple additive re-gression trees with application in epidemiologyrdquo Statistics inMedicine vol 22 no 9 pp 1365ndash1381 2003

[23] Y Freund and R E Schapire ldquoA decision-theoretic gener-alization of on-line learning and an application to boostingrdquoJournal of Computer and System Sciences vol 55 no 1pp 119ndash139 1997

[24] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo Informatica vol 31 pp 249ndash2682007

10 Scientific Programming

Page 8: Cervical Cancer Diagnosis Model Using Extreme Gradient

RF and XGBoost is similar with selected feature for Schillerwith a minor difference +e number of features selected byFirefly for Schiller was 13

33 Cytology Table 8 presents the outcome of all theclassifiers for the cytology diagnosis tests Like Hinsel-mann and Schiller diagnostic test XGBoost outperformedCytology test as well with selected features For specificityand accuracy similar outcomes were achieved using 30and selected features Similarly the performance of RF issimilar in both 30 and selected features +e number offeatures selected by Firefly feature selector for Cytologywas 11

34 Biopsy Similarly performance was not drastically dif-ferent yet using all the features resulted in a higher accuracythan when using SMOTE with selected features for Biopsy asshown in Table 9 XGB obtained the highest accuracy of 971with all features However for other measures the perfor-mance of the XGBoost is better with the selected featuresSimilar performance was achieved for all measures whenclassified using RF for both feature sets 30 and selectedrespectively+e number of selected features used for Biopsytarget class was 11

Overall after comparing all the four-diagnostic testsHinselmann test achieved the better outcome and can beused for the diagnosis of cervical cancer as shown in Ta-ble 10 As per the outcome achieved in the proposed studyHinselmann diagnosis test has better performance whencompared from other cervical cancer diagnosis tests likeSchiller Biopsy and Cytology respectively Similar findingshave been made in Abdoh et al [5] and Wu and Zhou [4]study

4 Comparison with Existing Studies

+e study used three ensemble techniques AdaBoost ex-treme Gradient Boosting and Random Forest Furthermorethe proposed study is the pioneer in using bioinspired al-gorithm for feature selection and optimization for cervicalcancer diagnosis To explore the significance of our proposedstudy the outcome of the study was compared with thebenchmark studies +e criteria for the benchmark studiesselection were based on data set used for the diagnosis ofcervical cancer Table 11 contains the comparison of theproposed technique with the benchmark studies in the lit-erature +e best outcomes in the benchmark studies were

achieved using 30 features However some of the outcomesin the previous studies were achieved with the reducedfeatures +e number in the brackets next to some of theoutcomes represents the number of features

+erefore based on Table 11 the proposed study out-performs the two studies in the benchmark interms of ac-curacy with reduced risk factors However the achievedsensitivity and NPA are less than those of Wu and Zhou [4]but higher than those of Abdoh et al [5] +e number offeatures in Wu et al study is 30 while the proposed studyused reduced risk factors +e specificity and PPA of theproposed study are higher than those of the benchmarkstudies except for the Schiller diagnosis test

In nutshell the main contributions of the current studyare applying bioinspired algorithm for feature selection andfor model optimization for cervical cancer risk factors +e

Table 6 Performance of ensemble classifiers using 30 and selectedfeature for Hinselmann target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9636 9705 9572 9544 9723AB 9008 9034 8984 8921 9091XGB 9737 9672 98 9793 9684

SelectedRF 9757 9712 9801 9793 9723AB 9494 95 9488 9461 9526XGB 9838 9755 992 9917 9763

Table 7 Performance of ensemble classifiers using 30 and selectedfeature for Schiller target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9384 9454 9312 9336 9435AB 8662 8532 8813 8921 8391XGB 9236 9399 9076 9087 9391

SelectedRF 9597 9549 9648 9668 9589AB 896 8692 9289 9378 8522XGB 9698 959 9692 971 9565

Table 8 Performance of ensemble classifiers using 30 and selectedfeature for Cytology target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9591 9409 9787 9795 9388AB 9366 9112 9652 9672 9061XGB 9632 9449 983 9836 9429

SelectedRF 9591 9444 9747 9754 9429AB 9264 9094 9447 9467 9061XGB 9693 9526 9974 9873 951

Table 9 Performance of ensemble classifiers using 30 and selectedfeature for Biopsy target

Features Model Accuracy Sensitivity Specificity PPA NPA

30RF 9668 9545 9732 9788 9553AB 9357 9399 9317 928 9431XGB 971 9549 9752 9746 9553

SelectedRF 9627 958 9672 9661 9593AB 8921 8802 9042 9025 8821XGB 9668 9583 9874 9873 9593

Table 10 Results of proposed model for 4 diagnosis tests forcervical cancer

Diagnosis test Accuracy Sensitivity Specificity PPA NPAHinselmann 9838 9755 992 9917 9763Schiller 9698 959 9692 971 9565Biopsy 9668 9583 9874 9873 9593Cytology 9693 9526 9873 9877 951

8 Scientific Programming

proposed model enhanced the outcomes when comparedwith the previous studies related with cervical cancer riskfactors data set Despite the above-mentioned advantagesthe study suffers from some limitations the data set suffersfrom huge imbalance and augmented data was generatedusing SMOTE Moreover the current study was based onopen-source data set and further testing is required to useother real and open-source data sets

To alleviate the above-mentioned limitations there is aneed for validating the model on real data set from thehospital

5 Conclusion

+is study presents an investigation of several ensembletechniques such as Random Forest AdaBoost and Ex-treme Gradient Boosting for diagnosing cervical cancer+e data set was obtained from the UCI machine learningrepository containing 858 records 32 features and 4target variables +e target variables are the diagnosis testused for cervical cancer Experiments were conducted foreach target class separately Data preprocessing includesimputing missing values and class balancing usingSMOTE Moreover bioinspired firefly algorithm was usedto optimize the models and to identify the key features Tocompare the performance of the models the experimentswere conducted with 30 features and the selected featuresusing SMOTED data Extreme Gradient Boosting out-performed the other two models for all four target vari-ables For future work the model will be validated onmultiple data sets Also other models that can handleoutliers and unbalanced data differently should beinvestigated

Data Availability

+e study used open-source data set available at httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] American Cancer Society Key Statistics for Cervical CancerAmerican Cancer Society Atlanta GA USA 2021 httpswwwcancerorgcancercervical-canceraboutkey-statisticshtml

[2] ldquoCervical cancer risk factorsrdquo 2020 httpswwwcancernetcancer-typescervical-cancerrisk-factors7Etext=+e20most20important20risk20factor10020different20types20of20HPV

[3] ldquoCervical cancerrdquo 2020 httpswwwnccc-onlineorghpvcervical-cancercervical-cancer-overview

[4] W Wu and H Zhou ldquoData-driven diagnosis of cervicalcancer with support vector machine-based approachesrdquo IEEEAccess vol 5 2017

[5] S F Abdoh M Abo Rizka and F A Maghraby ldquoCervicalcancer diagnosis using random forest classifier with SMOTEand feature reduction techniquesrdquo IEEE Access vol 6 2018

[6] J Lu E Song A Ghoneim and M Alrashoud ldquoMachinelearning for assisting cervical cancer diagnosis an ensembleapproachrdquo Future Generation Computer Systems vol 106pp 199ndash205 2020

[7] E Karim and N Neehal ldquoAn empirical study of cervicalcancer diagnosis using ensemble methodsrdquo in Proceedings ofthe 2019 1st International Conference on Advances in ScienceEngineering and Robotics Technology (ICASERT) DhakaBangladesh December 2019

[8] A Ul-Islam S H Ripon and N Qaisar Bhuiyan ldquoCervicalcancer risk factors classification and mining associationsrdquoAPTIKOM Journal on Computer Science and InformationTechnologies vol 4 no 1 pp 8ndash18 2019

[9] Y M S Al-Wesabi A Choudhury and D Won ldquoClassifi-cation of cervical cancer datasetrdquo in Proceedings of the 2018IISE Annual Conference and Expo pp 1456ndash1461 OrlandoFL USA May 2018

[10] K Fernandes D Chicco J S Cardoso and J FernandesldquoSupervised deep learning embeddings for the prediction ofcervical cancer diagnosisrdquo PeerJ Computer Science vol 4no 5 pp e154ndash21 2018

[11] K Adem S Kiliccedilarslan and O Comert ldquoClassification anddiagnosis of cervical cancer with stacked autoencoder andsoftmax classificationrdquo Expert Systems with Applicationsvol 115 pp 557ndash564 2019

[12] K Fernandes J S Cardoso and J Fernandes ldquoTransferlearning with partial observability applied to cervical cancerscreeningrdquo Pattern Recognition and Image Analysis Springervol 10255 pp 243ndash250 Berlin Germany 2017

Table 11 Comparison of the proposed study with benchmark studies

Target class Model Accuracy Sensitivity Specificity PPA NPA

Hinselmann (15)Abdoh et al [5] 9397 100 8996 8497 100Wu and Zhou [4] 976 9665 9854 9848 9678Proposed study 9838 9755 992 9917 9763

Schiller (13)Abdoh et al [5] 9018 9873 8463 8075 9903Wu and Zhou [4] 9501 9324 9758 (12) 9729 (12) 9381Proposed study 9639 959 9692 971 9565

Cytology (11)Abdoh et al [5] 9275 100 8792 83 100Wu and Zhou [4] 9694 9558 (8) 9901 9894 9576 (8)Proposed study 9698 9583 9974 9873 9593

Biopsy (11)Abdoh et al [5] 9413 100 9021 8607 100Wu and Zhou [4] 9606 9494 (6) 9776 (11) 9758 (11) 9491Proposed study 9693 9526 9873 9877 951

Scientific Programming 9

[13] ldquoCervical cancer (risk factors) data setrdquo 2020 httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

[14] A Fernandez S Garcia F Herrera and N V ChawlaldquoSMOTE for learning from imbalanced data progress andchallenges marking the 15-year anniversaryrdquo Journal ofArtificial Intelligence Research vol 61 pp 863ndash905 2018

[15] X Yang Nature-Inspired Metaheuristic Algorithms Vol 4Luniver Press Cambridge UK 2nd edition 2010

[16] B H Nguyen B Xue and M Zhang ldquoA survey on swarmintelligence approaches to feature selection in data miningrdquoSwarm and Evolutionary Computation vol 54 2020

[17] M Anbu and G S Anandha Mala ldquoFeature selection usingfirefly algorithm in software defect predictionrdquo ClusterComputing vol 22 no s5 Article ID 10925 2019

[18] L Breiman ldquoRandom forestsrdquo in Hands-On MachineLearning with R pp 203ndash219 CRC Press Boca Raton FLUSA 2019

[19] T M Oshiro P S Perez and J A Baranauskas ldquoHow manytrees in a random forestrdquo in Lecture Notes in ComputerScience vol 7376 pp 154ndash168 Springer New York NY USA2012

[20] T Chen and C Guestrin ldquoXGBoost a scalable tree boostingsystemrdquo in Proceedings of the 22nd ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Miningvol 13ndash17 pp 785ndash794 San Francisco CA USA August2016

[21] A Natekin and A Knoll ldquoGradient boosting machines atutorialrdquo Frontiers in Neurorobotics vol 7 2013

[22] J H Friedman and J J Meulman ldquoMultiple additive re-gression trees with application in epidemiologyrdquo Statistics inMedicine vol 22 no 9 pp 1365ndash1381 2003

[23] Y Freund and R E Schapire ldquoA decision-theoretic gener-alization of on-line learning and an application to boostingrdquoJournal of Computer and System Sciences vol 55 no 1pp 119ndash139 1997

[24] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo Informatica vol 31 pp 249ndash2682007

10 Scientific Programming

Page 9: Cervical Cancer Diagnosis Model Using Extreme Gradient

proposed model enhanced the outcomes when comparedwith the previous studies related with cervical cancer riskfactors data set Despite the above-mentioned advantagesthe study suffers from some limitations the data set suffersfrom huge imbalance and augmented data was generatedusing SMOTE Moreover the current study was based onopen-source data set and further testing is required to useother real and open-source data sets

To alleviate the above-mentioned limitations there is aneed for validating the model on real data set from thehospital

5 Conclusion

+is study presents an investigation of several ensembletechniques such as Random Forest AdaBoost and Ex-treme Gradient Boosting for diagnosing cervical cancer+e data set was obtained from the UCI machine learningrepository containing 858 records 32 features and 4target variables +e target variables are the diagnosis testused for cervical cancer Experiments were conducted foreach target class separately Data preprocessing includesimputing missing values and class balancing usingSMOTE Moreover bioinspired firefly algorithm was usedto optimize the models and to identify the key features Tocompare the performance of the models the experimentswere conducted with 30 features and the selected featuresusing SMOTED data Extreme Gradient Boosting out-performed the other two models for all four target vari-ables For future work the model will be validated onmultiple data sets Also other models that can handleoutliers and unbalanced data differently should beinvestigated

Data Availability

+e study used open-source data set available at httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] American Cancer Society Key Statistics for Cervical CancerAmerican Cancer Society Atlanta GA USA 2021 httpswwwcancerorgcancercervical-canceraboutkey-statisticshtml

[2] ldquoCervical cancer risk factorsrdquo 2020 httpswwwcancernetcancer-typescervical-cancerrisk-factors7Etext=+e20most20important20risk20factor10020different20types20of20HPV

[3] ldquoCervical cancerrdquo 2020 httpswwwnccc-onlineorghpvcervical-cancercervical-cancer-overview

[4] W Wu and H Zhou ldquoData-driven diagnosis of cervicalcancer with support vector machine-based approachesrdquo IEEEAccess vol 5 2017

[5] S F Abdoh M Abo Rizka and F A Maghraby ldquoCervicalcancer diagnosis using random forest classifier with SMOTEand feature reduction techniquesrdquo IEEE Access vol 6 2018

[6] J Lu E Song A Ghoneim and M Alrashoud ldquoMachinelearning for assisting cervical cancer diagnosis an ensembleapproachrdquo Future Generation Computer Systems vol 106pp 199ndash205 2020

[7] E Karim and N Neehal ldquoAn empirical study of cervicalcancer diagnosis using ensemble methodsrdquo in Proceedings ofthe 2019 1st International Conference on Advances in ScienceEngineering and Robotics Technology (ICASERT) DhakaBangladesh December 2019

[8] A Ul-Islam S H Ripon and N Qaisar Bhuiyan ldquoCervicalcancer risk factors classification and mining associationsrdquoAPTIKOM Journal on Computer Science and InformationTechnologies vol 4 no 1 pp 8ndash18 2019

[9] Y M S Al-Wesabi A Choudhury and D Won ldquoClassifi-cation of cervical cancer datasetrdquo in Proceedings of the 2018IISE Annual Conference and Expo pp 1456ndash1461 OrlandoFL USA May 2018

[10] K Fernandes D Chicco J S Cardoso and J FernandesldquoSupervised deep learning embeddings for the prediction ofcervical cancer diagnosisrdquo PeerJ Computer Science vol 4no 5 pp e154ndash21 2018

[11] K Adem S Kiliccedilarslan and O Comert ldquoClassification anddiagnosis of cervical cancer with stacked autoencoder andsoftmax classificationrdquo Expert Systems with Applicationsvol 115 pp 557ndash564 2019

[12] K Fernandes J S Cardoso and J Fernandes ldquoTransferlearning with partial observability applied to cervical cancerscreeningrdquo Pattern Recognition and Image Analysis Springervol 10255 pp 243ndash250 Berlin Germany 2017

Table 11 Comparison of the proposed study with benchmark studies

Target class Model Accuracy Sensitivity Specificity PPA NPA

Hinselmann (15)Abdoh et al [5] 9397 100 8996 8497 100Wu and Zhou [4] 976 9665 9854 9848 9678Proposed study 9838 9755 992 9917 9763

Schiller (13)Abdoh et al [5] 9018 9873 8463 8075 9903Wu and Zhou [4] 9501 9324 9758 (12) 9729 (12) 9381Proposed study 9639 959 9692 971 9565

Cytology (11)Abdoh et al [5] 9275 100 8792 83 100Wu and Zhou [4] 9694 9558 (8) 9901 9894 9576 (8)Proposed study 9698 9583 9974 9873 9593

Biopsy (11)Abdoh et al [5] 9413 100 9021 8607 100Wu and Zhou [4] 9606 9494 (6) 9776 (11) 9758 (11) 9491Proposed study 9693 9526 9873 9877 951

Scientific Programming 9

[13] ldquoCervical cancer (risk factors) data setrdquo 2020 httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

[14] A Fernandez S Garcia F Herrera and N V ChawlaldquoSMOTE for learning from imbalanced data progress andchallenges marking the 15-year anniversaryrdquo Journal ofArtificial Intelligence Research vol 61 pp 863ndash905 2018

[15] X Yang Nature-Inspired Metaheuristic Algorithms Vol 4Luniver Press Cambridge UK 2nd edition 2010

[16] B H Nguyen B Xue and M Zhang ldquoA survey on swarmintelligence approaches to feature selection in data miningrdquoSwarm and Evolutionary Computation vol 54 2020

[17] M Anbu and G S Anandha Mala ldquoFeature selection usingfirefly algorithm in software defect predictionrdquo ClusterComputing vol 22 no s5 Article ID 10925 2019

[18] L Breiman ldquoRandom forestsrdquo in Hands-On MachineLearning with R pp 203ndash219 CRC Press Boca Raton FLUSA 2019

[19] T M Oshiro P S Perez and J A Baranauskas ldquoHow manytrees in a random forestrdquo in Lecture Notes in ComputerScience vol 7376 pp 154ndash168 Springer New York NY USA2012

[20] T Chen and C Guestrin ldquoXGBoost a scalable tree boostingsystemrdquo in Proceedings of the 22nd ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Miningvol 13ndash17 pp 785ndash794 San Francisco CA USA August2016

[21] A Natekin and A Knoll ldquoGradient boosting machines atutorialrdquo Frontiers in Neurorobotics vol 7 2013

[22] J H Friedman and J J Meulman ldquoMultiple additive re-gression trees with application in epidemiologyrdquo Statistics inMedicine vol 22 no 9 pp 1365ndash1381 2003

[23] Y Freund and R E Schapire ldquoA decision-theoretic gener-alization of on-line learning and an application to boostingrdquoJournal of Computer and System Sciences vol 55 no 1pp 119ndash139 1997

[24] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo Informatica vol 31 pp 249ndash2682007

10 Scientific Programming

Page 10: Cervical Cancer Diagnosis Model Using Extreme Gradient

[13] ldquoCervical cancer (risk factors) data setrdquo 2020 httpsarchiveicsuciedumldatasetsCervical+cancer+28Risk+Factors29

[14] A Fernandez S Garcia F Herrera and N V ChawlaldquoSMOTE for learning from imbalanced data progress andchallenges marking the 15-year anniversaryrdquo Journal ofArtificial Intelligence Research vol 61 pp 863ndash905 2018

[15] X Yang Nature-Inspired Metaheuristic Algorithms Vol 4Luniver Press Cambridge UK 2nd edition 2010

[16] B H Nguyen B Xue and M Zhang ldquoA survey on swarmintelligence approaches to feature selection in data miningrdquoSwarm and Evolutionary Computation vol 54 2020

[17] M Anbu and G S Anandha Mala ldquoFeature selection usingfirefly algorithm in software defect predictionrdquo ClusterComputing vol 22 no s5 Article ID 10925 2019

[18] L Breiman ldquoRandom forestsrdquo in Hands-On MachineLearning with R pp 203ndash219 CRC Press Boca Raton FLUSA 2019

[19] T M Oshiro P S Perez and J A Baranauskas ldquoHow manytrees in a random forestrdquo in Lecture Notes in ComputerScience vol 7376 pp 154ndash168 Springer New York NY USA2012

[20] T Chen and C Guestrin ldquoXGBoost a scalable tree boostingsystemrdquo in Proceedings of the 22nd ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Miningvol 13ndash17 pp 785ndash794 San Francisco CA USA August2016

[21] A Natekin and A Knoll ldquoGradient boosting machines atutorialrdquo Frontiers in Neurorobotics vol 7 2013

[22] J H Friedman and J J Meulman ldquoMultiple additive re-gression trees with application in epidemiologyrdquo Statistics inMedicine vol 22 no 9 pp 1365ndash1381 2003

[23] Y Freund and R E Schapire ldquoA decision-theoretic gener-alization of on-line learning and an application to boostingrdquoJournal of Computer and System Sciences vol 55 no 1pp 119ndash139 1997

[24] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo Informatica vol 31 pp 249ndash2682007

10 Scientific Programming