Evaluation of Breast Cancer Susceptibility Using Improved Genetic Algorithms to Generate Genotype SNP Barcodes

Evaluation of Breast Cancer SusceptibilityUsing Improved Genetic Algorithms to

Generate Genotype SNP BarcodesCheng-Hong Yang, Yu-Da Lin, Li-Yeh Chuang, and Hsueh-Wei Chang

Abstract—Genetic association is a challenging task for the identification and characterization of genes that increase the

susceptibility to common complex multifactorial diseases. To fully execute genetic studies of complex diseases, modern geneticists

face the challenge of detecting interactions between loci. A genetic algorithm (GA) is developed to detect the association of

genotype frequencies of cancer cases and noncancer cases based on statistical analysis. An improved genetic algorithm (IGA) is

proposed to improve the reliability of the GA method for high-dimensional SNP-SNP interactions. The strategy offers the top five

results to the random population process, in which they guide the GA toward a significant search course. The IGA increases the

likelihood of quickly detecting the maximum ratio difference between cancer cases and noncancer cases. The study systematically

evaluates the joint effect of 23 SNP combinations of six steroid hormone metabolisms, and signaling-related genes involved in breast

carcinogenesis pathways were systematically evaluated, with IGA successfully detecting significant ratio differences between breast

cancer cases and noncancer cases. The possible breast cancer risks were subsequently analyzed by odds-ratio (OR) and risk-ratio

analysis. The estimated OR of the best SNP barcode is significantly higher than 1 (between 1.15 and 7.01) for specific combinations

of two to 13 SNPs. Analysis results support that the IGA provides higher ratio difference values than the GA between breast cancer

cases and noncancer cases over 3-SNP to 13-SNP interactions. A more specific SNP-SNP interaction profile for the risk of breast

cancer is also provided.

Index Terms—Single nucleotide polymorphism, SNP-SNP interactions, genetic algorithm, breast cancer

Ç

1 INTRODUCTION

CURRENT hypotheses hold that the risk of disease andcancer is associated with the individual co-occurrence

of single nucleotide polymorphisms (SNPs) on the geneticand phenotypic variability. Many studies have investi-gated variations in disease susceptibility by identifyingDNA sequence variations, i.e., SNPs, which are importantindicators of the main genetic components of complexdiseases and can shed light on genetic risk [1], [2], [3], [4],[5], [6]. The associations for genotype frequencies in caseand control data have a significant impact on suscept-ibility to diseases and cancers, and the complex interac-tions among genes and environmental factors are a keyresearch subject in common human disease aetiology.Current studies focus on the combined effects of multiple

SNPs on many cancer and disease risks, but associationstudies for multiple SNP candidates are hampered bycomplex computations.

The steroid hormone metabolism and signaling areimplicated in the pathogenesis of breast cancer [7], [8], [9],[10], [11], [12]. Several single nucleotide polymorphism(SNP) association studies have focused on the steroidhormone metabolism and signaling-related genes, such asthe estrogen receptor 1 (ESR1), steroid sulfatase (micro-somal), isozyme S (STS), cytochrome P450, family 19,subfamily A, polypeptide 1 (CYP19A1), the progesteronereceptor (PGR), catechol-O-methyltransferase (COMT), andsex hormone-binding globulin (SHBG) [13], [14], [15].

Evidence of SNP-SNP interactions has accumulated inbreast cancer association studies, such as the SNP-SNPinteractions between genes associated with DNA repair[16], [17], chemokine ligand-receptor interaction [18], andestrogen-response [6]. However, analysis of SNP-SNPinteractions is still tedious due to the complex combinationof data related to the many SNPs, and the many possiblecombinations of alleles created in the simultaneous exam-ination of multiple SNPs. Determining the significant ratiodifference value between cancer cases and noncancer casesentails calculating CðN;MÞ � 3M ¼ N!=½M!ðN �MÞ!� � 3M

possible combinations of SNP barcodes, where N is thenumber of SNPs or factors, and M is the selectedprediction number of SNPs. However, statistical analysisis insufficient to fully detect the association for genotypefrequencies of high-dimensional cancer and noncancer datasets. Bioinformatics researchers have proposed many

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 2, MARCH/APRIL 2013 361

. C.-H. Yang and Y.-D. Lin are with the Department of ElectronicEngineering, National Kaohsiung University of Applied Sciences,Kaohsiung 80778, Taiwan.E-mail: [email protected], [email protected].

. L.-Y. Chuang is with the Department of Chemical Engineering and theInstitute of Biotechnology and Chemical Engineering, I-Shou University,No. 1, Section 1, Xuecheng Rd, Dashu District, Kaohsiung 84001, Taiwan.E-mail: [email protected].

. H.-W. Chang is with the Department of Biomedical Science andEnvironmental Biology, Kaohsiung Medical University Cancer Center,Kaohsiung Medical University Hospital, and the Kaohsiung MedicalUniversity, 100, Shih-Chuan 1st Road, Kaohsiung 80708, Taiwan.E-mail: [email protected].

Manuscript received 24 Aug. 2012; revised 10 Mar. 2013; accepted 11 Mar.2013; published online 27 Mar. 2013.Recommended for acceptance by S. Dudoit.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-2012-08-0222.Digital Object Identifier no. 10.1109/TCBB.2013.27.

1545-5963/13/$31.00 � 2013 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

artificial intelligence methods to approach this challenge,for example, multifactor dimensionality reduction (MDR)[19], [20] and polymorphism interaction analysis [21] useexhaustive searches (ESs), as opposed to support vectormachines [22], particle swarm optimization (PSO) [23], [24],[25], and genetic algorithms (GA) [26], [27]. MDR detectsand characterizes the high-order gene-gene interactions incase-control studies for binary outcomes by classifyingthem as high risk or low risk [19], but the computationalload can be excessive when dealing with more than10 polymorphisms [28]. A GA has the ability to generatesignificant SNP combinations in high-dimensional data.The GA differs from the other methods in some veryfundamental ways, making it more robust. For example, itworks with a coding of the parameter set rather than theparameters themselves, it searches from a population ofpoints rather than a single point, it uses payoff informationrather than derivatives or other auxiliary knowledge, andit uses probabilistic transition rules rather than determi-nistic rules. This study aims to develop an improved GAmethod (“IGA”) to provide increased reliability. The IGAmethod is implemented in two stages. Stage 1 improves thepopulation initialization, based on the concept that goodresults are retained. This conservation of superior resultsyields better solutions for high-order SNP-SNP interac-tions. Stage 2 uses the main GA components, namelyencoding schemes, a fitness evaluation, population initi-alization, selection, and the crossover operator. We system-atically evaluated the joint effects of 23 SNP combinationsfrom six published steroid hormone metabolism- andsignaling-related genes involved in breast cancer-relatedpathways. The results demonstrate that the proposedapproach provides SNP barcodes that showed significantlydifferent values between breast cancer cases and noncancercases and improved the reliability of the results in the20 test runs we conducted.

2 METHODS

2.1 GA

A GA is an optimization and machine learning algorithmloosely based on the processes of biological evolution; itwas first developed by Holland [29]. The GA solves theoptimization problem by manipulating a population ofchromosomes. A chromosome is assigned a fitness valuethat is related to its success in solving the problem. Beforethe GA operation is started, an initial population ofchromosomes is randomly generated to serve as parents.When the operation is initialized, the members of thecurrent population are replaced by new chromosomes thatare (possibly modified) copies of the parents. This processof reproduction and population replacement carries on untila stopping criterion is met. The GA process applies thefollowing genetic operators: chromosome encoding andinitialization, selection, crossover, mutation, and replace-ment. Two parents are selected from the chromosomesbased on their fitness, and then the crossover operatorscreate offspring that combine chromosomal matter from thetwo parents. Mutation operators cause the offspring todiffer from their parents through the introduction of

localized change. Finally, replacement operators eliminatethe inferior chromosomes from the population. Thus, thefittest chromosomes tend to have more offsprings than theless fit ones. Better offsprings can be generated from goodparents, thus producing an entire population of superiorindividuals, and the resulting offsprings converge at a pointclose to a global optimum. The GA has been successfullyapplied to a variety of biology problems, for example,biomarker discovery and multiclass cancer classification[30], primer design [31], and gene selection based on geneexpression data [32], [33]. In this study, we apply the GAmethod to an evaluation of breast cancer susceptibilitythrough simulations that seek to maximize the functionalvalue on the SNP barcode.

2.2 IGA

We propose a new strategy to improve the stability of theGA in searches of high-order SNP-SNP interactions. Theimprovement stems from retaining the top five results thatyield better solutions for high-order SNP-SNP interactions;the GA also evolves to increase the likelihood of obtainingoptimal results within a limited time. The differencebetween the IGA and the GA is that the proposed algorithmis applied in the population initialization step of the GAprocess. This very simple idea can be accomplished withoutincreasing the computational complexity of the process. Theinitial population is generated by our “top five results”strategy, and the fitness values of all individuals in thepopulation are then calculated by a fitness function. Thepopulation is improved based on the GA operations, i.e.,selection, crossover, mutation, and replacement. The pro-cedure is repeated in successive iterations until thetermination conditions are reached.

The pseudocode in Fig. 1 shows how the IGA is used tocollocate data with the above-mentioned adaptation proce-dure to obtain the best SNP barcode for osteoporosis

362 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 2, MARCH/APRIL 2013

Fig. 1. IGA pseudocode.

differentiation. This solves the problem of selecting an SNPnumber to match the originally set number.

2.2.1 Encoding Schemes

In IGA, a chromosome in the population represents asolution group and can be divided into two parts: Thenumber of selected SNPs, and the genotype associated withthe SNPs. The chromosome encoding can thus be repre-sented by

Ci ¼ ðSNPi;j; Genotypei;jÞ; i ¼ 1; 2; . . . ;m; j ¼ 1; 2; . . . ; n;

where SNPi;j represents a selected SNP that cannot berepeatedly selected, Genotypei;j represents the three possi-ble genotypes once SNPi;j is selected, m represents the sizeof the population, and n represents the number of SNPsselected. For example, let C ¼ ðSNP1;3;5, Genotype2;1;3); thus,C represents the chosen SNPs (1, 3, 5) and genotypes (2, 1,3) and can be described by the SNPs associated with thegenotypes as follows: (1, 2), (3, 1), and (5, 3).

2.2.2 Population Initialization Using the Top Five Results

Conservation Strategy

The top five results conservation strategy is proposed todetect better SNP barcodes of the high-order SNP-SNPinteraction. We define the top five results as Ti ¼ ðSNPi;j;Genotypei;jÞ; i ¼ 1; 2; . . . ; 5; j ¼ 1; 2; . . . ; n, where SNPi;j re-presents a selected SNP that cannot be repeatedlyselected, Genotypei;j represents the three possible geno-types once SNPi;j is selected, i represents the number ofthe top five results, and n represents the number of SNPsselected. The strategy is applied in the initializationpopulation process in n-SNP ðn � 3Þ barcode identifica-tion. For the 2-SNP barcode, we only apply the ESalgorithm to compute and check all possible 2-SNPcombinations and, thus, obtain the top five results forall 2-SNP barcodes. The steps for population initializationare illustrated in Fig. 2. To initialize the population, thetop five results in the previous 2-SNP combinations,which were identified via the ES algorithm, are used forthe population initialization for higher numbers of SNPcombinations. For example, ðSNP3;4;Genotype2;1Þ is one ofthe top five 2-SNP barcodes (Step 1); subsequently, these2-SNP barcodes search themselves for the best combina-tion of 3-SNP barcodes with a maximum ratio differencevalue between breast cancer cases and noncancer cases(Step 2), for example, the search of ðSNP3;4;Genotype2;1Þfor its 3-SNP barcode yields ðSNP3;4;i;Genotype2;1;jÞ, wherei ¼ f1; 2; 5; 6; . . . ; n j n represents the number of SNPs} andj ¼ f1; 2; 3g. The ES algorithm is then applied to computeand check all possible 3-SNP combinations to identify thetop five results for all 3-SNP barcodes. If the ES algorithmfinds i ¼ 5 and j ¼ 1, the 2-SNP barcode ðSNP3;4;Genotype2;1Þ can generate the best 3-SNP barcodeðSNP3;4;5;Genotype2;1;1Þ. Four of the top five 3-SNPbarcodes are generated in the same way, i.e., the 2-SNPbarcode ðSNP1;2;Genotype1;2Þ generates the 3-SNP barcodeðSNP1;2;4;Genotype1;2;2Þ, and the 2-SNP barcode ðSNP1;3;Genotype1;1Þ generates the 3-SNP barcode ðSNP1;3;2;Genotype1;1;3Þ, and so on.

Meanwhile, the populations of 3-SNP barcodes arerandomly generated and are then sorted by their fitness

values from high to low (Step 3), i.e., the GA populationinitialization step. The results from Step 2 are used toreplace the worst five SNP barcodes (Step 4). Finally, theupdated 3-SNP barcode populations are ready for the GA toidentify the top five 3-SNP barcodes (Step 5), followingwhich these top five SNP barcodes can be used to initiatethe next SNP barcode generation.

2.2.3 Fitness Function

In the GA process, the fitness function value measures thequality of chromosomes. The SNP-SNP interaction studyfocuses on particular SNP combinations to detect thehighest fitness value, i.e., the maximum ratio differencevalue between breast cancer cases and noncancer cases.This criterion divides the fitness function into three separatesteps, and the relevant equation can be written as

F ðCiÞ ¼nðcase \ CiÞ

case� nðcontrol \ CiÞ

control; ð1Þ

where n is the total number of elements in a set, case is thetotal number of SNP interactions in the breast cancer casegroup, control is the total number of SNP interactions in

YANG ET AL.: EVALUATION OF BREAST CANCER SUSCEPTIBILITY USING IMPROVED GENETIC ALGORITHMS TO GENERATE GENOTYPE... 363

Fig. 2. Population initialization using the conservation of the bestresults strategy.

the control group, and Ci is the ith chromosome. First, thenumber of intersections between the ith chromosome andbreast cancer cases is calculated as nðcase \ CiÞ, and theratio is computed as nðcase \ CiÞ divided by case. Then, thenumber of intersections between the ith chromosome andnoncancer cases is calculated as nðcontrol \ CiÞ, and theratio is computed as nðcontrol \ CiÞ divided by control.Finally, the difference in the ratio of number of intersectionsbetween breast cancer cases and noncancer cases iscalculated; this difference represents the fitness value.

The concept behind the designed fitness value uses theintersection of set theory to evaluate the ratio differencevalue between breast cancer cases and noncancer cases. Theintersection of two sets is the set that contains all elementsfound in both sets, but no other elements. For example,C ¼ ðSNP1;2;Genotype2;1) is used to evaluate the number ofmatching conditions in the breast cancer cases and non-cancer cases. Suppose the number of independently match-ing SNP1 with genotype 2 and SNP2 with genotype 1 is 153of 200 breast cancer cases, and the number of independentlymatching SNP1 with genotype 2 and SNP2 with genotype 1is 73 of 200 noncancer cases. According to (1), the fitnessvalue is determined by subtracting 0.37 (73/200) from 0.77(153/200), leaving 0.40, which represents a high risk.

2.2.4 Selection

The proposed design adopts a rank-based tournamentselection scheme to select the parents. In tournamentselection, the chromosomes in the population are rankeduntil a certain set of chromosomes have been selected. Thebest two chromosomes (C1 and C2) from the population arethen used for the crossover and mutation operations.

2.2.5 Crossover and Mutation

The two parents C1 and C2 chosen by the tournamentselection are then processed by uniform crossover torandomly generate masks of N binary bits and indicatewhich variables need to be exchanged. All exchangedvariables are checked to prevent repetition of the selectedSNPs. Fig. 3 shows an example of the crossover operation.The two selected chromosomes are C1 ¼ ð1; 2; 4; 1; 2; 1Þ andC2 ¼ ð3; 5; 6; 3; 1; 2Þ. A mask of six binary bits is thenrandomly generated (e.g., 011010) and the values of SNP2,SNP3, and Genotype2 are exchanged in the crossoveroperation, i.e., the new offspring C01 ¼ ð1; 5; 6; 1; 1; 1Þ andC02 ¼ ð3; 2; 4; 3; 2; 2Þ are created. Following crossover, a one-point mutation operation is applied in the proposed IGA,which randomly selects a mutation point in an offspringchromosome. Variables are checked for repetition of theselected SNPs. If the variable value is violated, a newrandom point is selected and the variable is regenerated;

otherwise, the original value is replaced by the variable andthe process continues with the replacement operation.

2.2.6 Replacement

Offspring are generated by the crossover operation and themutation operation, as described above. If an offspring issuperior to both parents, it replaces the most similar parent.Alternatively, if the offspring’s fitness lies between that ofthe two parents, it replaces the inferior parent; otherwise,the most inferior chromosome in the population is replaced.

2.2.7 Parameter Settings

The experiment is averaged over 20 runs. For eachimplementation, the GA and the IGA use tournamentselection and a uniform crossover with an exchangeprobability of 1.0, and a one-point mutation with anexchange probability of 0.1. We set the population sizeequal to 50, and number of generations equal to 100. Allruns end when the generations meet the maximum numberof allowed iterations.

3 RESULTS AND DISCUSSION

3.1 Data Set Preparation

In this study, the chosen data set [14] provided only thegenotype frequencies rather than the original raw genotypedata. The data set was obtained from steroid hormones andtheir signaling and metabolic pathways (96 SNPs for eightgenes) in the breast cancer association study. The originaldata for each SNP involve different numbers of individuals,and therefore, the simulated data for each SNP need to beadjusted by normalization to obtain the same data size, thusallowing for further analysis. Table 1 shows the individualswith respect to 23 SNPs from six genes (COMT, CYP19A1,ESR1, PGR, SHBG, and STS) for the occurrence of breastcancer. The simulated data are inputted by generating thenew data at random, but in a way such that the outputstill obeys the final frequency for each SNP for the wholedata set. All genotype frequencies and simulated dataare identical to those described in the literature [34]. Thesignificance of the stimulated data (adjusted to 5,000) is thatit is in accord with the original data. We assume that SNPa isrs11571171 in the original data, and SNPb is the simulateddata after control. Suppose the number of the threegenotypes in SNPa is 4,534, of which AA, Aa, and aa are2,119, 1,962, and 453, respectively. First, we calculatedthe percentage of each genotype in SNPa, i.e., 2,119/4,534(46.74 percent) for AA, 1,962/4,534 (43.27 percent) for Aa,and 453/4,534 (9.99 percent) for aa. Then, based on thesepercentages, the modified data for SNPb are obtained bymultiplying the percentage with the amount in the completedata set, i.e., 46:74%� 5;000 ¼ 2;338 for AA, 43:27%�5;000 ¼ 2;163 for Aa, and 9:99%� 5;000 ¼ 499 for aa. Thesimulated data for SNPb are thus controlled at 5;000ð2;338 þ2;163þ 499 ¼ 5;000Þ. The whole data set is available athttp://bioinfo.kmu.edu.tw/brca-steroid-96SNP.xlsx.

3.2 Performance Measurements Using StatisticalAnalysis

We used two common criteria to determine the predictionscore [21]. The odds-ratio (OR) and the risk-ratio (RR)criteria are defined as follows:


Fig. 3. Example of a crossover operation.

Odds Ratio ¼ TP � TNFP � FN ð2Þ

Risk Ratio ¼ TP � ðFP þ TNÞFP � ðTP þ FNÞ ; ð3Þ

where TP is the number of true positives, TN is the numberof true negatives, FN is the number of false negatives, andFP is the number of false positives. OR and RR are used todetermine the best combination of genotypes and quantita-tively measure the disease risk. RR is a key parameter inepidemiology and is defined as the ratio of the risk amongthe exposed to that among the nonexposed. The risk in RRrefers to some measure and nonexposure distinguishingbetween a pair of alternative characteristic experiences. OR

is widely used in medical reports and offers a veryconvenient interpretation in case-control studies (to beaddressed in a future note) when RR cannot be obtaineddirectly (i.e., case-control association). A larger OR ð>1Þindicates a stronger positive association between thegenotype combination and the disease.

3.3 Identification of Best SNP-SNP InteractionCombination with Maximal Ratio DifferenceValue between Breast Cancer Cases andNoncancer Cases

Table 1 summarizes information pertaining to the 23 SNPsin the simulated data set. To investigate SNP-SNP interac-tions, we simultaneously used the IGA and GA methods togenerate the best SNP barcodes associated with breast


TABLE 1Individual SNPs of 23 Steroid Hormone Metabolism and Signaling-Related Genes on the Occurrence of Breast Cancer in Patients

a: Data collected from the literature [34]. b: All the [Ch/position], i.e., [Chromosome no./Chromosome position], information is based on “AssemblyGRCh37.” c: The contig information is shown in SNP no. (contig accession no.) as follows: SNP 1-2 (NT_011519.10); SNPs 3 (NT_010194.17);SNPs 4-15 (NT_025741.15); SNPs 16-19 (NT_033899.8); SNPs 20-22 (NT_010718.16); SNPs 23 (NT_167197.1).

cancer. Tables 2, 3, and 4 summarize the results for the 2- to13-SNP barcodes detected by the IGA, GA, and PSO.Identification of the best SNP-SNP interaction combinationrepresents the N specific combined SNPs with theircorresponding genotypes. For example, the three specificSNPs (4,17,18) with genotype 2-1-1 [rs3020314-CT]-[rs660149-CC]-[rs11571171-TT] showed the maximal ratiodifference between breast cancer cases and noncancer cases.In this SNP barcode, the matching control numbers andcase numbers are, respectively, 509 and 602, meaning theincidence in breast cancer cases is higher than inthe noncancer cases in 1.86 percent ratio difference valuegroups (i.e., 602=5;000� 509=5;000 ¼ 0:0186), suggestingthat this SNP barcode is a risk in breast cancer groups.Under the same criteria, 3- to 13-SNP barcodes with the bestperformance were mined by the IGA (see left column ofTable 2), GA (see left column of Table 3) and PSO (see leftcolumn of Table 4). The results indicate that the IGA andthe GA provide the highest level of ratio difference of SNPbarcodes between the control and breast cancer groups witha fixed number of SNPs.

3.4 Analysis of SNP Barcodes Significant forOccurrence in Breast Cancer

Tables 2 and 3 show the proportion of subjects with specificSNP combinations and other combinations in breast cancer.The proportion of subjects with breast cancer in the groupwith specific SNP combinations is significantly higher than

those with other combinations. For example, the IGAdetected that the proportion of breast cancer subjects withthe combined genotypes in SNPs (4, 17) with genotypes 2-1,i.e., rs3020314 CT-rs660149 CC was 52.75 percent, asopposed to 49.18 percent among those with other combina-tions. The two methods also provide proportions of subjectswith specific SNP combinations (2 to 13 SNPs) sufferingfrom breast cancer in ranges of 52.75-87.50 percent and52.75-85.71 percent, which are higher than those of othercombinations (49.18-49.98 percent and 49.18-49.99 percent).

3.5 Rank Analysis of the OR and RR for BreastCancer

Tables 2 and 3 present the OR and p-value to estimate theeffect of specific SNP combinations, i.e., SNP barcodes, onthe occurrence of breast cancer. For example, we observethat the IGA provides higher OR (1.15-7.01) and RR (1.07-1.75) values with more SNP combinations in the high-riskgroup. If the number of breast cancer cases is higher thanthe number of control cases, then the risk value is relativelyhigh. Thus, women with the specific SNP combination (2 to10 SNPs) had an RR that represented a 1.07- to 1.75-foldincreased risk and significantly increasing OR values of 1.15-7.01 for breast cancer, suggesting that these SNP barcodesrepresent an increased risk for breast cancer. Accordingly,the best IGA and GA-generated SNP barcodes listedrepresent an increased risk when more SNPs are selected.


TABLE 2Estimated Best SNP Barcode on the Occurrence of Breast Cancer as Determined by the IGA in the High-Risk Group

�The SNP barcode on the occurrence of breast cancer have significant (p-value < 0.05); FDR: False discovery rate.

3.6 Comparison of the IGA and the GA for SNP-SNPInteraction in Breast Cancer

This section compares the IGA and the GA for reliability andability to identify SNP barcodes by computing the averagevalue and standard deviation of the maximum ratiodifference values between breast cancer cases and noncancercases among 20 runs, for 3- to 13-specific SNPs. Fig. 4 showsthe best results of the IGA and the GA in terms of amaximum ratio difference value for 2- to 13-SNP barcodesover 20 runs. The ratio difference values of the IGA arehigher than those of the GA on the 3- to 13-SNP barcode, i.e.,1.86 percent versus 1.42 percent for 3 SNP, 1.20 percentversus 0.84 percent for 4 SNP, 0.82 percent versus0.42 percent for 5 SNP, 0.46 percent versus 0.24 percent for6 SNP, 0.34 percent versus 0.14 percent for 7 SNP,0.24 percent versus 0.10 percent for 8 SNP, 0.18 percentversus 0.08 percent for 9 SNP, 0.12 percent versus0.04 percent for 10 SNP, 0.12 percent versus 0.02 percentfor 11 SNP, 0.10 percent versus 0.02 percent for 12 SNP, and0.08 versus N/A for 13 SNP (N/A represents failure toidentify an SNP barcode). Notice that in the 13-SNPidentification, the GA fails to identify an SNP barcode in20 runs conducted. Therefore, the GA may miss a significantSNP barcode due to its difficulty in identifying the SNPbarcode’s large value difference when the dimensionality isvery high. We use the false-discovery rate (FDR) to adjustthe p-value to demonstrate the significance of the results inthis study; results are shown in Tables 2 and 3. The FDR

adjusted p-value in Tables 2 and 3 supports the propositionthat the IGA can identify statistically significant SNPbarcodes 2-, 3-, 4-, 5-, and 9-SNP combinations, whereasthe GA only identified the 2 SNP in a statistically significantmanner. The results of OR also indicate that the risk of breastcancer occurrence of the SNP barcode identified by the IGAis higher than that identified by the GA. In Fig. 5, the boxplots of the IGA and GA illustrate the extremes, the upperand lower hinges (quartiles), and the median of themaximum ratio difference value between the breast cancercases and the noncancer cases for the IGA and the GA on 3-to 13-combined SNPs over 20 runs. The black and white boldlines and the white boxes in Fig. 5 indicate the reliability ofthe method in multiple implementations; the distancesbetween the two bold lines and the boxes are as small aspossible. For example, the greatest (white) and least (black)ratio difference value in 3-SNP identified by the GA are,respectively, 1.42 and 1.27 percent, but the IGA identifies aratio difference value of 1.86 percent over 20 runs. The lowerand upper quartiles indicate the discarded lowest andhighest 25 percent of the results, meaning the middle50 percent of the result values range from 1.36 to 1.42 percentfor GA identification, and have a single value of 1.86 percentfor IGA identification. The other combined SNP numbers allattest to the reliability of the IGA as well.

The computational complexity of the IGA is estimated byobjective function computation. If there are i number ofiterations and s number of solutions (chromosome) in the


TABLE 3Estimated Best SNP Barcode on the Occurrence of Breast Cancer as Determined by the GA in the High-Risk Group

�The SNP barcode on the occurrence of breast cancer have significant (p-value < 0.05); FDR: False discovery rate; N.E: Not evaluate.

population, then the objective function computation has acomputational complexity of OðisÞ. The effective feature ofthe top five strategy computation is that only the top fivesolutions in each iteration are stored. If there are k solutionsin the archive, storing the solutions in the archive yields acomputational complexity of Oðiþ kÞ. If the same numberof archives and iterations is used, the overall complexity ofthe IGA is Oðisþ kÞ.

The 3- to 12-SNP barcodes in Tables 2 and 3, i.e., thecolumns of the combined SNP number and SNP genotypes,demonstrate the advantage of the retained results strategy.The SNP barcodes identified by the IGA have many of thesame SNP combinations (e.g., 4-17-18-22 with genotype 2-1-1-2) because these SNP barcodes have a certain ratiodifference value in all possible SNPs combinations. How-ever, the GA process needs to identify every combinedSNP number using a new population, resulting in the SNPbarcodes identified by the GA being chaotic among the 3-to 12-SNP barcodes. The IGA has the ability to predict therelative strength of the impact of an SNP on breast cancerprotection by calculating the maximum ratio differenceinformation. For example, the ratio difference betweencontrols and cases for SNP barcode [SNPs (4-17-18)-genotype (2-1-1)] is higher than that of [SNPs (4-17-18-22)-genotype (2-1-1-2)], suggesting that SNP 4, SNP 17, andSNP 18 are more associated with breast cancer protectionthan SNP 22. Accordingly, an order of impact on breastcancer for the SNPs listed in Table 2 can be arrangedas fol lows: SNPs 4=17 > SNP 18 > SNP 22 > SNP 11 >SNP 21 > SNP 19=20 > SNP 12 > SNP 23 > SNP 6. In the

simulated breast cancer association study, the IGA-gener-

ated SNP barcodes involving 2 to 13 SNPs and 2 to 10 SNPs

show significantly decreasing OR values ranging from 1.15

to 7.01 (Table 2). This situation can easily result in a blind

search in the GA process. Moreover, the IGA provides

superior results that yield better solutions for high-order

SNP-SNP interactions, thus leading to improved GA search

results in a limited time. The 7 SNP in Table 2 can be

interpreted to mean that the concept of retaining the top

five results considers the likelihood of unexpected SNP

combinations. The element of 6 SNP has the 21 SNP


TABLE 4Estimated Best SNP Barcode on the Occurrence of Breast Cancer as Determined by PSO in the High-Risk Group

�The SNP barcode on the occurrence of breast cancer have significant (p-value < 0.05); FDR: False discovery rate; N.E: Not evaluate.

Fig. 4. Maximum ratio difference between breast cancer cases andnoncancer cases for the GA and the IGA on 2 to 13 best combined SNPs.

number in the best ratio difference value, and the best ratiodifference using 6 SNP combined to implement the ESalgorithm can only identify 0.20 percent ratio differencevalues, i.e., the SNP barcode is SNPs (4, 11, 17, 18, 19, 21,22) with genotype (2-2-1-1-2-2-2). The IGA identified the0.34 percent ratio difference values because it used theSNPs 19 and 20 rather than the SNP 21 number, and this20 SNP number is an element within the third result of thetop five results in 6-SNP identification. Consequently,when the top five results are put into the population, theGA obtains a good SNP combination. The proposedconservation of the top five results successfully improvesthe identification capability and stability of the GA.

4 CONCLUSIONS

In this study, we proposed an IGA to determine that

multiple SNPs are exclusively used in association studies

to investigate polygenic diseases and cancers. In nature,

genes can be damaged or modified in various ways, and

this damage can lead to an increased risk of disease or

cancer. It is thus important to obtain informative SNP

patterns from the SNPs located in the relevant genes and

pathways. Due to the huge number of SNPs involved,

analysis of association studies is difficult to perform,

especially when multiple SNPs are investigated simulta-

neously. Our proposed IGA was shown to successfully

identify 23 SNP cross interactions and provides represen-

tative gene-gene interactions for breast cancer. The risk of

developing breast cancer due to SNP-SNP interactions is

evaluated by OR and RR. Our strategy successfully

improves the effectiveness of the GA search path given

limited time, thus enhancing the maximum ratio difference

between case and control data in high-order SNP-SNP

interactions. The results demonstrate that the IGA can

identify the best fitness of cases and controls and can

potentially be applied to determine complex gene-gene

SNP interactions among the huge number of SNPs

involved in genome-wide association studies.

ACKNOWLEDGMENTS

This work was partly supported by the National ScienceCouncil of Taiwan under grants 101-2622-E-151-027-CC3,100-2221-E-151-049-MY3, 100-2221-E-151-051-MY2,DOH101-TD-C-111-002, and the National Sun Yat-SenUniversity-Kaohsiung Medical University Joint ResearchProject (#NSYSUKMU 101-006).


Fig. 5. Box plots display the extremes, the upper and lower hinges (quartiles), and the median of the maximum ratio difference between breastcancer cases and noncancer cases for (A) GA and (B) IGA on 3 to 13 combined SNPs over 20 runs.

REFERENCES

[1] J. Li, K. Humphreys, H. Darabi, G. Rosin, U. Hannelius, T.Heikkinen, K. Aittomaki, C. Blomqvist, P.D. Pharoah, A.M.Dunning, S. Ahmed, M.J. Hooning, A. Hollestelle, R.A. Old-enburg, L. Alfredsson, A. Palotie, L. Peltonen-Palotie, A. Irwanto,H.Q. Low, G.H. Teoh, A. Thalamuthu, J. Kere, M. D’Amato, D.F.Easton, H. Nevanlinna, J. Liu, K. Czene, and P. Hall, “A Genome-Wide Association Scan on Estrogen Receptor-Negative BreastCancer,” Breast Cancer Research, vol. 12, no. 6, article 93, 2010.

[2] P. Kraft and C.A. Haiman, “GWAS Identifies a Common BreastCancer Risk Allele among BRCA1 Carriers,” Nature Genetics,vol. 42, no. 10, pp. 819-820, 2010.

[3] G. Thomas, K.B. Jacobs, P. Kraft, M. Yeager, S. Wacholder, D.G.Cox, S.E. Hankinson, A. Hutchinson, Z. Wang, K. Yu, N.Chatterjee, M. Garcia-Closas, J. Gonzalez-Bosquet, L. Prokunina-Olsson, N. Orr, W.C. Willett, G.A. Colditz, R.G. Ziegler, C.D.Berg, S.S. Buys, C.A. McCarty, H.S. Feigelson, E.E. Calle, M.J.Thun, R. Diver, R. Prentice, R. Jackson, C. Kooperberg, R.Chlebowski, J. Lissowska, B. Peplonska, L.A. Brinton, A.Sigurdson, M. Doody, P. Bhatti, B.H. Alexander, J. Buring,I.M. Lee, L.J. Vatten, K. Hveem, M. Kumle, R.B. Hayes, M.Tucker, D.S. Gerhard, J.F. Fraumeni Jr., R.N. Hoover, S.J.Chanock, and D.J. Hunter, “A Multistage Genome-WideAssociation Study in Breast Cancer Identifies Two New RiskAlleles at 1p11.2 and 14q24.1 (RAD51L1),” Nature Genetics,vol. 41, no. 5, pp. 579-584, 2009.

[4] A. Meindl, “Identification of Novel Susceptibility Genes for BreastCancer—Genome-Wide Association Studies or Evaluation ofCandidate Genes?” Breast Care (Basel), vol. 4, no. 2, pp. 93-99, 2009.

[5] D. Fanale, V. Amodeo, L.R. Corsini, S. Rizzo, V. Bazan, andA. Russo, “Breast Cancer Genome-Wide Association Studies:There Is Strength in Numbers,” Oncogene, vol. 31, pp. 2121-2128, 2011.

[6] J.C. Yu, C.N. Hsiung, H.M. Hsu, B.Y. Bao, S.T. Chen, G.C.Hsu, W.C. Chou, L.Y. Hu, S.L. Ding, C.W. Cheng, P.E. Wu,and C.Y. Shen, “Genetic Variation in the Genome-WidePredicted Estrogen Response Element-Related Sequences IsAssociated with Breast Cancer Development,” Breast CancerResearch, vol. 13, no. 1, article 13, 2011.

[7] A.M. Soto and C. Sonnenschein, “The Two Faces of Janus: SexSteroids as Mediators of Both Cell Proliferation and Cell Death,”J. Nat’l Cancer Inst., vol. 93, no. 22, pp. 1673-1675, 2001.

[8] F. Auricchio, A. Migliaccio, and G. Castoria, “Sex-SteroidHormones and EGF Signalling in Breast and Prostate CancerCells: Targeting the Association of SRC with Steroid Receptors,”Steroids, vol. 73, nos. 9/10, pp. 880-884, 2008.

[9] S. Ando, F. De Amicis, V. Rago, A. Carpino, M. Maggiolini,M.L. Panno, and M. Lanzino, “Breast Cancer: From Estrogento Androgen Receptor,” Molecular Cell Endocrinology, vol. 193,nos. 1/2, pp. 121-128, 2002.

[10] P. Giovannelli, M. Di Donato, T. Giraldi, A. Migliaccio, G. Castoria,and F. Auricchio, “Targeting Rapid Action of Sex SteroidReceptors in Breast and Prostate Cancers,” Frontier Bioscience,vol. 17, pp. 2224-2232, 2011.

[11] E.W. LaPensee and N. Ben-Jonathan, “Novel Roles of Prolactinand Estrogens in Breast Cancer: Resistance to Chemotherapy,”Endocrine-Related Cancer, vol. 17, no. 2, pp. 91-107, 2010.

[12] N. Fortunati, M.G. Catalano, G. Boccuzzi, and R. Frairia, “SexHormone-Binding Globulin (SHBG), Estradiol and Breast Can-cer,” Molecular Cell Endocrinology, vol. 316, no. 1, pp. 86-92, 2010.

[13] M.S. Udler, E.M. Azzato, C.S. Healey, S. Ahmed, K.A. Pooley, D.Greenberg, M. Shah, A.E. Teschendorff, C. Caldas, A.M. Dunning,E.A. Ostrander, N.E. Caporaso, D. Easton, and P.D. Pharoah,“Common Germline Polymorphisms in COMT, CYP19A1, ESR1,PGR, SULT1E1 and STS and Survival after a Diagnosis of BreastCancer,” Int’l J. Cancer, vol. 125, no. 11, pp. 2687-2696, 2009.

[14] P.D. Pharoah, J. Tyrer, A.M. Dunning, D.F. Easton, and B.A.Ponder, “Association between Common Variation in 120 Candi-date Genes and Breast Cancer Risk,” PLoS Genetics, vol. 3, no. 3,article e42, 2007.

[15] Y.L. Low, J.I. Taylor, P.B. Grace, A.A. Mulligan, A.A. Welch,S. Scollen, A.M. Dunning, R.N. Luben, K.T. Khaw, N.E. Day,N.J. Wareham, and S.A. Bingham, “Phytoestrogen Exposure,Polymorphisms in COMT, CYP19, ESR1, and SHBG Genes,and Their Associations with Prostate Cancer Risk,” Nutritionand Cancer, vol. 56, no. 1, pp. 31-39, 2006.

[16] W. Han, K.Y. Kim, S.J. Yang, D.Y. Noh, D. Kang, and K. Kwack,“SNP-SNP Interactions between DNA Repair Genes WereAssociated with Breast Cancer Risk in a Korean Population,”Cancer, vol. 118, no. 3, pp. 594-602, 2011.

[17] J. Conde, S.N. Silva, A.P. Azevedo, V. Teixeira, J.E. Pina, J. Rueff,and J.F. Gaspar, “Association of Common Variants in MismatchRepair Genes and Breast Cancer Susceptibility: A MultigeneStudy,” BMC Cancer, vol. 9, article 344, 2009.

[18] G.T. Lin, H.F. Tseng, C.H. Yang, M.F. Hou, L.Y. Chuang, H.T. Tai,M.H. Tai, Y.H. Cheng, C.H. Wen, C.S. Liu, C.J. Huang, C.L. Wang,and H.W. Chang, “Combinational Polymorphisms of SevenCXCL12-Related Genes Are Protective against Breast Cancer inTaiwan,” OMICS: A J. Integrative Biology, vol. 13, no. 2, pp. 165-172,2009.

[19] M.D. Ritchie, L.W. Hahn, N. Roodi, L.R. Bailey, W.D. Dupont, F.F.Parl, and J.H. Moore, “Multifactor-Dimensionality ReductionReveals High-Order Interactions among Estrogen-MetabolismGenes in Sporadic Breast Cancer,” Am. J. Human Genetics,vol. 69, no. 1, pp. 138-147, 2001.

[20] Y. Chung, S.Y. Lee, R.C. Elston, and T. Park, “Odds Ratio BasedMultifactor-Dimensionality Reduction Method for DetectingGene-Gene Interactions,” Bioinformatics, vol. 23, no. 1, pp. 71-76,2007.

[21] L.E. Mechanic, B.T. Luke, J.E. Goodman, S.J. Chanock, andC.C. Harris, “Polymorphism Interaction Analysis (PIA): AMethod for Investigating Complex Gene-Gene Interactions,”BMC Bioinformatics, vol. 9, article 146, 2008.

[22] S.H. Chen, J. Sun, L. Dimitrov, A.R. Turner, T.S. Adams, D.A.Meyers, B.L. Chang, S.L. Zheng, H. Gronberg, J. Xu, and F.C. Hsu,“A Support Vector Machine Approach for Detecting Gene-GeneInteraction,” Genetic Epidemiology, vol. 32, no. 2, pp. 152-67, 2008.

[23] H.W. Chang, C.H. Yang, C.H. Ho, C.H. Wen, and L.Y. Chuang,“Generating SNP Barcode to Evaluate SNP-SNP Interaction ofDisease by Particle Swarm Optimization,” Computational Biologyand Chemistry, vol. 33, no. 1, pp. 114-119, 2009.

[24] C.H. Yang, H.W. Chang, Y.H. Cheng, and L.Y. Chuang, “NovelGenerating Protective Single Nucleotide Polymorphism Barcodefor Breast Cancer Using Particle Swarm Optimization,” CancerEpidemiology, vol. 33, no. 2, pp. 147-154, 2009.

[25] L.Y. Chuang, H.W. Chang, M.C. Lin, and C.H. Yang, “ChaoticParticle Swarm Optimization for Detecting SNP-SNP Interactionsfor CXCL12-Related Genes in Breast Cancer Prevention,” EuropeanJ. Cancer Prevention, vol. 21, no. 4, pp. 336-342, 2012.

[26] H.W. Chang, L.Y. Chuang, C.H. Ho, P.L. Chang, and C.H. Yang,“Odds Ratio-Based Genetic Algorithms for Generating SNPBarcodes of Genotypes to Predict Disease Susceptibility,” OMICS:A J. Integrative Biology, vol. 12, no. 1, pp. 71-81, 2008.

[27] C.H. Yang, L.Y. Chuang, Y.J. Chen, H.F. Tseng, and H.W. Chang,“Computational Analysis of Simulated SNP Interactions between26 Growth Factor-Related Genes in a Breast Cancer AssociationStudy,” OMICS: A J. Integrative Biology, vol. 15, no. 6, pp. 399-407,2011.

[28] S.K. Musani, D. Shriner, N.J. Liu, R. Feng, C.S. Coffey, N.J. Yi, H.K.Tiwari, and D.B. Allison, “Detection of Gene x Gene Interactionsin Genome-Wide Association Studies of Human PopulationData,” Human Heredity, vol. 63, no. 2, pp. 67-84, 2007.

[29] J.H. Holland, Adaptation in Nature and Artificial Systems, MIT Press,1992.

[30] J.J. Liu, G. Cutler, W.X. Li, Z. Pan, S.H. Peng, T. Hoey, L.B. Chen,and X.F.B. Ling, “Multiclass Cancer Classification and BiomarkerDiscovery Using GA-Based Algorithms,” Bioinformatics, vol. 21,no. 11, pp. 2691-2697, 2005.

[31] C.H. Yang, Y.H. Cheng, L.Y. Chuang, and H.W. Chang, “Con-fronting Two-Pair Primer Design for Enzyme-Free SNP Genotyp-ing Based on a Genetic Algorithm,” BMC Bioinformatics, vol. 11,article 509, 2010.

[32] L.P. Li, C.R. Weinberg, T.A. Darden, and L.G. Pedersen, “GeneSelection for Sample Classification Based on Gene ExpressionData: Study of Sensitivity to Choice of Parameters of the GA/KNN Method,” Bioinformatics, vol. 17, no. 12, pp. 1131-1142, 2001.

[33] L.Y. Chuang, C.S. Yang, J.C. Li, and C.H. Yang, “Chaotic GeneticAlgorithm for Gene Selection and Classification Problems,”OMICS: A J. Integrative Biology, vol. 13, no. 5, pp. 407-420, 2009.


Cheng-Hong Yang received the MS and PhDdegrees in computer engineering from NorthDakota State University in 1988 and 1992,respectively. He is a professor in the Departmentof Electronic Engineering at the National Kaoh-siung University of Applied Sciences, Taiwan.His main areas of research are evolutionarycomputation, bioinformatics, and assistive toolimplementation.

Yu-Da Lin received the MS degree from theDepartment of Electronic Engineering, NationalKaohsiung University of Applied Sciences,Taiwan, in 2011. He is currently workingtoward the PhD degree in the Department ofElectronic Engineering, National KaohsiungUniversity of Applied Sciences, Taiwan. Hehas rich experience in computer programming,database design and management, and sys-tems programming and design. His main areas

of research are bioinformatics and computational biology.

Hsueh-Wei Chang received the MS degreefrom the Institute of Radiation Biology and thePhD degree from the Department of LifeSciences at Tsing Hua University, Hsinchu,Taiwan, in 1991 and 2000, respectively. He isan associate professor and deputy director in theDepartment of Biomedical Science and Environ-mental Biology at Kaohsiung Medical University.His main areas of research are bioinformatics,genomics, cancer biomarker, and anticancer

drug screening from natural products.

Li-Yeh Chuang received the MS degree fromthe Department of Chemistry, University of NorthCarolina in 1989 and the PhD degree from theDepartment of Biochemistry, North Dakota StateUniversity in 1994. She is a professor anddirector in the Department of Chemical Engineer-ing and Institute of Biotechnology and ChemicalEngineering at I-Shou University, Kaohsiung,Taiwan. Her main areas of research are bioinfor-matics, biochemistry, and genetic engineering.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Documents

Evaluation of Breast Cancer Susceptibility Using Improved Genetic Algorithms to Generate Genotype SNP Barcodes