10
156 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.6, NO.2 November 2012 Mining Rare Association Rules on Banpheo Hospital (Public Organization) via Apriori MSG-P Algorithm Taweechai Ouypornkochagorn 1 , Non-member ABSTRACT Mining association rule is one of the important techniques in data mining to exploit hidden knowl- edge in large database. Many businesses in several areas need this technique for examine their enormous information, and public health is the one area that highly requires. Several hidden information conceal in daily operation data such as relation between visit time and symptom, relation between disease and pa- tient age, etc. By the way, association rule discov- ery via traditional Apriori algorithm, the fundamen- tal way to retrieve hidden rules, has to pay with tremendous resources and time. This research im- plements the modification of association rule mining technique, Apriori MSG-P, in operational database of Banpheo hospital (Public Organization), Sumut- sakon province, Thailand. The objectives target on epidemic information and patient behavior on hospi- tal’s services. The research’s outcomes show that our implementation can evaluate a lot of valuable infor- mation that can be used by both of operation staffs and executive staffs. Moreover, the research’s out- comes demonstrate that Apriori MSG-P can be the proper one technique that can implement the real- world databases. Keywords: Banpheo Hospital, Apriori MSG-P, Multiple Supports, Association Rule, Data Mining Implementation 1. INTRODUCTION Association rule mining is the one of the most pop- ular techniques in data mining for discovering inter- esting patterns in database and this valuable informa- tion could be utilized and applied in several real-world applications. Apriori algorithm is the fundamental al- gorithm of the association rules discovery technique by using user minimum support threshold for select- ing interesting patterns. Unfortunately, the process of Apriori method performs an exhaustive search for all possible itemsets that takes unacceptable time in real-world applications. Several researches proposed the ways to improve traditional Apriori algorithm. Manuscript received on September 1, 2011 ; revised on Jan- uary 31, 2012. 1 The author is with Faculty of Engineering at Si Racha Kasetsart University Si Racha Campus Chonbury, Thailand, E-mail: [email protected] For example, by reducing the number of database scanning, Frequent-Pattern growth (FP Tree) [10] uses the tree concept which adopts a divide-and- conquer strategy (avoid to create candidate), PRICE [4] uses only once database scanning and using logical operations in the process. Furthermore, [1], [17], [21], [22] had been proposed the way to reduce database size with strategic planning and sampling. This leads to reducing the CPU computation time and disk ac- cess, but the qualification and the quality of sampling still are concerned. The other way, transaction reduc- tion [14], [15] provide effective way in mining associa- tion rules by neglecting a transaction which does not contain any frequent k-itemsets in subsequent scans. Parallelization by partitioning approaches is the chal- lenges for parallel data mining. FDM [5] and FSM [6] are parallelization of Apriori for mining association rules by its own partition which handled by separated machines. [3] was applied grid services for implement- ing high-performance distributed knowledge discov- ery systems that can use parallel and distributed min- ing system. Hashing technique is also the alternative effective way, Direct Hashing and Pruning (DHP) [11] uses a hashing technique to filter frequent 2-itemsets which are ineffective candidate and also avoids some database scanning. Hashing techniques have been ap- plied for mining association rules in text database [9] and also over data streams [8]. In several applications, some interesting items ap- pear very frequently in the data while others appear rarely. In the aspect of mining association rules with traditional Apriori with only one minimum support, when users set minimum support too high, those rules that involve with rare items will not be found. But when users set minimum support very low for retriev- ing rare itemsets, this may cause combinatorial explo- sion because those frequent items will be associated with one another in all possible ways. Users will re- ceive the expected rare itemsets together with enor- mous number of unexpected itemsets. This dilemma is called the rare item problem. The interesting item- sets containing with some interesting item but those itemsets do not have support value enough to evaluate are called “rare itemset”. Threshold improvements have been proposed. Multiple minimum supports ap- proaches were proposed for evaluating rare itemsets in [2] (that called MSApriori), [7], [13], [16] but these still be the hard way to use in practical.

Mining Rare Association Rules on Banpheo Hospital (Public ... · Mining Rare Association Rules on Banpheo Hospital (Public Organization) via ... on Banpheo Hospital (Public Organization)

  • Upload
    tranque

  • View
    221

  • Download
    2

Embed Size (px)

Citation preview

156 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.6, NO.2 November 2012

Mining Rare Association Rules on BanpheoHospital (Public Organization) via

Apriori MSG-P Algorithm

Taweechai Ouypornkochagorn1 , Non-member

ABSTRACT

Mining association rule is one of the importanttechniques in data mining to exploit hidden knowl-edge in large database. Many businesses in severalareas need this technique for examine their enormousinformation, and public health is the one area thathighly requires. Several hidden information concealin daily operation data such as relation between visittime and symptom, relation between disease and pa-tient age, etc. By the way, association rule discov-ery via traditional Apriori algorithm, the fundamen-tal way to retrieve hidden rules, has to pay withtremendous resources and time. This research im-plements the modification of association rule miningtechnique, Apriori MSG-P, in operational databaseof Banpheo hospital (Public Organization), Sumut-sakon province, Thailand. The objectives target onepidemic information and patient behavior on hospi-tal’s services. The research’s outcomes show that ourimplementation can evaluate a lot of valuable infor-mation that can be used by both of operation staffsand executive staffs. Moreover, the research’s out-comes demonstrate that Apriori MSG-P can be theproper one technique that can implement the real-world databases.

Keywords: Banpheo Hospital, Apriori MSG-P,Multiple Supports, Association Rule, Data MiningImplementation

1. INTRODUCTION

Association rule mining is the one of the most pop-ular techniques in data mining for discovering inter-esting patterns in database and this valuable informa-tion could be utilized and applied in several real-worldapplications. Apriori algorithm is the fundamental al-gorithm of the association rules discovery techniqueby using user minimum support threshold for select-ing interesting patterns. Unfortunately, the processof Apriori method performs an exhaustive search forall possible itemsets that takes unacceptable time inreal-world applications. Several researches proposedthe ways to improve traditional Apriori algorithm.

Manuscript received on September 1, 2011 ; revised on Jan-uary 31, 2012.1 The author is with Faculty of Engineering at Si Racha

Kasetsart University Si Racha Campus Chonbury, Thailand,E-mail: [email protected]

For example, by reducing the number of databasescanning, Frequent-Pattern growth (FP Tree) [10]uses the tree concept which adopts a divide-and-conquer strategy (avoid to create candidate), PRICE[4] uses only once database scanning and using logicaloperations in the process. Furthermore, [1], [17], [21],[22] had been proposed the way to reduce databasesize with strategic planning and sampling. This leadsto reducing the CPU computation time and disk ac-cess, but the qualification and the quality of samplingstill are concerned. The other way, transaction reduc-tion [14], [15] provide effective way in mining associa-tion rules by neglecting a transaction which does notcontain any frequent k-itemsets in subsequent scans.Parallelization by partitioning approaches is the chal-lenges for parallel data mining. FDM [5] and FSM [6]are parallelization of Apriori for mining associationrules by its own partition which handled by separatedmachines. [3] was applied grid services for implement-ing high-performance distributed knowledge discov-ery systems that can use parallel and distributed min-ing system. Hashing technique is also the alternativeeffective way, Direct Hashing and Pruning (DHP) [11]uses a hashing technique to filter frequent 2-itemsetswhich are ineffective candidate and also avoids somedatabase scanning. Hashing techniques have been ap-plied for mining association rules in text database [9]and also over data streams [8].

In several applications, some interesting items ap-pear very frequently in the data while others appearrarely. In the aspect of mining association rules withtraditional Apriori with only one minimum support,when users set minimum support too high, those rulesthat involve with rare items will not be found. Butwhen users set minimum support very low for retriev-ing rare itemsets, this may cause combinatorial explo-sion because those frequent items will be associatedwith one another in all possible ways. Users will re-ceive the expected rare itemsets together with enor-mous number of unexpected itemsets. This dilemmais called the rare item problem. The interesting item-sets containing with some interesting item but thoseitemsets do not have support value enough to evaluateare called “rare itemset”. Threshold improvementshave been proposed. Multiple minimum supports ap-proaches were proposed for evaluating rare itemsetsin [2] (that called MSApriori), [7], [13], [16] but thesestill be the hard way to use in practical.

Mining Rare Association Rules on Banpheo Hospital (Public Organization) via Apriori MSG-P Algorithm 157

In this research, we apply Apriori MSG-P [18],which is the improvement way of multiple minimumsupports approaches by only one specific user thresh-old for discovering rare itemsets, to real-world ap-plication. We choose hospital information system atBanpheo hospital (Public Organization), Sumutsakonprovince, Thailand as our database. Apriori MSG-Pproposed threshold bases on the statistic distributionthat interpretable, easy to evaluate the desired out-come and able to find rare itemsets by examining thedata characteristic of support values.

2. CHARACTERISTIC OF TRADITIONALAPRIORI’S RESULT

To show characteristic of traditional Apriori algo-rithm, let take a look on the sample database, North-wind database from http://www.microsoft.com. Wetake the attribute-oriented induction process follow-ing:

� Categorize the product’s unit price into 3 ranges:high (> 50), medium (20− 50) and low (< 20)

� Categorize the ordering quantity into 3 ranges:high (> 40), medium (15− 40) and low (< 15)

Then, try to choose 8 attributes in the sale order-ing subject on the trail: product, product’s category,product’s unit price, employee, ordering month, or-dering quantity, customer’s city and customer’s coun-try. The 202 items were found on 2,155 transactions.The results on running Apriori with the 1 transactioncount minimum support show in table 1. Obviously,each size of itemsets has the different data character-istic. So, applying the single minimum support to allof itemsets may not suitable.

In case of choosing 1% minimum support (or 21.55support count), this support will locate at percentile33.96% of itemset size 1, 73.79% of itemset size 2and 100% of the rest of upper sizes. This mean, thefiltered itemsets with 1% minimum support will passa lot of itemsets in itemset size 1 (passes 66.04%!),but it seems strictly screen in the upper sizes (only26.21% in itemset size 2 and very few in the uppersizes).

In the other hand, when offering the 40% statisticpercentile on each size, instead of using minimum sup-

Table 1: Itemsets Charateristic with 1% min supp.

port, it would be 2.15% minimum support in itemsetsize 1, 0.15% in size 2 and 0.07% in size 3 (omit-ted other sizes), that means 40% statistic percentileis able to generate multiple supports automaticallydepend on data characteristic. Additionally, withthis statistic percentile, it can be used for expect-ing the number of result itemsets. The number ofoutcome itemsets of 40% percentile has the coverageabout 100-40=60% in each size when the support dis-tribution is symmetric curve. So, statistic percentilemeasurements should be used as threshold of inter-estingness justification instead of the single minimumsupport.

Turning to the aspect about the meaning of min-imum support, minimum support seems to be un-clear meaning for user to trail for the best outcomesif environment was changed. For example, firstly, youmay need 3 from 10 transactions (30%) for minimumsupport, but when the transaction increases to 100and supposes that the number of items increases in 2times, how should the minimum support change? Re-maining 3 or changing to 30 or another value? Otherconfusion occurs when it moves on another database.If the best result of database A set on 30% mini-mum support, it would unnecessarily set to 30% ondatabase B even if they have same transaction size.So, minimum support cannot directly reflect to in-terestingness, especially when database is differentor changed since minimum support doesn’t analyzedatabase characteristic first. Indeed, the meaning ofinterestingness should be measured from ranking orsomething else among their members.

3. HOSPITAL INFORMATION SYSTEM(HIS)

Hospital Information System (HIS) is the systemdesigned for manage daily operation in hospital, bothof front offices and back offices. The examples of im-portant offices are shown as following:

� Medical record office� Outpatient department (OPD)� Inpatient department (IPD)� Emergency Room (ER)� Operating Room (OR)� Labor Room (LR)� Medicine (MED)� Surgical (SUR)� Orthopedic (ORTHO)� Obstretic Gynecology (OB-GYN)� Etc.

All information moves on system for multiple pur-poses such as curing or cashing purpose. Therefore,the enormous number of information is stored in hos-pital database. In addition, currently, HIS is imple-menting in standard. The well known standards, suchas HL7 by American National Standards Institute(ANSI) or statXML by Thailand’s National Informa-

158 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.6, NO.2 November 2012

tion Center (NIC), are applied and are used for sys-tem integrating among hospitals or between hospitaland government agency units.

Unfortunately, nowadays these information areused only in the small parts since they need a lotof times and human resources to exploit. A lot ofknowledge still does not be used and some of knowl-edge is the thing that seems unclear but is accus-tomed and acceptable by long experience hospital’sstaffs which is the great obstacle for new staffs tounderstand it. These hidden information and unclearinformation are the high valuable knowledge for medi-cal profession that may be found from symptom diag-nosing results and patient’s circumstance causes. So,a lot of knowledge still be laid and wait for someoneto discover them.

Banpheo hospital (public organization) is thebiggest local hospital in Sumutsakon province, Thai-land which has 304,470 patients and 6,864,118 clinicvisits in 2010. Even though many of executive infor-mation are prepared by HIS as executive or operativereports, the number of discovering information seemsbe in the small portion comparing with the number ofdata in database. In this research, we implement theone of our recent proposed technique, Apriori MSG-P, to evaluate hidden knowledge from Banpheo hospi-tal’s operational database by develop an applicationfor self-discovering and for interfacing with hospitalstaffs. The results can be found in section 5.

Fig.1: Banpheo hospital (public organization),Sumutsakon province, Thailand

4. MINING ASSOCIATION RULES WITHAPRIORI MSG-P

Apriori with Multiple Support Generating bystatistic Percentile threshold (Apriori MSG-P) wasproposed in [18] and was shown in Fig. 2. This al-gorithm combined statistic concepts, Normal Distri-bution Curve, to traditional Apriori by analyzing alloutcomes and defining the meaning of word “user’sinterestingness” with statistic percentile parameter.

So, Apriori MSG-P is able to provide the result meetusers need and still preserves rare itemsets in finally.

Refer to Fig. 2, function Apriori MSG-P thatshown in line1 to 8 and 13 to 15 are the same as tra-ditional Apriori - evaluating candidates and frequentitemsets. Apriori MSG-P adds the calculation stepsfor support’s statistic in line 9 to 11, itemset size byitemset size. User’s statistic percentile value α is ap-plied in line 11 by tracing Normal Distribution Curvefor minimum support at its size (MSS). At line 12,the true minimum support at its size evaluates fromthe greatest value, between MSS and LS (shown indefinition 1). The least support or LS is another user-specified threshold. Least support refers the lowestminimum support which should satisfy to become afrequent itemsets and prevents to create uninterest-ing itemsets when standard deviation of its size nearszero. Notice, apriori gen and has infrequent subsetare also as same as traditional Apriori that do not bedescribed here.

Definition 1: the minimum support of k-itemsetscandidates: MSk is defined as below where MSSk isminimum support that calculates from user percentilevalue (α) and LS is user specific least support.

MSk =

{MSSk if MSSk > LS;LS Otherwise;

(1)

Definition 2: the minimum support of k-itemsets: MSSk is the generated support thresholdwhich its value depends on statistic data of supportvalues of all candidates in same itemset size, basedon Normal Distribution Curve. MSSk is the supportvalue which located on percentile, calculated from (2),and that percentile have to equal with user specificpercentile α (practically, Standard Curve StatisticalTable is easier to use). Z − score getting from (2)will be used in (3) to find out MSSk. Remark thatµ is Mean and σ is Standard Deviation.

Percentile(z) =

∫ z

−∞

1√2π

e−1/

2Z2 dz (2)

MSS = z × σ + µ (3)

Notice that the single user-specified percentilevalue will be used to generate the multiple minimumsupports, itemset size by itemset size. The infrequentitemsets, whose supports are less than MSSk, willnot be used for generating next size of candidates.So, this can reduce the number of uninterested item-sets and improves the creating time performance, es-pecially on large database, and also preserves rareitemsets on high size of itemset.

The extraction of rare itemsets using Apri-ori MSG-P illustrated in example 1.

Example 1: Table 2 shown the dataset of 10transactions with 12 items. This example appliedthresholds with percentile 40% and least support 10%

Mining Rare Association Rules on Banpheo Hospital (Public Organization) via Apriori MSG-P Algorithm 159

Fig.2: Psuedo code of Apriori MSG-P

(or 1 support count) that the results shown in table2-5.

Table 2: Transformed Database

Table 3: Itemsets at Size 1

From database shown in table 2, size 1 frequentitemsets were discovered and shown in table 3. Meanof support count of whole size 1 itemsets is 3.083333and standard deviation is 1.831955. At user prefer-ence of 40 percentile and these statistic parameter val-ues, MSS1 is 2.619244 (reference from (3)). WhereasLS is 0.1, means least support count is 1 transaction,

Table 4: Itemsets at Size 2

Table 5: Itemsets at Size 3

the MS1 is the greater value between MSS1 and LSor equals 2.619244. The itemsets which have supportcount lower than 2.619244 will be neglected. We canfound that the number of frequent itemsets is 7 from12 candidate itemsets, or coverage is 58.33%. All offrequent itemsets in itemset size 1 will be used forgenerate candidate itemsets in size 2 and so on.

We can notice from above example that the cov-erage approximately equals 100-percentile or 100-40=60%. This helps us to spot on top most of in-teresting itemsets in each size.

5. EXPERIMENTAL RESULTS

5.1 Experimental results on Northwind database

The experiments run on Northwind databasewhich is performed the attribute-oriented inductionprocess described above. We compare the resultsof Apriori MSG-P and traditional Apriori at thestage that each of techniques provides the approxi-mately same number of outcome itemsets by adjust-ing threshold. The results were shown in table 6-7and Fig. 3-5. Obviously, the rare itemsets can bebetter found, especially at the high itemset size. Atitemset size 5-6 in Fig. 5, Apriori MSG-P can dis-cover itemsets that Apriori scarcely found. Also atsize 3-4 in Fig. 4, Apriori MSG-P takes the betterdiscovery performance with average 560.60% better.This means traditional Apriori cannot evaluate rareitemsets in upper size 3 effectively. In the contrary atsize 1-2 in Fig. 3, traditional Apriori provide greaternumber of outcomes than Apriori MSG-P but a lot ofthem seem uninteresting when comparing with sup-port value among their members. Therefore Apri-ori MSG-P is able to examine and choose the itemsetsat all itemset sizes and those itemsets seem significantamong their siblings with have the same itemset size.

Moreover in traditional Apriori, finding the bestminimum support is very difficult to guessing at the

160 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.6, NO.2 November 2012

first time. In contrast, Apriori MSG-P is able to useeasily because of its statistic meaning. Percentile isthe statistic parameter that can estimate the numberof outcomes (more accurately when the distributionlooks like symmetry). For example, 40% of percentilevalue means about 60% of high support itemsets willbe selected (left tail concept).

Table 6: The results from traditional Apriori

Table 7: The results from Apriori MSG-P

Fig.3: Percentage of number of itemsets in itemsetsize 1-2 using Apriori and Apriori MSG-P

The experiments on creating time were shown inFig. 6. The results obviously show that Apri-ori MSG-P takes lower creating time, especially whenthe number of result itemsets is high. Note that theexperiments run on the Pentium Dual Core 3.00GHzwith 2GByte RAM.

Fig.4: Percentage of number of itemsets in itemsetsize 3-4 using Apriori and Apriori MSG-P

Fig.5: Percentage of number of itemsets in itemsetsize 5-6 using Apriori and Apriori MSG-P

Fig.6: Creating time using Apriori and Apri-ori MSG-P

Mining Rare Association Rules on Banpheo Hospital (Public Organization) via Apriori MSG-P Algorithm 161

5.2 Experimental results on Banpheo hospitalHIS

The experiments on Banpheo hospital HIS catego-rize into 7 categories. Each of categories target ondiscovering interesting behaviors or patterns betweenselected attributes that hidden on Banpheo hospitalHIS. All of experimental results run on real opera-tional database between February 2010 and February2011 on 14 selected attributes as following:

� Month of year such as January, February.� Day of week such as Monday, Tuesday.� Clinic such as Orthopedic, Obstretic

Gynecology.� Disease (reference on standard ICD10 code)

such as Tuberculosis of eye, Chancroid.� Disease group such as Erysipelas, Measles,

Dermatophytosis.� Diagnostic 21 groups such as Neoplasms,

Diseases of the nervous system.� Pestilence that have to report Bureau of

Epidemiology, Thailand (Report 506) such as,ChoierA Chickenpox.

� Patient age range consists of 0-5, 5-15, 15-25,25-35, 35-45, 45-55, 55-65, 65-75, 65-75, 75+,and No data.

� Patient pay right such as social security right,government officer right.

� Marital status such as single, married, anddivorced.

� Sex consists of male, female, and not specific.� Education background such as Grade 4, 6, 9,

12, and bachelor.� Occupation such as Engineer, Farmer, and

Reporter.� Region of present address consists of local

area (Banpheo District), close neighbor area(other districts in Sumutsakon province),further neighbor area (provinces aroundSumutsakon except Bangkok), Bangkok(Capital of Thailand), other provinces,and not specific.

Note that some of experiments neglect one or someattributes since they need to fix one or some at-tributes due to their research topic.

The experiments performed by implementing Apri-ori MSG-P on each categories of researching topics,on the Pentium Dual Core 3.00GHz with 2GByteRAM. As the results, Apriori MSG-P gave the in-teresting association rules in each item sizes whichwere stored in database. This database recommendedto be used together with rules discovery application,“Data Mining Client v1.0”, which was developed forusers of Banpheo hospital. This application will per-mit users to select researching topic outcomes, to spe-cific interesting/ uninteresting attributes or interest-ing/ uninteresting item of each attributes, and to spe-cific itemset by the way of forming between itemsetswhich are contained only with specific attributes and

itemsets which are contained at least with specific at-tributes. The examples of “Data Mining Client v1.0”application were shown in Fig. 7-8.

Fig.7: “Data Mining Client v1.0” application

Fig.8: Discovering association rules which are ex-ploited by Apriori MSG-P algorithm

5.2.1 Patient behavior of Eye clinicThis research topic aims to study patient behavior

of Eye clinic from 36,746 clinic visits between Febru-ary 25, 2010 and February 22, 2011. This clinic hasthe different customer behavior from most of clinicsin Banpheo hospital and the hospital executive staffsneed to exploit some facts that may hide in HIS. Theexperiment sets the user percentile of Apriori MSG-P at 50%. The results were shown in table 8. Theperiod of time that was used to process is 39 minutes.

Table 8: Examples of interesting behavior on Eyeclinic

162 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.6, NO.2 November 2012

5.2.2 Patient behavior of Orthopedic clinicThis research topic aims to study patient behavior

of Orthopedic clinic from 33,675 clinic visits betweenFebruary 25, 2010 and February 22, 2011. This clinichas the different customer behavior from most of clin-ics in Banpheo hospital (like Eye clinic) and the hos-pital executive staffs need to exploit some facts thatmay hide in HIS. The experiment sets the user per-centile of Apriori MSG-P at 50%. The results wereshown in table 9. The period of time that was usedto process is 32 minutes.

Table 9: Examples of interesting behavior on Or-thopedic clinic

5.2.3 Patient paying behavior on cash, creditcard or equivalent

This research topic aims to study patient behav-ior on paying behavior on cash, credit card or equiv-alent from 88,542 visits between February 25, 2010and February 22, 2011. Since Banpheo hospital isbureaucratic organization that accepts all of socialsecurity rights, the hospital customers who choose topay by cash, credit card, or equivalent are in highlyinteresting of executive staffs. The experiment setsthe user percentile of Apriori MSG-P at 50%. Theresults were shown in table 10. The period of timethat was used to process is 4 hours 57 minutes.

Table 10: Examples of interesting patient payingbehavior on cash, credit card, or equivalent

5.2.4 Patient behavior of close neighbor areacustomers (other districts in Sumutsakonprovince)

This research topic aims to study patient behav-ior of close neighbor area customers (other districtsin Sumutsakon province: Kratumban and Mueng dis-trict) from 31,610 visits between February 25, 2010and February 22, 2011. This topic wants to studywhy customers on other areas outside Banpheo dis-trict prefer to take the hospital services at Banpheohospital and what are their preferred clinics. The ex-periment sets the user percentile of Apriori MSG-P at50%. The results were shown in table 11. The periodof time that was used to process is 55 minutes.

Table 11: Examples of interesting patient behaviorof close neighbor area customers (other districts inSumutsakon province)

5.2.5 Patient behavior of further neighbor area cus-tomers (provinces around Sumutsakon exceptBangkok)

This research topic aims to study patient behaviorof further neighbor area customers (provinces aroundSumutsakon except Bangkok, Nakornpathom, Rach-abury and Sumutsongkram province) from 59,726visits between February 25, 2010 and February 22,2011. This topic wants to study why customerson other provinces around Sumutsakon province ex-cept Bangkok prefer to take the hospital services atBanpheo hospital and what are their preferred clin-ics. The experiment sets the user percentile of Apri-ori MSG-P at 50%. The results were shown in table12. The period of time that was used to process is 1hour 50 minutes.

5.2.6 Patient behavior of BangkokThis research topic aims to study patient behavior

of Bangkok province customers (Capital of Thailand)from 12,927 visits between February 25, 2010 andFebruary 22, 2011. This topic wants to study whycustomers at Bangkok (which place that has highestmedical resources of Thailand) prefer to take the hos-pital services at Banpheo hospital and what are theirpreferred clinics. The experiment sets the user per-centile of Apriori MSG-P at 50%. The results wereshown in table 13. The period of time that was usedto process is 23 minutes.

Mining Rare Association Rules on Banpheo Hospital (Public Organization) via Apriori MSG-P Algorithm 163

Table 12: Examples of interesting patient behaviorof further neighbor area customers (provinces aroundSumutsakon except Bangkok)

Table 13: Examples of interesting patient behaviorof Bangkok, the capital of Thailand

5.2.7 Patient behavior of other provinces of Thai-land

This research topic aims to study patient behav-ior of other provinces customers from 35,687 visitsbetween February 25, 2010 and February 22, 2011.This topic wants to study why customers on otherprovinces of Thailand prefer to take the hospital ser-vices at Banpheo hospital and what are their pre-ferred clinics. The experiment sets the user percentileof Apriori MSG-P at 50%. The results were shown intable 14. The period of time that was used to processis 1 hour 16 minutes.

5.3 Apriori MSG-P’s result analysis compar-ing with traditional Apriori’s results

The experiments repeatedly process on patient be-havior of Eye clinic of Banpheo hospital, on 36,746clinic visits between February 25, 2010 and February22, 2011, by evaluating very interesting itemsets inusers’ view. After running Apriori MSG-P with 85percentile and 0.1 least support, which is the settingthat seems to represent emphasis on high interestingitemsets, it products 187 itemsets with lowest 8.09%support value. This number of results is only about0.70% when comparing with the number of resultsof 1% minimum support of traditional Apriori. We

Table 14: Examples of interesting patient behaviorof other provinces of Thailand

try to study how Apriori MSG-P is better than tradi-tional Apriori. Our analysis separates into 3 subjects.

5.3.1 Interesting itemsets that found in Apri-ori MSG-P but don’t found in traditionalApriori

This experiment tried to change minimum supportthreshold of traditional Apriori to stage that gives thenumber of itemsets equal with Apriori MSG-P at 85percentile (187 itemsets outcome) and then we foundat 15.64% minimum support: 147 itemsets are samewith Apriori MSG-P and 40 itemsets are different.Among the different itemsets of Apriori MSG-P out-comes, 23 itemsets are item size 1 and 17 itemsetsare size 2. To prove the itemsets that were found byApriori MSG-P are interesting, we pick up a differ-ent itemset that has lowest support value in each itemsize for analyzing.

At item size 1, we found item February. This item-set has 2,974 support count and 8.09% support value.We investigate its interesting by comparing among itssiblings and found that among size 1 itemsets theyhave 461 support count average and 2,377.24 supportcount standard deviation. That means {February}locates on 85.47 percentile among its siblings.

By the same proving way, 30 Baht health securityright, Age range: 65-75 is the lowest support itemsetin item size 2 with 5,102 support count and 13.88%support value. Statistic parameters on 2 item sizeare 2,075 support count average and 2,836.52 sup-port count standard deviation. That means {30 Bahthealth security right, Age range: 65-75} locates on85.70 percentile among its siblings.

Therefore, both of picked up itemsets are inter-esting due to their percentile, we can conclude thatApriori MSG-P can retrieve interesting itemsets moreeffectively than traditional Apriori since its differentoutcomes locate in high percentile among their sib-lings.

164 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.6, NO.2 November 2012

5.3.2 Itemsets that found in traditional Aprioribut Apriori MSG-P judges to be uninterest-ing itemsets

This experiment tried next from previous exper-iment by staring on the 40 different outcomes thattraditional Apriori retrieved. 20 itemsets are itemsize 3, 16 itemsets are size 4 and 4 itemsets are size 5.We pick up a highest support value itemsets in eachitemset size to determine why Apriori MSG-P deniedthese itemsets.

At item size 3, {Year: 2010, Eye clinic, Generalpay right} has 6,597 support count and 17.95% sup-port value. Among its siblings, statistic parametersare 8,902 support count average and 3,174.14 supportcount standard deviation. This item locates at 23.39percentile.

At item size 4, {Eye clinic, ICD10 group: G00- G99 Diseases of the nervous system, Married, Fe-male} has 7,604 support count and 20.69% supportvalue. Among its siblings, statistic parameters are7,820 support count average and 1,731.30 supportcount standard deviation. This item locates at 45.03percentile.

At item size 5, {Year 2010, Eye clinic, 30 Bahthealth security right, Married, Present Address:Bangkok-Thailand} has 6,532 support count and17.78% support value. Among its siblings, statis-tic parameters are 6,615 support count average and797.68 support count standard deviation. This itemlocates at 45.86 percentile.

Therefore, from the low percentile value of pickedup itemsets that shown above, we can conclude thateven though these itemsets have a high support valuebut they seem uninteresting comparing with their sib-lings. So, traditional Apriori with minimum supportmay not the good way to evaluate “interesting item-set”.

5.3.3 Interesting itemsets evaluating performanceDue to last 2 discussed subjects, we can conclude

that traditional Apriori with minimum support isthe threshold technique that does not try to inter-pret “interestingness” among their sibling itemsets.Therefore traditional Apriori may accidentally evalu-ate uninteresting itemsets and this seems very diffi-cult to get rid of this problem. From 187 itemsets ofApriori MSG-P with lowest 8.09% support, when wetry to evaluate itemsets with 8.09% minimum sup-port on traditional Apriori, it products 729 itemsetsor 3.90 time comparing with the number of outcomefrom Apriori MSG-P. This analysis can conclude thattraditional Apriori gives the lower evaluating perfor-mance. When users want to evaluate some of inter-esting itemsets, traditional Apriori produces a lot ofuninteresting itemsets (comparing with their siblingitemsets) in the same time.

6. CONCLUSION

In this paper, we implement the improvement oftraditional Apriori, Apriori MSG-P, with the real-world application, Banpheo’s hospital informationsystem (HIS). Apriori MSG-P processing is based onmultiple minimum supports by a single statistic per-centile value. It can mine the rare itemsets effectivelybetter than traditional Apriori because it examinesand considers about the different characteristic ofsupport value in each item size of itemsets duringevaluation processes, and then automatically gener-ates multiple minimum support value from supportscharacteristic of its size. In additional, the statisticpercentile value is more understandable and readilychoices, especially when trying to choose or to adjustfor the best one at first time. Experimental resultsshow that rare itemsets evaluating performance byApriori MSG-P are more effective in evaluating per-formance and in creating time performance.

Turning to consider about the experimental resultsof Banpheo’s database implementing, Apriori MSG-P can be used for discovering hidden information inaspect of association rules with acceptable runningtime. The outcome rules, especially on the high item-set sizes, give a lot of valuable patient behaviors andmedical notices that both operation staffs and exec-utive staffs can enrich their services or improve theirarrangement more appropriately.

References

[1] B. A. Mahafzah, A. F. Al-Badarneh and M. Z.Zakaria, “A New Sampling Technique for Asso-ciation Rule Mining,” in Journal Of InformationScience, vol.35, 2009, pp.358–376.

[2] B. Liu, W. Hsu, and Y. Ma, “Mining Associa-tion Rules with Multiple Minimum Supports,”SIGKDD Explorations, 1999.

[3] C. Mitica and A. Cristian, “Association RulesDiscovery using Grid Services,” In 3rd Inter-national Conference: Sciences of Electronic,Technologies of Information and Telecommuni-cations, 2005.

[4] C. Wang and C. Tjortjis, “Prices: An EfficientAlgorithm for Mining Association Rules,” Lec-ture Notes in Computer Science, vol. 2447, 2002,pp.77–83.

[5] D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu,(1996), “A Fast Distributed Algorithm for Min-ing Association Rules,” In Proc of 1996 Int’lConference on Parallel and Distributed Informa-tion Systems’, 1996, pp.31–44.

[6] D. Cheung and Y. Xaio, “Effect of Data Skew-ness in Parallel Mining of Association Rules,”Lecture Notes in Computer Science, vol. 1394,1998, pp.48-60.

[7] D. Xiangjun, Z. Zhiyun, N. Zhendong and J.Qiuting, “Mining Infrequent Itemsets Based onMultiple Level Minimum Supports,” Innovative

Mining Rare Association Rules on Banpheo Hospital (Public Organization) via Apriori MSG-P Algorithm 165

Computing, Information and Control (ICICIC’07), 2007, pp.528.

[8] En Tzu Wang and L. P. Arbee Chen, “A NovelHash-Based Approach for Mining Frequent Item-sets over Data Streams Requiring Less MemorySpace,” Data Mining and Knowledge Discovery,vol. 19, no. 1, 2009, pp 132-172.

[9] J. D. Holt and S.M. Chung, “Mining of Asso-ciation Rules in Text Databases Using InvertedHashing and Pruning,” Lecture Notes in Com-puter Science, Volume 1874/2000, 2000, pp.290–300.

[10] J. Han and J. Pei, “Mining Frequent Patternsby Pattern-Growth: methodology and implica-tions,” ACM SIGKDD Explorations Newsletter2,2000, pp.14–20.

[11] J. Soo, M. S. Chen, and P. S. Yu, “Using a Hash-Based Method with Transaction Trimming andDatabase Scan Reduction for Mining AssociationRules,” IEEE Transactions On Knowledge andData Engineering, vol., no.5., 1997, pp.813-825.

[12] K. Waiyamai and L. Lakhal, “Knowledge Dis-covery from Very Large Databases Using Fre-quent Concept Lattices,” In Proc. ECML 2000,2000, pp.437–445.

[13] O. Weimin and H. Qinhua, “Mining Directand Indirect Association Patterns with Multi-ple Minimum Supports,” Computational Intel-ligence and Software Engineering (CiSE), 2010,pp.1–4.

[14] R. Agarwal and R. Srikant, “Fast Algorithmsfor Mining Association Rules,” In Proc.20th Int.Conf. Very Large Data Bases, 1994, pp. 487–499.

[15] R. E. Thevar, and R. Krishnamoorthy, “A NewApproach of Modified Transaction Reduction Al-gorithm for Mining Frequent Itemset,” In Proc.11th conference on Computer and InformationTechnology ICCIT 2008, 2008.

[16] R. Uday Kiran and P. Krishna Re, “An ImprovedMultiple Minimum Support Based Approach toMine Rare Association Rules,” ComputationalIntelligence and Data Mining (CIDM ’09), IEEESymposium, 2009, pp.340–347.

[17] S. Parthasarathy, “Efficient Progressive Sam-pling for Association Rules,” IEEE InternationalConference on Data Mining, 2002, pp.354–361.

[18] T. Ouypornkochagorn and K. Waiyamai,”“Apriori MSG-P: A Statistic-Based MultipleMinimum Support Approach to Mine RareAssociation Rules,” The 3rd International Con-ference on Knowledge and Smart Technologies(KST-2011), Chonburi, Thailand, 2011.

[19] [T. Ouypornkochagorn and K. Waiyamai, “Do-main Ontology Development and Maintenanceusing Pertinent Concept Lattice,” GESTS Inter-national Transactions on Computer Science, vol.4, no. 1, 2005.

[20] T. Ouypornkochagorn and K. Waiyamai, “For-

mal Concept Mining: A Statistic-based Ap-proach for Pertinent Concept Lattice Construc-tion,” In Proc. 9th Asian Computing ScienceConference ASIAN2004 Publish in Lecture Notesin Computer Science, 2004, pp.195–204.

[21] V. Umarani and M. Punithavalli, “Developinga Novel and Effective Approach for AssociationRule Mining Using Progressive Sampling,” Inthe proc of 2nd Int’l Conference on Computerand Electrical Engineering (ICCEE 2009), vol.1, 2009, pp.610–614.

[22] V. Umarani and M. Punithavalli, “On Develop-ing an Effectual Progressive Sampling Based Ap-proach for Association Rule Discovery,” In theproc of 2nd IEEE Int’l Conference on Informa-tion and data Engineering (2nd IEEE ICIME2010), 2010.

Taweechai Ouypornkochagorn isdoing a Ph.D. on University of Manch-ester, UK in Biomedical Engineering,financial supporting by Office of TheCivil Service Commission, Thailand. Heachieved Master degree (M.Eng) in com-puter engineering at Kasetsart Univer-sity in 2003. He was a lecturer at Kaset-sart University, Si Racha campus. Hisresearches involve medical imaging tech-nologies such as EIT, MRI, data mining

on real medical databases, knowledge base system, and embed-ded system applications. He received awards ”Best 50 Ideas forThailand”, by The Prime Minister’s Office, Thailand in 2010and ”Best 60 Thailand Embedded Product Award”, by ThaiEmbedded Systems Association in 2010 and 2009.