Association Rules Network: Definition and Applications

Association Rules Network: Definition and Applications

Gaurav Pandey1, Sanjay Chawla2*, Simon Poon2, Bavani Arunasalam2 and Joseph G. Davis2

1 Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA

2 School of Information Technologies, University of Sydney, New South Wales, Australia

Received 17 January 2007; revised 22 January 2009; accepted 22 January 2009DOI:10.1002/sam.10027

Published online in Wiley InterScience (www.interscience.wiley.com).

Abstract: The role of data mining is to search “the space of candidate hypotheses” to offer solutions, whereas the role ofstatistics is to validate the hypotheses offered by the data-mining process. In this paper we propose Association Rules Networks(ARNs) as a structure for synthesizing, pruning, and analyzing a collection of association rules to construct candidate hypotheses.From a knowledge discovery perspective, ARNs allow for a goal-centric, context-driven analysis of the output of associationrules algorithms. From a mathematical perspective, ARNs are instances of backward-directed hypergraphs. Using two extensivecase studies, we show how ARNs and statistical theory can be combined to generate and test hypotheses. 2009 Wiley Periodicals,Inc. Statistical Analysis and Data Mining 1: 260–279, 2009

Keywords: data mining; association rules; association rules network

1. INTRODUCTION AND MOTIVATION

Data mining is often described as the process of dis-covering “interesting and actionable” patterns in largedatabases [1]. Given the large quantity of digital data that isconstantly being generated and stored, data mining offers asolution to the problem of rapidly summarizing and search-ing for nonobvious relationships in large databases. Therole of data mining can be perceived as a methodologyfor discovering candidate hypothesis or theories, whereasstatistical methods can be used to validate the proposed the-ories against data. This augmentation allows us to exploitthe recent advances in computationally efficient search tech-niques designed for large databases in conjunction withtraditional statistical methods, which remain the bedrockof theory verification and validation.

Figure 1(a) depicts a traditional theory-building process.The starting point of such a process is observations (events)that trigger the researcher to make a conceptual leap, andarrives at a framework in which the structure of the under-lying process (that is generating the events) can be elicited.In order to make the theory as unambiguous as possi-ble, a mathematical model to represent the theory is oftenconstructed. The theory is validated by testing the mathe-matical model and the hypotheses suggested by it, using

Correspondence to: Sanjay Chawla ([email protected])

appropriate data. Figure 1(b) shows how the traditionalapproach can be augmented with data-mining techniques.For example, consider the classical data-mining techniqueknown as association rule mining (ARM) [1,3–5]. Theobjective of ARM is to find “interesting patterns” in theform of rules X → Y where X and Y can be attributes,items or more generally “data objects”. Now, if we knewbeforehand that X and Y are correlated, in a statisti-cal sense, then the discovery of the rule X → Y doesnot necessarily add any useful information. On the otherhand if X and Y were previously never tested for cor-relation, then the discovery of the rule X → Y suggeststhat X and Y are candidate pairs to be validated statisti-cally (for correlation). As datasets are generally increasing,both in quantity and dimensionality, enumerating all pos-sible combinations of X and Y and then checking themfor correlation is not computationally feasible. Thus datamining can be seen as a mechanism to offer candidatetheories which are then required to be validated by statisti-cal approaches. In the economics literature this approach issometimes, perhaps derisively, called “measurement with-out theory” [2]. The argument is that without an underlyingtheory (called the maintained hypothesis), measurementsmay lead to results that are patently spurious. A well-known example is the observed positive correlation betweenthe number of stork nests and human babies born duringspring time in Scandinavian countries [6]. However, to

2009 Wiley Periodicals, Inc.

Pandey et al.: Association rules network 261

Accept

Observations

HypothesesFormulation

MathematicalModelling

Testagainst

data

Reject

Augmentationwith Data Mining

Accept

Observations

HypothesesFormulation

MathematicalModelling

Testagainst

data

Reject

(a) Traditional Theory DrivenApproach

(b) Traditional Theory DrivenApproach augmented with data mining

Fig. 1 Comparison of theory driven and augmented approach. Panel (a) is the classical method of scientific inquiry [2].

conclude and explain that the relationship is spurious (andnot causal) requires knowledge that cannot be inferred fromthe data alone. This is sometimes also called the “data-mining trap”.1

However, it is time to revisit this debate given thetremendous computational advances that have been madein recent years. Research in data-mining algorithms [1,5]has now made it possible to efficiently sift through largequantities of data and generate candidate patterns whichare “interesting”, often surprising and in many instancesmore credible than the storks/babies example would appearto suggest. In this paper, we present two examples of howthe steps shown in Fig. 1 can be illustrated.

One of the challenges of data-mining techniques is thatthey often generate a large number of “patterns” or hypothe-ses, making it extremely hard for a researcher to decidewhich hypotheses are credible and worth evaluating usingstatistical techniques. What we are proposing is a method[association rules network (ARN)] to overcome this chal-lenge for association rules—a well-known data-miningtechnique [1,3–5].

The core idea in ARN is that rules discovered by the asso-ciation rule-mining algorithm can be synthesized, pruned,and integrated in the context of specific objectives. In par-ticular, if there is a variable of interest (the “goal”), then wecan form a network consisting of the relevant and related

1 Patterns discovered in the data that have no predictive value.

variables and use the network to inform a “model” thatcan be tested using statistical methods. For example, sup-pose an association rule task outputs the following the rules:A → B, B → C and D → C. Also, the variable of interestmay be C. The output of the association rule task immedi-ately suggests the regression model C = α1D + α2B + ε.Furthermore, we can also test for the indirect dependenceof C by A. and even the joint dependence of B and D

on C. This way we can couple a data-mining task withstatistical analysis. In practice, ARM involves hundreds,if not thousands, of variables. The pruning strategy thataccompanies Association Rules Networks (ARNs) can beused to remove local inconsistencies between variables (likecycles), in order to suggest consistent statistical models. Tosummarize, ARNs offer the following features:

(1) Pruning in context: We use the ARN to prunerules in the context of a specific objective. Changingthe objective will result in the pruning of differentrules. This dynamic pruning strategy for associationrules has, to the best of our knowledge, never beenproposed before.

(2) Network structure: The ARN provides a mechanismfor determining the network relationship between therelevant variables and the goal. This can help us elicitdirect and indirect and join effects from the network.

(3) Hypotheses generation for evaluation: ARN canserve as a bridge between the outputs generated byARM and their statistical evaluation.

Statistical Analysis and Data Mining DOI:10.1002/sam

262 Statistical Analysis and Data Mining, Vol. 1 (2009)

The rest of the paper is organized as follows. In Section 2,we provide an overview of ARM and related research.Details of ARN, such as its definition, construction algo-rithm, and properties, are presented in Section 3. InSection 4, we describe two comprehensive case studies thatdemonstrate the process of constructing and using ARNs.We also provide a validation of our results by using classicalstatistical methods. We conclude with Section 5 compris-ing a summary and directions for future work. Preliminaryversions of the paper highlighting different aspects of ARNhave appeared in [7,8].

2. ASSOCIATION RULE MINING AND RELATEDRESEARCH

Association rule mining (ARM) is considered a corner-stone of data-mining research [3,5]. The objective of thistechnique is to find dependencies between different vari-ables (called items) in a large database. Unlike a standardstatistical approach to test for correlation between vari-ables, ARM casts the problem as a search for dependencies(called association rules). In order to scale up the methodfor very large databases and variables, two measures (calledsupport and confidence) are used to guide the searchprocess.

2.1. Association Rules

Association rules are generally described in the frame-work of market basket analysis. Given a set of items I anda set of transactions T consisting of subsets of I , an asso-ciation rule is a relationship of the form A

s,c→B, where A

and B are subsets of I , while s and c are the support andconfidence of the rule. A is called the antecedent and B theconsequent of the rule. The support σ(A) of a subset A of I

is defined as the percentage of transactions which contain A

and the confidence of a rule A → B is σ(A∪B)σ(A)

. Most algo-rithms for association rule discovery take advantage of theantimonotonicity property exhibited by the support level: IfA ⊂ B then σ(A) ≥ σ(B).

In the language of ARM, data is a set of transactionsand each transaction is a subset of the item space. Ourfocus is on applying ARN to relational data, where eachinstance is a tuple which consists of a set of attribute–valuepairs. More specifically, suppose we are given a relationR(A1, A2, . . . , An) where the domain of A, dom(Ai) ={a1, . . . , an}, is discrete-valued. Then an item is anattribute–value pair {Ai = a}. The ARN will be constructedusing rules of the form {Am1 = am1, . . . , amk

= amk} →

{Aj = aj } where j /∈ {m1, . . . , mk}.

2.2. Association-Rules-Related Research

The earlier research in ARM has focused on algorithmsfor discovery [3,4] and subsequently on pruning [9,10].This work is motivated by the fact that the standard mea-sures of support and confidence (described above) generatemany redundant rules, some of which may be spurious.Our study is also motivated by developing a solution to theproblem of organizing this large number of rules so as tostudy a specific goal. Thus, to provide a context for theutility of ARNs, we provide a review of the literature onpruning a large set of association rules below.

In general, there are three approaches for pruning redun-dant rules. The first approach involves the extraction ofmore specific types of patterns from transaction data. Suchpatterns include closed itemsets [11–13], which are max-imal itemsets all of whose subsets have the same support,hypercliques [14], which eliminate cross-support patternsusing the h-confidence measure, and correlated patterns[15], which are sets of correlated items derived on the basisof their mutual information. Generally, these patterns are asubset of the entire set of frequent itemsets derived froma dataset.

The second approach is based on the satisfaction of anadditional interestingness measure in addition to supportand confidence. For example, Piatetsky–Shapiro [16] pro-posed that a rule A → B is interesting if P(A,B)

P (A)P (B)> 1.

Several other statistical measures of interestingness havebeen proposed, and Tan et al. [17] and Geng et al. [18]provide excellent surveys of the properties of the differentmeasures. Recently, some approaches have also been pro-posed to determine the interestingness of a set of frequentitemsets using background knowledge in the form of priorbeliefs of the user [19] or Bayesian networks representingthe causal structure in the input dataset [20]. Usually, theseapproaches are quite successful in pruning the set of fre-quent item sets, and rules derived from them. Interestingly,ARNs can still be used in combination with these patternsand interestingness measures, since the ARN constructionalgorithm presumes only a set of rules as input, and is inde-pendent of how these rules are derived. Furthermore, thesepatterns and measures do not perform a goal-driven prun-ing, which is the main motivation behind the constructionof an ARN.

The third approach, which is most relevant to our work,involves the summarization of the set of item sets and/orassociation rules derived from a dataset. Some techniqueshave been proposed for summarizing the set of frequentitemsets using probabilistic methods [21,22]. Some tech-niques have also been proposed for clustering frequent itemsets [23,24] and association rules [25] using suitable dis-tance measures. In [23], Han et al. proposed a method ofsummarizing association rules and clustering the rule itemsusing an undirected hypergraph. Each frequent itemset was



represented as a hyperedge and set of all the frequentitemsets as a hypergraph. The resulting hypergraph waspartitioned into clusters using a min-cut algorithm specif-ically designed for hypergraphs. Items which belonged tothe same cluster were considered to be “more related” thanitems belonging to different clusters.

Finally, Berrado and Runger [26] have recently pro-posed a mechanism of generating meta-rules from a col-lection of discovered association rules. Here, meta-rulesare generated by reapplying the ARM algorithm on thediscovered rules; more precisely, on rules with the sameconsequent. These techniques are generally able to extractuseful summaries from the large set of rules or item setsderived from a dataset, but usually involve some loss ofinformation, since they are not able to retain the completeinformation in the original rule set. Furthermore, in contrastto ARN, the summaries, clusters, or meta-rules discoveredare global and do not take into consideration the miningtask, i.e. the goal being investigated.

In summary, although a substantial amount of work hasbeen done in the direction of pruning the usually large set offrequent patterns or association rules derived from a dataset,ARNs represent a new dimension in this direction, since thepruning is done in the context of a goal node. This is rele-vant for several applications where a specific goal is beingstudied, such as in the case studies presented below.

3. ASSOCIATION RULES NETWORK (ARN)

Although an algorithm like Apriori will discover allassociation rules which satisfy the minimum support andminimum confidence thresholds, there is no guarantee thatthe rules discovered are meaningful or not spurious. In fact,one of the major weaknesses of this approach is preciselythat several rules, many of them redundant and obvious,are produced by the algorithm. This can often hamper theknowledge discovery process.

However, in practice, the data-mining process is rarelycarried out in isolation without any reference to specificuser goals or focus on target items of interest. This callsfor an interactive strategy by which pruning of associationrules is carried out in the context of precise goals asthe following example will make clear. We begin withthe following rule discovered from the Census Data ofElderly People [10], which captures information about thedemographic and income distribution of elderly people inthe United States of America. Our goal is to understand theitemsets that are frequently associated with the target item,“income=below 50K”.2

2 All rules have predefined minimum support and confidence,but the exact value is not relevant to the discussion as what wetrying to highlight can occur at any support level.

r1 : immigrant = no → income = below50k.

By itself, this rule appears to contradict our general percep-tion regarding the comparative incomes of immigrants andnonimmigrants, at least in the USA. However, combiningthis pattern with another rule,

r2 : sex = female age < 75 → immigrant = no,

provides a context which helps in interpreting the first rule.This forms a network which flows into the goal (income =below50k) and provides a better explanation of the goal asopposed to when the rule r1 is viewed in isolation. Nowconsider a third rule

r3 : immigrant = no → sex = female.

This rule appears to be flowing “in the opposite direction”,i.e. it is redundant and does not help explain the goal.Graphically this rule participates in a hypercycle with ruler2. Finally, consider two more rules

r4 : urban = no → income = below50k,

r5 : urban = no → sex = female.

The rule r5 in conjunction with rules r2 and r1 alsoappears to be leading indirectly (redundant path) to thegoal. By ignoring rule r5, the goal items remain reachablefrom the item urban = no. Figure 2 captures the precedingdiscussion. The directed hypergraph after the removal ofhyperedges r3 and r5, which are redundant in the presenceof other rules, as shown in Fig. 2, is called an ARN. Weintroduced ARNs in the context of a specific applicationin [7], and provided a brief introduction in [8]. Here wedevelop this concept further in its full generality.

Our approach is based on formalizing the intuition behindthese observations. It consists of mapping a set of associ-ation rules into a directed hypergraph,3 and systematicallyremoving circular paths and redundant and backward hyper-edges that may obscure the relationship between the targetand other frequent items. This offers the following advan-tages:

(1) The pruning process is adaptive with respect to thechoice of the goal node made by the user. Thuspruning is carried out with respect to the targetobjective. Different objectives will result in differentrules being pruned.

3 Background material on directed hypergraphs [27] can befound in Section 3.2.



Age < 75

Sex = female

Immigrant = no

Urban = no

Income = below 50K

r2r1

r4r5

r3

Fig. 2 A backward-directed hypergraph (B-graph) representing the rules r1–r5. After the removal of rules r3 and r5, this graph is calledan ARN.

(2) Pruning is reduced to hypercycle and reverse hyper-edge elimination in the hypergraph. This step can bereadily automated (see ARN construction algorithm)

(3) The resulting hypergraph can be transformed into areasoning network where all edges move “forward”toward the goal node. A path from a node in thenetwork to the goal node is a series of rules and theconfidence of each rule in the path is the likelihoodof the consequent given the antecedent.

These advantages and the associated concepts will beexplained and illustrated in the following subsections.

3.1. Proposed Approach for ARN Construction

We briefly describe our method for structuring asso-ciation rules as a backward-directed hypergraph (hencereferred to as B-graphs), pruning it to generate the ARN,and transforming it for reasoning. The method consists offour steps that will be expanded in subsequent sub-sections.

Step A. Given a database D and the minimum support andconfidence, we first extract all association rulesusing a standard algorithm like Apriori [3].

Step B. Choose a frequent item g which appears as asingleton consequent in the rule set and representsthe goal node, and build a leveled B-graph whichrecursively flows into the goal g.

Step C. Prune the B-graph generated in Step B of hypercy-cles and reverse hyperedges. The resultant B-graphis called an ARN. We will give a formal justifica-tion for this step in the forthcoming sub-sections.

Step D. Find shortest paths between the goal node andthe nodes at the maximal level (a variant of edgedistance) in the ARN. The set of these pathsrepresents the explanatory network for the goalnode.

We next provide some basic background on directedhypergraphs that will aid us in defining an ARN andillustrating its various properties.

3.2. Directed Hypergraphs

A hypergraph is a pair H = (N, E) where N is a setof nodes {n1, n2, . . . , nk} and E ⊂ 2N is the set of hyper-edges. Thus each hyperedge e can potentially span morethan two nodes. Contrast this with a directed graph wherethe edge set E ⊂ NXN ⊂ 2N .

In a directed hypergraph the nodes spanned by a hyper-edge e are partitioned into two parts, the head and thetail denoted by H(e) and T (e), respectively. A hyper-edge e is called backward if |H(e)| = 1. Similarly anedge is called forward if |T (e)| = 1. A directed hyper-graph is called B-directed hypergraph if all its hyperedgesare backward. In the rest of the paper, we will refer tobackward-directed hypergraphs as B-graphs. Thus, the setof association rules whose consequents are singletons mapneatly into a B-graph. Each rule r is represented by a hyper-edge e, the antecedents of r by T (e) and the consequentby H(e).

We will also consider the antecedent of a rule as a singleentity. For that we define the notion of a hypernode. Givena B-graph B with hyperedges {e1, . . . , em}, the hypernodesinduced by the hyperedge ei are the tail T (ei) and thehead H(ei) considered as a single entity. The set of allhypernodes is denoted by V .

EXAMPLE 1: As can be seen in Fig. 3, N = {A, B, C,

D, E, F, G}, E = {e1, . . . , e6}, and V = {{A}, {B}, {C},

Be1 (0.6)

C

A D

E

F

e3 (0.9)

e4 (0.8)

e6 (0.75)G

e2 (0.7)

e5 (0.65)

Fig. 3 An example of ARN. e5 is not a part of the ARN becauseit is a reverse hyperedge and also participates in a hypercycle.



{D}, {C, D}, {E}, {E, F }, {G}}. Thus F is a node but nota hypernode.

We now define a hyperpath and a hypercycle for aB-graph. A hyperpath is defined as a sequence P ={v1, e1, v2, e2, . . . , en−1, vn}, where ∀i , vi is a hypernode,and ei is a hyperedge. Furthermore for 1 ≤ i ≤ n − 2, vi =T (ei), H(ei) ∈ vi+1, vn−1 = T (en−1), and vn = H(en−1).

EXAMPLE 2: Again, as can be seen in Fig. 3, P ={{A}, e3,{C, D}, e4, {E, F }, e6, {G}} is a hyperpath.

Finally, we define the size of a hyperpath |P | as the totalnumber of hypernodes appearing on P. Continuing with theprevious example, |P | = 4.

In the context of association rules, each node correspondsto a frequent item and frequent itemsets are mapped to undi-rected hyperedges [23,28,29]. There has been some the-oretical work relating hypergraphs with association rules[28,29]. Here the relationship between frequent itemset dis-covery and the undirected hypergraph transversal problemhas been noted in order to establish the computational com-plexity of frequent itemset mining. Directed hypergraphs[27,30] extend directed graphs and have been used tomodel many-to-one, one-to-many, and many-to-many rela-tionships in theoretical computer science and operationsresearch. Directed hypergraphs have also appeared withdifferent names including “labeled graphs” and “And-Or”graphs. A set of association rules can be represented asa directed hypergraph. In [31], a set of rules have beenmodeled as directed hypergraphs in order to detect struc-tural errors like circularity, unreachable goals, dead ends,redundancy, and contradiction.

3.3. Definition of ARN and Related Concepts

In this section, we will formally define an ARN, presentan algorithm to generate it from a set of association rulesand prove some important properties of the ARN.

Here, we define an ARN in terms of B-graphs.

DEFINITION 1: Given a set of association rules R anda frequent goal item z which appears as singleton in aconsequent of a rule r ∈ R. An ARN(R, z) is a weightedB-graph such that

(1) There is a hyperedge which corresponds to a rule r0whose consequent is the singleton item z.

(2) Each hyperedge in ARN(R, z) corresponds to a rulein R whose consequent is a singleton. The weight onthe hyperedge is the confidence of the rule.

(3) The node representing z is reachable from any othernode in ARN(R,z)

(4) Any node p �= z in the ARN is not reachable from z.

This defines the basic structure of an ARN. Note that,given a database and a set of rules R, the ARN for agiven goal node z is unique, as shown in Theorem 2 inSection 3.4.1.

However, it is important to note that it is possible to haveassociation rules of the form {A, C} → B and B → A inthe rule set, from which an ARN is constructed accordingto the above definition. It is easy to see that the use of thesetwo rules in the construction introduces a cyclic structurein the ARN. This cyclic structure is known as a hypercycle,and is more formally defined in Definition 2. Note that, ingeneral, a hypercycle may consist of an arbitrary numberof hyperedges, and can be substantially more complex thanthe two hyperedge (rule) example illustrated above.

DEFINITION 2: A hyperpath HC = {v1, e1, v2, e2,

. . . , en−1, vn} is called a hypercycle if vn ⊆ v1.

EXAMPLE 3: The hyperpath {B, e1, {C, D}, e4,

{E, F }, e5, B} is a hypercycle in the ARN shown in Fig. 3.The occurrence of such hypercycles in a reasoning net-

work such as an ARN is undesirable, since it prevents thereasoning algorithm to reach the goal node from any ofthe hypernodes involved in a hypercycle to the goal node.Hence, we removed hypercycles from an ARN using theconcept of level defined below.

DEFINITION 3: (1) The level of the goal node iszero.

(2) The level of a nongoal node v, is defined as

l(v) = min{l(u) + 1|∃ esuch that v

∈ T (e) and u = H(e)}.

Here, H(e) and T (e) denote the head and tail of thehyperedge e, respectively.

(3) The level of a hypernode c is defined as

l(c) = min{l(s)|s ∈ c}.

EXAMPLE 4: For the ARN in Fig. 3, l(G) = 0, l(F ) =l(E) = 1, l(C) = l(D) = 2, and l(B) = l(A) = 3.For hypernodes, l({C, D}) = 2 and l({E, F }) = 1.

The justification for the removal of hypercycles is dis-cussed further in detail in Section 3.5. This step is alsosupported by similar requirements of other graphical mod-eling frameworks, such as Bayesian networks [32], whichdo no allow cycles in the underlying graph.

Definition 1 of ARNs also does not prevent anotherreasoning anomaly introduced by the bi-directional natureof association rules. Consider the hyperedge e2 in the ARNshown in Fig. 4 (Section 3.5), for which the goal node isA. Although this hyperedge is not part of a hypercycle,



C

B D

A

E

F

e3e2

e1

e4 e5

e6

e7

Fig. 4 An example which distinguishes reverse hyperedges (e2)and hypercycles (e7).

its direction is against the flow of the network. Also, itspresence is redundant in the presence of the hyperedgee1, since the latter constitutes a shorter hyperpath fromhypernode B to the goal node than the hyperpath path{B, e2, C, e3, D, e4, A}, of which e2 is a part. We termsuch hyperedges as reverse hyperedges, and they are definedbelow using the level of a hypernode.

DEFINITION 4: A hyperedge e in an ARN is called areverse hyperedge if l(T (e)) < l(H(e)).

EXAMPLE 5: e5 and e2 are reverse hyperedges in theARNs shown in Figs. 3 and 4, respectively.

Interestingly, the concept of level can also be used toremove both the types of redundancies (hypercycles andreverse hyperedges) in an ARN. Specifically, we intro-duce the following two conditions into the ARN construc-tion algorithm, which will prevent hypercycles and reversehyperedges from appearing between hypernodes at differentlevels while preserving the reachability to the goal node inan ARN.

CONDITION 1 A node which has served as a con-sequent during ARN generation cannot be an antecedentacross levels for a rule whose consequent is at a higherlevel.

CONDITION 2 The goal attribute can appear with onlyone value in the ARN, namely the one in the goal node.

We prove in Theorem 4 in Section 3.4.1 that the incor-poration of these conditions into the construction algorithmensures the removal of hypercycles from an ARN, andSection 3.5 provides a detailed mathematical justificationfor the removal of both hypercycles and reverse hyperedges.We next discuss the algorithm used to construct an ARNfrom a given set of association rules and a goal node.

3.4. ARN Construction Algorithm

We now describe a breadth first-like algorithm to con-struct an ARN from a rule set R and a goal node c.

Algorithm 1 (shown below) takes as input the rule setR and a goal node c which appears as singleton con-sequent in R. The Rules.getRules(R, u, s) is a func-tion that returns all rules in R whose consequent is u,but antecedent does not contain s. For each of theserules, Rules.getAntecedents(u) returns the set of allantecedents. The level of each of these antecedent elementsis determined on the basis of Condition 1 and Definition2. For all the rules satisfying Condition 1, a hyperedge isadded to the ARN. Finally, hypercycles between hyper-nodes which are on the same level are removed usingRules.removeLevelCycles(). The algorithm returns thegenerated ARN.

3.4.1. Results about ARN construction algorithm

In this section we prove several properties of ARNgeneration. In Theorem 1, we prove the time complexityof constructing an ARN. Theorem 2 shows that for agiven set of association rules and goal node, the ARNis uniquely determined. Theorem 3 proves that the goalnode is reachable from all other nodes in the ARN. Finally,Theorem 4 shows that the ARN does not contain anyhypercycles.

THEOREM 1: The time complexity of ARN generationis O(nk), where n is the number of frequent items and k isthe number of rules whose consequents are singletons.

PROOF: For each node in the ARN, the hyperedgesflowing into it are found by searching the set of all ruleswhose consequents are singletons. Furthermore, the numberof nodes appearing in the ARN is bounded by the numberof frequent items. Hence the complexity is O(nk). Also k isbounded by

∑li=2 n∗

i i where l is the length of the longestfrequent item set and ni is the number of frequent itemsetsof size i. �

THEOREM 2: The ARN generated for a given goalnode and a given rule set is unique, i.e. it does not dependupon the sequence in which the nodes are explored.

PROOF: Let u and v be two nodes at the same level.We want to show that by exploring u and v in differentorders we are not losing a hyperedge whose consequentis one of them and the antecedent is the other—this isthe only choice that is made in the algorithm which couldresult in different ARNs. WLOG,4 assume u is exploredbefore v. Let {e1, . . . , euv} be the set of hyperedges forwhich u is a consequent and v is one of the antecedents.Then, by the definition of level, level[T (ei)] = level(v) forall 1 ≤ i ≤ uv. Note that, by Condition 1, there cannot be

4 WLOG = without loss of generality.



Algorithm 1: Algorithm for constructing an ARN from a rule set R and flowing into consequent c using a breadth-firststrategy.



any hyperedge whose consequent is u and any of whoseantecedents have level less than u (which is the same as thelevel of v).

Now, once we explore v, all the hyperedges, whoseconsequent is v and whose antecedents contain u, are freeto flow into v because of Condition 1. Thus we have not lostany hyperedges between u and v since they are at the samelevel. A similar argument holds when v is explored beforeu. Thus the set of hyperedges and hypernodes remains thesame in both cases. Hence, the same ARN will be generatedfor a given goal node, irrespective of the order in which thenodes at the same level are explored. �

THEOREM 3: The goal node is reachable from anynode in the ARN.

PROOF: By induction on level. By definition, the level ofthe goal node is zero and trivially reachable from itself.Assume that the goal node is reachable from any nodeat level i. For each node u at level i + 1, there exists atleast one hyperedge e such that l(H(e)) = i and u ∈ T (e).Thus the result holds for all i + 1 and hence, for all levelsl < max{l(u)|u ∈ N}. �

THEOREM 4: The ARN generated by the algorithm isfree of hypercycles across levels.

PROOF: By contradiction. Let C be a hypercycle acrosslevels, i.e. C = {v0, e1, . . . , en, vn} where vn ⊂ v0. Let vi

be the first hypernode in C where l(vi) < l(v0). Then,by construction of an ARN, l(vn−1) < l(vi). Therefore,l(vn−1) < l(v0). Now l(vn) ≥ l(v0) because vn is a sin-gleton and vn ⊂ v0. Therefore, en is a hyperedge from anantecedent at a lower level to a consequent at higher level.This contradicts Condition 1. �

3.5. Removal of Hypercycles and ReverseHyperedges, and their Relation to Rule Pruning

In Section 3.3, we illustrated how hypercycles andreverse hyperedges introduce redundancy and cyclicity intoan ARN constructed using its basic definition. In this sub-section, we provide a mathematical justification for theremoval of hypercycles and reverse hyperedges. We willshow that in either case the information provided by thepruned hyperedge, or the interestingness of the correspond-ing rule, is small, given the rest of the ARN.

Consider the example in Fig. 4. Given the goal node A,during ARN construction, the two hyperedges e2 and e7 areremoved. The removal of hyperedge e7 is more “crucial”than e2 because it is part of a hypercycle; otherwise, A

will never be reachable from F , violating the definition ofthe ARN. On the other hand, the removal of the reversehyperedge e2 is justified because it is part of a redundant

path from B to A. Hence, the removal of e2 will not destroythe reachability condition of the ARN. This illustrates thedifference between the two kinds of pruning operations.We provide separate formal arguments for each of the twooperations below.

3.5.1. Removal of hypercycles

We first note that a hypercycle H in a B-graph cannotbe of size less than three. We will provide the desiredjustification for two cases, one in which the size of H isthree and the other in which the size of H is four. Fromtransitivity, it follows that the removal of any hypercycle ofsize bigger than four can be justified using the argumentspresented for the second case. We will now consider thetwo cases.

CASE 1: Consider two rules of the form A → B and B →A. If A and B are the same level, then WLOGremove the hyperedge with lower confidence tobreak the hypercycle. Else, WLOG, assume level(A) > level(B). Since we are exclusively deal-ing with rules whose antecedents are a con-junction of items and the consequent is a sin-gleton item, for a vast majority of these rules,P (A) < P (B). In such a scenario,

conf(A → B) > conf(B → A).

This justifies the removal of the hyperedge rep-resenting B → A.

CASE 2: Consider three rules of the form A → B, B →C, and C → A. Our argument is based on theconcept of information gain. Assume we havean information channel I which is a sequenceof transactions. Then, given that the pair ofitemsets (A, B) and (B, C) are frequent, i.e. theyappear close to each other with high probabilitythen the information that (A, C) is frequent isnot surprising. We formalize this argument asfollows.

DEFINITION 5: Let F be the set of all itemsets. Letd : F × F → [0, 1] be defined as

d(A, B) = 1 − P (A, B)

P (A) + P (B) − P (A, B),

where A, B ∈ F and P (A) is the support of the itemset A.

THEOREM 5: The function d is a metric on the space F.

PROOF: The function d is identical to the distance mea-sure ρ defined in [25]. This proof follows directly fromthe proof of Lemma 3.2 in [25]. �



COROLLARY 1: For o < δ � 1, the information gainfrom observation that the pairs (A;C) are close to each otheris small given that d(A, B) < δ and d(B, C) < δ.

PROOF: Follows directly from the triangle inequalityof d. �

Now, WLOG, assume that level(A) > level(C). The factthat d(A, C) is small given that d(A, B) and d(B, C) aresmall is an indication of the fact that the rule A → C andC → A can be derived from the rules A → B and B → C

which are already in the ARN. Thus, C → A can be safelypruned, without violating the reachability constraint of theARN.

3.5.2. Removal of reverse hyperedges

The removal of reverse hyperedges in an ARN can bejustified on the basis of the fact that they generate redundantpaths from a node to the goal. This can be formally provedas follows.

THEOREM 6: Let a be a node in the ARN from whicha reverse hyperedge e originates. Then, there exists a pathP from a to the goal node such that e ∈ P and the size ofP is smaller than the size of the path from a to the goalnode in which e participates.

PROOF: Let l(a) = l1 and l(H(e)) = l2. Since e is areverse hyperedge, l1 < l2. We observe that the path ofsmallest size from any node at level l in the ARN to thegoal node is of size l + 1 (follows from the definition oflevel). Let this path of smallest size from a be P1 and theone from H(e) be P2. Clearly, the size of P1 is l1 + 1 andthat of the new path b; e ∪ P2 is l2 + 2. Thus P1 is therequired path.

This theorem justifies the redundancy of the rulerepresented by a reverse hyperedge and hence itsremoval. �

3.6. Benefits of ARN

In the introduction we raised the question of the utilityof association rules beyond simple exploratory analysis.ARNs are also a tool for exploratory analysis that providesa context for understanding and relating the discoveredrules with each other. More specifically, ARNs offer thefollowing benefits.

Organization of rules in a context: The primary util-ity of ARN is in the organization of a potentially largeset of association rules, such that a reasoning goal maybe explained using the most relevant rules in the set. Thisexplanation can extend beyond the immediate rules thathave the goal as a consequent, and thus provides a frame-work for organizing these rules in a context. The case

studies presented later provide evidence of the utility ofARNs for generating statistically viable hypotheses from aset of association rules derived from raw data.

Local pruning: ARNs provide a graphical method toprune rules by associating redundant rules with hypercyclesand reverse hyperedges. Furthermore, the pruning takesplace in the context of a goal node. Thus, a rule that isredundant for a particular goal node may become relevantfor another goal. This is more flexible than pruning basedon statistical measures of interestingness.

Consider the ARNs shown in Figs. 5 and 6. When the

goal node is G (Fig. 5), the hyperedge→

EB is a redundantrule, which may be eliminated. On the other hand, when

the goal node is D (Fig. 6), the same hyperedge→

EB isrelevant. Thus, pruning of a rule as per our notion becomesdependent upon the context of the goal node. We refer tothis kind of pruning as local pruning.

Reasoning using path traversal: An ARN is a weightedhypercycle free B-graph. Hyperpaths that lead to the goalnode can be interpreted as providing an explanation for thegoal node.

Formally, let Vmax be the set of all maximum level nodesin the ARN. For each v ∈ Vmax, let Pv be the set of allhyperpaths from v to the goal node g. We can then definetwo cost measures on each hyperpath p ∈ P :

Weight(p) =∑

ei∈p log[conf(ei)], (1)

Info(p) = −∑

ei∈pconf(ei) log[conf(ei)]. (2)

Be1 (0.6)

C

A D

E

F

e3 (0.9)

e4 (0.8)

e6 (0.75)G

e2 (0.7)

e5 (0.65)

Fig. 5 ARN with goal node G. The edge e5 is not part of theARN because it participates in a hypercycle.

B

E

A De3 (0.9)

e2 (0.7)

e5 (0.65)

Fig. 6 ARN with goal node D. The edge e5 is now part of theARN. This illustrates the adaptive nature of local pruning.



where ei is a hyperedge and conf(ei) is the confidence ofthe rule represented by ei .

The reason for introducing two cost functions is thatthey provide different kinds of information depending uponthe context. For example, Weight(p) can be interpreted asthe strength of the correlation between the source and thegoal node. Similarly, Info(p) can be interpreted as the totalinformation gain along the path from the source to the goalnode.

Now, the optimal path in Pv using Weight(p) or Info(p)

is likely to be the best explanation for the dependence of thegoal node on v. Computing these optimal hyperpaths for allv ∈ Vmax provides a reasoning network for the goal node g.The problem of optimal hyperpaths in B-graphs has beenstudied by [33] where they have reported an algorithm oftime complexity O(|H | + n log n) where |H |, the size ofthe hypergraph, is

∑ei∈E(|T (ei)| + 1).

4. CASE STUDIES

To recall, our overall objective in this paper is to providea systematic methodology of combining data mining tech-niques with traditional statistical methods to speed up theprocess of “theory building” and validation. The role of datamining in this methodology is to search the “space of theo-ries” and offer credible candidates and the role of statisticalmethods is to verify the theories offered by data mining. Todemonstrate the utility of this methodology, we provide twoextensive case studies using the techniques described in theprevious section. The first case study is on the Open SourceSoftware (OSS) development data and the other study is onthe Business Longitudinal Survey (BLS) data provided bythe Australian Bureau of Statistics (ABS).

4.1. Case Study 1—OSS development

OSS represents communities of software developers whouse the Internet infrastructure to coordinate the work ofbuilding robust and scalable software. The objective of thecase study is to determine the factors which contributeto the success (measured in number of downloads) ofOSS projects. Since research in OSS is relatively new,it is not clear what factors contribute to the success ofsuch projects. This makes it an ideal case study in oursituation, as data mining may offer new insights (in termsof attribute–value pairs) which may explain the successor failure of OSS projects. As we will show, we can useARNs to organize the output of data mining and thenhypothesize a model for OSS which can be tested bystandard statistical methods. Extensive data on thousandsof such OSS projects is available from the SourceForgewebsite. In this subsection we describe some of the main

Table 1. Data description.

Total number or records 24 002Total number of variables 46Number of categorical variables 15Number of numeric variables 31

results.5 However, we briefly digress to provide a summaryof the data extraction, data cleaning, and data engineeringphases that were carried out [34].

Data extraction: We downloaded OSS data from theSourceForge website (www.sourceforge.net) in 2003. Atthe time of the study, the SourceForge website had 40 002projects hosted (including SourceForge) and 424 862 regis-tered users. Only projects which had a relatively homoge-nous profile were investigated. The dataset was obtained bya program that crawls through the project summary page foreach of these projects and stores the necessary informationin a text file. The dataset holds features of each project andtheir activity information. Table 1 describes the facts aboutthe dataset:

Data cleaning: After the data were extracted, we noticedthat many entries related to several attributes were mis-labeled. For example, the attribute audience which cap-tured the end-user for the application was populated bythe entries for the environment variable. Similarly, the timeattribute was either missing or lacked a consistent format.We performed appropriate cleaning operations to reducesuch errors.

Data engineering: An important transformation we car-ried out was to convert the numeric attributes to ordinalones using equal-sized binning. The number of bins usedwas up to four representing missing, low, medium, and highvalues. This was necessary since most off-the-shelf associ-ation rule programs accept only categorical data. However,as we will show, this did not result in a substantial lossof information as similar results were obtained using factoranalysis which operated on the original numerical values.

After carrying out a detailed preliminary analysis of thedata, we decided to focus on 12 attributes which togetherindicate project activity along different dimensions. Theattributes were discretized to discover association rulesbut the original numerical values were retained for factoranalysis. These attributes are listed in Table 2.

4.1.1. Constructing ARN

We constructed the ARN shown in Fig. 7 using Algo-rithm 1 described earlier: Given a set of rules R and anitem (attribute–value pair) z, we find all rules in R inwhich z appears as a consequent. We recursively extractmore rules, where the antecedents in the previous step now

5 Detailed results of this case study can be found in [7].



Table 2. The attributes used to construct the ARN.

Attribute name Attribute description

Administrators Number of administratorsDevelopers Number of developersCVS commits Number of CVS commitsBugs found Number of bugs foundBugs fixed The percentage of bugs fixedSupport requests Number of support requestsSupport requests

completedPercentage of support

requests completedPatches started Number of patches startedPatches completed Number of patches

completedPublic forums Number of public forumsMailing lists Number of mailing listsForum messages Number of forum messages

play the role of consequents. A support level of 6% and aconfidence level of 50% was used to construct the ARN.

Analysis: Figure 7 shows the ARN extracted from theOSS data for the goal of high downloads. The ARN clearlyshows that association rules naturally discover attributesthat work in concert. For example, the three variables; highactivity in public forums, forum messages, and mailing listencapsulate communication activity related to the projectand have a first-order effect on the success of the software

project. This validates the importance of effective com-munication which has often been cited as a predictor forsuccess in software engineering projects [35].

4.1.2. Clustering and factor analysis

An ARN provides information about the hierarchicalrelationships between the variables (attributes) of the sys-tem under investigation. A natural follow-up is to clusterthe items that are elements of the ARN. Hypergraph clus-tering based on min-cut hypergraph partitioning is a well-researched topic and has been used, among other things, forVLSI design [36]. Han et al. [23] have used it for itemsetclustering.

One issue that arises when the clustering problem isbeing set up is to decide on which similarity measureto use. For example, in [23], the weighted confidencerule is used: if a hyperedge spans the items {A, B, C},then the weight of the hyperedge is the average of theconfidence of the rules {A, B} → {C}, {A, C} → {B}, and{B, C} → {A}. Another similarity measure that can be usedis related to the lift of an itemset [35]. Again consider threeitems {A, B, C}. Then the lift of {A, B, C} is σ(A,B,C)

σ(A)σ(B)σ(C).

We have used weighted confidence in our work because itis more directly related to rules rather than itemsets.

Patches completed= High

Bug Activity = High

Developers = High

Administrators =High

CVS commits =High

Patches started =High

Public forums = High

Bugs found = High

Forum Messages =High

Mailing lists = High

Downloads = High

Support requestsCompleted = High

Support requests =High

Conf: 71%

Conf: 72%

Conf: 85%

Conf: 81%

Conf: 74%

Conf: 80% Conf: 68%

Conf: 70%

Conf: 74%

Conf: 63%Conf: 55%

Fig. 7 ARN constructed using Algorithm 1. The ARN is a directed hypergraph. The distinguished goal node is Downloads.



Patches completed

Bug Activity

Developers

Administrators

CVS commits

Patches started

Public forums

Bugs found

Forum Messages Mailing lists

Downloads

Support requestsCompleted

Support requests

Conf: 71%

Conf: 72%

Conf: 85%

Conf: 81%

Conf: 74%

Conf: 80%Conf: 68%

Conf: 70%

Conf: 74%

Conf: 63% Conf: 55% Organization &Commitment

Development Activity

Communication

Support

Fig. 8 The four clusters derived using min-cut hypergraph partitioning on the ARN. The four clusters have a natural meaning as indicatedin the figure.

Another purpose of the case study is to generalize theresults generated from the ARN. We partitioned the ARNinto four components using min-cut hypergraph partition-ing. The results of clustering are shown in Fig. 8. The itemsnaturally fall in clusters which we have labeled commu-nication, support, development activity, and organization/commitment. In order to check the validity of our methodwe have used a completely different approach, namelyfactor analysis, on the original nondiscretized dataset andcompared the two approaches.

Analysis: Factor Analysis is a statistical technique thatattempts to find latent cluster(s) of variables which bestdescribe the data [37]. It is based on principal compo-nent analysis (PCA). It operates on the correlation matrixof the variables as opposed to the covariance matrix.The 12 variables in Table 1 are represented by four fac-tors 1, 2, 3, and 4. Each of these factors captures theessence of the 12 original variables. Table 3 shows howthe result of the ARN clusters and factor analysis cor-responded. For example, factor 1 captures the effect ofvariable Bugs Found, Patches Started, and Patches Com-pleted which relates the cluster labeled Development Activ-ity. Notice that only two variables, namely, Concurrent

Versions System (CVS) Commits and Forum Messages fallin nonmatching factors and clusters.

There is a remarkable degree of congruence betweenthe clusters arrived at via ARNs and Factor Analysis, andlend weight to the whole concept of inducing ARNs fromdata. Also notice that the variables that become part ofclusters derived from the ARN are semantically related.For example, the cluster Organization and Commitmentcaptures variables # of Administrator and # of Developersand their respective factor loadings within factor 4 are veryhigh (0.756 and 0.800, respectively).

4.2. Case Study 2—Organizational Practices andFirm-level Productivity

This case study focuses on productivity effects of a largenumber of organizational practices. We use a large firm-level dataset to analyze the complex relationships betweena variety of organizational practices and their complemen-tary impact on firm-level, long-term multifactor produc-tivity (MFP) growth. Similar to the previous study, thesecomplex relationships motivate the use of data mining



Table 3. Relationship between the ARN clusters and the factors derived from Factor Analysis.

Developmentactivity Support

Communicationintensity

Organization andcommitment

Factor 1 #Bugs found (0.791)#Patches started (0.967)#Patches completed(0.968)

Factor 2 #Support requests (0.686)%Support requestscompleted (0.521)

#Forum messages (0.711)

Factor 3 #Public forums (0.903)#Mailing list (0.561)

Factor 4 #CVS commits (0.557) #Administrators (0.756)#Developers (0.800)#Bug fixed (0.449)

The values in the parenthesis are factor loadings.

Table 4. Descriptions of capital investments, labor and output.

Description AvailabilitySummary statistics

annual average

Capital (K) Sum of (plant, machinery andequipment, land, dwellings otherbuilding and structures, andintangible assets) constant at 1998dollars

3 years $819K (1995–1996)$587K (1996–1997)$677K (1997–1998)

Labor (L) Total number of full- and part-timeemployees

4 years 27.36 (1994–1995)27.69 (1995–1996)27.67 (1996–1997)27.95 (1997–1998)

Value-added (Q) Total income minus purchases constantat 1998 dollars

4 years $3184K (1994–1995)$3475K (1995–1996)$3590K (1996–1997)$3751K (1997–1998)

in general, and ARNs in particular, for addressing thisproblem. In this subsection we describe some of the mainresults of using ARNs for this study.6 The main datasetused in this study is the ABS BLS covering four con-secutive years: 1994–1995, 1995–1996, 1996–1997, and1997–1998. The longitudinal survey was designed to pro-vide information on the growth and performance of Aus-tralian businesses, and to identify various economic andstructural characteristics of these businesses [39]. In eachyear, the survey was designed to target variables based ona particular organizational theme. In 1994/1995 the focuswas on benchmarking, in 1995/1996 the focus was on vari-ous organizational processes, in 1996/1997 the focus was onproduction technology and computer use, and in 1997/1998the focus was on training. The BLS contains about 9 550confidentialized respondent records and 3864 records thatparticipated in all 4 years. This BLS survey contains a totalof 787 variables with a wide coverage of organizational

6 Detailed results of this case study can be found in [38].

characteristics and measures including various businessintentions, capital expenditures, organizational practices,and performance variables. Descriptions of the capital,labor, and output variables are provided in Table 4.

Data cleaning: A total of 32 organizational variables inthe dataset were found useful for our study, and extractedfrom the dataset for ARN construction. Data cleaning wasnot a major issue for this dataset because ABS followeda well-established protocol to ensure data items collectedfrom survey instruments have high validity and reliability.In our present case, to ensure each record has value foreach organizational variable, we select only firms thatparticipated in all four years. Out of the 3864 panel datarecords, 565 records were discarded due to incomplete datavalues. The remaining 3299 records that have values for all32 organizational variables are used for this analysis.

Data engineering: We calculate the MFP growth foreach firm using an established approach called the GrowthAccounting Framework [40]. It was found that the averageMFP growth of all 3299 business units is −6.9%, in which



Table 5. Firm-level MFP growth results.

Number ofbusiness units

Average MFPgrowth

Positive MFP growth 1579 (48%) 38.85%Negative MFP growth 1720 (52%) −48.95%Average MFP growth 3299 (100%) −6.9%

1579 (48%) of 3299 business units obtained positive MFPgrowth and 1720 (52%) business units obtained zero ornegative MFP growth. The details of the MFP results areprovided in Table 5.

Furthermore, the preliminary observations of organiza-tional data reveal that the variables capturing the organi-zational practices in the dataset were mostly discrete vari-ables except StockAdjust, which was a continuous variable.Two organizational variables were modified to provide aworkable solution for the analysis. These are StockAdjustand RD (refer to Table 5). First, the variable StockAd-just was created to capture the changes (in percentage)of each firm’s inventory level with a 4-year difference(between 1994/1995 and 1997/1998) instead of 1-year dif-ference. These values were subsequently discretized intothree categories (i.e. increase, decreased, and not applica-ble (NA)). Second, because firms reported whether theyperformed research and development (R&D) in each year’s

BLS survey, the variable RD was created to generalizeall four years’ values into one variable. This RD vari-able was to show if firm performed R&D in any one ofthe 4 years. All 32 variables listed in Table 5 are eitherbinary coded [Yes/No] or categorical with three values[increase/no increase/NA] or [> 25%/ ≤ 25% or NA]. Thelist of organizational practices and their corresponding vari-ables is provided in Table 6.

4.2.1. Constructing ARN

At this stage, the ARN technique was applied to the32 organizational variables and the implied MFP derivedfrom the growth accounting framework. In this case, firstthe goal was fixed (for instance, either positive or negativeMFP growth), and all the significant rules that satisfied theminimum thresholds were generated. Then, the antecedents(namely the first-order variables leading to the goal) of theserules were considered as consequents in the following roundof association rule generation and all the rules of theseantecedents as consequents were generated. There weremodifications in this analysis as compared with the previouscase study. Instead of having a common minimum supportand minimum confidence thresholds for the ARNs, the twoARNs were allowed to have different minimum supportand minimum confidence thresholds. Also, the minimumsupport and minimum confidence thresholds were allowed

Table 6. Organizational practices employed for association analysis.

Variable name Organizational practices Variable name Organizational practices

CompPrice Comparing prices BusNetwork Implementation of business networkCompCosts Comparing costs BusComp Business comparisonCompQual Comparing quality of products

or serviceExpMarketing Export marketing

CompMkt Comparing marketing oradvertising strategies

BusLink Implementation of business link

CompCServ Comparing client services ProdTech Using production technologyCompProdR Comparing range of products

and serviceUC Use of computer

BPI Business process improvement StructuredTraining Providing structured training coursesAdjust PL Significantly increase

production levelsJobTraining Providing on-the-job training

IntroGood Introduction of new products orservices

Workshops Providing seminars, workshop andconferences etc.

TQM Total quality management JobRotation Providing job rotation and exchangesQA Quality assurance MgtTraining Providing management trainingJIT Just-in-time management ProTraining Providing professional trainingStraPlan Implementation of strategic plan ITTraining Providing training for computer specialistBusPlan Implementation of business plan SupplierTraining Providing trade and apprenticeship training

and traineeshipsBudgetPlan Implementation of budget plan StockAdjust Changes in stock level between 1998–1997

and 1996–1995IE Report Income and expenditure report RD Perform research and development in any of

the four years



to be dynamically adjusted for first- and second-orderassociation rules generation.

The main advantages of these modifications allowed thefocus to be on the most important set of rules at each level(i.e. top K rules with a highest possible minimum sup-port and minimum conference thresholds) rather than hav-ing arbitrary minimum support and minimum confidencethresholds that are common for the generation of the twoARNs. This is because problems can arise when (i) one ofthe two ARNs might require lower thresholds for any asso-ciation rule to appear, and thus, by setting the thresholds toohigh for one ARN would cause the weaker ARN to disap-pear, and (ii) if any of the ARNs requires lower thresholdsto generate higher-order association rules than to generatelower-order association rules, setting fixed thresholds wouldlimit the chance of discovering higher-order rules. The chal-lenge is that the minimum support and minimum confidencethresholds needed to be set low enough to generate at leastone first-order association rule. However, setting the globalminimum support and minimum confidence thresholds toolow would also cause many uninteresting higher-order asso-ciation rules to appear. Therefore, allowance was made forthe minimum support and minimum confidence set for thesecond-order rules generation to be different (if necessary)from the minimum support and minimum confidence setfor the first-order rules generation.

Analysis: For each ARN, we began with very high min-imum support and confidence thresholds. We then reducedthe support and conference thresholds slowly until the firstassociation rule was found. The first association rule wouldprovide a set of most frequent variables leading to MFP

growth. These variables are the first-order variables. Todiscover the second-order variables leading to first-ordervariables, we maintained the same support level but allowedthe confidence level to vary. If too many rules were found,we increased the confidence threshold, and reduced the con-fidence threshold otherwise. This way, we would get themost frequent variables leading to first-order variables. Forthe ARN with positive MFP growth as the goal, the min-imum support was set as 10% and minimum confidenceas 55% for the first-order rules, and the minimum confi-dence was increased to 99% for the second-order rules withminimum support maintained at 10%. For the ARN withnegative MFP growth as the goal, the minimum support wasset as 15% and minimum confidence as 65% for the first-order rules, and the minimum confidence was increased to90% for the second-order rules with the minimum supportmaintained at 10%. Figure 9 shows the ARN with positiveMFP growth.

Applying the same strategy to this ARN with the goalset to negative MFP growth, the ARN construction wasterminated after the second-order association rules weregenerated, since no new third-order variables were foundleading to the second-order variables with the given sup-port. Figure 10 shows the ARN with negative MFP growth.

4.2.2. Statistical analysis and model development

Several interesting patterns emerge when both ARNsare analyzed with reference to each other. First, the twoARNs generated are quite different. This result suggeststhat factors that contribute to positive MFP growth are not

Budget Plan=Yes

UC=Yes

IT Training= Yes

ProTraining= Yes

MFP_Growth= Positive

MgtTraining= Yes

JobTraining= Yes

IE_Report= Yes

BusLink= Yes

QA= Yes

RD= Yes

BusComp= Yes

BusPlan= Yes

StockAdjust =Decrease

Rule 1.1

Rule 2.1

Rule 2.2

Rule 2.3

Rule 2.4

Rule 2.5

Group 1

Group 2

Group 3

Fig. 9 ARN with positive MFP growth.



necessarily the same as those contributing to negative MFPgrowth. For example, the ARN in Fig. 9 shows that thethree variable instances (i.e. reduced stock-level, use ofcomputers, and use of budget plan) have the first-ordereffect on positive MFP growth, whereas the other ARNin Fig. 10 shows a different set of variable instances [i.e.increase stock-level, no implementation of Total QualityManagement (TQM) and no introduction of goods] has thefirst-order effect on negative MFP growth. Second, the posi-tions of some factors in the two ARNs are different. Theseresults also suggest that some factors affect productivitygrowth in the positive direction, but the absence of thesefactors may not have the level of reverse effects. Third, therelationship structure among organizational practices thatleads to productivity growth are not necessarily the samerelationship structure set of organizational practices thathinder productivity growth. After examining all the vari-ables in both the ARNs, we classify the variables in thetwo ARNs using the following criteria:

• Variables with full ellipse if they exist in both ARNs.• Variables with dotted ellipse if they exist in only one

of the ARNs.• First-order variables are labeled with a double line.• Second-order variables are labeled with a single line.

The objective of this step is to convert variable instancesfrom data-mining results to candidate-type variables (i.e.to develop a testable model based on the two ARNs). Theproductivity impacts of these candidate-type variables can

Table 7. Variable instances have bi-directional impact onMFP growth.

Figure 9: ARN(positive MFP

growth)

Figure 10: ARN(negative MFP

growth)

Change in stock level First order First orderPC use First order Second orderBudget plan First order Second orderBusiness plan Second order Second orderQuality assurance Second order Second orderIncome/expenditure

reportSecond order Second order

Adjust productionlevel

Second order Second order

R&D Second order Second orderBusiness comparison Second order Second order

be hypothesized and tested statistically. On the basis of theclassification of the variables, there are only nine variableinstances common to both ARNs. These nine variables areshown in Table 7.

In order to synthesize the data-mining results, the num-ber of candidate variables to be considered was reducedby selecting only the first-order variable instances fromeither ARN. The five first-order candidate variables wereclassified as strong and weak candidate variables. To beclassified as a strong candidate variable, two necessary con-ditions are required. First, variable instances have to exist inboth ARNs with opposite directions. Second, these variable

MFP Growth= Negative

TQM= No

Introduction ofGoods = No

UC = No

I Partner = No

QA = No

BPI = No

BudgetPlan = No

BusPlan = No

StraPlan = No

ProdTech = NA

JIT = No

StructuredTraining = No

CompCServ = No

IE Report = No

Adjust PL = No

RD = No

BusComp = No

Workshop = No

StockAdjust =Increase

Rule 1.1

Rule 2.1

Rule 2.2

Rule 2.3

Rule 2.4

Rule 2.5

Rule 2.6

Rule 2.7

Group 1

Group 2

Group 3

Fig. 10 ARN with negative MFP growth.



Table 8. Candidate variables for statistical testing.

Positive MFP Growth Negative MFP growth Candidate variable Label

Decrease stock-level(first order)

Increase stock level(first order)

Strong negative StockAdjust

Use computer(first order)

No use computer(second order)

Strong positive UC

Use budget plan(first order)

No use budget plan(second order)

Strong positive BudgetPlan

No introduction of goods(first order)

Weak positive IntroGood

No implementation of TQM(first order)

Weak positive TQM

instances have first-order impact on either positive or neg-ative MFP growth. However, variables that have first-orderimpacts on only either positive or negative MFP growth(i.e. they only exist in one ARN) are classified as weakcandidate variables. Table 8 shows the first-order variableinstances discovered in the ARNs.

There are total of five different first-order candidate vari-ables. Of these variables, three variables (namely Adjust-ment in stock level, Use of computer, and Budget Plan)appeared in both ARNs with opposite directions. The othertwo variables (namely Introduction of goods and Implemen-tation of TQM) appeared only in one ARN. One positivenote is that among the other variables, use of computersis shown to be a strong candidate variable that has stronginfluence on MFP growth.

Nest, regression analysis is used to test the statistical sig-nificance of each of the candidate organizational variablesin explaining MFP growth. Based on the five first-ordervariables discovered in ARNs, the general model can nowbe specified as:

•a3 = λ + β1UC + β2StockAdjust + β3BudgetPlan

+β4TQM + β5IntroGood + industry + ε.

By estimating the significance of each coefficient (βi),one can statistically verify the relationship between thedependent variable (MFP growth) and the explanatory vari-ables.7 The regression results are provided in Table 9.

The statistical findings have revealed interesting insightsabout how these variables behave in organizations becausevariables which appeared in both ARNs are found to havehigher statistical significance than variables which onlyappear in just one of the ARNs. This phenomenon points

7 Note that the original values of the dependent variable•a3 and

the independent variable StockAdjust is used (i.e. these variablesare continuous variables in the regression). Furthermore, theindustry variable is also modeled in the regression to adjust forthe possible industry effect.

Table 9. Coefficients estimates of main effect model.

Estimates Standard errors t-value

UC 0.229∗∗∗ 0.022 10.284StockAdjust −0.082∗∗∗ 0.010 −8.105BudgetPlan 0.063∗∗∗ 0.012 5.099TQM 0.047∗∗∗ 0.011 4.208IntroGood 0.016 0.011 1.521Control IndustryR2 0.308N 660F-value 48.829

∗∗∗ p < 0.01; ** p < 0.05; * p < 0.10.

to two interesting research issues: (i) it has been con-firmed that certain factors that affect organizational perfor-mance positively do not necessarily have a reverse effectwhen they are reduced or removed, and (ii) statistical infer-ences are perhaps inadequate to estimate the impact ofsuch variables that affect the dependent variable in onedirection only. In the next step, relationships among theorganizational variables were modeled and statistically ana-lyzed.

On the basis of the regression results, we construct ourpreliminary model using the relationship structure discov-ered in the two ARNs, i.e. by careful interpretation ofthe relationship structure from Figs. 9 and 10. First, thefirst-order variables that found to be statistically significantat 0.01 levels were stock level adjustment (StockAdjust),Computer use (UC ), implemented Budget Plan (Budget-Plan), and implemented TQM (TQM ). The variable rep-resenting introduction of good and services (IntroGood ) isremoved due to its statistical insignificance from regressionresults.

Second, on the basis of positions of the UC variable andthe relative positions of the other organizational variablesin the ARNs, two types of relationships can be proposed.These are moderating and mediating relationships. On thebasis of the relationship structure revealed in Fig. 9, the



MFP GrowthUse of Computer

Stock Adjust

Budget Pan

TQM

Moderation

Mediation

Direct

Moderation

Direct

Direct

Fig. 11 Preliminary Path Model derived from the two ARNs.

Table 10. Mediating and moderating impact of com-puter and organizational factors.

Relationship Candidate relations

Moderating Computer use and stock level adjustmentModerating Computer use and budget planMediating Computer use is mediated by TQMMediating Budget plan is mediated by TQM

moderating relationships are between computer use andstock level adjustment, and between computer use and bud-get plan. On the basis of relationship structure revealed inFig. 10, the two mediating relationships are between com-puter use and Budget Plan, and MFP growth, mediated byTQM. The candidate relationships are provided in Table 10.

We consider these candidate analytic properties fromTable 10 to construct the preliminary path model for furtherstatistical testing.8 Figure 11 illustrates the proposed pathmodel.

In summary, these two case studies illustrate how ARNscan be used effectively for identifying complex relation-ships between factors that lead to the achievement of acertain goal. These effects can be both immediate and dis-tant, and the structure of an ARN provides indications aboutthese dependencies. These facts justify the use of ARN foraddressing several complex real-life problems.

5. SUMMARY AND FUTURE WORK

We have presented a methodology to systematicallyintegrate the data-mining search process with traditional

8 There are various ways to validate the preliminary modeldescribed in Fig. 6. One way is to perform path analysis. Thestatistical test for this preliminary model using the frameworkestablished by Baron and Kenny [41] can be found in Poon et al.[38].

statistical methods. We have applied data mining with thegoal of refining the existing theoretical approach, whereasstatistical methods were used to validate the conjecturesagainst data. This hybrid approach allows us to exploit therecent advances in computationally efficient search tech-niques designed for large databases in conjunction withtraditional statistical methods.

We illustrate our methodology with the help of two casestudies and show how Association Rules Network can beused to systematically search for candidate theories. ARNprovides a mechanism for synthesizing association rules ina structured manner. The important features of an ARN are(1) their ability to prune rules in the context of a goal,(2) the transformation of pruning mechanisms to simplegraph operations and (3) its use as a basis of reasoningwith the discovered association rules.

We are working on designing a graph layout algorithmspecifically for ARNs. This will automate the display ofARNs and allow them to become part of a decision-makingtoolkit. We will also investigate how spurious rules affectthe accuracy of an ARN for reasoning about a given goalnode. It will also be interesting to investigate how the ARNfor a given goal node varies with variations in the supportand confidence thresholds used to derive the set of rulesused for constructing the ARN, and if any patterns can beobserved therein. Finally, the possibility of simultaneouslyanalyzing multiple-goal nodes is another topic that deservesfurther investigation.

REFERENCES

[1] J. Han and M. Kamber, Data Mining, Concepts and Trends,Morgan Kaufmann, 2001.

[2] A. Koutsoyiannis, Theory of Econometrics, (2nd ed.),Palgrave, 2001.

[3] R. Agrawal and R. Srikant, Fast algorithms for miningassociation rules, Proceedings 20th International ConferenceVery Large Data Bases, VLDB, Morgan Kaufmann, 1994,487–499.



[4] R. Agrawal, T. Imielinski, and A. Swami, Mining associationrules between sets of items in large databases, SIGMOD Rec22(2) (1993), 207–216.

[5] P. Tan, M. Steinbach, and V. Kumar, Introduction to DataMining, Addison-Wesley, 2006.

[6] R. Sapsford and V. Jupp, Data Collection and Analysis, SagePublications, 2006.

[7] S. Chawla, B. Arunasalam, and J. Davis, Mining Opensource Software (OSS) Data using Association RulesNetwork, in Advances in Knowledge Discovery and DataMining, 7th Pacific-Asia Conference, PAKDD’03, 2003,461–466.

[8] S. Chawla, J. Davis, and G. Pandey, On local pruning ofassociation rules using directed hypergraphs, in Proceedingsof the 20th IEEE International Conference on DataEngineering, Boston, 2004, 832.

[9] M. J. Zaki, Generating non-redundant association rules,Proceedings of the Sixth ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,ACM Press, 2000, 34–43.

[10] S. Jaroszewicz and D. A. Simovici, Pruning redundantassociation rules using maximum entropy principle, inAdvances in Knowledge Discovery and Data Mining, 6thPacific-Asia Conference, PAKDD’02, Taipei, Taiwan, 2002,135–147.

[11] A. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal,Discovering frequent closed itemsets for association rules,Proceeding of the 7th International Conference on DatabaseTheory, Springer-Verlag, 1999, 398–416.

[12] M. J. Zaki and C.-J. Hsiao, ChARM: An efficient algorithmfor closed association rule mining, in Proceedings of theSIAM International Conference on Data Mining (SDM),2002.

[13] J. Pei, J. Han, and R. Mao, CLOSET: an efficientalgorithm for mining frequent closed itemsets, in ACMSIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery, 2000, 21–30.

[14] H. Xiong, P.-N. Tan, and V. Kumar, Hyperclique patterndiscovery, Data Mining Knowl Discov J 13(2) (2006),219–242.

[15] Y. Ke, J. Cheng and W. Ng, Mining quantitativecorrelated patterns using an information-theoretic approach,in Proceedings of the 12th ACM SIGKDD Conference, 2006,227–236.

[16] G. Piatetsky-Shapiro and C. J. Matheus, The interestingnessof deviations, in Proceedings of the AAAI-94 Workshop onKnowledge Discovery in Databases, 1994, 25–36.

[17] P. Tan, V. Kumar, and J. Srivastava, Selecting the rightobjective measure for association analysis. Inf Syst 29(4)(2004), 293–313.

[18] L. Geng and H. J. Hamilton, Interestingness measures fordata mining: A survey, ACM Comput Surv 38(3) (2006),Article 9.

[19] A. Silberschatz and A. Tuzhilin, On subjective measures ofinterestingness in knowledge discovery, in Proceedings ofthe First International Conference on Knowledge Discoveryand Data Mining, 1995, 275–281.

[20] S. Jaroszewicz and D. A. Simovici, Interestingness offrequent itemsets using Bayesian networks as backgroundknowledge, in Proceedings of the Tenth ACM SIGKDDConference, 2004, 178–186.

[21] C. Wang and S. Parthasarathy, Summarizing itemset patternsusing probabilistic models, in Proceedings of the 12th ACMSIGKDD Conference, 2006, 730–735.

[22] X. Yan, H. Cheng, J. Han, and D. Xin, Summarizing itemsetpatterns: a profile-based approach, in Proceeding of theEleventh ACM SIGKDD Conference, 2005, 314–323.

[23] E.-H. Han, G. Karypis, V. Kumar, and B. Mobasher,Clustering based on association rule hypergraphs, inProceedings SIGMOD Workshop Research Issues on DataMining and Knowledge Discovery (DMKD ’97), 1997.

[24] D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressedfrequent-pattern sets, in Proceedings of 31st InternationalConference on Very Large Databases (VLDB), 2005,709–720.

[25] G. Gupta, A. Strehl, and J. Ghosh, Distance basedclustering of association rules, Intelligent EngineeringSystems Through Artificial Neural Networks, Proceedingsof ANNIE 1999, Vol. 9, ASME Press, 1999, 759–764.

[26] A. Berrado and G. Runger, Using metarules to to groupdiscovered association rules, Data Mining KnowledgeDiscovery (DMKD), Vol. 14–3, Springer, 2007, 409–431.

[27] G. Gallo, G. Longo, and S. Pallottino, Directed hypergraphsand applications, Discrete Appl Math 42(2) (1993),177–201.

[28] M. Zaki and M. Ogihara, Theoretical foundations ofassociation rules, in Proceedings of 3rd SIGMOD’98Workshop on Research Issues in Data Mining andKnowledge Discovery (DMKD’98), Seattle, Washington,June 1998.

[29] D. Gunopulos, H. Mannila, R. Khardon, and H. Toivonen,Data mining, hypergraph transversals, and machine learning,in Proceedings of ACM PODS, 1997, 209–216.

[30] G. I. G. Ausiello and U. Nanni, Dynamic maintenance ofdirected hypergraphs, Theor Comput Sci 72(2–3) (1990),97–117.

[31] M. Ramaswamy, S. Sarkar, and Y. Chen, Using directedhypergraphs to verify rule-based expert systems, IEEETKDE 9(2) (1997), 221–237.

[32] D. Heckerman, Bayesian networks for data mining, DataMining Knowl Discov 1 (1997), 79–119.

[33] G. Ausiello, G. F. Italiano, and U. Nanni, Hypergraphtraversal revisited: Cost measures and dynamic algorithms,Lect Notes Comput Sci 1450 (1998), 1–16.

[34] R. Cooley, B. Mobasher, and J. Srivastava, Data preparationfor mining world wide web browsing patterns, Knowledgeand Information Systems (1), 1999, 5–32.

[35] A. Dutoit and B. Bruegge, Communication metrics forsoftware development, IEEE Trans Softw Eng 24(8) (1998),615–628.

[36] G. Karypis and V. Kumar, Multilevel k-way hypergraphpartitioning, ACM IEEE Design Automation Conference(DAC), 1999, 343–348.

[37] R. L. Gorsuch, Factor Analysis, Hillsdale, NJ, Erlbaum,1983.

[38] S. K. Poon, J. G. Davis, and B. Choi, Understanding ofIT complementarities using augmented Endogenous GrowthFramework, in The Proceedings of the 26th InternationalConference on Information Systems (ICIS2005), Las Vegas,11–14 December, 2005, 659–671.

[39] ABS (Australian Bureaus of Statistics), Business Longitudi-nal Survey, Australian Bureau of Statistics, Cat. NO 8141.0,2000.

[40] OECD, Measuring Productivity (OECD Manual): Measuringof Aggregate and Industry Level Productivity Growth, Paris,OECD, 2001.

[41] R. Baron and D. Kenny, Moderator-mediator variabledistinction in social psychology research: Conceptual,strategic, and statistical considerations, J Pers Soc Psychol51 (1986), 1173–1182.


Documents

Association Rules Network: Definition and Applications