1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean...

A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the

Boolean Retrieval Model

Peter Bollmann-Sdorra

Contents

IntroductionBackgroundBoolean Association MiningExpressing item-sets as queriesConclusionsFuture Work

IntroductionResearchers focus on discovering rules in the form of implications between itemsets which have adequate supports.

Having frequent itemsets as both antecedent and precedent parts of rules represent only the simplest form of predicates.

This simplicity is due in part to the lack of a theoretical framework that includes more expressive predicates.

MotivationIn Information retrieval systems, a strong theoretical background gives the user the power to ask more sophisticated and pertinent questions. Information retrieval and association mining are two complementary processes on the same data records or transactions. In information retrieval, given a query, we need to find the subset of records that matches the query.In contrast, in data mining, we need to find the queries (rules) having adequate number of records that support them.

Proposed Solution

we introduce the theory of association mining that is based on a model of retrieval known as the Boolean Retrieval Model, where

a Boolean query that uses only the AND operator is analogous to an itemset,

a general Boolean query (AND, OR or NOT) has interpretation as a generalized itemset,

notions of support of itemsets and confidence of rules can be dealt with uniformly, and

an event algebra can be defined, involving all possible transaction subsets, to formally obtain a probability space.

BackgroundDeriving association rules from data: Given a set of items I={i1,i2, . . . , in}, and a

set of transactions T = {t1, t2, . . ., tm}, each transaction ti T , such that ti I,

an association rule is defined as X Y, where X I, Y I, and X Y = , describes the existence of a relationship between the two itemsets X and Y.

The percentage of transactions in the

database that contain both X and Y.

Measure for Significance

),()X( YXPYSupport

The percentage of transactions that contain Y

among those transactions containing X.

Measure for Importance

)(/),()X( XPYXPYConfidence

Represents a test of statistical

independence.

Measure for Importance

)()(/),()X YPXPYXPYInterest(

Boolean Association Mining

Given a set of items I = {i1, i2, …, in}, a

transaction t is defined as a subset of items such that t2I, where 2I = {, {i1}, {i2}, …,

{in}, {i1, i2}, …, { i1, i2, …, in}}.

Let T 2I be a given set of transactions {t1,

t2, …, tm}. Every transaction tT has an

assigned weight w’(t).

Possible Weights

w’(t) = 1, for all transactions t T. w’(t) = f(t), where f(t) is the frequency of transaction t, for all transactions

t T, i.e., how many times the transaction t was repeated in our database.

w’(t) = |t| * g(t) for all transactions t T, where |t| is the cardinality of t, and g(t) could be either one of the weight functions w’(t)’s defined in (i) and (ii). In this case, longer transactions get higher weight.

w’(t) = v(t) * f(t) for all transactions t T, where v(t) could be the sum of the prices or profits of those items in t.

weights w’s are normalized to

)'t('w

)t('w)t(w 1)t(w

Let I = {beer, milk, bread} be the set of all items, where price(beer) = 5, price(milk) = 3, and price(bread) = 2. The set of transactions T is

f(t) is the frequency of transaction t

Example

# t f(t) 1 {beer} 22 2 {milk} 8 3 {bread} 10 4 {beer, bread} 20 5 {milk, bread} 25 6 {beer, milk, bread} 15

Case 1: W’(t) = 1,

T {beer} {milk} {bread} {beer,bread} {milk,bread} {beer,milk,bread} w’(t) 1 1 1 1 1 1 w(t)

Case 2: W’(t) = f(t),

T {beer} {milk} {bread} {beer,bread} {milk,bread} {beer,milk, bread} w’(t) 22 8 10 20 25 15 w(t) 0.22 0.08 0.1 0.2 0.25 0.15

Case 3: W’(t) = |t| * g(t),

T {beer} {milk} {bread} {beer,bread} {milk,bread} {beer,milk,bread} w’(t) 22 8 10 40 50 45 w(t) 0.13 0.05 0.06 0.23 0.27 0.26

Let g(t)=f(t),

Case 4: W’(t) = v(t) * g(t),

T {beer} {milk} {bread} {beer,bread} {milk,bread} {beer,milk, bread} Price(t) 5 3 2 7 5 10 w’(t) 110 24 20 140 125 150 w(t) 0.19 0.04 0.04 0.25 0.22 0.26

Let g(t)=f(t) and v(t)=Price(t)

Expressing item-sets as queries (logical expressions)

Definition 1: For a given set of items I, the set Q of all possible queries associated with item-sets created from I is defined as follows. i I i Q, q, q’ Q q q’ Q

These are all.

Definition 2: For any query q Q, the response set of q, RS(q), is defined as follows:

For all atomic i Q, RS(i) = {tT | it}RS (q q’) = RS(q) RS(q’)

Definition 3: Let q = (i1i2…ik) and Aq denote the item-set associated with q; that is, Aq = {i1, i2, …, ik}, the support of Aq is defined as

where q = (i1 i2 … ik).

)q(RSt

q )t(w)A(S

Lemma 1:

The support set of Aq ; SS(Aq), equals to RS(q).

Lemma 2:

For queries q, q1, q2 and q3, the following axioms hold: RS(q q) = RS(q) RS((q1 q2) q3) = RS(q1 (q2 q3)) RS(q1 q2) = RS(q2 q1)

Example:

RS((x1 x2) (x3 x2)) = RS(x1 x2 x3)

Definition 4:

For a given set of items I, the set Q* of all possible queries is defined as follows.

i I i Q*,q, q’ Q* q q’ Q*q, q’ Q* q q’ Q*q Q* q Q*

Definition 5:

For any query q Q*, the response set of transactions, R (q) is defined as

For all i Q*, RS (i) = {tT | it} RS (q q’) = RS (q) RS (q’) RS (q q’) = RS (q) RS (q’) RS (q) = T - RS (q)

Theorem:

If q is a transformation of q’ that is obtained by applying the rules of Boolean algebra, then

RS(q)= RS(q’) Each q Q* can be considered as a generalized itemset. The itemsets

investigated in earlier works only consider q Q.

Lemma 3:

{RS(q) | q Q*}=2T

Theorem:

(T, 2T, P) is a probability space.

Rules and Their Response Strengths

Definition 6: The confidence of a rule

Aq Aq’ is defined as

Definition 7: The interest of a rule Aq Aq’ is defined as

Definition 8: The support of a rule Aq Aq’ is defined as

)'q(R*)q(R

)'qq(R

)'qq(R)A(S q q'A

Lemma 4: For a rule Aq Aq’,

Lemma 5: For a rule Aq Aq’,

)'qq(R)q(R1)A(S q -Aq'

)'qq(R*2)'q(R)q(R1)A(S q --A q'

ConclusionsThe theory of association mining that is based on a model of retrieval known as the Boolean Retrieval Model has been introduced.The framework we develop derives from the observation that information retrieval and association mining are two complementary processes on the same data records or transactions.Based on the theory of Boolean retrieval, we generalize the itemset structure by using all Boolean operators.

Conclusions (cont.)

By introducing the notion of support of generalized itemsets, a uniform measure for both itemsets and rules (generalized itemsets) has been developed.

Support of a generalized itemset is extended to allow transactions to be weighted so that they can contribute to support unequally.

Future Work

In order to only generate understandable queries, new restrictions or measures, such as, compactness and simplicity, should be introduced. (These restrictions or measures could eliminate a large number of frequent generalized itemsets, many of which could have complex structures.)

1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean...

Documents

Introduction to Information Retrieval http ...saravanan-thirumuruganathan.github.io/cse5334Spring2015/slides/13_AnsweringKeyword...Boolean retrieval The Boolean model is arguably the

Introduction to Information Retrieval ` `%%%`# ` …...Boolean retrieval The Boolean model is arguably the simplest model to base an information retrieval system on. Queries are Boolean

Boolean and Vector Space Retrieval Models by Ray Mooney

Information Retrieval - Boolean information retrieval and ...ce.sharif.edu/courses/97-98/1/ce324-2/resources/root/Slides/Lect-2to… · Boolean Retrieval Model. The Boolean model

Retrieval Models: Boolean and Vector Space Jamie Callan Carnegie Mellon University callan@cs.cmu.edu 11-741: Information Retrieval

Boolean and Vector Space Retrieval Modelsdiana/csi4107/L3.pdf · Boolean and Vector Space Retrieval Models . 2 Retrieval Models •A retrieval model specifies the details of: –Document

Introduction to Information Retrieval ` `%%%`#`&12 ...hs/teach/14s/ir/pdf/02voc.pdf · Simple Boolean vs. Ranking of result set Simple Boolean retrieval returns matching documents

E.G.M. PetrakisInformation Retrieval Models1 Classic IR Models Boolean model simple model based on set theory queries as Boolean expressions adopted

1. Boolean Retrieval - Otto von Guericke University Magdeburgtheo.cs.ovgu.de/lehre08s/anfrage/BooleanRetrieval.pdf · 1. Boolean Retrieval . Definition: Information retrieval (IR)

Introduction to Information Retrieval ` `%%%`# ` ~~~false [0.5cm] IIR 1: Boolean Retrieval · 2011-08-28 · 3 Probabilistic models (30) 4 Language model-based retrieval (30) 5 Latent

Introduction to Information Retrieval Boolean Retrieval

CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval

Information Retrieval - Stanford Universityweb.stanford.edu/class/cs276/19handouts/lecture2-intro-boolean-6per.pdfUnstructured Structured Introduction to Information Retrieval Unstructured

Data-analysis and Retrieval Boolean retrieval, …Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries Hans Philippi (based on the slides from the Stanford

Christian Bollmann 2015

Information Retrieval and Web Searchweb.eecs.umich.edu/~mihalcea/498IR/Lectures/BooleanRetrieval.pdf · Information Retrieval and Web Search Boolean retrieval Instructor: Rada Mihalcea

Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity

Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval Fall 2012

9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of

Introduction to Information Retrieval and Boolean …seem5680/lecture/Boolean-model-2018.pdfIntroduction to Information Retrieval and Boolean model Reference: Introduction to Information