CSE 634 Data Mining Concepts and Techniques Association Rule Mining

CSE 634Data Mining Concepts and Techniques

Association Rule Mining

Barbara MuchaTania Irani

Irem IncekoyMikhail Bautin

Course Instructor: Prof. Anita WasilewskaState University of New York, Stony Brook

Group 6

References

Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber

Presentation Slides of Prateek Duble Presentation Slides of the Course Book. Mining Topic-Specific Concepts and

Definitions on the Web Effective Personalization Based on

Association Rule Discovery from Web Usage Data

Overview Basic Concepts of Association

Rule Mining Association & Apriori Algorithm Paper: Mining Topic-Specific

Concepts and Definitions on the Web Paper: Effective Personalization

Based on Association Rule Discovery from Web Usage Data

Barbara Mucha

Outline

What is association rule mining?

Methods for association rule mining

Examples

Extensions of association rule

Barbara Mucha

What Is Association Rule Mining?

Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database

Frequent pattern mining: finding regularities in data What products were often purchased together?

Beer and diapers?! What are the subsequent purchases after

buying a car? Can we automatically profile customers?

Barbara Mucha

Basic Concepts of Association Rule Mining

Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)

Find: all rules that correlate the presence of one set of items with that of another set of items E.g., 98% of people who purchase tires and auto

accessories also get automotive services done Applications

* Maintenance Agreement (What the store should do to boost Maintenance Agreement sales)

Home Electronics * (What other products should the store stocks up?)

Attached mailing in direct marketingBarbara Mucha

Association Rule Definitions

Set of items: I={I1,I2,…,Im} Transactions: D = {t1, t2,.., tn} be a set of

transactions, where a transaction,t, is a set of items

Itemset: {Ii1,Ii2, …, Iik} I Support of an itemset: Percentage of

transactions which contain that itemset. Large (Frequent) itemset: Itemset whose

number of occurrences is above a threshold.Barbara Mucha

Rule Measures: Support & Confidence

An association rule is of the form : X Y where X, Y are subsets of I, and X INTERSECT Y = EMPTY

Each rule has two measures of value, support, and confidence.

Support indicates the frequencies of the occurring patterns, and confidence denotes the strength of implication in the rule.

The support of the rule X Y is support (X UNION Y) c is the CONFIDENCE of rule X Y if c% of transactions that contain X also contain Y, which can be written as the radio:

support(X UNION Y)/support(X) Barbara Mucha

Support & Confidence : An Example

Let minimum support 50%, and minimum confidence 50%, then we have,A C (50%, 66.6%)C A (50%, 100%)

TransactionID ItemsBought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Barbara Mucha

Types of Association Rule Mining Boolean vs. quantitative associations

(Based on the types of values handled) buys(x, “computer”) buys(x, “financial

software”) [.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”)

buys(x, “PC”) [1%, 75%]

Single dimension vs. multiple dimensional associations buys(x, “computer”) buys(x, “financial

software”) [.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”)

buys(x, “PC”) [1%, 75%]Barbara Mucha

Types of Association Rule Mining

Single level vs. multiple-level analysisWhat brands of beers are associated

with what brands of diapers?

Various extensionsCorrelation, causality analysis

Association does not necessarily imply correlation or causality

Constraints enforced E.g., small sales (sum < 100) trigger big

buys (sum > 1,000)?Barbara Mucha

Association Discovery Given a user specified minimum support (called

MINSUP) and minimum confidence (called MINCONF), an important

PROBLEM is to find all high confidence, large itemsets (frequent sets, sets with high support). (where support and confidence are larger than minsup and minconf).

This problem can be decomposed into two subproblems:

1. Find all large itemsets: with support > minsup (frequent sets).

2. For a large itemset, X and B X (or Y X) , find those rules, X\{B} => B ( X-Y Y) for which confidence > minconf.Barbara Mucha

Basics Itemset: a set of items

E.g., acm={a, c, m} Support of itemsets

Sup(acm)=3 Given min_sup=3,

acm is a frequent pattern

Frequent pattern mining: find all frequent patterns in a database

TID Items bought

100 f, a, c, d, g, I, m, p

200 a, b, c, f, l,m, o

300 b, f, h, j, o

400 b, c, k, s, p

500 a, f, c, e, l, p, m, n

Transaction database TDB

Barbara Mucha

Mining Association Rules—An Example

For rule A C:support = support({A &C}) = 50%

confidence = support({A &C})/support({A}) = 66.6%

The Apriori principle:Any subset of a frequent itemset must be frequent

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Rules from frequent sets

X = {mustard, sausage, beer}; frequency = 0.4

Y = {mustard, sausage, beer, chips}; frequency = 0.2

If the customer buys mustard, sausage, and beer, then the probability that he/she buys chips is 0.5

Barbara Mucha

Applications

Mine: Sequential patterns

find inter-transaction patterns such that the presence of a set of items is followed by another item in the time-stamp ordered transaction set.

Periodic patterns It can be envisioned as a tool for forecasting and prediction

of the future behavior of time-series data. Structural Patterns

Structural patterns describe how classes and objects can be combined to form larger structures.

Barbara Mucha

Application Difficulties

Wal-Mart knows that customers who buy Barbie dolls have a 60% likelihood of buying one of three types of candy bars.

What does Wal-Mart do with information like that? 'I don't have a clue,' says Wal-Mart's chief of merchandising, Lee Scott www.kdnuggets.com/news/98/n01.html

Diapers and beer urban legendhttp://web.onetel.net.uk/~hibou/Beer

%20and%20Nappies.html

Barbara Mucha

Thank You!

Barbara Mucha

CSE 634Data Mining Concepts and Techniques

Association & Apriori AlgorithmTania Irani

(105573836)

Course Instructor: Prof. Anita WasilewskaState University of New York, Stony Brook

References

Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber

Presentation Slides of Prof. Anita Wasilewska

Agenda

The Apriori Algorithm (Mining single-dimensional boolean association rules)

Frequent-Pattern Growth (FP-Growth) Method

Summary

The Apriori Algorithm: Key Concepts K-itemsets: An itemset having k items in it.

Support or Frequency: Number of transactions that contain a particular itemset.

Frequent Itemsets: An itemset that satisfies minimum support. (denoted by Lk for frequent k-itemset).

Apriori Property: All non-empty subsets of a frequent itemset must be frequent.

Join Operation: Ck, the set of candidate k-itemsets is generated by joining Lk-1 with itself. (L1: frequent 1-itemset, Lk: frequent k-itemset)

Prune Operation: Lk, the set of frequent k-itemsets is extracted from Ck by pruning it – getting rid of all the non-frequent k-itemsets in Ck

Iterative level-wise approach: k-itemsets used to explore (k+1)-itemsets.

The Apriori Algorithm finds frequent k-itemsets.

How is the Apriori Property used in the Algorithm?

Mining single-dimensional Boolean association

rules is a 2 step process:

Using the Apriori Property find the frequent itemsets: Each iteration will generate Ck (candidate k-itemsets from

Ck-1) and Lk (frequent k-itemsets)

Use the frequent k-itemsets to generate association

rules.

Finding frequent itemsets using the Apriori Algorithm: Example

TID List of Items

T100 I1, I2, I5

T100 I2, I4

T100 I2, I3

T100 I1, I2, I4

T100 I1, I3

T100 I2, I3

T100 I1, I3

T100 I1, I2 ,I3, I5

T100 I1, I2, I3

Consider a database D, consisting of 9 transactions.

Each transaction is represented by an itemset.

Suppose min. support required is 2 (2 out of 9 = 2/9 =22 % )

Say min. confidence required is 70%.

We have to first find out the frequent itemset using Apriori Algorithm.

Then, Association rules will be generated using min. support & min. confidence.

Step 1: Generating candidate and frequent 1-itemsets with min. support = 2

Itemset Sup.Count

{I1} 6

{I2} 7

{I3} 6

{I4} 2

{I5} 2

Itemset Sup.Count

{I1} 6

{I2} 7

{I3} 6

{I4} 2

{I5} 2

In the first iteration of the algorithm, each item is a member of the set of candidates Ck along with its support count.

The set of frequent 1-itemsets L1, consists of the candidate 1-itemsets satisfying minimum support.

Scan D for count of each candidate

Compare candidate support count with minimum support count

C1 L1


Itemset

{I1, I2}

{I1, I3}

{I1, I4}

{I1, I5}

{I2, I3}

{I2, I4}

{I2, I5}

{I3, I4}

{I3, I5}

{I4, I5}

Itemset Sup.

Count

{I1, I2} 4

{I1, I3} 4

{I1, I4} 1

{I1, I5} 2

{I2, I3} 4

{I2, I4} 2

{I2, I5} 2

{I3, I4} 0

{I3, I5} 1

{I4, I5} 0

Itemset Sup

Count

{I1, I2} 4

{I1, I3} 4

{I1, I5} 2

{I2, I3} 4

{I2, I4} 2

{I2, I5} 2

Generate C2

candidates from L1 x L1

C2

C2

L2


Compare candidate support count with minimum support count

Note: We haven’t used Apriori Property yet!


Itemset

{I1, I2, I3}

{I1, I2, I5}

{I1, I3, I5}

{I2, I3, I4}

{I2, I3, I5}

{I2, I4, I5}

Itemset Sup.

Count

{I1, I2, I3} 2

{I1, I2, I5} 2

Itemset Sup

Count

{I1, I2, I3} 2

{I1, I2, I5} 2

C3

C3L3


Compare candidate support count with min support count

Generate C3 candidates from L2

The generation of the set of candidate 3-itemsets C3, involves use of the Apriori Property.

When Join step is complete, the Prune step will be used to reduce the size of C3. Prune step helps to avoid heavy computation due to large Ck.

Contains non-frequent (2-itemset) subsets

Step 4: Generating frequent 4-itemset

L3 Join L3 C4 = {{I1, I2, I3, I5}}

This itemset is pruned since its subset {{I2, I3, I5}} is not frequent.

Thus, C4 = φ, and the algorithm terminates, having found all of the frequent items.

This completes our Apriori Algorithm. What’s Next ?

These frequent itemsets will be used to generate strong association rules (where strong association rules satisfy both minimum support & minimum confidence).

Step 5: Generating Association Rules from frequent k-itemsets

Procedure:

For each frequent itemset l, generate all nonempty subsets of l

For every nonempty subset s of l, output the rule “s (l - s)” if

support_count(l) / support_count(s) ≥ min_conf where min_conf is minimum confidence threshold. 70% in our case.

Back To Example: Lets take l = {I1,I2,I5}

The nonempty subsets of Lets take l are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}

Step 5: Generating Association Rules from frequent k-itemsets [Cont.]

The resulting association rules are: R1: I1 ^ I2 I5

Confidence = sc{I1,I2,I5} / sc{I1,I2} = 2/4 = 50% R1 is Rejected.

R2: I1 ^ I5 I2 Confidence = sc{I1,I2,I5} / sc{I1,I5} = 2/2 = 100% R2 is Selected.

R3: I2 ^ I5 I1 Confidence = sc{I1,I2,I5} / sc{I2,I5} = 2/2 = 100% R3 is Selected.

Step 5: Generating Association Rules from Frequent Itemsets [Cont.]

R4: I1 I2 ^ I5 Confidence = sc{I1,I2,I5} / sc{I1} = 2/6 = 33% R4 is Rejected.

R5: I2 I1 ^ I5 Confidence = sc{I1,I2,I5} / {I2} = 2/7 = 29% R5 is Rejected.

R6: I5 I1 ^ I2 Confidence = sc{I1,I2,I5} / {I5} = 2/2 = 100% R6 is Selected.

We have found three strong association rules.

Agenda

The Apriori Algorithm (Mining single dimensional boolean association rules)


Summary

Mining Frequent Patterns Without Candidate Generation

Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure Highly condensed, but complete for frequent pattern mining Avoid costly database scans

Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology:

Compress DB into FP-tree, retain itemset associations Divide the new DB into a set of conditional DBs – each

associated with one frequent item Mine each such database seperately

Avoid candidate generation

FP-Growth Method : An Example

Consider the previous example of a database D, consisting of 9 transactions.

Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % )

The first scan of the database is same as Apriori, which derives the set of 1-itemsets & their support counts.

The set of frequent items is sorted in the order of descending support count.

The resulting set is denoted as L = {I2:7, I1:6, I3:6, I4:2, I5:2}

TID List of Items

T100 I1, I2, I5

T100 I2, I4

T100 I2, I3

T100 I1, I2, I4

T100 I1, I3

T100 I2, I3

T100 I1, I3

T100 I1, I2 ,I3, I5

T100 I1, I2, I3

FP-Growth Method: Construction of FP-Tree

First, create the root of the tree, labeled with “null”. Scan the database D a second time (First time we scanned it to

create 1-itemset and then L), this will generate the complete tree. The items in each transaction are processed in L order (i.e. sorted

order). A branch is created for each transaction with items having their

support count separated by colon. Whenever the same node is encountered in another transaction, we

just increment the support count of the common node or Prefix. To facilitate tree traversal, an item header table is built so that each

item points to its occurrences in the tree via a chain of node-links. Now, The problem of mining frequent patterns in database is

transformed to that of mining the FP-Tree.

FP-Growth Method: Construction of FP-Tree

An FP-Tree that registers compressed, frequent pattern information

Item Id

Sup Count

Node-link

I2 7

I1 6

I3 6

I4 2

I5 2

I2:7

null{}

I1:2

I1:4I3:2 I4:1

I3:2

I5:1I5:1

I3:2 I4:1

Mining the FP-Tree by Creating Conditional (sub) pattern bases

1. Start from each frequent length-1 pattern (as an initial suffix pattern).

2. Construct its conditional pattern base which consists of the set of prefix paths in the FP-Tree co-occurring with suffix pattern.

3. Then, construct its conditional FP-Tree & perform mining on this tree.

4. The pattern growth is achieved by concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-Tree.

5. The union of all frequent patterns (generated by step 4) gives the required frequent itemset.

FP-Tree Example Continued

Now, following the above mentioned steps: Lets start from I5. I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1 I3

I5: 1}. Therefore considering I5 as suffix, its 2 corresponding prefix paths would be

{I2 I1: 1} and {I2 I1 I3: 1}, which forms its conditional pattern base.

Item Conditional pattern base Conditional

FP-Tree

Frequent pattern generated

I5 {(I2 I1: 1),(I2 I1 I3: 1)} <I2:2 , I1:2> I2 I5:2, I1 I5:2, I2 I1 I5: 2

I4 {(I2 I1: 1),(I2: 1)} <I2: 2> I2 I4: 2

I3 {(I2 I1: 2),(I2: 2), (I1: 2)} <I2: 4, I1: 2>,<I1:2> I2 I3:4, I1 I3: 2 , I2 I1 I3: 2

I1 {(I2: 4)} <I2: 4> I2 I1: 4

Mining the FP-Tree by creating conditional (sub) pattern bases

FP-Tree Example Continued

Out of these, only I1 & I2 is selected in the conditional FP-Tree because I3 does not satisfy the minimum support count.

For I1, support count in conditional pattern base = 1 + 1 = 2

For I2, support count in conditional pattern base = 1 + 1 = 2

For I3, support count in conditional pattern base = 1

Thus support count for I3 is less than required min_sup which is 2 here.

Now, we have a conditional FP-Tree with us. All frequent pattern corresponding to suffix I5 are generated by

considering all possible combinations of I5 and conditional FP-Tree. The same procedure is applied to suffixes I4, I3 and I1. Note: I2 is not taken into consideration for suffix because it doesn’t

have any prefix at all.

Why Frequent Pattern Growth Fast ?

Performance study shows

FP-growth is an order of magnitude faster than

Apriori

Reasoning

No candidate generation, no candidate test

Use compact data structure

Eliminate repeated database scans

Basic operation is counting and FP-tree building

Agenda

The Apriori Algorithm (Mining single dimensional boolean association rules)


Summary

Summary Association rules are generated from frequent itemsets.

Frequent itemsets are mined using Apriori algorithm or Frequent-Pattern Growth method.

Apriori property states that all the subsets of frequent itemsets must also be frequent.

Apriori algorithm uses frequent itemsets, join & prune methods and Apriori property to derive strong association rules.

Frequent-Pattern Growth method avoids repeated database scanning of Apriori algorithm.

FP-Growth method is faster than Apriori algorithm.

Thank You!

Mining Topic-Specific Concepts and Definitions on the Web

Irem Incekoy

May 2003, Proceedings of the 12th International conference on World Wide Web, ACM Press

Bing Liu, University of Illinois at Chicago, 851 S. Morgan Street Chicago IL 60607-7053

Chee Wee Chin,Hwee Tou Ng, National University of Singapore 3 Science Drive 2 Singapore

References

Agrawal, R. and Srikant, R. “Fast Algorithm for Mining Association Rules”, VLDB-94, 1994.

Anderson, C. and Horvitz, E. “Web Montage: A Dynamic Personalized Start Page”, WWW-02, 2002.

Brin, S. and Page, L. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, WWW7, 1998.

Introduction

When one wants to learn about a topic, one reads a book or a survey paper.

One can read the research papers about the topic.

None of these is very practical. Learning from web is convenient, intuitive,

and diverse.

Purpose of the Paper

This paper’s task is “mining topic-specific knowledge on the Web”.

The goal is to help people learn in-depth knowledge of a topic systematically on the Web.

Learning about a New Topic

One needs to find definitions and descriptions of the topic.

One also needs to know the sub-topics and salient concepts of the topic.

Thus, one wants the knowledge as presented in a traditional book.

The task of this paper can be summarized as “compiling a book on the Web”.

Proposed Technique

First, identify sub-topics or salient concepts of that specific topic.

Then, find and organize the informative pages containing definitions and descriptions of the topic and sub-topics.

Why are the current search tecnhiques not sufficient? For definitions and descriptions of the topic:

Existing search engines rank web pages based on keyword matching and hyperlink structures. NOT very useful for measuring the informative value of the page.

For sub-topics and salient concepts of the topic:

A single web page is unlikely to contain information about all the key concepts or sub-topics of the topic. Thus, sub-topics need to be discovered from multiple web pages. Current search engine systems do not perform this task.

Related Work

Web information extraction wrappers Web query languages User preference approach Question answering in information retrieval

• Question answering is a closely-related work to this paper. The objective of a question-answering system is to provide direct answers to questions submitted by the user. In this paper’s task, many of the questions are about definitions of terms.

The Algorithm

WebLearn (T)

1) Submit T to a search engine, which returns a set of relevant pages

2) The system mines the sub-topics or salient concepts of T using a set S of top ranking pages from the search engine

3) The system then discovers the informative pages containing definitions of the topic and sub-topics (salient concepts) from S

4) The user views the concepts and informative pages.

If s/he still wants to know more about sub-topics then

for each user-interested sub-topic Ti of T do

WebLearn (Ti);

Sub-Topic or Salient Concept Discovery Observation:

Sub-topics or salient concepts of a topic are important word phrases, usually emphasized using some HTML tags (e.g., <h1>,...,<h4>,<b>).

However, this is not sufficient. Data mining techniques are able to help to find the frequent occurring word phrases.

Sub-Topic Discovery

After obtaining a set of relevant top-ranking pages (using Google), sub-topic discovery consists of the following 5 steps.

1) Filter out the “noisy” documents that rarely contain sub-topics or salient-concepts. The resulting set of documents is the source for sub-topic discovery.

Sub-Topic Discovery

2) Identify important phrases in each page (discover phrases emphasized by HTML markup tags).

Rules to determine if a markup tag can safely be ignored Contains a salutation title (Mr, Dr, Professor). Contains an URL or an email address. Contains terms related to a publication (conference,

proceedings, journal). Contains an image between the markup tags. Too lengthy (the paper uses 15 words as the upper limit)

Sub-Topic Discovery

Also, in this step, some preprocessing techniques such as stopwords removal and word stemming are applied in order to extract quality text segments.

Stopwords removal: Eliminating the words that occur too

frequently and have little informational meaning. Word stemming: Finding the root form of a word by

removing its suffix.

Sub-Topic Discovery

3) Mine frequent occurring phrases: - Each piece of text extracted in step 2 is stored in a dataset

called a transaction set.

- Then, an association rule miner based on Apriori algorithm is executed to find those frequent itemsets. In this context, an itemset is a set of words that occur together, and an itemset is frequent if it appears in more than two documents.

- We only need the first step of the Apriori algorithm and we only need to find frequent itemsets with three words or fewer (this restriction can be relaxed).

Sub-Topic Discovery

4) Eliminate itemsets that are unlikely to be sub-topics, and determine the sequence of words in a sub-topic. (postprocessing)

Heuristic: If an itemset does not appear alone as an important phrase in any page, it is unlikely to be a main sub-topic and it is removed.

Sub-Topic Discovery

5) Rank the remaining itemsets. The remaining itemsets are regarded as the sub-topics or salient concepts of the search topic and are ranked based on the number of pages that they occur.

Definition Finding

This step tries to identify those pages that include definitions of the search topic and its sub-topics discovered in the previous step.

Preprocessing steps: Texts that will not be displayed by browsers (e.g., <script>...</

script >,<!—comments-->) are ignored. Word stemming is applied. Stopwords and punctuation are kept as they serve as clues to

identify definitions. HTML tags within a paragraph are removed.

Definition Finding

After that, following patterns are applied to identify definitions:

[1] Bing Liu, Chee Wee Chin, Hwee Tou Ng. Mining Topic-Specific Concepts and Definitions on the Web

Definition Finding

Besides using the above patterns, the paper also relies on HTML structuring and hyperlink structures.

1) If a page contains only one header or one big emphasized text segment at the beginning in the entire document, then the document contains a definition of the concept in the header.

2) Definitions at the second level of the hyperlink structure are also discovered. All the patterns and methods described above are applied to these second level documents.

Definition Finding

Observation: Sometimes no informative page is found for a particular sub-topic when the pages for the main topic are very general and do not contain detailed information for sub-topics.

In such cases, the sub-topic can be submitted to the search engine and sub-subtopics may be found recursively.

Dealing with Ambiguity

One of the difficult problems in concept mining is the ambiguity of the search terms (e.g., classification).

A search engine may not return any page in the right context in its top ranking pages.

Partial solution: adding terms that can represent the context (e.g., classification data mining).

Disadvantage: returned web pages focus more on the context words since they represent a larger concept.


To handle this problem: First reduce the ambiguity of a search topic by using context words. Then,

1) Finding salient concepts only in the segment describing the topic or sub-topic. (using HTML structuring tags as cues).

2) Identifying those pages that hierarchically organize knowledge of the parent topic. To identify such pages, we can parse the HTML nested list items (e.g., <li>) structure by building a tree.


An example of a well-organized topic hierarchy

• We confirm whether it is a correct page by finding if the hierarchy contains at least another sub-topic ofthe parent topic.



Finding salient concepts enclosed within braces illustrating examples.

Example:

There are many clustering approaches (e.g., hierarchical, partitioning, k-means, k-medoids), and we add that efficiency is important if the clusters contain many points.

The execution of the algorithm can stop when most of the salient concepts found are parallel concepts of the search topic.

Mutual Reinforcement

This method applies to situations where we have already found the sub-topics of a topic, and we want to find the salient concepts of the sub-topics of the topic, to go down further.

Often, when one searches for a sub-topic S1, one also finds important information about another sub-topic S2 due to the ranking algorithm used by the search engine.

This method works in two steps: 1) submit each sub-topic individually to the search engine. 2) combine the top-ranking pages from each search into one

set, and apply the proposed techniques to the whole set to look for all sub-topics.

System Architecture

The overall system is composed of five main components:

1) A search engine: This is a standard web search engine (Google is used in this system).

2) A crawler: It crawls the World Wide Web to download those top-ranking pages returned by the search engine. It stores the pages in “Web Page Depository”.

3) A salient concept miner: It uses the sub-topic discovery techniques explained before to search the pages stored in “Web Page Depository”, in order to identify and extract those sub-topics and salient concepts.

System Architecture

4) A definition finder: It uses the technique presented in definition finding section to search through the pages stored in “Web Page Depository” to find those informative pages containing definitions of the topics and the sub-topics.

5) A user interface: It enables the user to interact with the system

System Architecture


Experimental Study

The size of the set of documents is limited to the first hundred results returned by Google.

Table 1 shows the sub-topics and salient concepts discovered for 28 search topics

In each box, the first line gives the search topic. For each topic, only ten top-ranking concepts are listed.

For too specific topics, only definition finding is meaningful.


Experimental Study

In Table 2, the precision of the definition-finding task is compared with the Google search engine and AskJeeves, the web’s premier question-answering system.

The first 10 pages of results are compared with the first 10 pages returned by Google and AskJeeves. To do a fair comparison, they also look for definitions in the second level of the search results returned by Google and AskJeeves.


Table 2

Experimental Study

Table 3 presents the results for ambiguity handling by applying the respective methods explained before.

Column 1 lists two ambiguous topics of “data mining” and “time series”. Column 2 lists the sub-topics identified using the original technique.

Column 3 lists gives the sub-topics discovered using the respective parent-topics as context terms.

Column 4 uses ambiguity handling techniques. Column 5 applies mutual reinforcement in addition to others.


Conclusions

The proposed techniques aim at helping Web users to learn an unfamiliar topic in-depth and systematically.

This is an efficient system to discover and organize knowledge on the web, in a way similar to a traditional book, to assist learning.

Effective Personalization Based on Association Rule

Discovery from Web Usage Data

Mikhail Bautin

Bamshad Mobasher, Honghua Dai, Tao Luo, Miki Nakagawa

DePaul University 243 S. Wabash Ave.

Chicago, Illinois 60604, USA (2001)

References

B. Mobasher, H. Dai, T. Luo and M. Nakagawa: "Effective Personalization Based on Association Rule Discovery from Web Usage Data", in Proc. the 3rd ACM Workshop on Web Information and Data Management (WIDM01) (2001).

R. Agarwal, C. Aggarwal, and V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Proceedings of the High Performance Data Mining Workshop, Puerto Rico, 1999.

R. Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Proc. 20th Int. Conference on Very Large Data Bases, VLDB94, 1994.

Goal

Personalize a web site:Predict actions of the user (pre-fetching etc.)Recommend new items to a customer based

on viewed items and knowledge of what other customers are interested in:

“Customers who buy this also buy that...”

Approaches

Collaborative filteringFind top k users who have similar tastes or

interests (k-nearest-neighbor)Predict actions based on what those users didToo much online computation needed

Association rulesScalable: constant time query processingBetter precision and coverage than CF

Data Preparation

Input: web server logs Steps:

User identification (trivial if using cookies)Session and transaction identificationPage view identification (for multi-frame sites)

As a result of preparation:Records correspond to transactions Items correspond to page viewsOrder of page views does not matter

Pattern Discovery

Running Apriori algorithmRecords = transactions, items = page viewsMinimum support and confidence restrictionProblem with global minimum support value:

important but rare items can be discardedSolution: multiple minimum support values.

For itemset {p1, ..., pn} require

Recommendation Engine

Fixed-size sliding window w:

subset of |w| most recent page views Need to find rules with w on the left This is done with depth-first search Sort elements of w lexicographically Only need O(|w|) to find the itemset and O(#

of page views) to produce recommendations

Frequent Itemset Graph

Figure 1 from the paper (Mobasher et al.)

Example

Active session window w = {B, E} Solid lines – “lexicographic” extension Stippled lines – any extension The search leads to node BE (5) at level 3 Possible extensions: A and C Confidence calculated as For A it is 5/5 = 1, for C it is 4/5

Window size vs minsup

For large window size it might be difficult to find frequent enough itemsets

But larger window gives better accuracy Solution: the “all-kth-order” method

Start with the largest possible window sizeReduce window size until able to generate a

recommendationNo additional computation incurred

Evaluation Methodology For each transaction t first n page views

are used for generating recommendation and last |t| - n are used for testing

ast – subset of first n elements of t – minimum required confidence R(ast, ) – set of recommendations

evalt – the last |t| - n pageviews of t

Measures of Evaluation

The threshold is ranging from 0.1 to 1

Impact of Window Size


Single vs Multiple Min. Support


The all-kth-order Model


Association Rules vs kNN


Conclusions Personalization based on association rules

is better than k-nearest-neighbor approach:Faster – very little online computationTherefore, better scalabilityBetter precisionBetter coverage

Effective alternative to standard collaborative filtering mechanisms for personalization

Thank you!

Documents

CSE 634 Data Mining Concepts and Techniques Association Rule Mining