22
1 University of Florida CISE department Gator Engineering Introduction to Data Mining Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville [email protected] Data Mining Sanjay Ranka Fall 2003 2 University of Florida CISE department Gator Engineering Course Overview • Introduction to Data Mining • Important data mining primitives: – Classification – Clustering – Association Rules – Sequential Rules – Anomaly Detection • Commercial and Scientific Applications

dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

  • Upload
    vuminh

  • View
    216

  • Download
    4

Embed Size (px)

Citation preview

Page 1: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

1

University of Florida CISE department Gator Engineering

Introduction to Data Mining

Dr. Sanjay RankaProfessor

Computer and Information Science and EngineeringUniversity of Florida, Gainesville

[email protected]

Data Mining Sanjay Ranka Fall 2003 2

University of Florida CISE department Gator Engineering

Course Overview

• Introduction to Data Mining

• Important data mining primitives:

– Classification

– Clustering

– Association Rules

– Sequential Rules

– Anomaly Detection

• Commercial and Scientific Applications

Page 2: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

2

Data Mining Sanjay Ranka Fall 2003 3

University of Florida CISE department Gator Engineering

• Background required:– General background in algorithms and programming

• Grading scheme:– 4 to 6 home works (10%)

– 3 in-class exams ( 30% each )

– Last exam may be replaced by a project

• Textbook:– Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, 2003

– Data Mining: Concepts and Techniques by JiaweiHan and Micheline Kamber, 2000

Data Mining Sanjay Ranka Fall 2003 4

University of Florida CISE department Gator Engineering

Data Mining

1b55

2b44

4b33

3a22

2a10

QCISelection CleaningTransformation

1b55

2b44

4b33

3a22

2a10

QCIMining

Interpretation/Optimizing Processes

Non trivial extraction of nuggets

from large amounts of data

Page 3: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

3

Data Mining Sanjay Ranka Fall 2003 5

University of Florida CISE department Gator Engineering

Data Mining is not …

• Generating multidimensional cubes of a relational table

Source: Multidimensional OLAP vs. Relational OLAP by Colin White

Data Mining Sanjay Ranka Fall 2003 6

University of Florida CISE department Gator Engineering

Data Mining is not …

• Searching for a phone number in a phone book

• Searching for keywords on Google

Page 4: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

4

Data Mining Sanjay Ranka Fall 2003 7

University of Florida CISE department Gator Engineering

Data Mining is not …

• Generating a histogram of salaries for different age groups

• Issuing SQL query to a database, and reading the reply

Data Mining Sanjay Ranka Fall 2003 8

University of Florida CISE department Gator Engineering

Data Mining is …

• Finding groups of people with similar hobbies

• Are chances of getting cancer higher if you live near a power line?

Page 5: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

5

Data Mining Sanjay Ranka Fall 2003 9

University of Florida CISE department Gator Engineering

Data Mining Tasks

• Prediction methods

– Use some variables to predict unknown or future values of the same or other variables

• Description methods

– Find human interpretable patterns that describe data

From Fayyad, et al., Advances in Knowledge Discovery and Data Mining, 1996

Data Mining Sanjay Ranka Fall 2003 10

University of Florida CISE department Gator Engineering

Important Data Mining Primitives

Tid Refund Marital

Status

Taxable

Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes

11 No Married 60K No

12 Yes Divorced 220K No

13 No Single 85K Yes

14 No Married 75K No

15 No Single 90K Yes 10

Predictive M

odeling

Clustering

Assoc

iation

Rules

Anomaly/Deviation

Detection

Milk

Data

Page 6: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

6

Data Mining Sanjay Ranka Fall 2003 11

University of Florida CISE department Gator Engineering

Data Mining Tasks …

• Classification (predictive)

• Clustering (descriptive)

• Association Rule Discovery (descriptive)

• Sequential Pattern Discovery (descriptive)

• Regression (predictive)

• Deviation Detection (predictive)

Data Mining Sanjay Ranka Fall 2003 12

University of Florida CISE department Gator Engineering

Why is Data Mining prevalent?Lots of data is collected and stored in

data warehouses• Business

– Wal-Mart logs nearly 20 million transactions per day

• Astronomy

– Telescope collecting large amounts of data (SDSS)

• Space

– NASA is collecting peta bytes of data from satellites

• Physics

– High energy physics experiments are expected to generate 100 to 1000 tera bytes in the next decade

Page 7: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

7

Data Mining Sanjay Ranka Fall 2003 13

University of Florida CISE department Gator Engineering

Why is Data Mining prevalent?Quality and richness of data collected

in improving

• Retailers– Scanner data is much more accurate than other means

• E-Commerce– Rich data on consumer browsing

• Science– Accuracy of sensors is improving

Data Mining Sanjay Ranka Fall 2003 14

University of Florida CISE department Gator Engineering

Why is Data Mining prevalent?The gap between data and analysts is increasing

• Hidden information is not always evident• High cost of human labor• Much of data is never analyzed at all

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

1995 1996 1997 1998 1999

The Data Gap

Total new disk (TB) since 1995

Number of

analysts

Ref: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications

Page 8: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

8

Data Mining Sanjay Ranka Fall 2003 15

University of Florida CISE department Gator Engineering

Origins of Data Mining

• Drawn ideas from Machine Learning, Pattern Recognition, Statistics, and Database systems for applications that have

– Enormity of data

– High dimensionality of data

– Heterogeneous data

– Unstructured data

Database Systems

PatternRecognition

Statistics

Data Mining Sanjay Ranka Fall 2003 16

University of Florida CISE department Gator Engineering

Regression

• Predict the value of a given continuous valued variable based on the values of other variables, assuming a linear or non-linear model of dependency

• Extensively studied in the fields of Statistics and Neural Networks

• Examples– Predicting sales numbers of a new product based on advertising expenditure

– Predicting wind velocities based on temperature, humidity, air pressure, etc

– Time series prediction of stock market indices

Page 9: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

9

Data Mining Sanjay Ranka Fall 2003 17

University of Florida CISE department Gator Engineering

Regression

• Linear regression

– Data is modeled using a straight line

– Y = a + bX

• Non-linear regression

– Data is more accurately correctly modeled using a nonlinear function

– Y = a + b� f(X)

Data Mining Sanjay Ranka Fall 2003 18

University of Florida CISE department Gator Engineering

Association Rule Discovery• Given a set of

transactions, each of which is a set of items, find all rules (X � Y ) that satisfy user specified minimum support and confidence constraints

• Support = (#T containing X and Y) / (#T)

• Confidence = (#T containing X and Y ) / (#T containing X)

• Applications– Cross selling and up selling– Supermarket shelf

management

• Some rules discovered– Bread � Peanut Butter

• support=60%, confidence=75%

– Peanut Butter � Bread• support=60%, confidence=100%

– Jelly � Peanut Butter• support=20%, confidence=100%

– Jelly �Milk• support=0%

Bread, Jelly, Peanut ButterT1

Beer, MilkT5

Beer, BreadT4

Bread, Milk, Peanut ButterT3

Bread, Peanut ButterT2

ItemsTransaction

Source: Data Mining – Introductory and Advanced topics by Margaret Dunham

Page 10: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

10

Data Mining Sanjay Ranka Fall 2003 19

University of Florida CISE department Gator Engineering

Association Rule Discovery: Definition

• Given a set of records, each of which contain some number of items from a given collection:

– Produce dependency rules which will predict occurrence of an item based on occurrences of other items

• Example:

– {Bread} � {Peanut Butter}

– {Jelly} � {Peanut Butter}

Data Mining Sanjay Ranka Fall 2003 20

University of Florida CISE department Gator Engineering

Association Rule Discovery:Marketing and sales promotion

• Say the rule discovered is{Bread, …. } � {Peanut Butter}

• Peanut Butter as a consequent: can be used to determine what products will boost its sales

• Bread as an antecedent: can be used to see which products will be impacted if the store stops selling bread (e.g. cheap soda is a “loss leader”for many grocery stores.)

• Bread as an antecedent and Peanut Butter as a consequent: can be used to see what products should be stocked along with Bread to promote the sale of Peanut Butter

Page 11: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

11

Data Mining Sanjay Ranka Fall 2003 21

University of Florida CISE department Gator Engineering

Association Rule Discovery:Super market shelf management

• Goal: To identify items that are bought concomitantly by a reasonable fraction of customers so that they can be shelved appropriately based on business goals.

• Data Used: Point-of-sale data collected with barcode scanners to find dependencies among products

• Example– If a customer buys Jelly, then he is very likely to buy Peanut Butter.

– So don’t be surprised if you find Peanut Butter next to Jelly on an aisle in the super market. Also, salsa next to tortilla chips.

Data Mining Sanjay Ranka Fall 2003 22

University of Florida CISE department Gator Engineering

Association Rule Discovery:Inventory Management

• Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products, and wants to keep the service vehicles equipped with frequently used parts to reduce the number of visits to consumer household.

• Approach: Process the data on tools and parts required in repairs at different consumer locations and discover the co-occurrence patterns

Page 12: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

12

Data Mining Sanjay Ranka Fall 2003 23

University of Florida CISE department Gator Engineering

Association Rules: Apparel

Jeans, Shoes, Shorts, T-ShirtT20Jeans, Shoes, T-ShirtT10

JeansT19JeansT9

Jeans, Shoes, Shorts, T-ShirtT18Jeans, Shoes, Shorts, T-ShirtT8

Blouse, Jeans, SkirtT17Jeans, SkirtT7

Skirt, T-ShirtT16Shoes, T-ShirtT6

Jeans, T-ShirtT15Jeans, ShortsT5

Shoes, Skirt, T-ShirtT14Jeans, Shoes, T-ShirtT4

Jeans, Shoes, Shorts, T-ShirtT13Jeans, T-ShirtT3

Blouse, Jeans, Shoes, Skirt, T-ShirtT12Shoes, Skirt, T-ShirtT2

T-ShirtT11BlouseT1

ItemsTransactionItemsTransaction

{Jeans, T-Shirt, Shoes} � { Shorts}Support: 20% Confidence: 100%

Source: Data Mining – Introductory and Advanced topics by Margaret Dunham

Data Mining Sanjay Ranka Fall 2003 24

University of Florida CISE department Gator Engineering

Classification: Definition

• Given a set of records (called the training set)

– Each record contains a set of attributes. One of the attributes is the class

• Find a model for the class attribute as a function of the values of other attributes

• Goal: Previously unseen records should be assigned to a class as accurately as possible

– Usually, the given data set is divided into training and test set, with training set used to build the model and test set used to validate it. The accuracy of the model is determined on the test set.

Page 13: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

13

Data Mining Sanjay Ranka Fall 2003 25

University of Florida CISE department Gator Engineering

Classification Example

Tid Refund Marital

Status

Taxable

Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categorical

categorical

continuous

class

Refund Marital

Status

Taxable

Income Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set

ModelLearn

Classifier

Data Mining Sanjay Ranka Fall 2003 26

University of Florida CISE department Gator Engineering

Decision Tree

Classification

• Modeling a class attribute, using other attributes

• Applications– Targeted marketing

– Customer attrition

Gender

Height Height

Short Medium Tall Medium TallShort

= F = M

< 1.3m > 1.8m < 1.5m > 2 m

Medium1.75 mFLynette

Medium1.8 mFAmy

Tall1.9 mFKim

Medium1.95 mMTodd

Medium1.8 mFDebbie

Tall2.1 mMSteven

Tall2.2 mMWorth

Medium1.7 mMDave

Medium1.6 mFKathy

Medium1.85 mMBob

Medium1.7 mFStephanie

Tall1.88 mFMartha

Tall1.9 mFMaggie

Medium2 mMJim

Medium1.6 mFKristina

OutputHeightGenderName

Source: Data Mining – Introductory and Advanced topics by Margaret Dunham

Page 14: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

14

Data Mining Sanjay Ranka Fall 2003 27

University of Florida CISE department Gator Engineering

Classification: Direct Marketing

• Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell phone product

• Approach:– Use the data collected for a similar product introduced in the

recent past.

– Use the profiles of customers along with their {buy, didn’t buy} decision. The latter becomes the class attribute.

– The profile of the information may consist of demographic, lifestyle and company interaction.• Demographic – Age, Gender, Geography, Salary

• Psychographic – Hobbies

• Company Interaction –Recentness, Frequency, Monetary

– Use these information as input attributes to learn a classifier model

Source: Data Mining Techniques, Berry and Linoff, 1997

Data Mining Sanjay Ranka Fall 2003 28

University of Florida CISE department Gator Engineering

Classification: Fraud Detection

• Goal: Predict fraudulent cases in credit card transactions

• Approach:

– Use credit card transactions and the information on its account holders as attributes (important information: when and where the card was used)

– Label past transactions as {fraud, fair} transactions. This forms the class attribute

– Learn a model for the class of transactions

– Use this model to detect fraud by observing credit card transactions on an account

Page 15: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

15

Data Mining Sanjay Ranka Fall 2003 29

University of Florida CISE department Gator Engineering

Classification: Customer Churn

• Goal: To predict whether a customer is likely to be lost to a competitor

• Approach:

– Use detailed record of transaction with each of the past and current customers, to find attributes• How often does the customer call, Where does he call, What time of the day does he call most, His financial status, His marital status, etc. (Important Information: Expiration of the current contract).

– Label the customers as {churn, not churn}

– Find a model for Churn

Source: Data Mining Techniques, Berry and Linoff, 1997

Data Mining Sanjay Ranka Fall 2003 30

University of Florida CISE department Gator Engineering

Classification:Sky survey cataloging

• Goal: To predict class {star, galaxy} of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory)– 3000 images with 23,040 x 23,040 pixels per image

• Approach:– Segment the image

– Measure image attributes (40 of them) per object

– Model the class based on these features

• Success story: Could find 16 new high red-shift quasars (some of the farthest objects that are difficult to find) !!! Source: Advances in Knowledge Discovery and Data Mining, Fayyad et al., 1996

Page 16: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

16

Data Mining Sanjay Ranka Fall 2003 31

University of Florida CISE department Gator Engineering

Classification:Classifying Galaxies

Early

Intermediate

Late

Data Size: • 72 million stars, 20 million galaxies• Object Catalog: 9 GB

• Image Database: 150 GB

Class: • Stages of Formation

Attributes:• Image features, • Characteristics of light waves received, etc.

Source: Minnesota Automated Plate Scanner Catalog, http://aps.umn.edu

Data Mining Sanjay Ranka Fall 2003 32

University of Florida CISE department Gator Engineering

• Determine object groupings such that objects within the same cluster are similar to each other, while objects in different groups are not

• Typically objects are represented by data points in a multidimensional space with each dimension corresponding to one or more attributes. Clustering problem in this case reduces to the following:– Given a set of data points, each having a set of attributes, and a

similarity measure, find clusters such that• Data points in one cluster are more similar to one another

• Data points in separate clusters are less similar to one another

• Similarity measures:– Euclidean distance if attributes are continuous

– Other problem-specific measures

Clustering

Page 17: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

17

Data Mining Sanjay Ranka Fall 2003 33

University of Florida CISE department Gator Engineering

Clustering Example

• Euclidean distance based clustering in 3D space

– Intra cluster distances are minimized

– Inter cluster distances are maximized

Data Mining Sanjay Ranka Fall 2003 34

University of Florida CISE department Gator Engineering

Clustering: Market Segmentation

• Goal: To subdivide a market into distinct subset of customers where each subset can be targeted with a distinct marketing mix

• Approach:

– Collect different attributes of customers based on their geographical and lifestyle related information

– Find clusters of similar customers

– Measure the clustering quality by observing the buying patterns of customers in the same cluster vs. those from different clusters

Page 18: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

18

Data Mining Sanjay Ranka Fall 2003 35

University of Florida CISE department Gator Engineering

Clustering: Document Clustering

• Goal: To find groups of documents that are similar to each other based on important terms appearing in them

• Approach: To identify frequently occurring terms in each document. Form a similarity measure based on frequencies of different terms. Use it to generate clusters

• Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents

Data Mining Sanjay Ranka Fall 2003 36

University of Florida CISE department Gator Engineering

Clustering: Document Clustering Example

• Clustering points: 3024 articles of Los Angeles Times

• Similarity measure: Number of common words in documents (after some word filtering)

278354Entertainment

573738Sports

746943Metro

36273National

260341Foreign

364555Financial

Correctly placed articlesTotal articlesCategory

Page 19: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

19

Data Mining Sanjay Ranka Fall 2003 37

University of Florida CISE department Gator Engineering

Clustering: S&P 500 stock data

• Observe stock movements everyday

• Clustering points: Stock – {UP / DOWN}

• Similarity measure: Two points are more similar if the events described by them frequently happen together on the same day

D i s c o v e r e d C l u s te r s I n d u s tr y G r o u p

1A p p lie d -M a t l- D O W N , B a y -N e t w o rk - D o w n , 3 - C O M -D O W N ,

C a b le t ro n -S y s -D O W N ,C IS C O - D O W N ,H P -D O W N ,

D S C - C o m m - D O W N ,I N T E L - D O W N , L S I - L o g ic -D O W N ,

M ic ro n -T e c h -D O W N ,T e xa s - In s t -D o w n ,T e l la b s - In c -D o w n ,

N a t l- S e m ic o n d u c t -D O W N , O ra c l- D O W N , S G I- D O W N ,

S u n -D O W N

T e c h n o lo g y1 - D O W N

2A p p le -C o m p - D O W N ,A u t o d es k -D O W N , D E C - D O W N ,

A D V -M ic ro - D e v ic e -D O W N ,A n d re w - C o rp - D O W N ,

C o m p u t e r -A s s o c -D O W N , C i rc u it - C it y -D O W N ,

C o m p a q -D O W N , EM C - C o rp - D O W N , G e n - In s t -D O W N ,

M o t o ro la -D O W N ,M ic ro s o f t -D O W N ,S c ie n t if ic -A t l -D O W N

T e c h n o lo g y2 - D O W N

3F a n n ie -M a e - D O W N ,F e d - H o m e - L o a n - D O W N ,

M B N A - C o rp -D O W N ,M o rg a n - S t a n le y -D O W N F in a n c ia l- D O W N

4B a k e r - H u g h e s -U P , D re s s e r - In d s -U P , H a l l ib u r t o n -H LD -U P ,

L o u is ia n a - L a n d - U P , P h i ll ip s -P e t ro - U P , U n o c a l- U P ,

S c h lu m b e rg e r - U PO il- U P

Data Mining Sanjay Ranka Fall 2003 38

University of Florida CISE department Gator Engineering

Deviation / Anomaly Detection

• Some data objects do not comply with the general behavior or model of the data. Data objects that are different from or inconsistent with the remaining set are called outliers

• Outliers can be caused by measurement or execution error. Or they represent some kind of fraudulent activity.

• Goal of Deviation / Anomaly Detection is to detect significant deviations from normal behavior

Page 20: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

20

Data Mining Sanjay Ranka Fall 2003 39

University of Florida CISE department Gator Engineering

Deviation / Anomaly Detection: Definition

• Given a set of n data points or objects, and k, the expected number of outliers, find the top k objects that considerably dissimilar, exceptional or inconsistent with the remaining data

• This can be viewed as two sub problems– Define what data can be considered as inconsistent in a given data set

– Find an efficient method to mine the outliers so defined

Data Mining Sanjay Ranka Fall 2003 40

University of Florida CISE department Gator Engineering

Deviation:Credit Card Fraud Detection

• Goal: To detect fraudulent credit card transactions

• Approach:

– Based on past usage patterns, develop model for authorized credit card transactions

– Check for deviation form model, before authenticating new credit card transactions

– Hold payment and verify authenticity of “doubtful” transactions by other means (phone call, etc.)

Page 21: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

21

Data Mining Sanjay Ranka Fall 2003 41

University of Florida CISE department Gator Engineering

Anomaly Detection:Network Intrusion Detection

• Goal: To detect intrusion of a computer network

• Approach:

– Define and develop a model for normal user behavior on the computer network

– Continuously monitor behavior of users to check if it deviates from the defined normal behavior

– Raise an alarm, if such deviation is found

Data Mining Sanjay Ranka Fall 2003 42

University of Florida CISE department Gator Engineering

Sequential Pattern Discovery: Definition

• Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events

(A B) (C) (D E)

Page 22: dm1 - Computer & Information Science & Engineering …€¦ ·  · 2008-12-16Data Mining Sanjay Ranka Fall 2003 7 University of Florida CISE department Gator Engineering Data Mining

22

Data Mining Sanjay Ranka Fall 2003 43

University of Florida CISE department Gator Engineering

Sequential Pattern Discovery: Telecommunication Alarm Logs

• Telecommunication alarm logs

– (Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm) � (Fire_Alarm)

Data Mining Sanjay Ranka Fall 2003 44

University of Florida CISE department Gator Engineering

Sequential Pattern Discovery:Point of Sale Up Sell / Cross Sell

Point of sale transaction sequences

– Computer bookstore

• (Intro_to_Visual_C) (C++ Primer) �(Perl_For_Dummies, Tcl_Tk)

– Athletic apparel store

• (Shoes) (Racket, Racket ball) � (Sports_Jacket)