28
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Embed Size (px)

Citation preview

Page 1: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

CS157B Fall 04

Introduction to

Data MiningChapter 22.3Professor Lee

Yu, Jianji (Joseph)

Page 2: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Today's Presentation covers:

1.What is Data Mining?2.Data Mining Objectives3.Data Mining Operations4.Knowledge Discovery5.Application of Data Mining6.Summary7.References

Page 3: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Statistics

Databases

Artificial Intelligence

Visualization

Data Mining

Overviewof Data Mining

Page 4: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

1. What is Data Mining?

➔ We usually use Data Mining to:– Discovering useful, previously unknown

knowledge by analyzing large and complex databases.

– Knowledge discovery, exploratory data analysis, applied statistics, machine learning

– Search for valuable Information in Large Databases

Page 5: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

2. Data Mining Objectives

➔ Find rules and patterns in large volumn databases

➔ Discovery– Finding human understandable patterns

describing the data➔ Prediction

– Using some variables or fields in database to predict unknown or future values or other variables of interest

Page 6: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Data Mining Objectives

➔ Knowledge Discovery– Stage somewhat prior to prediction where

information is insufficient– It's close to decision support

Page 7: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

3. Data Mining Operations

➔ Associations➔ Sequential Patterns➔ Time-Series Clustering➔ Classification➔ Segmentation➔ And many more!

Page 8: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Association

● Used to find all rules in a basket data

● Basket data also called transaction data

● Analyze how items purchased by customers in a shop

Page 9: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Association...● A formal definition:● Let I = {i1, i2, …im} be a total set of items D a set of transactions d is one transaction consists of a set of items d I● Association rule:-● X Y where X I ,Y I and X Y = ● Support = (#of transactions contain X Y ) / D● Support: number of instances predicted correctly● Confidence: number of correct predictions, as proportion of all

instances● Confidence = (#of transactions contain X Y) /

#of transactions contain X

Page 10: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Association...● Example of transaction data:

– Transaction 1: CD player, music's CD, music's book

– Transaction 2: CD player, music's CD– Transaction 3: Music's CD, music's book– Transaction 4: CD player

● I = {CD player, music's CD, music's book}● D = 4● # of transactions contain both CD player,

music's CD = 2● # of transactions contain CD player = 3● Support = 2 /4, Confidence: 2 /3

Page 11: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Applying Association Rule...

● Example: Books that tend to be bought together. If a customer buys a book, an online bookstore may suggest other associated books. (ie. Amazon.com)

● Example: If a person buys a laptop, the salesperson may suggest accessories that tend to be bought along with laptop.

Page 12: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Time Series Clustering● Given:

– A database of time series● Find:

– Groups of similar time series● Sample Applications:

– Determine products with similar selling patterns

– Identify companies with similar pattern of grown

– Find stocks with similar price movements

Page 13: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Classification● Classification

– Problem: Given that items belong to one of several classes, and given past instances (aka training instances) of items along with the classes to which they belong, the problem is to PREDICT the class to which a new item belongs

– The class of the new instance is not known, so other attributes of the instance must be used to predict the class.

– It can be done by finding rules that partition the given data into disjoint groups

Page 14: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Classification...

● Dataset is usually in the form of a relation table.

● Data has a set of distinct attributes.● Each data record is also labeled with a class.● Goal : To build a model or learn rules that can

be used to predict the classes of new cases.● Training Data are used to build this model.

Page 15: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Classification...● For example

– Suppose that a credit card company wants to decide whether or not to give a credit card to an applicant

● The company has a variety of information about the person, such as their age, education background, income, etc..

● Then they will rank the applicants (catogorized them into classes)

● Forall person P, P.degree=masters AND P.income > 75,000 ==> P.credit = excellent

● Forall person P, P.degree=bachelors OR (P.income >= 25,000 AND P.income <= 75,000) ==> P.credit = good

Page 16: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Classification...● Table:

Age Smoke Risk ---------------------------------------------- 20 No Low 25 Yes High 44 Yes High 18 No Low 55 No High 35 No Low

● To identify the risk (we have two groups):– Risk = Low and Risk = High

Page 17: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Classification...

● The following techniques could be used to analyze the classification:– Decision Tree– Predictive Modeling– Using association rule– Neural networks– etc...

Page 18: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Decision Trees

● “Divide-and-conquer” approach produce tree● Nodes involve testing a particular attribute● Usually, attribute value is compared to constant● Other possibilities:

– Comparing values of two attributes– Using a function of one or more attributes

● Leaves assign classification, set of classifications, or probability distrbution to instances

● Unknown instance is routed down the tree

Page 19: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Decision Tree

● In short, Decision tree is just a series of nested if/then rules.

Smoke

Age

Yes

No

0-35High

Low

36-100

High

Our previous example

Page 20: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Predictive Modeling

● Predict values based on similar groups of data

● Pattern Recognition– Association of an observation to past

experience or knowledge– Interchangeable with classification

● Estimation– Assign infinite number of numeric labels to

an observation

Page 21: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

4. Knowledge Discovery

● Find Patterns in database– For example, if someone buys one thing,

what else will he buy next● Interesting + Certain = Knowledge

– Usually the output called “Discovered Knowledge”

● KDD – Knowledge Discovery in Database● A non-trivial process of identifying valid,

potentially useful, and understandable patterns in data

Page 22: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

KDD – Knowledge Discovery in Database...

● Advances in traditional tasks in data analysis– Classification, Clustering– New Data Mining operations

● Association rules● Sequential patterns● Deviation /Exceptions

● New Application areas– Spatial, Text, Web, Image, ....

Page 23: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

KDD – Knowledge Discovery in Database

● Applications– Most large companies have data

warehouses: platforms for Data Mining Projects

– Trend towards integrated vertical solutions such as financial and telecom areas

● Back-end: integration with databases● Front-end: Campaign Management or CRM (Customer Relationship Management)

Page 24: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

KDD – Knowledge Discovery in Database

● Next Generation Knowledge Discovery Systems:– Have integrated front-end access to

knowledge delivery tools– Have integrated back-end access to

enterprise and external databases– Have knowledge discovery engine as

embedded part of the overall solution– Be oriented to solving a business problem,

not a data analysis problem

Page 25: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

5. Application of Data Mining

● Medical● Control Theory● Engineering● Marketing and Finance● Data Mining on the web● Scientific Data Base● Fraud Dectection● And many more!

Page 26: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

6. Summary● Data Mining IS....

– Decision Trees, Nearest Neighbor Classification, Neural networks, Rule Induction, K-means Clustering

– Decision support process in which we search patterns of information in data

● Data Mining is NOT...– Retrieving data (ie. Google)

● “Information retrieval” or “Database querying”

● Data Mining infers “the right query” from data

– Merging many small databases into a large one

Page 27: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Summary

● Data Mining is not...– Data warehousing– SQL / Ad Hoc Queries / Reporting– Software Agents– Online Analytical Processing (OLAP)

– Data Visualization

Page 28: CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

Referneces

● Dr. Lee's Presentation– http://www.cs.sjsu.edu/~lee/cs157b/cs157b.html

● Data Mining Section● Dr. Kurt Thearling's website

– http://www.thearling.com/dmintro/dmintro_frame.htm● An Introduction to Data Mining