15
DATA MINING By Cecilia Parng CS 157B

DATA MINING

  • Upload
    hedwig

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

DATA MINING. By Cecilia Parng CS 157B. Contents. Definition of Data Mining Knowledge Discovery in Databases Classification Decision-Tree Association Rules Support Confidence Clustering. Definition of Data Mining. - PowerPoint PPT Presentation

Citation preview

Page 1: DATA MINING

DATA MININGBy

Cecilia Parng

CS 157B

Page 2: DATA MINING

Contents

Definition of Data Mining– Knowledge Discovery in Databases

Classification– Decision-Tree

Association Rules– Support– Confidence

Clustering

Page 3: DATA MINING

Data Mining: A class of database applications that look for hidden patterns in a group of data that can be used to predict future behavior. For example, data mining software can help retail companies find customers with common interests.

Data mining is also popular in the science and mathematical fields.

Definition of Data Mining

Page 4: DATA MINING

Definition of Data Mining (cont.)

Data Mining, also known as Knowledge-Discovery in Databases (KDD) – The knowledge discovery process includes six

phases: data selection data cleansing enrichment data transformation or encoding data mining reporting and displaying of the discovered information.

Page 5: DATA MINING

Classification ( Decision-Tree)

Classification is the process of learning a model that describes different classes of data. The classes are predetermined.– Decision-Tree classifier is a widely used

technique for classification.

Page 6: DATA MINING

Decision-Tree Classifier

A decision tree takes as input an object or situation described by a set of properties, and outputs a yes/no decision. Decision trees therefore represent Boolean functions.

Page 7: DATA MINING

How to build a Decision-Tree

– A decision tree is constructed by looking for regularities in data.Data Decision Tree Allows us to make predictions

on unseen data

Decision Rule

– For example: Some one who apply for a credit card may be classified as a

“poor risk,” or a “good risk.”

Page 8: DATA MINING

Example Decision Tree for Credit Card Application

married

salary Acct balance

age

yes no

< 20k>= 20k

>= 50k

< 50k

Poor risk Fair risk Good riskPoor risk

< 5k>= 5k

>= 25< 25

Fair risk Good risk

Page 9: DATA MINING

Association Rules

An association rule must have an associated population:– The population consists of a set of instances

Rule is used to discover elements that occur in common within a given data set.

Rules have an associated support, as well as an associated confidence

Page 10: DATA MINING

Association Rules & Frequent Items

Association rule algorithms typically only identify patterns that occur in the original form throughout the database. In databases which contain many small variations in the data, potentially important discoveries may be ignored as a result. an associate rule mining algorithm.

Customer Items 1               orange juice, soda     2                               milk, orange juice, window cleaner 3             orange juice, detergent, 4                             orange juice, detergent, soda 5                                  window cleaner, soda

Page 11: DATA MINING

How does association rule analysis work

The co-occurrence table contains some simple patterns: ·      OJ and soda are likely to be purchased together than any other two items. ·      Detergent is never purchased with window cleaner or milk. ·      Milk is never purchased with soda or detergent.

Items OJ Cleaner Milk Soda Detergent

OJ 4 1 1 2 1

Window Cleaner 1 2 1 1 0

Milk 1 1 1 0 0

Soda 2 1 0 3 1

Detergent 1 0 0 1 2

Page 12: DATA MINING

Association Support

The Support:– These simple observations are examples of associations

and may suggest a formal rule like: “If a customer purchases soda, then the customer also purchases milk”. For now, we find this rule automatically. In the data, two of the five transactions include both soda and orange juice. These two transactions support the rule. Another way of expressing this is as a percentage. The support for the rule is two out of five or 40 percent.

 

Page 13: DATA MINING

Association Confident

The Confident: Since both the transactions that contain soda also contain

orange juice, there is a high degree of confidence in the rule as well. In fact, every transaction that contains soda also contains orange juice, so the rule “if soda, then orange juice” has a confidence of 100 percent. We are less confident about the inverse rule, “if orange juice then soda”, because of the four transactions with orange juice, only two also have soda. Its confidence, then, is just 50 percent. More formally, confidence is the ratio of the number of the transactions supporting the rule to the number of transactions where the conditional part of the rule holds. Another way of saying this is that confidence is the ratio of the number of transactions with all the items to the number of transactions with just the “if” items.

Page 14: DATA MINING

Clustering

The goal of clustering is to place records into groups, such that records in a group are similar to each other and dissimilar to records in other groups. The groups are usually disjoint.

Page 15: DATA MINING

The End