View
224
Download
1
Category
Preview:
Citation preview
1
Chapter 1
INTRODUCTION
2
What is Pattern Recognition?
Pattern Recognition by Human perceptual specialized – decision making
Pattern Recognition by Computers benefit of automated pattern recognition advantage in complex calculations
Pattern Recognition from Data (Data Mining)
3
Pattern Recognition from Data
Pattern recognition from data is the process of learning the historical data by finding data dependency and getting the knowledge from data.
4
What is Data?
Studies Education Works Income (D)
1 Poor SPM Poor None
2 Poor SPM Good Low
3 Moderate SPM Poor Low
4 Moderate Diploma Poor Low
5 Poor SPM Poor None
6 Moderate Diploma Poor Low
7 Good MSC Good Medium
:
99 Poor SPM Good Low
100 Moderate Diploma Poor Low
5
What is Knowledge??studies(Poor) AND work(Poor) => income(None)
studies(Poor) AND work(Good) => income(Low)
education(Diploma) => income(Low)
education(MSc) => income(Medium) OR income(High)
studies(Mod) => income(Low)
studies(Good) => income(Medium) OR income(High)
education(SPM) AND work(Good) => income(Low)
6
Why is Data Mining prevalent?1. Lots of data is collected and stored in data
warehouses
Business Wal-Mart logs nearly 20 million transactions per
day Astronomy
Telescope collecting large amounts of data. Space
NASA is collecting peta bytes of data from satellites Physics
High energy physics experiments are expected to generate 100 to 1000 tera bytes in the next decade.
7
Why is Data Mining prevalent?2. Quality and richness of data collected is
improving
Retailers Scanner data is much more accurate than other
means E-commerce
Rich data on customer browsing Science
Accurate of sensor is improving
8
Why is Data Mining prevalent?3. The gap between data and analysts is increasing
Existing of Hidden information High cost of human labor Much of data is never analyzed at all
9
Origins of Data Mining
Drawn ideas from Machine Learning, Pattern Recognition, Statistics, and Database Systems for applications that have Enormous of data High dimensionality of data Heterogeneous data Unstructured data
10
Data Mining: confluence of multiple discipline
DATA MINING
Database technology
statistic
Machine learning
Informationscience
Neural network
Pattern recognition
visualization Information retrieval
HPerformance computing
Spatial data analysis
11
Data Mining – What it isn’tSmall Scale Data mining methods are designed for large data sets
Foolproof Data mining techniques will discover patterns in any data The patterns discovered may be meaningless It is up to the user to determine how to interpret the
results “Make it foolproof and they’ll just invent a better fool”
Magic Data mining techniques cannot generate information that
is not present in the data They can only find the patterns that are already there
12
Example: Data Mining is not ….
Generating multidimensional cubes of a relational table
Searching for a phone number in a phone book
Searching for keywords on Google (IR)
Generating a histogram of salaries for different age groups
Issuing SQL query to a database, and reading the reply
13
Data Mining – What it is
Extracting knowledge from large amounts of data
Uses techniques from: Pattern Recognition Machine Learning Statistics
Plus techniques unique to data mining (Association rules)
Data mining methods must be efficient and scalable
14
Example: Data mining is …
What goods should be promoted to this customer?
What is the probability that a certain customer will respond to a planned promotion?
Can one predict the most profitable securities to buy/sell during the next trading session?
Will this customer default on a loan or pay back on schedule?
What medical diagnose should be assigned to this patient?
What kind of cars should be sell this year??
Finding groups of people with similar hobbies
Are chances of getting cancer higher if you live near a power line?
15
Data Mining is simply...
Finds relationship
make prediction
16
Data Mining: Definition
The non trivial extraction of implicit, previously unknown, and potentially useful information from data
(William J Fawley, Gregory Piatetsky-Shapiro and Christopher J Matheus)
17
Data Mining : 1-step of KDD
KDD = Knowledge Discovery in DatabasesPatterns
DataWarehouse
Databases Flat files
Selection and Transformation
Data Mining
Evaluation & Presentation
Cleaning and Integration
Knowledge
18
Cont’d
Data cleaning To remove noise and inconsistent data
Data integration Multiple data sources may be combined
Data selection Data relevant to the analysis task are retrieved from the
database
Data transformation Data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations
19
Cont’d
Data mining An essential process where intelligent methods are
applied in order to extract data patterns
Pattern evaluation To identify the truly interesting patterns representing
knowledge based on some interestingness measures
Knowledge presentation Visualization and knowledge representation techniques
are used to present the mined knowledge to the users
20
Early Steps of Data Mining
Data preprocessing handling incomplete data, noisy data, uncertain
data
Data discretization/representation transforms data into suitable values for the
mining algorithm to find patterns
Data selection selects the suitable data for mining purposes
21
Data base Systems
Kinds of DB
RelationalData warehouseTransactional DBAdvanced DB systemFlat filesWWW
Kinds of Knowledge
ClassificationAssociationClusteringPrediction……
22
Data Mining – Types of Data
Mining can be performed on data in a variety of forms
Relational Database Traditional DMBS everyone is familiar with Data is stored in a series of tables (Collection of tables) Data is extracted via queries, typically with SQL SQL: “Show me a list of items that were sold in the last quarter” “show me the total sales of the last month, grouped by branch” “How many transactions occurred in the month of December?” “which sales person had the highest amount of sales” Relational language: aggregate function such as sum, avg, count,
max, min
23
Data Mining – Types of Data Apply data mining – go further
Searching for trends or data patterns Analyzed customer data to predict credit risk of new customers based on their
income Detect deviation – items whose sales are far from those expected in comparison
with the previous year (further investigated: change in packaging, increase in price?)
Transaction Database Similar to relational database (transactions stored in a table) Each row (record) is a transaction with id & list of items in
transaction Nested relation Can be unfolded into a relational database or stored in flat files
since nested relational structures did not supported by relational db system
Which items sold well together?
24
Data Mining – Types of Data
Data Warehouse Stores historical data, potentially from multiple sources Organized around major subjects Contains summary statistics
Object / Object-Relational Databases Database consisting of objects Object = set of variables + associated methods Eg: Intel uses regularity extraction in automatic circuit layout
Images Can mine features extracted from images, OR Can use mining techniques to extract features Content based image retrieval
25
Data Mining – Types of Data
Vector Geometries (spatial db) Include GIS and CAD data Raster data – n-dimensional bit maps /pixel maps Vector format – point, line, polygon Can find spatial patterns between features Describing the characteristics of houses located near a specified
kind of location Describe the climate of mountainous areas located at various
altitudes
Text Can be unstructured, semi-structured, or structured Documentation, newspaper articles, web sites etc. Can facilitate search by linking related documents / concepts
26
Data Mining – Types of Data
Video / Audio Speech recognition – recognized spoken command Security applications Integrated with standard data mining methods (storage and
searching)
Temporal Databases / Time Series Global change databases (temperature records) Space shuttle telemetry Stock market data (stock exchange) Usually stores relational data that include time-related attributes Find the trend of changes for objects – decision making/strategy
planning
27
Data Mining – Types of Data
Stock exchange data can be mined to uncover trends that could help in planning investment strategies (when is the best time to purchase TNB stock?)
Legacy Databases Group of heterogeneous databases (relational, OO db, network db,
multimedia db etc.) Connected by intra- or inter-computer networks Information exchange is very difficult – student academic
performance among different schools/universities Data mining – transforming the given data into higher, more
generalized, conceptual levels
28
The evolution of database technology
Data mining can viewed as a result of the natural evolution of data base technology (Fig. 1.1).
The figure shows 5 stages of functionalities:- data collection and database creation- database management systems- advanced databases systems- web-based databases systems- data warehousing and data mining
29
30
The evolution of database technology ..cont
Databases systems provide data storage and retrieval, and transaction processing.
Data warehousing and data mining provide data analysis and understanding.
Data ware house is a database architecture that store many different types of databases, a repository of multiple heterogeneous data sources.
They are organized under a unified schema at a single site in order to facilitate management decision making.
31
The evolution of database technology ..cont
Data warehouse technology includes: - data cleansing - data integration, and - On-Line Analytical Processing (OLAP)
OLAP is the analysis technique for performing summarization, consolidation, and aggregation, as well as ability to view information from different angles.
Although OLAP tools support data analysis but not in-depth-analysis such as data classification, clustering, and the characterization of data changes over time
32
DBMS, OLAP & Data MiningArea DBMS OLAP Data Mining
Task Extraction of detailed and summary data
Summaries, trends and forecast
Knowledge discovery of hidden patterns and insight
Type of result
Information Analysis Insight and prediction
Method Deduction (Ask the question, verify with data)
Multidimensional data modeling, Aggregation, statistics
Induction (Build the model, apply it to new data, get the result)
Example question
Who purchased mutual funds in the last 3 years
What is the average income of mutual fund buyers by region by year?
Who will buy a mutual fund in the next 6 months and why?
33
Example: Weather data
Record of the weather conditions during a two-week period, along with the decisions of a tennis player whether or not to play tennis on each particular dayGenerated tuples (or examples, instances) consisting of values of 4 independent variables Outlook Temperature Humidity Windy
One dependent variable - play
34
Cont’dDay outlook temperature humidity windy play
1 sunny 85 85 false No
2 sunny 80 90 true No
3 overcast 83 86 False Yes
4 rainy 70 96 False Yes
5 rainy 68 80 False Yes
6 rainy 65 70 True No
7 overcast 64 65 True Yes
8 sunny 72 95 False No
9 sunny 69 70 False Yes
10 rainy 75 80 False Yes
11 sunny 75 70 True Yes
12 overcast 72 90 True Yes
13 overcast 81 75 False Yes
14 rainy 71 91 true no
35
DBMS
We may answer questions by querying a DBMS containing the above table What was the temperature in the sunny days? Which days the humidity was less than 75? Which days the temperature was greater than
70? Which days the temperature was greater than
70 and the humidity was less than 75?
36
OLAP (On-line analytical processing)
Using OLAP – create Multidimensional Model (Data cube)
Eg. Dimensions: time, outlook, play – can create the model below
9/5 sunny rainy overcast
Week1
0/2 2/1 2/0
Week2
2/1 1/1 2/0
37
Cont’d
Observing the data cube – easily identify some important properties of the data Find regularities or pattern
Eg. The 3rd column: if the outlook is overcast the play attribute is always yes If outlook = overcast then play = yes
38
Drill-down: time dimension
Concept hierarchy 9/5 sunny rainy overcast
1 0/1 0/0 0/0
2 0/1 0/0 0/0
3 0/0 0/0 1/0
4 0/0 1/0 0/0
5 0/0 1/0 0/0
6 0/0 0/1 0/0
7 0/0 0/0 1/0
8 0/1 0/0 0/0
9 1/0 0/0 0/0
10 0/0 1/0 0/0
11 1/0 0/0 0/0
12 0/0 0/0 1/0
13 0/0 0/0 1/0
14 0/0 0/1 0/0
39
Roll-up (reverse of drill-down)
9/5 sunny rainy overcast
Week1
0/2 2/1 2/0
Week2
2/1 1/1 2/0
40
Data Mining Tasks
Prediction methods Use some variables to predict unknown or future values
of the same or other variables. Inference on the current data in order to make
prediction
Description methods Find human interpretable patterns that describe data Characterize the general properties of data in db
Descriptive mining is complementary to predictive mining but it is closer to decision support than decision making
41
Cont’d
Association Rule Mining (descriptive)
Classification and Prediction (predictive)
Clustering (descriptive)
Sequential Pattern Discover (descriptive)
Regression (predictive)
Deviation Detection (predictive)
42
Association Rule Mining
Initially developed for market basket analysis
Goal is to discover relationships between attributes
Data is typically stored in very large databases, sometimes in flat files or images
Uses include decision support, classification and clustering
Application areas include business, medicine and engineering
43
Association Rule Mining
Given a set of transactions, each of which is a set of items, find all rules (XY) that satisfy user specified minimum support and confidence constraintsSupport = (#T containing X and Y)/(#T)Confidence=(#T containing X and Y)/ (#T containing X)Applications Cross selling and up selling Supermarket shelf
management
Some rules discovered Bread Jem
Sup=60%, conf=75% Jelly Bread
Sup=60%, conf=100% Jelly Jem
Sup=20%, conf=100% Jelly Milk
Sup=0%
Transaction ItemsT1 Bread, Jelly, JemT2 Bread, JemT3 Bread, Milk, JemT4 Coffee, BreadT5 Coffee, Milk
44
Association Rule Mining:Definition
Given a set of records, each of which contain some number of items from a given collection: Produce dependency rules which will predict
occurrence of an item based on occurrences of other items
Example: {Bread} {Jem} {Jelly} {Jem}
45
Association Rule Mining:Marketing and sales promotion
Say the rule discovered is
{Bread, …} {Jem}
Jem as a consequent: can be used to determine what products will boost its sales.
Bread as antecedent: can be used to see which products will be impacted if the store stops selling bread
Bread as an antecedent and Jem as a consequent: can be used to see what products should be stocked along with Bread to promote the sale of Jem.
46
Association Rule Mining:Supermarket shelf management
Goal: To identify items that are bought concomitantly by a reasonable fraction of customers so that they can be shelved.Data Used: Point-of sale data collected with barcode scanners to find dependencies among products.Example If customer buys jelly, then he is very likely to by Jem. So don’t be surprised if you find Jem next to Jelly on an
aisle in the super market. Also salsa next to tortilla chips.
47
Association Rule Mining
Association rule mining will produce LOTS of rules
How can you tell which ones are important? High Support High Confidence Rules involving certain attributes of interest Rules with a specific structure Rules with support / confidence higher than expected
Completeness – Generating all interesting rules
Efficiency – Generating only rules that are interesting
48
Clustering
Determine object groupings such that objects within the same cluster are similar to each other, while objects in different groups are not
Typically objects are represented by data points in a multidimensional space with each dimension corresponding to one or more attributes. Clustering problem in this case reduces to the following: Given a set of data points, each having a set of attributes, and a
similarity measure, find cluster such that Data points in one cluster are more similar to one another Data points in separate clusters are less similar to one another
49
Cont’d
Similarity measures: Euclidean distance (continuous attr.) Other problem – specific measures
Types of Clustering Group-Based Clustering Hierarchical Clustering
50
Clustering Example
Euclidean distance based clustering in 3D space Intra cluster distances
are minimised Inter cluster distances
are maximised
51
Clustering: Market Segmentation
Goal: To subdivide a market into distinct subset of customers where each subset can be targeted with a distinct marketing mixApproach: Collect different attributes of customers based on their
geographical and lifestyle related information Find clusters of similar customers Measure the clustering quality by observing the buying
patterns of customers in the same cluster vs. those from different clusters.
52
Clustering: Document Clustering
Goal: To find groups of documents that are similar to each other based on important terms appearing in themApproach: To identify frequently occurring terms in each document. Form a similarity measure based on frequencies of different terms. Use it to generate clusters.Gain: Information Retrieval can utilize the clusters to relate a new document or search to clustered documents
53
Clustering: Document Clustering Example
Clustering points: 3204 articles of LA Times
Similarity measure: Number of common words in documents (after some word filtering)
Category Total articles Correctly placed articles
Financial
Foreign
National
Metro
Sports
Entertainment
555
341
273
943
738
354
364
260
36
746
573
278
54
Classification: Definition
Given a set of records (called the training set) Each record contains a set of attributes. One of the
attributes is the class
Find a model for the class attribute as a function of the values of other attributesGoal: Previous unseen records should be assigned to a class as accurately as possible Usually, the given data set is divided into training and
test set, with training set used to build the model and test set used to validate it. The accuracy of the model is determined on the test set.
55
Classification: cont’d
Classifiers are created using labeled training samplesClassifiers are evaluated using independent labeled samples (test set)Training samples created by ground truth / expertsClassifier later used to classify unknown samplesMeasurements must be able to predict the phenomenon!Examples Direct marketing Fraud detection Customer churn Sky survey cataloging Classifying galaxies
56
Classification Example
Tid RefundMaritalStatus
TaxableIncome
Cheat
123456789
10
YesNoNo
YesNoNo
YesNoNoNo
SingleMarriedSingle
MarriedDivorcedMarried
DivorcedSingle
MarriedSingle
125K100K70K
120K95K60K
220K85K75K90K
NoNoNoNo
YesNoNo
YesNo
Yes
TrainingSet
LearnClassifier
Model
Testset
RefundMaritalStatus
TaxableIncome
Cheat
YesNoNoYesNoNoYesNoNoNo
SingleMarriedSingle
MarriedDivorcedMarried
DivorcedSingle
MarriedSingle
125K100K70K
120K95K60K
220K85K75K90K
NoNoNoNoYesNoNoYesNoYes
57
Classification: Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell phone productApproach: Use the data collected for a similar product introduced in the
recent past. Use the profiles of consumers along with their (buy, didn’t buy}
decision. The latter becomes the class attribute. The profile of the information may consist of demographic,
lifestyle and company interaction. Demographic – Age, Gender, Geography, Salary Psychographic - Hobbies Company Interaction – Recentness, Frequency, Monetary
Use these information as input attributes to learn a classifier model
58
Classification: Fraud DetectionGoal: Predict fraudulent cases in credit card transactionsApproach: Use credit card transactions and the information on its
account holders as attributes (important: when and where the card was used)
Label past transactions as {fraud, fair} transactions. This forms the class attribute
Learn a model for the class of transactions Use this model to detect fraud by observing credit card
transactions on an account.
59
Regression
Predict the value of a given continuous valued variable based on the values of other variables, assuming a linear or non-linear model of dependencyExtensively studied in the fields of Statistics and Neural Networks Predicting sales number of new product based on
advertising expenditure Predicting wind velocities based on temperature,
humidity, air pressure, etc Time series prediction of stock market indices
60
Deviation/Anomaly Detection
Some data objects do not comply with the general behavior or model of the data. Data objects that are different from or inconsistent with the remaining set are called outliers
Outliers can be caused by measurement or execution error. Or they represent some kind of fraudulent activity
Goal of deviation/anomaly detection is to detect significant deviations from normal behavior
61
Deviation/Anomaly Detection:Definition
Given a set of n points or objects, and k, the expected number of outliers, find the top k objects that considerably dissimilar, exceptional or inconsistent with the remaining dataThis can be viewed as two sub problems Define what data can be considered as
inconsistent in a given data set Find an efficient method to mine the outliers
62
Deviation:Credit Card Fraud Detection
Goal: to detect fraudulent credit card transactions
Approach: Based on past usage patterns, develop model for
authorized credit card transactions Check for deviation from model, before authenticating
new credit card transactions Hold payment and verify authenticity of “doubtful”
transaction by other means (phone call, etc.)
63
Anomaly detection:Network Intrusion Detection
Goal: to detect intrusion of a computer networkApproach: Define and develop a model for normal user
behavior on the computer network Continuously monitor behavior of users to
check if it deviates from the defined normal behavior
Raise an alarm, if such deviation is found
64
Sequential pattern discovery:definition
Given is a set of objects, with each object associated with its own time of events, find rules that predict strong sequential dependencies among different events
Sequence discovery aims at extracting sets of events that commonly occur over a period of time
(A B) (C) (D E)
65
Sequential pattern discovery:Telecommunication Alarm Logs
Telecommunication alarm logs (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) (Fire_Alarm)
66
Sequential pattern discovery:Point of Sell Up Sell / Cross Sell
Point of sale transaction sequences Computer bookstore
(Intro_to_Visual_C) (C++ Primer) (Perl_For_Dummies, Tcl_Tk)
60% customers who buy Intro toVisual C and C++ Primer also buy Perl for dummies and Tcl Tk within a month
Athletic apparel store (Shoes) (Racket, Racket ball) (Sport_Jacket)
67
Example: Data Mining(Weather data)
By applying various data mining techniques, we can find associations and regularities in our data Extract knowledge in the forms of rules, decision trees
etc. Predict the value of the dependent variable in new
situation
Some example Mining association rules Classification by decision trees and rules Prediction methods
68
Mining association rules
First, discretize the numeric attributes (a part of the data preprocessing stage)Group the temperature values in three intervals (hot, mild, cool) and humidity values in two (high, normal)Substitute the values in data with the corresponding namesApply the Apriori algorithm and get the following rules
69
Discretized weather dataDay outlook temperature humidity windy play
1 sunny hot high false No
2 sunny hot high true No
3 overcast hot high False Yes
4 rainy mild high False Yes
5 rainy cool normal False Yes
6 rainy cool normal True No
7 overcast cool normal True Yes
8 sunny mild high False No
9 sunny cool normal False Yes
10 rainy mild normal False Yes
11 sunny mild normal True Yes
12 overcast mild high True Yes
13 overcast hot normal False Yes
14 rainy mild high true no
70
Cont’d
1. humidity=normal windy=false play=yes (4,1)2. temperature=cool humidity=normal (4,1)3. outlook=overcast play=yes (4,1)4. temperature=cool play=yes humidity=normal (3,1)5. outlook=rainy windy=false play=yes (3, 1)6. outlook=rainy play=yes windy=false (3, 1)7. outlook=sunny humidity=high play=no (3, 1)8. outlook=sunny play=no humidity=high (3, 1)9. temperature=cool windy=false humidity=normal play=yes (2,
1)10. temperature=cool humidity=normal windy=false play=yes (2,
1)
71
Cont’d
These rules show some attribute values sets (itemsets) that appear frequently in the data
Support (the number of occurrences of the itemset in the data)
Confidence (accuracy) of the rules
Rule 3 – the same as the one that is produced by observing the data cube
72
Classification by Decision Trees and Rules
Using ID3 algorithm, the following decision tree is producedOutlook=sunny Humidity=high:no Humidity=normal:yes
Outlook=overcast:yesOutlook=rainy Windy=true:no Windy=false:yes
73
Cont’d
Decision tree consists of: Decision nodes that test the values of their
corresponding attribute Each value of this attribute leads to a subtree and so on,
until the leaves of the tree are reached They determine the value of the dependent variable
Using a decision tree we can classify new tuples
74
Cont’d
A decision tree can be presented as a set of rules Each rule represents a path through the tree from the root to
a leaf
Other data mining techniques can produce rules directly: Prism algorithmif outlook=overcast then yesif humidity=normal and windy=false then yesIf temperature=mild and humidity=normal the yesIf outlook=rainy and windy=false then yesIf outlook=sunny and humidity=high then noIf outlook=rainy and windy=true then no
75
Prediction methods
DM offers techniques to predict the value of the dependent variable directly without first generating a model
The most popular approaches is based of statistical methods
Uses the Bayes rule to predict the probability of each value of the dependent variable given the values of the independent variables
76
Cont’d
Eg: applying Bayes to the new tuple:(sunny, mild, normal, false, ?)
P(play=yes| outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.8P(play=no| outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.2
The predicted value must be “yes”
77
Data Mining : Problems and Challenges
Noisy data
Difficult Training
Set
Incomplete Data
Dynamic Databases
Large Databases
78
Noisy data
many of attribute values will be inexact or incorrect erroneous instruments measuring some property human errors occurring at data entry
two forms of noise in the data corrupted values - some of the values in the training set
are altered from the original form missing values - one or more of the attribute values
may be missing both for examples in the training set and for object which are to be classified.
79
Difficult Training Set
Non-representative data Learning are based on a few examples Using large db, the rules probably representative
Absence of boundary cases To find the real differences between two classes
Limited information Two objects to be classified give the same conditional
attributes but are classified in the diff class Not have enough information of distinguishing two
types of objects
80
Dynamic databases
Db change continually
Rules that reflect the content of the db at all time (preferred)
If same changes are made, the whole learning process may have to be conducted again
81
Large databases
The size of db to be ever increasing
Machine learning algorithms – handling a small training set (a few hundred examples)
Much care on using similar techniques in larger db
Large db – provide more knowledge (eg. rules may be enormous)
82
Data Mining – Issues in Data Mining
User Interaction / Visualization
Incorporation of Background Knowledge
Noisy or Incomplete Data
Determining Interestingness of Patterns
Efficiency and Scalability
Parallel and Distributed Mining
Incremental Learning / Mining Time-Changing Phenomena
Mining from Image / Video / Audio Data
Mining Unstructured Data
Recommended