View
217
Download
2
Embed Size (px)
Citation preview
Data Mining
Theory and Practice
Dr. Azuraliza Abu Bakar
http://www.ftsm.ukm.my/jabatan/ts/aab/index.htm
What is Pattern Recognition
Pattern Recognition by Human– perceptual – specialized – decision making
Pattern Recognition by Computers– benefit of automated pattern recognition– advantage in complex calculations
Pattern Recognition from Data (Data Mining)
Pattern Recognition from Data
Pattern recognition from data is a process of learning or observing the past data by studying the dependencies and extracting knowledge from data
What is Data?
Studies Education Works Income (D)
1 Poor SPM Poor None
2 Poor SPM Good Low
3 Moderate SPM Poor Low
4 Moderate Diploma Poor Low
5 Poor SPM Poor None
6 Moderate Diploma Poor Low
7 Good MSC Good Medium
:
99 Poor SPM Good Low
100 Moderate Diploma Poor Low
What is Knowledge??studies(Poor) AND work(Poor) => income(None)
studies(Poor) AND work(Good) => income(Low)
education(Diploma) => income(Low)
education(MSc) => income(Medium) OR income(High)
studies(Mod) => income(Low)
studies(Good) => income(Medium) OR income(High)
education(SPM) AND work(Good) => income(Low)
What is Data Mining??
Extraction of knowledge from data
exploration and analysis of large quantities of data to discover meaningful pattern from data.
Discover Knowledge
How data mining looks into data??
Data DataData
Data Mining : Motivation
Huge amounts of data
Important need for turning data into useful information
Fast growing amount of data, collected and stored in large and numerous databases exceeded the human ability for comprehension without powerful tools
Questions??
What goods should be promoted to this customer?
What is the probability that a certain customer will respond to a planned promotion?
Can one predict the most profitable securities to buy/sell during the next trading session?
Will this customer default on a loan or pay back on schedule?
What medical diagnose should be assigned to this patient?
What kind of cars should be sell this year??
Data Mining is simply...
Finds relationship
make prediction
Data Mining : 1-step of KDD
Task
KDD
Data mining
Techniques
Data Mining as a Step of KDD
Patterns
DataWarehouse
Databases Flat files
Selection and Transformation
Data Mining
Evaluation & Presentation
Cleaning and Intergration
Knowledge
Early Steps of Data Mining
Data preprocessing– handling incomplete data, noisy data, uncertain
data Data discretization/representation
– transforms data into suitable values for the mining algorithm to find patterns
Data selection– selects the suitable data for mining purposes
Data Mining Techniques
Decision Trees
Neural Network
Genetic Algorithms
Fuzzy Set Theory
Rough Set Theory
Statistical Method (Regression Analysis)
Kinds of DB
RelationalData warehouseTransactional DBAdvanced DB systemFlat filesWWW
Kinds of Knowledge
ClassificationAssociationClusteringPrediction……
Classification of Data Mining Systems
Classification of Data Mining Systems
Techniques used
DB oriented techniquesStatisticMachine learningPattern recognitionNeural NetworkRough Set etc
Application adapted
FinanceMarketingMedicalStockTelecommunication, etc
Data Mining: confluence of multiple discipline
DATA MINING
Database technology
statistic
Machine learning
Informationscience
Neural network
Pattern recognition
visualization Information retrieval
HPerformance computing
Spatial data analysis
Data Mining
What we are looking at??
What we are looking for??
Data Mining Tasks
– Prediction– Classification– Clustering– Association Rules– Sequential Analysis– Deviation analysis– Similarity analysis– Trend analysis
Classification
Classificationalgorithm
Training data
Studies Education Works Income (D)
1 Poor SPM Poor None
2 Poor SPM Good Low
3 Moderate SPM Poor Low
4 Moderate Diploma Poor Low
5 Poor SPM Poor None
6 Moderate Diploma Poor Low
7 Good MSC Good Medium
:
99 Poor SPM Good Low
100 Moderate Diploma Poor Low
Classification Rules
If studies=“poor” and work=“poor” then Income=“poor”
Classification
Test data
Studies Education Works Income (D)
Moderate Diploma Poor ?
Poor SPM Poor ?
Moderate Diploma Poor ?
Good MSC Good ?
:
New data
studies=“poor” and work=“poor”
Classificationrules
poor
classify
Type of Classifiers
Statistical ClassifierStatistical Classifier–Bayesion approach–Multiple Regression–K-nearest neighbour–Naïve Bayes–Causal Network–Discriminant Analysis
Neural ClassifierNeural Classifier–Hopfield Network–Multilayer Perceptron–Radial Basis Function–Kohonen Networks
Rough Classifier
DATASET
Studies Education Works Income (D)
1 Poor SPM Poor None
2 Poor SPM Good Low
3 Moderate SPM Poor Low
4 Moderate Diploma Poor Low
5 Poor SPM Poor None
6 Moderate Diploma Poor Low
7 Good MSC Good Medium
:
99 Poor SPM Good Low
100 Moderate Diploma Poor Low
RULESstudies(Poor) AND work(Poor) => income(None)
studies(Poor) AND work(Good) => income(Low)
education(Diploma) => income(Low)
education(MSc) => income(Medium) OR income(High)
studies(Mod) => income(Low)
studies(Good) => income(Medium) OR income(High)
education(SPM) AND work(Good) => income(Low)
Comparing Classifiers
Predictive Accuracy Speed Robustness Scalability Interpretability
Data Mining : Problems and Challenges
Noisy data
Difficult Training
Set
Dynamic Databases
Large Databases
Incomplete Data
Performance Issues
Cost of the Learning
Set
Time and Memory Constraint
Predictive Ability
Performance Issues
Cost of the Learning
Set
-number of examples necessary for training
-cost of assuring the good accuracy
Performance Issues
Time and Memory Constraint
-time complexity of the learning phase
-time taken for evaluation
-time it takes to reach a certain level of accuracy
Performance Issues
Predictive Ability
-to be able to predict the correct decision towards the test or unseen data
-involve the generation of rules
-measuring the quality or accuracy of rules
DATA
AGE
SEX CP TRESTBPS
CHOL
FBS
RESTECG THALACH
EXANG
OLDPEAK
SLOPE CA
THAL DISEASE
1 63 Male Typical angina
145 233 T LV hyper 150 No 2.3 Downslope
0 Fixed No
2 67 Male Asymp 160 286 F LV hyper 108 Yes 1.5 Flat 3 Normal Yes
3 67 Male Asymp 120 229 F LV hyper 129 Yes 2.6 Flat 2 Reversable
Yes
4 37 Male Non-anginal 130 250 F Normal 187 No 3.5 Downslope
0 Normal No
5 41 Female
Atypical 130 204 F LV hyper 172 No 1.4 Upsloping
0 Normal No
6 56 Male Atypical 120 236 F Normal 178 No 0.8 Upsloping
0 Normal No
7 62 Female
Asymp 140 268 F LV hyper 160 No 3.6 Downslope
2 Normal Yes
8 57 Female
Asymp 120 354 F Normal 163 Yes 0.6 Upsloping
0 Normal No
9 63 Male Asymp 130 254 F LV hyper 147 No 1.4 Flat 1 Reversable
Yes
10 53 Male Asymp 140 203 T LV hyper 155 Yes 3.1 Downslope
0 Reversable
Yes
11 57 Male Asymp 140 192 F Normal 148 No 0.4 Flat 0 Fixed defect
No
12 56 Female
Atypical 140 294 F LV hyper 153 No 1.3 Flat 0 Normal No
13 56 Male Non-anginal 130 256 T LV hyper 142 Yes 0.6 Flat 1 Fixed defect
Yes
14 44 Male Atypical 120 263 F Normal 173 No 0 Upsloping
0 Reversable
No
15 52 Male Non-anginal 172 199 T Normal 162 No 0.5 Upsloping
0 Reversable
No
16 57 Male Non-anginal 150 168 F Normal 174 No 1.6 Upsloping
0 Normal No
17 48 Male Atypical 110 229 F Normal 168 No 1 Downslope
0 Reversable
Yes
18 54 Male Asymp 140 239 F Normal 160 No 1.2 Upsloping
0 Normal No
19 48 Female
Non-anginal 130 275 F Normal 139 No 0.2 Upsloping
0 Normal No
20 49 Male Atypical 130 266 F Normal 171 No 0.6 Upsloping
0 Normal No
Samples of the CLEV Dataset (before scaling)
oldpeak(0.7) => disease(No)
oldpeak(4.4) => disease(Yes)
chol(233) AND restecg(LV hypertrophy) => disease(No)
chol(204) AND restecg(LV hypertrophy) => disease(No)
chol(236) AND restecg(Normal) => disease(No)
chol(203) AND restecg(LV hypertrophy) => disease(Yes)
chol(294) AND restecg(LV hypertrophy) => disease(No)
chol(275) AND restecg(Normal) => disease(No)
chol(266) AND restecg(Normal) => disease(No)
chol(247) AND restecg(Normal) => disease(No)
chol(219) AND restecg(LV hypertrophy) => disease(No)
chol(266) AND restecg(LV hypertrophy) => disease(Yes)
chol(304) AND restecg(Normal) => disease(No)
chol(254) AND restecg(Normal) => disease(Yes)
chol(267) AND restecg(Normal) => disease(Yes)
chol(264) AND restecg(LV hypertrophy) => disease(No)
chol(234) AND restecg(LV hypertrophy) => disease(No)
Rules generated from data mining process