50
Data Mining Theory and Practice Dr. Azuraliza Abu Bakar http://www.ftsm.ukm.my/jabatan/ts/aab/ index.htm

Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Data Mining

Theory and Practice

Dr. Azuraliza Abu Bakar

http://www.ftsm.ukm.my/jabatan/ts/aab/index.htm

Page 2: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

What is Pattern Recognition

Pattern Recognition by Human– perceptual – specialized – decision making

Pattern Recognition by Computers– benefit of automated pattern recognition– advantage in complex calculations

Pattern Recognition from Data (Data Mining)

Page 3: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Pattern Recognition from Data

Pattern recognition from data is a process of learning or observing the past data by studying the dependencies and extracting knowledge from data

Page 4: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

What is Data?

 

Studies Education Works Income (D)

1 Poor SPM Poor None

2 Poor SPM Good Low

3 Moderate SPM Poor Low

4 Moderate Diploma Poor Low

5 Poor SPM Poor None

6 Moderate Diploma Poor Low

7 Good MSC Good Medium

:

99 Poor SPM Good Low

100 Moderate Diploma Poor Low

Page 5: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

What is Knowledge??studies(Poor) AND work(Poor) => income(None)

studies(Poor) AND work(Good) => income(Low)

education(Diploma) => income(Low)

education(MSc) => income(Medium) OR income(High)

studies(Mod) => income(Low)

studies(Good) => income(Medium) OR income(High)

education(SPM) AND work(Good) => income(Low)

Page 6: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

What is Data Mining??

Extraction of knowledge from data

exploration and analysis of large quantities of data to discover meaningful pattern from data.

Discover Knowledge

Page 7: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

How data mining looks into data??

Data DataData

Page 8: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Data Mining : Motivation

Huge amounts of data

Important need for turning data into useful information

Fast growing amount of data, collected and stored in large and numerous databases exceeded the human ability for comprehension without powerful tools

Page 9: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Questions??

What goods should be promoted to this customer?

What is the probability that a certain customer will respond to a planned promotion?

Can one predict the most profitable securities to buy/sell during the next trading session?

Will this customer default on a loan or pay back on schedule?

What medical diagnose should be assigned to this patient?

What kind of cars should be sell this year??

Page 10: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Data Mining is simply...

Finds relationship

make prediction

Page 11: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Data Mining : 1-step of KDD

Task

KDD

Data mining

Techniques

Page 12: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Data Mining as a Step of KDD

Patterns

DataWarehouse

Databases Flat files

Selection and Transformation

Data Mining

Evaluation & Presentation

Cleaning and Intergration

Knowledge

Page 13: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Early Steps of Data Mining

Data preprocessing– handling incomplete data, noisy data, uncertain

data Data discretization/representation

– transforms data into suitable values for the mining algorithm to find patterns

Data selection– selects the suitable data for mining purposes

Page 14: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Data Mining Techniques

Decision Trees

Neural Network

Genetic Algorithms

Fuzzy Set Theory

Rough Set Theory

Statistical Method (Regression Analysis)

Page 15: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Kinds of DB

RelationalData warehouseTransactional DBAdvanced DB systemFlat filesWWW

Kinds of Knowledge

ClassificationAssociationClusteringPrediction……

Classification of Data Mining Systems

Page 16: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Classification of Data Mining Systems

Techniques used

DB oriented techniquesStatisticMachine learningPattern recognitionNeural NetworkRough Set etc

Application adapted

FinanceMarketingMedicalStockTelecommunication, etc

Page 17: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Data Mining: confluence of multiple discipline

DATA MINING

Database technology

statistic

Machine learning

Informationscience

Neural network

Pattern recognition

visualization Information retrieval

HPerformance computing

Spatial data analysis

Page 18: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Data Mining

What we are looking at??

What we are looking for??

Page 19: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Data Mining Tasks

– Prediction– Classification– Clustering– Association Rules– Sequential Analysis– Deviation analysis– Similarity analysis– Trend analysis

Page 20: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 21: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 22: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 23: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 24: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 25: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 26: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 27: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 28: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 29: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 30: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 31: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 32: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 33: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 34: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 35: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 36: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 37: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar
Page 38: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Classification

Classificationalgorithm

Training data

Studies Education Works Income (D)

1 Poor SPM Poor None

2 Poor SPM Good Low

3 Moderate SPM Poor Low

4 Moderate Diploma Poor Low

5 Poor SPM Poor None

6 Moderate Diploma Poor Low

7 Good MSC Good Medium

:

99 Poor SPM Good Low

100 Moderate Diploma Poor Low

Classification Rules

If studies=“poor” and work=“poor” then Income=“poor”

Page 39: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Classification

Test data

Studies Education Works Income (D)

Moderate Diploma Poor ?

Poor SPM Poor ?

Moderate Diploma Poor ?

Good MSC Good ?

:

New data

studies=“poor” and work=“poor”

Classificationrules

poor

classify

Page 40: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Type of Classifiers

Statistical ClassifierStatistical Classifier–Bayesion approach–Multiple Regression–K-nearest neighbour–Naïve Bayes–Causal Network–Discriminant Analysis

Neural ClassifierNeural Classifier–Hopfield Network–Multilayer Perceptron–Radial Basis Function–Kohonen Networks

Rough Classifier

Page 41: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

DATASET

 

Studies Education Works Income (D)

1 Poor SPM Poor None

2 Poor SPM Good Low

3 Moderate SPM Poor Low

4 Moderate Diploma Poor Low

5 Poor SPM Poor None

6 Moderate Diploma Poor Low

7 Good MSC Good Medium

:

99 Poor SPM Good Low

100 Moderate Diploma Poor Low

Page 42: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

RULESstudies(Poor) AND work(Poor) => income(None)

studies(Poor) AND work(Good) => income(Low)

education(Diploma) => income(Low)

education(MSc) => income(Medium) OR income(High)

studies(Mod) => income(Low)

studies(Good) => income(Medium) OR income(High)

education(SPM) AND work(Good) => income(Low)

Page 43: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Comparing Classifiers

Predictive Accuracy Speed Robustness Scalability Interpretability

Page 44: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Data Mining : Problems and Challenges

Noisy data

Difficult Training

Set

Dynamic Databases

Large Databases

Incomplete Data

Page 45: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Performance Issues

Cost of the Learning

Set

Time and Memory Constraint

Predictive Ability

Page 46: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Performance Issues

Cost of the Learning

Set

-number of examples necessary for training

-cost of assuring the good accuracy

Page 47: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Performance Issues

Time and Memory Constraint

-time complexity of the learning phase

-time taken for evaluation

-time it takes to reach a certain level of accuracy

Page 48: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Performance Issues

Predictive Ability

-to be able to predict the correct decision towards the test or unseen data

-involve the generation of rules

-measuring the quality or accuracy of rules

Page 49: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

DATA

AGE

SEX CP TRESTBPS

CHOL

FBS

RESTECG THALACH

EXANG

OLDPEAK

SLOPE CA

THAL DISEASE

1 63 Male Typical angina

145 233 T LV hyper 150 No 2.3 Downslope

0 Fixed No

2 67 Male Asymp 160 286 F LV hyper 108 Yes 1.5 Flat 3 Normal Yes

3 67 Male Asymp 120 229 F LV hyper 129 Yes 2.6 Flat 2 Reversable

Yes

4 37 Male Non-anginal 130 250 F Normal 187 No 3.5 Downslope

0 Normal No

5 41 Female

Atypical 130 204 F LV hyper 172 No 1.4 Upsloping

0 Normal No

6 56 Male Atypical 120 236 F Normal 178 No 0.8 Upsloping

0 Normal No

7 62 Female

Asymp 140 268 F LV hyper 160 No 3.6 Downslope

2 Normal Yes

8 57 Female

Asymp 120 354 F Normal 163 Yes 0.6 Upsloping

0 Normal No

9 63 Male Asymp 130 254 F LV hyper 147 No 1.4 Flat 1 Reversable

Yes

10 53 Male Asymp 140 203 T LV hyper 155 Yes 3.1 Downslope

0 Reversable

Yes

11 57 Male Asymp 140 192 F Normal 148 No 0.4 Flat 0 Fixed defect

No

12 56 Female

Atypical 140 294 F LV hyper 153 No 1.3 Flat 0 Normal No

13 56 Male Non-anginal 130 256 T LV hyper 142 Yes 0.6 Flat 1 Fixed defect

Yes

14 44 Male Atypical 120 263 F Normal 173 No 0 Upsloping

0 Reversable

No

15 52 Male Non-anginal 172 199 T Normal 162 No 0.5 Upsloping

0 Reversable

No

16 57 Male Non-anginal 150 168 F Normal 174 No 1.6 Upsloping

0 Normal No

17 48 Male Atypical 110 229 F Normal 168 No 1 Downslope

0 Reversable

Yes

18 54 Male Asymp 140 239 F Normal 160 No 1.2 Upsloping

0 Normal No

19 48 Female

Non-anginal 130 275 F Normal 139 No 0.2 Upsloping

0 Normal No

20 49 Male Atypical 130 266 F Normal 171 No 0.6 Upsloping

0 Normal No

Samples of the CLEV Dataset (before scaling)

Page 50: Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

oldpeak(0.7) => disease(No)

oldpeak(4.4) => disease(Yes)

chol(233) AND restecg(LV hypertrophy) => disease(No)

chol(204) AND restecg(LV hypertrophy) => disease(No)

chol(236) AND restecg(Normal) => disease(No)

chol(203) AND restecg(LV hypertrophy) => disease(Yes)

chol(294) AND restecg(LV hypertrophy) => disease(No)

chol(275) AND restecg(Normal) => disease(No)

chol(266) AND restecg(Normal) => disease(No)

chol(247) AND restecg(Normal) => disease(No)

chol(219) AND restecg(LV hypertrophy) => disease(No)

chol(266) AND restecg(LV hypertrophy) => disease(Yes)

chol(304) AND restecg(Normal) => disease(No)

chol(254) AND restecg(Normal) => disease(Yes)

chol(267) AND restecg(Normal) => disease(Yes)

chol(264) AND restecg(LV hypertrophy) => disease(No)

chol(234) AND restecg(LV hypertrophy) => disease(No)

Rules generated from data mining process