Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
1 Your reference
Title
Introduction to Data Mining
Dr Arulsivanathan Naidoo
Statistics South Africa
OECD Conference Cape Town
8-10 December 2010
2 Your reference
Outline
• Introduction
• Statistics vs Knowledge Discovery
• Predictive Modeling
• Data Mining Examples
• Census 2011
• ROC
• Conclusions
3 Your reference
Introduction
• What is Data Mining?
• Data Mining is a general term.
• Data mining is defined a an application of intelligent techniques such as decision trees, Neural networks, fuzzy logic genetic algorithm, nearest neighbour method, rule induction and data visualization to large quantities of data to discover hidden trends, patterns and relationships (Lam and Kamber 2006)
4 Your reference
Orgins of Data Mining
Artificial Intellengence Databases
Statistics
Data Mining
KDD
Pattern Recognition
Machine Learning
Neurocomputing
5 Your reference
Hypothesis Testing
Statement
Hypothesis
Analysis
Decision
Accept H0
Top
Down
6 Your reference
Knowledge Discovery
Question?
Data
What item is purchased with
disposable Baby Napkins?
Answer Beer
Statement Up
Bottom
7 Your reference
Unsupervised learning
Data
Association
Disassociation
Sequential
Cluster / SOM
Items bought together
Items not bought together
Items bought in order
Grouping- Segments
8 Your reference
Supervised Learning
Data
Target Variable
Decision Tree
Regression
Neural Network
Two Stage
9 Your reference
What is a Model?
One Word
Equation
Straight Line
Y = mX + c
Example: Countryside
10 Your reference
Decision Tree
A decision tree model is constructed by segmenting a dataset using a series of simple rules, resulting in a hierarchy of segments within segments Algorithms such as the CHAID (chi squared automatic interactive detection) can be used to decide on how to split the segments. The hierarchy is called a tree and each segment a node
11 Your reference
Decision Tree
100 M 100 W
Short Hair Long Hair
Earings No Earings
Predict everyone with short hair and earings is female
12 Your reference
Regression
x1
x2
x3
Y
13 Your reference
Neural Networks
x1
x2
x3
Y
H1
H2
Inputs
Black Box
Outputs
14 Your reference
Two Stage
Buy from every Catalogue R100
Buy from Catalogue once/year R5000
15 Your reference
Eurostat Funding
KESO ( Knowledge extraction for statistical offices) This is a Eurostat project with the goal to construct a versatile efficient industrial strength data mining system that satisfies the needs of providers large scale databases
SPIN (Spatial mining for data of public interest) was developed to support statistical offices in their timely and cost effective dissemination of statistical data by integrating the state of the art GIS and data mining functionality in an open highly extensible internet enabled plug in architecture
IDSA (Intelligent Data Control System) Hassain et al 2010 This is an application of data mining to the official statistics
16 Your reference
NASS Decision Trees
Census Non Response Weighting
Census Mail List Trimming
Analysis of reporting Errors
Allocation of Survey Incentives
Prediction of Survey Non Respondents
17 Your reference
NASS
Association Analysis
•Survey Data Edit design
Cluster Analysis
•2007 Census Donor Pool Screening
•Questionnaire design and Construction
•Identifying Subtypes of records Missing from the
Census Mail List
18 Your reference
Examples
• Absa Branch Robberies
• Old Mutual Policies
• MTN prepaid
• HSBC Bank Credit Cards
• Royal Saudi Air Force
• Census 2011
19 Your reference
Census 2011
Sample
Data Model B
Census
2001
Assess Score
Results
(Ranking)
Model C
Model A
Will Respond
High Wall Areas
Informal
Areas
20 Your reference
Prediction Types
Training Data Predictions
Case 1 : inputs target
Case 2 : inputs target
Case 3 : inputs target
Case 4 : inputs target
Case 5 : inputs target
Decisions
Rankings
Estimates
21 Your reference
Prediction Types
Training Data Decisions
Case 1 : inputs target
Case 2 : inputs target
Case 3 : inputs target
Case 4 : inputs target
Case 5 : inputs target
Success
Failure
Failure
Success
Success
22 Your reference
Prediction Types
Training Data Rankings
Case 1 : inputs target
Case 2 : inputs target
Case 3 : inputs target
Case 4 : inputs target
Case 5 : inputs target
680
720
640
582
635
23 Your reference
Prediction Types
Training Data Estimates
Case 1 : inputs target
Case 2 : inputs target
Case 3 : inputs target
Case 4 : inputs target
Case 5 : inputs target
0.45
0.53
0.62
0.55
0.47
24 Your reference
Prediction Type
Validation Fit Statistic Direction
Decisions Misclassification
Average Profit/Loss
Kolmogorov-Smirnov Statistic
Smallest
Largest/Smallest
Largest
Rankings ROC Index (Concordance)
Gini Coefficient
Largest
Largest
Estimates Average Square Error
Schwarz’s Bayesian Criterion
Log-likelihood
Smallest
Smallest
Largest
25 Your reference
Confusion Matrix
Actual
male
female
female
male
Predicted
True
Positive
False
Negative
False
Positive
True
Negative
d
c a
b
26 Your reference
ROC
a
• Sensitivity = --------
a+b
d
• Specificity = --------
c+d
27 Your reference
ROC
The ROC (Receiver Operating Characteristic)
curve was first used during World War 2 following the attacks on Pearl harbour in 1941. The US army research the prediction of correctly detecting Japanese aircraft from their radar signals
28 Your reference
ROC Curve
29 Your reference
Conclusion
Data mining is a growing discipline which originated outside statistics in the data base community mainly for commercial purposes Today data Mining can be considered a branch of exploratory statistics where useful models and patterns are uncovered through the extensive use of algorithms
Finally who should analyse huge data sets, the National statistics Offices or other research institutions
Data mining techniques use individual records not aggregate data There is by law the confidentiality clause The NSO are the best place and this will imply new directions of research
30 Your reference
Conclusion
Official statistics should be a field for data mining giving new life and value to its huge data bases, but this may imply a redefinition of the visions and missions of official statistics offices South Africa changed its vision and mission this year
In Statistics South Africa we have acquired data mining software and we have started a data mining user group of over 100 researchers We are hoping to start a working paper series where some of this research will be published on our website for comments
31 Your reference
Thank you