Upload
asif-zafar
View
225
Download
0
Embed Size (px)
Citation preview
7/28/2019 CS5550 Session 07 At
1/36
CS5550 Data Management and
Business Intelligence
Session 7:
Data Mining (II) Advanced DM
7/28/2019 CS5550 Session 07 At
2/36
CS5550 Session 05 Slide 2
Session Learning Outcomes
The learning outcomes for this session
are that you can:
Understand how to use state-of-the-art DM
techniques Discuss the strengths and weaknesses of
such methods
Discuss the use of key DM tools for business
intelligence
7/28/2019 CS5550 Session 07 At
3/36
CS5550 Session 05 Slide 3
Recap on Data Mining
What it is
Definition IDA, AI, KDD, etc
Data to knowledge
Some typical tools Correlation
Regression
Clustering Visualisation
7/28/2019 CS5550 Session 07 At
4/36
More Advanced Techniques
Classifiers (e.g. Decision Trees)
Association Rules
Time-Series Models
Bayesian Networks Principal Components Analysis to plot
multidimensional data
Graph Based Methods to explore multiple
relationships
Optimisation
7/28/2019 CS5550 Session 07 At
5/36
Classification
What sort of data is this?
Similar to Clustering but Supervised Learningwe have sample
classes to learn from:
Fraudulent Financial Reporting
Y = {fraudulent, truthful}
Predicting Delayed Flights
Y = {delayed, on time}
7/28/2019 CS5550 Session 07 At
6/36
Classification
Supervised method unlike clustering
-0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6x
x
7/28/2019 CS5550 Session 07 At
7/36
CS5550 Session 05 Slide 7
Decision Trees
Established method for classifying data
Originated from biology
Easy to understand Commonly used for
Fraud detection
Credit Rating
0
10
20
30
40
50
60
0 20 40 60 80Age
Sala
ry
Class 0
Class 1
?
7/28/2019 CS5550 Session 07 At
8/36
CS5550 Session 05 Slide 8
Decision Trees
Credit Rating example
7/28/2019 CS5550 Session 07 At
9/36
Decision Trees
Advantages
Transparent & Interpretable
Can also perform Feature Selection
Can model complex relationships (non-linear)
Disadvantages
Risk over-fitting data (but can prune trees)
Require lots of data Cannot model diagonal relationships (splits
always on one predictor, not combinations)
7/28/2019 CS5550 Session 07 At
10/36
CS5550 Session 05 Slide 10
K Nearest Neighbour
Find Kobservations in the data that are
similar to the new observation we wish to
classify
Requires:
Distance Metric
Voting Mechanism
Weighting Function
7/28/2019 CS5550 Session 07 At
11/36
K-Nearest Neighbour
Distance Metric e.g. Euclidian
Weighting function
Neighbour voting mechanism e.g. Maximum
k=1 k=3 k=5
- + -
+
+ o -
-
- +
- + -
+
+ o -
-
- +
- + -
+
+ o -
-
- +
7/28/2019 CS5550 Session 07 At
12/36
K-Nearest Neighbour
Advantages
Simplicity
Few assumptions about data
Disadvantages
Slow when large number of datapoints
Need lots of data when lots of predictors
7/28/2019 CS5550 Session 07 At
13/36
CS5550 Session 05 Slide 13
Other Classifiers
Linear Classifiers
Artificial Neural Network Classifiers
Support Vector Machines Bayesian Classifiers
7/28/2019 CS5550 Session 07 At
14/36
CS5550 Session 05 Slide 14
Association Rules
What data goes with what
Large amount of basket data
e.g. Supermarket purchases
Looks for associations between items
Builds IfThenRules e.g. Nappies => Beer
Uses notion ofsupportand confidence
7/28/2019 CS5550 Session 07 At
15/36
CS5550 Session 05 Slide 15
Association Rules
Example:
Rule 1. If the quality of the management is medium, then thecompany may have a profit or a loss (C3, C4).
Rule 2. If the quality of the management is (at least) high and the
number of employees is similar to 700, then the companymakes a profit (C1).
Rule 3. If the quality of the management is (at most) low, then thecompany has a loss (C5, C6).
Rule 4. If the number of employees is similar to 420 and thelocalization is B, then the company has a loss (C2).
7/28/2019 CS5550 Session 07 At
16/36
CS5550 Session 05 Slide 16
Association Rules
Advantages
Disadvantages
Profusion of rules generated Ignores rare (but potentially interesting)
combinations
7/28/2019 CS5550 Session 07 At
17/36
CS5550 Session 05 Slide 17
Neural Networks
Map inputs to outputs using weights
Back-propagation algorithm to learn weights
Ii
i
k Output Layer
Hidden Layer
Input Layer
7/28/2019 CS5550 Session 07 At
18/36
CS5550 Session 05 Slide 18
Neural Networks
Forecasting Markets
Predicting Stock collapse
Classifying exceptional behaviour incustomers
Fraudulent credit card usage
Online monitoring
7/28/2019 CS5550 Session 07 At
19/36
Neural Networks
Advantages
Can model complex relationships
Versatile
Disadvantages
Suffers very badly from Over-fitting
Need lots of data Do not select features automatically
Black Box model
7/28/2019 CS5550 Session 07 At
20/36
CS5550 Session 05 Slide 20
Time-Series Models
Predicting the Stock-Market
Long been the goal of:
mathematicians
statisticians
computer scientists
philosophers
Pi Harvest Filmworks 1999
7/28/2019 CS5550 Session 07 At
21/36
CS5550 Session 05 Slide 21
Time-Series Models
Statistical Models
AI Models such as Neural Networks
7/28/2019 CS5550 Session 07 At
22/36
CS5550 Session 05 Slide 22
Bayesian Networks
Overcome black box nature of NNs
Model data using probabilities and graphs
No hidden layers or weights
Models a joint distributionprobability of any
event is calculable
7/28/2019 CS5550 Session 07 At
23/36
CS5550 Session 05 Slide 23
Bayesian Networks
7/28/2019 CS5550 Session 07 At
24/36
CS5550 Session 05 Slide 24
Bayesian Networks
7/28/2019 CS5550 Session 07 At
25/36
CS5550 Session 05 Slide 25
Bayesian Networks
Advantages?
Disadvantages?
7/28/2019 CS5550 Session 07 At
26/36
CS5550 Session 05 Slide 26
Optimisation
For searching through huge numbers of
possible solutions:
Scheduling processes
Manufacturing Deliveries
Bin Packing of objects
Efficient loading of crates prior to shipping
Routing for efficient delivery
7/28/2019 CS5550 Session 07 At
27/36
CS5550 Session 05 Slide 27
Optimisation
Well known Techniques:
Greedy Searches
Hill Climb
Simulated Annealing
Genetic Algorithms
Gradient Descent
7/28/2019 CS5550 Session 07 At
28/36
CS5550 Session 05 Slide 28
Optimisation
For example: Travelling Salesman Problem
Famous NP Hard Problem
7/28/2019 CS5550 Session 07 At
29/36
Travelling Salesman Problem
7/28/2019 CS5550 Session 07 At
30/36
CS5550 Session 05 Slide 30
Optimisation: Bin Packing
Trucks with capacity of 10
How few required to store objects of size:
{3, 6, 2, 1, 5, 7, 2, 4, 1, 9}?
7/28/2019 CS5550 Session 07 At
31/36
CS5550 Session 05 Slide 31
Optimisation: Bin Packing
Search techniques for finding the best
allocation for objects within fixed size
containers
Potential HeuristicApproaches: First Fit
Next Fit
Best Fit Worst Fit
7/28/2019 CS5550 Session 07 At
32/36
CS5550 Session 05 Slide 32
Optimisation: Bin Packing
Also 2D and 3D approaches
7/28/2019 CS5550 Session 07 At
33/36
CS5550 Session 05 Slide 33
Business Intelligence
Data Integration + Data Mining + Human
Expertise => Business Intelligence:
Improved Decision Making
Quicker Response Times
Better Broadcasting / Marketting
7/28/2019 CS5550 Session 07 At
34/36
Weaknesses of Data Mining
Data Quality
Spurious Correlations
Over-fitting
Black Box Modelling
Over-relianceslave to the dataCant see the wood for the trees
7/28/2019 CS5550 Session 07 At
35/36
CS5550 Session 05 Slide 35
Session Summary
This session has examined:
Advanced Data Mining Techniques
with examples
Advantages and Disadvantages
7/28/2019 CS5550 Session 07 At
36/36
Next Session: Guest Lecture
Case Study of the application of BI