CS5550 Session 07 At

7/28/2019 CS5550 Session 07 At

1/36

CS5550 Data Management and

Business Intelligence

Session 7:

Data Mining (II) Advanced DM

7/28/2019 CS5550 Session 07 At

2/36

CS5550 Session 05 Slide 2

Session Learning Outcomes

The learning outcomes for this session

are that you can:

Understand how to use state-of-the-art DM

techniques Discuss the strengths and weaknesses of

such methods

Discuss the use of key DM tools for business

intelligence

7/28/2019 CS5550 Session 07 At

3/36


Recap on Data Mining

What it is

Definition IDA, AI, KDD, etc

Data to knowledge

Some typical tools Correlation

Regression

Clustering Visualisation

7/28/2019 CS5550 Session 07 At

4/36

More Advanced Techniques

Classifiers (e.g. Decision Trees)

Association Rules

Time-Series Models

Bayesian Networks Principal Components Analysis to plot

multidimensional data

Graph Based Methods to explore multiple

relationships

Optimisation

7/28/2019 CS5550 Session 07 At

5/36

Classification

What sort of data is this?

Similar to Clustering but Supervised Learningwe have sample

classes to learn from:

Fraudulent Financial Reporting

Y = {fraudulent, truthful}

Predicting Delayed Flights

Y = {delayed, on time}

7/28/2019 CS5550 Session 07 At

6/36

Classification

Supervised method unlike clustering

-0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6x

x

7/28/2019 CS5550 Session 07 At

7/36


Decision Trees

Established method for classifying data

Originated from biology

Easy to understand Commonly used for

Fraud detection

Credit Rating

0

10

20

30

40

50

60

0 20 40 60 80Age

Sala

ry

Class 0

Class 1

?

7/28/2019 CS5550 Session 07 At

8/36


Decision Trees

Credit Rating example

7/28/2019 CS5550 Session 07 At

9/36

Decision Trees

Advantages

Transparent & Interpretable

Can also perform Feature Selection

Can model complex relationships (non-linear)

Disadvantages

Risk over-fitting data (but can prune trees)

Require lots of data Cannot model diagonal relationships (splits

always on one predictor, not combinations)

7/28/2019 CS5550 Session 07 At

10/36


K Nearest Neighbour

Find Kobservations in the data that are

similar to the new observation we wish to

classify

Requires:

Distance Metric

Voting Mechanism

Weighting Function

7/28/2019 CS5550 Session 07 At

11/36

K-Nearest Neighbour

Distance Metric e.g. Euclidian

Weighting function

Neighbour voting mechanism e.g. Maximum

k=1 k=3 k=5

- + -

+

+ o -

-

- +

- + -

+

+ o -

-

- +

- + -

+

+ o -

-

- +

7/28/2019 CS5550 Session 07 At

12/36

K-Nearest Neighbour

Advantages

Simplicity

Few assumptions about data

Disadvantages

Slow when large number of datapoints

Need lots of data when lots of predictors

7/28/2019 CS5550 Session 07 At

13/36


Other Classifiers

Linear Classifiers

Artificial Neural Network Classifiers

Support Vector Machines Bayesian Classifiers

7/28/2019 CS5550 Session 07 At

14/36


Association Rules

What data goes with what

Large amount of basket data

e.g. Supermarket purchases

Looks for associations between items

Builds IfThenRules e.g. Nappies => Beer

Uses notion ofsupportand confidence

7/28/2019 CS5550 Session 07 At

15/36


Association Rules

Example:

Rule 1. If the quality of the management is medium, then thecompany may have a profit or a loss (C3, C4).

Rule 2. If the quality of the management is (at least) high and the

number of employees is similar to 700, then the companymakes a profit (C1).

Rule 3. If the quality of the management is (at most) low, then thecompany has a loss (C5, C6).

Rule 4. If the number of employees is similar to 420 and thelocalization is B, then the company has a loss (C2).

7/28/2019 CS5550 Session 07 At

16/36


Association Rules

Advantages

Disadvantages

Profusion of rules generated Ignores rare (but potentially interesting)

combinations

7/28/2019 CS5550 Session 07 At

17/36


Neural Networks

Map inputs to outputs using weights

Back-propagation algorithm to learn weights

Ii

i

k Output Layer

Hidden Layer

Input Layer

7/28/2019 CS5550 Session 07 At

18/36


Neural Networks

Forecasting Markets

Predicting Stock collapse

Classifying exceptional behaviour incustomers

Fraudulent credit card usage

Online monitoring

7/28/2019 CS5550 Session 07 At

19/36

Neural Networks

Advantages

Can model complex relationships

Versatile

Disadvantages

Suffers very badly from Over-fitting

Need lots of data Do not select features automatically

Black Box model

7/28/2019 CS5550 Session 07 At

20/36


Time-Series Models

Predicting the Stock-Market

Long been the goal of:

mathematicians

statisticians

computer scientists

philosophers

Pi Harvest Filmworks 1999

7/28/2019 CS5550 Session 07 At

21/36


Time-Series Models

Statistical Models

AI Models such as Neural Networks

7/28/2019 CS5550 Session 07 At

22/36


Bayesian Networks

Overcome black box nature of NNs

Model data using probabilities and graphs

No hidden layers or weights

Models a joint distributionprobability of any

event is calculable

7/28/2019 CS5550 Session 07 At

23/36


Bayesian Networks

7/28/2019 CS5550 Session 07 At

24/36


Bayesian Networks

7/28/2019 CS5550 Session 07 At

25/36


Bayesian Networks

Advantages?

Disadvantages?

7/28/2019 CS5550 Session 07 At

26/36


Optimisation

For searching through huge numbers of

possible solutions:

Scheduling processes

Manufacturing Deliveries

Bin Packing of objects

Efficient loading of crates prior to shipping

Routing for efficient delivery

7/28/2019 CS5550 Session 07 At

27/36


Optimisation

Well known Techniques:

Greedy Searches

Hill Climb

Simulated Annealing

Genetic Algorithms

Gradient Descent

7/28/2019 CS5550 Session 07 At

28/36


Optimisation

For example: Travelling Salesman Problem

Famous NP Hard Problem

7/28/2019 CS5550 Session 07 At

29/36

Travelling Salesman Problem

7/28/2019 CS5550 Session 07 At

30/36


Optimisation: Bin Packing

Trucks with capacity of 10

How few required to store objects of size:

{3, 6, 2, 1, 5, 7, 2, 4, 1, 9}?

7/28/2019 CS5550 Session 07 At

31/36



Search techniques for finding the best

allocation for objects within fixed size

containers

Potential HeuristicApproaches: First Fit

Next Fit

Best Fit Worst Fit

7/28/2019 CS5550 Session 07 At

32/36



Also 2D and 3D approaches

7/28/2019 CS5550 Session 07 At

33/36


Business Intelligence

Data Integration + Data Mining + Human

Expertise => Business Intelligence:

Improved Decision Making

Quicker Response Times

Better Broadcasting / Marketting

7/28/2019 CS5550 Session 07 At

34/36

Weaknesses of Data Mining

Data Quality

Spurious Correlations

Over-fitting

Black Box Modelling

Over-relianceslave to the dataCant see the wood for the trees

7/28/2019 CS5550 Session 07 At

35/36


Session Summary

This session has examined:

Advanced Data Mining Techniques

with examples

Advantages and Disadvantages

7/28/2019 CS5550 Session 07 At

36/36

Next Session: Guest Lecture

Case Study of the application of BI

Documents

CS5550 Session 07 At