CS5550 Session 07 At

Embed Size (px)

Citation preview

  • 7/28/2019 CS5550 Session 07 At

    1/36

    CS5550 Data Management and

    Business Intelligence

    Session 7:

    Data Mining (II) Advanced DM

  • 7/28/2019 CS5550 Session 07 At

    2/36

    CS5550 Session 05 Slide 2

    Session Learning Outcomes

    The learning outcomes for this session

    are that you can:

    Understand how to use state-of-the-art DM

    techniques Discuss the strengths and weaknesses of

    such methods

    Discuss the use of key DM tools for business

    intelligence

  • 7/28/2019 CS5550 Session 07 At

    3/36

    CS5550 Session 05 Slide 3

    Recap on Data Mining

    What it is

    Definition IDA, AI, KDD, etc

    Data to knowledge

    Some typical tools Correlation

    Regression

    Clustering Visualisation

  • 7/28/2019 CS5550 Session 07 At

    4/36

    More Advanced Techniques

    Classifiers (e.g. Decision Trees)

    Association Rules

    Time-Series Models

    Bayesian Networks Principal Components Analysis to plot

    multidimensional data

    Graph Based Methods to explore multiple

    relationships

    Optimisation

  • 7/28/2019 CS5550 Session 07 At

    5/36

    Classification

    What sort of data is this?

    Similar to Clustering but Supervised Learningwe have sample

    classes to learn from:

    Fraudulent Financial Reporting

    Y = {fraudulent, truthful}

    Predicting Delayed Flights

    Y = {delayed, on time}

  • 7/28/2019 CS5550 Session 07 At

    6/36

    Classification

    Supervised method unlike clustering

    -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2-0.3

    -0.2

    -0.1

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6x

    x

  • 7/28/2019 CS5550 Session 07 At

    7/36

    CS5550 Session 05 Slide 7

    Decision Trees

    Established method for classifying data

    Originated from biology

    Easy to understand Commonly used for

    Fraud detection

    Credit Rating

    0

    10

    20

    30

    40

    50

    60

    0 20 40 60 80Age

    Sala

    ry

    Class 0

    Class 1

    ?

  • 7/28/2019 CS5550 Session 07 At

    8/36

    CS5550 Session 05 Slide 8

    Decision Trees

    Credit Rating example

  • 7/28/2019 CS5550 Session 07 At

    9/36

    Decision Trees

    Advantages

    Transparent & Interpretable

    Can also perform Feature Selection

    Can model complex relationships (non-linear)

    Disadvantages

    Risk over-fitting data (but can prune trees)

    Require lots of data Cannot model diagonal relationships (splits

    always on one predictor, not combinations)

  • 7/28/2019 CS5550 Session 07 At

    10/36

    CS5550 Session 05 Slide 10

    K Nearest Neighbour

    Find Kobservations in the data that are

    similar to the new observation we wish to

    classify

    Requires:

    Distance Metric

    Voting Mechanism

    Weighting Function

  • 7/28/2019 CS5550 Session 07 At

    11/36

    K-Nearest Neighbour

    Distance Metric e.g. Euclidian

    Weighting function

    Neighbour voting mechanism e.g. Maximum

    k=1 k=3 k=5

    - + -

    +

    + o -

    -

    - +

    - + -

    +

    + o -

    -

    - +

    - + -

    +

    + o -

    -

    - +

  • 7/28/2019 CS5550 Session 07 At

    12/36

    K-Nearest Neighbour

    Advantages

    Simplicity

    Few assumptions about data

    Disadvantages

    Slow when large number of datapoints

    Need lots of data when lots of predictors

  • 7/28/2019 CS5550 Session 07 At

    13/36

    CS5550 Session 05 Slide 13

    Other Classifiers

    Linear Classifiers

    Artificial Neural Network Classifiers

    Support Vector Machines Bayesian Classifiers

  • 7/28/2019 CS5550 Session 07 At

    14/36

    CS5550 Session 05 Slide 14

    Association Rules

    What data goes with what

    Large amount of basket data

    e.g. Supermarket purchases

    Looks for associations between items

    Builds IfThenRules e.g. Nappies => Beer

    Uses notion ofsupportand confidence

  • 7/28/2019 CS5550 Session 07 At

    15/36

    CS5550 Session 05 Slide 15

    Association Rules

    Example:

    Rule 1. If the quality of the management is medium, then thecompany may have a profit or a loss (C3, C4).

    Rule 2. If the quality of the management is (at least) high and the

    number of employees is similar to 700, then the companymakes a profit (C1).

    Rule 3. If the quality of the management is (at most) low, then thecompany has a loss (C5, C6).

    Rule 4. If the number of employees is similar to 420 and thelocalization is B, then the company has a loss (C2).

  • 7/28/2019 CS5550 Session 07 At

    16/36

    CS5550 Session 05 Slide 16

    Association Rules

    Advantages

    Disadvantages

    Profusion of rules generated Ignores rare (but potentially interesting)

    combinations

  • 7/28/2019 CS5550 Session 07 At

    17/36

    CS5550 Session 05 Slide 17

    Neural Networks

    Map inputs to outputs using weights

    Back-propagation algorithm to learn weights

    Ii

    i

    k Output Layer

    Hidden Layer

    Input Layer

  • 7/28/2019 CS5550 Session 07 At

    18/36

    CS5550 Session 05 Slide 18

    Neural Networks

    Forecasting Markets

    Predicting Stock collapse

    Classifying exceptional behaviour incustomers

    Fraudulent credit card usage

    Online monitoring

  • 7/28/2019 CS5550 Session 07 At

    19/36

    Neural Networks

    Advantages

    Can model complex relationships

    Versatile

    Disadvantages

    Suffers very badly from Over-fitting

    Need lots of data Do not select features automatically

    Black Box model

  • 7/28/2019 CS5550 Session 07 At

    20/36

    CS5550 Session 05 Slide 20

    Time-Series Models

    Predicting the Stock-Market

    Long been the goal of:

    mathematicians

    statisticians

    computer scientists

    philosophers

    Pi Harvest Filmworks 1999

  • 7/28/2019 CS5550 Session 07 At

    21/36

    CS5550 Session 05 Slide 21

    Time-Series Models

    Statistical Models

    AI Models such as Neural Networks

  • 7/28/2019 CS5550 Session 07 At

    22/36

    CS5550 Session 05 Slide 22

    Bayesian Networks

    Overcome black box nature of NNs

    Model data using probabilities and graphs

    No hidden layers or weights

    Models a joint distributionprobability of any

    event is calculable

  • 7/28/2019 CS5550 Session 07 At

    23/36

    CS5550 Session 05 Slide 23

    Bayesian Networks

  • 7/28/2019 CS5550 Session 07 At

    24/36

    CS5550 Session 05 Slide 24

    Bayesian Networks

  • 7/28/2019 CS5550 Session 07 At

    25/36

    CS5550 Session 05 Slide 25

    Bayesian Networks

    Advantages?

    Disadvantages?

  • 7/28/2019 CS5550 Session 07 At

    26/36

    CS5550 Session 05 Slide 26

    Optimisation

    For searching through huge numbers of

    possible solutions:

    Scheduling processes

    Manufacturing Deliveries

    Bin Packing of objects

    Efficient loading of crates prior to shipping

    Routing for efficient delivery

  • 7/28/2019 CS5550 Session 07 At

    27/36

    CS5550 Session 05 Slide 27

    Optimisation

    Well known Techniques:

    Greedy Searches

    Hill Climb

    Simulated Annealing

    Genetic Algorithms

    Gradient Descent

  • 7/28/2019 CS5550 Session 07 At

    28/36

    CS5550 Session 05 Slide 28

    Optimisation

    For example: Travelling Salesman Problem

    Famous NP Hard Problem

  • 7/28/2019 CS5550 Session 07 At

    29/36

    Travelling Salesman Problem

  • 7/28/2019 CS5550 Session 07 At

    30/36

    CS5550 Session 05 Slide 30

    Optimisation: Bin Packing

    Trucks with capacity of 10

    How few required to store objects of size:

    {3, 6, 2, 1, 5, 7, 2, 4, 1, 9}?

  • 7/28/2019 CS5550 Session 07 At

    31/36

    CS5550 Session 05 Slide 31

    Optimisation: Bin Packing

    Search techniques for finding the best

    allocation for objects within fixed size

    containers

    Potential HeuristicApproaches: First Fit

    Next Fit

    Best Fit Worst Fit

  • 7/28/2019 CS5550 Session 07 At

    32/36

    CS5550 Session 05 Slide 32

    Optimisation: Bin Packing

    Also 2D and 3D approaches

  • 7/28/2019 CS5550 Session 07 At

    33/36

    CS5550 Session 05 Slide 33

    Business Intelligence

    Data Integration + Data Mining + Human

    Expertise => Business Intelligence:

    Improved Decision Making

    Quicker Response Times

    Better Broadcasting / Marketting

  • 7/28/2019 CS5550 Session 07 At

    34/36

    Weaknesses of Data Mining

    Data Quality

    Spurious Correlations

    Over-fitting

    Black Box Modelling

    Over-relianceslave to the dataCant see the wood for the trees

  • 7/28/2019 CS5550 Session 07 At

    35/36

    CS5550 Session 05 Slide 35

    Session Summary

    This session has examined:

    Advanced Data Mining Techniques

    with examples

    Advantages and Disadvantages

  • 7/28/2019 CS5550 Session 07 At

    36/36

    Next Session: Guest Lecture

    Case Study of the application of BI