10 Data Mining

Embed Size (px)

Citation preview

  • 7/30/2019 10 Data Mining

    1/43

    SQL Server 2008 for Business IntelligenceUTS Short Course

  • 7/30/2019 10 Data Mining

    2/43

    Loves C# and .NET

    Specializes in Application architecture and

    design

    SQL Performance Tuning andOptimization

    Agile, ScrumCertified Scrum Trainer

    Technology aficionado Silverlight

    ASP.NET

    Windows Forms

    Eric Phan SA @ SSWw: ericphan.info | e: [email protected] | t: @ericphan

  • 7/30/2019 10 Data Mining

    3/43

    Attendance

    You initial sheet Hands On Lab

    You get me to initial sheet

    Homework

    Certificate

    At end of 5 sessions

    If I say if you have completed successfully

    Admin Stuff

  • 7/30/2019 10 Data Mining

    4/43

    Course Timetable & Materials

    http://bit.ly/UTSSQL

    Resources

    http://sharepoint.ssw.com.au/Training/UTSSQL/

    Course Website

    http://bit.ly/UTSSQLhttp://www.microsoft.com/downloads/details.aspxhttp://www.microsoft.com/downloads/details.aspxhttp://www.microsoft.com/downloads/details.aspxhttp://bit.ly/UTSSQLhttp://bit.ly/UTSSQL
  • 7/30/2019 10 Data Mining

    5/43

    Course Overview

    Session Date Time Topic

    1Tuesday

    01-05-201218:00 - 21:00 SSIS and Creating a Data Warehouse

    2Tuesday

    08-05-201218:00 - 21:00 OLAP Creating Cubes and Cube Issues

    3Tuesday

    15-05-201218:00 - 21:00 Reporting Services

    4 Tuesday22-05-2012 18:00 - 21:00 Alternative Cube Browsers

    5Tuesday

    29-05-201218:00 - 21:00 Data Mining

  • 7/30/2019 10 Data Mining

    6/43

    1. Other cube browsers

    Microsoft Data Analyzer Proclarity

    Excel 2003/2007/2010

    Excel services

    Thinslicer

    Performance Point

    Power Pivot

    Last week(s)

  • 7/30/2019 10 Data Mining

    7/43

    The plan

  • 7/30/2019 10 Data Mining

    8/43

    1. Create Data Warehouse

    2.Copy data to data warehouse

    3. Create OLAP Cubes

    4. Create Reports

    5. Browse the cube

    6. Do some Data Mining Discovering relationships

    Predict future events

    Step by step to BI

  • 7/30/2019 10 Data Mining

    9/43

    1. What is Data Mining?

    2. Why?

    3. Uses

    4. Algorithms

    5. Demo

    6. Hands on Lab

    Agenda

  • 7/30/2019 10 Data Mining

    10/43

    What is Data Mining?

    Data mining is theuse of powerfulsoftware tools

    to discover significant traits or relationships,

    from databases or data warehouses and

    often used topredict future events

  • 7/30/2019 10 Data Mining

    11/43

    What is Data Mining?

    It exploits statistical algorithms

    Once the knowledge is extracted it:

    Can be used to discover

    Can be used to predict values of other cases

  • 7/30/2019 10 Data Mining

    12/43

    Marketing

    Who picks the movie? The kids, the wife, me

    Who are our Customers and what sort of films do theyhire?

    Is a 30 year old woman with 2 children going to hire Arnieslatest film

    Validation

    Is this data sensible? Terminator 2 and Toy Story

    Prediction

    Sales Next Year

    Why Data Mining?

  • 7/30/2019 10 Data Mining

    13/43

    1. Get new information from data, future trends, past trends,

    outlier, maximums, minimums

    2. Analyse data from different perspectives and summarizing it

    into useful information

    3. New information to

    increase revenuecuts costs

    or both :-)

    Why? Its all about money

  • 7/30/2019 10 Data Mining

    14/43

    Who are our biggest customers?

    What are customers buying with cigars?

    What are the customer retention levels of our branches?

    Which customers have bought olives, feta cheese but no ciabatta bread?

    Which regions have the highest male/female ratio of single 20 somethings?

    Which region has lowest customer retention levels and list out lost

    customers?

    Which Questions are Data

    Mining?

  • 7/30/2019 10 Data Mining

    15/43

    Ad hoc query

    Drill through to details

    Business Intelligence tool

    Whats not data mining

  • 7/30/2019 10 Data Mining

    16/43

    Huge amount of data

    Good raw material good data mining

    Samples should be representative

    Samples "similar" to domain

    Not all-seeing crystal ball

    Verify and Validate!

    Data - Uncover patterns in samples

  • 7/30/2019 10 Data Mining

    17/43

    OLAP

    Is about fast ad hoc querying Analysis by dimensions and measures

    Gives precise answers

    Data Mining

    May use RDBMS or OLAP source

    Is about discovering and predicting

    Gives imprecise answers

    OLAP is not a prerequisite for data mining, but it almost always comes first

    OLAP versus Data Mining

    (learning to ride a bike before a car)

  • 7/30/2019 10 Data Mining

    18/43

    Classification algorithms

    predict one or more discrete variables, based on the other attributes in the dataset

    Regression algorithms

    predict one or more continuous variables, such as profit or loss, based on otherattributes in the dataset

    Segmentation algorithms

    divide data into groups, or clusters, of items that have similar properties

    Association algorithms

    find correlations between different attributes in a dataset Sequence analysis algorithms

    summarize frequent sequences or episodes in data, such as a Web path flow

    Types of Data Mining Algorithms

  • 7/30/2019 10 Data Mining

    19/43

    Complete Set Of AlgorithmsWays to analyze your data

    Decision Trees Clustering Time Series

    Neural Network AssociationNave Bayes

    Linear Regression LogisticRegressionSequenceClustering

    http://images.google.com/imgres?imgurl=http://nuweb2.neu.edu/math/cp/blog/regression/graphics/regression__62.png&imgrefurl=http://www.atsweb.neu.edu/math/cp/blog/regression/regression.htm&usg=__F-hsRrePZlGNdhsrqxCN824gTbQ=&h=390&w=580&sz=8&hl=en&start=2&sig2=ptfDCDM4_FJD0qf6cnDK0A&um=1&tbnid=93FZLt5e5PuV-M:&tbnh=90&tbnw=134&prev=/images?q=logistic+regression&hl=en&rls=com.microsoft:en-au:IE-SearchBox&rlz=1I7SKPB_en&sa=N&um=1&ei=wNLTSsLOA8aSkAXK1_CHBAhttp://images.google.com/imgres?imgurl=http://www.le.ac.uk/bl/gat/virtualfc/Stats/regression/REGR2.GIF&imgrefurl=http://www.le.ac.uk/bl/gat/virtualfc/Stats/regression/regr1.html&usg=__ufucSf1dLob9MSEDeZn1MsF4vsA=&h=427&w=597&sz=4&hl=en&start=2&sig2=VgBJJpZNjT58fJkWmVWlug&um=1&tbnid=hYxaFI0g1YSGPM:&tbnh=97&tbnw=135&prev=/images?q=linear+regression&hl=en&rls=com.microsoft:en-au:IE-SearchBox&rlz=1I7SKPB_en&sa=N&um=1&ei=btLTStqxBYOHkQXOsfT8Aw
  • 7/30/2019 10 Data Mining

    20/43

    Split data

    Each of branch is like an attribute

    Brightness = amount of data

    Decision trees

  • 7/30/2019 10 Data Mining

    21/43

    Decision Trees assign (classify) each case to one of a

    few (discrete) broad categories of selected attribute

    (variable) and explains the classification with few

    selected input variables

    The process of building is recursive partitioning

    splitting data into partitions and then splitting it up

    more

    Initially all cases are in one big box

    Decision Trees (1)

  • 7/30/2019 10 Data Mining

    22/43

    The algorithm tries all possible breaks in classes using all

    possible values of each input attribute; it then selects the

    split that partitions data to the purest classes of thesearched variable

    Several measures of purity

    Then it repeats splitting for each new class

    Again testing all possible breaks

    Unuseful branches of the tree can be

    pre-pruned or post-pruned

    Decision Trees (2)

  • 7/30/2019 10 Data Mining

    23/43

    Decision trees are used for classification and prediction

    Typical questions:

    Predict which customers will leave

    Help in mailing and promotion campaigns

    Explain reasons for a decision

    What are the movies young female customers like to buy?

    Decision Trees (3)

  • 7/30/2019 10 Data Mining

    24/43

    Decision Trees Who Decides

  • 7/30/2019 10 Data Mining

    25/43

    Bayes Formula

    Uses statistics to say falls into certain category or notwith probability

    Spam filtering: score of spam (Bayes)

    Testing only a particular attribute

    Nave Bayes

  • 7/30/2019 10 Data Mining

    26/43

    Quickly builds mining models that can be used for

    classification and prediction

    It calculates probabilities for each possible state of the

    input attribute, given each state of the predictable

    attribute

    This can later be used to predict an outcomeof the predicted attribute based on the known input attributes

    This makes the model a good option

    for exploring the data

    Nave Bayes

  • 7/30/2019 10 Data Mining

    27/43

    Grouping data into clusters

    Objects within a cluster have high similarity based on the

    attribute values

    The class label of each object is not known

    Several techniques

    Partitioning methods

    Hierarchical methods

    Density based methods Model based methods

    And more

    Cluster Analysis (1)

  • 7/30/2019 10 Data Mining

    28/43

    Segments a heterogeneous population into a number of more

    homogenous subgroups or clusters

    Some typical questions:

    Discover distinct groups of customers

    Identification of groups of houses in a city

    In biology, derive animal and plant taxonomies

    Find outliers

    Cluster Analysis (2)

  • 7/30/2019 10 Data Mining

    29/43

    Clustering

    Age

    Annual

    Income

  • 7/30/2019 10 Data Mining

    30/43

    Time series

    Timebased data prediction

  • 7/30/2019 10 Data Mining

    31/43

    Sequence clustering

    Numbers orders stronger associations

    Direction of association (not necessary the other direction)

  • 7/30/2019 10 Data Mining

    32/43

    If you own certain stocks ' you own maybe other ones as well

    Probability = thickness of line

    Association

  • 7/30/2019 10 Data Mining

    33/43

    Let system learn how to classify data

    Neural Network adapts to the new data

    Formulate statement/hypothesis

    Outcome is know

    (Data / Surveys)

    1. 70% data to train network (outcome is known)

    2. 30% of data to test network (outcome is known)

    3. New data (no survey needed, predict from network)

    Other example: OCR

    Neural Nets

  • 7/30/2019 10 Data Mining

    34/43

    Conclusion: When To Use What

    Task Microsoft algorithms to use

    Predicting a discrete attribute.For example, predict whether therecipient of a targeted mailing campaignwill buy a product.

    Microsoft Decision Trees AlgorithmMicrosoft Naive Bayes AlgorithmMicrosoft Clustering AlgorithmMicrosoft Neural Network Algorithm

    Predicting a continuous attribute.For example, forecast next year's sales. Microsoft Decision Trees AlgorithmMicrosoft Time Series Algorithm

    Predicting a sequence.For example, perform a clickstreamanalysis of a company's Web site.

    Microsoft Sequence Clustering Algorithm

    Finding groups of common itemsin transactions.For example, use market basket analysisto suggest additional products to a

    customer for purchase.

    Microsoft Association AlgorithmMicrosoft Decision Trees Algorithm

    Finding groups of similar items.For example, segment demographic datainto groups to better understand therelationships between attributes.

    Microsoft Clustering AlgorithmMicrosoft Sequence Clustering Algorithm

    http://msdn.microsoft.com/en-us/library/ms174941.aspxhttp://msdn.microsoft.com/en-us/library/ms175312.aspxhttp://www.sqlservercentral.com/articles/Video/64190/http://www.microsoft.com/downloads/en/details.aspxhttp://msdn.microsoft.com/en-us/library/ms174941.aspxhttp://msdn.microsoft.com/en-us/library/ms174879.aspxhttp://msdn.microsoft.com/en-us/library/ms174923.aspxhttp://msdn.microsoft.com/en-us/library/ms175462.aspxhttp://msdn.microsoft.com/en-us/library/ms174941.aspxhttp://www.sqlservercentral.com/articles/Video/64190/http://msdn.microsoft.com/en-us/library/ms174923.aspxhttp://msdn.microsoft.com/en-us/library/ms174923.aspxhttp://www.sqlservercentral.com/articles/Video/64190/http://msdn.microsoft.com/en-us/library/ms174941.aspxhttp://msdn.microsoft.com/en-us/library/ms175462.aspxhttp://msdn.microsoft.com/en-us/library/ms174923.aspxhttp://msdn.microsoft.com/en-us/library/ms174879.aspxhttp://msdn.microsoft.com/en-us/library/ms174941.aspxhttp://www.microsoft.com/downloads/en/details.aspxhttp://www.sqlservercentral.com/articles/Video/64190/http://msdn.microsoft.com/en-us/library/ms175312.aspxhttp://msdn.microsoft.com/en-us/library/ms174941.aspx
  • 7/30/2019 10 Data Mining

    35/43

    Visual Numerics

    3rd party algorithms

    http://www.vni.com/company/whitepapers/

    MicrosoftBIwithNumericalLibraries.pdf

    There is more...

    http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/
  • 7/30/2019 10 Data Mining

    36/43

    Microsoft SQL Server 2008 Data Mining Add-ins for Microsoft

    Office 2007 http://www.microsoft.com/downloads/en/details.aspx?familyid=8

    96A493A-2502-4795-94AE-E00632BA6DE7&displaylang=en

    Excel Data Mining

    http://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=en
  • 7/30/2019 10 Data Mining

    37/43

    Train station / airport

    Who is the bad guy Farmers

    Find the best crops

    Supermarket

    Find to figure out how to get you to buy more, where theexpensive items

    Other usages of data mining

    Find patterns - Profiling

  • 7/30/2019 10 Data Mining

    38/43

    SSIS 2008 - Data profiling task

    Get a profile of the data in a table

    potential candidate keys

    length of data values in columns

    Null percentage of rows

    distribution of values

    ....

    Tip

  • 7/30/2019 10 Data Mining

    39/43

    Video: Simple data mining model

    http://www.sqlservercentral.com/articles/Video/65055/

    Video: Data mining and Reporting Services

    http://www.sqlservercentral.com/articles/Video/64190/

    Data Mining Algorithms

    http://msdn.microsoft.com/en-us/library/ms175595.aspx

    Resources 1

    http://blogs.msdn.com/b/jamiemac/http://richardlees.blogspot.com/http://www.amazon.com/gp/product/0470277742http://www.amazon.com/gp/product/0470277742http://www.amazon.com/gp/product/0470277742http://www.amazon.com/gp/product/0470277742http://www.amazon.com/gp/product/0470277742http://richardlees.blogspot.com/http://blogs.msdn.com/b/jamiemac/http://blogs.msdn.com/b/jamiemac/
  • 7/30/2019 10 Data Mining

    40/43

    Jamie MacLennan

    http://blogs.msdn.com/b/jamiemac/

    Richard Lees on BI

    http://richardlees.blogspot.com/

    Book Data Mining with Microsoft SQL Server 2008http://www.amazon.com/gp/product/0470277742?ie=UTF8&tag=sqlserverda09-

    20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742

    Resources 2

    http://www.ssw.com.au/ssw/Events/2010UTSSQL/http://www.vni.com/company/whitepapers/MicrosoftBIwithNumericalLibraries.pdfhttp://sharepoint.ssw.com.au/Training/UTSSQL/?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742http://sharepoint.ssw.com.au/Training/UTSSQL/?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742http://sharepoint.ssw.com.au/Training/UTSSQL/?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742http://sharepoint.ssw.com.au/Training/UTSSQL/?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742http://sharepoint.ssw.com.au/Training/UTSSQL/?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742http://www.vni.com/company/whitepapers/MicrosoftBIwithNumericalLibraries.pdfhttp://www.vni.com/company/whitepapers/MicrosoftBIwithNumericalLibraries.pdfhttp://www.vni.com/company/whitepapers/MicrosoftBIwithNumericalLibraries.pdfhttp://www.ssw.com.au/ssw/Events/2010UTSSQL/http://www.ssw.com.au/ssw/Events/2010UTSSQL/http://www.ssw.com.au/ssw/Events/2010UTSSQL/
  • 7/30/2019 10 Data Mining

    41/43

    Why Data Mining?

    Uses

    Algorithms

    Demo

    Hands on Lab

    Summary

  • 7/30/2019 10 Data Mining

    42/43

    3 things

    [email protected]

    http://ericphan.info/

    twitter.com/ericphan

  • 7/30/2019 10 Data Mining

    43/43

    Thank You!

    Gateway Court Suite 10

    81 - 91 Military Road

    Neutral Bay, Sydney NSW 2089

    AUSTRALIA

    ABN: 21 069 371 900

    Phone: + 61 2 9953 3000

    Fax: + 61 2 9953 3105

    [email protected]

    www.ssw.com.au