44
SQL Server 2008 for Business Intelligence UTS Short Course

Data Mining with SQL Server 2008

Embed Size (px)

DESCRIPTION

What is Data Mining? Why? Discovering relationships Predict future events Usage scenarios Algorithms

Citation preview

Page 1: Data Mining with SQL Server 2008

SQL Server 2008 for Business IntelligenceUTS Short Course

Page 2: Data Mining with SQL Server 2008

Specializes in

C# and .NET (Java not anymore)

TestingAutomated tests

Agile, ScrumCertified Scrum Trainer

Technology aficionado • Silverlight• ASP.NET• Windows Forms

Peter Gfader

Page 3: Data Mining with SQL Server 2008

Attendance You initial sheet

Hands On Lab You get me to initial sheet

Homework

Certificate At end of 5 sessions If I say if you have completed successfully

Admin Stuff

Page 4: Data Mining with SQL Server 2008

Course Timetable & Materials

http://www.ssw.com.au/ssw/Events/2010UTSSQL/

Resources

http://sharepoint.ssw.com.au/Training/UTSSQL/

Course Website

Page 5: Data Mining with SQL Server 2008

Course OverviewSession

Date Time Topic

1Tuesday14-09-2010

18:00 - 21:00

SSIS and Creating a Data Warehouse

2Tuesday21-09-2010

18:00 - 21:00

OLAP – Creating Cubes and Cube Issues

3Tuesday28-09-2010

18:00 - 21:00 Reporting Services

4Tuesday05-10-2010

18:00 - 21:00

Alternative Cube Browsers

5 Tuesday12-10-2010

18:00 - 21:00 Data Mining

Page 6: Data Mining with SQL Server 2008

1. Other cube browsers Microsoft Data Analyzer Proclarity Excel 2003/2007/2010 Excel services Thinslicer Performance Point Power Pivot

Last week(s)

Page 7: Data Mining with SQL Server 2008

1. Create report on top of Northwind

Top 10 customers (Table)Top 10 products (Table)Top 10 employees (Table)1 chart that shows the top 10 customers1 usage of the gauge control (surprise me)

Homework

Page 8: Data Mining with SQL Server 2008

The plan

Page 9: Data Mining with SQL Server 2008

1. Create Data Warehouse

2. Copy data to data warehouse

3. Create OLAP Cubes

4. Create Reports

5. Browse the cube

6. Do some Data Mining Discovering relationships Predict future events

Step by step to BI

Page 10: Data Mining with SQL Server 2008

1. What is Data Mining?

2. Why?

3. Uses

4. Algorithms

5. Demo

6. Hands on Lab

Agenda

Page 11: Data Mining with SQL Server 2008

What is Data Mining?

“Data mining is the use of powerful software tools

to discover significant traits or relationships, from databases or data warehouses and

often used to predict future events”

Page 12: Data Mining with SQL Server 2008

What is Data Mining?

It exploits statistical algorithms

Once the “knowledge” is extracted it:

Can be used to discover Can be used to predict values of other cases

Page 13: Data Mining with SQL Server 2008

Marketing Who picks the movie? The kids, the wife, me Who are our Customers and what sort of films do

they hire? Is a 30 year old woman with 2 children going to hire

Arnie’s latest film Validation

Is this data sensible? Terminator 2 and Toy Story Prediction

Sales Next Year

Why Data Mining?

Page 14: Data Mining with SQL Server 2008

1. Get new information from data, future trends, past trends, outlier, maximums, minimums

2. Analyse data from different perspectives and summarizing it into useful information

3. New information to

increase revenuecuts costsor both :-)

Why? Its all about money

Page 15: Data Mining with SQL Server 2008

Who are our biggest customers?

What are customers buying with cigars?

What are the customer retention levels of our branches?

Which customers have bought olives, feta cheese but no ciabatta bread?

Which regions have the highest male/female ratio of single 20 somethings?

Which region has lowest customer retention levels and list out lost customers?

Which Questions are Data Mining?

Page 16: Data Mining with SQL Server 2008

Ad hoc query

Drill through to details

Business Intelligence tool

What’s not data mining

Page 17: Data Mining with SQL Server 2008

Huge amount of data

Good raw material good data mining

Samples should be representative

Samples "similar" to domain

Not all-seeing crystal ball

Verify and Validate!

Data - Uncover patterns in samples

Page 18: Data Mining with SQL Server 2008

OLAP Is about fast ad hoc querying Analysis by dimensions and measures Gives precise answers

Data Mining May use RDBMS or OLAP source Is about discovering and predicting Gives imprecise answers

OLAP is not a prerequisite for data mining, but it almost always comes first

OLAP versus Data Mining

(learning to ride a bike before a car)

Page 19: Data Mining with SQL Server 2008

Classification algorithms

predict one or more discrete variables, based on the other attributes in the dataset

Regression algorithms

predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset

Segmentation algorithms

divide data into groups, or clusters, of items that have similar properties

Association algorithms

find correlations between different attributes in a dataset

Sequence analysis algorithms

summarize frequent sequences or episodes in data, such as a Web path flow

Types of Data Mining Algorithms

Page 20: Data Mining with SQL Server 2008

Complete Set Of AlgorithmsWays to analyze your data

Decision Trees Clustering Time Series

Neural Network Association Naïve Bayes

Linear Regression Logistic Regression

Sequence Clustering

Page 21: Data Mining with SQL Server 2008

Split data

Each of branch is like an attribute

Brightness = amount of data

Decision trees

Page 22: Data Mining with SQL Server 2008

Decision Trees assign (classify) each case to one of a few (discrete) broad categories of selected attribute (variable) and explains the classification with few selected input variables

The process of building is recursive partitioning – splitting data into partitions and then splitting it up more

Initially all cases are in one big box

Decision Trees (1)

Page 23: Data Mining with SQL Server 2008

The algorithm tries all possible breaks in classes using all possible values of each input attribute; it then selects the split that partitions data to the purest classes of the searched variable Several measures of purity

Then it repeats splitting for each new class Again testing all possible breaks

Unuseful branches of the tree can be pre-pruned or post-pruned

Decision Trees (2)

Page 24: Data Mining with SQL Server 2008

Decision trees are used for classification and prediction

Typical questions:

Predict which customers will leave Help in mailing and promotion campaigns Explain reasons for a decision What are the movies young female customers like

to buy?

Decision Trees (3)

Page 25: Data Mining with SQL Server 2008

Decision Trees – Who Decides

Page 26: Data Mining with SQL Server 2008

Bayes Formula

Uses statistics to say falls into certain category or not with probability

Spam filtering: score of spam (Bayes)

Testing only a particular attribute

Naïve Bayes

Page 27: Data Mining with SQL Server 2008

Quickly builds mining models that can be used for classification and prediction

It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute This can later be used to predict an outcome

of the predicted attribute based on the known input attributes

This makes the model a good option for exploring the data

Naïve Bayes

Page 28: Data Mining with SQL Server 2008

Grouping data into clusters Objects within a cluster have high similarity based on the

attribute values

The class label of each object is not known

Several techniques Partitioning methods Hierarchical methods Density based methods Model based methods And more…

Cluster Analysis (1)

Page 29: Data Mining with SQL Server 2008

Segments a heterogeneous population into a number of more homogenous subgroups or clusters

Some typical questions:

Discover distinct groups of customers Identification of groups of houses in a city In biology, derive animal and plant taxonomies Find outliers

Cluster Analysis (2)

Page 30: Data Mining with SQL Server 2008

Clustering

Age

Annual

Income

Page 31: Data Mining with SQL Server 2008

Time series

Timebased data prediction

Page 32: Data Mining with SQL Server 2008

Sequence clustering

Numbers orders stronger associations

Direction of association (not necessary the other direction)

Page 33: Data Mining with SQL Server 2008

If you own certain stocks ' you own maybe other ones as well

Probability = thickness of line

Association

Page 34: Data Mining with SQL Server 2008

Let system learn how to classify data

Neural Network adapts to the new data

Formulate statement/hypothesis

Outcome is know

(Data / Surveys)

1. 70% data to train network (outcome is known) 2. 30% of data to test network (outcome is known) 3. New data (no survey needed, predict from network)

Other example: OCR

Neural Nets

Page 35: Data Mining with SQL Server 2008

Conclusion: When To Use What

Task Microsoft algorithms to use

Predicting a discrete attribute. For example, predict whether the recipient of a targeted mailing campaign will buy a product.

Microsoft Decision Trees Algorithm Microsoft Naive Bayes Algorithm Microsoft Clustering Algorithm Microsoft Neural Network Algorithm

Predicting a continuous attribute. For example, forecast next year's sales.

Microsoft Decision Trees Algorithm Microsoft Time Series Algorithm

Predicting a sequence. For example, perform a clickstream analysis of a company's Web site.

Microsoft Sequence Clustering Algorithm

Finding groups of common items in transactions. For example, use market basket analysis to suggest additional products to a customer for purchase.

Microsoft Association Algorithm Microsoft Decision Trees Algorithm

Finding groups of similar items. For example, segment demographic data into groups to better understand the relationships between attributes.

Microsoft Clustering Algorithm Microsoft Sequence Clustering Algorithm

Page 38: Data Mining with SQL Server 2008

Train station / airport

Who is the bad guy

Farmers

Find the best crops

Supermarket

Find to figure out how to get you to buy more, where the expensive items

Other usages of data miningFind patterns - Profiling

Page 39: Data Mining with SQL Server 2008

SSIS 2008 - Data profiling task

Get a profile of the data in a table

potential candidate keys length of data values in columns Null percentage of rows distribution of values ....

Tip

Page 40: Data Mining with SQL Server 2008

Video: Simple data mining model

http://www.sqlservercentral.com/articles/Video/65055/

Video: Data mining and Reporting Services

http://www.sqlservercentral.com/articles/Video/64190/

Data Mining Algorithms

http://msdn.microsoft.com/en-us/library/ms175595.aspx

Resources 1

Page 41: Data Mining with SQL Server 2008

Jamie MacLennan

http://blogs.msdn.com/b/jamiemac/

Richard Lees on BI

http://richardlees.blogspot.com/

Book Data Mining with Microsoft SQL Server 2008

http://www.amazon.com/gp/product/0470277742?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742

Resources 2

Page 42: Data Mining with SQL Server 2008

Why Data Mining? Uses Algorithms Demo Hands on Lab

Summary

Page 43: Data Mining with SQL Server 2008

3 things…

[email protected]

u

http://blog.gfader.com/

twitter.com/peitor

Page 44: Data Mining with SQL Server 2008

Thank You!

Gateway Court Suite 10 81 - 91 Military Road Neutral Bay, Sydney NSW 2089 AUSTRALIA

ABN: 21 069 371 900

Phone: + 61 2 9953 3000 Fax: + 61 2 9953 3105

[email protected]