103
Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Embed Size (px)

Citation preview

Page 1: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Regression for Data Mining

Mgt. 2206 – Introduction to Analytics

Matthew Liberatore

Thomas Coghlan

Page 2: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Learning Objectives

To understand the application of regression analysis in data mining Linear/nonlinear Logistic (Logit)

To understand the key statistical measures of fit

To learn how to run and interpret regression analyses using SAS Enterprise Miner software

Page 3: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Analysis of Association

In business problems interests often go beyond the statistical testing of differences (e.g., female versus male preferences)

Often interested in degree of association between variables.

Regression is one of the techniques that helps uncover those relations.

Page 4: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Linear Regression Analysis

Analysis of the strength of the linear relationship between predictor (independent) variables and outcome (dependent/criterion) variables.

In two dimensions (one predictor, one outcome variable) data can be plotted on a scatter diagram.

EE((yy) = ) = 001 1 (x)(x)

Expected value of y (outcome)

Intercept Term coefficient

Predictorvariable

Page 5: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Estimation Process

Regression ModelRegression Modelyy = = 00 + + 11xx + +

Regression EquationRegression EquationEE((yy) = ) = 00 + + 11xx

Unknown ParametersUnknown Parameters00, , 11

Sample Data:Sample Data:x yx y

xx11 y y11

. .. . . .. . xxnn yynn

bb00 and and bb11

provide estimates ofprovide estimates of00 and and 11

EstimatedEstimatedRegression EquationRegression Equation

Sample StatisticsSample Statistics

bb00, , bb11

0 1y b b x

Page 6: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Simple Linear Regression Equation:Positive Linear RelationshipPositive Linear Relationship

EE((yy): Outcome): Outcome

x : Predictorx : Predictor

Slope Slope 11

is positiveis positive

Regression lineRegression line

InterceptIntercept00

Page 7: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

EE((yy): Outcome): Outcome

x: Predictorx: Predictor

Slope Slope 11

is negativeis negative

Regression lineRegression lineInterceptIntercept00

Simple Linear Regression Equation:Negative Linear RelationshipNegative Linear Relationship

Page 8: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

EE((yy): Outcome): Outcome

x: Predictorx: Predictor

Simple Linear Regression Equation:No RelationshipNo Relationship

• • •

•••

• •

••

••

••

••

Page 9: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

EE((yy))

xx

Slope Slope 11

is 0is 0

Regression lineRegression lineInterceptIntercept

00

Simple Linear Regression Equation:No RelationshipNo Relationship

Page 10: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

EE((yy): Outcome): Outcome

x: Predictorx: Predictor

InterceptIntercept00

Simple Linear Regression Equation:Parabolic RelationshipParabolic Relationship

•••••••••••••••••••••••

•••

•••

•••

Page 11: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Example

List Variables we have Determine a DV of interest Is there a way to predict DV?

Page 12: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Least Squares Method

Least Squares Criterion: minimize error (distance between actual data & estimated line)

min (y yi i )2min (y yi i )2

where:where:

yyii = = observedobserved value of the dependent variable value of the dependent variable

for the for the iith observationth observation^yyii = = estimatedestimated value of the dependent variable value of the dependent variable

for the for the iith observationth observation

Page 13: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Slope for the Estimated Regression Equation

1 2

( )( )

( )i i

i

x x y yb

x x

1 2

( )( )

( )i i

i

x x y yb

x x

Least Squares Method

Page 14: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

yy--Intercept for the Estimated Regression EquationIntercept for the Estimated Regression Equation

Least Squares MethodLeast Squares Method

0 1b y b x 0 1b y b x

where:where:xxii = = value of independent variable for value of independent variable for iithth observationobservation

nn = = total number of observationstotal number of observations

__yy = mean value for dependent variable= mean value for dependent variable

__xx = mean value for independent variable= mean value for independent variable

yyii = = value of dependent variable for value of dependent variable for iithth observationobservation

Page 15: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Least Squares Estimation Procedure

Least Squares Criterion: The sum of the vertical deviations (y axis) of the

points from the line is minimal.

Predicted Line

Actual Data

Page 16: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Example: Kwatts vs. Temp

Temp Kwatts59.2 9,73061.9 9,75055.1 10,18066.2 10,23052.1 10,80069.9 11,16046.8 12,53076.8 13,91079.7 15,11079.3 15,69080.2 17,02083.3 17,880

Page 17: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Is the Relationship Linear?KWatts vs. Temp

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

20,000

40 45 50 55 60 65 70 75 80 85 90

Temp

KW

atts

KWatts

Page 18: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Example ResultsLet X = Temp, Y = Kwatts

Y = 319.04 + 185.27 X

KWatts vs. Temp

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

20,000

40 45 50 55 60 65 70 75 80 85 90

Temp

KW

atts KWatts

Forecast

average

Page 19: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Coefficient of Determination

How “strong” is relationship between predictor & outcome? (Fraction of observed variance of outcome variable explained by the predictor variables).

Relationship Among SST, SSR, SSE

where:where: SST = total sum of squaresSST = total sum of squares SSR = sum of squares due to regressionSSR = sum of squares due to regression SSE = sum of squares due to errorSSE = sum of squares due to error

SST = SSR + SSESST = SSR + SSE

2( )iy y 2( )iy y 2ˆ( )iy y 2ˆ( )iy y 2ˆ( )i iy y 2ˆ( )i iy y

Page 20: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Coefficient of Determination (Coefficient of Determination (rr22))

where:where:

SSR = sum of squares due to regressionSSR = sum of squares due to regression

SST = total sum of squaresSST = total sum of squares

rr22 = SSR/SST = SSR/SST

Page 21: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Kwatts vs. Temp Example

df SSRegression 1 58784708.31Residual 10 38696916.69Total 11 97481625

rr22 = = 0.6030337340.603033734Does the linear regression provide a good fit?

Page 22: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Assumptions About the Error Term

1. The error1. The error is a random variable with mean of zero.is a random variable with mean of zero.

2.2. The variance ofThe variance of , , denoted by denoted by 22, , is the same foris the same for all values of the independent variable.all values of the independent variable.

3.3. The values ofThe values of are independent.are independent.

4.4. The errorThe error is a normally distributed randomis a normally distributed random variable.variable.

Page 23: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Significance Test for Regression

Is the value of Is the value of 11 zero?zero?

Two tests are commonly used:Two tests are commonly used:

tt Test Test andand FF Test Test

Both theBoth the tt test and test and FF test require an estimate oftest require an estimate of the the variance (variance (22) of the error () of the error (..

As in most of our statistical work, we are working withAs in most of our statistical work, we are working witha sample, not the population, so we use a sample, not the population, so we use

mean square error (mean square error (ss22).).

Page 24: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

An Estimate of Testing for Significance

210

2 )()ˆ(SSE iiii xbbyyy 210

2 )()ˆ(SSE iiii xbbyyy

where:where:

ss 22 = MSE = SSE/( = MSE = SSE/(n n 2) 2)

Page 25: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Testing for Significance

An Estimate of

2

SSEMSE

n

s2

SSEMSE

n

s

• To estimate To estimate we take the square root of we take the square root of 22..

• The resulting The resulting ss is called the is called the standard error ofstandard error of the estimatethe estimate..

Page 26: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Hypotheses: Coefficient (11) is 0

(no relationship between predictor & outcome)

Calculating t Statistic:

Testing for Significance: t Test

1

1

b

bt

s

1

1

b

bt

s

Page 27: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

0 1: 0H 0 1: 0H

0 1: 0H 0 1: 0H 1. Determine if 1. Determine if

2. Specify the level of significance.2. Specify the level of significance.

3. Select the test statistic.3. Select the test statistic.

= .05= .05

4. State the rejection rule.4. State the rejection rule. Reject Reject if if pp-value -value << .05 or | .05 or |t|t| > > 3.182 (with 3 degrees of 3.182 (with 3 degrees of freedom)freedom)

Testing for Significance: Testing for Significance: tt Test Test

1

1

b

bt

s

1

1

b

bt

s

Page 28: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

0 1: 0H 0 1: 0H Same Hypotheses: Same Hypotheses:

Different Test Statistic:Different Test Statistic:

Alternative Test: Alternative Test: FF Test Test

FF = MSR/MSE = MSR/MSE

Page 29: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Reject if: Reject if: pp-value-value << or or FF >> FF

Testing for Significance: Testing for Significance: FF TestTest

wherewhere::

FF is based on an is based on an FF distribution withdistribution with

1 degree of freedom in the numerator and1 degree of freedom in the numerator and

nn - 2 degrees of freedom in the denominator- 2 degrees of freedom in the denominator

FF = MSR/MSE = MSR/MSE

Page 30: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

0 1: 0H 0 1: 0H

0 1: 0H 0 1: 0H 1. Determine if 1. Determine if

2. Specify the level of significance.2. Specify the level of significance.

3. Select the test statistic. 3. Select the test statistic.

= .05= .05

4. State the rejection rule.4. State the rejection rule. Reject Reject if if pp-value -value << .05 or F > 10.13 .05 or F > 10.13 ((withwith 1 d.f. 1 d.f. in numerator andin numerator and

3 d.f. in denominator)3 d.f. in denominator)

Testing for Significance: Testing for Significance: FF TestTest

FF = MSR/MSE = MSR/MSE

Page 31: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Standard Error of the Estimate

Standard Error of Estimate has properties analogous to those of standard deviation.

How “good” is our “fit”?

Interpretation is similar: ~68% of outcomes/predictions within one sest. ~95% of outcomes/predictions within two sest.

Page 32: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan
Page 33: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Kwatts vs. Temp ExampleANOVA

df SS MS F Significance F

Regression 1 58784708.31 58784708.31 15.19 0.002972726Residual 10 38696916.69 3869691.669

Total 11 97481625

Coefficients Standard Error t Stat P-valueIntercept 319.0414124 3260.412811 0.097853073 0.923982528Temp 185.2702073 47.53479059 3.897570706 0.002972726

Is the regression model statistically significant? Is the coefficient of Temp significant?

Page 34: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Cautions about Interpreting Significance Tests

Statistical significance does not mean Statistical significance does not mean linear linear relationship relationship between between xx and and yy. .

Relationship between Relationship between xx and and yy does not does not meanmean a a cause-and-effect relationshipcause-and-effect relationship is is present between present between xx and and yy..

Page 35: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

SAS Enterprise Miner These results can be obtained using Excel or using

a data mining package such as SAS Enterprise Miner 5.3

Using SAS Enterprise Miner requires the following steps: Convert your data (usually in an Excel file) into a SAS data

file Using SAS 9.1 Create a project in Enterprise Miner Within the project:

Create a data source using your SAS data file Create a diagram that includes a data node and a

regression node and a multiplot node for graphs Run the model in the diagram and review the results

Page 36: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Creating a SAS data file from an Excel file: open SAS 9.1. Select File then Import Data

Page 37: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

This opens the import wizard. Since the source file is from Excel, click Next. Then click Browse to find the TempKWatts.xls file

Page 38: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Since the data are on sheet1$, click Next. Then enter SASUSER as the Library and TEMPKILOWATTL as the Member. Then click Next

Page 39: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Now click Finish to create your file

Page 40: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Open SAS Enterprise Miner 5.3. Enter the user name and password provided

Page 41: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

The Enterprise Window below opens. Select New Project

Page 42: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

The Create New Project dialog box appears. Select the General tab, then type the short name of the project, e.g., KWattTemp0. Keep the default path.

Page 43: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

In the Startup code tab, enter:libname Ktemps "C:\Documents and Settings\mliberat\My Documents\My SAS Files\9.1\EM_Projects"; This code will be run each time you open the project

Page 44: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

The Enterprise Miner application window opens

Page 45: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Right-click on Data Source, opening the wizard. Source is SAS table, so click Next

Page 46: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Browse the SAS libraries to find the SAS table Tempkilowattl found in the SASuser Library (previously created)

Page 47: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Click Next twice. Note that the Table properties shows that we have two variables with 12 observations

Page 48: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

The next step controls how Enterprise Miner organizes metadata for the variables in your data. Select advanced, then click next(you can view/change the settings if you click Customize before clicking Next)

Page 49: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Change Role of KWatts to target (outcome variable); change Level of both KWatts and Temp to interval (continuous values); then click Next (Other levels are possible, such as binary). You can click on Explore if you wish to look at some basic stats – we will do this later

Page 50: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Here Role relates to the role of the data set (raw, train, validate, score); raw is fine for our analysis of data, so click Finish

Page 51: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Tempkilowattl now appears under Data Sources in the top left panel called the Project Panel

Page 52: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

We need to create a Diagram for our model. Right-click on Diagrams, then enter TempKwatts0 in the dialog box. Now the left panel shows TempKwatts0 as a Diagram, and the right-hand panel is called the Diagram Workspace. Icons can be dragged and dropped onto the Diagram Workspace.

Page 53: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Now add an Input Data Node to the Diagram. From the Data Sources list in the Project Panel drag and drop the Data Source TempKwatts0 onto the Diagram Workspace. Note that when input data node is highlighted, various properties are displayed on the left-hand panel.

Page 54: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

If you wish to see the properties of any or all of the variables, highlight the input data node; then on the left hand Properties Panel under Train, click on the box to the right of Variables; in the screen that opens control-click on KWatts and Temp; then click on Explore in the lower right

Page 55: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Frequency distributions for the variables and the raw data are provided. Right-clicking on observations in the lower-left panel will show where they appear in the bar charts. Cancel when finished.

Page 56: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Click on the Explore tab found over the Diagram Workspace, and then drag and drop the Multiplot icon onto the field. Using your cursor, draw a directed arrow from the TempKwattsl icon to the Multiplot icon. With the Multiplot icon highlighted, its properties are found in the left-hand Properties Panel.

Page 57: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Right-click on the Multiplot icon and select Run. After the run is completed select Results from the Run Status window.

Page 58: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Various charts are available as shown below. Descriptive statistics for each variable are given in the lower pane.

Page 59: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Click on the Model tab and drag the Regression icon onto the Model field. Connect the Tempkwattsl icon to the Regression icon. Highlight the Regression icon and on the Property Panel change Regression Type to linear regression.

Page 60: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Run the Regression and select Results. Starting from the upper left and going clockwise, these windows show the fit between target and predicted in percentile terms, the various fit statistics, model output (estimates, F and t stats, R-square), and the two effects (intercept and slope – bars represent size and color represents direction)

Page 61: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

For a given percentile, the Target Mean is the actual (or estimated value based on actuals), or what you are trying to predict; the Mean for Predicted is the forecasted values, or the predictions (or estimated values based on forecasts). The results are shown from highest to lowest forecasted values. The distances between the curves shows how well the model predicts the actual data.

Page 62: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

A variety of fit statistics are provided. These include SSE, MSE=SSE/(n-2), ASE=SSE/n, RMSE=SQRT(MSE), RASE=SQRT(SSE), FPE = MSE (n+p+1)/n, MAX = largest error in terms of absolute value, where n = no. of observations, p=no. of variables in model (one in our case).Schwartz’s Bayesian Criterion and Akaike’s Information Criterion are used for model selection (comparing one model to another). Schwartz’s adjusts the residual squared error for the number of parameters estimated, while Akaike’s is a relative measure of information lost from fitting the model.

Page 63: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Kwatts vs. Temp Example 2

Another approach to modeling the relationship between Kwatts and Temp is to use a nonlinear regression

This is easily accomplished in Enterprise Miner – highlight the regression node, then in the left hand panel select yes for polynomial terms We use the default of two terms

Is the fit any better???

Page 64: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan
Page 65: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Multiple RegressionConsider the following data relating family size and income to food expenditures:

family food $ income $ family size1 5.2 28 32 5.1 26 33 5.6 32 24 4.6 24 15 11.3 54 46 8.1 59 27 7.8 44 38 5.8 30 29 5.1 40 110 18 82 611 4.9 42 312 11.8 58 413 5.2 28 114 4.8 20 515 7.9 42 316 6.4 47 117 20 112 618 13.7 85 519 5.1 31 220 2.9 26 2

Page 66: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Multiple Regression We can run this problem in Enterprise Miner using the same

approach followed with the previous example On our model field we have placed the data source called

foodexpenditures, and also both Multiplot and StatExplore found under the Explore tab above the model field

Highlight foodexpenditures, then in the left-hand panel under Training, find variables and click on the box to the right to open up the variables

Change the role of family to rejected (it is just the number of the observation) and change the level of food_ to target, and income_, food_, and fam_size to interval, then click OK

Page 67: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Foodexpenditures Model

Page 68: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Highlight the StatExplore node, right-click to Run, then select Results. Correlations between the input variables and the target are provided, along with basic statistics. The input variables are ordered by the size of the correlations. Now close out the results window and run the regression node and obtain results

Page 69: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Starting from the upper left and going clockwise, these windows show the fit between target and predicted in percentile terms, the various fit statistics, model output (estimates, F and t stats, R-square), and the three effects (intercept and slopes for the two input variables with bars represent size and color represents direction). The model is significant and is a good fit with the data.

Page 70: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

What happens in regression analysis when the target variable is binary?

There are many situations when the target variable is binary – some examples: whether a customer will or will not receive credit whether a customer will or will not response to a

promotion Whether a firm will go bankrupt in a year Whether a student will pass an exam!!!

Page 71: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Passing an Exam DataStudent id Outcome Study Hours1 0 32 1 343 0 174 0 65 0 126 1 157 1 268 1 299 0 1410 1 5811 0 212 1 3113 1 2614 0 11

Page 72: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Running a linear regression to predict pass/don’t pass as a function of hours of study provides a model that doesn’t correctly model the data. The data are given in exampassing.xls

Passing an Exam

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 10 20 30 40 50 60 70

hours of s tudy

pass

or

don'

t pa

ss

Actual

Predicted

Page 73: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

The Enterprise Miner results show a poor fit on a percentile basis between predicted and target – another modeling approach is needed.

Page 74: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Similar to linear regression, two main differences Y (outcome or response) is categorical

Yes/No Approve/Reject Responded/Did not respond

Result is expressed as a probability of being in either group.

Logistic Regression

Page 75: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Comparing the Logistic & Linear Regression Models

Page 76: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Logisitic regression

p = Prob(y=1|x) = exp(a+bx)/[1+exp(a+bx)]

1-p =1/[1+exp(a+bx)]

ln [p/(1-p)] = a + bxwhere:

exp or e is the exponential function (e=2.71828…)

ln is the natural logarithm (ln(e) = 1)

p is probability that the event y occurs given x, and can range between 0 and 1

p/(1-p) is the "odds ratio"

ln[p/(1-p)] is the log odds ratio, or "logit"

all other components of the regression model are the same

Page 77: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Odds Ratio Frequently used Related to probability of an event as follows:

Odds Ratio = p/(1-p) Example:

Probability of firm going bankrupt = .25 Odds firm will go bankrupt = .25/(1-.25) = 1/3 or 3 to 1 This is how sports books calculate odds

(e.g., if odds of VU winning a championship are 2:1, probability is 1/3

ln [p/(1-p)] = a + bx means that as x increases by 1, the natural log of the odds ratio increases by b, or the odds ratio increase by a factor of exp(b)

Page 78: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Probability, Odds Ratio, LN of Odds Ratio

-5

0

5

10

15

20

25

probability

odds

nl(odds)

Page 79: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Running the exam data: Change regression type from linear regression to logistic regressionHighlight the data node; on left-hand panel under Train open variables and change the level of outcome to binary

Page 80: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Results show a much better fit (upper left) and only one misclassification (lower right – a false negative).

Page 81: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

The results show that the odds ratio = p(1-p) = exp(-8.4962+0.4949x). For every additional hour of study the odds ratio increases by a factor of exp(0.4949)= 1.640

Page 82: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Understanding Response Rate and LiftTo better understand the top left chart, change cumulative lift to

cumulative % response. The observations are ranked by the predicted probability of response (highest to lowest) for each observation (from the fitted model).

Page 83: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Understanding Response Rate and Lift

Since the first 6 passes were correctly classified, the cumulative % response is 100% through the 40th percentile.

At the 50th percentile the next observation with the highest predicted probability is a non-response, so the cumulative response drops to 6/7 or 85.7%.

The 8th ranked observation, between the 55th and 60th percentile, is a positive response, so the cumulative % response is about 7/8 or 87%. Since there are no more positive responses after the 60th

percentile, the cumulative response rate will drop to 50%. The chart compares how well the cumulative ranked predictions

lead to a match between actual and predicted responses

Page 84: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Understanding Response Rate and Lift Lift calculates the ratio of the actual response rate (passing) of the

top n% of the ranked observations to the overall response rate. Cumulative lift is likewise defined.

At the 50th percentile, the cumulative % response is 88.7%, the cumulative base response is 50%, for a lift of 1.7142.

Page 85: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

On the Properties Panel, click on Exported Data to see the predicted probabilities and response for each observation and compare to the actual response.

Page 86: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Logistic regression uses maximum likelihood (and not sum of squared errors) to estimate the model parameters. The results below show that the model is highly significant based on a chi-square test. The Wald chi-square statistic tests whether an effect is significant or not.

Page 87: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Bankruptcy Prediction

To predict bankruptcy a year in advance, you might collect: working capital/total assets (WC/TA) retained earnings/total assets (RE/TA) earnings before interest and taxes/total assets

(EBIT/TA) market value of equity/total debt (MVE/TD) sales/total assets (S/TA)

Page 88: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Bankruptcy Training DataFirm WC/TA RE/TA EBIT/TA MVE/TD S/TA BR/NB1 0.0165 0.1192 0.2035 0.813 1.6702 12 0.1415 0.3868 0.0681 0.5755 1.0579 13 0.5804 0.3331 0.081 0.5755 1.0579 14 0.2304 0.296 0.1225 0.4102 3.0809 15 0.3684 0.3913 0.0524 0.1658 1.1533 16 0.1527 0.3344 0.0783 0.7736 1.5046 17 0.1126 0.3071 0.0839 1.3429 1.5736 18 0.0141 0.2366 0.0905 0.5863 1.4651 19 0.222 0.1797 0.1526 0.3459 1.7237 110 0.2776 0.2567 0.1642 0.2968 1.8904 111 0.2689 0.1729 0.0287 0.1224 0.9277 012 0.2039 -0.0476 0.1263 0.8965 1.0457 013 0.5056 -0.1951 0.2026 0.538 1.9514 014 0.1759 0.1343 0.0946 0.1955 1.9218 015 0.3579 0.1515 0.0812 0.1991 1.4582 016 0.2845 0.2038 0.0171 0.3357 1.3258 017 0.1209 0.2823 -0.0113 0.3157 2.3219 018 0.1254 0.1956 0.0079 0.2073 1.489 019 0.1777 0.0891 0.0695 0.1924 1.6871 020 0.2409 0.166 0.0746 0.2516 1.8524 0

Page 89: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Bankruptcy Example

Using the BankruptTrain.xls data create a SAS data file called bankrupt BR_NB: role is target and level is binary Firm: role is rejected and level is nominal (it is

simply the firm number) Remaining five financial ratio variables: role is input

and level is interval

Page 90: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Create a diagram named bankrupt1. Drag and drop the data node onto the model. Highlight the data node and on the left hand panel under variables click on the box to its right to see the variables data

Page 91: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

From the Explore tab drag and drop the StatExplore node onto the diagram and link it to the bankrupt node. Highlight the StatExplore node, right-click and run it, and obtain results. On top, correlations between the five input variables and the target are shown via bars ordered from largest to smallest. Below the mean variable score for bankrupt vs. non-bankrupt observations is shown.

Page 92: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

From the Model tab drag and drop the regression node onto the diagram and connect it to the bankrupt node. Highlight the regression node and run, and obtain the results

Page 93: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

The results show that the model fits the data very well with highly significant overall chi square statistic, low error values, and 0 misclassifications. Cumulative lift shows that for the top 50% of observations that are bankrupt, they are twice as likely to be classified as bankrupt.

Page 94: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Scoring

Once you have specified a model you might wish to apply it to new data whose outcome is unknown -- make predictions

This can be easily accomplished in Enterprise Miner using scoring

Convert the data set BankruptScore.xls to a SAS file called bankruptscore. The role of this data is score.

Page 95: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Bankruptcy Scoring DataFirm WC/TA RE/TA EBIT/TA MVE/TD S/TAA 0.1759 0.1343 0.0956 0.1955 1.9218B 0.3732 0.3483 -0.0013 0.3483 1.8223C 0.1725 0.3238 0.104 0.8847 0.5576D 0.163 0.3555 0.011 0.373 2.8307E 0.1904 0.2011 0.1329 0.558 1.6623F 0.1123 0.2288 0.01 0.1884 2.7186G 0.0732 0.3526 0.0587 0.2349 1.7432H 0.2653 0.2683 0.0235 0.5118 1.835I 0.107 0.0787 0.0433 0.1083 1.2051J 0.2921 0.239 0.9673 0.3402 0.9277

Page 96: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Drag and drop the bankruptscore data node to the bankrupt1 diagram. From the Assess tab, drag and drop the Score node into the diagram. Link the regression and bankruptscore nodes together and connect them to the Score node.

Page 97: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Run the Score node and obtain the Results. Of the 10 firms, 6 are predicted to become bankrupt.

Page 98: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

For details about the individual predictions, highlight the Score node and on the left-hand panel click on the square to the right of Exported Data. Then in the box that appears click on the row whose Port entry is Score. Then click on Explore.

Page 99: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

The lower portion of the output is shown below. The predictions are given, along with the probabilities of the firm becoming bankrupt or not.

Page 100: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Regression Using Selection Models

When there are a number of possible input variables, procedures are available to sort through them and include those that have a certain level of statistical significance

SAS Enterprise Miner 5.3 offers three selection methods: Backward Forward Stepwise

Page 101: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Regression Using Selection Models Backward: training begins with all candidate effects

in the model and removes effects until the stay significance level or the stop criterion is met

Forward: training begins with no candidate effects in the model and adds effects until the entry significance level or the stop criterion is met.

Stepwise: training begins as in the forward model but may remove effects already in the model. This continues until the stay significance level or the stop criterion is met

Note that the default significance levels (p values) values are 0.05 and no stop criteria (such as maximum number of steps in the regression) are set

Page 102: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Regression Using Selection Models – Bankruptcy Model

To select stepwise regression

for the bankruptcy model, highlight

the regression node and in the

properties panel under

Selection Model choose

Stepwise. The default significance

level of 0.05 is used

Page 103: Regression for Data Mining Mgt. 2206 – Introduction to Analytics Matthew Liberatore Thomas Coghlan

Regression Using Selection Models – Bankruptcy Model

Interestingly, the Training Model only uses RE/TA as a predictor There are 3 misclassifications (.15 rate) in this set vs. 0 in

the original model The results are very different: the original model with

all 5 input variables predicted bankruptcy for G, E, C, and J, while the stepwise model predicted B, C, D, F, G, H, and J would become bankrupt.

Changing the significance levels to 0.1 (to make it easier for input variables to enter/leave the stepwise model) produces the same results