MELJUN CORTES Predictive Modeling With IBM SPSS Modeler

7/30/2019 MELJUN CORTES Predictive Modeling With IBM SPSS Modeler

http://slidepdf.com/reader/full/meljun-cortes-predictive-modeling-with-ibm-spss-modeler 1/467

Predictive Modeling withIBM SPSS Modeler

Student Guide

Course Code: 0A032

ERC 1.0



Predictive Modeling with IBM SPSS Modeler Licensed Materials - Property of IBM

© Copyright IBM Corp. 20100A032

Published October 2010 US Government Users Restricted Rights - Use,duplication or disclosure restricted by GSA ADPSchedule Contract with IBM Corp.

IBM, the IBM logo and ibm.com are trademarksof International Business Machines Corp.,registered in many jurisdictions worldwide.

SPSS, and PASW are trademarks of SPSS Inc.,an IBM Company, registered in many jurisdictions worldwide.

Microsoft, Windows, Windows NT, and theWindows logo are trademarks of MicrosoftCorporation in the United States, other countries,

or both.

Other product and service names might betrademarks of IBM or other companies.

This guide contains proprietary information whichis protected by copyright. No part of thisdocument may be photocopied, reproduced, or translated into another language without a legallicense agreement from IBM Corporation.

Any references in this information to non-IBMWeb sites are provided for convenience only and

do not in any manner serve as an endorsementof those Web sites. The materials at those Websites are not part of the materials for this IBMproduct and use of those Web sites is at your own risk.



TABLE OF CONTENTS

i

Table of Contents

LESSON 1: PREPARING DATA FOR MODELING ............................... 1-1

LESSON 2: DATA REDUCTION: PRINCIPAL COMPONENTS ........... 2-1

LESSON 3: DECISION TREES/RULE INDUCTION .............................. 3-1

LESSON 4: NEURAL NETWORKS .......................................................... 4-1

1.1 I NTRODUCTION............................................................................................................. 1-2 1.2 CLEANING DATA .......................................................................................................... 1-3 1.3 BALANCING DATA........................................................................................................ 1-4 1.4 NUMERIC DATA TRANSFORMATIONS .......................................................................... 1-6 1.5 BINNING DATA VALUES............................................................................................... 1-9 1.6 DATA PARTITIONING .................................................................................................. 1-12 1.7 A NOMALY DETECTION............................................................................................... 1-14 1.8 FEATURE SELECTION FOR MODELS............................................................................ 1-19 SUMMARY EXERCISES............................................................................................................ 1-24 2.1 I NTRODUCTION............................................................................................................. 2-1 2.2 USE OF PRINCIPAL COMPONENTS FOR PREDICTION MODELING AND CLUSTER

A NALYSES................................................................................................................................ 2-1 2.3 WHAT TO LOOK FOR WHEN R UNNING PRINCIPAL COMPONENTS OR FACTOR A NALYSIS................................................................................................................................. 2-3 2.4 PRINCIPLES................................................................................................................... 2-3 2.5 FACTOR A NALYSIS VERSUS PRINCIPAL COMPONENTS A NALYSIS............................... 2-4 2.6 NUMBER OF COMPONENTS........................................................................................... 2-4 2.7 R OTATIONS................................................................................................................... 2-5 2.8 COMPONENT SCORES ................................................................................................... 2-6 2.9 SAMPLE SIZE ................................................................................................................ 2-6 2.10 METHODS ..................................................................................................................... 2-6 2.11 OVERALL R ECOMMENDATIONS ................................................................................... 2-7 2.12 EXAMPLE: R EGRESSION WITH PRINCIPAL COMPONENTS ............................................ 2-7 SUMMARY EXERCISES............................................................................................................ 2-21 3.1 I NTRODUCTION............................................................................................................. 3-1 3.2 COMPARISON OF DECISION TREE MODELS .................................................................. 3-1 3.3 USING THE C5.0 NODE ................................................................................................. 3-3 3.4 VIEWING THE MODEL................................................................................................... 3-7 3.5 GENERATING AND BROWSING A R ULE SET................................................................ 3-14 3.6 U NDERSTANDING THE R ULE AND DETERMINING ACCURACY ................................... 3-17 3.7 U NDERSTANDING THE MOST IMPORTANT FACTORS IN PREDICTION......................... 3-26 3.8 FURTHER TOPICS ON C5.0 MODELING ....................................................................... 3-27 3.9 MODELING CATEGORICAL OUTPUTS WITH OTHER DECISION TREE ALGORITHMS ... 3-32 3.10 MODELING CATEGORICAL OUTPUTS WITH CHAID .................................................. 3-32 3.11

MODELING CATEGORICAL OUTPUTS WITH C&R TREE ............................................. 3-39

3.12 MODELING CATEGORICAL OUTPUTS WITH QUEST .................................................. 3-45 3.13 PREDICTING CONTINUOUS FIELDS ............................................................................. 3-48 SUMMARY EXERCISES............................................................................................................ 3-56



PREDICTIVE MODELING WITH IBM SPSS MODELER

ii

LESSON 5: SUPPORT VECTOR MACHINES ......................................... 5-1

LESSON 6: LINEAR REGRESSION ......................................................... 6-1

LESSON 7: COX REGRESSION FOR SURVIVAL DATA ..................... 7-1

LESSON 8: TIME SERIES ANALYSIS .................................................... 8-1

4.1 I NTRODUCTION TO NEURAL NETWORKS ...................................................................... 4-1 4.2 TRAINING METHODS..................................................................................................... 4-2 4.3 THE MULTI-LAYER PERCEPTRON ................................................................................. 4-3 4.4 THE R ADIAL BASIS FUNCTION ..................................................................................... 4-4 4.5 WHICH METHOD? .........................................................................................................4-5 4.6 THE NEURAL NETWORK NODE..................................................................................... 4-6 4.7 MODELS PALETTE....................................................................................................... 4-15 4.8 THE NEURAL NET MODEL .......................................................................................... 4-16 4.9 VALIDATING THE LIST OF PREDICTORS ...................................................................... 4-23 4.10 U NDERSTANDING THE NEURAL NETWORK ................................................................ 4-25 4.11 U NDERSTANDING THE R EASONING BEHIND THE PREDICTIONS .................................. 4-28 4.12 MODEL SUMMARY...................................................................................................... 4-31 4.13 BOOSTING AND BAGGING MODELS ............................................................................ 4-31 4.14 MODEL BOOSTING WITH NEURAL NET....................................................................... 4-32 4.15 MODEL BAGGING WITH NEURAL NET ........................................................................ 4-40 SUMMARY EXERCISES ............................................................................................................ 4-47

5.1 I NTRODUCTION ............................................................................................................. 5-1 5.2 THE STRUCTURE OF SVM MODELS.............................................................................. 5-1 5.3 SVM MODEL TO PREDICT CHURN................................................................................ 5-5 5.4 EXPLORING THE MODEL ............................................................................................. 5-14 5.5 A MODEL WITH A DIFFERENT K ERNEL FUNCTION ..................................................... 5-17 5.6 TUNING THE RBF MODEL........................................................................................... 5-20 SUMMARY EXERCISES ............................................................................................................ 5-23 6.1 I NTRODUCTION ............................................................................................................. 6-1 6.2 BASIC CONCEPTS OF R EGRESSION................................................................................ 6-2 6.3 A N EXAMPLE: ERROR OR FRAUD DETECTION IN CLAIMS............................................ 6-4 6.4 USING LINEAR MODELS NODE TO PERFORM R EGRESSION ........................................ 6-15 SUMMARY EXERCISES ............................................................................................................ 6-26 7.1 I NTRODUCTION ............................................................................................................. 7-1 7.2 WHAT IS SURVIVAL A NALYSIS? ...................................................................................7-2 7.3 COX R EGRESSION ......................................................................................................... 7-5 7.4 COX R EGRESSION TO PREDICT CHURN......................................................................... 7-6 7.5 CHECKING THE PROPORTIONAL HAZARDS ASSUMPTION........................................... 7-19 7.6 PREDICTIONS FROM A COX MODEL ............................................................................ 7-22 SUMMARY EXERCISES ............................................................................................................ 7-36



TABLE OF CONTENTS

iii

LESSON 9: LOGISTIC REGRESSION ..................................................... 9-1

LESSON 10: DISCRIMINANT ANALYSIS ........................................... 10-1

LESSON 11: BAYESIAN NETWORKS .................................................. 11-1

LESSON 12: FINDING THE BEST MODEL FOR CATEGORICAL

TARGETS .......................................................................................... 12-1

LESSON 13: FINDING THE BEST MODEL FOR CONTINUOUSTARGETS .......................................................................................... 13-1

LESSON 14: GETTING THE MOST FROM MODELS ......................... 14-1

8.1 I NTRODUCTION............................................................................................................. 8-1 8.2 WHAT IS A TIME SERIES? ............................................................................................. 8-3 8.3 A TIME SERIES DATA FILE ........................................................................................... 8-4 8.4 TREND, SEASONAL AND CYCLIC COMPONENTS........................................................... 8-7 8.5 WHAT IS A TIME SERIES MODEL? ................................................................................ 8-9 8.6 I NTERVENTIONS ......................................................................................................... 8-11 8.7 EXPONENTIAL SMOOTHING........................................................................................ 8-12 8.8 ARIMA ...................................................................................................................... 8-13 8.9 DATA R EQUIREMENTS................................................................................................ 8-15 8.10 AUTOMATIC FORECASTING IN A PRODUCTION SETTING ........................................... 8-16 8.11 FORECASTING BROADBAND USAGE IN SEVERAL MARKETS ..................................... 8-16 8.12 APPLYING MODELS TO SEVERAL SERIES ................................................................... 8-40 SUMMARY EXERCISES............................................................................................................ 8-46 9.1 I NTRODUCTION TO LOGISTIC R EGRESSION.................................................................. 9-1 9.2 A MULTINOMIAL LOGISTIC A NALYSIS: PREDICTING CREDIT R ISK ............................. 9-4 9.3 I NTERPRETING COEFFICIENTS .................................................................................... 9-13 SUMMARY EXERCISES............................................................................................................ 9-19 10.1 I NTRODUCTION........................................................................................................... 10-1 10.2 HOW DOES DISCRIMINANT A NALYSIS WORK ? .......................................................... 10-2 10.3 THE DISCRIMINANT MODEL....................................................................................... 10-3 10.4 HOW CASES ARE CLASSIFIED .................................................................................... 10-3 10.5 ASSUMPTIONS OF DISCRIMINANT A NALYSIS............................................................. 10-4 10.6 A NALYSIS TIPS ........................................................................................................... 10-5 10.7 COMPARISON OF DISCRIMINANT AND LOGISTIC R EGRESSION .................................. 10-5 10.8 A N EXAMPLE: DISCRIMINANT.................................................................................... 10-6 SUMMARY EXERCISES.......................................................................................................... 10-18 11.1 I NTRODUCTION........................................................................................................... 11-1 11.2 THE BASICS OF BAYESIAN NETWORKS ...................................................................... 11-1 11.3 TYPE OF BAYESIAN NETWORKS IN PASW MODELER ............................................... 11-4 11.4 CREATING A BAYES NETWORK MODEL..................................................................... 11-5 11.5 MODIFYING BAYES NETWORK MODEL SETTINGS ................................................... 11-21 SUMMARY EXERCISES.......................................................................................................... 11-28

12.1 I NTRODUCTION........................................................................................................... 12-1 SUMMARY EXERCISES.......................................................................................................... 12-22

13.1 I NTRODUCTION........................................................................................................... 13-1 SUMMARY EXERCISES.......................................................................................................... 13-19




iv

APPENDIX A : DECISION LIST .............................................................. A-1

14.1 I NTRODUCTION ........................................................................................................... 14-1 14.2 COMBINING MODELS WITH THE E NSEMBLE NODE..................................................... 14-2 14.3 USING PROPENSITY SCORES ..................................................................................... 14-11 14.4 META-LEVEL MODELING ......................................................................................... 14-17 14.5 ERROR MODELING .................................................................................................... 14-22 SUMMARY EXERCISES .......................................................................................................... 14-30 I NTRODUCTION ........................................................................................................................ A-1 A DECISION LIST MODEL ........................................................................................................ A-1 COMPARISON OF R ULE I NDUCTION MODELS .......................................................................... A-4 R ULE I NDUCTION USING DECISION LIST................................................................................. A-5 U NDERSTANDING THE R ULES AND DETERMINING ACCURACY .............................................. A-8 U NDERSTANDING THE MOST IMPORTANT FACTORS IN PREDICTION .................................... A-14 EXPERT OPTIONS FOR DECISION LIST ................................................................................... A-15 I NTERACTIVE DECISION LIST ................................................................................................ A-18 SUMMARY EXERCISES ........................................................................................................... A-34



PREPARING DATA FOR MODELING

1-1

Lesson 1: Preparing Data for ModelingOverview

• Preparing and cleaning data for modeling

• Balancing data using the Distribution and the Balance node

• Transforming the data with the Distribution node

• Grouping data with the Binning node

• Partition the data into training and testing samples with the Partition node

• Detecting unusual cases with the Anomaly Node

• Selecting predictors with the Feature Selection Node

Data

In this lesson we use data from a telecommunications company, churn.txt for several examples. Thefile contains records for 1477 of the company’s customers who have at one time purchased a mobile

phone. It includes such information as length of time spent on local, long distance and international

calls, the type of billing scheme and a variety of basic demographics, such as age and gender. Thecustomers fall into one of three groups: current customers, involuntary leavers and voluntary leavers.

We want to use data mining to understand what factors influence whether an individual remains as acustomer or leaves for an alternative company. The data are typical of what is often referred to as achurn example (hence the file name). We also use a similar data file named rawdata.txt to illustrate

several steps in data preparation, and a third file customer_dbase.sav, also from a telecommunicationsfirm, to demonstrate how to detect anomalous records and select fields for modeling.

Note about Type Nodes in this Course

Streams presented in this course contain Type nodes, although in most instances the Types tab in theSource node would serve the same purpose.

PASW ®

Modeler and PASW ®

By default, PASW Modeler will run in local mode on your desktop machine. If PASW Modeler

Server has been installed, then PASW Modeler can be run in local mode or in distributed (client-server) mode. In this latter mode, PASW Modeler streams are built on the client machine, but run by

PASW Modeler Server. Since the data files used in this training course are relatively small, werecommend you run in local mode. However, if you choose to run in distributed mode make sure thetraining data are either placed on the machine running PASW Modeler Server or that the drive

containing the data can be mapped from the server. To determine in which mode PASW Modeler isrunning on your machine, click Tools…Server Login (from within PASW Modeler) and see whether the Connection option is set to Local or Network. This dialog is shown below.

Modeler Server




1-2

Figure 1.1 Server Login Dialog in PASW Modeler

Note Concerning Data for this Course

Data for this course are assumed to be stored in the folder c:\Train\ModelerPredModel. At SPSS®

1.1 Introduction

training centers, the data will be located in a folder of that name. Note that if you are running PASWModeler in distributed (Server) mode (see note above), then the data should be copied to the server

machine or the directory containing the data should be mapped from the server machine.

Preparing data for modeling can be a lengthy but essential and extremely worthwhile task. If data arenot cleaned and modified/transformed as necessary, it is doubtful that the models you build will besuccessful. In this lesson we will introduce a number of techniques that enable such data preparation.

We will begin with a brief discussion concerning the handling of blanks and cleaning of data,although this is covered in greater detail in the Introduction to PASW Modeler and Data Mining course.

Following this, we will introduce the concept of data balancing and how it is achieved within PASW

Modeler. A number of data transformations will also be introduced as possible solutions to skewed data.

We will discuss how to create training and validation samples of the data automatically with the use

of data partitioning.




1-3

1.2 Cleaning DataIn most cases, datasets contain problems or errors such as missing information, outliers, and/or

spurious values. Before modeling begins, these problems should be corrected or at least minimized.The higher the quality of data used in data mining, the more likely it is that predictions or results are

accurate.

PASW Modeler provides a number of ways to handle blank or missing information and several

techniques to detect data irregularities. In this section we will briefly discuss an approach to datacleaning.

Note: If there is interest the trainer may refer to the stream Dataprep.str located in the c:\Train\ModelerPredModel directory. This stream contains examples of the techniques detailed inthe following paragraphs.

After the data have been read into PASW Modeler, and if necessary all relevant data sources have been combined, the first step in data cleaning is to assess the overall quality of the data. This ofteninvolves:

• Using the Types tab of a source node or the Type node to fully instantiate data, usually

achieved by clicking the Read Values button within the source or Type node, or by passing

the data from a Type node into a Table node and allowing PASW Modeler to auto-type.

• Flagging missing values (white space, null and value blanks) as blank definitions within a

source node or the Type node.

• Using the Data Audit node to examine the distribution and summary statistics (minimum,maximum, mean, standard deviation, number of valid records) for data fields.

Once the condition of the data has been assessed, the next step is to attempt to improve the overall

quality. This can be achieved in a variety of ways:

• Using the Generate menu from the Data Audit node’s report, a Select node that removesrecords with blank fields can be automatically created (particularly relevant for a model’s

output field).

• Fields with a high proportion of blank records can be filtered out using the Generate menu

from the Data Audit node’s report to create a Filter node.

• Blanks can be replaced with appropriate values using the Filler node. Possible appropriate

values within a continuous field can range from the average, mode, or median, to a value predicted using one of the available modeling techniques. In addition, missing values can beimputed by using the Data Audit node.

• The Type node and Types tab in source nodes provide an automatic checking process thatexamines values within a field to determine whether they comply with the currentmeasurement level and bounds settings. If they do not, fields with out-of-bound values can

either be modified, or those records removed from passing downstream.

After these actions are completed, the data will have been cleaned of blanks and out-of-boundsvalues. It may also be necessary to use the Distinct node to remove any duplicate records.Once the data file has been cleaned, you can then begin to modify it further so that it is suitable for

the modeling technique(s) you plan to use.




1-4

1.3 Balancing DataOnce the data have been cleaned you should examine the distribution of the key fields you will be

using in modeling, including the output field (if you are creating a predictive model). This is achieved most easily using the Data Audit node, but either the Distribution node (for categorical data), the

Histogram node (for continuous data), or the Graphboard node (for either type) will produce charts

for single fields.

If the distribution of a categorical target field is heavily skewed in favor of one of the categories, youmay encounter problems when generating predictive models. For example, if only 3% of a mailing

database have responded to a campaign, a neural network trained on this data might try to classifyevery individual as a non-responder to achieve 97% accuracy—great but not very useful!

One solution to overcome this problem is to balance the data, which will overweight the less frequentcategories. This can be accomplished with the Balance node, which works by either reducing the

number of records in the more frequent categories, or boosting the records in the less frequentcategories. It can be automatically generated from the distribution and histogram displays.

When balancing data we recommend using the reduce option in preference to the boosting option.The latter duplicates records and thus magnifies problems and irregularities, as only a relatively fewcases can be heavily weighted. However, when working with small datasets, data reducing is oftennot feasible and data boosting is the only sensible solution to imbalances within data.

Note

A better solution than balancing data at this stage is to sample from the original dataset(s) to create atraining file with a roughly equal number of cases in each category of the output field. The test

datasets should, however, match the unbalanced population proportions for this field to provide arealistic test of the generated models.

The Partition node makes it easy to create training and validation data partitions from a single data

file, but that node doesn’t solve the problem of a skewed distribution for a field, as it can overweightone or more categories.

We will illustrate data balancing by examining the distribution of the field CHURNED within the filechurn.txt . This field records whether the customer is current, a voluntary leaver, or an involuntaryleaver (we attempt to predict this field in the lessons that follow).

Open the stream Cpm1.str (located in c:\Train\ModelerPredModel)Run the Table node and familiarize yourself with the dataClose the Table windowConnect a Distribution node to the Type nodeEdit the Distribution node and set the Field: to CHURNED Click the Run button




1-5

Figure 1.2 Distribution of the CHURNED Field

The proportions of the three groups are rather unequal and data balancing may be useful when trying

to predict this field.

This output can be used directly to create a Balance node, but first we must decide whether we wish

to reduce or boost the current data. Reducing the data will drop over 73% of the records, but boostingthe data will involve duplicating the involuntary leavers from 132 records to over 830. Neither of

these methods is ideal but in this case we choose to reduce the data to eliminate the magnification of errors.

Click Generate…Balance Node (reduce) Close the Distribution plot window

A generated Balance node will appear in the Stream Canvas.

Drag the Balance node to the right of the Type node and connect it between the Type andDistribution nodes

Run the stream from the Distribution node

Figure 1.3 Distribution of the CHURNED Field after Balancing the Data

When balancing data it is advisable to enable a data cache on the balance node, to freeze the selected sample. This is due to the fact that the balance node is randomly reducing or boosting data and adifferent sample will be selected each time the data are passed through the node.




1-6

At this point the data are balanced and can be passed into a modeling node, such as the Neural Netnode. Once the model has been built, it is important that the testing and assessment of the model

should be done based on the unbalanced data.

Close the Distribution plot window

1.4 Numeric Data TransformationsWhen working with numeric data, the act of data balancing, as detailed above, is a rather drasticsolution to the problem of skewed data and usually isn’t appropriate. There are a variety of numericaltransformations that provide a more sensible approach to this problem and that result in a flat or flatter distribution.

The Derive node can be used to produce such transformed fields within PASW Modeler. Todetermine which transformation is appropriate, we need to view the data using a histogram. We’ll usethe field LOCAL in this example, which measures the number of minutes of local calls per month.

Add a Histogram node to the streamConnect the Histogram node to the Type nodeEdit the Histogram node and select LOCAL in the Field list (not shown)Run the node

Figure 1.4 Histogram of the LOCAL Field

This distribution has a strong positive skewness. This condition may lead to poor performance of aneural network predicting LOCAL since there is less information (fewer records) on those individuals

with higher local usage. What we need is a transformation that inverts the original skewness, that is,skews it to the left. If we get the transformation correct, the data will become relatively balanced.When you transform data you normally try to create a normal distribution or a uniform (flat)

distribution.




1-7

For our problem, the distribution of LOCAL closely follows that of a negative exponential, e-x

, so theinverse is a logarithmic function. We will therefore try a transformation of the form ln(x + a), where a

is a constant and x is the field to be transformed. We need to add a small constant because some of therecords have values of 0 for LOCAL, and the log of 0 is undefined. Typically the value of a would be

the smallest actual positive value in the data.

Close the Histogram window Add a Derive node from the Field Ops palette and connect the Type node to itEdit the Derive node and set the Derive Field name to LOGLOCAL Select Formula in the Derive As listEnter log(LOCAL + 3) in the Formula text box (or use the Expression Builder)Click on OK

Figure 1.5 Derive Node to Create LOGLOCAL

Connect the Derive node to the existing Histogram nodeEdit the Histogram node and set the Field to LOGLOCAL Click the Run button




1-8

Figure 1.6 Histogram of the Transformed LOCAL Field Using a Logarithmic Function

Although this distribution is not perfectly normal it is a great improvement on the distribution of theoriginal field.

Close the Histogram window

The above is a simple example of a transformation that can be used. Table 1.1 gives a number of

other possible transformations you may wish to try when transforming data, together with their CLEM expression.

Table 1.1 Possible Numerical Transformations

Transformation CLEM Expression

eExp(x) where x is the name of the field to be transformed

x

ln(x+a)Log(x + a) where a is a numerical constant

ln((x-a)/(b-x))Log((x - a) / (b – x))

Where a and b are numerical constants

log(x+a) Log10(x + a)

sqrt(x) Sqrt(x)

1 / e1 / exp(@GLOBAL_AVE(x) – x) where @GLOBAL_AVE is the average of the field x, set usingthe Set Globals node in the Output palette

(mean(x)-x)

Note

Because the original field LOCAL has been transformed, predictions from a model will be made in thelog of that field. To transform back to the original scale, you need to raise the predicted value to the

base 10 for the standard log or base e for the natural log (e.g., 10 predictved value

). So, for example, if the




1-9

model predicts a value of 1.4 for LOGLOCAL, that is actually 101.4

1.5 Binning Data Values

or 25.12 for LOCAL (or LOCAL +constant).

Another method of transforming a continuous field involves modifying it to create a new categorical

field (flag, nominal, ordinal) based on the original field’s values. For example, you might wish togroup age into a new field based on fixed width categories of 5 or 10 years. Or, you might wish totransform income into a new field based on the percentiles (based on either the count or sum) of

income (e.g., quartiles, deciles).

This operation is labeled binning in PASW Modeler, since it takes a range of data values and collapses them into one bin where they are all given the same data value. It is certainly true that

binning data loses some information compared to the original distribution. On the other hand, youoften gain in clarity, and binning can overcome some data distribution problems, including skewness.Moreover, oftentimes there is interest in looking at the effect of a predictor at natural cutpoints (e.g.,

one standard deviation above the mean). In addition, when performing data understanding, it might beeasier to view the relationship between two or more continuous fields if at least one is binned.

Binning can be performed with bins based on fixed widths, percentiles, the mean and standard deviation, or ranks.

We can use the original field LOCAL to show an example of binning. We know this field is highly positively skewed, and it has many distinct values. Let’s group the values into five bins by requesting binning by quintiles, and then examine the relationship of the binned field to CHURNED. TheBinning node is located in the Field Ops palette.

Add a Binning node to the stream near the Type nodeConnect the Type node to the Binning nodeEdit the Binning node and set the Bin fields to LOCALClick the Binning method dropdown and select Tiles (equal count) method

Click the Quintile (5) check box

By default, a new field will be created from the original field name with the suffix _TILEN , where N stands for the number of bins to be created (here five). Percentiles can be based on the record count(in ascending order of the value of the bin field, which is the standard definition of percentiles), or on

the sum of the field.




1-10

Figure 1.7 Completed Binning Node to Group LOCAL by Quintiles

The Bin Values tab allows you to view the bins that have been created and their upper and lower limits. However, understandably, information on generated bins is not available until the node has

been run in order to allow the thresholds to be determined.

Click OK

To study the relationship between binned LOCAL ( LOCAL_TILE5) and CHURNED, we could use aMatrix node, since both fields are categorical, but we can also use a Distribution node, which will beour choice here.

Add a Distribution node to the stream and attach it to the Binning nodeEdit the Distribution node and select LOCAL_TILE5 as the Field Select CHURNED as the Overlay fieldClick Normalize by color checkbox (not shown)Click Run




1-11

Figure 1.8 Distribution of CHURNED by Binned LOCAL

There is an interesting pattern apparent. Essentially all the involuntary churners are in the firstquintile of LOCAL_TILE5 (notice how the number of cases in each category is almost exactly thesame). Perhaps we got lucky when specifying quintiles as the binning technique, but we have found aclear pattern that might not have been evident if LOCAL had not been binned.

We would next wish to know what the bounds are on the first quintile, and to see that we need to editthe Binning node.

Close the Distribution plot windowEdit the Binning node for LOCAL Click the Bin Values tabSelect 5 from the Tile: menu




1-12

Figure 1.9 Bin Thresholds for LOCAL

We observe that the upper bound for Bin 1 is 10.38 minutes. That means that the involuntary churnersessentially all made less than 10.38 minutes of local calls, since they all fall into this bin (quintile).

Given this finding, we might decide to use the binned version of LOCAL in modeling, or try twomodels, one with the original field and then one with the binned version.

1.6 Data Part it ioningModels that you build (train) must be assessed with separate testing data that was not used to createthe model. The training and testing data should be created randomly from the original data file. Theycan be created with either a Derive or Sample node, but the Partition node allows greater flexibility.

With the Partition node, PASW Modeler has the capability to directly create a field that can splitrecords between training, testing (and validation) data files. Partition nodes generate a partition field

that splits the data into separate subsets or samples for the training and testing stages of model building. When using all three subsets, the model is built with the training data, refined with thetesting data, and then tested with the validation data.

The Partition node creates a categorical field with the role automatically set to Partition. The set field

will either have two values (corresponding to the training and testing files), or three values (training,

testing, and validation).

PASW Modeler model nodes have an option to enable partitioning, and they will recognize a field with role “partition” automatically (as will the Evaluation node). When a generated model is created,

predictions will be made for records in the testing (and validation) samples, in addition to the trainingrecords. Because of this capability, the use of the Partition node makes model assessment moreefficient.




1-13

To illustrate the use of data partitioning, we will create a partition field for the churn data with twovalues, for training and testing. Although the Partition node assists in selecting records for training

and testing, its output is a new field, and so it can be found in the Field Ops Palette.

Add a Partition node to the stream and connect the Type node to itEdit the Partition node

The name of the partition field is specified in the Partition field text box. The Partitions choiceallows you to create a new field with either 2 or 3 values, depending on whether you wish to create 2or 3 data samples.

The size of the files is specified in the partition size text boxes. Size is relative and given in percents

(which do not have to add to 100%). If the sum of the partition sizes is less than 100%, the recordsnot (randomly) included in a partition will be discarded.

The Generate menu allows you to create Select nodes that will select records in the training, testing,and validation samples.

We’ll change the size of the training and testing partitions, and input a random seed so our results arecomparable.

Figure 1.10 Partition Node Settings




1-14

Change the Training partition size: to 70 Change the Testing partition size: to 30 Change the Seed value to 999 (not shown)Click OK Attach a Distribution node to the Partition nodeEdit the Distribution node and select Partition in the Field listRun the Distribution node

Figure 1.11 Distribution of the Partition Field

The new field Partition has close to a 70/30 distribution. It can now be directly used in modeling as

described above, or separate files can be created with use of the Select node. We will use the partitionfield in a later lesson, so we’ll save the stream for use in later lessons.

Close the Distribution windowClick on File…Save Stream As Save the stream with the name Lesson1_Partition

1.7 Anomaly DetectionData mining usually involves very large data files, sometimes with millions of records. In suchsituations, we may not be concerned about whether some records are odd or unusual based on how

they compare to the bulk of records in the file. Odd cases, unless they are relatively frequent (and then they can hardly be labeled “unusual”), will not cause problems to most algorithms when we tryto predict some outcome.

For analysts with smaller data files, though, anomalous records can be a concern, as they can distort

the outcomes of a modeling process. The most salient example of this comes from classical statistics,where regression, and other methods that fall under the rubric of the General Linear Model, can bestrongly affected by outliers and deviant points.

PASW Modeler includes an Anomaly node that searches for unusual cases in an automatic manner.Anomaly detection is an exploratory method designed for the quick detection of unusual cases or records that should be candidates for further analysis. These should be regarded as suspected anomalies, which, on closer examination, may or may not turn out to be real concerns. You may find that a record is perfectly valid but choose to screen it from the data for purposes of model building.Alternatively, if the algorithm repeatedly turns up false anomalies, this may point to an error in the

data collection process.




1-15

The procedure is based on clustering the data using a set of user-specified fields. A case that isdeviant compared to the norms (distributions) of all the cases in that cluster is deemed anomalous.

The procedure helps you quickly detect unusual cases during data exploration before you beginmodeling. It is important to note that the definition of an anomalous case is statistical and not

particular to any specific industry or application, such as fraud in the finance or insurance industry(although it is possible that the technique might find such cases).

Clustering is done using the TwoStep cluster routine (also available in the TwoStep node). In additionto clustering, the Anomaly node scores each case to identify its cluster group and creates an anomaly

index, to measure how unusual it is, and identifies which fields contribute most to the anomalousnature of the case.

We’ll use a new data file to demonstrate the Anomaly node’s operation. The file,customer_dbase.sav, is a richer data file that is also from a telecommunications company. It has an

outcome field churn which measures whether a customer switched providers in the last month. Thereis no target field for anomaly detection, but in most instances you will want to use the same set of

fields in the Anomaly node that you plan to use for modeling. There is an existing stream file we canuse for this example. The Anomaly node is found in the Modeling palette since it uses the TwoStep

clustering routine.

Click File…Open Stream Double-click on Anomaly_FeatureSelect.str in the c:\Train\ModelerPredModel directoryRun the Table node and view the dataClose the Table windowPlace an Anomaly node in the stream and connect it to the Type nodeEdit the Anomaly node, and then click the Fields tab

Figure 1.12 Anomaly Node Fields Tab

You will typically specify exactly which fields should be used to search for anomalous cases. In these

data, there are several fields that measure various aspects of the customer’s account, and we want to




1-16

use all these here (there are also demographic fields, but in the interests of keeping this examplerelatively simple, we will restrict somewhat the number and type of fields used).

Click Use custom settings buttonClick the Field chooser button, and select all the fields from longmon to ebill (they are

contiguous)

Click OK Click the Model tab

Figure 1.13 Anomaly Node Model Settings

By default, the procedure will use a cutoff value that flags 1% of the records in the data. The cutoff is

included as a parameter in the model being built, so this option determines how the cutoff value is setfor modeling but not the actual percentage of records to be flagged during scoring. Actual scoring

results may vary depending on the data.

The Number of anomaly fields to report specifies the number of fields to report as an indication of

why a particular record is flagged as an anomaly. The most anomalous fields are defined as those thatshow the greatest deviation from the field norm for the cluster to which the record is assigned.

We’ll use the defaults for this example.

Click Run

Right-click on the Anomaly model in the Models Manager, and select Browse Click Expand All button




1-17

Figure 1.14 Browsing Anomaly Generated Model Results

We see that three clusters (labeled “Peer Groups”) were created automatically (although we didn’t

view the Expert options, the default number of clusters to be created is set between 1 and 15). In thefirst cluster there are 1267 records, and 18 have been flagged as anomalies (about 1.4%, close to the

1% cutoff value). The Model browser window doesn’t tell us which cases are anomalous in thiscluster, but it does provide a list of fields that contributed to defining one or more cases as anomalous.Of the 18 records identified by the procedure, 16 are anomalous on the field lnwireten (the log of

wireless usage over tenure in months [time as a customer]). This was a derived field created earlier inthe data exploration process. The average contribution to the anomaly index from lnwireten is .275.This value should be used in a relative sense in comparison to the other fields.

To see information for specific records we use the generated Anomaly model on the stream canvas.

We will sort the records by the $O-AnomalyIndex field, which contains the index values.

Add a Sort node from the Record Ops palette to the stream and connect the Anomalygenerated model node to the Sort node

Edit the Sort node and select the field $O-AnomalyIndex as the sort fieldChange the Sort Order to Descending




1-18

Figure 1.15 Sorting Records by Anomaly Index

Click OK Connect a Table node to the Sort nodeRun the Table node

Figure 1.16 Records Sorted by Anomaly Index with Fields Generated by Anomaly Model

For each record, the model creates 9 new fields. The field #O-PeerGroup contains the cluster membership. The next six fields contain the top three fields that contributed to this record being ananomaly and the contribution of that field to the anomaly index (we can request fewer or more fields

on which to report in the Anomaly node Model tab). Thus we see that the three most anomalouscases, with an anomaly index of 5.0, all are in cluster 2. The first two of these are most deviant onlongmon and longten.

Knowing which fields made the greatest contribution to the anomaly index allows you to more easilyreview the data values for these cases. You don’t need to look at all the fields, but instead can




1-19

concentrate on specific fields detected by the model for that case. In the interests of time, we won’ttake this next step here, but you might want to try this in the exercises.

What we can briefly show are the options available when an Anomaly generated model is added to

the stream.

Close the Table windowEdit the Anomaly generated model node in the streamClick on the Settings tab

Figure 1.17 Settings Tab Options for Anomaly Generated Models

Note in particular that in large files, there is an option available to discard non-anomalous records,which will make investigating the anomalous records much easier. Also, you can change the number

of fields on which to report here.

Close the Anomaly model Browser window

1.8 Feature Selection for Models

Just as data files can have many records in data-mining problems, there are often hundreds, or thousands, of potential fields that can be used as predictors. Although some models can naturally usemany fields—decision trees, for example—others cannot or are inefficient, at best, with too manyfields. As a result, you may have to spend an inordinate amount of time to examine the fields todecide which ones should be included in a modeling effort.

To shortcut this process and narrow the list of candidate predictors, the Feature Selection node canidentify the fields that are most important—mostly highly related—to a particular target/outcome




1-20

field. Reducing the number of fields required for modeling will allow you to develop models morequickly, but also permit you to explore the data more efficiently.

Feature selection has three steps:

1) Screening: In this first step, fields are removed that have too much missing data, too little

variation, or too many categories, among other criteria. Also, records are removed withexcessive missing data.

2) Ranking: In the second step, each predictor is paired with the target and an appropriate test of

the bivariate relationship between the two is performed. This can be a matrix for categoricalfields or a Pearson correlation coefficient if both fields are continuous. The probability valuesfrom these bivariate analyses are turned into an importance measure by subtracting the pvalue of the test from 1 (thus a low p value leads to an importance near 1). The predictors arethen ranked on importance.

3) Selecting: In the final step, a subset of predictors is identified to use in modeling. The number of predictors can be identified automatically by the model, or you can request a specific

number.

Feature selection is also located in the Modeling palette and creates a generated model node. Thisnode, though, does not add predictions or other derived fields to the stream. Instead, it acts as a filter node, removing unnecessary fields downstream (with parameters under user control).

We’ll try feature selection on the customer database file. Note that although we are using featureselection after demonstrating anomaly detection, you may want to use these two in combination. For

example, you can first use feature selection to identify important fields. Then you can use anomalydetection to find unusual cases on only those fields.

Add a Feature Selection node to the stream and connect it to the Type nodeEdit the Feature Selection node and click the Fields tabClick the Use custom settings button

Select churn as the Target field (not shown)Select all the fields from region to news (near the bottom) as Inputs (be careful not to selectchurn again)

Click the Model tab




1-21

Figure 1.18 Model Tab for Feature Selection to Predict Churn

By default fields will initially be screened based on the various criteria listed in the Model tab. A field can have no more than 70% missing data (which is rather generous, and you may wish to modify this

value). There can be no more than 90% of the records with the same value, and the minimumcoefficient of variation (standard deviation/mean) is 0.1. All of these are fairly liberal standards.

Click the Options tab

Figure 1.19 Options for Feature Selection

Fields after being ranked will be selected based on importance, and only those deemed Important will be selected in the model. This can be changed to select the top N fields, by ranking of importance, or




1-22

by selecting all fields that meet a minimum level of importance. Four options are available for determining the importance of categorical predictors, with the default being the Pearson chi-square

value.

We will use all default settings for these data.

Click Run Right-click on the churn Feature Selection generated model and select Browse

Figure 1.20 Feature Selection Browser Window

We selected 127 potential predictors. Seven were rejected in the screening stage because of too muchmissing data or too little variation. Of the remaining 120 fields, the model selected 63 as beingimportant, so it has reduced our tasks of data review and model building considerably. The model




1-23

ranked the fields by importance (importance is rounded off to a maximum value of 1.000). If youscroll down the list of fields in the upper pane, you will eventually see fields with low values of

importance that are unrelated to churn. All fields with their box checked will be passed downstream if this node is added to a data stream.

The set of important fields includes a mix, with some demographic (age, employ), account-related

(tenure, ebill), and financial status (cardtenure) types.

From here, the generated Feature Selection model in the stream will filter out the unimportant fields.

Note

When using the Feature Selection node, it is important to understand its limitations. First, importanceof a relationship is not the same thing as the strength of a relationship. In data mining, the large data

files used allow very weak relationships to be statistically significant. So just because a field has animportance value near 1 does not guarantee that it will be a good predictor of some target field.

Second, nonlinear relationships will not necessarily be detected by the tests used in the FeatureSelection node, so a field could be rejected yet have the potential of being a good predictor (this is

especially true for continuous predictors).




1-24

Summary Exercises

A Note Concerning Data Files

In this training guide files are assumed to be located in the c:\Train\ModelerPredModel directory.

The exercises in this lesson use the data file churn.txt . The following table provides details about thefile.

churn.txt contains information from a telecommunications company. The data are comprised of customers who at some point have purchased a mobile phone. The primary interest of the company is

to understand which customers will remain with the organization or leave for another company.

The file contains the following fields:

ID Customer reference number LONGDIST Time spent on long distance calls per month

International Time spent on international calls per monthLOCAL Time spent on local calls per monthDROPPED Number of dropped callsPAY_MTHD Payment method of the monthly telephone billLocalBillType Tariff for locally based callsLongDistanceBillType Tariff for long distance callsAGE AgeSEX Gender STATUS Marital statusCHILDREN Number of ChildrenEst_Income Estimated incomeCar_Owner Car owner

CHURNED (3 categories):Current – Still with companyVol – Leavers who the company wants to keepInvol – Leavers who the company doesn’t want

In these exercises we will perform some exploratory analysis on the Churn.txt data file and preparethese data so that they are ready for modeling.

1. Read the file c:\Train\ModelerPredModel\Churn.txt —this file is comma delimited and includes field names—using a Var. File node. Browse the data and familiarize yourself with

the data structure within each field.

2. Check to see if there are blanks (missing values) within the data; if you find any problems,decide how you wish to deal with these and take appropriate steps.

3. Look at the distribution of the CHURNED field. This field probably requires balancing. Try“boosting” the data to balance the field, since we used reducing in the lesson.

4. If you think that both of these methods are too harsh (either in terms of duplicating data toomuch or reducing data so there are too few cases), edit the balance node and see if you can

find a way of reducing the impact of balancing.




1-25

5. If you are going to use this data for modeling, do you wish to cache this node?

6. Use the Data Audit node to look at the distribution of some of the fields that will be used as

inputs. Does the distribution of these fields appear appropriate? If not, try and find atransformation that may help the modeling process. (Note: The instructor may have already

spoken about the field LOCAL —you may want to transform this field, as discussed in thelesson).

7. Look at the field International. Do you think this field will need transforming or binning?Can you find a transformation that helps with this field? If not, why do think this is?

8. Think about whether there are potentially any other fields that could be derived from existingdata that may help out with the modeling process. If so, create those fields.

9. Try using the Anomaly node on these data to detect unusual records. Don’t use the field

CHURNED. Do you find any commonalities among most of the anomalous records?

10. If you have made any data transformations, balanced the data, or derived any fields, you maywant to create a Supernode that reduces the size of your current stream.

11. Save your stream as Exer1.str .

12. For those with extra time. Use the Anomaly node to detect anomalous cases in the

customer_dbase.sav file, as we did in the lesson. Then add the generated Anomaly node tothe stream and investigate these unusual cases in more detail. Would you retain them for modeling, or not? Why?

13. For those with more extra time. Use the Data Audit node or other methods to search for

outlier data values on continuous fields. If you find some, what might be done to reduce their

impact on modeling?




1-26



DATA REDUCTION: PRINCIPAL COMPONENTS

2-1

Lesson 2: Data Reduction: PrincipalComponents

Objectives• Review principal components analysis, a technique used to perform data reduction prior to

modeling

• Run a principal components analysis on a dataset of waste production

Data

We use a file containing information about the amount of solid waste in thousands of tons (WASTE )in various locations along with information about land use, including number of acres used for industrial work ( INDUST ), fabricated metals ( METALS ), trucking and wholesale trade (TRUCKS ),retail trade ( RETAIL), and restaurants and hotels ( RESTRNTS ). The data set appears in Chatterjee and Hadi (1988, Sensitivity Analysis in Linear Regression. New York: Wiley).

2.1 In t roduct ion Although it is used as an analysis technique in its own right, in this lesson we discuss principalcomponents primarily as a data reduction technique in support of statistical predictive modeling (for example, regression or logistic regression) and clustering.

We first review the role of principal components and factor analysis in segmentation and predictionstudies, and then discuss what to look for when running these techniques. Some background principles will be covered along with comments about popular factor methods. We provide someoverall recommendations. We will perform a principal components analysis on a set of fieldsrecording different types of land usage, all of which are to be used to predict the amount of waste produced from that land.

2.2 Use of Pr inc ipal Com ponents for Predic t ion Model ing

and Cluster Analyses In the areas of segmentation and prediction, principal components and factor analysis typically servein the ancillary role of reducing the many fields available to a core set of composite fields(components or factors) that are used by cluster, regression or logistic regression. These techniques,though, can also be used for classical data mining methods, including neural networks and Bayesiannetworks.

Statistical prediction models such as regression, logistic regression, and discriminant analysis, whenrun with highly correlated input fields can produce unstable coefficient estimates (the problem of near

multicollinearity). In these models, if any input field can be almost or perfectly predicted from alinear combination of the other inputs (near or pure multicollinearity), the estimation will either fail or be badly in error. Prior data reduction using factor or principal components analysis is one approachto reducing this risk.

Although we have described this problem in the context of statistical prediction models, neuralnetwork coefficients can become unstable under these circumstances. However, since theinterpretation of neural network coefficients is relatively rarely done, this issue is less prominent.




2-2

However, neural networks training time increases with more inputs, so reducing the number of inputswhile retaining important variable information is normally a good practice.

Bayesian networks that model the conditional probabilities among a set of fields function best, and are much easier to interpret, with a relatively small number of inputs, so data reduction can be auseful step before modeling here as well.

Rule induction methods will run when predictors are highly related. However, if two continuous predictors are highly correlated and have about the same relationship to the target, then the predictor with the slightly stronger relationship to the target will enter into the model. The other predictor isunlikely to enter into the model, since it contributes little in addition to the first predictor. While thismay be adequate from the perspective of accurate prediction, the fact that the first field entered themodel, while the second didn't, could be taken to mean that the first was important and the second was not. However, if the first were removed, the second predictor would have performed nearly aswell. Such relationships among inputs should be revealed as part of the data understanding and data preparation step of a data mining project. If this were not done, or if it were done inadequately, thenthe data reduction performed by principal components or factor analysis might be necessary (for statistical methods) and helpful (for both statistical and machine learning methods).

In some surveys done for segmentation purposes, dozens of customer attitude measures or productattribute ratings may be collected. Although cluster analysis can be run using a large number of cluster fields, two complications can develop. First, if several fields measure the same or very similar characteristics and are included in a cluster analysis, then what they measure is weighted moreheavily in the analysis. For example, suppose a set of rating questions about technical support for a product is used in a cluster analysis with other unrelated questions. Since distance calculations used inthe PASW Modeler clustering algorithms are based on the differences between observations on eachfield, then other things being equal, the set of related items would carry more weight in the analysis.To exaggerate to make a point, if two fields were identical copies of each other and both were used ina cluster analysis, the effect would be to double the influence of what they measure. In practice yourarely ask the same number of rating questions about each attribute (or psychographic) area. So

principal components and factor analysis are used to either explicitly combine the original input fieldsinto independent composite fields, to guide the analyst in constructing subscales, or to aid in selectionof representative sets of fields (some analysts select three fields strongly related to each factor or component to be used in cluster analysis). Cluster is then performed on these fields.

A second reason factor or principal components might be run prior to clustering is for conceptualclarity and simplification. If a cluster analysis were based on forty fields it would be difficult to look at so large a table of means or a line chart and make much sense of them. As an alternative, you can perform rule induction to identify the more influential fields and summarize those. If factor or principal components analysis is run first, then the clustering is based on the themes or conceptsmeasured by the factors or components. Or, as mentioned above, clustering can be done on equal-sized sets of fields, where each set is based on a factor. If the factors (components) have a ready

interpretation, it can be much easier to understand a solution based on five or six factors, compared toone based on forty fields. As you might expect, factor and principal components analyses are moreoften performed on “soft” measures—attitudes, beliefs, and attribute ratings— and less often on behavioral measures like usage and purchasing patterns.

Keep in mind that factor and principal components analysis are considered exploratory datatechniques (although there are confirmatory factor methods; for example, AmosTM can be used to testspecific factor models). So as with cluster analysis, do not expect a definitive, unassailable answer.




2-3

When deciding on the number and interpretation of factors or components, domain knowledge of thedata, common sense, and a dose of hard thinking are very valuable.

2.3 What to Lo ok for When Runn ing Pr inc ipal

Compo nents or Factor An alys is

There are two main questions that arise when running principal components and factor analysis: howmany (if any) components are there, and what do they represent? Most of our effort will be directed toward answering them. These questions are related because, in practice, you rarely retain factors or components that you cannot identify and name. Although the naming of components has rarelystumped a creative researcher for long, which has led to some very odd-sounding “components,” it isaccurate enough to say that interpretability is one of the criteria when deciding to keep or drop acomponent. When choosing the number of components, there are some technical aids (eigenvalues, percentage of variance accounted for) we will discuss, but they are guides and not absolute criteria.

To interpret the components, a set of coefficients, called loadings or lambda coefficients, relating thecomponents (or factors) to the original fields, are very important. They provide information as towhich components are highly related to which fields and thus give insight into what the components

represent.

2.4 Princ iples Factor analysis operates (and principal components usually operates) on the correlation matrixrelating the continuous fields to be analyzed. The basic argument is that the fields are correlated because they share one or more common components, and if they didn’t correlate there would be noneed to perform factor or component analysis. Mathematically a one-factor (or component) model for three fields can be represented as follows (Vs are fields (or variables), F is a factor (or component),Es represent error variation that is unique to each field (uncorrelated with the F component and the Ecomponents of the other variables)):

V1 = L1*F1 + E1 V2 = L2*F1 + E2

V3 = L3*F1 + E3

Each field is composed of the common factor (F1) multiplied by a loading coefficient (L1, L2, L3 -the lambdas) plus a unique or random component. If the factor were measurable directly (which itisn’t) this would be a simple regression equation. Since these equations can’t be solved as given (theLs, Fs and Es are unknown), factor and principal components analysis take an indirect approach. If the equations above hold, then consider why fields V 1 and V2 correlate. Each contains a random or unique component that cannot contribute to their correlation (Es are assumed to have 0 correlation).However, they share the factor F1, and so if they correlate the correlation should be related to L1 and L2

(the factor loadings). When this logic is applied to all the pairwise correlations, the loading

coefficients can be estimated from the correlation data. One factor may account for the correlations between the fields, and if not, the equations can be easily generalized to accommodate additionalfactors. There are a number of approaches to fitting factors to a correlation matrix (least squares,generalized least squares, maximum likelihood), which has given rise to a number of factor methods.

What is a factor? In market research factors are usually taken to be underlying traits, attitudes or beliefs that are reflected in specific rating questions. You need not believe that factors or components




2-4

actually exist in order to perform a factor analysis, but in practice the factors are usually interpreted,given names, and generally spoken of as real things.

2.5 Factor Analys is versus Pr inc ipal Com ponents

Analys is

Within the general area of data reduction there are two highly related techniques: factor analysis and principal components analysis. They can both be applied to correlation matrices with data reductionas a goal. They differ in a technical way having to do with how they attempt to fit the correlationmatrix. We will pursue the distinction since it is relevant to which method you choose. The diagram below is a correlation matrix composed of five continuous fields.

Figure 2.1 Correlation Matrix of Five Continuous Fields

Principal components analysis attempts to account for the maximum amount of variation in the set of fields. Since the diagonal of a correlation matrix (the ones) represents standardized variances, each principal component can be thought of as accounting for as much as possible of the variationremaining in the diagonal. Factor analysis, on the other hand, attempts to account for correlations between the fields, and therefore its focus is more on the off-diagonal elements (the correlations). Sowhile both methods attempt to fit a correlation matrix with fewer components or factors than fields,they differ in what they focus on when fitting. Of course, if a principal component accounts for mostof the variance in fields V1 and V2 , it must also account for much of the correlation between them.And if a factor accounts for the correlation between V1 and V2

2.6 Number of Compon ents

, it must account for at least some of their (common) variance. Thus, there is definitely overlap in the methods and they usually yield similar results. Often factor is used when there is interest in studying relations among the fields, while principal components is used when there is a greater emphasis on data reduction and less oninterpretation. However, principal components is very popular because it can run even when the dataare multicollinear (one field can be perfectly predicted from the others), while most factor methodscannot. In data mining, since data files often contain many fields likely to be multicollinear or near multicollinear, principal components is used more often. This is especially the case if statisticalmodeling methods, which will not run with multicollinear predictors, are used. Both methods areavailable in the PCA/Factor node; by default, the principal components method is used.

When factor or principal components analysis is run there are several technical measures that canguide you in choosing a tentative number of factors or components. The first indicator would be theeigenvalues. Eigenvalues are fairly technical measures, but in principal components analysis, and some factor methods (under orthogonal rotations), their values represent the amount of variance in theinput fields that is accounted for by the components (or factors). If we turn back to the correlation




2-5

matrix in Figure 8.1, there are five fields and therefore 5 units of standardized variance to beaccounted for. Each eigenvalue measures the amount of this variance accounted for by a factor. Thisleads to a rule of thumb and a useful measure to evaluate a given number of factors. The rule of thumb is to select as many factors as there are eigenvalues greater than 1. Why? If the eigenvaluerepresents the amount of standardized variance in the fields accounted for by the factor, then if it isabove 1, it must represent variance contained in more than one field. This is because the maximum

amount of standardized variance contained in a single field is 1. Thus, if in our five-field analysis thefirst eigenvalue were 3, it must account for variation in several fields. Now an eigenvalue can be lessthat 1 and still account for variation shared among several fields (for example 30% of the variation of each of three fields for an eigenvalue of .9), so the eigenvalue of 1 rule is only applied as a rule of thumb. Another aspect of eigenvalues (for principal components and some factor methods) is thattheir sum is the same as the number of fields, which is equal to the total standardized variance in thefields. Thus you can convert the eigenvalue into a measure of percentage of explained variance,which is helpful when evaluating a solution. Finally, it is important to mention that in applications inwhich you need to be able to interpret the results, the components must make sense. For this reason,factors with eigenvalues over 1 that cannot be interpreted may be dropped and those with eigenvaluesless than 1 may be retained.

2.7 Rotat ions When factor analysis succeeds you obtain a relatively small number of interpretable factors thataccount for much of the variation in the original set of fields. Suppose you have eight fields and factor analysis returns a two-factor solution. Formally, the factor solution represents a two-dimensionalspace. Such a space can be represented with a pair of axes as shown below.

While each pair of axes defines the same two-dimensional space, the coordinates of a point would vary depending on which pair of axes was applied. This creates a problem for factor methods sincethe values for the loadings or lambda coefficients vary with the orientation of axes and there is nounique orientation defined by the factor analysis itself. Principal Components does not suffer fromthis problem since its method produces a unique orientation. This difficulty for factor analysis is a

fundamental mathematical problem. The solutions to it are designed to simplify the task of interpretation for the analyst. Most involve, in some fashion, finding a rotation of the axes thatmaximizes the variance of the loading coefficients, so some are large and some small. This makes iteasier for the analyst to interpret the factors. This is the best that can currently be done, but the factthat factor loadings are not uniquely determined by the method is a valid criticism leveled against it by some statisticians. We will discuss the various rotational schemes in the Methods section below.




2-6

Figure 2.2 Two Dimensional Space

2.8 Component Scores If you are satisfied with a factor analysis or principal components solution, you can request that a newset of fields be created that represent the scores of each data record on the factors. These arecalculated by summing the product of each original field and a weight coefficient (derived from thelambda coefficients). These factor score fields can then be used as the inputs for prediction and segmentation analyses. They are usually normalized to have a mean of zero and standard deviation of one. An alternative some analysts prefer is to use the lambda coefficients to judge which fields arehighly related to a factor, and then compute a new field which is the sum or mean of that set of fields.This method, while not optimal in a technical sense, keeps (if means are used) the new scores on thesame scale as the original fields (this of course assumes the fields themselves share a common scale),which can make the interpretation and the presentation straightforward. Essentially, subscale scoresare created based on the factor results, and these scores are used in further analyses.

2.9 Samp le Size Since principal components analysis is a multivariate statistical method, the rule of thumb for samplesize (commonly violated) is that there should be from 10 to 25 times as many records as there arecontinuous fields used in the factor or principal components analysis. This is because principalcomponents and factor analysis are based on correlations and for p fields there are p* (p-1)/2correlations. Think of this as a desirable goal and not a formal requirement (technically if there are pfields there must be p+1 observations for factor analysis to run—but don’t expect reasonable results).If your sample size is very small relative to the number of input fields, you should turn to principal

components.

2.10 Methods There are several popular methods within the domain of factor and principal components analyses.The common factor methods differ in how they go about fitting the correlation matrix. A traditionalmethod that has been around for many years—for some it means factor analysis— is the principalaxis factor method (often abbreviated as PAF). A more modern method that carries some technicaladvantages is maximum likelihood factor analysis. If the data are ill behaved (say near




2-7

multicollinear), maximum likelihood, the more refined method, is more prone to give wild solutions.In most cases results using the two methods will be very close, so either is fine under generalcircumstances. If you suspect there are problems with your data, then principal axis may be a safer bet. The other factor methods are considerably less popular. One factor method, called Q factor analysis, involves transposing the data matrix and then performing a factor analysis on the recordsinstead of the fields. Essentially, correlations are calculated for each pair of records based on the

values of the input fields. This technique is related to cluster analysis, but is used infrequently today.Besides the factor methods, principal components can be run and, as mentioned earlier, must be runwhen the inputs are multicollinear.

Similarly, there are several choices in rotations. The most popular by far is the varimax rotation,which attempts to simplify the interpretation of the factors by maximizing the variances of the inputfields’ loadings on each factor. In other words, it attempts to finds a rotation in which some fieldshave high and some low loadings on each factor, which makes it easier to understand and name thefactors. The quartimax rotation attempts to simplify the interpretation of each field in terms of thefactors by finding a rotation yielding high and low loadings across factors for each field. The equimaxrotation is a compromise between the varimax and quartimax rotation methods. These three rotationsare orthogonal, which means the axes are perpendicular to each other and the factors will be

uncorrelated. This is considered a desirable feature since statements can be made about independentfactors or aspects of the data. There are nonorthogonal rotations available (axes are not perpendicular); popular ones are oblimin and promax (runs faster than oblimin). Such rotations arerarely used in data mining, since the point of data reduction is to obtain relatively independentcomposite measures, and it is easier to speak of independent effects when the factors are uncorrelated.Finally, principal components does not require a rotation, since there is a unique solution associated with it. However, in practice, a varimax rotation is sometimes done to facilitate the interpretation of the components.

2.11 Overal l Recommend at ion s For data mining applications, principal components is more commonly performed than factor analysis

because of the expected high correlations among the many continuous inputs that are often analyzed,and because there isn't always strong interest in interpreting the results. Varimax rotation is usuallydone (although it is not necessary for principal components) to simplify the interpretation. If there arenot many highly correlated fields (or other sources for ill-behaved data, for example, much missingdata), then either principal axis or maximum likelihood factor can be performed. Maximum likelihood has technical advantages, but can produce an ugly solution if the data are not well conditioned (astatistical criterion).

2.12 Example: Regression with Pr inc ipal Compon ents To demonstrate principal components, we will run a linear regression analysis predicting a target(amount of waste produced) as a function of several related inputs (amount of acreage put to differentuses). After examining the regression results, we will run principal components analysis and use the

first few component score fields as inputs to the regression.

Note

A complete example of linear regression is provided in Lesson 6. Our intent here is not to teach linear regression, but instead to use this technique to illustrate how principal components can be used inconjunction with that modeling technique.

Click File…Open Stream and move to the c:\Train\ModelerPredModel directoryDouble-click on PrincipalComponents.str




2-8

When the stream first opens, the following warning dialog is displayed. In version 14.0 of Modeler,the Linear Models node was added that is an enhanced version of the Regression node. TheRegression node will be replaced in a future release.

Figure 2.3 Regression Node Expiration Warning

Click OKRight-click on the Table node connected to the Type node, then click Run Examine the data, and then close the Table windowDouble-click on the Type node

Figure 2.4 Type Node for Linear Regression Analysis

The INDUST , METALS , TRUCKS , RETAIL, and RESTRNTS fields (which measure the number of acres of a specific type of land usage) will be used as inputs to predict the amount of solid waste(WASTE ).

Close the Type node windowDouble-click on the Regression node named WASTE at the top of the Stream canvasClick the Expert tab, and then click the Expert option buttonClick the Output button, and then make sure that the Descriptives check box is checked




2-9

Figure 2.5 Requesting Descriptive Statistics in a Linear Regression Node

To check for correlation among the inputs, we request descriptive statistics ( Descriptives). This willdisplay correlations for all the fields in the analysis, among other statistics. (Note that we could haveobtained these correlations from the Statistics node.) We can obtain more technical information about

correlated predictors by checking the Collinearity Diagnostics check box.

Click OK, and then click the Run buttonRight-click the Regression generated model node named Waste in the Models Manager

window, then click Browse Click the Summary tabExpand the Analysis topic in the Summary tab (if necessary)

Figure 2.6 Linear Regression Browser Window (Summary Tab)




2-10

The estimated regression equation appears in the Summary tab under Analysis; notice that two of theinputs have negative coefficients.

Click the Advanced tabScroll to the Pearson Correlation section of the Correlations table in the Advanced tab of

the browser window

Figure 2.7 Correlations for Input Fields and Target Field

All correlations are positive and there are high correlations between the METALS and TRUCKS fields(.893) and between the RESTRNTS and RETAIL fields (.920). Since some of the inputs are highlycorrelated, this might create stability problems (large standard errors) for the estimated regressioncoefficients due to near multicollinearity.

Scroll to the Model Summary table

Figure 2.8 Regression Model Summary

The regression model with five predictors accounted for about 83% (adjusted R Square) of thevariation in the target field (waste).

Scroll to the Coefficients table




2-11

Figure 2.9 Linear Regression Coefficients

Two of the significant coefficients ( INDUST and RETAIL) have negative regression coefficients,

although they correlate positively (see Figure 2.9) with the target field. Although there might be avalid reason for this to occur, this coupled with the fact that RETAIL is highly correlated with another predictor is suspicious. Also, those familiar with regression should note that the estimated betacoefficient for RESTRNTS is above 1, which is another sign of near multicollinearity. It is possiblethat this situation could have been avoided if a stepwise method had been used (this is left as anexercise). However, we will take the position that the current set of inputs is exhibiting signs of near multicollinearity and we will run principal components as an attempt to improve the situation.

Close the Regression browser windowDouble-click the PCA/Factor model node (named Factor ) in the stream canvas

Figure 2.10 PCA/Factor Dialog




2-12

In Simple mode (see Expert tab), the only options involve selection of the factor extraction method (some of these were discussed in the Methods section). Notice that Principal Components is thedefault method.

Click the Run buttonRight-click the PCA/Factor generated model node named Factor in the Models Manager

window, then click Browse

Figure 2.11 PCA/Factor Browser Window (Five-Component Solution)

Five principal components were found. Since there were originally five input fields, reducing them tofive principal components does not constitute data reduction (but it does solve the problem of multicollinearity). If the solution were successful, we would expect that the variation within the five

input fields would be concentrated in the first few components and we could check this by examiningthe Advanced tab of the browser window. However, instead we will use the Expert options to havethe PCA/Factor node select an optimal number of principal components.

Close the PCA/Factor browser windowDouble-click on the PCA/Factor modeling node named Factor Click the Expert tab, and then click the Expert Mode option button




2-13

Figure 2.12 Expert Options

The Extract factors option indicates that while in Expert mode, PCA/Factor will select as manyfactors as there are eigenvalues over 1 (we discussed this rule of thumb earlier in the lesson). You canchange this rule or specify a number of factors; this might be done if you prefer more or fewer factorsthan the eigenvalue rule provides. By default, the analysis will be performed on the correlation

matrix; principal components can also be applied to covariance matrices, in which case fields withgreater variation will have more weight in the analysis. This is really all we need to proceed, but let'sexamine the other Expert options.

Notice that the Only use complete records check box becomes active when the Expert Mode isselected. By default, PCA/Factor will only use records with complete information on the input fields.If this option is not checked, then a pairwise technique is used. Here for a record with missing valueson one or more fields used in the analysis, fields with valid values will be used. However, the created factor score fields will be set to $null$ for these records. Also, substantial amounts of missing data,when the Use only complete records is not selected, can lead to numeric instabilities in the algorithm.

The Sort values check box in the Component/Factor format section will have PCA/Factor list the

fields in descending order by their loading coefficients on the factor/component for which they load highest. This makes it very easy to see which fields relate to which factors and is especially usefulwhen a many input fields are involved. To further aid this effort, by suppressing loading coefficientsless that .3 in absolute value (the Hide values below option) you will only see the larger loadings(small values are replaced with blanks) and not be distracted by small loadings. Although notrequired, these options make the interpretive task much easier when many fields are involved.

Make sure the Sort values check box is checkedMake sure the Hide values below check box is checkedSet the Hide scores below value to 0.3




2-14

Click the Rotation button

Figure 2.13 Expert Options (Factor/Component Rotation)

By default, no rotation is performed, which is often the case when principal components is run. The Delta and Kappa text boxes control aspects of the Oblimin and Promax rotation methods,respectively.

Click Cancel Click the Run buttonRight-click the PCA/Factor generated model node, named Factor , in the Models Manager

window, then click BrowseClick the Model tab




2-15

Figure 2.14 PCA/Factor Browser Window (Two-Component Solution)

The PCA/Factor browser window contains the equations to create component (in this case) or factor score fields from the inputs. Two components were selected based on the eigenvalue greater than 1rule (recall five were selected in the original analysis under the Simple mode). The coefficients are sosmall because the components are normalized to have means of 0 and standard deviations of 1, whilemost inputs have values that extend into the thousands. To interpret the components, we turn to theadvanced output.

Click the Advanced tabScroll to the Communalities table in the Expert Output browser window




2-16

Figure 2.15 Communalities Summary

The communalities represent the proportion of variance in an input field explained by the factors(here principal components). Since initially, as many components are fit as there are inputs, thecommunalities in the first column ( Initial ) are trivially 1. They are of interest when a solution isreached ( Extraction column). Here the communalities are below 1 and measure the percentage of variance in each input field that is accounted for by the selected number of components (two). Any

fields having very small communalities (say .2 or below) have little in common with the other inputs,and are neither explained by the components (or factors), nor contribute to their definition. Of the fiveinputs, all but INDUST have a large proportion of their variance accounted for by the twocomponents and Indust itself has a communality of .44 (44%).

Scroll to the Total Variance Explained table in the Avanced tab of browser window

Figure 2.16 Total Variance Explained (by Components) Table

The Initial eigenvalues area contains all (5) eigenvalues, along with the percentage of variance (of thefields) explained by each and a cumulative percentage of variance. We see in the Extracted Sums of

Squared Loadings section that there are two eigenvalues over 1, the first being about twice the size of the second. Two components were selected and they collectively account for about 82 percent of thevariance of the 5 inputs. The third eigenvalue is .73, which might be explored as a third component if more input fields were involved (reducing from five fields to three components is not much of areduction). The remaining two components (fourth and fifth) are quite small. While not pursued here,in practice we might try out a solution with a different number of components.

Scroll to the Component Matrix table in the Advanced tab of the browser window




2-17

Figure 2.17 Component Matrix (Component or Factor Loadings)

PCA/Factor next presents the Component (or Factor) Matrix that contains the unrotated loadings. If arotation were requested, this table would appear in addition to a table containing the rotated loadings.

The input fields form the rows and the components (or factors if a factor method were run) form thecolumns. The values in the table are the loadings. If any loading were below .30 (in absolute value), blanks would appear in its position due to our option choice. While it makes no difference here, theoption helps focus on the larger (absolute value closer to 1) loadings.

The first component seems to be a general component, having positive loadings on all the input fields(recall that they all correlated positively—see Figure 2.17). In some sense, it could represent the total(weighted) amount of land used in these activities. The second component has both positive and negative coefficients, and seems to represent the difference between land usage for trucking and wholesale trade, fabricated metals, and industrial work, versus retail trade, restaurants and hotels.This might be considered a contrast between manufacturing/industrial and service-oriented use of land. This pattern, all fields with positive loadings on the first component (factor) and contrasting

signs on coefficients of the second and later components (factors), is fairly common in unrotated solutions. If we requested a rotation, the fields would group into the two rotated componentsaccording to their signs on the second component.

We should note that when interpreting components or factors, the loading magnitude is important;that is, fields with greater loadings (in absolute value) are more closely associated with thecomponents and are more influential when interpreting the components.

We know that the two components account for 82 percent of the variation of the original input fields(a substantial amount), and that we can interpret the components. Now we will rerun the linear regression with the components as inputs.

Close the PCA/Factor browser windowDouble-click on the Type node located to the right of the PCA/Factor generated model

node named Factor




2-18

Figure 2.18 Type Node Set Up for Principal Components Regression

The two component score fields ($F-Factor-1, $F-Factor-2) are the only fields that will be used asinputs; the original land usage fields have their role set to None. If both the land usage fields and thecomponent score fields were inputs to the linear regression, we would have only exacerbated the near multicollinearity problem (as an exercise, explain why).

Close the Type node windowRun the Regression modeling node, named Waste, located in the lower right section of the

Stream canvasRight-click the Regression generated model node named Waste in the Models Manager,

then click BrowseClick the Summary tabExpand the Analysis topic




2-19

Figure 2.19 Linear Regression (Using Components as Inputs) Browser Window

The prediction equation for waste is now in terms of the two principal component fields. Notice thatthe coefficient for the second component has a negative sign, which we will consider when examiningthe expert output.

Click the Advanced tabScroll to the Model Summary table

Figure 2.20 Model Summary (Principal Components Regression)

The regression model with two principal component fields as inputs accounts for about 73% of thevariance (adjusted R square) in the Waste field. This compares with the 83% in the original analysis(Figure 2.8). Essentially, we are giving up 10% explained variance to gain more stable coefficientsand possibly a simpler interpretation. The requirements of the analysis would determine whether thistradeoff is acceptable.

Scroll to the Coefficients table




2-20

Figure 2.21 Coefficients Table (Principal Components Regression)

Both components are statistically significant. The positive coefficient for $F-Factor-1 indicates, notsurprisingly, as overall land usage increases, so does the amount of waste. The coefficient for thesecond component (which represented a contrast of land use for manufacturing/industrial versusservice-oriented), which is negative, indicates that, controlling for total land usage, as the amount of manufacturing/industrial land use increases relative to service-oriented usage, waste production goesdown. Or, to put it another way, as service-oriented land use increases, relative to

manufacturing/industrial, waste production increases.

As mentioned before, the interpretation of the component, and thus the regression, results might bemade easier by rotating (say using a varimax rotation) the components (you might ask your instructor to demonstrate this approach). Notice that the components, unlike the original fields (see Figure 2.8),have no beta coefficients above 1, indicating that the potential problem with near multicollinearity has been resolved.

It is important to note that while we have shifted from a regression with five inputs to a regressionwith two components, the five inputs are still required to produce predictions because they are needed to create the component score fields.

Additional ReadingsThose interested in learning more about factor and principal components analysis might consider the book by Kline (1994), Jae-On Kim’s introductory text (1978) and his book with Charles W Mueller (1979), and Harry Harman’s revised text (1979).




2-21

Summary Exerc ises The exercises in this lesson use the file waste.dat . The table provides details about the file.

Waste.dat contains information from a waste management study in which the amount of solid waste

produced within an area was related to type of land usage. Interest is in relating land usage to amountof waste produced for planning purposes. Inputs were found to be highly correlated and the dataset isused to demonstrate principal components regression. The file contains 40 records and the followingfields:

INDUST Acreage (US) used for industrial workMETALS Acreage used for fabricated metalTRUCKS Acreage used for trucking and wholesale tradeRETAIL Acreage used for retail tradeRESTRNTS Acreage used for restaurants and hotelsWASTE Amount of solid waste produced

1. Working with the current stream from the lesson, request a varimax rotation of the principalcomponents analysis. Interpret the component coefficients. Use the component score fieldsfrom this generated model node as inputs to the Regression node predicting waste. Does theR square change? Explain this. Do the regression coefficients change? How would youinterpret them?

2. With the same data, use the Extraction Method drop-down list in the PCA/Factor node to runa factor analysis instead (using principal axis factoring or maximum likelihood) with norotation. Compare the results to those obtained by the principal components in the lesson. Arethey similar? In what way do they differ? Now rerun the factor analysis, requesting a varimaxrotation. How do these results compare to those obtained in the first exercise? Do you find

anything that leads you to prefer one to the other?




2-22



DECISION TREES/RULE INDUCTION

3-1

Lesson 3: Decision Trees/Rule Induction

Overview

• Introduce the features of the C5.0, CHAID, C&R Tree and QUEST nodes• Create models for categorical targets• Understand how CHAID and C&R Tree model a continuous output

Data

We will use the dataset churn.txt that we used in Lesson 1. This data file contains information on1477 of a telecommunication company’s customers who have at some time purchased a mobile

phone. The customers fall into one of three groups: current customers, involuntary leavers and voluntary leavers. In this lesson, we use decision tree models to understand which factors influencegroup membership.

Following recommended practice, we will use a Partition Node to divide the cases into two partitions

(subsamples), one to build or train the model and the other to test the model (often called a holdoutsample). With a holdout sample, you are able to check the resulting model performance on data notused to fit the model. The holdout data sample also has known values for the target field and thereforecan be used to check model performance.

A second dataset Insclaim.dat , used with the C&R Tree node, contains 293 records based on patientadmissions to a hospital. All patients belong to a single diagnosis related group (DRG). Four fields(grouped severity of illness, age, length of stay, and insurance claim amount) are included. The goalis to build a predictive model for the insurance claim amount and use this model to identify outliers(patients with claim values far from what the model predicts), which might be instances of errors or fraud made in the claims. Such analyses can be performed for error or fraud detection in instanceswhere audited data (for which the outcome error/no error or fraud/no fraud) are not available.

3.1 Introduct ion PASW Modeler contains four different algorithms for constructing a decision tree (more generallyreferred to as rule induction): C5.0, CHAID, QUEST, and C&R Tree (classification and regressiontrees). They are similar in that they can all construct a decision tree by recursively splitting data intosubgroups defined by the predictor fields as they relate to the target. However, they differ in severalimportant ways. (PASW Modeler also includes the Decision List node, which develops models toidentify subgroups or segments that show a higher or lower likelihood of a binary (yes or no) targetrelative to the overall sample. These models are tree-like, but they are different enough that DecisionList is reviewed separately in Appendix A.)

We begin by reviewing a table that highlights some distinguishing features of the algorithms. Next,

we will examine the various options for the algorithms in the context of predicting a categoricaloutput. Within each section we discuss when it is advisable to use the expert options within thesenodes.

3.2 Compar ison of Decis ion Tree Models The table below lists some of the important differences between the decision tree/rule inductionalgorithms available within PASW Modeler.




3-2

Table 3.1 Some Key Differences between the Four Decision Tree Models

Model Criterion C5.0 CHAID QUEST C&R Tree

Type of Split for

Categorical

Predictors

Multiple Multiple Binary1 Binary

Continuous

Target

No Yes No Yes

Continuous

Predictors

Yes No Yes2 Yes

Criterion for

Predictor

Selection

Informationmeasure

Chi-squareF test for continuous

Statistical Impurity(dispersion)measure

Can Cases

Missing Predictor

Values be Used?

Yes, usesfractionalization

Yes, missing becomes acategory

Yes, usessurrogates

Yes, usessurrogates

Priors No No Yes Yes

PruningCriterion

Upper limit on predicted error

Stops rather than overfit

Cost-complexity pruning

Cost-complexity pruning

Build Trees

Interactively

No Yes Yes Yes

Supports

Bagging/Boosting

Yes Yes Yes Yes

1Modeler has extended the logic of the CHAID approach to accommodate ordinal and continuoustarget fields.2

Continuous predictors are binned into ordinal fields containing by default approximately equal sized categories.

Note: C&R Tree and QUEST produce binary splits (two branch splits) when growing the tree, whileC5.0 and CHAID can produce more than two subgroups when splitting occurs. However, if we had a

predictor of measurement level nominal or ordinal with four categories, each of which were distinctwith relation to the target field, C&R Tree and QUEST could perform successive binary splits on thisfield. This would produce a result equivalent to a multiple split at a single node, but requiresadditional tree levels.

All methods can handle predictors and targets that are categorical (flag, nominal, and ordinal).CHAID and C&R Tree can use a continuous target field, while all but CHAID can use a continuous

predictor or input (although see footnote 2).

The trees that each method grows will not necessarily be identical because the methods use very

different criteria for selecting a predictor. CHAID and QUEST use more standard statistical methods,while C5.0 and C&R Tree use non-statistical measures, as explained below.

Missing (blank) values are handled in three different ways. C&R Tree and QUEST use the substitute(surrogate) predictor field whose split is most strongly associated with that of the original predictor todirect a case with a missing value to one of the split groups during tree building. C5.0 splits a case in

proportion to the distribution of the predictor field and passes a weighted portion of the case downeach tree branch. CHAID uses all the missing values as an additional category in model building.




3-3

Three of the four methods prune trees after growing them quite large, while CHAID instead stops before a tree gets too large.

For all these reasons, you should not expect the four algorithms to produce identical trees for thesame data. You should expect that important predictors would be included in trees built by anyalgorithm.

Those interested in more detail concerning the algorithms can see the PASW Modeler 14.0 AlgorithmsGuide. Also, you might consider C4.5: Programs for Machine Learning (Morgan Kauffman, 1993)

by Ross Quinlan, which details the predecessor to C5.0; Classification and Regression Trees (Wadsworth, 1984) by Breiman, Friedman, Olshen and Stone, who developed CART (Classificationand Regression Tree) analysis; the article by Loh and Shih (1997, “Split Selection Methods for classification trees,” Statistica Sinica, 7: 815-840) that details the QUEST method; and for adescription of CHAID, “The CHAID Approach to Segmentation Modeling: CHI-squared AutomaticInteraction Detection,” Lesson 4 in Richard Bagozzi, Advanced Methods of Marketing Research (Blackwell, 1994).

3.3 Using the C5.0 Node We will use the C5.0 node to create a rule induction model. It contains the rule induction model ineither decision tree or rule set format. By default, the C5.0 node is labeled with the name of the outputfield. The C5.0 model can be browsed and predictions can be made by passing new data through it inthe Stream Canvas.

Before a data stream can be used by the C5.0 node—or essentially any node in the Modeling palette—the measurement levels of all fields used in the model must be instantiated (either in thesource node or a Type node). That is because all modeling nodes use this information to set up themodels. As a reminder, the table below shows the six key roles for a field.

Table 3.2 Role Settings

Input The field acts as an input or predictor within the modeling.Target The field is the output or target for the modeling.Both Allows the field to be act as both an input and an target in modeling. This role is

suitable for the association rule and sequence detection algorithms only; all other modeling techniques will ignore the field.

None The field will not be used in machine learning or statistical modeling. Default if the field is defined as Typeless.

Partition Indicates a field used to partition the data into separate samples for training,testing, and (optional) validation purposes.

Split Indicates that a model should be built for each value of a field. For flag, nominaland ordinal fields only.

Frequency Used as a frequency weighting factor. Supported by CHAID, QUEST, C&R Tree,and the Linear node

Role can be set by clicking in that column for a field within the Type node or the Type tab of a sourcenode and selecting the role from the drop-down menu. Alternatively, this can be done from the Fieldstab of a modeling node.

If the Stream Canvas is not empty, click File…New Stream Place a Var. File node from the Sources paletteDouble-click the Var. File nodeMove to the c:\Train\ModelerPredModel directory and double-click on the Churn.txt file




3-4

As delimiter, check the Comma option if necessarySet the Strip lead and trail spaces: option to Both Click OK to return to the Stream CanvasPlace a Partition node from the Field Ops palette to the right of the Var. File node named

Churn.txtConnect the Var.File node named Churn.txt to the Partition nodePlace a Type node from the Field Ops palette to the right of the Partition nodeConnect the Partition Node to the Type node

Next we will add a Table node to the stream. This not only will force PASW Modeler to instantiatethe data but also will act as a check to ensure that the data file is being correctly read.

Place a Table node from the Output palette above the Type node in the Stream CanvasConnect the Type node to the Table nodeRight-click the Table nodeRun the Table node

The values in the data table should look reasonable (not shown).

Click File…Close to close the Table windowDouble-click the Type nodeClick in the cell located in the Measurement column for ID (current value is Continuous),

and select Typeless from the listClick in the cell located in the Role column for CHURNED (current value is Input) and select

Target from the list

Figure 3.1 Type Node Ready for Modeling




3-5

Notice that ID will be excluded from any modeling as the role is automatically set to None for aTypeless field. The CHURNED field will be the target field for any predictive model and all fields but ID and Partition will be used as predictors.

Click OK Place a C5.0 node from the Modeling palette to the right of the Type node

Connect the Type node to the C5.0 node

The name of the C5.0 node should immediately change to CHURNED.

Figure 3.2 C5.0 Modeling Node Added to Stream

Double-click the C5.0 node

Figure 3.3 C5.0 Node Model Tab




3-6

The Model name option allows you to set the name for both the C5.0 and resulting C5.0 rule nodes.The form (decision tree or rule set, both will be discussed) of the resulting model is selected using theOutput type: option.

The Use partitioned data option is checked so that the C5.0 node will make use of the Partition field created by the Partition node earlier in the stream. Whenever this option is checked, only the cases the

Partition node assigned to the Training sample will be used to build the model; the rest of the caseswill be held out for Testing and/or Validation purposes. If unchecked, the field will be ignored and the model will be trained on all the data. Here, we use the default setting for the Partition node, so50% of cases will be used for training and 50% for testing.

The Build model for each split option enables you to use a single stream to build separate models for each possible value of a flag, categorical or continuous input field, which is specified as split field inthe Fields tab or upstream Type node. With split modeling, you can easily build the best-fitting modelfor each possible field value in a single execution of the stream.

The Cross-validate option provides a way of validating the accuracy of C5.0 models when there aretoo few records in the data to permit a separate holdout sample. It does this by partitioning the data

into N equal-sized subgroups and fits N models. Each model uses (N-1) of the subgroups for training,then applies the resulting model to the remaining subgroup and records the accuracy. Accuracyfigures are pooled over the N holdout subgroups and this summary statistic estimates model accuracyapplied to new data. Since N models are fit, N-fold validation is more resource intensive and reportsthe accuracy statistic, but does not present the N decision trees or rule sets. By default N, the number of folds, is set to 10.For a predictor field that has been defined as categorical, C5.0 will normally form one branch per value in the set. However, by checking the Group symbolics check box, the algorithm can be set sothat it finds sensible groupings of the values within the field, thus reducing the number of rules. Thisis often desirable. For example, instead of having one rule per region of the country, group symbolicvalues may produce a rule such as:

Region [South, Midwest] …Region [Northeast, West] …

Once trained, C5.0 builds one decision tree or rule set that can be used for predictions. However, itcan also be instructed to build a number of alternative models for the same data by selecting the Boosting option. Under this option, when it makes a prediction it consults each of the alternativemodels before making a decision. This can often provide more accurate prediction, but takes longer totrain. Also the resulting model is a set of decision tree predictions and the outcome is determined byvoting, which is not simple to interpret.

The algorithm can be set to favor either Accuracy on the training data (the default) or Generality to

other data. In our example, we favor a model that is expected to better generalize to other data and sowe select Generality.

Click Generality option button

C5.0 will automatically handle errors (noise) within the data and, if known, you can inform PASWModeler of the expected proportion of noisy or erroneous data. This option is rarely used.




3-7

As with all of the modeling nodes, after selecting the Expert option or tab, more advanced settings areavailable. In this course, we will discuss the Expert options briefly. The reader is referred to the Modeler 14 Modeling Nodes documentation for more information on these settings.

Click the Expert option button

Figure 3.4 C5.0 Node Model Tab Expert Options

By default, C5.0 will produce splits if at least two of the resulting branches have at least two datarecords each. For large datasets you may want to increase this value to reduce the likelihood of rulesthat apply to very few records. To do so, increase the value in the Minimum records per child branch

box.

Click the Simple Mode option button, and then click Run

A C5.0 Rule model, labeled with the predicted field (CHURNED), will appear in the Models paletteof the Manager.

The C5.0 Rule model is also added automatically to the stream, connected to the Type node. A dotted line connects it to the C5.0 Modeling node, indicating the source of the model (not shown). Each timethe model is rerun, the model in the stream will be replaced.

3.4 Viewing the Model Once the C5.0 Rule node is in the stream it can be edited.




3-8

Right-click the C5.0 generated model node named CHURNED in the stream palette, thenclick Edit

The Model Viewer window has two panes. The left one shows the root node of the tree and the firstsplit; the right pane displays a graph of predictor importance measures.

According to what we see of the tree so far, LOCAL is the first split in the tree. Further, we see that if LOCAL <= 4.976 the Mode value for CHURNED is InVol . The Mode is the modal (most frequent)output value for the branch, and it will be the predicted value unless there are other fields that need to

be taken into account within that branch to make a prediction. When LOCAL <= 4.976 the branchterminates, visually apparent because of the arrow. So this means the prediction for all customers withthis range of values on LOCAL is to be an involuntary churner.

In the second half of the first split where LOCAL > 4.976, the Mode value is Current. In this instance,no predictions of CHURNED are visible, and to view the predictions we need to further unfold thetree.

Predictor importance is enabled by default on the Analyze tab in the C5.0 modeling node (or any

modeling node for which it can be calculated). Predictor importance takes into account the whole treeand is calculated on the test partition, if one is available (as is true here). Predictor importance valuessum to 1.0 so the relative importance of each predictor can be directly compared. Importantly,

predictor importance does not relate to model accuracy; instead, it is a measure of how muchinfluence a field has on model prediction, i.e., changes in a field lead to changes in model predictions.

Figure 3.5 Browsing the C5.0 Rule Node




3-9

The bar chart shows that the field LOCAL, used on the first split, is by far the most important in predicting CHURNED. However, we haven’t seen the whole tree, and critically, we aren’t yet readyto use the test partition data, so we won’t examine predictor importance any further at the moment.

To unfold the branch LOCAL > 4.976, just click the expand button.

Click to unfold the branch LOCAL > 4.976

Figure 3.6 Unfolding a Branch

SEX is the next split field. Now we see that SEX is the best predictor for persons who spend more than4.976 minutes on local calls. The Mode value for Males is Current and for Females is Vol . However,

at this point we still cannot make any predictions because there is a symbol to the left of eachvalue of SEX which means that other fields need to be taken into account before we can make a

prediction. Once again we can unfold each separate branch to see the rest of the tree, but we will takea shortcut:

Click the All button in the Toolbar




3-10

Figure 3.7 Fully Unfolded Tree

We can see several nodes usually referred to as terminal nodes that cannot be refined any further. Inthese instances, the mode is the prediction. For example, if we are interested in the Current Customer group, one group we would predict to remain customers are persons where LOCAL > 4.976, SEX = M , International <= 0.905, and AGE > 29. To get an idea about the number and percentage of

records within such branches we ask for more details.

Click Show or hide instance and confidence figures in the toolbar




3-11

Figure 3.8 Instance and Confidence Figures Displayed

The incidence tells us that there are 218 persons who met those criteria. The confidence figure for this set of individuals is 1.0, which represents the proportion of records within this set correctlyclassified (predicted to be Current and actually being Current ). That means it is 100% accurate onthis group! If we were to score another dataset with this model, how would persons with the samecharacteristics be classified? Because PASW Modeler assigns the group the modal category of the

branch, everyone in the new dataset who met the criteria defined by this rule would be predicted toremain Current Customers.

If you would like to present the results to others, an alternative format is available that helps visualizethe decision tree. The Viewer tab provides this alternative format.

Click the Viewer tab

Click the Decrease Zoom tool (to view more of the tree). (You may also need to expandthe size of the window.)

Branch predictingCurrent




3-12

Figure 3.9 Decision Tree in the Viewer Tab

The root of the tree shows the overall percentages and counts for the three categories of CHURNED.The modal category is shaded in each node. We see that there are 719 customers in the training

partition.

The first split is on LOCAL, as we have seen already in the text display of the tree. Similar to the textdisplay, we can decide to expand or collapse branches. In the right corner of some nodes a – or + isdisplayed, referring to an expanded or collapsed branch, respectively. For example, to collapse thetree at node 2:

Click in the lower right corner of node 2 (shown in Figure 3.10)




3-13

Figure 3.10 Collapsing a Branch

In the Viewer tab, toolbar buttons are available for zooming in or out; showing frequency informationas graphs and/or as tables; changing the orientation of the tree; and displaying an overall map of thetree in a smaller window (tree map window) that aids navigation in the Viewer tab. When it is not

possible to view the whole tree at once, such as now, one of the more useful buttons in the toolbar isthe Tree map button because it shows you the size of the tree. A red rectangle indicates the portion of the tree that is being displayed. You can then navigate to any portion of the tree you want by clickingon any node you desire in the Tree map window.

Click in the lower right corner of node 2

Click on the Treemap button in the tool bar Enlarge the Treemap until you see the node numbers (shown in Figure 3.11)




3-14

Figure 3.11 Decision Tree in the Viewer Tab with a Tree Map

3.5 Generat ing and Browsing a Rule Set

When building a C5.0 model, the C5.0 node can be instructed to generate the model as a rule set, asopposed to a decision tree. A rule set is a number of IF … THEN rules which are collected together

by prediction.

A rule set can also be produced from the Generate menu when browsing a C5.0 decision tree model.

In the C5.0 Rule Model Viewer window, click Generate…Rule Set

Figure 3.12 Generate Ruleset Dialog




3-15

Note that the default Rule set name appends the letters “RS” to the output field name. You mayspecify whether you want the C5.0 Ruleset node to appear in the Stream Canvas (Canvas), thegenerated Models palette (GM palette), or both. You may also change the name of the rule set and lower limits on support (percentage of records having the particular values on the input fields) and confidence (accuracy) of the produced rules (percentage of records having the particular value for theoutput field, given values for the input fields).

Set Create node on: to GM Palette Click OK

Figure 3.13 Generated C5.0 Rule Set Node

Click File…Close to close the C5.0 Rule browser windowRight-click the C5.0 Rule Set node named CHURNEDRS in the generated Models palette in

the Manager, then click Browse

Generated RuleSet for CHURNED




3-16

Figure 3.14 Browsing the C5.0 Generated Rule Set

Click All button to unfold

Click Show or hide instance and confidence figures button in the toolbar

The numbered rules now expand as shown below.




3-17

Figure 3.15 Fully Expanded C5.0 Generated Rule Set

For example Rule #1 (Current ) has this logic: If a person makes more than 4.976 minutes of local

calls a month, is Male, makes less than or equal to .905 minutes of International calls, is less than or equal to 29 years old, and has an estimated income greater than 38,950.50, then we would predictCurrent . This form of the rules allows you to focus on a particular conclusion rather than having toview the entire tree.

If the Rule Set is added to the stream, a Settings tab will become available that allows you to exportthe rule set in SQL format, which permits the rules to be directly applied to a database.

Click File…Close to close the Rule set browser window

3.6 Understandin g the Rule and Determ ining Accur acy The predictive accuracy of the rule induction model is not given directly within the C5.0 model node.

To get that information, you can use an Analysis node. However, at this stage we will use Matrixnodes and Evaluation Charts to determine how good the model is.




3-18

Creating a Data Table Containing Predicted Values

We use the Table node to examine the predictions from the C5.0 model.

Place a Table node from the Output palette below the generated C5.0 Rule model namedCHURNED

Connect the generated C5.0 Rule model named CHURNED to the Table node

Right-click the Table node, then click Run and scroll to the right in the table

Figure 3.16 Two New Fields Generated by the C5.0 Rule Node

Two new columns appear in the data table, $C-CHURNED and $CC-CHURNED. The first representsthe predicted value for each record and the second the confidence value for the prediction.

Click File…Close to close the Table output window

Comparing Predicted to Actual Values

We will view a matrix (crosstab table) to see where the predictions were more, and less, correct, and then we evaluate the model graphically with a gains chart.

Place two Select nodes from the Records palette, one to the lower right of the generatedC5.0 node named CHURNED, and one to the lower left

Connect the generated C5.0 node named CHURNED to the each Select node

First we will edit the Select node on the left that we will use to select the Training sample cases:




3-19

Double-click on the Select node on the left to edit it

Click the Expression Builder buttonMove Partition from the Fields list box to the Expression Builder text box

Click the (equal sign button)

Click the Select from existing field values button and insert the value 1_TrainingClick OK, and then click OK again to close the dialog

Figure 3.17 Completed Selection for the Training Partition

Now we will edit the Select node on the right to select the Testing sample cases:

Double-click on the Select node on the right to edit it

Click the Expression Builder buttonMove Partition from the Fields list box to the Expression Builder text box

Click the (equal sign button)

Click the Select from existing field values button and insert the value 2_TestingClick OK, and then click OK again to close the dialog

Now attach a separate Matrix node to each of the Select nodes. For each of the Select nodes:

Place a Matrix node from the Output palette below the Select nodeConnect the Matrix node to the Select node

Double-click the Matrix node to edit itPut CHURNED in the Rows:Put $C-CHURNED in the Columns: Click the Appearance tabClick the Percentage of row optionClick on the Output tab and custom name the Matrix node for the Training sample as

Training and the Testing sample as Testing (this will make it easier to keep track of which output we are looking at)

Click OK




3-20

For each actual churned category, the Percentage of row choice will display the percentage of records predicted into each of the target categories.

Run each Matrix node

Figure 3.18 Matrix Output for the Training and Testing Samples

Looking at the Training sample results, the model predicts about 78.7% of the Current categorycorrectly, 100% of the Involuntary Leavers, and 97.0% of the Voluntary Leavers correctly. Theresults with the testing sample compare favorably which suggests that the model will perform wellwith new data.

Click File…Close to close the Matrix windows

Evaluation Chart Node

The Evaluation Chart node offers an easy way to evaluate and compare predictive models in order tochoose the best model for your application. Evaluation charts show how models perform in predicting particular outcomes. They work by sorting records based on the predicted value and confidence of the prediction, splitting the records into groups of equal size (quantiles), and then plotting the value of acriterion for each quantile, from highest to lowest.

To produce a gains chart for the Current group:

Place an Evaluation chart node from the Graphs palette to the right of the generated C5.0Rule node named CHURNED

Connect the generated C5.0 Rule node named CHURNED to the Evaluation chart node

Outcomes are handled by defining a specific value or range of values as a hit. Hits usually indicatesuccess of some sort (such as a sale to a customer) or an event of interest (such as someone givencredit being a good credit risk). Flag output fields are straightforward; by default, hits correspond totrue values. For Set output fields, by default the first value in the set defines a hit. For the churn data,the first value for the CHURNED field is Current. To specify a different value as the hit value, use theOptions tab of the Evaluation node to specify the target value in the User defined hit group. There arefive types of evaluation charts, each of which emphasizes a different evaluation criterion. Here wediscuss Gains and Lift charts. For information about the others, which include Profit and ROI charts,see the Modeler 14 Modeling Nodes documentation.




3-21

Figure 3.19 Evaluation Chart Dialog

Gains are defined as the proportion of total hits that occurs in each quantile. We will examine thegains when the data are ordered from those most likely to those least likely to be in the currentcategory (based on the confidence of the model prediction).

The Chart type option supports five chart types with Gains chart being the default. If Profit or ROI chart type is selected, then the appropriate options (cost, revenue and record weight values) becomeactive so information can be entered. The charts are cumulative by default (see Cumulative plot check

box), which is helpful in evaluating such business questions as “how will we do if we make the offer to the top X% of the prospects?” The granularity of the chart (number of points plotted) is controlled

by the Plot drop-down list and the Percentiles choice will calculate 100 values (one for each percentile from 1 to 100). For small data files or business situations in which you can only contactcustomers in large blocks (say some number of groups, each representing 5% of customers, will becontacted through direct mail), the plot granularity might be decreased (to deciles (10 equal-sized groups) or vingtiles (20 equal-sized groups)).

A baseline is quite useful since it indicates what the business outcome value (here gains) would be if the model predicted at the chance level. The Include best line option will add a line corresponding to




3-22

a perfect prediction model, representing the theoretically best possible result applied to the data wherehits = 100% of the cases.

The Separate by partition option in the node provides an easy and convenient way to validate themodel by displaying not only the results of the model using the training data, but in a separate chart,showing how well it performed with the testing or holdout data. Of course, this assumes that you

made use of the Partition Node to develop the model.

Click the Include best line checkbox (not shown)Click Run

Figure 3.20 Gains Chart of the Current Customer Group

The vertical axis of the gains chart is the cumulative percentage of the hits, while the horizontal axisrepresents the ordered (by model prediction and confidence) percentile groups. The diagonal line

presents the base rate, that is, what we expect if the model is predicting the outcome at the chancelevel. The upper line (labeled $BEST-CHURNED) represents results if a perfect model were applied to the data, and the middle line (labeled $C-CHURNED) displays the model results. The three linesconnect at the extreme [(0, 0) and (100, 100)] points. This is because if either no records or all recordsare considered, the percentage of hits for the base rate, best model, and actual model are identical.The advantage of the model is reflected in the degree to which the model-based line exceeds the base-rate line for intermediate values in the plot and the area for model improvement is the discrepancy

between the model line and the Best (perfect model) line. If the model line is steep for early

percentiles, relative to the base rate, then the hits tend to concentrate in those percentile groups of data. At the practical level, this would mean for our data that many of the current customers could befound within a small portion of the ordered sample.

The gains line ($C-CHURNED) in the Training data chart rises steeply relative to the baseline,indicating the hits for the Current outcome are concentrated in the percentiles predicted most likely tocontain current customers according to the model. Just over 75% of the hits were contained within thefirst 40 Percentiles. The gains line in the chart using Testing data is very similar which suggests thatthis model can be reliably used to predict current customers with new data.




3-23

You can hover over a line and a popup will display the value at that point, as shown in Figure 3.21.

Figure 3.21 Gains Chart for the Current Customer Group

Click File…Close to close the Evaluation chart window

Changing Target Category for Evaluation Charts

By default, an Evaluation chart will use the first target outcome category to define a hit. To changethe target category on which the chart is based, we must specify the condition for a User defined hit inthe Options tab of the Evaluation node. To create a gains chart in which a hit is based on the

Voluntary Leaver category:

Double-click the Evaluation nodeClick the Options tabClick the User defined hit checkbox

Click the Expression Builder button in the User defined hit groupClick @Functions on the functions category drop-down list

Select @TARGET on the functions list, and click the Insert button

Click the = buttonRight-click CHURNED in the Fields list box, then select Field ValuesSelect Vol, and then click Insert button




3-24

Figure 3.22 Specifying the Hit Condition within the Expression Builder

The condition (Vol as the target value) defining a hit was created using the Expression Builder.

Click OK




3-25

Figure 3.23 Defining the Hit Condition for CHURNED

In the evaluation chart, a hit will now be based on the Voluntary Leaver target category.

Click Run




3-26

Figure 3.24 Gains Chart for the Voluntary Leaver Category (Interaction Enabled)

The gains chart for the Voluntary Leavers category is better (steeper in the early percentiles) than thatfor the Current category. For example, the top 40 model-ordered percentiles in the Training data chartcontain over 85% of the Voluntary Leavers as opposed to the same chart when we looked at Current

Customers (that value was 75.3%)

Click File…Close to close the Evaluation chart window

To save this stream for later work:

Click File…Save Stream AsMove to the c:\Train\ModelerPredModel directoryType C5 in the File name: text boxClick Save

3.7 Understandin g the Most Impor tant Factors in

Predic t ion

An advantage of rule induction models over neural networks is that the decision tree form makes itclear which fields are having an impact on the predicted field. There is no great need to usealternative methods such as web plots and histograms to understand how the rule is working. Of course, you may still use the techniques we will demonstrate in Lesson 4 for neural networks to helpunderstand the model, but they often are not needed.

As we noted, predictor importance is calculated on the testing data partition. In addition to using thatinformation, viewing the tree also provides information about importance, as the most importantfields in the predictions can be thought of as those that divide the tree in its earliest stages. Thus in




3-27

this example the most important field in predicting churn is LOCAL. Once the model divided the datainto two groups, those who do more local calling and those who do less, it will focus separately oneach group to determine which predictors determine whether or not a customer will remain loyal tothe company, voluntarily leave, or even be dropped as a customer. The process continues until thenodes either cannot be refined any further or stopping rules are reached which causes tree growth tostop.

In Figure 3.25 we show the C5.0 model node with an expanded tree along with the predictor importance chart. The order in which splits occur in the tree parallels the relative importance of thefields.

Figure 3.25 Expanded Tree and Predictor Importance Chart

3.8 Furth er Top ics on C5.0 Modeling Now that we have introduced you to the basics of C5.0 modeling, we will discuss the Expert optionswhich allow you to refine your model even further. This time, we will use an existing stream rather than building one from scratch.

Click File…Open Stream Double Click on DecisionTrees.str

The simple options within the C5.0 node allow you to use Boosting , specify the Expected noise (%)and whether the resulting tree favors Accuracy or Generality. Noisy (inconsistent) data containrecords in which the same, or very similar, predictor values lead to different target values. While C5.0will handle noise automatically, if you have an estimate of it, the method can take this into account(see the section on Minimum Records and Pruning for more information on the effect of specifying anoise value).




3-28

The expert mode allows you to fine-tune the rule induction process.

Double-click on the C5.0 node named CHURNEDClick the Model tabClick the Expert Mode option button

Figure 3.26 Expert Options Available within the C5.0 Dialog (Model Tab)

When constructing a decision tree, the aim is to split the data into subsets that are, or seem to beheading toward, single-class collections of records on the target field. That is, ideally the terminalnodes contain only one category of the output field. At each point of the tree, the algorithm could

potentially partition the data based on any one of the input fields. To decide which is the “best” wayto partition the data—to find a compact decision tree that is consistent with the data—the algorithmsconstruct some form of test that usually works on the basis of maximizing a local measure of

progress.

Gain Ratio Selection Criterion

Within C5.0, the Gain Ratio criterion, based on information theory, is used when deciding how to

partition the data.

In the following sections, we will describe, in general terms, how this criterion measures progress.However the reader is referred to C4.5: Programs for Machine Learning by J. Ross Quinlan (MorganKaufmann, San Mateo CA, 1993) for a more detailed explanation of the original algorithm.




3-29

The criterion used in the predecessors to C5.0 selected the partition that maximizes the informationgain. Information gained by partitioning the data based on the categories of field X (an input or

predictor field) is measured by:

GAIN(X) = IN FO(DATA) – INFOX

(DATA)

Where INFO(DATA) represents the average information needed to identify the class (outcomecategory) of a record within the total data.

And INFOX

(DATA) represents the expected information requirement once the data has been partitioned into each outcome of the current field being tested.

The information theory that underpins the criterion of gain can be given by the statement:“The information conveyed by a message depends on its probability and can be measured in bits asminus the logarithm to the base 2 of that probability. So, if for example there are 8 equally probablemessages, the information conveyed by any one of them is – log2

(1 / 8) or 3 bits”. For details on howto calculate these values the reader is referred to Lesson 2 in C4.5: Programs for Machine Learning .

Although the gain criterion gives good results, it has a flaw in that it favors partitions that have a largenumber of outcomes. Thus a categorical predictor with many values has an advantage over one withfew categories. The gain ratio criterion, used in C5.0, rectifies this problem.

The bias in the gain criterion can be rectified by a kind of normalization in which the gain attributableto tests with many outcomes is adjusted. The gain ratio represents the proportion of informationgenerated by dividing the data in the parent node into each of the categories of field X that is useful,i.e., that appears helpful for classification.

GAIN RATIO(X) = GAI N(X) / SPLI T I NFOX

(DATA)

Where SPLIT INFOX

(DATA) represents the potential information generated by partitioning the data

into n outcomes, whereas the information gain measures the information relevant to classification.

The C5.0 algorithm will choose to partition the data based on the outcomes of the field thatmaximizes the information gain ratio. This maximization is subject to the constraint that theinformation gain must be large, or at least as great as the average gain over all tests examined. Thisconstraint avoids the instability of the gain criterion, when the split is near trivial and the splitinformation is thus small.

Two other parameters the expert options allow you to control are the severity of pruning and theminimum number of records per child branch. In the following sections we will introduce each of these in turn and give advice on their settings.

Pruning and Attribute Winnowing Within C5.0Within C5.0, once the tree has been built, it can be pruned back to create a more general (and less

bushy) tree. Within the expert mode, the Pruning severity option allows you to control the extent of the pruning. The higher this number, the more severe the pruning and the more general the resultingtree.

The algorithm used to decide whether a branch should be pruned back toward the parent node is based on comparing the predicted errors for the “sub-tree” (i.e. unpruned branches) with those for the“leaf” (or pruned node). Error estimates for leaves and sub-trees are calculated based on a set of




3-30

unseen cases the same size as the training set. The formula used to calculate the predicted error ratesfor a leaf involves the number of cases within the leaf, the number of these cases that have beenincorrectly classified within this leaf and confidence limits based on the binomial distribution. Thereader is referred to Lesson 4 in C4.5: Programs for Machine Learning for a more detailed description of error estimations and pruning in general.

A second phase of pruning (global pruning) is then applied by default. It prunes further based on the performance of the tree as a whole, rather than at the sub-tree level considered in the first stage of pruning. This option (Use global pruning ) can be turned off, which generally results in a larger tree.

After initially analyzing the data, the Winnow attributes option will discard some of the inputs to themodel before building the decision tree. This can produce a model that uses fewer input fields yetmaintains near the same accuracy, which can be an advantage in model deployment. This option can

be especially effective when there are many inputs and where inputs are statistically related.

Minimum Records per Child Branch

One other consideration when building a general decision tree is that the terminal nodes within thetree are not too small in size. Within the C5.0 dialog, you control the Minimum records per child

branch, which specifies that at any split point in the tree, at least two sub-trees must cover at least thisnumber of cases. The default is two cases but increasing this number can be useful for noisy datasetsand tends to produce less bushy trees.

How to Use Pruning and Minimum Records per Branch

As previously mentioned, within the C5.0 dialog the Simple mode allows you to specify both theExpected noise (%) and whether the resulting tree favors Accuracy or Generality.

• If the algorithm is set to favor Accuracy, the Pruning Severity is set to 75 and the Minimum

records per branch is 2; hence, although the tree is accurate there is a degree of generality bynot allowing the nodes to contain only one record.

•

If the algorithm is set to favor Generality the Pruning Severity is set to 85 and the Minimumrecords per branch is 5.• If the Expected noise (%) is used the Minimum records per branch is set to half of this value.

Once a tree has been built using the simple options, the expert options may be used to refine the treein these two common ways.

• If the resulting tree is large and has too many branches increase the Pruning Severity • If there is an estimate for the expected proportion of noise (relatively rare in practice), set the

Minimum records per branch to half of this value.

Boosting

C5.0 has a special method for improving its accuracy rate, called boosting . It works by buildingmultiple models in a sequence. The first model is built in the usual way. Then, a second model is builtin such a way that it focuses especially on the records that were misclassified by the first model. Thena third model is built to focus on the second model's errors, and so on. Finally, cases are classified byapplying the whole set of models to them, using a weighted voting procedure to combine the separate

predictions into one overall prediction. Boosting can significantly improve the accuracy of a C5.0model, but it also requires longer training. The Number of trials option allows you to control howmany models are used for the boosted model.




3-31

While boosting might appear to offer something for nothing, there is a price. When model building iscomplete, more than one tree is used to make predictions. Therefore, there is no simple description of the resulting model, nor of how a single predictor affects the target field. This can be a seriousdeficiency, so boosting is normally used when the chief goal of an analysis is predictive accuracy, notunderstanding.

Misclassification CostsThe Costs tab allows you to set misclassification costs. When using a tree to predict a categoricaloutput, you may wish to assign costs to misclassifications (where the tree predicts incorrectly) to biasthe model away from “expensive” mistakes. The Misclassifying controls allow you to specify the costattached to each possible misclassification. The default costs are set at 1.0 to represent that eachmisclassification is equally costly. When unequal misclassification costs are specified, the resultingtrees tend to make fewer expensive misclassifications, usually at the cost of an increased number of the relatively inexpensive misclassifications.

Propensity Scores

Importance measures and propensity scores are available from the Analyze tab.

Click Analyze tab

By default, importance scores will be calculated for model evaluation, as we have seen.

There are two check boxes to request raw and adjusted propensity scores. Propensity scores are used for flag fields only, and they indicate the likelihood of the TrueRaw propensity scores are derived from the model based on the training data only. If the model

predicts the true value (will respond), then the propensity is the same as P, where P is the probabilityof the prediction (often the confidence). If the model predicts the false value, then the propensity iscalculated as (1 – P).

value defined for the field.

Propensity scores differ from confidence scores, which apply to the model prediction, whether true or false. In cases where the prediction is false, for example, a high confidence actually means a highlikelihood not to respond . Propensity scores overcome this limitation to allow easier comparisonacross all records. For example, a no prediction with a confidence of 0.65 translates to a raw

propensity of 0.35 (or 1 – 0.65).

Raw propensities are based purely on estimates given by the model, which may be overfitted, leadingto over-optimistic estimates of propensity. Adjusted propensities attempt to compensate by looking athow the model performs on the test or validation partitions and adjusting the propensities to give a

better estimate accordingly. A partition is required to calculate adjusted propensities.




3-32

Figure 3.27 Analyze Tab Options

Close the C5.0 modeling node

3.9 Model ing Catego r ical Outpu ts with Other Decis ion

Tree Alg or i thm s As we saw in Table 3.1, C5.0 can only be used to model categorical targets. Quest has the samelimitation. The other two algorithms, CHAID and C&R Tree can be used to model both categoricaland continuous targets. Before we discuss how to create models with continuous targets, let’s take alook at the various options for modeling categorical targets in CHAID, C&RT and Quest. You cancertainly try one of these techniques on the churn.txt data to compare to C5.0, but for the most part,the output format is very similar.

3.10 Model ing Categor ical Outpu ts w ith CHAID

First, we’ll look at the CHAID node and the options available there.

Double-click the CHAID node named CHURNEDClick the Field tab (if necessary)




3-33

Figure 3.28 CHAID Node Dialog Fields Tab

There are four tabs to control various aspects of the modeling process. The Fields tab, as with manymodeling nodes, allows you to select the target field and the predictors, or inputs. Here, the fields arealready set because of roles assigned in the Type node.

Click the Build Options tab




3-34

Figure 3.29 Objective Settings in Build Options Tab

The Build Options tab enables you to control six different areas, including overall objectives, thespecific CHAID algorithm, tree depth, stopping rules, costs of making an error, how to combineensembles of models, and advanced statistical specifications, such as the significance level used for splitting a node.

The default objective is to build a single tree. You can set one of two modes: Generate model buildsthe model, Launch Interactive session launches the Interactive Tree feature which we will discuss in alater section.

Multiple trees can be built by using bagging or boosting (see Lesson 4 for a discussion of theseoptions). If you have a server connection, CHAID can create models for very large datasets by

dividing the data into smaller data blocks and building a model on each block. The most accuratemodels are then automatically selected and combined into a single final model.

Click the Basics settings




3-35

Figure 3.30 Basics Settings in Build Options Tab

The options on the Basics panel enable you to choose the algorithm. For a single CHAID tree model,there are two methods, standard or exhaustive CHAID. The latter is a modification of CHAIDdesigned to address some of its weaknesses. Exhaustive CHAID examines more possible splits for a

predictor, thus improving the chances of finding the best predictor (at the cost of additional processing time.

The Maximum tree depth specifies the maximum number of levels below the root node. The defaultdepth is 5. Since CHAID doesn’t prune a bushy tree, the user can specify the depth with the Custom setting. This setting should depend on the size of the data file, the number of predictors, and thecomplexity of the desired tree.

Click theStopping Rules

settings




3-36

Figure 3.31 Stopping Rules Settings in Build Options Tab

The options on the Stopping Rules panel enable you to specify the rules to be applied to cease

splitting nodes in the tree.

You set the minimum branch sizes to prevent splits that would create very small subgroups. Thesecan be specified either as an absolute number of records or as a percentage of the total number of records. By default, a parent branch to be split must contain at least 2% of the records; a child branchmust contain at least 1%. It is often more convenient to work with the absolute number of recordsrather than a percent, but in either case, you will very likely modify these values to get a smaller, or larger, tree.

Click the Ensembles settings




3-37

Figure 3.32 Ensembles Settings in Build Options Tab

These settings determine the behavior of ensembling that occurs when boosting, bagging, or verylarge datasets are requested in Objectives. Options that do not apply to the selected objective areignored.

For bagging and very large datasets, rules must be used to combine the predictions from two or moremodels. The rules differ depending upon whether the target is categorical or continuous, with votingused for the former and the mean of the predictions for the latter. Other options are available.Boosting always uses a weighted majority vote to score categorical targets and a weighted median toscore continuous targets.

When using boosting or bagging, by default 10 models or bootstrap samples, respectively, arecreated. This number is usually sufficient but can be changed.

Click the Advanced settings




3-38

Figure 3.33 Advanced Settings in Build Options Tab

To select the predictor for a split, CHAID uses a chi-square test in the table defined at each node by a predictor and the target field. CHAID chooses the predictor that is the most significant (smallest p value). If that predictor has more than 2 categories, CHAID compares them and collapses together those categories that show no differences in the target. This category merging process stops when allremaining categories differ at the specified testing level (Significance level for splitting:). It is

possible for CHAID to split merged categories, controlled by the Allow resplitting of merged categories check box. (Note that a categorical predictor with more than 127 discrete categories will

be ignored by CHAID.) There is a comparable significance level for merging.

For continuous predictors, the values are binned into a maximum of 10 groups, and then the sametabular procedure is followed as for flag and categorical types.

Because many chi square tests are performed, CHAID automatically adjusts its significance valueswhen testing the predictors. These are called Bonferroni adjustments and are based on the number of tests. You should normally leave this option turned on; in small samples or with only a few

predictors, you could turn it off to increase the power of your analysis.

The Overfit prevention set (%) setting controls the percent of records that are internally separated intoan overfit prevention set. This is an independent set of data used to track errors during training inorder to prevent the tree from modeling chance variation in the data. The default is 30%. This settingis unrelated to the separation of data before modeling into training and testing partitions. The




3-39

modeling is done only with the training data, and thus the separation into an overfitting set is doneonly within the training data.

Note

The overfit prevention data split is not used by CHAID, but instead by C&R Tree and QUEST. For those decision tree methods, the overfit set is used during tree pruning.

Unlike other models, CHAID uses missing, or blank, values when growing a tree. All blank valuesare placed in a missing category that is treated like any other category for nominal predictors. For ordinal and continuous predictors, the process of handling blanks is a bit different, but the effect is thesame (see the PASW Modeler14.0 Algorithms Guide for detailed information). If you don’t want toinclude blank data in a model, it should be removed beforehand.

3.11 Model ing Categor ical Outpu ts w ith C&R Tree We move next to the C&R Tree node to predict a categorical output field.

Click Cancel to close the CHAID dialog

Double-click on the C&R Tree node named CHURNED Click the Build Options tab

The same settings options are available for C&R Tree as for CHAID in the Objective settings.It is also possible, as with CHAID, to grow a tree interactively.




3-40

Figure 3.34 Classification and Regression Trees (C&R Tree) Build Options Tab





3-41

Figure 3.35 Basics Settings for Build Options Tab

The default tree depth, as with CHAID, is five levels below the root node.

Pruning within C&RT

The Prune tree to avoid overfitting check box will invoke pruning. The Maximum difference in risk

(in standard Errors) allows C&R Tree to select the simplest tree whose risk estimate (which is the proportion of errors the tree model makes when equal misclassification costs and empirical priors areused) is close to that of the subtree with the smallest risk. The value in the text box indicates howmany standard errors difference are allowed in the risk estimate between the final tree and the treewith the smallest risk. As this is increased, the pruning will be more severe.

Surrogates

Surrogates are used to deal with missing values on the predictors. For each split in the tree, C&R Treeidentifies the input fields (the surrogates) that are most similar statistically to the selected split field.When a record to be classified has a missing value for a split field, its value on a surrogate field can

be used to make the split.The Maximum surrogates option controls how many surrogate predictor fields will be stored at eachmode. Retaining more surrogates slows processing and the default (5) is usually adequate.




3-42

The Stopping Rules settings are identical to those for CHAID, so we won’t review them here. We donote that, unlike CHAID, while the default values may seem small, it is important to keep in mind that pruning is an important component of C&R Tree, and it can trim back some of the small

branches.

Click on Costs & Priors settings

Figure 3.36 Costs & Priors Settings for Build Options Tab

The misclassification costs are identical to those for the other decision tree models we have discussed.

Priors in C&RT

Historically, priors have been used to incorporate knowledge about the base population rates (here of the output field categories) into the analysis. Breiman et al. (1984) point out that if one target categoryhas twice the prior probability of occurring than another, it effectively doubles the cost of misclassifying a case from the first category, since it is counted twice. Thus by specifying a larger

prior probability for a response category, you can effectively increase the cost of its misclassification.Since priors are only given at the level of the base rate for the output field categories (with Jcategories there are J prior probabilities), use of them implies that the misclassification of a record




3-43

actually in output category j has the same cost regardless of the category into which it ismisclassified, (that is C(j) = C(k|j), for all k not equal to j).

By default, the prior probabilities are set to match the probabilities found in the training data. The Equal for all classes option allows you to set all priors equal (might be used if you know your sampledoes not represent the population and you don’t know the population distribution on the target), and

you can enter prior probabilities (Custom option). The prior probabilities should sum to 1 and if youenter custom priors that reflect the desired proportions, but do not sum to 1, the Normalize button willadjust them. Finally, priors can be adjusted based on misclassification costs (see Breiman’s commentabove) entered in the Costs tab.

The Ensembles settings are identical to that for CHAID, so we don’t review them.






3-44

Impurity Criterion

The criterion that guides tree growth in C&R Tree with a categorical output field is called impurity. Itcaptures the degree to which responses within a node are concentrated into a single output category.A pure node is one in which all cases fall into a single output category, while a node with themaximum impurity value would have the same number of cases in each output category. Impurity can

be defined in a number of ways and two alternatives are available within the C&R Tree procedure.The default, and more popular measure, is the Gini measure of dispersion. If P(t)i is the proportion of

cases in node t that are in output category i, then the Gini measure is:

2)(1 ∑−=

i

it P Gini

Alternatively:

j

ji

i t P t P Gini )()(∑≠

=

If two nodes have different distributions across three response categories (for example (1,0,0) and (1/3, 1/3, 1/3)), the one with the greater concentration of responses in a single category (the first one)

will have the lower impurity value (for (1,0,0) the impurity is 1 – (12 + 02 + 02), or 0; for (1/3, 1/3,

1/3) the impurity is 1 – ((1/3)2 + (1/3)2 + (1/3)2)or .667). The Gini measure ranges between 0 and 1,although the maximum value is a function of the number of output categories.

Thus far we have defined impurity for a single node. It can be defined for a tree as the weighted average of the impurity values from the terminal nodes. When a node is split into two child nodes, theimpurity for that branch is simply the weighted average of their impurities. Thus if two child nodesresulting from a split have the same number of cases and their individual impurities are .4 and .6, their combined impurity is .5*.4 + .5*.6. When growing the tree, C&R Tree splits a node on the predictor that produces the greatest reduction in impurity (comparing the impurity of the parent node to theimpurity of the child nodes). This change in impurity from a parent node to its child nodes is called the improvement and under Expert options you can specify the minimum change in impurity for treegrowth to continue. The default value is .0001 and if you are considering modifying this value, youmight calculate the impurity at the root node (the overall output proportions) to establish a point of reference.

The problems with using impurity as a criterion for tree growth are that you can almost always reduceimpurity by enlarging the tree and any tree will have 0 impurity if it is grown large enough (if everynode has a single case, impurity is 0). To address these difficulties, the developers of the classificationand regression tree methodology (see Breiman, Friedman, Olshen, and Stone, Classification and Regression Trees, Wadsworth, 1984) developed a pruning method based on a cross-validated costcomplexity measure (as discussed above).

By default, the Gini measure of dispersion is used. Breiman and colleagues proposed Twoing as analternative impurity measure. If the target has more than two output categories, twoing will create

binary splits of the response categories in order to calculate impurity. Each possible combination of output categories split into two groups will be separately evaluated for impurity with each predictor,and the best split across predictors and target category combinations is chosen. Ordered Twoing (inactive because the target field is nominal, not ordinal) applies Twoing as described above, exceptthat the output category combinations are limited to those consistent with the rank order of the




3-45

categories. For example, if there are five output categories numbered 1,2,3,4 and 5, Ordered Twoingwould examine the (1,2) (3,4,5) split, but the (1,4) (2,3,5) split would not be considered, since onlycontiguous categories can be grouped together.

Of the methods, the Gini measure is most commonly used.

The option to prevent model overfitting is available here, as it was with CHAID.

3.12 Model ing Categor ical Outputs with QUEST We turn next to the QUEST node for predicting a categorical field.

Click Cancel to close the C&R Tree dialogDouble-click on the QUEST node named CHURNED Click the Build Options tab

QUEST (Quick Unbiased Efficient Statistical Tree) is a binary classification method that wasdeveloped, in part, to reduce the processing time required for large C&R Tree analyses with manyfields and/or records. It also tries to reduce the tendency in decision tree methods to favor predictorsthat allow more splits (see Loh and Shi, 1997).




3-46

Figure 3.38 QUEST Build Options Tab

There are the same settings areas as for C&R Tree. In the Objectives settings, the same selections areavailable as the other methods.





3-47

Figure 3.39 QUEST Advanced Settings in Build Options Tab

QUEST separates the tasks of predictor selection and splitting at a node. Like CHAID, it usesstatistical tests to pick a predictor at a node. For each continuous or ordinal predictor, QUEST

performs an analysis of variance, and then uses the significance of the F test as a criterion. For nominal predictors (measurement level flag and nominal), chi-square tests are performed. The

predictor with the smallest significance value from either the F or chi-square test is selected.Although not evident from the dialog box options, Bonferroni adjustments are made, as with CHAID(not under user control).

QUEST is more efficient than C&R Tree because not all splits are examined, and category

combinations are not tested when evaluating a predictor for selection.

After selecting a predictor, QUEST determines how the field should be split (into two groups) bydoing a quadratic discriminant analysis, using the selected predictor on groups formed by the targetcategories. The details are rather complex and can be found in Loh and Shih (1997). Themeasurement level of the predictor will determine how it is treated in this method. While quadraticdiscriminant analysis allows for unequal variances in the groups and makes one fewer assumptionthan does linear discriminant analysis, it does assume that the distribution of the data are multivariatenormal, which is unlikely for predictors that are flags and sets.




3-48

QUEST uses an alpha (significance) value of .05 for splitting in the discriminant analysis, and youcan modify this setting. For large files you may wish to reduce alpha to .01, for example.

Pruning, Stopping, Surrogates

QUEST follows the same pruning rule as does C&R Tree, using a cost-complexity measure that takesinto account the increase in error if a branch is pruned, using a standard error rule. The Stoppingchoices are the same as for CHAID and C&R Tree. QUEST also uses surrogates to allow predictionsfor missing values, employing the same methodology as C&R Tree.

3.13 Predic t ing Cont in uous Fields Two of the decision tree models, CHAID and C&R Tree, can predict a continuous field. We will

briefly review the options available for this type of target and then run an example.

Continuous Outputs with C&R Tree

When a continuous field is used as the target field in C&R Tree, the algorithm runs in the waydescribed earlier in this lesson. For a continuous output field (the regression trees portion of thealgorithm), the impurity criterion is still used but is based on a measure appropriate for a continuous

field: within-node variance. It captures the degree to which records within a node are concentrated around a single value. A pure node is one in which all cases have the same output value, while a nodewith a large impurity value (in principle, the theoretical maximum would be infinity) would containcases with very diverse values on the output field. For a single node, the variance (or standard deviation squared) of the output field is calculated from the records within the node. When generatinga prediction, the algorithm uses the average value of the target field within the terminal node.

Click File…Open Stream Double-click on CRTree.str

Figure 3.40 PASW Modeler Stream with C&R Tree Model Node (Continuous Output Field)

The data file consists of patient admissions to a hospital. The goal is to build a model predictinginsurance claim amount based on hospital length of stay, severity of illness group and patient age.

Right-click on the Table node and select Run Review the data and then close the Table windowDouble click on the C&R Tree node (labeled CLAIM)Click on Build Options tab




3-49

The Build Options tab for the C&R Tree dialog with a continuous output is the same as for acategorical output. However, if we explore the various settings we find that some—priors and misclassification costs—which are not relevant for a continuous output field, are inactive.

Otherwise, setting up the model-building parameters, and executing the tree, is identical to the process for a categorical target. The generated model will display the predicted mean for the

insurance claim amount in each terminal node.

Figure 3.41 C&R Tree Build Options Tab for a Continuous Output Field

We will illustrate the use of interactive tree-building as well in this example.

Click the Launch interactive session buttonClick Run




3-50

Figure 3.42 Interactive Tree Builder to Predict Claim with C&R Tree

Because the target field is continuous, different statistics appear in the nodes. The nodes display themean, number of cases, and percentage of the sample. Thus, the mean insurance claim for persons inthis data file is slightly over $4680, which we would predict for each person if we didn’t know how

long they stayed in the hospital, their age, or how severely ill they were. Once the tree is grown, weshould get some insight into the characteristics of patients that separate high insurance claims fromlow ones.

As before, we are using about 70% of the data to fit the model, and reserve the remainder to testoverfitting. For C&R Tree, the overfit data is used to prune the tree.

We could grow the tree one level at a time. However, if we do so, the tree then can’t be easily pruned,and pruning is an important feature of creating a tree that will perform as well on new data.

Right-Click on the Node 0 and select Grow Tree and Prune




3-51

Figure 3.43 Interactive Tree Builder Fully Grown and Pruned Tree

The results indicate that LOS (Length of Stay) is the best predictor. The average insurance claim for persons who stay more than 2 days in the hospital is $5637.07 while the mean claim for persons whospent 1 or two days in the hospital is $4369.51. This certainly makes sense.

The predictions are further refined by the split on ASG (severity of illness) for those with a length of stay of 1 or 2 days . This field is coded 0, 1, and 2. Those with the lowest severity (0) in Node 3 havea mean insurance claim of $4194.15; those who have more severe illnesses have a mean claim of $4659.76.

You can make different splits in the tree based on business information or just to try alternative trees.To see this option:

Right-Click on the Node 0 and select Grow Branch with Custom Split




3-52

Figure 3.44 Choosing a Custom Split with C&R Tree

The current best split is listed, on LOS, along with the values that will be used to split the tree. Youcan change the split values by selecting the Custom option button.

The Predictors dialog shows the top predictors and the Improvement value for the optimal split oneach.

Click Predictors button




3-53

Figure 3.45 Predictors and Improvement Values for that Split

You can choose another predictor here to split the node. In this instance, the Improvement value for LOS is clearly the largest, so it would be better to retain it as the first split.

Click Cancel, and then Cancel again

For a continuous output, we can investigate model performance in several ways. No model isautomatically created when using interactive mode, so we need to generate a model first.

Click Generate…Generate Model Click OK in the Generate New Model dialog

Figure 3.46 Generate New Model Dialog

Close the Interactive Tree windowMove the generated model CLAIM1 near the Type nodeConnect the Type node to the generated model CLAIM1 Add an Analysis node to the streamConnect the Analysis node to the generate model CLAIM1 Run the Analysis node




3-54

Figure 3.47 Analysis Output for C&R Tree Model to Predict CLAIM

The Mean Absolute Error is perhaps the most useful statistic, and has a value of 824.11. This is theamount of model prediction error, on the average. The mean value of CLAIM is about 4,631, so theaverage error is somewhat under 20% of the mean. The analyst has to decide whether this issufficiently accurate, given the goals of a data-mining project.

The correlation between the predicted value and actual value of CLAIM is .507, which is good, butnot outstanding. On the other hand, there are only three terminal nodes in the tree, and so only three

predicted values of CLAIM . Given that, this correlation isn’t bad at all.

You can explore the model further by using other nodes to see the relationship between the predictorsand the value of $R-CLAIM . However, the decision tree is very clear as to how the predictors are

being used to make predictions. You may, though, want to investigate how AGE related to $R-CLAIM since AGE isn’t used in the tree (you would find there is essentially no relationship, hence the reasonit wasn’t used).

Close the Analysis window

Continuous Outputs with CHAID

When CHAID is used with a continuous target, the overall approach is identical to what we havediscussed above, but the specific tests used to select predictors and merge categories differ.




3-55

An analysis of variance test is used for predictor selection and merging of categories, with the targetas the dependent variable. Nominal and ordinal predictors are used in their untransformed form.Continuous predictors are binned as described above for CHAID into at most 10 categories; then ananalysis of variance test is used on the transformed field.

The field with the lowest p value for the ANOVA F test is selected as the best predictor at a node, and

the splitting and merging of categories proceeds based on additional F tests.

If you are interested, you can try CHAID with this same data and target.




3-56

Summary Exerc ises The exercises in this lesson use the data file churn.txt that was also used for examples in this lesson.The following table provides details about the file.

churn.txt contains information from a telecommunications company. The data are comprised of customers who at some point have purchased a mobile phone. The primary interest of the company isto understand which customers will remain with the organization or leave for another company.

The file contains the following fields:

ID Customer reference number LONGDIST Time spent on long distance calls per monthInternational Time spent on international calls per monthLOCAL Time spent on local calls per monthDROPPED Number of dropped callsPAY_MTHD Payment method of the monthly telephone bill

LocalBillType Tariff for locally based callsLongDistanceBillType Tariff for long distance callsAGE AgeSEX Gender STATUS Marital statusCHILDREN Number of ChildrenEst_Income Estimated incomeCar_Owner Car owner CHURNED (3 categories)

Current – Still with companyVol – Leavers who the company wants to keepInvol – Leavers who the company doesn’t want

In these exercises we will explore the various training methods and options for the rule inductiontechniques within PASW Modeler. In all the exercise that follow, you can use a Partition node to splitthe data into Training and Testing partitions to make these modeling exercises more realistic. Youmay or may not want to split the data 50/50 into two partitions depending on the number of records inthe file.

1. Begin a new stream with a Var.file node connected to the file Churn.txt .

2. Use C5.0 and at least one other decision tree method to predict CHURNED and compare theaccuracy of both. What do you learn from this? Which rule method performs “best”?

3. Now browse the rules that have been generated by the methods. Which model appears to bethe most manageable and/or practical? Do you think there is a trade-off between accuracy and practicality?

4. Try switching from Accuracy to Generality in C5.0. Does this have much effect on the sizeand accuracy of the tree?

5. Experiment with the model options within the methods you selected to see how they affecttree growth. Can you increase the accuracy without making the model overly complicated?






3-58



NEURAL NETWORKS

4-1

Lesson 4: Neural Networks

Overview

• Describe the structure and types of neural networks

• Build a neural network

• Browse and interpret the results

• Evaluate the model

• Illustrate the use of bagging and boosting

Data

In this lesson we will use the dataset churn.txt . We continue to use a Partition Node to divide thecases into two partitions (subsamples), one to build or train the model and the other to test the model(often called a holdout sample).

4.1 Intro duc t ion to Neural Network s Historically, neural networks attempted to solve problems using methods modeled on how it was

envisioned the brain operates. Today they are simply viewed as powerful modeling techniques.

A typical neural network consists of several neurons arranged in layers to create a network. Each

neuron can be thought of as a processing element that is given a simple part of a task. Theconnections between the neurons provide the network with the ability to learn patterns and

interrelationships in data. The figure below gives a simple representation of a common neuralnetwork (a Multi-Layer Perceptron).

Figure 4.1 Simple Representation of a Common Neural Network

When using neural networks to perform predictive modeling, the input layer contains all of the fieldsused to predict the target. The output layer contains an output field: the target of the prediction. Theinput and output fields can be continuous or categorical (in PASW Modeler, categorical fields are

transformed into a numeric form (dummy or binary set encoding) before processing by the network).The hidden layer contains a number of neurons at which outputs from the previous layer combine. A




4-2

network can have any number of hidden layers, although these are usually kept to a minimum. Allneurons in one layer of the network are connected to all neurons within the next layer

While the neural network is learning the relationships between the data and results, it is said to be

training. Once fully trained, the network can be given new, unseen data and can make a decision or prediction based upon its experience.

When trying to understand how a neural network learns, think of how a parent teaches a child how toread. Patterns of letters are presented to the child and the child makes an attempt at the word. If the

child is correct she is rewarded and the next time she sees the same combination of letters she is likelyto remember the correct response. However, if she is incorrect, then she is told the correct responseand tries to adjust her response based on this feedback. Neural networks work in the same way.

4.2 Training Methods One of the advantages of PASW Modeler is the ease with which you are able to build a neuralnetwork without in fact knowing too much about how the algorithms work. Nevertheless, it helps tounderstand a bit about these methods, so we will begin by briefly describing the two different types of

training methods: a Multi-Layer Perceptron (MLP) and a Radial Basis Function Network (RBFN)model.

As was noted above, a neural network consists of a number of processing elements, often referred toas “neurons,” that are arranged in layers. Each neuron is linked to every neuron in the previous layer

by connections that have strengths or weights attached to them. The learning algorithm controls the

adaptation of these weights to the data; this gives the system the capability to learn by example and generalize for new situations.

The main consideration when building a network is to locate the best, or global solution, within a

domain; however, the domain may contain a number of sub-optimal solutions. The global solutioncan be thought of as the model that produces the least possible error when records are passed through

it.

To understand the concept of global error, imagine a graph created by plotting the hidden weights

within the neural network against the error produced. Figure 4.2 gives a simple representation of sucha graph. With any complex problem there may be a large number of feasible solutions, thus the graphcontains a number of sub-optimal solutions or local minima (the “valleys” in the plot). The trick to

training a successful network is to locate the overall minimum or global solution (the lowest point),and not to get “stuck” in one of the local minima or sub-optimal solutions.



NEURAL NETWORKS

4-3

Figure 4.2 Representation of the Error Domain Showing Local and Global Minima

There are many different types of supervised neural networks (that is neural networks that require both inputs and an output field). However, within the world of data mining, two are most frequently

used. These are the Multi-Layer Perceptron (MLP) and the Radial Basis Function Network (RBFN).In the following paragraphs we will describe the main differences between these types of networksand describe their advantages and disadvantages.

4.3 The Mul ti-Layer Perceptr on The MLP network consists of layers of neurons, with each neuron linked to all neurons in the

previous layer by connections of varying weights. All MLP networks consist of an input layer, anoutput layer and at least one hidden layer. The hidden layer is required to perform non-linear mappings. The number of neurons within the system is directly related to the complexity of the

problem, and although a multi-layered topology is feasible, in practice there is rarely a need to havemore than one hidden layer.

Within a Multi-Layer Perceptron, each hidden layer neuron receives an input based on a weighted combination of the outputs of the neurons in the previous layer. The neurons within the final hidden

layer are, in turn, combined to produce an output. This predicted value is then compared to the correctoutput and the difference between the two values (the error) is fed back into the network, which in

turn is updated. This feeding of the error back through the network is referred to as back-propagation.

To illustrate this process we will take the simple example of a child learning the difference between

an apple and a pear. The child may decide in making a decision that the most useful factors are theshape, the color and the size of the fruit—these are the inputs. When shown the first example of a

fruit she may look at the fruit and decide that it is round, red in color and of a particular size. Notknowing what an apple or a pear actually looks like, the child may decide to place equal importanceon each of these factors—the importance is what a network refers to as weights. At this stage the

child is most likely to randomly choose either an apple or a pear for her prediction.

On being told the correct response, the child will increase or decrease the relative importance of eachof the factors to improve her decision (reduce the error). In a similar fashion a MLP network beginswith random weights placed on each of the inputs. On being told the correct response, the network




4-4

adjusts these internal weights. In time, the child and the network will hopefully make correct predictions.

To visualize how a MLP works, imagine a problem where you wish to predict a target field consisting

of two groups, using only two input fields. Figure 4.3 shows a graph of the two input fields plotted against one another, overlaid with the target. Using a non-linear combination of the inputs, the MLP

fits an open curve between the two classes.

Figure 4.3 Decision Surface Created Using the Multi-Layer Perceptron

The advantages of using a MLP are:

• It is effective on a wide range of problems

• It is capable of generalizing well• If the data are not clustered in terms of their input fields, it will classify examples in the

extreme regions

• It is currently the most commonly used type of network and there is much literaturediscussing its applications

The disadvantages of using a MLP are:

• It can take a great deal of time to train

• It does not guarantee finding the best global solution

4.4 The Radial Basis Func tion

The Radial Basis Function (RBF) is a more recent type of network and is responsive to local regionswithin the space defined by the input fields.

Figure 4.4 shows a graphical representation of how a RBF fits a number of basis functions to the

problem described in the previous section. The RBF can be thought of performing a type of clusteringwithin the input space, encircling individual clusters of data by a number of basis functions. If a data

point falls within the region of activation of a particular basis function, then the neuron corresponding

to that basis function responds most strongly. The concept of the RBF is extremely simple; however the selection of the centers of each basis function is where difficulties arise.



NEURAL NETWORKS

4-5

Figure 4.4 Operation of a Radial Basis Function

The advantages of using a RBF network are:

• It is quicker to train than a MLP

• It can model data that are clustered within the input space.

The disadvantages of using a RBF network are:

• It is difficult to determine the optimal position of the function centers

• The resulting network often has a poor ability to represent the global properties of the data.

Within PASW Modeler the RBF algorithm uses the K-means clustering algorithm to determine thenumber and location of the centers in the input space.

4.5 Whic h Method? Due to the random nature of neural networks, the models built using each of the algorithms will tend to perform with varying degrees of accuracy depending on the initial weights and starting positions.

When building neural networks it is sensible to try both algorithms and either choose the one with the best overall performance, or, use both models to gain a majority prediction.




4-6

4.6 The Neural Networ k Node The Neural Net node is used to create a neural network and can be found in the Modeling palette.

Once trained, a Generated Neural Net node labeled with the name of the predicted field will appear inthe Generated Models palette and on the stream. This node represents the trained neural network. Its

properties can be browsed and new data can be passed through this node to generate predictions.

We can use the stream saved in Lesson 3.

Click File…Open Stream Move to the c:\Train\ ModelerPredModel directoryDouble-click on the C5.str Delete the C5.0 modeling node and the C5.0 generated model node, leaving the other nodesRun the Table nodeClick File…Close to close the Table windowDouble-click the Type node

Figure 4.5 Type Node Ready for Modeling

Notice that ID will be excluded from any modeling as the role is automatically set to None for aTypeless field. The CHURNED field will be the target field for any predictive model and all fields but ID and Partition will be used as predictors.



NEURAL NETWORKS

4-7

Click OK Place a Neural Net node from the Modeling palette to the right of the Type nodeConnect the Type node to the Neural Net node

Figure 4.6 Neural Net Node (CHURNED) Added to Data Stream

Notice that once the Neural Net node is added to the data stream, its name becomes CHURNED, thefield we wish to predict.

Double-click the Neural Net nodeClick the Build Options tab (if necessary)




4-8

Figure 4.7 Neural Net Dialog: Build Options Tab

The Build Options tab enables you to control five different areas, including overall objectives, thespecific neural net algorithm, stopping rules, how to combine ensembles of models, and advanced

statistical specifications, including how to handle missing data and the size of an overfit preventionset.

The default objective is to build a new single neural network. You can instead use boosting or bagging (explained in sections below) to enhance model accuracy or stability. If you have a server

connection, Neural Net can create models for very large datasets by dividing the data into smaller data blocks and building a model on each block. The most accurate models are then automaticallyselected and combined into a single final model.




NEURAL NETWORKS

4-9

Figure 4.8 Basics Settings in Build Options Tab

The options on the Basics panel enable you to choose one of two algorithms. The multilayer perceptron allows for more complex relationships at the possible cost of increasing the training and

scoring time. The radial basis function may have lower training and scoring times, at the possible costof reduced predictive power compared to the MLP.

The hidden layer(s) of a neural network contains unobservable neurons (units). The value of each

hidden unit is some function of the predictors; the exact form of the function depends in part upon the

network type. A multilayer perceptron can have one or two hidden layers; a radial basis functionnetwork can only have one hidden layer. By default the model will choose the best number of hiddenunits in each hidden layer, although you can specify this yourself. Normally, it is best to allow thealgorithm to make this choice.

Click the Stopping Rules settings




4-10

Figure 4.9 Stopping Rules Settings in Build Options Tab

This area allows you to control the rules that determine when to stop training multilayer perceptronnetworks; these settings are ignored when the radial basis function algorithm is used. Training

proceeds through at least one cycle (data pass), and can then be stopped based on three criteria.

By default, PASW Modeler stops when it appears to have reached its optimally trained state; that is,when accuracy in the (internal) test dataset seems to no longer improve. Alternatively, you can set arequired accuracy value, a limit to the number of cycles through the data, or a time limit in minutes.

We use the default in these examples.

Click the Ensembles settings



NEURAL NETWORKS

4-11

Figure 4.10 Ensembles Settings in Build Options Tab

The settings in this section are used when boosting, bagging or very large datasets are modeled as

requested in the Objectives section. In this case, two or more models need to be combined to make a prediction. Ensemble predicted values for categorical targets can be combined using voting, highest probability, or highest mean probability. Voting selects the category that has the highest probability

most often across the base models. Highest probability selects the category that achieves the singlehighest probability across all base models. Highest mean probability selects the category with the

highest value when the category probabilities are averaged across the individual models. Ensemble predicted values for continuous targets can be combined using the mean or median of the predicted

values from the individual models.

You can also specify the number of base models to build for boosting and bagging;for bagging, this is

the number of bootstrap samples.





4-12


Over-training is one of the problems that can occur within neural networks. As the data pass

repeatedly through the network, it is possible for the network to learn patterns that exist in the sampleonly and thus over-train. That is, it will become too specific to the training sample data and lose itsability to generalize. By selecting the Overfit prevention set option (checked by default), only a

randomly selected proportion of the training data is used to train the network (this is separate from aholdout sample created in the Partition node). By default, 70% of the data is selected for training the

model, and 30% for testing it. Once the training proportion of data has made a complete pass throughthe network, the rest is reserved as a test set to evaluate the performance of the current network. By

default, this information determines when to stop training and provides feedback information. Weadvise you to leave this option turned on. Note that with a Partition node in use, and with Overfit prevention set turned on, the Neural Net model will be trained on 70 percent of the training sample

selected by the Partition Node, and not on 70% of the entire dataset.

Since the neural network initiates itself with random weights, the behavior of the network can bereproduced by setting the same random seed by checking the Replicate Results checkbox (selected bydefault) and specifying a seed. It is advisable to run several trials on a neural network to ensure that



NEURAL NETWORKS

4-13

you obtain similar results using different random seed starting points. The Generate button will createa pseudo-random integer between 1 and 2147483647, inclusive.

The Neural Net node requires valid values for all input fields. There are two options to handle

missing values for the predictors (records with missing values on the target are always ignored). Bydefault, listwise deletion will be used, which deletes any record with a missing (blank) value on one

or more of the input fields. As an alternative, Modeler can impute the missing data. For categoricalfields, the most frequently occurring category (the mode) is substituted; for continuous fields, theaverage of the minimum and maximum observed values is substituted.

It is important to check the data in advance of running a Neural Net using the Data Audit node todecide which records and fields should be passed to the neural network for modeling. Otherwise yourun the risk of a model being built using data values supplied by these imputation rules even if that isnot your preference. Also, you can take control of the missing value imputation by using, for

example, the Data Audit node to change missing values to valid values with several optional methods before using the Neural Net node.

Click the Model Options tab




4-14

Figure 4.12 Model Options Tab

The Make Available for Scoring area contains options for controlling how the model is scored. The predicted value (for all targets) and confidence (for categorical targets) are always computed whenthe model is scored. The computed confidence can be based on the probability of the predicted value(the highest predicted probability) or the difference between the highest predicted probability and the

second highest predicted probability.

Propensity scores—the likelihood of the true outcome—can be created for flag targets, as we also sawfor decision tree models. The model produces raw propensity scores; adjusted propensity scores are

not available.

We are ready to run the Neural Net node. Although we have discussed the various options, we will

use all defaults for this first model.

Click Run



NEURAL NETWORKS

4-15

4.7 Mod els Palette The Neural Net model is placed in both the stream, attached to the Type node, and in the Models

Palette. The Models tab in the Manager holds and manages the results of the machine learning and statistical modeling operations. There are two context menus available within the Models palette. The

first menu applies to the entire model palette.

Right-click in the background (empty area) in the Models palette

Figure 4.13 Context Menu in the Models Palette

This menu allows you to open a model in the palette, save the models palette and its contents, open a

previously saved models palette, clear the contents of the palette, or to add the generated models tothe Modeling section of the CRISP-DM project window. If you use PASW

®

Collaboration and Deployment Services to manage and run your data mining projects, you can store the palette, retrievea palette or model from the repository.

The second menu is specific to the generated model nodes.

Right-click the generated Neural Net node named CHURNED in the Models palette




4-16

Figure 4.14 Context Menu for Nodes in the Models Palette

This menu allows you to rename, annotate, and browse the generated model node. A generated model

node can be deleted, exported as PMML (Predictive Model Markup Language) and stored in thePASW Collaboration and Deployment Services Repository, or saved in a file for future use.

4.8 The Neural Net Model We can now explore the neural net model created to predict CHURNED.

Double-click the Neural Net model in the stream

The Model Tab contains five views or summaries of the model. The first is the Model Summary,which is a higher level view of model performance (see next figure).

The table identifies the target, the type of neural network trained, the stopping rule that stopped training (shown if a multilayer perceptron network was trained), and the number of neurons in each

hidden layer of the network. Here, eight hidden neurons were used in one hidden layer.

The bar chart displays the accuracy of the final model, which is 81.1% (compare to that for the C5.0model in the previous lesson). For a categorical target, this is simply the percentage of records for which the predicted value matches the observed value. Since we are using a Partition node, this is the

percentage correct on the Training partition. The full Training partition is used for this calculation,including the overfit prevention set that was used internally during model building.



NEURAL NETWORKS

4-17

Figure 4.15 Model Summary Information for Neural Net Model

Note: Accuracy for a Continuous Target

For a categorical target the accuracy is simply the percentage correct. It is worth noting that if the

target is continuous then accuracy within PASW Modeler is defined as the average across all recordsof the following expression, which is summed over all records.

∑

−−=

ValuesTargetof Range

ValueTargetPredicted ValueTarget1

1

n Accuracy

Predictor importance

Next we can look at predictor importance.

Click the Predictor Importance chart panel

For models such as neural nets, predictor importance takes on, well, added importance because thereis no single equation or other representation of the model available in the generated model nugget(but we can view the coefficients, as demonstrated below). The same is true, for example, with SVMmodels. Predictor importance is based on sensitivity analysis, which is a method to determine howvariation in the model inputs leads to variation in the predicted values. The more important a

predictor, the more changes in its values change the outcome values. In PASW Modeler, importance




4-18

is calculated by sampling repeatedly from combinations of values in the distribution of the predictorsand then assessing the effect on the target. Then everything is normalized to 1.0 so that the

importances can be compared.

The most important predictor of CHURNED is LONGDIST followed by LOCAL and then International . These fields all measure usage of the phone service. The first customer demographic

variable in importance is SEX . You may wish to compare these fields to the ones selected by the C5.0model in Lesson 3.

Figure 4.16 Predictor Importance in Model

Predictor importance is not a substitute for exploring the model and seeing how it actually functions,as we do later, but it is a first step at reviewing a model.

Next we can view how well the model performs at predicting each category of CHURNED.

Click the Classification panel

For categorical targets, this section displays the cross-classification of observed versus predicted values in a heat map, plus the percent in each cell. We look, as usual, on the diagonal to see thecorrect predictions. The neural net does best at predicting those in the InVol category, with lowestaccuracy for Current customers. As a reminder, this table uses the Training partition. The depth of

shading of each cell is based on the percentage, with darker shading corresponding to higher percentages. There are three other table styles available, selected from the Styles dropdown.



NEURAL NETWORKS

4-19

Figure 4.17 Predictor Importance in Model

The Neural Network Structure

You can see the neural network itself in the next section of output.

Click on the Neural Network panel

The network can be displayed in several views. These icons display the network in,respectively:

• Standard style, with inputs on the left and outputs on the right

• Inputs on the top and outputs on the bottom

• Inputs on the bottom and outputs on the top

Also available is a slider to limit the display of inputs based on predictor importance.




4-20

Figure 4.18 Neural Network Structure

It is difficult to see the full network in the standard view, so we’ll switch to the view with inputs onthe top.

Click on the middle network iconMaximize the window’s width

There are two different display styles, which are selected from the Style dropdown list.

• Effects. This displays each predictor and target as one node in the diagram irrespective of

whether the measurement scale is continuous or categorical. That is the current view.

• Coefficients. This displays multiple indicator nodes for categorical predictors and targets. The

connecting lines in the coefficients-style diagram are colored based on the estimated value of the synaptic weight.

Move both sliders to their endpoints so all fields are displayed



NEURAL NETWORKS

4-21

Figure 4.19 Neural Network With Inputs at Top

In this network there is one hidden layer, containing eight neurons, and the output layer still containsonly one neuron corresponding to the target field CHURNED. .

Click the Style dropdown and select Coefficients

In the Coefficients view all the neurons are visible. The input layer is made up of one neuron per continuous field. Categorical fields will have one neuron per value. In this example, there are seven

continuous, five flag, and one nominal field with three values, totaling twenty neurons. There are also Bias neurons to set the scale of the input. The field CHURNED is represented by three neurons for itsthree categories.




4-22

Figure 4.20 Neural Network In Coefficients View

The connecting lines in the diagram are colored based on the estimated value of the synaptic weight,with darker blues corresponding to a greater weight. If you hover the cursor over a link betweenneurons, the weight will be displayed in a popup (weights vary from –1.0 to +1.0).

It is visually evident that neural network models are very complicated to summarize easily. This is because of the very large number of connections (each of the input neurons is connected to eachhidden neuron, and then each hidden neuron is connected to each output neuron). In effect, there are

many, many equations in the network, and so the influence or effect of any one input field would have to be summed across these many equations.

We conclude our review of the Model Viewer output by looking at the Settings tab.

Click the Settings tab



NEURAL NETWORKS

4-23

Figure 4.21 Settings Tab for Neural Network

The options here are equivalent to those on the Model Options tab in the Neural Network modelingnode. The type of confidence can be requested, along with the number of probability fields for

categorical targets. One new option is to generate SQL for the model, allowing pushback of the modelscoring stage to the database. This is only available for the multilayer perceptron.

Click OK to close the Neural Network Model Viewer

4.9 Validat ing the List of Predictors Because the neural network results depend on the initial random starting point, it is important to rerunthe model with a different seed to be sure that the results are consistent. It is entirely possible that

because of the seed we chose that one or more of the fields the Neural Network found to be importantin influencing CHURNED might not be selected again with a different seed. Therefore, it is crucial to

run the Neural Network model enough times until you are convinced about which predictors are themost important in influencing your target. We will rerun the model just once and compare it with the

one we just ran. Normally, you would need to rerun it several times.




4-24

Double-click the Neural Network modeling nodeClick Build Options tabClick Advanced Change the random seed in the Random seed: text box to 444 Click Run Edit the generated Neural Net model in the stream

We see in the Model Summary pane that the overall accuracy has decreased by about 3.6%. Also,

intriguingly, the number of neurons in the hidden layer is now 3, not 8 (the number of hidden neuronsis also determined based in part on the random seed).

Figure 4.22 Model Summary Information for Neural Net Model After Changing Seed

Click the Predictor Importance panel

The Predictor Importance is not identical to that in our first model, but it very similar. The three usage

fields are the most important. In fact, the top six fields are the same, in the same order, as the first

model. After that the order changes, but importance is not something that should be viewed as a fixed and definite value (such as a regression coefficient). Instead, importance is a rough measure of a predictor’s influence on the overall network output.



NEURAL NETWORKS

4-25

Figure 4.23 Predictor Importance after Changing the Seed

So although the accuracy dropped a bit, generally these results are encouraging about the stability of the model. If we look at the Classification table, we will learn that accuracy dropped for the Current

and Vol categories more than for InVol.

Normally we would rerun the model a few more times with different seeds to further convinceourselves that the top predictors of CHURNED remain the same and that accuracy remains fairlyconstant, but we will stop here and attempt to further understand the model.

Click OK to close the Neural Network Model Viewer

4.10 Understandin g the Neural Network A common criticism of neural networks is that they are opaque; that is, once built, the reasoning

behind their predictions is not clear. For instance, does making a lot of international calls mean thatyou are likely to remain a customer, or instead leave voluntarily? In the following sections we willuse some techniques available in PASW Modeler to help you evaluate the network and discover its

structure.

Creating a Data Table Containing Predicted Values

The first step is and passing the data to a Table node to look at the output fields.

Connect the generated Neural Net model named CHURNED to the nearby Table nodeRun the Table node




4-26

Figure 4.24 Table Showing the Two Fields Created by the Generated Net Node

The generated Neural Net node calculates two new fields, $N-CHURNED and $NC-CHURNED, for

every record in the data file with valid data for the model. The first represents the predicted CHURNED value and the second a confidence value for the prediction. The latter is only appropriatefor categorical targets and will be in the range of 0.0 to 1.0, with the more confident predictionshaving values closer to 1.0. We can observe that the first record, which is contained in the Training

partition, was correctly predicted to be a voluntary churner.

Close the Table window


In data-mining projects it is advisable to see not only how well the model performed with the data we

used to train the model, but also with the data we held out for testing purposes. The Neural Net modelonly displays results for the Training partition, so we need to use a Matrix node to create the

equivalent table for the Testing partition.

Because the Matrix node does not have an option to automatically split the results by partition wemust manually divide the Training and Testing samples with Select nodes. This will allow us tocreate a separate matrix table for each sample. We already have the Select and Matrix nodes in thestream from the C5.0 stream.

Connect the generated Neural Net model named CHURNED to each Select node

We then need to specify the correct field in the Matrix nodes.



NEURAL NETWORKS

4-27

Double-click on each Matrix node to edit itPut $N-CHURNED in the Columns: Run each Matrix node

Figure 4.25 Matrix of Actual and Predicted Churned for Training and Testing Samples

For the training data, the model is predicting 75.8% of the current customers, 95.8.0% of theinvoluntary leavers, and 74.9% of the voluntary leavers. For the testing data, the model is doing

slightly better on the current customers (76.3%), but not quite as well on the other two categories(91.8% and 70.7%, respectively).

When you decide whether to accept a model, and you report on its accuracy, you should use theresults from the Testing (or Validation) sample. The model’s performance on the Training data may

be too optimized for that particular sample, so its performance on the Testing sample will be the best

indication of its performance in the future.

Close the Matrix windows

Overall Accuracy with an Analysis Node

An Analysis node will allow us to assess the overall accuracy of the model on each data partition. It isoften true when predicting a categorical target with more than two categories that overall accuracy isless important than accuracy at predicting specific outcomes, but usually analysts prefer to know

overall accuracy, too. And, decision-makers regularly ask about it.

The Analysis node in the stream is ready for our use.

Connect the generated Neural Net model node to the Analysis nodeClick RunOverall percent correct for the Training partition is 77.47%; overall percent correct for the Testing

partition is 75.73%. This small reduction in accuracy from the Training to Testing data is typical, and it falls well within acceptable limits. You can see that the Testing data partition is slightly larger thanthat for the Training data, as they were created randomly.




4-28

Figure 4.26 Analysis Node Output

Close the Analysis Output browser window

Evaluation Charts

The Evaluation node is included in the stream, and if you wish, you can run Evaluation charts for the Neural Network model to further study and compare the performance on the training and testing data

partitions.

4.11 Understand ing th e Reasonin g behin d the Predict io ns One method of trying to understand how a neural network is making its predictions is to apply analternative machine learning technique, such as rule induction, to model the neural network

predictions. Here, though, we will use more straightforward methods to understand the relationships between the predicted values and the fields used as inputs.

Categorical Input with Categorical Target

Based on the predictor importance chart, a categorical input of moderate importance is SEX . Since itand the target field are categorical we can use a distribution plot with a symbolic overlay tounderstand how gender relates to the CHURNED predictions.



NEURAL NETWORKS

4-29

Place a Distribution node from the Graphs palette near the Select node for the Trainingpartition

Connect the Select node to the Distribution nodeDouble-click the Distribution nodeSelect SEX from the Fields: listSelect $N-CHURNED as the Color Overlay fieldClick the Normalize by color check box (not shown)Click Run

The Normalize by color option creates a bar chart with each bar the same length. This helps to

compare the proportions in each overlay category.

Figure 4.27 Distribution Plot Relating Sex and Predicted Churned ($N-CHURNED)

The chart illustrates that the model is predicting that the majority of females are voluntary leavers,

while the bulk of males were predicted to remain current customers. The large difference in the proportion of each category of CHURNED for males compared to females is an illustration of why

SEX is an important predictor. And this plot would help you describe the model in any summaryreports you write.

Close the Distribution plot window

We next look at a histogram plot with an overlay.

Continuous Input with Categorical Target

The most important continuous input for this model is LONGDIST . Since the target field iscategorical, we will use a histogram of LONGDIST with the predicted value as an overlay to try tounderstand how the network is associating long distance minutes used with CHURNED.




4-30

Place a Histogram node from the Graphs palette near the Select node for the Trainingsample

Connect the Select node to the Histogram nodeDouble-click the Histogram nodeClick LONGDIST in the Field: listSelect $N-CHURNED in the Overlay Color field listClick on the Options tabClick on Normalize by Color (not shown)Click Run

Figure 4.28 Histogram with Overlay of Predicted Churned by Long Distance Minutes

Here the only clear pattern we see is that Involuntary Leavers tend to be people who do little or nolong distance calling. In contrast, it appears that the amount of long distance calling was not as much

an issue when it came to predicting whether or not a person would remain a current customer or voluntarily choose to leave.

You may wish to try these same graphs with the Testing Partition.

Note: Use of Data Audit Node

We explored the relationship between just two input fields ( LONGDIST and SEX ) and the predictionfrom the neural net ($N-CHURNED), and used Distribution and Histogram nodes to create the plots.

If more inputs were to be viewed in this way, a more efficient approach would be to use the DataAudit node because overlay plots could easily be produced for multiple input fields (the overlay plotscan’t be normalized, though).


Click File…Save Stream As



NEURAL NETWORKS

4-31

Move to the c:\Train\ModelerPredModel directory (if necessary)Type NeuralNet in the File Name: text boxClick Save

4.12 Model Summary In summary, we appear to have built a neural network that is reasonably good at predicting the threedifferent CHURNED groups. The overall accuracy was about 77% with the Training data, and 76%

with the Testing data. Focusing on the Testing or unseen data, the model is most accurate at predicting the Involuntary Leaver group but somewhat less successful predicting the Current Customers and Voluntary Leaver . Considering that the model was correct almost three-quarters of thetime even in the case of these latter two groups, it is certainly within the realm of possibility that themodel may be considered a success. Of course, this would depend on whether these accuracy rates

met or exceeded the minimum requirements defined at the beginning of the data-mining project. Interms of how predictors relate to the model, the most important factors in making its predictions are LONGDIST , International , LOCAL, and SEX . The network appears to associate females with theVoluntary Leaver group and predicts that males will remain Current Customers. The model also tendsto predict that the people who are most likely to be dropped by the company ( Involuntary Leavers)

are those who do little or no long distance calling. (There are many other relationships in the data between the predictors and the CHURNED that we didn’t investigate, of course.)

4.13 Boost ing and Bagging Models There are two additional techniques available in the Neural Net node, and other modeling nodes inPASW Modeler, to create reliable and accurate predictive models. These techniques build a

collection, or ensemble, of models, and then combine the results of the models to make a prediction.However, the techniques don’t simply create a number of models on exactly the same data, whichwouldn’t provide any particular advantage. Instead, they resample or reweight the data for each

additional model, which leads to a different model each time. This turns out to be a winning strategyfor creating effective ensembles of models.

These two techniques are called boosting and bagging . We will provide a brief description of howeach operates, then simple examples of both.

Boosting. The key concept behind model boosting is that successive models are built to predict the

cases misclassified from earlier models. Thus, as the number of models increases, the number of misclassified cases should decrease. The method works by applying a model to the data in the normal

fashion, with each record assigned an equal weight. After the first model is constructed, predictionsare made, and weights are created that are inversely proportional to the accuracy of classification.Then a second model is created, using the weighted data. Then predictions are made from the second

model, and then weights are created that are inversely proportional to the accuracy of classificationfrom the second model. This process continues for a (usually) small number of iterations. When

done, the model predictions can then be combined by voting, by highest probability, etc.

Bagging. The term “bagging” is derived from the phrase “Bootstrap aggregating.” In this method,new training datasets are generated which are of the same size as the original training dataset. This is

done by using simple random sampling with replacement from the original data. By doing so, somerecords will be repeated in each new dataset. This type of sample is called a bootstrap sample. Then, amodel is constructed for each bootstrap sample, and the results are combined with the usual methods(voting, averaging for continuous targets). Cases are weighted normally with this method in each

bootstrap sample.




4-32

Boosting can be used on datasets of almost any size and characteristics. It is designed to increasemodel accuracy, first and foremost. Bagging should be avoided on very small datasets, especially

those with lots of outliers, where the outliers can affect the samples that are constructed. Bagging canincrease accuracy, but also reduce model variability.

When these methods are used, the Model Viewer will provide different views from that when a single

model is constructed. Included will be the results from each model and details on each, plus someindication of the variability of the model results (for bagging).

It is absolutely necessary to have a test or validation dataset on which the boosted or bagged modelscan be assessed. These models, even more so than a regular model, are highly tuned to the trainingdata, and so their performance must be evaluated on data not used for model-building. Bagging and

boosting are not guaranteed to improve model performance on new data, but the idea is that creatingseveral models is worth the tradeoff of reusing the training data several times.

Typically, only a small number of boosted or bagged models need to be constructed; the default

number is 10 in PASW Modeler.

The outcome of boosting or bagging is still only one model nugget, and it can be used the same as anyother standard model nugget. The downside to boosting or bagging is that no one equation, or decision tree, or the equivalent, can represent the model, so it can be hard to describe and characterizehow the model makes predictions. Thus, you should investigate the relationship between the

predictors and the target field, as we did above, go gain model understanding.

4.14 Model Boo sting wit h Neural Net We will use boosting to predict CHURNED, using the default settings but changing the random seed once more.

Edit the Neural Net modeling node named CHURNEDClick Build Options tabClick Objectives Click Enhance model accuracy (boosting)



NEURAL NETWORKS

4-33

Figure 4.29 Requesting Model Boosting

Click Ensembles settings

We viewed the Ensembles settings earlier in the lesson. The number of component models to createfor boosting or bagging is 10, and that can be changed here. The default choice for combining modelsfor categorical targets is voting, and two other choices using probability are available.

We’ll use the default choices.




4-34

Figure 4.30 Ensemble Model Scoring

Click Advanced settingsChange the Random seed value to 5555 (not shown)Click Run

You will notice that execution does take much longer than when running a single neural net model.Once the model has finished:

Edit the Neural Net Model CHURNED



NEURAL NETWORKS

4-35

Figure 4.31 Boosted Model Accuracy

The Model Summary view has three measures of accuracy. The bar chart displays the accuracy of the

final model, compared to a reference model and a naive model. The reference model is the first model

built on the original unweighted data. The naïve model represents the accuracy if no model was built,and assigns all records to the modal category (Current). The naive model is not computed for continuous targets.

The ensemble of 10 models is perfectly accurate—100%! That is encouraging, but we’ll have to seehow it performs on the Training data partition.

Click Predictor Importance panel





NEURAL NETWORKS

4-37

Figure 4.33 Predictor Frequency for Boosted Model

In some modeling methods, such as decision trees, the predictor set can vary across componentmodels. The Predictor Frequency plot is a dot plot that shows the distribution of predictors acrosscomponent models in the ensemble. Each dot represents one or more component models containing

the predictor. Predictors are plotted on the y-axis, and are sorted in descending order of frequency;thus the topmost predictor is the one that is used in the greatest number of component models and the

bottommost one is the one that was used in the fewest. The top 10 predictors are shown. However, all

predictors are used in each Neural Net component model in the ensemble, so this plot is not usefulhere.

Click the Ensemble Accuracy panel




4-38

Figure 4.34 Ensemble Accuracy for Boosted Model

The Ensemble Accuracy line plot shows the overall accuracy of the model ensemble as each model isadded. Generally, accuracy will increase as models are added, and we see that ensemble model

accuracy reached 100% after only five models (you can hover the cursor over the line and view a popup of accuracy at that point). The line plot can be used to decide how fast the ensemble accuracyis increasing, and whether it is worthwhile to increase (or decrease) the number of models in another modeling run.

Click on Component Model Details panel



NEURAL NETWORKS

4-39

Figure 4.35 Component Model Details for Boosted Model

Information about each of the models, in order of their creation, is supplied in the Component Model

Details panel. Included are model accuracy, the number of predictors, and the number of synapses(weights), which is directly related to the number of neurons in the network. Not surprisingly, as the

models attempted to model cases that were still mispredicted by earlier models, model accuracydecreased, although not dramatically. You can sort the rows in ascending or descending order by thevalues of any column by clicking on the column header.

This information can be used to decide whether the model should be rerun, with additionalcomponent models, or perhaps different modeling settings.

Boosted Model Performance

Because this model has been added to the stream, we can immediately check its performance on the

Testing data partition.

Click OK Run the Analysis node




4-40

Figure 4.36 Analysis Node Output for Boosted Model

As we saw when browsing the model, the boosted model is 100.0% accurate on the Training data. On

the Testing data, the accuracy is 76.12%. This large drop-off is typical for boosted models and isillustrative of why the Testing data performance is the true guide to model performance on new data.

The level of accuracy is decent, but the accuracy with one model was 75.73%, almost the same. Of course, every small increase in accuracy may be important. Compared to the single neural network,

the ensemble is performing better with current customers, but not as well with those who leftinvoluntarily. All of these factors must be taken into account when deciding which model is preferred.

We will not investigate the boosted neural net model further, but you can do so using the existingnodes in the stream, or others you add. If you do you will discover that the boosted model has similar

relations between the key predictors and the target field CHURNED.

4.15 Model Bagg ing with Neural Net We next try bagging with a neural net model. We will use the same settings in the Neural Netmodeling node, just changing to bagging.



NEURAL NETWORKS

4-41

Close the Analysis nodeEdit the Neural Net modeling node named CHURNEDClick Objectives Click Enhance model stability (boosting)

Figure 4.37 Requesting Model Bagging

Click Run

You will notice that execution does take much longer than when running a single neural net model.Once the model has finished:

Edit the Neural Net Model CHURNED




4-42

Figure 4.38 Bagged Model Accuracy

The Model Summary view has the same three measures of accuracy, and a measure of model

diversity (variability). The reference and naïve model accuracies are the same as for the boosted model, as the reference model is a model built on all the training data, as with the boosted model.

The ensemble of 10 models is very accurate at 96.94%, although not completely accurate as was the boosted model.

For bagged models there is a dropdown to display accuracy for the different model combining rules.

All of these can be shown on one chart by selecting the Show All Combining Rules check box.

Click Show All Combining Rules check box



NEURAL NETWORKS

4-43

Figure 4.39 Bagged Model Accuracy for all Voting Methods

For this bagged model, the rule with the highest accuracy is to use the highest mean probability. Youcan try all three types of combining rules on the Testing data to pick the best performing model.

Below the Quality bar chart is a bar chart labeled Diversity. This chart displays the "diversity of opinion" among the component models used to build the bagged ensemble, presented in greater ismore diverse format, normalized to vary from 0 to 1. It is a measure of how much predictions varyacross the base models. Although the label indicates that larger is better, this isn’t necessarily so. Thetrue test is how well the bagged models perform on the Testing data.

Figure 4.40 Bagged Model Diversity

We skip the Predictor Importance information, which is very similar to that for the boosted ensemblemodel. For the same reason, we skip the Predictor Frequency panel, which is identical to that for the

boosted model, since all predictors are used in each component model.




4-44

Click the Ensemble Accuracy panel

Figure 4.41 Component Model Accuracy for Bagged Model

The Component Accuracy chart is a dot plot of predictive accuracy for the component models. Each

dot represents one or more component models with the level of accuracy plotted on the y-axis. Hover over any dot to obtain the id for the corresponding individual component model. The chart alsodisplays color coded lines for the accuracy of the ensemble as well as the reference model and naïve

models. A checkmark appears next to the line corresponding to the model that will be used for scoring. What we can see from the dot plot is that the level of accuracy of the 10 bagged models isvery comparable.

Unlike with boosted models, we cannot see the overall accuracy as each model is added. This is

because the models have no logical sequence.

Click on Component Model Details panel



NEURAL NETWORKS

4-45

Figure 4.42 Component Model Details for Bagged Model

Information about each of the models created, in order of their creation, is supplied in the ComponentModel Details panel. This is the same type of information as supplied for boosted models.

Here, there is no trend in accuracy as new bootstrap samples are taken, and there really shouldn’t be.

Also, overall accuracy is very similar for each of the models. And since the diversity measure waslow (.08), we know that the model predictions were also very similar. The fact that the models wereso comparable in performance may be a good thing, but we won’t know until we view the bagged

model with the Testing data.

Bagged Model Performance

Because this model has been added to the stream, we can immediately check its performance on the

Testing data partition.

Click OK

Run the Analysis node





NEURAL NETWORKS

4-47

Summ ary Exerc ises The exercises in this lesson use the file charity.sav. The following table provides details about the

file.

charity.sav comes from a charity and contains information on individuals who were mailed a promotion. The file contains details including whether the individuals responded to the campaign,their spending behavior with the charity and basic demographics such as age, gender and mosaic(demographic) group. The file contains the following fields:

response Response to campaignorispend Pre-campaign expenditureorivisit Pre-campaign visitsspendb Pre-campaign spend categoryvisitb Pre-campaign visits categorypromspd Post-campaign expenditurepromvis Post-campaign visits

promspdb Post-campaign spend categorypromvisb Post-campaign visit categorytotvisit Total number of visitstotspend Total spendforpcode Post Codemos 52 Mosaic Groupsmosgroup Mosaic Bandstitle Titlesex Gender yob Year of Birthage Ageageband Age Category

In this set of exercises you will use a neural network to predict the field Response to campaign.

1. Begin with a blank Stream canvas. Place a Statistics source node on the Stream canvas and connect it to the file charity.sav. Tell PASW Modeler to use variable and value labels.

2. Attach a Type and Table node in a stream to the source node. Run the stream and allowPASW Modeler to automatically define the types of the fields.

3. Edit the Type node. Set all of the fields to role NONE.

4. We will attempt to predict response to campaign ( Response to campaign) using the fields

listed below. Set the role of all five of these fields to input and the Response to campaignfield to target.

Pre-campaign expenditure

Pre-campaign visitsGender Age Mosaic Bands (which should be changed to nominal measurement level)




4-48

5. Attach a Neural Net node to the Type node. Run the Neural Net node with the defaultsettings.

6. Once the model has finished training, browse the generated Net node in the stream. What is

the predicted accuracy of the neural network? What were the most important fields within thenetwork?

7. Connect the generated Net node to a Matrix node and create a data matrix of actual responseagainst predicted response. Which group is the model predicting well?

8. Use some of the methods introduced in the lesson, as well as others, such as web plots and histograms (or use the Data Audit node with an overlay field), to try to understand thereasoning behind the network’s predictions.

9. Change the random seed and rerun the neural net and recheck its performance.

10. Try a radial basis function neural network to see if you can improve on the model performance.

11. Try a boosted or bagged model to see if you can improve on model performance compared tothe models you created above.

12. For those with extra time: Use C5.0 or other decision tree methods to predict Response tocampaign from the charity.sav data. How do the rule induction models compare with the

neural network models built here? Which are the most accurate? Which are the easiest tounderstand?

13. Save a copy of the stream as Exer4.str .



SUPPORT VECTOR MACHINES

5-1

Lesson 5: Support Vector Machines

Objectives

• Review the foundations of the Support Vector Machines model

• Use an SVM model to predict customer churn

• Try several different kernel functions and model parameters to improve the model

• Discuss how missing data is handled in SVM models

Data

In this lesson we use the data file customer_dbase.sav, which like churn.txt is also from atelecommunications firm. We will use an SVM model to predict customers churn. The file containsfields measuring both customer demographics and customer use of telecommunications service to useas predictors. SVM models can use many input fields, so this file will allow us to demonstrate thatfeature. We continue to use a Partition node to split the data file.

5.1 Introduct ion Support vector machine (SVM) is a robust classification technique that is used to predict either acategorical and continuous outcome field. SVM is particularly well suited to analyzed data with alarge number of predictor fields. Broadly, an SVM works by mapping data into a dimensional spacewhere the data points can be categorized or predicted accurately, even if there is no easy way to

separate the points in the original dimensional space. This involves using a kernel function to map thedata from the original space into the new space. An SVM, like a neural net model, does not provide

output in the form of an equation with coefficients on the predictor fields, although predictor importance is available with the model. Thus, like a neural net, to understand the model, not use itsimply as a black box that makes predictions, you will need to do additional analysis.

We will use an SVM in this lesson to predict customer churn in three categories. First we provide

some background and theory about how an SVM model is calculated and what features of the modelare under user control. Developing an acceptable SVM model usually requires trying various modelsettings rather than accepting the default node settings.

5.2 The Struc ture of SVM Models

SVM models were developed to handle difficult classification/prediction problems where the

“simple” linear models were unable to accurately separate the categories of an outcome field. Atypical complicated problem, in two dimensions, is shown in Figure 5.1. Assume that the X and Yaxis represent two predictors, while the circles and squares represent the two categories of a target

field we wish to predict.




5-2

Figure 5.1 Predicting a Binary Outcome Field

There is no simple straight line that can separate the categories, but the curve drawn around the

squares shows that there is a complex curve that will completely separate the two categories.

The central task of SVM is to transform the data from this space into another space where the curve

that separates the data points will be much simpler. Typically this means transforming the data so thata hyperplane (in higher dimensional space) can be used to separate the points.

The mathematical function used for the transformation is known as a kernel function. After transformation, the data points might be represented as shown in Figure 5.2.

Figure 5.2 Kernel Transformation of Original Data

The squares and circles can now be separated by a straight line in this two-dimensional space. Thefilled in circles and squares are the cases (called vectors in the SVM literature) that are on the

boundary between the two classes. They are the same points in both Figure 5.1 and Figure 5.2. Thefilled in circles and squares are all the data that is needed to separate the two categories, and these key

points are called support vectors because they support the solution and boundary definition. Because




5-3

SVM models were developed in the machine learning tradition, this technique was called support vector machine, hence the model name.

Even though it appears that we have a solution, there is more than one straight line (hyperplane) that

could be used to separate the two categories, as illustrated in Figure 5.3.

Figure 5.3 Multiple Possible Separating Lines

SVM models try to find the best hyperplane that maximizes the margin (separation) between thecategories while balancing the tradeoff of potentially overfitting the data. The narrower the margin

between the support vectors, the more accurate the model will be on the current data. Thus, aseparating line as shown in Figure 5.4 maximizes the margin between the support vectors.

Figure 5.4 Creating Maximum Separation Between Support Vectors

However, although this separator is 100% accurate, it may be too narrow to perform well on new

data, as illustrated in Figure 5.5. Here, in a new dataset, there circles or squares that fall on the wrongside of the support vectors and so will be classified in error.




5-4

Figure 5.5 Misclassified Cases in Training Data

To allow for this, SVM models include a regularization or weight factor C that is added as a penaltyterm in the function used by the SVM model. The algorithm attempts to maximize the margin

between the support vectors while minimizing error. As described below, you can try various valuesof C to find an optimal model.

Although the description of an SVM has been in the context of a categorical target, an SVM can beapplied to predict a continuous field. See the Modeler 14.0 Algorithms Guide for more information.

Kernel Functions

Four different types of kernel function are available in the SVM node.

• Linear: Simple function that works well when there are nonlinear relationships in the data are

minimal• Polynomial: A more complex function that allows for higher order terms

• RBF (Radial Basis Function): Equivalent to the neural network of this type. Can fit highlynonlinear data.

• Sigmoid: Equivalent to a two-layer neural network. Can also fit highly nonlinear data.

Some of these functions have other parameters that you can modify to find an optimal model, such asthe degree of the polynomial, or the gamma factor that controls the influence of the function. As withthe factor C, there is a tradeoff with gamma values between accuracy and overfitting.

You can anticipate that you will not find the best model using the default settings in the SVM node.

Just as with decision trees, where you usually need to change the depth of a tree, a pruning parameter,

or the minimum number of cases in terminal nodes, SVM models must be tuned to perform better.One method to fit many models efficiently is to use the Auto Classifier (if the target field is a flag) or the Auto Numeric nodes (if the target is continuous), as appropriate, which allow you to run severalversions of a model at one time (see Lessons 12 and 13, respectively). For example, you could run 10

different SVMs with 10 different values of C.SVM models make predictions by use of the separating hyperplane, and the equation of thathyperplane, and the support vectors themselves, are possible outputs from the model. However,neither of these will provide much insight into the model unless the dimensionality of the space is




5-5

very low, which is rarely the case. As a consequence, the SVM node doesn’t provide this output,although you can request predictor importance (which is not directly related to either the support

vectors of hyperplane definition). So to understand a model, you will need to explore how the predictors are related to the predicted values.

With this background in the basics of an SVM model, we can now apply an SVM to the churn data.

5.3 SVM Model to Predict Chur n The customer database we use in this lesson contains a field (churn) that measures whether or not acustomer of the telecommunications firm has renewed their service or not. We will attempt to

prediction this flag field with several inputs.

Click File…Open Stream and move to the c:\Train\ModelerPredModel directoryDouble-click on SVM.str Run the Table nodeClose the Table windowEdit the Type node




5-6

Figure 5.6 Type Node for SVM Model to Predict churn

There are 19 fields with Role Input that will be used as predictors. SVM models can easily handlehundreds of predictors, but we limit the number of predictors here for a practical reason. Requesting

predictor importance for an SVM model can greatly increase execution time (by over a factor of 10).Therefore, so that we don’t wait excessively for models to run, we limit the inputs.

Close the Type window Add an SVM node to the streamConnect the Type node to the SVM nodeEdit the SVM node




5-7

Figure 5.7 SVM Node Model Tab

All the model settings are available in the Expert tab.

Click Expert tabClick Expert option button

Figure 5.8 Expert Options for SVM Models




5-8

As mentioned in the previous section, there are four types of kernels that can be selected toeffectively create different types of models. The default is a RBF (Radial Basis Function), and we use

that initially.

The Regularization parameter (C) controls the trade-off between maximizing the margin of thesupport vectors and minimizing the training error. Its value should normally be between 1 and 10

inclusive, with the default being 10. Increasing the value improves the classification accuracy (or reduces the regression error at predicting a continuous outcome) for the training data, but this can alsolead to overfitting. In general, it is usually better to reduce C.

The Regression precision (epsilon) is used when the measurement level of the target field iscontinuous. Errors in the model predictions are accepted if they are under this value. Increasingepsilon may result in faster modeling, but at the expense of accuracy.

The RBF gamma value is enabled only if the kernel type is set to RBF. Gamma should normally be between 3/k and 6/k, where k is the number of input fields. For example, if there are 12 input fields,

values between 0.25 and 0.5 would be worth trying. Increasing the value improves the classificationaccuracy (or reduces the regression error) for the training data, but this can also lead to overfitting, in

a similar manner to the Regularization parameter. For our problem, with 19 predictors, gamma should be between .16 and .32, so we will need to change the default value of .10.

The Gamma value is enabled only if the kernel type is set to Polynomial or Sigmoid. As with RBFgamma, increasing the value improves the classification accuracy for the training data, but this canalso lead to overfitting.

The Bias value is enabled only if the kernel type is set to Polynomial or Sigmoid. Bias sets the coef0value in the kernel function. The default value 0 is suitable in most cases.

You can use the Degree value when the Kernel type is Polynomial to control the complexity

(dimension) of the mapping space. The default is 3 (equivalent to a term such as X3

).

Change the Regularization parameter to 3 Change the RBF Gamma value to 0.2




5-9

Figure 5.9 Expert Settings for SVM Model

Click Analyze tab

In the Analyze tab, unlike most models, the calculation of predictor importance is not checked on bydefault. This calculation can be lengthy, and we will not request it in our initial model run.

Figure 5.10 Options in Analyze Tab

Click Run




5-10

After the model has finished execution:

Right-click the generated SVM model and select Browse

Figure 5.11 SVM Generated Model Summary Tab

Close the SVM model browser window Add an Analysis node to the stream Attach the SVM generated model to the Analysis nodeEdit the Analysis nodeClick Coincidence matrices (for symbolic targets) check box




5-11

Figure 5.12 Analysis Node Settings

Click Run






5-13

Figure 5.14 Requesting Predictor Importance

Click Run

The model will take much longer to run. When it is done:

Right-click the generated SVM model and select Browse




5-14

Figure 5.15 Predictor Importance in Predicting churn

The most important field, by far, is equip (VI chart will display variable label instead of variable itself if it is available), which records whether or not a customer rents equipment from the

telecommunications firm. The second most important field, ebill , records whether or not a customer

pays bills electronically. The most important demographic field is gender . One strategy based on thischart is to drop some of the fields of low importance. We will instead change model parameters.

Close the SVM model browser

5.4 Explo r ing the Model Before trying a different model, we briefly illustrate examining how the SVM model makes

predictions. Since we are working with the training data, we need to use a Select node to work onlywith the data in that partition.

Add a Select node from the Record Ops palette to the stream near the Type node

Attach the Type node to the Select nodeEdit the Select nodeUse the Expression Builder to create the condition Partition = “1_Training”




5-15

Figure 5.16 Select Node to Select Training Data

Click OK Connect the Select node to the SVM model node in the streamClick Replace to replace the connection Add a Distribution node to the stream near the SVM modelConnect the SVM model node to the Distribution nodeEdit the Distribution nodeSpecify equip as the FieldSpecify churn as the Overlay fieldClick Normalize by color check box

These selections will show us the relationship between the most important predictor and the originalvalues of churn.




5-16

Figure 5.17 Requesting a Distribution Graph with equip and churn

Click Run

We see in Figure 5.18 that most customers who don’t rent equipment (value of 0, or you can click the

Label tool) did not churn, but a sizeable fraction of customers who did rent equipment churned.

Figure 5.18 Distribution Graph of equip and churn

Now we look at model predictions.

Close the Distribution graph windowEdit the Distribution nodeChange the Overlay field to $S-churn (not shown)




5-17

Click Run

Figure 5.19 Distribution Graph of equip and Predicted churn

The model predictions are similar to the previous graph, only more extreme. The model predictsalmost no customers who don’t rent equipment will churn. It predicts about the same percentage of

customers who do rent equipment will churn.

Close the Distribution graph window

We could continue this process with other predictors, using appropriate nodes to see the relationship between them and the original and predicted values of churn.

We next try to improve our first model.

5.5 A Model wi th a Different Kernel Func tion We have three other kernel functions to try, along with changing model parameters. The linear modelmay be too simple, so let’s start with the Polynomial.

Edit the SVM modeling nodeClick the Analyze tabClick Calculate predictor importance to deselect itClick the Expert tabClick the Kernel type: dropdown and select Polynomial Change the Degree value to 2

We’ll use a simpler polynomial by one degree.




5-18

Figure 5.20 Requesting a Model with Polynomial Kernel

Click Run

When the model is done executing:

Attach the Type node to new generated SVM modelEdit the SVM modelClick Annotations tabClick Custom and name the model Polynomial – 2

Click OK Attach the generated SVM model to the Analysis node by replacing the connectionRun the Analysis node






5-20

Figure 5.22 Analysis Node for Linear Kernel Model

Close the Analysis Output browser

We won’t show the results of a model using a Sigmoid kernel, as it performs much more poorly (youcan try it if you wish).

5.6 Tunin g the RBF Model To illustrate tuning a model even further, we can rerun the SVM model node with an RBF kernel.

We’ll change the value of C, and even though, in theory, increasing it may make the model generalizeless well, we’ll give it a try to see the effect.

Edit the SVM modeling nodeClick on Expert tabChange the Kernel type to RBF (not shown)Change the Regularization parameter to 5 Click Run

After the model has run:

Connect the SVM model to the Analysis nodeRun the Analysis node




5-21

Figure 5.23 Analysis Node with RBF Kernel Model

The model did a bit better overall than the original RBF-based model (see Figure 5.13). And it did better at predicting the customers who churned (47%). So this model appears to be better even with a

higher value of C.

SVM Models and Missing Data

In Figure 5.23 for the Analysis node output, note that there are three records for which a predictioncould not be made (the column with the label $null$). In SVM models, records with missing valuesfor any of the input or output fields are excluded from model building. In these data, there are 3records with missing values on longten, and two of these customers also have a missing value for cardten. (You can use a Select node and Table node to view the data and see these records.) Bychance, all three customers were in the Testing Partition.

If the amount of missing data in a file is a small fraction of the total records, you may be willing totolerate the loss of some records from model building and scoring. But if the amount of missing data

is significant, you will need to take some action before using the SVM node. You can use the Typenode Check option to change missing values to a valid value. You could also do this yourself with aFiller node. Or you could use a Data Audit node to impute missing values in a sophisticated fashion.

Of course, if you do this when creating a model, you will also need to use the same methods to handlemissing data before scoring new data with the model.

Model Comparison

We have several results from several models, and picking a final model is based, in part, on theaccuracy on the Testing partition. For ease of reference, we have created the table below whichsummarizes the results from Figure 5.13, Figure 5.21, Figure 5.22, and Figure 5.23.




5-22

Table 5.1 Model Performance on Testing Partition

Model Type

Overall

Accuracy

Accuracy at

Predicting

Non-Churners

Accuracy at

Predicting

Churners

RBF (C=3, RBF

gamma=0.2) 77.81% 92.1% 36.3%

Polynomial

(Degree=2)79.14% 92.9% 39.2%

Linear 79.73% 93.7% 39.4%

RBF (C=5, RBF

gamma=0.2)77.65% 91.2% 38.3%

The best model on all three measures used a Linear kernel, while the next best model used aPolynomial kernel. One never knows what model will do best until trying several types of modelswith various settings. None of these models does very well at predicting those customers who

churned, so we might continue model building by adding or deleting fields, modifying fields, or resume changing model parameters.

It should be clear by now that to be well-organized, you will want to develop a strategy of changingmodel parameters in a systematic way, and keeping close track of the model parameters and results.

As noted above, the Auto Classifier or Auto Numeric nodes can be very helpful.

We don’t need to save the stream in this lesson.




5-23

Summ ary Exerc ises The exercises in this lesson use the file charity.sav. The following table provides details about the

file.




In this set of exercises you will attempt to predict the field Response to campaign using an SVMmodel.

1. Begin with a clear Stream canvas. Place an Statistics source node on the Stream canvas and connect it to the file charity.sav. Tell PASW Modeler to Read Labels as Names.

2. Attach a Type and Table node in a stream to the source node. Run the stream and allow

PASW Modeler to automatically define the types of the fields.


4. We will attempt to predict response to campaign ( Response to campaign) using the fieldslisted below. Set the role of all five of these fields to Input and the Response to campaign

field to Target.

Pre-campaign expenditure Pre-campaign visitsGender Age Mosaic Bands (which should be changed to nominal measurement level)





LINEAR REGRESSION

6-1

Lesson 6: Linear Regression

Objectives

• Review the concepts of linear regression

• Use the Regression node to model medical insurance claims data

• Demonstrate the Linear node to perform regression modeling

Data

We use the data file InsClaim.dat , which contains 293 records based on patient admissions to ahospital. All patients belong to a single diagnosis related group (DRG). Four fields (grouped severityof illness, age, length of stay, and insurance claim amount) are included. The goal is to build a predictive model for the insurance claim amount and use this model to identify outliers (patients withclaim values far from what the model predicts), which might be instances of errors or fraud made inthe claims.

6.1 Introduct ion Linear regression is a method familiar to just about everyone these days. It is the classic general linear model (GLM) technique, and it is used to predict an target that is interval or ratio in scale(measurement level continuous) with predictors that are also interval or ratio. In addition, categoricalinput fields can be included by creating dummy variables (fields). The Regression model node performs linear regression in PASW Modeler.

Linear regression assumes that the data can be modeled with a linear relationship. To illustrate, thefigure below contains a scatterplot depicting the relationship between the length of stay for hospital patients and the dollar amount claimed for insurance. Superimposed on the plot is the best-fitregression line.

The plot may look a bit unusual in that there are only a few values for length of stay, which isrecorded in whole days, and few patients stayed more than three days.




6-2

Figure 6.1 Scatterplot of Hospital Length of Stay and Insurance Claim Amount

l

Although there is a lot of spread around the regression line and a few outliers, it is clear that there is a positive trend in the data such that longer stays are associated with greater insurance claims. Of course, linear regression is normally used with several predictors; this makes it impossible to displaythe complete solution with all predictors in convenient graphical form, but it is useful to look at bivariate scatterplots.

6.2 Basic Conc epts of Regression In the plot above, to the eye (as well as to one's economic sense) there seems to be a positive relation between length of stay and the amount of a health insurance claim. However, it would be more usefulin practice to have some form of prediction equation. Specifically, if some simple function canapproximate the pattern shown in the plot, then the equation for the function would concisely describethe relation and could be used to predict values of one field given knowledge of the other. A straightline is a very simple function and is usually what researchers start with, unless there are reasons(theory, previous findings, or a poor linear fit) to suggest otherwise. Also, since the goal of muchresearch involves prediction, a prediction equation is valuable. However, the value of the equationwould be linked to how well it actually describes or fits the data, and so part of the regression outputincludes fit measures.

The Regression Equation and Fit MeasureIn the plot above, insurance claim amount is placed on the Y (vertical axis) and the length of stayappears along the X (horizontal) axis. If we are interested in insurance claim as a function of thelength of stay, we consider insurance claim to be the output field and length of stay as the input or predictor field. A straight line is superimposed on the scatterplot along with the general form of theequation:

Y i = A + B * X i + e i











LINEAR REGRESSION

6-7

Figure 6.5 Regression Fields Tab

The weighted least squares option (Use weight field check box) supports a form of regression inwhich the variability of the output field is different for different values of the input fields; anadjustment can be made for this if an input field is related to this degree of variation. In practice, thisoption has been rarely used in data mining.

We see here the option to specify a partition field when there is such a field (but it doesn’t have thedefault name of Partition), also you can specify more than one input field as split fields. A model is

built for each possible combination of the values of the selected split fields.

Click the Expert tabClick the Expert Mode option button





LINEAR REGRESSION

6-9

You control the criteria used for input field entry and removal from the model. By default, an inputfield must be statistically significant at the .05 level for entry and will be dropped from the model if its significance value increases above .1.

Click Cancel Click Output button

Figure 6.8 Advanced Output Options

These options control how much supplementary information concerning the regression analysisdisplays. The results will appear in the Advanced tab of the generated model node in HTML format.Confidence bands (95%) for the estimated regression coefficients can be requested (Confidenceinterval ). Summaries concerning relationships among the inputs can be obtained by requesting their Covariance matrix or Collinearity Diagnostics. The latter are especially useful when you need toidentify the source and assess the level of redundancy in the predictors, and the more predictors youhave, the more likely that some may be highly correlated (you can ask your instructor for moreinformation on these diagnostics). Part and partial correlations measure the relationship between an

input and the output field, controlling for the other inputs. Descriptive statistics ( Descriptives) includemeans, standard deviations, and correlations; these summaries can also be obtained from the Statisticsor Data Audit node. The Durbin-Watson statistic can be used when running regression on time seriesdata and evaluates the degree to which adjacent residuals are correlated (regression assumes residualsare uncorrelated).

Click Cancel Click the Simple option buttonClick Model tab, and then click Enter on the Method drop-down list (not shown)Click Run button


Edit the Regression generated model node in the streamClick the Summary tab, and then expand the Analysis summary













LINEAR REGRESSION

6-15

Figure 6.13 Errors Sorted in Descending Order

There are two records for which the claim values are much higher than the regression prediction. Bothare about $6,000 more than expected from the model. These would be the first claims to examinemore carefully. We could also examine the last few records for large over-predictions, which might be errors as well.

6.4 Using Linear Models Node to Perform Regress ion

The Linear Models node was added to Modeler 14.0 to create linear models to predict a continuoustarget with one or more predictors. Models are created with an equation that assumes a simple linear relationship between the target and predictors.

The Linear Models node has more features than the Regression node, including the ability to createthe best subset model, several criteria for model selection, the option to limit the number of predictors, and the use of bagging and boosting, as discussed earlier in this course. In addition, thereis a feature to automatically prepare the data for modeling, by transforming the target and predictors inorder to maximize the predictive power of the model. This includes outlier handling, adjusting themeasurement level of predictors, and merging similar categories. The Linear Models node automaticallycreates dummy variables from categorical fields (that have nominal or ordinal measurement level), whichis another definite advantage.

The Linear Models node also uses the new Model Viewer to display a wealth of information about themodel that helps to evaluate it and understand the effect of the predictors.

For this example, we will reproduce the model from the Regression node and concentrate on the newfeatures and output, rather than attempt to find a more accurate model.




6-16

Close the Table node Add a Linear node to the stream near the Type nodeConnect the Linear node to the Type nodeEdit the Linear nodeClick the Build Options tab

Figure 6.14 Build Options Tab for Linear Models Node

There are five areas within the Build Options tab that correspond to the standard ones included with

most Classification nodes. We will create a standard model and not use bagging and boosting.

Click Basics settings







LINEAR REGRESSION

6-19

The Ensembles and Advanced settings are similar to those for other modeling nodes, so we won’treview them here.

Click Model Options tab

Figure 6.17 Model Options Tab for Linear Models Node

Neither probability nor propensity are available for the Linear Models node because it is predicting acontinuous target.

Click Run

After the model has executed:

Edit the CHURNED Linear Models generated model





LINEAR REGRESSION

6-21

Figure 6.19 Predictor Importance for Linear Model

Predictor importance is about equal for ASG and AGE , with LOS being most important. Importancefor a Linear model is calculated differently than for a Regression model. For Linear Models, the

method of leave one predictor out is used, with the model statistics compared with and without that predictor (see the PASW® Modeler 14 Algorithms Guide for details).

Click on the Effects panel












6-26

Summary Exerc ises The exercises use the data file InsClaim.dat that was used in this lesson. The following table providesdetails about the file.

InsClaim.dat contains insurance claim information from patients in a hospital. All patients were inthe same diagnosis related group (DRG). Interest is in building a prediction model of total charges based on patient information and then identifying exceptions to the model (error or fraud detection).The file contains about 300 records and the following fields:

ASG Severity of illness code (higher values mean more seriously ill)

AGE AgeLOS Length of hospital stay (in days)CLAIM Total charges in US dollars (total amount claimed on

form)

1. Using the insurance claims data, use the Stepwise method and compare the equation to theone obtained using the Enter method. Are you surprised by the result? Why or why not? Trythe Forward and Backward methods. Do you find any differences?

2. Instead of examining errors in the original scale, analysts may prefer to express the residualas a percent deviation from the prediction. Such a measure may be easier to communicate to awider audience. Add a Derive node that calculates a percent error. Name this field PERERROR and use the following formula: 100* (CLAIM – '$E-CLAIM')/'$E-CLAIM'.Compare this measure of error to the original DIFF . Do the same records stand out? Whatconditions is percent error most sensitive to? Use the Histogram node to produce histogramsfor either of the error fields, generate a Select node to select records with large errors, and then display them in a table.

3. Use the Neural Net modeling node to predict CLAIM using a neural network. How does its performance compare to linear regression? What does this suggest about the model?

4. Fit a C&R Tree model and make the same comparison. Examine the errors from the better of the neural net and C&R Tree models (as you judge them). Do the same records consistentlydisplay large errors?



COX REGRESSION FOR SURVIVAL DATA

7-1

Lesson 7: Cox Regression for SurvivalData

Overview• What is Survival Analysis?

• What to Look for in Survival Analysis

• Cox Regression

• Checking the Proportional Hazards Assumption

• Predictions from a Cox Model

Data

In this lesson we use the data file customer_dbase.sav that contains 5,000 records from customers of atelecommunications firm. The firm has collected a wide variety of consumer information on itscustomers, and we are interested in studying the length of time customers retain their primary credit

card—in other words, we wish to model the time for these customers to churn—not renew—their primary credit card. We will use several predictors to model churn to learn their effect on time tochurn.

7.1 Introduct ion

Survival analysis studies the length of time to an event of interest. The analysis can involve no

predictors, or it can investigate survival time as a function of one or more predictor fields. Thetechnique was originally used in medical research to study the amount of time patients survivefollowing onset of a disease (hence the name survival analysis). In data mining, it has been applied to

model such diverse outcomes as length of time a person subscribes to a newspaper or magazine, thetime employees spend with a company, the time to failure for electrical or mechanical components,the time to make a second purchase from an online retailer, or the length of tenure for renters of commercial properties.

The Cox node in PASW Modeler can perform both univariate (no predictors) and multivariate (Coxregression) survival analysis. The former type of analysis is called Kaplan-Meier, often used tocompare survival time for treatment and control groups in medical studies. In data mining, there aremany possible predictors, so Cox regression, a semi-parametric technique, is used. In this lesson wewill review the concepts and theory behind survival analysis and Cox regression, and then perform a

Cox regression predicting churn.








7-4

Figure 7.2 Types of Censoring in Survival Data

Both left censoring and left truncation can cause problems for models because they lead to a biased sample. In data-mining applications, it is common to have customer history data taken from a cross-

sectional extraction from existing databases, using all customers active as of some fixed date when

the study begins. This approach will systematically under sample customers with short survival times(those who are left-censored) and thus overestimates survival. Left truncation is less of an issue with

business databases because the time a customer begins is normally well known (although whencompanies merge and combine customers, incompatible data systems can lead to uncertainties incustomer history).

To solve the left-censored problem, it is better to sample on a history of customers over time,sampling not cross-sectionally but over some defined time period. This means that you don’t need tochoose all those customers who began say, on a fixed date. Instead, customers can enter, and leave thestudy (because they churned), over some long time interval. This doesn’t imply that a survival studymust actually go on for many months or years in real-time; instead, it means that the data sampling

must be done over an extended time interval.

Too much right-censored data can also be a problem simply because the event of interest—herechurning—will have occurred too infrequently.

Why Not Regression or Logistic Regression?

Those who first encounter survival data sometimes wonder why linear regression can’t be used to

predict survival time, or why logistic regression or other methods to predict dichotomous outcomescan’t be used to predict whether an event has occurred or not. Besides the censoring issue, there areseveral reasons, but the key one related to linear regression is that the residuals from the regression

model are unlikely to be normally distributed, even for large sample sizes. This is because the time toevent distribution is likely to be non-normal, even bimodal in many real-world applications (there arecertain intervals when more customers are likely to churn, such as at the end of their contract period).

Logistic regression doesn’t assume a particular distribution of the residuals, but it also doesn’t handle

censored data appropriately. It is possible to follow a sample of say, 1000 customers, from the timethey obtain a credit card until the last one has dropped that card (many years later). This type of dataset has no right-censored data because the status of every customer is known along with when theevent of interest occurred. However, collecting this type of data is often impractical; moreover, since

conditions change rapidly in many businesses, the effects of predictors on credit card churn may












7-9

As a final step in data exploration, let’s look at the relationship between these two key fields.

Close the Histogram windowEdit the Histogram nodeSpecify churn as the Color overlay fieldClick Options

Click Normalize by color (not shown)Click Run

At first, the smoothly declining percentage over time of values of churn=1 over cardtenure may come

as a surprise. But actually, as with many products, most consumers who switch do so early. So churnrates are initially around 50% for the first few years. But then, over time, switching the primary creditcard still occurs, but less frequently as a percentage of those customers who have survived that long.

This trend continues right up to the end of data collection at 40 years. This will help us understand predictions from a Cox regression model in a later section.

Figure 7.6 Overlay Histogram of cardtenure by churn

We are now ready to add the Cox regression node to the stream and review its settings.

Add a Cox modeling node to the streamConnect the Type node to the Cox nodeEdit the Cox node

We selected the input and target fields in the Type node, and the Cox node correctly lists churn as thetarget field. However, we also need to specify which field contains survival times.




7-10

Figure 7.7 Fields Tab in Cox Node

Click Survival time field chooser and select cardtenure Click Model tab

As with other regression-based models, Cox regression can estimate a model either by using all the predictors, or by performing forwards or backwards stepwise model-building. In this example, we use

the default choice of Enter and use all the predictors. If you have many predictors and want to use oneof the stepwise methods, you definitely should use a testing or validation sample or partition.

Complex models can be built by specifying a Model type of Custom and then selecting specific terms.This allows you to incorporate interaction terms into a model, for example.

If you would prefer to see separate analyses for discrete groups of customers, a Groups: field can beselected. This field should be categorical, and a separate model will be developed for each category in

this field. Alternatively, such a field can be included in the model as a predictor, but if you believethat coefficients and survival times are quite different for the various groups, or you want to assesswhether this is the case, estimating separate models can be a useful approach.




7-11

Figure 7.8 Model Tab in Cox Node

Click the Expert tabClick the Expert options buttonClick the Output button

By default only the most basic of output is supplied by the Cox node, and neither the survival or

hazard plots are included.

The check box for Display baseline function will display the baseline hazard function and cumulativesurvival at the mean of the covariates.




7-12

Figure 7.9 Expert Output Options

Click Display baseline function check boxClick Survival and Hazard check boxes under Plots area

When the plots selections are made, the bottom of the dialog box becomes active, and the fieldsincluded in the model are listed, with their value set to the mean. This is necessary because thesurvival and hazard functions depend on the values of the predictors, and you must use constantvalues for the predictors to plot the functions versus time. The default is to use the mean, but you can

enter your own values for the plot using the grid. This would allow you to get plots for survival for a particular type of customer. For categorical inputs, indicator coding is used, so there is a regressioncoefficient for each category (except the last). For a categorical input, the mean value for eachdummy field is equal to the proportion of cases in the category corresponding to that indicator contrast.

You can also request a separate line for each value of a categorical field on any plot. We’ll do so for marital (the field used here does not have to be an input to the model). This is not equivalent toadding a Groups field to the model.

Click the Plot a separate line for each value dropdown and select marital




7-13

Figure 7.10 Completed Advanced Output Dialog Selections

We will now briefly look at the Settings tab.

Click OK Click Settings tab

The Settings tab has several options to specify how a Cox model should be applied to make predictions. The model can be scored at regular time intervals, over one or more time periods, with

the unit of time defined by the field used in the model. Alternatively, another time field can be listed.

In many cases, customers or the equivalent will already have a survival time which must be taken intoaccount (not everyone is beginning as a new customer who has just acquired a product or subscription), so the setting Past survival time allows you to select a field which contains this

information (this is often the same field as used for survival time itself, such as cardtenure in thecurrent example).

These options become relevant when a model has already been developed, so we won’t say moreabout them at this point in the lesson.




7-14

Figure 7.11 Settings Tab in a Cox Node

Click Run

After the model runs:

Right-click the Cox model in the stream named churn and select Edit

The Categorical Variable Coding table shows the dummy variable coding for the categorical variablesin the model. Unlike in other PASW Modeler nodes, Cox regression does indicator coding, with the

last category as the reference category. This means that for flag variables such as gender , which iscoded 0 for males and 1 for females, the coding within Cox regression reverses this ordering. This isvery important to note for interpretation of the model.

If you prefer the original ordering, you can change the values of a field with a Reclassify node.




7-15

Figure 7.12 Categorical Variable Coding

The next set of tables includes tests of the model as a whole. Since all predictors are entered at once,the values reported in the Change From Previous Step and Change From Previous Block sections areidentical. Here we are testing whether the effect of one or more of the predictor fields is significantlydifferent from zero in the population. This is analogous to the overall F test used in regression

analysis. The results indicate that at least one predictor is significantly related to the hazard becausethe significance values are well below .05 or .01. (An omnibus test is also done using the scorestatistic, which is used in stepwise predictor selection.)

Figure 7.13 Omnibus Tests of Model Coefficients




7-16

Figure 7.14 Variables in the Equation Table

The next table—Variables in the Equation—contains information on the individual effects of each predictor. To interpret these coefficients, recall that the model predicts the hazard directly, not

survival time, and that in the scale of the predictors, the natural log of the hazard is being predicted.Therefore, the B coefficient relates the change in natural log of the hazard to a one unit change in a

predictor, controlling for other predictors. As such they are difficult to understand (although positive

values are associated with increasing hazard and lower survival time, while negative values areassociated with decreasing hazard and increasing survival times). For this reason, the Exp(B) column

is usually used when interpreting the results.

The significance of each predictor is tested using the Wald statistic and the associated probabilityvalues are reported in the Sig. column. Here, four of the predictors are significant, but gender and cardfee are not.

The positive B values for ed and marital indicate, respectively, that increasing education and beingunmarried (check the coding) are associated with increasing hazard for churn. The negative B valuesfor age and income indicate that increasing age and income lead to reduced hazard for churn.

The Exp(B) column presents the estimated change in risk (hazard) associated with a one-unit changein a predictor, controlling for the other predictors. When the predictor is categorical and indicator coding is used, Exp(B) represents the change in hazard when comparing the reference category toanother category and is referred to as relative risk. Exp(B) is also called the hazard ratio, since itrepresents the ratio of the hazards for two individuals who differ by one unit in the predictor of

interest. The Exp(B) value for marital is 1.469; this means that, other things being equal, the hazard for customers who are unmarried is 1.469 times greater than the hazard for married customers.

This does not mean that an unmarried customer will churn 1.469 times faster, or that 1.469 times asmany unmarried customers as married customers will churn in a given time interval. It simply means

that an unmarried customer has odds of 1.469:1 that he or she will churn compared to married customers.

For a continuous predictor such as age, the odds ratio refers to the effect on the hazard for each unitchange. So, a one year increase in age leads to a decrease in the hazard of .923. A ten year change in

age (comparing, say a 40-year-old to a 30-year-old customer) must be calculated by multiplying thehazards and so corresponds to .92310

The next, rather lengthy table is the Survival table which contains the baseline cumulative hazard, thesurvival estimates, and the cumulative hazard. The first portion of the table is displayed in

= .449, which is a substantial reduction in the likelihood of

churning for the older customer (who tend to hang on to their credit cards).

Figure






7-18

Figure 7.16 Cumulative Survival Function

The cumulative survival values are plotted in the Survival Function chart. Time held primary creditcard is on the horizontal axis, and cumulative survival is on the vertical axis. Until the last few years,cumulative survival for a typical customer declines steadily at about a constant pace, then moresteeply over the last two or three years. The curve is not smooth, but jagged because survival for these

data is measured by the year (rather than in months or even weeks).

Figure 7.17 Cumulative Hazard Function




7-19

The cumulative hazard function plot is somewhat the mirror image of the survival plot; again, notethat the hazard can take on values above 1. As cumulative survival decreases, cumulative hazard—the

chance of not retaining the primary credit card—decreases.

There are survival and hazard graphs with separate lines for each category of marital . Looking at thecumulative survival function for what is labeled “patterns 1 -2” shows that the survival curve for

unmarried customers is always below the line for married customers. This means that the cumulativesurvival for unmarried customers is always lower than for married customers, which is consistentwith the regression coefficient for marital , which had an Exp(B) value of 1.469 and indicated that

there was an increased hazard for unmarried customers. Differences in survival gradually increaseover time between these two groups until about 35 years, where estimates grow less precise. This typeof plot is a useful adjunct to interpretation of model coefficients and help to gain additional modelunderstanding.

Figure 7.18 Cumulative Survival Function for Married and Unmarried Customers

Close the Cox model Browser window

That completes a review of the most important types of output for a Cox model.

7.5 Checking the Proport ion al Hazards As sum pt ion Cox regression is based on a proportional hazards model, but we don’t know whether that assumption

is valid for these data (that the hazard functions of any two individuals or groups remain in constant proportion over time). There are several approaches to test this for predictors. For our purposes, the

two chief methods are to:

• Examine the survival or hazard plots with the categorical predictor as the factor




7-20

• Examine the survival or log-minus-log plot in Cox Regression with the categorical predictor

specified as a Groups variable

We will illustrate by specifying marital as the Groups field within Cox Regression and examining thesurvival and log-minus-log plots.

Edit the Cox modeling nodeClick the Expert tabClick the Output buttonClick Log minus log check boxRemove marital from the Plot a separate line for each value box (not shown)Click OK Click Model tabSelect marital as the Groups field

Figure 7.19 Marital Selected As Groups Field

A separate base hazard function will be fit to each category of the Groups field. If marital does not

conform to the proportional hazards assumption, this should be revealed in the survival and log minuslog plots, which will present a line for each category, i.e., married and unmarried customers.

Click Run Right-click on the generated model and select Browse

Since the focus of this analysis is on the proportional hazards assumption, we will not examine the

model results but move directly to the diagnostic plots.




7-21

Figure 7.20 Survival Plot for Marital Categories

The survival plots for the married and unmarried customers remain roughly parallel over time,

suggesting that the hazard ratio for the two groups is reasonably constant over time. We can ignorethe last few years where there are few customers and so estimates are less precise.

Because marital is not used as a predictor in the model (it is a groups field), the expected survivalvalues allow us to assess whether marital would meet the proportional hazards assumption in this

context.




7-22

Figure 7.21 Log Minus Log Plot for Marital Categories

Another way of examining the proportional hazards assumption is through the ln(-ln) plot. Again, we

simply have to judge whether the lines for the different categories are parallel. Here the two survivallines do initially diverge after a few years, and then slowly converge over time until about year 30.However, compared to the full range of the data, the divergence is small between the two curves.Although this is requires some judgment, in this instance we will conclude that the proportionalhazards assumption is met for marital .

If the assumption is not met, a time-varying covariate model can be fit to the data. Or you can dropthis predictor from the model, which is a viable option when you have dozens of predictors.

Checking the proportional hazards assumption should be done, in theory, for all significant

categorical predictors in the model.

We can now look at the predictions made by the model.

7.6 Predict ion s from a Cox Model The Cox regression model node can be added to a stream to make predictions. Given that the data are

collected over time, the idea of a “prediction” of the model is complicated, and in fact the model can

make several predictions for the same customer, over different time periods, i.e., survival over thenext time period, over the next five time periods, and so forth.

We want to rerun the basic model without marital specified as a Groups field.




7-23

Close the Cox generated model browser Edit the Cox modeling nodeClick the Model tabRemove marital as the Groups fieldClick Run


Add a Table node to the stream and connect it to the generated Cox model nodeRun the Table nodeScroll to the right to see the new fields

Figure 7.22 Output Fields from a Cox Regression Model with Default Settings

Four fields are created by default from a Cox model. $C-churn-1 is the prediction of the Cox model

for whether or not this customer will churn in the time interval that the user has requested (we willreview the default interval next. The field $CP-churn-1 contains the probability associated with this

prediction (whether it is associated with the churn or no churn condition). As can be seen from thetable, the probabilities are very high for almost all the predictions. The last two columns contain the

probabilities for the churn=0 and churn=1 conditions (the field $CP-0-1 stands for the predicted probability associated with churn=0 for the first predicted interval). All the predicted probabilities

seem to be taken from the $CP-0-1 field.

It is very important to understand that the predictions we see in the Table do not take current survival

time for a customer into account. This may seem odd, given that the model was developed withsurvival time data, and model nuggets usually make predictions based on values of the predictors, but

because predicting into the future is so central to the use of Cox models, no survival time value is

used by default in the model. To see this we need to view the Settings tab on the Cox generated model.





7-24

Edit the Cox model nodeClick the Settings tab

Figure 7.23 Settings Tab Options on Cox Generated Model

The options available within this tab are the same as those available in the modeling node, allowing a

model to be developed that simultaneously makes predictions. Note that there is no Time field listed,nor any Past survival time field. By default, survival will be predicted for a time interval of 1.0,

which is defined in units of the time field used to create the model (here cardtenure). So the prediction will be for one year, for one time period ( Number of time periods to score).

But—and here is the key to understanding how predictions are made with Cox regression—since no past survival time field is specified, the predictions are equivalent to predicting whether each

customer will keep their primary credit card for 1 year after receiving it. These predictions are not whether a customer who has survived this long (as measured by cardtenure) will keep his or her card for another year. Since the odds of a customer churning in the first year are not terribly large, very

few customers will be predicted to churn.

To see what the predictions of the model are for each customer for their actual survival time (whichvaries by customer), we need to set the time field to cardtenure. This will request the Cox model

predict into the “future” the number of time intervals represented by cardtenure for each customer.

Recall from the previous example that, when no past survival time is listed, the model effectively begins at time 0.

Click Time field option buttonSpecify the field as cardtenure Click OK




7-25

Run the Table node

Figure 7.24 Predictions of Cox Model at Current Survival Time

In the first 20 records visible in Figure 7.24 there are still no predictions that a customer will churn.

However, the probabilities of churning ($CP-1-1) are much higher than previously (see Figure 7.22).These are the predictions of the model for the actual survival times of each customer.

We can now use an Analysis node to review the model predictions for churn.

Close the Table window Add an Analysis node to the streamConnect the Cox generated model to the Analysis nodeEdit the Analysis nodeClick Coincidence matrices (for symbolic targets) check box (not shown)Click Run




7-26

Figure 7.25 Predictions for Cox Model

Overall the model is correct for 72.7% of the customers. However, looking at the Coincidence Matrix,we see that the model performs very well for customers who didn’t churn, but not nearly as well for

those who did. But we are using only a few predictors to keep the example simple; in a real-life data-mining project using Cox regression, performance is very likely to be much better than this.

Given what we have learned in the two instances of prediction we have reviewed, let’s return to theCox generated model to discuss prediction in more depth.

Close the Analysis nodeEdit the Cox generated model




7-27

Figure 7.26 Cox Model Settings Tab

The section of the Settings tab labeled Predict survival at future times specified as: allows the user toeither specify future time based on time intervals or an actual time field. This section is separate from

specifying past (really current) survival time. As we have seen, if no past survival time is listed, thenthe model predicts based as it time=0. Let’s set the past survival time to cardtenure and predict onetime interval (one year) into the future.

Click Regular intervals option buttonSpecify cardtenure as the Past survival time field




7-28

Figure 7.27 Cox Model to Predict One Year Beyond Current Survival Time

Click OK Run the Table node




7-29

Figure 7.28 Cox Predictions One Year Beyond Current Survival

There are two points to notice about these data. First, for case 16, there are values of $null$ for all

four output fields. If we scroll to the left and check the value of cardtenure for this customer, we will

learn that it is 40. Predictions cannot be made outside the range of survival values in the data, so for any customer already at the upper end of the survival range, no predictions can be made.

Second, if you compare Figure 7.28 to Figure 7.24, you may see differences that are at first puzzling.

The customer in the first row has a cardtenure value of 2. Thus, the model we just ran is predictingone year ahead for this customer, or to year 3, and the probability of churning in year 3 is 0.135. Butif we look back at Figure 7.24 we see a probability of churning of 0.251, which is greater. You might

wonder how the probability of churning can go down; the probability of the terminal event occurringshould either stay constant or increase over time (unless we have time-varying covariates). The

answer is that predictions that take into account past survival are actually conditional predictions, and the probability listed is the conditional probability. This means that, in Figure 7.28, the probability of

0.135 is the probability of this customer churning by the end of year 3, given their churn probability

(and survival) through year 2.

Thus, if we used the Analysis node to examine predictions from this model, we would see very fewcustomers predicted to churn. This is because the probability of churning in the next year is always

rather small for these data, so there are few predictions that a customer will churn.

Once you are satisfied with a model, you will want to score customers to find those most likely to

churn. A common situation is to predict some time into the future, given past survival time. We’ll predict five years into the future.




7-30

There is no need to predict for those customers who have already churned; we would only beinterested in predicting churn for customers who are still current and may churn at some point in the

future. So we need to select those customers with churn=0. Then we can sort the data stream by thelikelihood of churning in a given future time interval.


Place a Select node from the Record Ops palette in the stream to the right of Cox modelConnect the Cox generated model to the Select nodeEdit the Select nodeEnter the text churn = 0 in the Condition: box

Figure 7.29 Selecting Customers who have not Churned

Click OK Add a Sort node to the stream to the right of the Select nodeConnect the Select node to the Sort nodeEdit the Sort nodeSelect $C-churn-1 as the first sort field in descending order Select $CP-1-1 as the second sort field, also in descending order




7-31

Figure 7.30 Sorting by Predicted Value and Probability to Churn

Click OK Connect the Sort node to the Table node, replacing the connectionEdit the Cox generated model

To see survival five years into the future, taking into account current survival for each customer, we

need to make only one change.

Change the Time interval value to 5.0




7-32

Figure 7.31 Settings to Predict Five Years into the Future taking into Account Current Survival

Since the number of time periods to score is still 1, this will score each record at five years into thefuture dependent on past survival time up until the last point data recorded for that customer.

Click OK Run the Table node attached to the Sort node




7-33

Figure 7.32 Cox Predictions Five Years Beyond Current Survival

There are predictions now that customers will churn ($C-churn-1=1). All these customers have avalue of $CP-1-1 above .50. If we scroll down to row 71, we see that there are 71 customers

predicted to churn. However, this number depends on the default cutoff value of .500, which you arefree to adjust up or down as necessary. Decreasing the cutoff will cast a wider net to find more

customers who might churn in the specified five-year time period. Customers are ordered by model- predicted probability of churning, so we can spend marketing dollars or other resources on contactingthem in order of likelihood to churn.

If you scroll down to record 3266 you will begin to see values of $null$ again for all the output fields.This occurs when a prediction is made outside the survival range of the model. As we all know from

basic regression analysis, predicting outside the bounds of the predictor fields is to be avoided. In Coxregression, if we attempt to predict outside the bounds of the time field, the model will not pass on a

predicted value downstream. In every instance, these null values come about when current survivaltime is 36 years or greater. Given the fewer customers at large survival times, these customers caneither be ignored or handled separately (see PASW Modeler Application Examples on Cox

Regression for hints).

To understand more about the model predictions into the future, we’ll examine the relationship between time and the model prediction with an overlay histogram.




7-34

Close the Table window Add a Histogram node to the stream near the Sort nodeConnect the Sort node to the Histogram nodeEdit the Histogram nodeSpecify cardtenure as the Field Specify $C-churn-1 as the Overlay Color field (not shown)Click Options Click Normalize by color Click Run

Figure 7.33 Histogram of Survival by Model Prediction

When reviewing this chart, keep in mind that, based on current survival, we are predicting survival five years into the future. This explains why there is a sharp cutoff after 35 years, since 36 + 5 extends beyond the data range of cardtenure of 40 years. Most of the predictions that people will switch their primary credit card occur to customers with relatively low current survival times—under 10 years—

which is consistent with past experience and common sense.

Close the Histogram window

If we wanted to see survival in a graph at cardtenure + 5 years, which is what the model is now predicting, you could use a Derive node to create a field with that equation and then use it in the

Histogram in place of cardtenure.

To recap how to make predictions with a Cox generated model, follow this advice:

1) If you use the default setting, you are predicting from time 0 one time period into the future.

You usually don’t want this for existing customers, but this could certainly be appropriate for a data file with new customers.




7-35

2) If you want to predict survival at the current survival time for existing customers, specify thesurvival time field as the Time field on the Settings tab.

3) If you want to predict future survival, given past survival, specify the survival time field onthe Past survival time field, and specify either time intervals and periods to score (if you score

more than one period, you will get more than one prediction), or a Time field.




7-36

Summary Exerc ises The exercises in this lesson use the data file customer_dbase.sav that we have been using throughout

the lesson examples.

1. Begin with a current stream, or alternatively just begin a new stream to access thecustomer_dbase.sav data. Choose a completely different set of predictors than thedemographic fields used in the lesson to predict churn, although you will continue to usecardtenure as survival time.

2. Estimate a Cox regression model with this new set of predictors. What are the significant predictors?

3. Choose a categorical predictor or two that are significant and plot survival curves for eachvalue of the predictors? What did you learn?

4. Test the assumption of proportional hazards for your model with one or more categorical

predictors. Is the assumption met or not?

5. Using your model, predict survival 3 and 6 years into the future. Select only those customers

who have not churned. How many customers are predicted to churn at 3 years into the future?At 6 years?



TIME SERIES ANALYSIS

8-1

Lesson 8: Time Series Analysis

Objectives

• To explain what is meant by time series analysis

• Outline how time series models work

• Demonstrate the main principles behind a time-series forecasting model

• Forecast several series at one time

• Produce forecasts with a time series model on new observations

8.1 Introduct ion It is often essential for organizations to plan ahead. In order to do this it is important to forecastevents in order to ensure a smooth transition into the future. In order to minimize errors when planning for the future it is necessary to collect information on any factors which may influence planson a regular basis over time. Once a catalogue of past and current information has been collected, patterns can be identified and these patterns help make forecasts into the future. Even though many

organizations may collect historic information relevant to the planning process, forecasts are oftenmade on an ad-hoc basis. This often leads to large forecasting errors and costly mistakes in the planning process. Statistical techniques provide a more scientific basis upon which to base forecasts.By using these techniques, a more structured approach can be used to ensure careful planning whichwill reduce the chance of making costly errors. Statisticians have developed a whole area of statisticaltechniques, known as time series analysis, which is devoted to the area of forecasting.

Examples of Time Series Analysis

In order to understand how time series analysis works it is useful to give an example. Suppose that acompany wishes to forecast the growth of its sales into the future. The benefit of making the forecastis that if the company has an idea of future sales it can plan the production process for its product. Indoing so, it can minimize the chances of under producing and having product shortages or,alternatively, overproducing and having excess stock which will need to be stored at additional cost.

Prior to being able to make the forecast, the company will need to collect information on its sales over time in order to gain a full picture of how sales have changed in the past. Once this information has been collected it is possible to plot how sales change over time. An example of this is shown inFigure 8.1. Here information on the sales of a product has been collected each month from January1982 until December 1995.




8-2

Figure 8.1 Plot of Sales Over Time

This is a simple example that demonstrates the idea of time series. Time series analysis looks atchanges over time. Any information collected over time is known as a time series. A time series isusually numerical information collected over time on a regular basis.

One of the most common uses of time series analysis is to forecast future values of a series. There area number of statistical time series techniques which can be used to make forecasts into the future. Inthe above example the forecast would be the future values of sales.

Some time series methods can also be used to identify which factors have been important in affectingthe series you wish to forecast. For example, to determine whether an advertising campaign has had asignificant and beneficial effect on sales. It is also possible to use time series analysis to quantify thelikely impact of a change in advertising expenditure on future sales.

Other examples of time series analysis and forecasting include:

•

Governments using time series analysis to predict the effects of government policies oninflation, unemployment and economic growth.

• Traffic authorities analyzing the effect on traffic flows following the introduction of parkingrestrictions in city centers.

• The analyses of how stock market prices change over time. By being able to predict whenstock market prices rise or fall decisions can be made about the right times to buy and sellshares.




8-3

• Companies predicting the effects of pricing policies or increased advertising expenditure onthe sales of their product.

• A company wishing to predict the number of telephone calls at different times during the day,so it can arrange the appropriate level of staffing.

Time series analysis is used in many areas of business, commerce, government and academia, and itsvalue cannot be overstated.

A number of time series techniques can be found within the Time Series node in PASW Modeler.This node provides analysts with both a flexible and powerful way to analyze time series data.

8.2 What is a Time Series? A time series is a field whose values represent equally spaced observations of a phenomenon over time. Examples of time series include quarterly interest rates, monthly unemployment rates, weekly beer sales, annual sales of cigarettes, and so on. In term of a data file, time periods constitute the rows(cases) in your file.

Time series analysis is usually based on aggregated data. If we take the monthly sales shown inFigure 8.1, each sale is recorded on a transactional basis with an attached date and/or time stamp.There is usually no business need requiring sales forecasts on a minute-by-minute basis, while there isoften great interest in predicting sales on a weekly, monthly, or quarterly basis. For this reason,individual transactions and events are typically aggregated at equally spaced time points (days,weeks, months, etc.), and forecasting is based on these summaries. Also, most software programs that perform time series analysis, including PASW Modeler, expect each row (case) of data to represent atime period, while the columns contain the series to be forecast.

Classic time series involves forecasting future values of a time series based on patterns and trendsfound in the history of that series (exponential smoothing and simple ARIMA) or on predictor fields

measured over time (multivariate ARIMA, or transfer functions).

Time Series Models versus Econometric Models

Time Series models are models constructed without drawing on any theories concerning possiblerelationships between the fields. In univariate models, the movements of a field are explained solelyin terms of its own past and its position with respect to time. ARIMA models are the premier timeseries models for single series.

By way of contrast, econometric models are constructed by drawing on theory to suggest possiblerelationships between fields. Given that you can specify the form of the relationship, econometrics provides methods for estimating the parameters, testing hypotheses, and producing predictions. Your model might consist of a single equation, which can be estimated by some variant of regression, or a

system of simultaneous equations, which can be estimated by two-stage least squares or some other technique.

The Classical Regression Model

The classical linear regression model is the conventional starting point for time series and econometric methods. Peter Kennedy, in A Guide to Econometrics (5th edition, 2003, MIT Press), provides a convenient statement of the model in terms of five assumptions:




8-4

• The dependent variable can be expressed as a linear function of a specific set of independentvariables plus a disturbance term (error).

• The expected value of the disturbance term is zero.

• The disturbances have a constant variance and are uncorrelated.

• The observations on the independent variable(s) can be considered fixed in repeated samples.

• The number of observations exceeds the number of independent variables and there are noexact linear relationships between the independent variables.

While regression can serve as a point of departure for both time series and econometric models, it isincumbent on you (the researcher) to generate the plots and statistics which will give some indicationof whether the assumptions are being met in a particular context.

Assumption 1 is concerned with the form of the specification of the model. Violations of thisassumption include omission of important regressors (predictors), inclusion of irrelevant regressors,models nonlinear in the parameters, and varying coefficient models.

When assumption 2 is violated, there is a biased intercept.

Assumption 3 assumes constant variance (homoscedasticity) and no autocorrelation. (Autocorrelationis the correlation of a variable with itself at a fixed time lag.) Violations of the assumption are thereverse: non-constant variance (heteroscedasticity) and autocorrelation.

Assumption 4 is often called the assumption of fixed or nonstochastic independent variables.Violations of this assumption include errors in measurement in the variables, use of lagged values of the dependent variable as regressors (common in time series analysis), and simultaneous equationmodels.

Assumption 5 has two parts. If the number of observations does not exceed the number of independent variables, then your problem has a necessary singularity and your coefficients are notestimable. If there are exact linear relationships between independent variables, software might

protect you from the consequences. If there are near-exact linear relationships between your independent variables, you face the problem of multicollinearity.

In regression, parameters can be estimated by least squares. Least squares methods do not make anyassumptions about the distribution of the disturbances. When you make the assumptions of theclassical linear regression model and add to them the assumption that the disturbances are normallydistributed, the regression estimators are maximum likelihood estimators (ML). It also can be shownthat the least-squares methods produce Best Linear Unbiased estimates (BLU). The BLU and ML properties allow estimation of the standard errors of the regression coefficients and the standard error of the estimate, and therefore enable the researcher to do hypothesis testing and calculate confidenceintervals.

8.3 A Time Series Data File To show you what a time series data file looks like, we open a PASW Modeler stream.

Click File…Open Stream and move to the c:\Train\ModelerPredModel folder Double-click on Time Series Intro.str Run the Table node




8-5

Figure 8.2 A Time Series Data File

Each column in the data editor corresponds to a given field. The important point to note concerning

the organization of time series data is that each row in the Table window corresponds to a particular period of time. Each row must therefore represent a sequential time period. The above example showsa data file containing monthly data for sales starting in January 1982. In order to use standard timeseries methods it is important to collect, or at least be able to summarize, the information over equaltime periods. Within a time series data file it is essential that the rows represent equally spaced time periods. Even time periods for which no data was collected must be included as rows in the data file(with missing values for the fields).

The Time Plot Chart

The data file contains the recorded sales of a product over a fourteen-year period. The simplest wayof identifying patterns in your data is to plot your information over the relevant time period, and thisis essential for time series analysis. In PASW Modeler a sequence chart called a Time Plot chart is

used to show how time series change over time. The Time Plot chart plots the value of the field of interest on the vertical axis, with time represented on the horizontal axis. A Time Plot chart can showseveral fields (series) on the same chart. Points are joined up to display a line graph which shows any patterns in your data.

Close the Table windowDouble-click on the Time Plot node to the right of the Time Intervals node to open itUse the Field selector tool to select Sales




8-6

Figure 8.3 Time Plot Dialog

Click Run

There is an option to Display series in separate panels which can be used to generate a separate chartfor each series if you want to plot several of them at once. If you do not check this option, all fieldsare plotted on one chart. Figure 8.4 shows how sales have changed over the fourteen years.




8-7

Figure 8.4 Sequence Plot of Sales

The sequence chart is the most powerful exploratory tool in time series analysis and it can be used toidentify trend, seasonal and cyclical patterns in a time series. There is a clear regularity (repeating pattern) to the times series, and the volume of sales generally increases over time. These are the keyfeatures we will need to model.

8.4 Trend, Seasonal and Cycl ic Components

After identifying important patterns that have occurred in the past, time series analysis uses thisinformation to forecast into the future. In Figure 8.4 there are clear patterns in past sales. These patterns can be divided into three main categories: trend, seasonal components and cycles.

Trend Patterns

Trend refers to the smooth upward or downward movement characterizing a time series over a long period of time. This type of movement is particularly reflective of the underlying continuity of fundamental demographic and economic phenomena. Trend is sometimes referred to as secular trend where the word secular is derived from the Latin word saeculum, meaning a generation or age.Hence, trend movements are thought of as long-term movements, usually requiring 15 or 20 years todescribe (or the equivalent for series with more frequent time intervals). Trend movements might beattributable to factors such as population change, technological progress, and large-scale shifts in

consumer tastes.

For example, if we could examine a time series on the number of pairs of shoes produced in theUnited States extending annually, say, from the 1700s until the present, we would find an underlyingtrend of growth throughout the entire period, despite fluctuations around this general upward movement. If we would compare the figures of the recent time against those near the beginning of theseries, we would find the recent numbers are much larger. This is because of the increase in population, because of the technical advances in shoe-producing equipment enabling vastly increased




8-8

levels of production, and because of shifts in consumer tastes and levels of affluence which havemeant a larger per capita requirement of shoes than in the earlier time

In Figure 8.4 there is a clear upward trend in the data as sales have continued to increase from 1982until 1995, albeit less pronounced from the beginning of 1991.

Cyclical PatternsCyclical patterns (or fluctuations), or business cycle movements, are recurrent up and downmovements around the trend levels which have a duration of anywhere from about 2 to 15 years. Theduration of these cycles can be measured in terms of their turning points, or in other words, fromtrough to trough or peak to peak. These cycles are recurrent rather than strictly periodic. The heightand length (amplitude and duration) of cyclical fluctuations in industrial series differ from those of agricultural series, and there are differences within these categories and within individual series.Hence, cycles in durable goods activity generally display greater relative fluctuations than consumer goods activity and a particular time series of, say, consumer goods activity may possess businesscycles which have considerable variations in both duration and amplitude.

Economists have produced a large number of explanations of business cycle fluctuations including

external theories which seek the causes outside the economic system, and internal theories in terms of factors within the economic system that lead to self-generating cycles.

Since it is clear from the foregoing discussion that there is no single simple explanation of businesscycle activity and that there are different types of cycles of varying length and size, it is not surprisingthat no highly accurate method of forecasting this type of activity has been devised. Indeed, nogenerally satisfactory mathematical model has been constructed for either describing or forecastingthese cycles, and perhaps never will be. Therefore, it is not surprising to find that classical time seriesanalysis adopts a relatively rough approach to the statistical measurement of the business cycle. Theapproach is a residual one; that is, after trend and seasonal variations have been eliminated from atime series, by definition, the remainder or residual is treated as being attributable to cyclical and irregular factors. Since the irregular movements are by their very nature erratic and not particularly

tractable to statistical analysis, no explicit attempt is usually made to separate them from cyclicalmovements, or vice versa. However, the cyclical fluctuations are generally large relative to theseirregular movements and ordinarily no particular difficulty in description or analysis arises from thissource. Therefore, unless you have data available over a long period of time, cyclic patterns are notusually fit by forecasting models.

Seasonal Patterns

Seasonal variations are periodic patterns of movement in a time series. Such variations are considered to be a type of cycle that completes itself within the period of calendar year, and then continues in arepetition of this basic pattern. The major factors in this seasonal pattern are weather and customs,where the latter term is broadly interpreted to include patterns in social behavior as well asobservance of various holidays such as Christmas and Easter. Series of monthly or quarterly data areordinarily used to examine these seasonal patterns. Hence, regardless of trend or cyclical levels, onecan observe in the United States that each year more ice cream is sold during the summer months thanduring the winter, whereas more fuel oil for home heating purposes is consumed in the winter thanduring the summer months. Both of these cases illustrate the effect of climatic factors in determiningseasonal patterns. Also, department store sales generally reveal a minor peak during the months inwhich Easter occurs and a larger peak in December, when Christmas occurs, reflecting the shoppingcustoms of consumers associated with these dates.




8-9

Seasonal patterns need not be linked to a calendar year. For example, if we studied the daily volumeof packages delivered by a private delivery service, the periodic pattern might well repeat weekly(heavier deliveries mid-week, lighter deliveries on the weekend). Here the period for the seasonal pattern could be seven days. Of course, if daily data were collected over several years, then there maywell be a yearly pattern as well, and just which time period constitutes a season is no longer clear.

The number of time periods that occur during the completion of a seasonal pattern is referred to as theseries periodicity. How often the time series data are collected usually depends on the type of seasonality that the analyst expects to find.

• For hourly data, where data are collected once an hour, there is usually one seasonal pattern every twenty-four hours. The periodicity is most likely to be 24.

• For monthly data, where each month a new time period of data is collected, there isusually one seasonal pattern every twelve months. The periodicity is thus likely to be 12.

• For daily data, where data are collected once every day, there is usually one seasonal pattern per week. The periodicity is therefore 7 if the data refer to a seven-day week or 5if no data are collected on Saturdays and Sundays.

• For quarterly data, where data are collected once every three months, there is usually one

seasonal pattern per year. The periodicity is therefore 4.• For annual data, where data are collected once a year, there is no seasonal pattern. The

periodicity is therefore none (undefined).

Of course, changes can occur in seasonal patterns because of changing institutional and other factors.Hence, a change in the date of the annual automobile show can change the seasonal pattern of automobile sales. Similarly, the advent of refrigeration techniques with the corresponding widespread use of home refrigerators has brought about a change of seasonal pattern of ice cream sales. Thetechniques of measurement of seasonal variation which we will discuss are particularly well suited tothe measurement of relatively stable patterns of seasonal variation, but can be adapted to cases of changing seasonal movements as well.

In Figure 8.4, there appears to be a rise in sales during the early part of the year while sales tend tofall to a low around November. Finally, there is some recovery in sales leading up to the Christmas period of each year.

Irregular Movements

Irregular movements are fluctuations in time series that are erratic in nature, and follow no regularlyrecurrent or other discernible pattern. These movements are sometimes referred to as residualvariations, since, by definition, they represent what is left over in an economic time series after trend,cyclical, and seasonal elements have been accounted for. These irregular fluctuations result fromsporadic, unsystematic occurrences such as wars, earthquakes, accidents, strikes, and the like. In theclassical time series model, the elements of trend, cyclical, and seasonal variations are viewed asresulting from systematic influences leading to gradual growth, decline, or recurrent movements.Irregular movements, however, are considered to be so erratic that it would be fruitless to attempt todescribe them in terms of a formal model. Irregular movements can result from a large number of causes of widely differing impact.

8.5 What is a Time Series Mod el? A time series model is a tool used to predict future values of a series by analyzing the relationship between the values observed in the series and the time of their occurrence. Time series models can bedeveloped using a variety of time series statistical techniques. If there has been any trend and/or




8-10

seasonal variation present in the data in the past then time series models can detect this variation, usethis information in order to fit the historical data as closely as possible, and in doing so improve the precision of future forecasts.

Time Series techniques in PASW Modeler can be categorized in the following ways:

Pure time series modelsExponential Smoothing

Causal time series models

Linear Time Series RegressionIntervention Analysis

Both Pure and Causal

ARIMA

Pure Versus Causal Time Series Models

A distinction can be made between pure and causal time series models.

Pure time Series Models

Pure time series models utilize information solely from the series itself. In other words, pure timeseries forecasting makes no attempt to discover the factors affecting the behavior of a series. For example, if the aim were to forecast future sales for a product, then a pure time series model would use just the data collected on sales. Information on other explanatory forces such as advertisingexpenditure and economic conditions would not be used when developing a pure time series model.In such models it is assumed that some pattern or combination of patterns in the series which is to beforecasted is recurring over time. Identifying and extrapolating that pattern can develop forecasts for subsequent time periods. The main advantage of pure time series modeling is that it is a quick and simple way of developing a forecast model. Also, such models rely upon little statistical theory. Oneobvious disadvantage of pure time series models, such as exponential smoothing, is that they cannotidentify important factors influencing the series. Another drawback is that it is not possible toaccurately predict the impact of any decisions taken by an organization on the future values of theseries.

Causal time series models

Causal time series models such as regression and ARIMA will incorporate data on influential factorsto help predict future values of a series. In such models, a relationship is modeled between a targetfield (the time series being predicted), time, and a set of predictor fields (other associated factors alsomeasured over time). The first task of forecasting is to find the cause-and-effect relationship. In our sales example, a causal time series technique such as regression would indicate whether advertisingexpenditure or the price of the product has been an important influence on sales and if it has, whether each factor has had a positive or negative influence on sales. The real advantage of an explanatorymodel is that a range of forecasts corresponding to a range of values for the different fields can bedeveloped. For example, causal time series models can assess what the effect of a $100,000 increasein advertising expenditure will have on future sales, or alternatively a $150,000 increase inadvertising expenditure.

The main drawbacks of causal time series models are that they require information on several fieldsin addition to the target that is being forecast and usually take longer to develop. Furthermore, the




8-11

model may require estimation of the future values of the independent factors before the target can beforecast.

8.6 Interventions Time series may experience sudden shifts in level, upward or downward, as a result of external

events. For example, sales volume may briefly increase as the result of a direct marketing campaignor a discount offering. If sales were limited by a company’s capacity to manufacture a product, then bringing a new plant online would shift the sales level upward from that date onward. Similarly,changes in tax laws or pricing may shift the level of a series. The idea here is that some outsideintervention resulted in a shift in the level of the series.

In this context, a distinction is made between a pulse —that is, a sudden, temporary shift in the serieslevel—and a step, a sudden, permanent shift in the series level. A bad storm, or a one-time, 30-dayrebate offer, might result in a pulse, while a change in legislation or a large competitor’s entry into amarket could result in a step change to the series. Time series models are designed to account for gradual, not sudden, change. As a result, they do not natively fit pulse and step effects very well.However, if you can identify events (by date) that you believe are associated with pulse or step

effects, they can be incorporated into time series models (they are called intervention effects) and forecasts.

Below we see an example of a pulse intervention. In April 1975 a one-time tax rebate occurred in anattempt to stimulate the US economy, then in recession. Note that the savings rate reached itsmaximum (9.7%) during this quarter. The intervention can be modeled and used in scenarios to assessthe effect of a tax rebate on savings rates in the future.




8-12

Figure 8.5 U. S. Savings Rate (Seasonally Adjusted)—Tax Rebate in April 1975

8.7 Exponent ia l Smooth ing

The Expert Modeler in PASW®

Forecasting considers two classes of time series models whensearching for the best forecasting model for your data: exponential smoothing and ARIMA. In thissection we provide a brief introduction to simple exponential smoothing.

Exponential smoothing is a time series technique that can be a relatively quick way of developingforecasts. This technique is a pure time series method; this means that the technique is suitable whendata has only been collected for the series that you wish to forecast. In comparison, ARIMA modelscan accommodate predictor fields and intervention effects.

Exponential smoothing takes the approach that recent observations should have relatively moreweight in forecasting than distant observations. “Smoothing” implies predicting an observation by aweighted combination of the previous values. “Exponential” smoothing implies that the weights

decrease exponentially as the observations get older. “Simple” (as in simple exponential smoothing)implies that slowly changing level is all that is being modeled. Exponential smoothing can beextended to model different combinations of trend and seasonality. Exponential smoothingimplements many models in this fashion.

An analyst using custom exponential smoothing typically examines the series to make some broad characterizations (is there trend, and if so what type? Is there seasonality [a repeating pattern], and if so what type?) and fits one or more models. The best model fit is then extrapolated into the future tomake forecasts. One of the main advantages of exponential smoothing is that models can be easily




8-13

constructed. The type of exponential smoothing model developed will depend upon the seasonal and trend patterns inherent in the series you wish to forecast. An analyst building a model might simplyobserve the patterns in a sequence chart to decide which type of exponential smoothing model is themost promising one to generate forecasts. In PASW Forecasting, when the Expert Modeler examinesthe series, it considers all appropriate exponential smoothing models when searching for the most promising time series model.

Simple exponential smoothing (no trend, no seasonality) can be described in two algebraicallyequivalent ways. One common formula, known as the recurrence form, is as follows:

)1()()( *)1(*−

−+= t t t S yS α α

Also, the forecast

)()( t m S y =

where y(t) is the observed value of the time series in period t, S (t-1) is the smoothed level of the seriesat time t-1, α (alpha) is the smoothing parameter for the level of the series, and S (t) is the smoothed

level of the series at time t, computed after y (t) is observed, and y(m)

is the model estimated m stepahead forecast at time t. Intuitively, the formula states that the current smoothed value is obtained bycombining information from two sources: the current point and the history embodied in the series.Alpha (α) is a weight ranging between 0 and 1. The closer alpha is to 1, the more exponentialsmoothing weights the most recent observation and the less it weights the historical pattern of theseries. The smoothed value for the current case becomes the forecast value.

This is the simplest form of an exponential smoothing model. As mentioned above, extensions of theexponential smoothing model can accommodate several types of trend and seasonality, yielding ageneral model capable of fitting single-series data.

8.8 ARIMAMany of the ideas that have been incorporated into ARIMA models were developed in the 1970s byGeorge Box and Gwilym Jenkins, and for this reason ARIMA modeling is sometimes called Box-Jenkins modeling. ARIMA stands for AutoRegressive Integrated Moving Average, and theassumption of these models is that the variation accounted for in the series field can be divided intothree components:

• Autoregressive (AR)

• Integrated (I) or Difference

• Moving Average (MA)

An ARIMA model can have any component, or combination of components, at both the nonseasonal

and seasonal levels. There are many different types of ARIMA models and the general form of anARIMA model is ARIMA(p,d,q)(P,D,Q), where:

• p refers to the order of the nonseasonal autoregressive process incorporated into the ARIMAmodel (and P the order of the seasonal autoregressive process)

• d refers to the order of nonseasonal integration or differencing (and D the order of theseasonal integration or differencing)

• q refers to the order of the nonseasonal moving average process incorporated in the model(and Q the order of the seasonal moving average process).




8-14

So for example an ARIMA(2,1,1) would be a nonseasonal ARIMA model where the order of theautoregressive component is 2, the order of integration or differencing is 1, and the order of themoving average component is also 1. ARIMA models need not have all three components. For example, an ARIMA(1,0,0) has an autoregressive component of order 1 but no difference or movingaverage component. Similarly, an ARIMA(0,0,2) has only a moving average component of order 2.

AutoregressiveIn a similar way to regression, ARIMA models use input fields to predict a target field (the seriesfield). The name autoregressive implies that the series values from the past are used to predict thecurrent series value. In other words, the autoregressive component of an ARIMA model uses thelagged values of the series target, that is, values from previous time points, as predictors of thecurrent value of the series. For example, it might be the case that a good predictor of current monthlysales is the sales value from the previous month.

The order of autoregression refers to the time difference between the series target and the lagged series target used as a predictor. If the series target is influenced by the series target two time periods back, then this is an autoregressive model of order two and is sometimes called an AR(2) process. AnAR(1) component of the ARIMA model is saying that the value of series target in the previous period

(t-1) is a good indictor and predictor of what the series will be now (at time period t). This patterncontinues for higher-order processes.

The equation representation of a simple autoregressive model (AR(1)) is:

ae y y t t t ++Φ=− )()1(1)( *

Thus the series value at the current time point (y(t)) is equal to the sum of: (1) the previous seriesvalue (y(t-1)) multiplied by a weight coefficient (Φ1); (2) a constant a (representing the series mean);and (3) an error component at the current time point (e (t)

Moving Average

).

The autoregressive component of an ARIMA model uses lagged values of the series values as predictors. In contrast to this, the moving average component of the model uses lagged values of themodel error as predictors.

Some analysts interpret moving average components as outside events or shocks to the system. Thatis, an unpredicted change in the environment occurs, which influences the current value in the seriesas well as future values. Thus the error component for the current time period relates to the series’values in the future.

The order of the moving average component refers to the lag length between the error and the seriestarget. For example, if the series target is influenced by the model’s error lagged one period, then this

is a moving average process of order one and is sometimes called an MA(1) process. An MA(1)model would be expressed as:

aee y t t t ++Φ=− )()1(1)( *

Thus the series value at the current time point (y(t)) is equal to the sum of several components: (1) the previous time point’s model error (e(t-1)) multiplied by a weight coefficient (here Φ1); (2) a constant(representing the series mean); and (3) an error component at the current time point (e(t)).




8-15

Integration

The Integration (or Differencing) component of an ARIMA model provides a means of accountingfor trend within a time series model. Creating a differenced series involves subtracting the values of adjacent series values in order to evaluate the remaining component of the model. The trend removed by differencing is later built back into the forecasts by Integration (reversing the differencingoperation). Differencing can be applied at the nonseasonal or seasonal level, and successivedifferencing, although relatively rare, can be applied. The form of a differenced series (nonseasonal)would be:

)1()()( −−= t t t y y x

Thus the differenced series values (x (t)) is equal to the current series value (y (t)) minus the previousseries value (y(t-1)

Multivariate ARIMA

).

ARIMA also permits a series to be predicted from values in other data series. The relations may be atthe same time point (for example, a company spending on advertising this month influences the

company’s sales this month) or in a leading or lagging fashion (for example, the company’s spendingon advertising two months ago influence the company’s sales this month). Multiple predictor seriescan be included at different time lags. A very simple example of a multivariate ARIMA modelappears below:

ae xb y t t t ++=− )()1(1)( *

Here the series value at the current time point (y (t)) is equal to the sum of several components: (1) thevalue of the predictor series at the previous time point (x (t-1)) multiplied by a weight coefficient (b1);(2) a constant; and (3) an error component at the current time point (e (t)

). In a practical context, thismodel could represent monthly sales of a new product as a function of direct marketing spending the previous month.

Complex ARIMA models that include other predictor series, autoregressive, moving average, and integration components can be built in the Time Series node..

8.9 Data Requ irements In time series analysis, each time period at which the data are collected yields one sample point to thetime series. The idea is that the more sample points you have, the clearer the picture of how the series behaves. It is not reasonable to collect just two months worth of data on the sales of a product and, onthe basis of this, expect to be able to forecast two years into the future. This is because your samplesize is only two (one sixth of the seasonal span) and you wish to forecast 24 data points, or months,ahead (two full seasonal spans). Therefore the way to view the collection of time series information is

that the more data points you have, the greater your understanding of the past will be, and the moreinformation you have to use to predict future values in the series.The first important question to be answered is how many data points are required before it is possibleto develop time series forecasts. Unfortunately, there is no clear-cut answer to this, but the followingfactors influence the minimum amount of data required:

• Periodicity

• How often the data are collected




8-16

• Complexity of the time series model

It is important to note that some time series techniques incorporating seasonal effects require severalseasonal spans of time series data before it is possible to use them. Usually four or more seasons of data observations is a good rule of thumb to use when attempting to explore seasonal modeling. For example, four years (seasonal spans) worth of quarterly or monthly data would be sufficient, as there

are four replications of the time period. At the same time, four years worth of annual data is notenough historic data, as the sample is only four. The four year rule is not, however, a rigid rule, astime series can be developed and used for forecasting with less historic data.

Two final thoughts: first, the more complex the time series model, the larger the time series samplesize should be. Secondly, time series models assume that the same patterns appear throughout theseries. If you are fitting a long series in which a dramatic change occurred that might influence thefundamental relations that exist over time (for example, deregulation in the airline and telecomindustries), you may obtain more accurate prediction using only the recent (after the change) data todevelop the forecasts.

8.10 Au tom atic Forecast ing in a Productio n Sett ing In common data mining applications, analysts need to create forecasts for dozens of series on aregular basis. Typical examples are for inventory control for many different products/parts, or for demand forecasting within segments of customers (geographical, customer type, etc.). In principle,this task is no more complex than what we have already reviewed in the previous lessons. But in practice, it can be demanding simply because of the large number of series which could require dataexploration, checking of residuals, etc.

Fortunately, the Expert PASW Modeler in the Time Series node will automatically find a best-fittingmodel for any number of series that are added to the target list, with little work on your part (you canalso use one or more predictor fields that would apply to all the target series). Although you could, if you had the time, do some preliminary work to determine the characteristics of the series, if you need

to make regular forecasts on a weekly or monthly basis, it is likely that you won’t have the time todevote to this effort.

After models are fit to several series—each series will have its own unique model—you can theneasily apply those models in the future, without having to re-estimate or rebuild the models. This will be very time efficient. Of course, when enough time passes, you will most likely want to re-estimatethe models, just in case any fundamental processes have changed in the drivers of specific series.

8.11 Forecast in g B roadband Usage in Several Markets Our example of production forecasting involves a national broadband provider who wants to produceforecasts of user subscriptions in order to predict bandwidth usage. To keep the example relativelymanageable, we will use only five time series in the example, although there are 85 series altogether

(but we also forecast the total for all series). The file broadband_1.sav contains the monthly number of subscribers for each series from January 1999 to December 2003. After fitting models to the series,we want to produce forecasts for the next three months, which will be adequate to prepare for changesin demand/usage.

We’ll open the data file and do some data exploration.

Close the Time Plot graph windowClick on File…Open Stream




8-17

Double-click on the file broadband1Run the Table node

Figure 8.6 Broadband Time Series Data

The file contains information on 85 markets. Rather than looking at all of them at once, we will focusonly on Markets 1 through 5. The Filter node to the right of the source node will filter out the marketswe don’t need.

Close the Table windowDouble-click the Filter node




8-18

Figure 8.7 Filter Node Dialog

The next step is to examine sequence charts of each series, but before doing so we will need to definethe periodicity of each series. The Time Series modeling node, and the Time Plot node, require thatthe periodicity be defined. This is accomplished in the Time Intervals node which is found in theFields Ops palette.

Place a Type Node to the right of the Filter nodeConnect the Filter node to the Type nodePlace a Time Intervals node to the right of the Type nodeConnect the Type node to the Time Interval nodeDouble-click on the Time Intervals node




8-19

Figure 8.8 Time Intervals Dialog

The Time Intervals node allows you to specify intervals and generate labels for time series data to beused in a Time Series model or a Time Plot node. A full range of time intervals is supported, fromseconds to years. You can also specify the periodicity—for example, five days per week or eight

hours per day.

In this node you can specify the range of records to be used for estimating a model, and you canchoose whether to exclude the earliest records in the series and whether to specify holdouts. Doing soenables you to test the model by holding out the most recent records in the time series data in order tocompare their known values with the estimated values for those periods. You can also specify howmany time periods into the future you want to forecast, and you can specify future values for use inforecasting by downstream Time Series modeling nodes.

The Time Interval dropdown is used to define the periodicity of the series. By default it is set to None.While it is not absolutely required that you specify a periodicity, unless you do so the Expert PASWModeler will not consider models that adjust for seasonal patterns. In this case, because we have

collected data on a monthly basis, we will define our time interval as months.

Click on the Time Interval dropdown and select Months




8-20

Figure 8.9 Time Intervals Dialog with Periodicity Defined

The next step is to label the intervals. You can either start labeling from the first record, which in thecase of this data file is January, 1999, or build the labels from a field that identifies the time or date of each measurement. In order to use the Start labeling from first record method, you must specify thestarting date and/or time to label the records. This method assumes that the records are alreadyequally spaced, with a uniform interval between each measurement. Any missing measurementswould be indicated by empty rows in the data. You can use the Build from data method for series thatare not equally spaced. This requires that you have a field that contains the time or date of eachmeasurement. PASW Modeler will automatically impute values for the missing time points so thatthe series will have equally spaced intervals. However, in addition, this method requires a date, time,or timestamp field in the appropriate format to use as input. For example if you have a string field with values like Jan 2000, Feb 2000

, etc., you can convert this to a date field using a Filler node. Thisis the method that we are going to use.

Click OK Insert a Filler node between the Filter node and the Type Node




8-21

Figure 8.10 Stream After Adding the Filler Node

Double-click on the Filler nodeSelect DATE_ in the Fill in fields boxSelect Always from the Replace: dropdownType or use the expression builder to insert to_date(DATE_) in the Replace with: box

Figure 8.11 Completed Filler Node

Click OK

Next, let’s set up the Type node so that the role for all the target series we want to forecast is set totarget, and the role for the newly converted DATE field is set to None. We will also need toinstantiate the data.

Double-click on the Type node




8-22

Set the role on all the fields from Market_1 to Total to TargetSet the role on the DATE_ field to NoneClick on Read Values button to instantiate the data

Figure 8.12 Completed Type Node

Click OK

Now we can complete the Time Intervals settings.

Double-click on the Time Intervals node

Click on Build from dataUse the menu on the Field: option to select DATE_




8-23

Figure 8.13 Time Intervals Dialog with Date Field added

The New field name extension is used to apply either a prefix or suffix to the new fields generated bythe node. By default it is the prefix $T1_.

Click on the Build tab




8-24

Figure 8.14 Build Tab Dialog

The Build tab allows you to specify options for aggregating or padding fields to match the specified interval. These settings apply only when the Build from data option is selected on the Intervals tab.For example, if you have a mix of weekly and monthly data, you could aggregate or “roll up” theweekly values to achieve a uniform monthly interval. Alternatively, you could set the interval toweekly and pad the series by inserting blank values for any weeks that are missing, or byextrapolating missing values using a specified padding function. When you pad or aggregate data, anyexisting date or timestamp fields are effectively superseded by the generated TimeLabeland TimeIndex

fields and are dropped from the output. Typeless fields are also dropped. Fields thatmeasure time as a duration are preserved—such as a field that measures the length of a service callrather than the time the call started—as long as they are stored internally as time fields rather thantimestamp.

Click the Estimation tab




8-25

Figure 8.15 Estimation Tab Dialog

The Estimation tab of the Time Intervals node allows you to specify the range of records used inmodel estimation, as well as any holdouts. These settings may be overridden in downstream modelingnodes as needed, but specifying them here may be more convenient than specifying them for eachnode individually. The Begin Estimation is used to specify when you want the estimation period to begin. Y

ou can either begin the estimation period at the beginning of the data or exclude older valuesthat may be of limited use in forecasting. Depending on the data, you may find that shortening theestimation period can speed up performance (and reduce the amount of time spent on data preparation), with no significant loss in forecasting accuracy. The End Estimation option allows youto either estimate the model using all records up to the end of the data or “hold out” the most recentrecords in order to evaluate the model. For example, if you hold out the last three records and thenspecify 3 for the number of records to forecast, you are effectively “forecasting” values that arealready known, allowing you to compare observed and predicted values to gauge the model’seffectiveness to forecast into the future. We will use the default settings.

Click the Forecast tab




8-26

Figure 8.16 Forecast Tab Dialog

The Forecast tab of the Time Intervals node allows you to specify the number of records you want toforecast and to specify future values for use in forecasting by downstream Time Series modelingnodes. These settings may be overridden in downstream modeling nodes as needed, but againspecifying them here may be more convenient than specifying them for each node individually.

The Extend records into the future option lets you specify the number of time points you wish toforecast beyond the estimation period. Note that these time points may or may not be in the future,depending on whether or not you held out some historic data for validation purposes. For example, if you hold out 6 records and extend 7 records into the future, you are forecasting 6 holdout values and

only 1 future value. The Future indicator field is used to label the generated field to indicate whether a record contains forecast data. The default value for the label is $TI_Future. The Future Values to

Use in Forecasting allows you to specify future values for any predictor fields you use. Future valuesfor any predictor fields are required for each record that you want to forecast, excluding holdouts. For example, if you are forecasting next month's revenues for a hotel based on the number of reservations,you need to specify the number of reservations you actually expect. Note that fields selected here mayor may not be used in modeling; to actually use a field as a predictor, it must be selected in adownstream modeling node. This dialog box simply gives you a convenient place to specify futurevalues so they can be shared by multiple downstream modeling nodes without specifying themseparately in each node. Also note that the list of available fields may be constrained by selections onthe Build tab. For example, if Specify fields and functions is selected in the Build tab, any fields notaggregated or padded are dropped from the stream and cannot be used in modeling. The Future value

functions lets you choose from a list of functions, or specify

a value of your own. For example, youcould set the value to the most recent value. The available functions depend on the measurement levelof field.

Click on the Extend records into the future check boxSpecify that you would like to forecast 3 records beyond the estimation period




8-27

Figure 8.17 Completed Forecast Tab Dialog

Click OK

The next step is to examine each series with a Sequence chart. We will display all the fields on the

same chart.

Place a Time Plot node from the Graphs palette below the Time Intervals node Attach the Time Intervals node to the Time Plot nodeDouble-click on the Time Plot nodeSelect all the series from Market_1 to Total Uncheck the Display Series in separate panels box




8-28

Figure 8.18 Completed Time Plot Dialog

Click Run




8-29

Figure 8.19 Sequence Chart Output for Each Series

From this graph, it is clear that Broadband usage has been increasing rapidly in the US over this period, so we see a steady, very smooth increase for all fields. The numbers for Market_3 do begin todip in the last couple of months, but perhaps this is temporary. There is clearly no seasonality in thesedata, which makes sense. The number of broadband subscriptions does not rise and fall throughoutthe year.

If we use this fact, we can reduce the time for the Expert PASW Modeler to fit models to these series,since requesting that seasonality be considered will increase processing time.

Additionally, because the series we’ve viewed here are so smooth, with no obvious outliers, we’ll notrequest outlier detection. This will also save on processing time. Note, though, that if you are in doubtabout this, it is safer to use outlier detection during modeling.

Close the Time Plot graphPlace a Time Series node from the Modeling palette near the Time Intervals nodeConnect the Time Intervals node to the Time Series node

Here is the stream so far:




8-30

Figure 8.20 Stream with Times Series Node Attached

Double-click on the Time Series node

Figure 8.21 Times Series Node

The default Method of modeling is Expert PASW Modeler which automatically selects the bestexponential smoothing or ARIMA model for one or more series (there can be a different model for each series). As an alternative, you can use the menu to specify that you only want to specify acustom Exponential Smoothing or ARIMA model. In addition, there is a Continue estimation using




8-31

existing model(s)

option, which allows you to apply an existing model to new data, without re-estimating the model from the beginning. In this way you can save time by re-estimating and producing a new forecast based on the same model settings as before but using more recent data.Thus, if the original model for a particular time series was Holt's linear trend, the same type of modelis used for re-estimating and forecasting for that data; the modeler does not reattempt to find the bestmodel type for the new data. We will use the Expert Modeler in this example.

In addition, you can specify the confidence intervals you want for the model predictions and residualautocorrelations. By default, a 95% confidence interval is used. You can set the maximum number of lags shown in tables and in plots of autocorrelations and partial autocorrelations.

You must include a Time Intervals node upstream from the Time Series node. Otherwise, the dialogwill indicate that no time interval has been defined and the stream will not run. In this example, thesettings indicate that the model will be estimated from all the records and that forecasts will be made3 time periods beyond the estimation period.

Click the Criteria button

Figure 8.22 Criteria Dialog

The All models option should be checked if you want the Expert Modeler to consider both ARIMAand exponential smoothing models. The other two modeling options can be used if you want theExpert Modeler to only consider Exponential smoothing or ARIMA models. The Expert Modeler willonly consider seasonal models if periodicity has been defined for the active dataset. When this optionis selected, the Expert Modeler considers both seasonal and nonseasonal models. If this option is notselected, the Expert Modeler only considers nonseasonal models. We will uncheck this option




8-32

because the sequence charts clearly show that there were no seasonal patterns in broadband subscriptions.

The Events and Interventions

option enables you to designate certain fields as event or interventionfields. Doing so identifies a field as identifying periods when time series data might have beenaffected by events (predictable recurring situations, e.g., sales promotions) or interventions (one-time

incidents, e.g., a power outage or employee strike). These fields must be of flag, nominal or ordinal,and must be numeric (e.g. 1/0, not T/F, for a Flag field), to be included in this list.

Click the Outliers tab

Figure 8.23 Outliers Dialog

The Detect Outliers automatically option is used to locate and adjust for outliers. Outliers can lead toforecasting bias either up or down, erroneous predictions if the outlier is near the end of the series,and increased standard errors. Because there were no obvious outliers in the sequence chart, we willleave this option unchecked.

Click Cancel Click Run


Right-click on the generated model named 6 fields in the Models paletteClick Browse




8-33

Figure 8.24 Time Series Model Output (Simple View)

The Time Series model displays details of the models the Expert Modeler selected for each series. Inthis case, it chose the Holts linear trend exponential smoothing model for the first four series and thelast one (Total), and the Winters additive exponential smoothing model for the fifth series. Given thelikely similar patterns in the series, it is not surprising that the same model was chosen for most of theseries. The default output shows for each series the model type, the number of predictors specified,and the goodness-of-fit measure (stationary R

-squared is the default). This measure is usually preferable to an ordinary R-squared when there is a trend or seasonal pattern. If you have specified outlier methods, there is a column showing the number of outliers detected. The default output also

includes a Ljung-Box Q statistic, which tests for autocorrelation of the error. Here we see that theresult was significant for the Model_2, Model_4, and Total series. Below we will examine someresiduals plots to see why the results were significant.

The default view (Simple) displays the basic set of output columns. For additional goodness-of-fitmeasures, you can use the View menu to select the Advanced option. The check boxes to the left of each model can be used to choose which models you want to use in scoring. All the boxes arechecked by default. The Check all and Un-check all buttons in the upper left act on all the boxes in a




8-34

single operation. The Sort by option can be used to sort the rows in ascending or descending order of a specified column. As an alternative, you can also click on the column heading itself to change theorder.

Click on the View: menu and select Advanced

Figure 8.25 Time Series Model Output (View = Advanced)

The Root Mean Square Error (RMSE) is the square root of the mean squared error. The MeanAbsolute Percentage Error (MAPE) is obtained by taking the absolute error for each time period,dividing it by the actual series value, averaging these ratios across all time points, and multiplying by100. The Mean Absolute Error (MAE) takes the average of the absolute values of the errors. TheMaximum Absolute Percentage Error (MaxAPE) is the largest absolute forecast error expressed as a percentage. The Maximum Absolute Error (MaxAE) is the largest forecast error, positive or negative.

And finally, Normalized Bayesian Information Criterion (Norm BIC) is a general measure of theoverall fit of a model that attempts to account for model complexity.

From this table, you can easily scan the statistics to look for better, or poorer, fitting models. We cansee here that Model_5 has the highest Stationary R-squared value (0.544) and Total has a very lowone (0.049). However, the Total series has a lower MAPE than any of the other series.The summary statistics at the bottom of the output provide the mean, minimum, maximum and percentile rank values for the standard fit measures. Here we see that the value for Stationary R-




8-35

squared at the highest percentile (Percentile 95) is 0.544. This means that Model_5 should be ranked in the highest percentile based on this statistic, and the Total series should be ranked in the lowest.

Model Parameters

Time series models are represented by specific equations, and each model therefore has coefficientsor parameters associated with its various terms. These parameters can provide additional insight into amodel and the series that it predicts.

Click Parameters tab

A Holts linear trend model has two parameters, the level and trend. The level of the series is modeled with Alpha, which varies from 0 (older observations count just as much as current observations) to 1(the current observation is used exclusively). The alpha estimate is 1.0, and it is significant at the .05level, so this smoothly varying series can be modeled for level by ignoring previous observationswhen predicting an observation.

Gamma is a parameter which models the trend in the series, and it varies from 0 (trends are based onall observations) to 1 (the trend is based on only the most recent observations). The gamma value of

0.3 (also significant at the .05 level) indicates that the series trend (which is clearly increasing over time) requires using some of the past observations as well as more recent ones. Note that the value of gamma does not describe the trend itself.

Figure 8.26 Parameters of Holts Linear Trend Model for Market_1

An experienced analyst can use the parameters for a model to compare one model to another, tocompare changes over time in a model (and thus a series), and to assess whether a model makessense. You may want to review the parameters for the other series’ models. Now let’s look at the Residual plots.

Click on the Residuals tab




8-36

Figure 8.27 Residuals Output for the Market_1 Series

The Residuals tab shows the autocorrelation function (ACF) and partial autocorrelation function

(PACF) of the residuals (the differences between expected and actual values) for each target field.The ACF values are the correlations between the current value and the previous time points. Bydefault, 24 autocorrelations are displayed. The PACF values look at the correlations after controllingfor the series values at the intervening time points. If all of the bars fall within the 95% confidencelimits (the blue highlighted area), then there are no significant autocorrelations in the series. Thatseems to be the case with the Market_1 series. However, as we saw in Figure 8.24, the Market_2series seemed to have significant autocorrelation based on the Ljung-Box Q statistic. Let’s take a look at the residuals plot for the Market_2 series to see if we can see why that statistic was significant for that series.

Use the Display plot for model: option to select the Market_2 series




8-37

Figure 8.28 Residuals Output for the Market_2 Series

Here we see that there is significant autocorrelation at lag 6 in both the ACF and PACF plots. Thus,

the results of the Ljung-Box Q statistic and these two plots are consistent: there is a non-random pattern in the errors. What this implies is not that the current model can’t be used for forecasting, as itmay perform adequately for the broadband provider. But it does suggest that the model can beimproved. The Expert Modeler is an automatic modeling technique, and it normally finds a fairlyacceptable model, but that doesn’t mean that some tweaking on your part isn’t appropriate.

Click OKPlace a Table node nearby the generated Times Series model Connect the generated model to the Table nodeRun the Table node




8-38

Figure 8.29 Table Output Showing Fields Created by Time Series Model

The table now contains a forecast value for each time point for each series along with an upper and lower confidence limit. In addition, there is a field called $T1_Future that indicates that there arerecords that contain forecast data. For records that extend into the future, the value for this field will be “1”.

Scroll to the bottom of the table and then slightly to the right

Figure 8.30 Table Output with Future Values Displayed




8-39

Notice that the original series all have null values on these last three records because they are projected into the future. On the right hand side in Figure 8.30 we can see the forecast values for future months (January 2004 to March 2004) for the Market_1 series.

Finally, let’s create a chart showing the forecast for one of the series.

Close the Table windowPlace a Time Plot node near the generated model on the stream canvasConnect the Time Plot node to the generated modelSelect the following fields to be plotted: Market_5, $TS-Market_5, $TSLCI-Market_5,

$TSUCI-Market_5Uncheck the Display Series in separate panels optionClick Run

Figure 8.31 Sequence Chart for Market_5 along with Forecast and Upper & Lower Confidence

Limits

From this chart, it appears that the model fits this series very well.

Close the Time plot graph window




8-40

Click on File…Save Stream As and name the file Broadband.str

8.12 App lying Models to Several Series We just produced models for 6 series, along with forecasts for the next three months. Suppose that 3months has passed and we now have actual data for January to March 2004 (for which we made

forecasts initially). Now it is April 2004 and we want to make forecasts for the next three months(April to June 2004) using the same model we developed before without having to re-estimate themodel now that we have updated the file with three months of new records. We do this with theReuse stored settings method in the Time Series node to apply the model we just created to theupdated data file. (We leave aside whether the correct forecast period is three months, more, or less.)

Click File…Open Stream…Broadband2.str (in the C:\Train\ModelerPredModel folder)Copy the generated Time Series node from the Broadband.str (or add it from the Models

manager to the streamPaste the generated model into Broadband2.str

Figure 8.32 Broadband2.str with the Generated Model from Broadband1.str.

This node contains the settings from the time series models we just created. Normally, at this point,

with any other type of PASW Modeler generated model, we would make predictions on new data byattaching this node to the Type node and running the generated model. This would automaticallymake predictions for new cases.

Time series data, though, is different. Unlike other types of data files, where there is usually nospecial order to the cases (in terms of modeling), order makes a difference in a time series. To reuseour settings, but also use the new data (from January to March) to make estimates, we must create anew Time Series node directly from the generated Time Series model.

Right-click on the generated model and select EditClick on Generate…Generate Modeling Node

This places a time series modeling node onto the palette.

Close the time series modeling output and delete the copied generated model from thestream canvas

Connect the Time Intervals node to the new Time Series node




8-41

Figure 8.33 Broadband2.str with the Time Series Node Generated from the Previous Model

We don’t have to specify any target fields because the models, with all specifications, are alreadystored in the generated time series modeling node. We simply insert the model node and decidewhether the model should be re-estimated or not. Assuming that you have recently estimated models,you might be willing to act as if the model type for each series still holds. You can avoid redoingmodels and apply the existing models to the new data by using the method Continue estimation using existing model(s) option. This choice means that PASW Modeler will use the model settings for

model form (type of exponential smoothing and ARIMA model). Thus, for example, if the originalmodel for a particular time series was Holt's linear trend, the same type of model is used for re-estimating and forecasting for that data; the system does not attempt to find the best model type for the new data. The Expert Modeler will estimate new model parameters based on the additional data.

If instead you wish to re-estimate the model type, then you can uncheck the Continue estimation check box. Although it will clearly take more computing time to redo the models, you may prefer thischoice unless you have many, many time series which are very long. However, if you are makingforecasts every month (week, etc.) based on just one additional month (week, etc.) of data, it may not be worth the effort to redo the complete models every month. In that case, you may wish to re-estimate the parameters every month, but re-estimate the models themselves every few months.

Double-Click on the Time Series node

By default the Time series node will use the existing models for estimation




8-42

Figure 8.34 Time Series Model Node with Continue Estimating Using Existing Models

Click Run to place a new model in the Models Manager Browse the new model




8-43

Figure 8.35 Time Series Model Output

As we can see, the models used for each series are the same as before (see Figure 8.24), although thestatistics are not (examine stationary R square, for example). Let’s review the parameter estimates.

Click Parameters tab




8-44

Figure 8.36 Parameters for Market_1 Holts Linear Trend Model Estimated with New Data

The alpha parameter is still 1.0, but the gamma parameter is now almost zero (0.001) and it is non-significant, so it is effectively equal to zero. This means that the trend in the series is modeled usingall the data, rather than more recent observations, compared to the original model for the Market_1 series.

Now let’s take a look at the new forecasts for April, May and June.

Close the Time Series model browser

Attach a Table node to the new Time Series modelRun the Table node




8-45

Figure 8.37 Table Node Output with New Forecasts

There are now null values for the original series for April, May, and June 2004, but there are predictions for these months. In addition, you can compare the predictions for the first three monthsof 2004 with these data with the original predictions we made above. They will not be the same because we have three months of additional data to improve the model.

In summary, in this lesson we demonstrated a typical application of time series analysis in data

mining by showing how to make forecasts for several series at once. We then used these models butre-estimated them on new data to make new forecasts at a future date for those same series. The process of applying the models to new data can be repeated as often as necessary.




8-46

Summary Exerc ises The exercises in this lesson are written for the data file broadband_1.sav.

1. Using the same dataset that was used in the lesson (broadband_1.sav), rerun the Time Series

node, using different series from the ones used in the lesson to fit a model and then produceforecasts.

2. Try rerunning the models requesting outlier detection. Does this make any difference in thegenerated models?

3. For those with extra time: Try specifying your own exponential smoothing model(s) or anARIMA model, if you are knowledgeable about these methods, to see whether you can obtaina better model than that found by the Expert Modeler for one or more of the series.



LOGISTIC REGRESSION

9-1

Lesson 9: Logistic Regression

Objectives

• Review the concepts of logistic regression

• Use the technique to model credit risk

Data

A risk assessment study in which customers with credit cards were assigned to one of threecategories: good risk, bad risk-profitable (some payments missed or other problems, but were profitable for the issuing company), and bad risk-loss. In addition to the risk classification field, anumber of demographics, including age, income, number of children, number of credit cards, number of store credit cards, having a mortgage, and marital status, were available for about 2,500 records.

9.1 Introdu ct ion to Logis t ic Regression Logistic regression, unlike linear regression, develops a prediction equation for a categorical target

field that contains two or more unordered categories (the categories could be ordered, but logisticregression does not take the ordering into account). Thus it can be applied to such situations as:

• Predicting which brand (of the major brands) of personal computer an individual will purchase

• Predicting whether or not a customer will close her account, accept an offering, or switch providers

Logistic regression technically predicts the probability of an event (of a record being classified into aspecific category of the target field). The logistic function is shown in Figure 9.1. Suppose that wewish to predict whether someone buys a product. The function displays the predicted probability of purchase based on an incentive.

Figure 9.1 Logistic Model for Probability of Purchase

We see the probability of making the purchase increases as the incentive increases. Note that thefunction is not linear but rather S-shaped. The implication of this is that a slight change in theincentive could be effective or not depending on the location of the starting point. A linear model




9-2

would imply that a fixed change in incentive would always have the same effect on probability of purchase. The transition from low to high probability of purchase is quite gradual. However, with alogistic model the transition can occur much more rapidly (steeper slope) near the .5 probabilityvalue.

To understand how the model functions, we need to review some equations. The logistic model

makes predictions based on the probability of an outcome. Binary (two target category) logisticregression can be formulated as:

k k

k k

X B X B X B

X B X B X B

e

e++++

++++

+...

...

2211

2211

1 =)Prob(event

α

α

Where X1, X2, …, Xk

are the input fields.

This can also be expressed in terms of the odds of the event occurring.

k k X B X B X B

e++++

=−=...2211

event)(noProb

(event)Prob

or (event)Prob1

(event)Prob

(event)Oddsα

where the outcome is one of two categories (event, no event). If we take the natural log of the odds,we have a linear model, akin to a standard regression equation:

k k X B X B X B ++++ ... =(event))(Oddsln 2211α

With two categories, a single odds ratio summarizes the outcome. However, when there are more thantwo target categories, ratios of the category probabilities can still describe the outcome, but additionalratios are required. For example, in the credit risk data used in this lesson there are three targetcategories: good risk, bad risk–profit, and bad risk–loss. Suppose we take the Good Risk category asthe reference or baseline category and assign integer codes to the target categories for identification:(1) Bad Risk–Profit, (2) Bad Risk–Loss, (3) Good Risk. For the three categories we can create two probability ratios:

Risk)(Good Prob

Profit)-Risk (Bad Prob

(3)

(1) g(1) ==

π

π

and

Risk)(Good Prob

Loss)-Risk (Bad Prob

(3)

(2) g(2) ==

π

π

Where (j)π is the probability of being in target category j.

Each ratio is based on the probably of an target category divided by the probability of the reference or baseline target category. The remaining probability ratio (Bad Risk–Profit / Bad Risk–Loss) can beobtained by taking the ratio of the two ratios shown above. Thus the information in J target categoriescan be summarized in (J-1) probability ratios.



LOGISTIC REGRESSION

9-3

In addition, these target-category probability ratios can be related to input fields in a fashion similar towhat we saw in the binary logistic model. Again using the Good Risk category as the reference or baseline, we have the following model:

k k X B X B X B 12121111 ...)

Risk)(Good Prob

Profit)-Risk (Bad Probln()

(3)

(1)ln( ++++== α

π

π

and

k k X B X B X B 22221212 ...)Risk)(Good Prob

Loss)-Risk (Bad Probln()

(3)

(2)ln( ++++== α

π

π

Notice that there are two sets of coefficients for the three-category case, each describing the ratio of atarget category to the reference or baseline category. If we complete this logic and create a ratiocontaining the baseline category in the numerator, we would have:

k k X B X B X B 32321313 ...

0ln(1))Risk)(Good Prob

Risk)(Good Probln()(3)

(3)ln(

++++=

===

α

π

π

This implies that the coefficients associated with )(3)

(3)ln(

π

π

are all 0 and so are not of interest. Also,

the ratio relating any two target categories, excluding the baseline, can be easily obtained bysubtracting their respective natural log expressions. Thus:

)(3)

(2)ln()

(3)

(1)ln()

(2)

(1)ln(

π

π

π

π

π

π

−= , or

)Risk)(Good Prob

Loss)-Risk (Bad Probln(- )

Risk)(Good Prob

Profit)-Risk (Bad Probln( )

Loss)-Risk (Bad Prob

Profit)-Risk (Bad Probln( =

We are interested in predicting the probability of each target category for specific values of the predictor fields. This can be derived from the expressions above. The probability of being in targetcategory j is:

)(

(j) (j)

1

∑=

= J

i

i g

g π , where J is the number of target categories.

In our example with the three risk categories, for category (1):




9-4

1

(1)

(3)(2)(1)

(1)

(3)

(3)

(3)

(2)

(3)

(1)

(3)

(1)

(3))2()1(

(1) π

π π π

π

π

π

π

π

π

π

π

π

=++

=

++

=++ g g g

g

And substituting for the g(j)’s, we have an equation relating the predictor fields to the target category probabilities.

k k k k k k

k k

X B X B X B X B X B X B X B X B X B

X B X B X B

eee

e323213132222121212121111

12121111

.........

...

(1)++++++++++++

++++

++=

α α α

α

π

1

2222121212121111

12121111

......

...

++=

++++++++

++++

k k k k

k k

X B X B X B X B X B X B

X B X B X B

ee

eα α

α

In this way, the logic of binary logistic regression can be naturally extended to permit analysis of categorical target fields with more than two categories.

9.2 A Mult inom ial Log ist ic Analys is : Predic t ing Credi t

Risk We will perform a multinomial logistic analysis that attempts to predict credit risk (three categories)for individuals based on several financial and demographic input fields. The data file has been splitinto two, and we use risktrain.txt to train the model. We are interested in fitting a model, interpretingand assessing it, and obtaining a prediction equation. Possible input fields are shown below.

Field name Field descriptionAGE age in years

INCOME income (in thousands of British pounds)

GENDER f=female, m=maleMARITAL marital status: single, married, divsepwid (divorced, separated or

widowed)

NUMKIDS number of dependent children

NUMCARDS number of credit cards

HOWPAID how often paid: weekly, monthly

MORTGAGE have a mortgage: y=yes, n=no

STORECAR number of store credit cards

LOANS number of other loans

INCOME1K income (in thousands of British pounds) divided by 1,000

The target field is:

Field name Field descriptionRISK credit risk: 1= bad risk-loss, 2=bad risk-profit, 3= good risk



LOGISTIC REGRESSION

9-5

To access the data:

Click File…Open Stream and move to the c:\Train\ModelerPredModel directoryDouble-click on Logistic.str Run the Table node, examine the data, and then close the Table windowDouble-click on the Type node

The target field is credit risk ( RISK ). Notice that only four input fields are used. This is done tosimplify the results for this presentation. As an exercise, the other fields will be used as predictors.

Figure 9.2 Type Node for Logistic Analysis

Close the Type node dialogDouble-click on the Logistic Regression model node named RISK Click on the Model tab




9-6

Figure 9.3 Logistic Regression Dialog

In the Model tab, you can choose whether a constant (intercept) is included in theequation. The Procedure option is used to specify whether a binomial or multinomial model iscreated. The options that will be available to you in the dialog box will differ according to whichmodeling procedure you select.

Binomial is used when the target field has two discreet values, such as good risk/bad risk, or churn/not churn. Whenever you use this option, you will in addition be asked to declare which of your flag or categorical fields should be treated as categorical, the type of contrast you want performed, and the reference category for each predictor. The default contrast is Indicator, whichindicates the presence or absence of category membership. However, in fields with some implicitorder, you may want to use another contrast such as Repeated, which compares each category withthe one that precedes it. The default reference or base category is the First category. If you prefer,

you can change this to the Last category.

Multinomial should be used when the target field is a categorical field with more than two values.This is the correct choice in our example because the RISK field has three values: bad risk, bad profit,and good risk. Whenever you use this option, the Model type option will become available for you tospecify whether you want a main effects model, a full factorial model, or a custom model. By default,a model including the main effects (no interactions) of factors (categorical inputs) and covariates(continuous inputs) will be run. This is similar to what the Regression model node will do (unlessinteraction terms are formally added). The Full factorial option would fit a model including all factor



LOGISTIC REGRESSION

9-7

interactions (in our example, with two categorical predictors, the two-way interaction of MARITAL and MORTGAGE would be added).

Notice that there are Method options (as there were for linear regression), so stepwise methods can beused when the Main Effects model type is selected. When a number of input fields are available, thestepwise methods provide a method of input field selection based on statistical criteria.

The Base Category for target option is used to specify the reference category. The default is the Firstcategory in the list, which in this case is bad loss. Note

: This field is unavailable if the contrastsetting is Difference, Helmert, Repeated, or Polynomial.

Select the Multinomial Procedure option (if necessary)Click on the Specify button to the right of Base category for target. This will open the

Insert Value dialog boxClick on good risk

Figure 9.4 Insert Value Dialog

Click the Insert button

This will change the base target category. The result is shown in Figure 9.5.




9-8

Figure 9.5 Logistic Regression Dialog with Good Risk as the Base Target Category

Click on the Expert tabClick the Expert Mode option button



LOGISTIC REGRESSION

9-9

Figure 9.6 Logistic Expert Mode Options

The Scale option allows adjustment to the estimated parameter variance-covariance matrix based onover-dispersion (variation in the outcome greater than expected by theory, which might be due toclustering in the data). The details of such adjustment are beyond the scope of this course, but you canfind some discussion in McCullagh and Nelder (1989).

If the Append all probabilities checkbox is selected, predicted probabilities for every category of thetarget field would be added to each record passed through the generated model node. If not selected, a predicted probability field is added only for the predicted category.

Click the Output button Make sure the Likelihood ratio tests check box is selectedMake sure the Classification table check box is selected

By default, summary statistics and (partial) likelihood ratio tests for each effect in the model appear inthe output. Also, 95% confidence bands will be calculated for the parameter estimates. We haverequested a classification table so we can assess how well the model predicts the three risk categories.




9-10

Figure 9.7 Logistic Regression Advanced Output Options

In addition, a table of observed and expected cell probabilities can be requested (Goodness of fit chi-

square statistics). Note that, by default, cells are defined by each unique combination of a covariate(continuous input) and factor (categorical input) pattern, and a response category. Since a continuous predictor ( INCOME1K ) is used in our analysis, the number of cell patterns is very large and eachmight have but a single observation. These small counts could possibly yield unstable results, and sowe will forego goodness of fit statistics. The asymptotic correlation of parameter estimates can provide a warning for multicollinearity problems (when high correlations are found among parameter estimates). Iteration history information is requested to help debug problems if the algorithm fails toconverge, and the number of iteration steps to display can be specified. Monotonicity measures can beused to find the number of concordant pairs, discordant pairs, and tied pairs in the data, as well as the percentage of the total number of pairs that each represents. The Somers' D, Goodman and Kruskal'sGamma, Kendall's tau-a, and Concordance Index C are also displayed in this table. Informationcriteria provides

Akaike’s information criterion (AIC) and Schwarz’s Bayesian information criterion

(BIC).

Click OK Click Convergence button

Figure 9.8 Logistic Regression Convergence Criteria

The Logistic Regression Convergence Criteria options control technical convergence criteria.Analysts familiar with logistic regression algorithms might use these if the initial analysis fails toconverge to a solution.



LOGISTIC REGRESSION

9-11

Click Cancel Click Run Browse the Logistic Regression generated model node named RISK in the Models

Manager windowClick the Advanced tab, and then expand the browsing window

The advanced output is displayed in HTML format.

Figure 9.9 Record Processing Summary

The marginal frequencies of the categorical inputs and the target are reported, along with a summaryof the number of valid and missing records. A record must have valid values on all inputs and thetarget in order to be included in the analysis. We have nearly 2,500 records for the analysis.




9-12

Figure 9.10 Model Fit and Pseudo R-Square Summaries

The Final model chi-square statistic tests the null hypothesis that all model coefficients are zero in the population, equivalent to the overall F test in regression. It has ten degrees of freedom that correspond to the parameters in the model (seen below), is based on the change in –2LL (–2 log likelihood) fromthe initial model (with just the intercept) to the final model, and is highly significant. Thus at leastsome effect in the model is significant. The AIC and BIC fit measures are also displayed. The modelfit improves as these two values approach zero. Because each of them decreased, we can concludethat the model fit improved with the addition of the predictors.

Pseudo r-square measures try to measure the amount of variation (as functions of the chi-square lack of fit) accounted for by the model. The model explains only a modest amount of the variation (themaximum is 1, and some measures cannot reach this value).

Figure 9.11 Likelihood Ratio Tests

The Model Fitting Criteria table provided an omnibus test of effects in the model. Here we have a testof significance for each effect (in this case the main effect of an input field) after adjusting for theother effects in the model. The caption explains how it is calculated. All effects are highly significant. Notice that the intercepts are not tested in this way, but tests of the individual intercepts can be found in the Parameter Estimates table. In addition, we can use this table to rank order the importance of the predictors. For instance, if we focus on the -2 LL value, if INCOME1K was removed as a predictor, the -2 LL value would increase by a magnitude 302.422. Clearly, the removal of this



LOGISTIC REGRESSION

9-13

predictor would have far more impact on the overall fit than if we were to eliminate any of the other predictors. The further -2LL gets from zero, the worse the fit. Thus, we can conclude that INCOME1K is the most important predictor, followed by MARITAL, NUMKIDS , and MORTGAGE .

For those familiar with binary (two category) logistic regression, note that the values in the df (degrees of freedom) column are double what you would expect for a binary logistic regression

model. For example, the covariate income ( INCOME1K ), which is continuous, has two degrees of freedom. This is because with three target categories, there are two probability ratios to be fit,doubling the number of parameters. Income has by far the largest chi-square value compared to theother predictors with two (or even four) degrees of freedom.

9.3 Interpret ing Coeff ic ients The most striking feature of the Parameter Estimates table is that there are two sets of parameters.One set is for the probability ratio of “bad risk–loss” to “good risk,” which is labeled “bad loss.” Theother set is for the probability ratio of “bad risk–profit” to “good risk,” labeled “bad profit.” You canview the estimates in equation form in the Model tab, but the Advanced tab contains moresupplementary information.

Figure 9.12 Parameter Estimates

For each of the two target probability ratios, each predictor is listed, plus an intercept, with the Bcoefficients and their standard errors, a test of significance based on the Wald statistic, and theExp(B) column, which is the exponentiated value of the B coefficient, along with its 95% confidenceinterval. As with ordinary linear regression, these coefficients are interpreted as estimates for theeffect of a particular input, controlling for the other inputs in the equation.




9-14

Recall that the original (linear) model is in terms of the natural log of a probability ratio. The interceptrepresents the log of the expected probability ratio of two target categories when all continuous inputsare zero and all categorical fields are set to their reference category (last group) values. For covariates, the B coefficient is the effect of a one-unit change in the input on the log of the probabilityratio. Examining income ( INCOME1K ) in the “bad loss” section, anincrease of 1 unit (equivalent to 1,000 British pounds) decreases the log of the probability ratio

between “bad loss” and “good risk” by –.056. But what does this mean in terms of probabilities?Moving to the Exp(B) column, we see the value is .945 for INCOME1K (in the “bad loss” section of the table). Thus increasing income by 1 unit (or 1,000 British pounds) decreased the expected ratio of the probability of being a bad loss to the probability of being a good risk by a factor of .945. In other words, increasing income reduces the expected probability of being a “bad loss” relative to being a“good risk,” and this reduction is .945 per 1,000 British pounds. This finding makes common sense.If we examine the income coefficient in the “bad profit” section of the table, we see that in a similar way (Exp(B) = .878) the expected probability of being a “bad profit” relative to being a good risk decreases as income increases. Thus increasing income, after controlling for the other fields in theequation, is associated with decreasing the probability of having a “bad loss” or “bad profit” outcomerelative to being a “good risk.” This relationship is quantified by the values in the Exp(B) column and the Sig column indicates that both coefficients are statistically significant.

Turning to the number of children ( NUMKIDS ), we see that its coefficient is significant for the “bad loss” ratio, but not the “bad profit” ratio. Examining the Exp(B) column for NUMKIDS in the “bad loss” section, the coefficient estimate is 2.267. For each additional child (one unit increase in NUMKIDS ), the expected ratio of the probability of being a “bad loss” to being a “good risk” morethan doubles. Thus, controlling for other predictors, adding a child (one unit increase) doubles theexpected probability of being a “bad loss” relative to a “good risk.” However, controlling for theother predictors, the number of children has no significant effect on the probability ratio of being a“bad profit” relative to a “good risk.”

The Logistic node uses a General Linear Model coding scheme. Thus for each categorical input (here MARITAL and MORTGAGE ), the last category value is made the reference category and the other

coefficients for that input are interpreted as offsets from the reference category. In examining thetable we see that the last categories for MARITAL (single) and MORTGAGE (y) have B coefficientsfixed at 0. Because of this the coefficient of any other category can be interpreted as the changeassociated with shifting from the reference category to the category of interest, controlling for theother input fields. Since the reference category coefficients are fixed at 0, they have no associated statistical tests or confidence bands.

Looking at the MARITAL input field, its two coefficients (for divsepwid and married categories) aresignificant for both the “bad loss” and “bad profit” summaries. In the “bad loss” section, we see theestimated Exp(B) coefficient for the “MARITAL=divsepwid” category is .284, while that for “MARITAL=married” is 2.891. Thus we could say that, after controlling for other inputs, compared to those who are single, those who are divorced, separated or widowed have a large reduction (.284)

in the expected ratio of the probability of being a “bad loss” relative to a “good risk.” Put another way, the divorced, separated or widowed group is expected to have fewer “bad losses” relative to“good risks” than is the single group. On the other hand, the married group is expected to have amuch higher (by a factor of almost 3) proportion of “bad losses” relative to “good risks” than thesingle group. The explanation of why being married versus single should be associated with anincrease of “bad losses” relative to “good risks” should be worked out by the analyst, perhaps inconjunction with someone familiar with the credit industry (domain expert). If we examine the MARITAL Exp(B) coefficients for the “bad profit” ratios, we find a very similar result.



LOGISTIC REGRESSION

9-15

Finally, MORTGAGE is significant for both the “bad loss” and “bad profit” ratios. Since having amortgage (coded y) is the reference category, examining the Exp(B) coefficients shows that compared to the group with a mortgage, those without a mortgage have a greater expected probability of being“bad losses” (1.828) or “bad profits” (2.526) relative to “good risks.” In short, those withoutmortgages are less likely to be good risks, controlling for the other predictors.

In this way, the statistical significance of inputs can be determined and the coefficients interpreted. Note that if a predictor were not significant in the Likelihood Ratio Tests table, then the model should be rerun after dropping the field. Although NUMKIDS is not significant for both sets of categoryratios, the joint test (Likelihood Ratio Test) indicates it is significant and so we would retain it.

Classification Table

The classification table, sometimes called a misclassification table or confusion matrix, provides ameasure of how well the model performs. With three target categories we are interested in the overallaccuracy of model classification, the accuracy for each of the individual target categories, and patterns in the errors.

Figure 9.13 Classification Table

The rows of the table represent the actual target categories while the columns are the predicted targetcategories. We see that overall, the predictive accuracy of the model is 62.4%. Although marginalcounts do not appear in the table, by adding the counts within each row we find that the mostcommon target category is bad profit (1,475). This constitutes 60.1% percent of all cases (2,455).Thus the overall predictive accuracy of our model is not much of an improvement over the simplerule of always predicting “bad risk–profit.” However, we should recall that this simple rule would never make a prediction of “bad risk–loss” or “good risk.”

In examining the individual categories, the “bad risk–profit” group is predicted most accurately(87.3%), while the other categories, “bad risk–loss” (15.9%) and “good risk” (36.8%) are predicted with much less accuracy. Not surprisingly, most errors in prediction for these latter two targetcategories are predicted to be “bad risk–profit.”The classification table allows us to evaluate a model from the perspective of predictive accuracy.Whether this model would be adequate depends in part on the value of correct predictions and thecost of errors. Given the modest improvement of this model over simply classifying all cases as “bad risk–profit,” in practice an analyst would see if the model could be improved by adding additional predictors and perhaps some interaction terms.

Finally, it is important to note that the predictions were evaluated on the same data used to fit themodel and for this reason may be optimistic. A better procedure is to keep a separate validationsample on which to evaluate the predictive accuracy of the model.




9-16

Making Predictions

We now have the estimated model coefficients. How does the Logistic generated model node make predictions from the model? First, let’s see the actual predictions by adding the generated model tothe stream.

Close the Model browsing window Add a Table node to the stream and connect the Logistic generated model to the Table

nodeRun the Table node

Figure 9.14 Predicted Value and Probability from Logistic Model

The field $L-RISK contains the most likely prediction from the model (here “good risk”). The probabilities for all three target categories must sum to 1; the model prediction is the category withthe highest probability. That probability is contained in the field $LP-RISK . So for the first case, the prediction is “good risk” and the predicted probability of this occurring is .692 for this combination of input values. You prefer that the probability be as close to 1 as possible (the lowest possible value for the predicted category is .333; Why?).

To illustrate how the actual calculation is done, let’s take an individual who is single, has a mortgage,no children, and has an income of 35,000 British pounds ( INCOME1K = 35.00). What is the predicted probability of her (although gender was not included in the model) being in each of the three risk categories? Into which risk category would the model place her?



LOGISTIC REGRESSION

9-17

Earlier in this lesson we showed the following (where (j)π is the probability of being in target

category j):

)(

(j) (j)

1

∑=

= J

i

i g

g π , where J is the number of target categories

If we substitute the parameter estimates in order to obtain the estimated probability ratios, we have:

121 *603.*062.1*260.1*818.1*056.438.(1)ˆ Mortgage Marital Marital Numkidsk Incomee g

++−+−=

121 *927.*021.1*220.1*153.1*130.285.4(2)ˆ Mortgage Marital Marital Numkidsk Incomee g

++−+−=

and

1(3)ˆ = g

Where because of the coding scheme for the categorical inputs (Factors):

Marital1

Marital= 1 if Marital=divsepwid; 0 otherwise

2

Mortgage= 1 if Marital=married; 0 otherwise

1

= 1 if Mortgage=n; 0 otherwise

Thus for our hypothetical individual, the estimated probability ratios are:

218.(1)ˆ 522.1.0*603.0*062.10*260.10*818.0.35*056.438.===

−++−+− ee g

767.(2)ˆ 265.0*927.0*021.10*220.10*153.0.35*130.285.4===

−++−+− ee g

1(3)ˆ = g

And the estimated probabilities are:

.110 1767.218.

218. (1)ˆ =

++

=π

.386

1767.218.

767. (2)ˆ =

++

=π

.504 1767.218.

1 (3)ˆ =

++=π

Since the third group (good risk) has the greatest expected probability (.504), the model predicts thatthe individual belongs to that group. The next most likely group to which the individual would beassigned would be group 2 (bad risk–profit) because its expected probability is the next largest (.386).




9-18

Additional Readings

Those interested in learning more about logistic regression might consider David W Hosmer and Stanley Lemeshow’s Applied Logistic Regression, 2nd

Edition, New York, Wiley, 2000.



LOGISTIC REGRESSION

9-19

Summ ary Exerc ises The exercises in this lesson use the data file risktrain.tx detailed in the following text box.

RiskTrain.txt contains information from a risk assessment study in which customers with credit

cards were assigned to one of three categories: good risk, bad risk-profitable (some payments missed or other problems, but profitable for the issuing company), and bad risk-loss. In addition to the risk classification field, a number of demographics are available for about 2,500 cases. We want to predicting credit risk from the demographic fields. The file contains the following fields:

ID ID number AGE AgeINCOME Income in British poundsGENDER Gender MARITAL Marital statusNUMKIDS Number of dependent childrenNUMCARDS Number of credit cards

HOWPAID How often is customer paid by employer (weekly, monthly)MORTGAGE Does customer have a mortgage?STORECAR Number of store credit cardsLOANS Number of outstanding loans RISK Credit risk categoryINCOME1K Income in thousands of British pounds (field

derived within PASW Modeler)

1. Continuing with the stream from the lesson, add the other available inputs, excluding INCOME (which is linearly related to income1k ), and ID, to a logistic regression model and evaluate the results. Do the additional fields substantially improve the predictive accuracy of the model? Examine the estimated coefficients for the significant inputs. Do these

relationships make sense?

2. Rerun the Logistic node, dropping those inputs that were not significant in the last analysis.Does the accuracy of the model change much? Does the interpretation of any of thecoefficients change substantially?

3. Rerun the Logistic node, this time using the Stepwise method. Do the input fields selected match those retained in Exercise 2?

4. Run a rule induction model (using C5.0 or CHAID) on this data, using all fields but ID and INCOME as inputs. How does the accuracy of this model compare to that found by logisticregression? What does this suggest about the relations in the data? Do the inputs used by the

model correspond to the inputs that were found to be significant in the logistic regressionanalysis?

5. Run a neural net model on this data, again excluding ID and INCOME as inputs. Make sureyou request predictor importance. Does the neural network outperform the other models? Arethe important predictors in the neural network model the same as the significant input fieldsin the logistic regression?




9-20



DISCRIMINANT ANALYSIS

10-1

Lesson 10: Discriminant Analysis

Objectives

• How Does Discriminant Analysis Work?

• The Elements of a Discriminant Analysis

• The Discriminant Model

• How Cases are Classified

• Assumptions of Discriminant Analysis

• Analysis Tips

• A Two–Group Discriminant Example

Data

To demonstrate discriminant analysis we use data from a study in which respondents answered,hypothetically, whether they would accept an interactive news subscription service (via cable). Wewish to identify those groups most likely to adopt the service. Several demographic fields are

available, including education, gender, age , income (in categories), number of children, number of organizations the respondent belonged to, and the number of hours of TV watched per day. The target

measure was whether they would accept the offering or not.

10.1 Introdu ct io n Discriminant analysis is a technique designed to characterize the relationship between a set of fields,often called the response or predictors, and a grouping field with a relatively small number of categories. By modeling the relationship, discriminant can make predictions for categories of the

grouping field (target). To do so, discriminant creates a linear combination of the predictors that bestcharacterizes the differences among the groups. The technique is related to both regression and multivariate analysis of variance, and as such it is another general linear model technique. Another

way to think of discriminant analysis is as a method to study differences between two or more groupson several fields simultaneously.

Common uses of discriminant include:

1. Deciding whether a bank should offer a loan to a new customer.2. Determining which customers are likely to buy a company’s products.

3. Classifying prospective students into groups based on their likelihood of success at a school.4. Identifying patients who may be at high risk for problems after surgery.




10-2

10.2 How Does Discr iminant Analys is Work? Discriminant analysis assumes that the population of interest is composed of separate and distinct

populations, as represented by the grouping field. The discriminant analysis grouping field can havetwo or more categories. Furthermore, we assume that each population is measured on a set of

fields—the predictors—that follow a multivariate normal distribution. Discriminant attempts to find the linear combinations of the predictors that best separate the populations. If we assume two inputfields, X and Y, and two groups for simplicity, this situation can be represented as in Figure 10.1.

Figure 10.1 Two Normal Populations and Two Predictor Fields, with Discriminant Axis

The two populations or groups clearly differ in their mean values on both the X and Y-axes.

However, the linear function—in this instance, a straight line—that best separates the two groups is acombination of the X and Y values, as represented by the line running from lower left to upper rightin the scatterplot. This line is a graphic depiction of the discriminant function, or linear combination

of X and Y, that is the best predictor of group membership. In this case with two groups and onefunction, discriminant will find the midpoint between the two groups that is the optimum cutoff for

separating the two groups (represented here by the short line segment). The discriminant function and cutoff can then be used to classify new observations.

If there are more than two predictors, then the groups will (hopefully) be well separated in amultidimensional space, but the principle is exactly the same. If there are more than two groups, more

than one classification function can be calculated, although not all the functions may be needed toclassify the cases. Since the number of predictors is almost always more than two, scatterplots such as

Figure 10.1 are not always that helpful. Instead, plots are often created using the new discriminantfunctions, since it is on these that the groups should be well separated.

The effect of each predictor on each discriminant function can be determined, and the predictors can be identified that are more important or more central to each function. Nevertheless, unlike inregression, the exact effects of the predictors are not typically seen as of ultimate importance in

discriminant analysis. Given the primary goal of correct prediction, the specifics of how this isaccomplished are not as critical as the prediction itself (such as offering loans to customers who will

pay them back). Second, as will be demonstrated below, the predictors do not directly predict the




10-3

grouping field, but instead a value on the discriminant function, which, in turn, is used to generate agroup classification.

10.3 The Discrim inant Model The discriminant model has the following mathematical form for each function:

FK

= D0

+ D1X

1+ D

2X

2+ ... D

pX

p

where FK is the score on function K, the D i’s are the discriminant coefficients, and the X i

’s are the predictor or response fields (there are p predictors). The maximum number of functions K that can bederived is equal to the minimum of the number of predictors (p) or the quantity (number of groups – 1). In most applications, there will be more predictors than categories of the grouping field, so thelatter will limit the number of functions. For example, if we are trying to predict which customers will

choose one of three offers, (3-1), or two classification functions can be derived.

When more than one function is derived, each subsequent function is chosen to be uncorrelated, or orthogonal, to the previous functions (just as in principal components analysis, where each

component is uncorrelated with all others, see Lesson 7). This allows for straightforward partitioningof variance.

Discriminant creates a linear combination of the predictor fields to calculate a discriminant score for each function. This score is used, in turn, to classify cases into one of the categories of the groupingfield.

10.4 How Cases Are Classif ied There are three general types of methods to classify cases into groups.

1. Maximum likelihood or probability methods: These techniques assign a case to group k if its probability of membership is greater for group k than for any other group. These probabilities

are posterior probabilities, as defined below. This method relies upon assumptions of multivariate normality to calculate probability values.

2. Linear classification functions: These techniques assign a case to group k if its score on thefunction for that group is greater than its score on the function for any other group. This

method was first suggested by Fisher, so these functions are often called Fisher linear discriminant functions (which is how PASW Modeler refers to them).

3. Distance functions: These techniques assign a case to group k if its distance to that group’scentroid is smaller than its distance to any other group’s centroid. Typically, the Mahalanobis

distance is the measure of distance used in classification.

When the assumption of equal covariance matrices is met, all three methods give equivalent results.

PASW Modeler uses the first technique, a probability method based on Bayesian statistics, to derive a

rule to classify cases. The rule uses two probability estimates. The prior probability is an estimate of the probability that a case belongs to a particular group when no information from the predictors isavailable. Prior probabilities are typically either determined by the number of cases in each categoryof the grouping field, or by assuming that the prior probabilities are all equal (so that if there are three




10-4

groups, the prior probability of each group would be 1/3). We have more to say about prior probabilities below.

Second, the conditional probability is the probability of obtaining a specific discriminant score (or

one further from the group mean) given that a case belongs to a specific group. By assuming that thediscriminant scores are normally distributed, it is possible to calculate this probability.

With this information and by applying Bayes’ rule, the posterior probability is calculated, which isdefined as the likelihood or probability of group membership, given a specific discriminant score. It is

this probability value that is used to classify a case into a group. That is, a case is assigned to thegroup with the highest posterior probability.

Although PASW Modeler uses a probability method of classification, you will most probably use amethod based on a linear function to classify new data. This is mainly for ease of calculation because

calculating probabilities for new data is computationally intensive compared to using a classificationfunction. This will be illustrated below.

10.5 Assump t ions of Discr im inant Analys is As with other general linear model techniques, discriminant makes some fairly rigorous assumptionsabout the population. And as with these other techniques, it tends to be fairly robust to violations of these assumptions.

Discriminant assumes that the predictor fields are measured on an interval or ratio scale (continuous).

However, as with regression, discriminant is often used successfully with fields that are ordinal, suchas questionnaire responses on a five- or seven-point Likert scale. Nominal fields can be used as

predictors if they are given dummy coding. The grouping field can be measured on any scale and canhave any number of categories, though in practice most analyses are run with five or fewer categories.

Discriminant assumes that each group is drawn from a multivariate normal population. This

assumption can be and is violated often, especially as sample size increases, and moderate departuresfrom normality are usually not a problem. If this assumption is violated, the tests of significance and the probabilities of group membership will be inexact. If the groups are widely separated in the space

of the predictors, this will not be as critical as when there is a fair amount of overlap between thegroups.

When the number of categorical predictor fields is large (as opposed to interval–ratio predictors),multivariate normality cannot hold by definition. In that case, greater caution must be used, and many

analysts would choose to use logistic regression instead. Most evidence indicates that discriminantoften performs reasonably well with such predictors, though.

Another important assumption is that the covariance matrices of the various groups are equal. This isequivalent to the standard assumption in analysis of variance about equal variances across factor

levels. When this is violated, distortions can occur in the discriminant functions and the classificationequations. For example, the discriminant functions may not provide maximum separation betweengroups when the covariances are unequal. If the covariance matrices are unequal but the input fields’

distribution is multivariate normal, the optimum classification rule is the quadratic discriminantfunction. But if the matrices are not too dissimilar, the linear discriminant function performs quitewell, especially when the sample sizes are small. This assumption can be tested with the Explore

procedure or with the Box’s M statistic, displayed by Discriminant.




10-5

For a more detailed discussion of problems with assumption violation, see P.A. Lachenbruch( Discriminant Analysis. 1975. New York: Hafner) or Carl Huberty ( Applied Discriminant Analysis.

1994. New York: Wiley).

10.6 Analysis Tips

In addition to the assumptions of discriminant, some additional guidelines are helpful. Many analystswould recommend having at least 10 to 20 times as many cases as predictor fields to insure that amodel doesn’t capitalize on chance variation in a particular sample. For accurate classification,

another common rule is that the number of cases in the smallest group should be at least five times thenumber of predictors. In the interests of parsimony, Huberty recommends having a goal of only 8 to

10 response fields in the final model. Although in applied work this may be too stringent, keep inmind that more is not always better.

Outlying cases can affect the results by biasing the values of the discriminant function coefficients.Looking at the Mahalanobis distance for a case or examining the probabilities is normally an effective

check for outliers. If a case has a relatively high probability of being in more than one group, it isdifficult to classify. Analyses can be run with and without outliers to see how results are affected.

Multicollinearity is less of a problem in Discriminant Analysis because the exact effect of a predictor field is typically not the focus of an analysis. When two fields are highly correlated, it is difficult to

partition the variance between them, and the coefficient estimates are often incorrect. Still, theaccuracy of prediction may be little affected. Multicollinearity can be more of a problem whenstepwise methods of field selection are used, since fields can be removed from a model for reasonsunrelated to that field’s ability to separate the groups.

10.7 Compar ison of Discr iminant and Logist ic Regression Discriminant and logistic regression have the same broad purpose: to build a model predicting whichcategory (or group) individuals belong to based on a set of interval scale predictors. Discriminantformally makes stronger assumptions about the predictors, specifically that for each group they

follow a multivariate normal distribution with identical population covariance matrices. Based on thisyou would expect discriminant to be rarely used since this assumption is seldom met in practice.However, Monte Carlo simulation studies indicate that multivariate normality is not critical for discriminant to be effective.

Discriminant follows from a view that the domain of interest is composed of separate populations,each of which is measured on variables that follow a multivariate normal distribution. Discriminant

attempts to find the linear combinations of these measures that best separate the populations. This isrepresented in Figure 10.1. The two populations are best separated along an axis (discriminant

function) that is a linear combination of x and y. The midpoint between the two populations is the cut- point. This function and cut-point would be used to classify future cases.

Logistic regression, as we have seen in Lesson 9, is derived from a view of the world in whichindividuals fall more along a continuum.

This difference in formulation led discriminant to be employed in credit analysis (there are those whorepay loans and those who don’t), while logistic regression was used to make risk adjustments in

medicine (depending on demographics, health characteristics and treatment you are more or lesslikely to survive a disease). Despite these different origins, discriminant and logistic give very similar

results in practice. Monte Carlo simulation work has not found one to be superior to the other over




10-6

very general circumstances. There is, of course, the obvious point that if the data are based onsamples from multivariate normal populations, then discriminant outperforms logistic regression.

One consideration when choosing between these two methods involves how many dichotomous

predictor fields (or dummy coded nominal or ordinal fields) are used in the analysis. Because of thestronger assumptions made about the predictors by discriminant, the more categorical fields you have,

the more you would lean toward logistic regression. Within the domain of response-based segmentation, from the business side more discriminant analysis is seen, while if the problem isformulated from a marketing perspective as a choice model, logistic models are more common.

Note that neither discriminant nor logistic will produce a list of groups more or less associated withvarious target categories. Rather they will indicate which predictor fields (some may representdemographic characteristics) are relevant to the category. From the prediction equation or other summary measures you can determine the combinations of characteristics that most likely lead to the

desired target category.

Recommendations

Logistic regression and discriminant analysis give very similar results in practice. Since discriminant

does make stronger assumptions about the nature of your predictors (formally, multivariate normalityand equal covariance matrices are assumed), as more of your predictor fields are categorical (and thusneed to be dummy coded) or dichotomous, you would move in the direction of logistic regression.Certain research areas have a tradition of using only one of the methods, which may also influenceyour choice.

10.8 An Examp le: Discrim inant To demonstrate discriminant analysis we use data from a study in which respondents indicated whether they would accept an interactive news subscription service (via cable). Most of the predictor

fields are continuous in scale, the exceptions being GENDER (a dichotomy) and INC (an ordered categorical field). We would expect few if any of these to follow a normal distribution, but will

proceed with discriminant.

Note that the predictor fields for discriminant must be numeric, although they can be categorical.Most importantly, if you have predictors that are truly categorical, such as region of the U.S. (e.g.,northwest, southwest, etc.), even with numeric coding, Discriminant will not create dummy

variables/fields for these categories. You will need to create dummy variables yourself (use theSetToFlag node), and then enter the dummy variables in the model, leaving one out so as not to createredundancy. In this current example we don’t face this issue.

As in our other examples, we will move directly to the analysis although ordinarily you would run

data checks and exploratory data analysis first.

Click File…Open Stream and then move to the c:\Train\ModelerPredModel folder Double-click on Discriminant.str Right-click on the Table node and select Run to view the data




10-7

Figure 10.2 The Interactive News Study Data

Place a Discriminant node from the Modeling palette to the right of the Type nodeConnect the Type node to the Discriminant node

The name of the Discriminant node will immediately change to NEWSCHAN , the target field.

Figure 10.3 Discriminant Node Added to the Stream

Double-click on the Discriminant nodeClick on the Model tab




10-8

Figure 10.4 Discriminant Dialog

The Use partitioned data option can be used to split the data into separate samples for training and testing. This may provide an indication of how well the model will work with new data. We will not

use this option in this example, but instead will take advantage of a different option for validating themodel (Leave-one-out classification) that is built into the Discriminant procedure.

The Build model for each split option enables you to use a single stream to build separate models for each possible value of a field which is set to role Split in the Type node. All the resulting models will

be accessible from a single model nugget.

The Method option allows you to specify how you want the predictors entered into the model. Bydefault, all of the terms are entered into the equation. If you do not have a particular model in mind,

you can invoke the Stepwise option that will enter predictors into the equation based on a statisticalcriterion. At each step, terms that have not yet been added to the model are evaluated, and if the bestof those terms adds significantly to the predictive power of the model, it is added. Some analysts

prefer to enter all the predictor fields into the equation and then evaluate which are important.However, if there are many correlated predictor fields, you run the risk of multicollinearity, in which

case a Stepwise method may be preferred. A drawback is that the Stepwise method has a strongtendency to overfit the training data. When using this method, it is especially important to verify thevalidity of the resulting model with a hold-out test sample or new data (which is common practice in

data mining).

Click on the Method button and select Stepwise




10-9

Figure 10.5 Discriminant Analysis with Method Stepwise

Click on the Expert tabClick on Expert mode

Figure 10.6 Discriminant Expert Options

You can use the Prior Probabilities area to provide Discriminant with information about thedistribution of the target in the population. By default, before examining the data, Discriminant

assumes an observation is equally likely to belong to each group. If you know that the sample proportions reflect the distribution of the target in the population then you can use the Compute from group sizes option to instruct Discriminant to make use of this information. For example, if a target




10-10

category is very rare, Discriminant can make use of this fact in its prediction equation. We don’tknow what the proportions would be so we retain the default.

The Use covariance matrix option is often useful whenever the homogeneity of variance option is not

met. In general, if the groups are well separated in the discriminant space, heterogeneity of variancewill not be terribly important. However, in situations when you do violate the equal variance

assumption, it may be useful to use the Separate-groups covariance matrices to see if your predictionschange by very much. If they do, that would suggest that the violation of the equal varianceassumption was serious. It should be noted that using separate-groups covariance matrices does not

affect the results prior to classification. This is because PASW Modeler does not use the originalscores to do the classification. Thus, the use of the Fisher classification functions is not equivalent toclassification by PASW Modeler with separate covariance matrices.

Click the Output button

Figure 10.7 Discriminant Advanced Output Dialog

Checking Univariate ANOVAs will have PASW Modeler display significance tests of between-group(target categories) differences on each of the predictors. The point of this is to provide some hint as towhich fields will prove useful in the discriminant function, although this is precisely what

discriminant will resolve. The Box’s M statistic is a direct test of the equality of covariance matrices.The covariance matrices are ancillary output and very rarely viewed in practice. However, you mightview the within-groups correlations among the predictors to identify highly correlated predictors.

Either Fisher's coefficients or the unstandardized discriminant coefficients can be used to make

predictions for future observations (customers). Both sets of coefficients produce the same predictions




10-11

when equal covariance matrices are assumed. If there are only two target categories (as is our situation), either set of coefficients is easy to use. If you want to try “what if” scenarios using a

spreadsheet, the unstandardized coefficients, which involve a single equation in the two-categorycase, are more convenient. If you run discriminant with more than two target categories, then Fisher's

coefficients are easier to apply as prediction rules.

Casewise results can be used to display the codes for the actual group, predicted group, posterior probabilities, and discriminant scores for each case. The Summary table, also known by several other names including Classification table, Misclassification Table, and Confusion table, displays the

number of cases correctly and incorrectly assigned to each of the groups based on the discriminantanalysis. The Leave-one-out classification classifies each case based on discriminant coefficientscalculated while the case is excluded from the analysis. This is a jackknife method and provides aclassification table that should at least slightly better generalize to other samples. You can also

produce a Territorial map, which is a plot of the boundaries used to classify cases into groups, but the

map will not be displayed if there is only one discriminant function (the maximum number of functions is equal to the number of categories – 1 in the target field).

The Stepwise options allow you to display a Summary of stati

stics for all fields after each step.

Click the Means, Univariate ANOVAS, and Box’s M check boxes in the Descriptives areaClick the Fisher’s and Unstandardized check boxes in the Function Coefficients areaClick the Summary table and Leave-one-out classification check boxes in the

Classification area

Figure 10.8 Discriminant Advanced Output dialog after Option Selection




10-12

Click OK Click the Stepping button

Figure 10.9 Stepping Dialog

Wilks’ lambda is the default and probably the most common method. The differences between themethods are somewhat technical and beyond the scope of this course. You can change the statisticalcriterion for field entry. For example, you might want to make the criterion more stringent whenworking with a large sample.

Click Cancel Click the Run buttonBrowse the Discriminant generated model in the Models Manager windowClick the Advanced tab, and then expand the browsing windowScroll to the Classification Results

Figure 10.10 Classification Results Table

Although this table appears at the end of the discriminant output, we turn to it first. It is an importantsummary since it tells us how well we can expect to predict the target. The actual (known) groupsconstitute the rows and the predicted groups make up the columns. Of the 227 people surveyed whosaid they would not accept the offering, the discriminant model correctly predicted 157 of them; thus




10-13

accuracy for this group is 69.2%. For the 214 respondents who said they would accept the offering,66.4% were correctly predicted. Overall, the discriminant model was accurate in 67.80% of the cases.

Is this good? Will this model work well with new data? The answer to the first question will largely

depend on what level of predictive accuracy you required before you began the project. One way wecan assess the success of the model is to compare these results with the predictions we would have

made if we simply guessed the larger group. If we simply did that, we would be correct in 227 of 441(227 + 214) instances, or about 51.5% of the time. The 67.8% correct figure, while certainly far from

perfect accuracy, does far better than guessing. The Cross-validated portion of the table gives us an

idea about how accurate this model will be with new data. The percent of correctly classified caseshas decreased slightly from 67.8% to 67.3% for the cross-validation. Because these results arevirtually identical, it appears the model is valid.

Since we are interested in discovering which characteristics are associated with someone who accepts

the news channel offer, we proceed.

Scroll back to the Group Statistics pivot table

Figure 10.11 Group Statistics

Viewing the means by themselves is of limited use, but notice the group that would accept the serviceis about 7 years older than the group that would not accept, whereas the daily hours of TV viewing

are almost identical for the two groups. The standard deviations are very similar across groups, whichis promising for the equal covariance matrices assumption.

Scroll to the Tests of Equality of Group Means pivot table




10-14

Figure 10.12 Univariate F Tests

The significance tests of between-group differences on each of the predictor fields provide hints as to

which will be useful in the discriminant function (recall we are using Wilks' criterion as a stepwisemethod). Notice Age in Years has the largest F (is most significant) and will be first selected in thestepwise solution. This table looks at each field ignoring the others, while discriminant adjusts for the

presence of the other fields in the equation (as would regression).

Scroll to the Box’s M test results

Figure 10.13 Box’s M Test Results

Because the significance value is well above 0.05, we can accept the null hypothesis that the

covariance matrices are equal. However, the Box’s M test is quite powerful and leads to rejection of equal covariances when the ratio N/p is large, where N is the number of cases and p is the number of fields. The test is also sensitive to lack of multivariate normality, which applies to these data. If thevariances were unequal, the effect on the analysis would be to create errors in the assignment of casesto groups.

Scroll to the Eigenvalues and Wilks’ Lambda portion of the output




10-15

Figure 10.14 Summaries of Discriminant Function (Eigenvalues and Wilks’ Lambda)

These two tables are overall summaries of the discriminant function. The canonical correlation

measures the correlation between a field (or fields when there are more than two groups) contrastingthe groups and an optimal (in terms of maximizing the correlation) linear combination of the

predictors. In short, it measures the strength of the relationship between the predictor fields and thegroups. Here, there is a modest (.363) canonical correlation.

Wilks’ lambda provides a multivariate test of group differences on the predictors. If this test were notsignificant (it is highly significant), we would have no basis on which to proceed with discriminant

analysis. Now we view the individual coefficients.

Scroll down until you see the Standardized Coefficients and Structure Matrix

Figure 10.15 Standardized Coefficients and Structure Matrix

Standardized discriminant coefficients can be used as you would use standardized regressioncoefficients in that they attempt to quantify the relative importance of each predictor in thediscriminant function. The only three predictors that were selected by the stepwise analysis were




10-16

Education, Gender and Age. Not surprisingly, age is the dominant factor. The signs of thecoefficients can be interpreted with respect to the group means on the discriminant function. Notice

the coefficient for gender is negative. Other things being equal, as you shift from a man (code 0) to awoman (code 1), this results in a one unit change, which when multiplied by the negative coefficient

will lower the discriminant score, and move the individual toward the group with a negative mean(those that don’t accept the offering). Thus women are less likely to accept the offering, adjusting for

the other predictors.

The Structure Matrix displays the correlations between each field considered in the analysis and the

discriminant function(s). Note that income category correlates more highly with the function thangender or education, but it was not selected in the stepwise analysis; this is probably because incomecorrelated with predictors that entered earlier. The standardized coefficients and the structure matrix

provide ways of evaluating the discriminant fields and the function(s) separating the groups.

Scroll down to the Canonical Discriminant Function Coefficients and Functions at GroupCentroids are visible

Figure 10.16 Unstandardized Coefficients and Group Means (Centroids)

In Figure 10.1 we saw a scatterplot of two separate groups and the axis along which they could be

best separated. Unstandardized discriminant coefficients, when multiplied by the values of anobservation, project an individual onto this discriminant axis (or function) that separates the groups. If you wish to use the unstandardized coefficients for prediction purposes, you would simply multiple a

prospective customer’s education, gender and age values by the corresponding unstandardized coefficients and add the constant. Then you compare this value to the cut-point (by default the

midpoint) between the two group means (centroids) along the discriminant function (the cut-pointappears in Figure 10.1). If the prospective customer’s value is greater than the cut point you predict




10-17

the customer will accept, if the score is below the cut point you predict the customer will not accept.This prediction rule is also easy to implement with two groups, but involves much more complex

calculations when more than two groups are involved. It is in a convenient form to do “what if”scenarios, for example, it we have a male with 16 years of education at what age would such an

individual be a good prospect? To answer this we determine the age value that moves thediscriminant score above the cut-point.

Scroll down until you see the Classification Function Coefficients

Figure 10.17 Fisher Classification Coefficients

Fisher function coefficients can be used to classify new observations (customers). If we know a prospective customer’s education (say 16 years), gender (Female=1) and age (30), we multiply these

values by the set of Fisher coefficients for the No (no acceptance) group (2.07*16 + 1.98*1 + .32*30 –20.85) which yield a numeric score. We repeat the process using the coefficients for the Yes groupand obtain another score. The customer is then placed in the target group for which she has the higher score. Thus the Fisher coefficients are easy to incorporate later into other software (spreadsheets,

databases) for predictive purposes.

We did not test for the normality assumptions of discriminant analysis in this example. In general,normality does not make a great deal of difference, but heterogeneity of the covariance matrices can,especially if the sample group sizes are very different. Here the samples sizes were about the same.

As mentioned earlier, whether you consider the hit rate here to be adequate really depends on thecosts of errors, the benefits of a correct prediction and what your alternatives are. Here, although the

prediction was far from perfect we were able to identify the relations between the demographic fieldsand the choice.




10-18

Summary Exerc ises The exercises in this lesson use the data file credit.sav. The following table provides details about

fields in the file.

Credit.sav has the same fields as risktrain.txt except that they are all numeric so that we can use themall in a Discriminant Analysis. The file contains the following fields:

ID ID number AGE AgeINCOME IncomeGENDER Gender MARITAL Marital statusNUMKIDS # of dependent childrenNUMCARDS # of credit cardsHOWPAID paid M/WklyMORTGAGE Mortgage

STORECAR # of store cards heldLOANS # other loans RISK Credit risk categoryINCOME1K Income in thousands of British pounds (field

derived within PASW Modeler)

1. Begin with a clear Stream canvas. Place a Statistics File source node on the canvas and connect it to Credit.sav.

2. Attach a Type node to the Source node, and a Table node to the Type node. Run the Table

and allow PASW Modeler to automatically type the fields.

3. Attach a SetToFlag node to the Type node and create separate dummy fields for eachcategory of the marital field. Make sure that you code the True value as 1 and the False valueas 0. This is important because Discriminant expects numeric data for the inputs.

4. Attach a Type node to the SetToFlag node.

5. Edit the second Type node and change the role for risk to Target, and to None for id , marital ,income1k , and marital_3, or a reference field of your choice. Leave the role as Input for all

the rest of the fields.

6. Use a Distribution node to examine the distribution of risk .

7. Attach a Discriminant node to the second Type node and run the analysis. How manyclassification functions are significant? What fields are important predictors?

8. How accurate is the model as a whole. On which category is it more accurate?



BAYESIAN NETWORKS

11-1

Lesson 11: Bayesian Networks

Objectives

• The Basics of Bayesian Networks• Types of Bayesian Networks in PASW Modeler • Creating models with the Bayes Net node• Modifying Bayes Network Model Settings

Data

We will use the dataset churn.txt that we have employed in several previous lessons. This data filecontains information on 1477 customers of a telecommunication company who have at some time

purchased a mobile phone. The customers fall into one of three groups: current customers,involuntary leavers and voluntary leavers. In this lesson, we use a Bayes Net to predict groupmembership. A partition node will be used to split the data.

11.1 Introdu ct io n Bayesian analysis has been introduced to data mining with Bayesian networks, which are graphicrepresentations of the probabilistic relationships among a set of fields. These networks are verygeneral and can be used to represent causal relationships, can have multiple target fields, and oftenallow an analyst to specify the existence (or non-existence) of certain relationships using domainknowledge and experience.

The Bayes Net node provides the ability to use two different types of Bayesian networks to predict acategorical target. Bayes Net can use predictors on any scale, but continuous (Range) fields will beautomatically binned into five groups. In theory a Bayes Net can use many predictors, but since everyfield will be categorical, cells with low or zero counts are more likely, especially if some categorical

predictors have many categories. This is less an issue with very large data files.

Bayes Nets are an alternative to other methods of prediction for categorical targets, including decisiontrees, neural nets, logistic/multinomial regression, or SVM models. Unlike many other PASWModeler models, a graphical depiction of the model in the form of a Bayesian network is available inthe generated model to further model understanding, although there is no predictive equation withcoefficients for individual predictors as in some other models.

The Bayes Net node is included in the Classification module.

11.2 The Basics o f Bayesian Networ ks Bayesian analysis is a an area of statistics that is based on a different approach to probability than

frequentist statistics, which are, for example, the standard approach used to calculate the probabilityvalues for a t-test. The frequentist approach defines probability as the limit of an outcome’s relativefrequency in a large number of trials, and it assumes that a priori knowledge plays no role indetermining probability. In contrast, Bayesian statistics incorporate prior knowledge or belief aboutan event or outcome, so that one has both prior and posterior probabilities.

Bayesian analysis, and Bayes’ theorem, on which it is based, is named after the Reverend ThomasBayes, who studied how to compute a distribution for the parameter of a binomial distribution. There

http://en.wikipedia.org/wiki/Probability

http://en.wikipedia.org/wiki/Limit_%28mathematics%29

http://en.wikipedia.org/wiki/Thomas_Bayes


http://en.wikipedia.org/wiki/Binomial_distribution

http://en.wikipedia.org/wiki/Binomial_distribution



http://en.wikipedia.org/wiki/Limit_%28mathematics%29

http://en.wikipedia.org/wiki/Probability




11-2

are several ways to state Bayes’ theorem. If we wish to test a hypothesis H that is conditional onevidence from some Data, then one general statement of Bayes’ theorem is:

P(H|Data) = P(Data|H)*P(H) /P(Data),

Where P(H|Data) means the probability of H given the Data.

The issue of prior probabilities enters because the P(H) is the prior probability of H, given no other information, i.e. the data collected for our study. This probability can be subjective, or it can be based on more objective prior knowledge, such as the proportion of persons who buy a new refrigerator in ayear (for a model where we are trying to predict who will buy a new refrigerator).

We won’t provide an example of using the theorem here. You can find many good worked exampleson websites and elementary texts on Bayesian statistics. We also don’t provide an example becauseyou don’t really need to understand Bayes theorem to use the Bayes Net node or its output. A portionof the output is a joint probability table, but that is really nothing other than a bivariate or multi-waycrosstabulation of the fields that are found to be dependent (because values of one field depend or arerelated to another, although this dependence does not imply causality: correlation does not equal

causality, as we have all been taught).

Otherwise, the output fields from a Bayes Net model are similar to those from other models and include a prediction and the probability of that prediction.

A Bayesian network is a graphical model based on a direct acyclic graph (DAG). First, a directed graph is shown in Figure 11.1 for comparison. Directed graphs are composed of vertices or nodes (thecircles) that represent fields in a model, and arrows between the nodes that are called variously arcs,arrows, or directed edges.

Figure 11.1 Simple Example of a Directed Graph

In comparison, a directed acyclic graph is shown in Figure 11.2. Here, for any node n, there is no

path, following the arrows, that begins and ends on n. You can try that for any of the nodes in thegraph.



BAYESIAN NETWORKS

11-3

Figure 11.2 Directed Acyclic Graph to Predict RESPONSE

A Bayesian network is a model that represents a set of data with a directed acyclic graph and that usesthat information to make predictions. Nodes that are connected have probabilistic dependencies. Nodes that are not connected (broadly speaking) are conditionally independent, which means thatthese other nodes add no more information to the relationship, given the nodes that are interconnected (more about this below). So in the graph in Figure 11.2, the field VISITB is conditionally independentof ORIVISIT .

A Bayesian network can display causal relationships between nodes with the arcs and arrows.However, the networks constructed by the Bayes Net node are not designed to represent causalrelationships, for several important reasons. In data-mining, more emphasis is placed on the ability of a model to make accurate predictions rather than represent causal influences, i.e., the effect of field Aon outcome C is direct and also indirect through field B. The networks constructed by the Bayes Net

node are optimized for prediction. Second, software by itself, despite any claims otherwise, cannotsuccessfully find causal relationships without user input. That is why in structural equation modeling,the user must set up the structure of the model and then test whether the data support that structure or model. Finally, data mining problems often incorporate many potential predictors, makingspecification of causal links more and more complex.

The end result of these points is that it is possible to glean information from a network in PASWModeler, but you need to be cautious when doing so and not over interpret the model.

Bayesian networks in general are often resistant to problems caused by missing data, and they canmake predictions for cases with missing data. However, the Bayes Net node by default uses listwisedeletion, where any missing data causes it to delete a case from analysis. Why this is so and how itaffects model-building is explained with an example below.

Bayesian networks as implemented in PASW Modeler are designed to use only categorical data for which probability statements can be readily constructed. This means that only categorical targets can

be used. If a continuous predictor is used, it will be binned into five roughly equally-spaced bins. Thismay not always be appropriate for skewed or other non-symmetrical distributions. If you have

predictors like that, you may wish to manually bin these fields using a Binning node before theBayesian Network node. For example, you could use Optimal Binning where the Supervisor field isthe same as the Bayesian Network node Target field.




11-4

11.3 Type of Bayesian Network s in PASW Modeler The Bayes Net node provides two types of Bayesian networks. To understand them, it helps to firstdiscuss a Naïve Bayes network.

In Figure 11.3 this type of network is displayed. There is a target field A and a set of predictors B, C,

and D. A is a parent node of the other nodes, and nodes B, C, and D are therefore child nodes of A.This is reminiscent of the graphical view of a decision tree, but you should not try to equate the two.Although we are attempting to predict A, the arcs point toward the predictors. This is a consequenceof Bayes theorem, where the prior probability of the data, given the outcome, is included in thenumerator of the equation. This probability is represented by the arrows flowing away from A, thetarget.

Figure 11.3 Naïve Bayes Network

Of course, we include fields that are meaningful predictors of the target, so these arrows shouldn’t beconfusing. For example, if we want to predict customers who will make a second purchase from anonline retailer, we can include such things as income, gender, and prior purchase behavior. All of those will influence a second purchase, but not the reverse.The other key characteristic of a Naïve Bayes network is that there are no links or dependencies

between the predictors. This is the simplest possible network.

With this as background, we can now consider the two networks available in the Bayes Net node.

Tree-Augumented Naïve Bayes (TAN). This type of network extends Naïve Bayes by allowing each predictor to depend on one other predictor in addition to the target field. Again, this dependence is notnecessarily causal dependence but simply probabilistic dependence given the data at hand. Figure11.2 shows a TAN network, where you can see that no predictor has more than two arrows pointingtoward it, where all the arrows point away from the target RESPONSE , and where one predictor (ORISPEND) has no dependency on other predictors.

The conditional probability tables produced by the Bayes Net node will reflect this structure, so atable for VISITB will include RESPONSE and SPENDB.

Markov Blanket. This type of network selects the set of nodes in the dataset that contain the target’s parents, its children, and its children’s parents. This is illustrated in Figure 11.4 where once again thetarget field is RESPONSE . There were many more potential predictors available than are displayed inthe network, but once a Markov Blanket has been defined, the target node is conditionally



BAYESIAN NETWORKS

11-5

independent of all other nodes (predictors), and so those predictors are not used in the network (model). Essentially, a Markov blanket identifies all the fields that are needed to predict the target.

Notice that arrows can go both to and from the target field in a Markov Blanket.

This type of network should, all things being equal, be more accurate than a TAN, especially with alarge number of fields. However, with large datasets the processing time will be significantly greater.

To reduce the amount of processing, you can use the Feature Selection options on the Expert tab tohave PASW Modeler use only the fields that have a significant related bivariate relationship to thetarget. As before, arrows from the target to another field don’t indicate causal influence in thatdirection.

Figure 11.4 Example of a Markov Blanket Network

You now understand the basics of a Bayesian network and the types of networks produced with theBayes Net node. We can begin using Bayesian networks to predict customer churn.

11.4 Creating a Bayes Netwo rk Model We will use the churn data file that we have used in several other lessons. This will allow comparisonto these other techniques.

Click File…Open Stream and move to the c:\Train\ModelerPredModel folder Double-click on Bayes Net.str Run the Table nodeClose the Table windowEdit the Type node




11-6

Figure 11.5 Type Settings for Churn Data

All available input fields will be used (with the exception of ID). The field CHURNED has threecategories.

Close the Type window

Edit the Bayes Net node named CHURNED

There are two types, or structures, of networks available, as explained above.

If you have many fields, you may wish to include a first step of feature selection that will reduce thenumber of inputs. This option can be turned on with the Include feature selection preprocessing step check box.



BAYESIAN NETWORKS

11-7

Figure 11.6 Model Tab in Bayes Net Node

Recall that the probability being modeled in a Bayesian network is comprised of a series of tables,and so there can be a significant fraction of cells with small or even zero cell counts. This can pose a

computational difficulty; in addition, there is a danger of overfitting the model. The Bayes adjustment for small cell counts check box reduces these problems by applying smoothing to reduce the effect of any zero-counts.

If a model has previously been trained, the results shown on the model nugget Model tab areregenerated and updated each time the model is run if you select the Continue training existing model check box. You would do this when you have added new or updated data to an existing stream with amodel.

Click Expert tabClick Expert options button

Missing Values in Bayes Net Models

The default option for a Bayes Net is to use only complete records (Use only complete records check box). This is equivalent to standard listwise deletion, so if a record has a missing value for any field,that record won’t be used in creating a model (or in scoring from an existing model). If this option isunchecked, the Bayes Net will do the equivalent of pairwise deletion, using as much information as

possible.




11-8

However, as with any algorithm that uses pairwise deletion, at least two issues become salient. Thenumber of cases used for the analysis now becomes ill-defined. This may not be critical for mostdata-mining projects, but you should be aware of this issue. Perhaps more important, the estimates of the model can be unstable and be affected by small changes in the data. This could make modelvalidation more difficult.If there is a significant amount of missing data, you may wish to estimate/impute some of the missing

data values, although this raises its own complications.

Computationally, the best solution is to use listwise deletion, but that only is ideal when missing datais a small percentage of the file.

Other Bayes Net Expert Options

The algorithm for creating a Markov Blanket structure uses conditioning sets of increasing size tocarry out independence testing and remove unnecessary links from the network (to find parents and children of the target field). This can be especially useful when processing data with strongdependencies among many fields. The default setting for Maximal conditioning set size is 5,

Because tests involving a high number of conditioning fields require more time and memory for processing you can limit the number of fields to be included. If you reduce the maximumconditioning set size, though, the resulting network may contain some superfluous links. You can alsouse this setting if you are using a TAN network by requesting feature selection in the Model tab.

The Feature Selection area is available for Markov Blanket models or with TAN models when featureselection is turned on. You can use this option to restrict the maximum number of inputs used when

processing the model in order to speed up model building. If feature selection is turned on, the defaultmaximum number of fields to be used in the network is 10. If there are important fields that should beused in a network, you can specify them in the Inputs always selected box.

The Bayes Net conducts tests of independence on two-way and larger tables to construct the network.A Likelihood ratio test is used by default, but you can request that a standard Pearson chi-square be

used instead. The significance level of the test can be set, but only if feature selection or a MarkovBlanket network are requested.



BAYESIAN NETWORKS

11-9

Figure 11.7 Expert Options for Bayes Net

At this point we won’t change any defaults.

Click Run


Right-click and Browse the generated Bayes Net model




11-10

Figure 11.8 Bayes Net Model Browser for TAN Model

As with most predictive models, predictor importance is included in the right half of the model browser window. The most important predictors are clearly SEX, LONGDIST , and International (youmight want to compare this to other models that we developed to predict CHURNED with these data,such as the decision trees in Lesson 3).

The actual Bayesian TAN model is displayed in the left half of the model browser. The network graph of nodes displays the relationship between the target and its predictors, as well as therelationship between the predictors. The importance of each predictor is shown by the density of itscolor; a darker blue color shows an important predictor. The target CHURNED has a red node.

You can use the mouse to drag nodes around the graph to more easily view relationships.

Click on the node for CHURNED and drag it more into the center of the network (see Figure11.9)



BAYESIAN NETWORKS

11-11

Figure 11.9 Bayesian Network Graph

There are several things to notice about the network diagram.

There is a path from CHURNED to every input field. The arrows all point away from CHURNED even though it is the target field. This makes CHURNED a parent of all the input nodes. These factsare simply a consequence of how a TAN is defined and don’t mean that somehow churn status isaffecting the input fields. The arrows do indicate which fields will be included in conditional

probability tables, as we will see shortly.

Second, a TAN network allows paths between a predictor and one other predictor (plus theconnection with the target). You can see this if you examine the network closely; no predictor hasmore than two arrows going toward it.

Third, the links and arrows do have some meaning, but not causal influence. For example, there is anarrow from LOCAL to Est_Income. Since the number of minutes on average of local phone serviceisn’t going to affect one’s income, the direction of the arrow doesn’t indicate causality, but instead aconditional dependency or interrelatedness. Or consider the paths going from LONGDIST to

International Car_Owner , and LOCAL. The arrows between LONGDIST and the other two measuresof phone service usage do probably indirectly indicate something meaningful, but not cause and effect. Instead, the arrows are a sign that there are probabilistic dependencies among these fields.From our understanding of the data, we might conclude that these dependencies exist because of

different groups of customers who have similar phone use patterns. For example, one group could becustomers who make local calls but not many long distance calls; another group could be those whomake lots of long distance and international calls.

As mentioned earlier, the Bayes Net node bins continuous predictors into five categories, splitting therange into five equally-spaced groups. You can view the bin values by hovering with the mouse over a node for a continuous predictor.

Hover the mouse over the node for Est_Income




11-12

The first bin runs from 0 to 20,054.807. The last bin contains all customers with incomes above79,888.377.

Figure 11.10 Distribution of Est_Income

We are now in the Basic view of the network (see the View dropdown in the lower left corner). Wecan switch to the Distribution view.

Click View dropdown and select Distribution

Figure 11.11 Distribution View of TAN Network

The Distribution view displays the conditional probabilities for each node in the network as a mini-graph. Bayesian networks work only with categorical data, so the graphs are all bar charts. The



BAYESIAN NETWORKS

11-13

simplest one is for the target field, which shows its distribution unrelated to any other field (becausethe arrows point away from it).

You can hover the mouse pointer over a graph to display its values in a popup ToolTip.

Hover the mouse over the bottom bar in the graph for CHURNED

In Figure 11.12 we have isolated just this portion of the network.

Figure 11.12 Percentage of Customers who are Current from ToolTip

The probabilities for the input nodes are more complicated because most are conditional with thetarget and another field.

Hover the mouse over the graph for Car_Owner and move it from top to bottom

As you move the mouse over the mini-graph for Car_Owner , you see probabilities listed along withvalues of CHURNED and LONGDIST . This is because there are arrows from those two fields pointed toward Car_Owner .

We can learn more from viewing the conditional probability table for an input node. When you selecta node in either Basic or Distribution view, the associated conditional probabilities table is displayed in the right half of the model browser. This table contains the conditional probability value for eachnode category and each combination of values in its parent nodes.

Click on the mini-graph for Car_Owner to select it




11-14

Figure 11.13 Conditional Probability Table for Car_Owner

These conditional probabilities are based on the actual data. Thus, if we created a table withCHURNED, Car_Owner , and (binned) LONGDIST , we would find, for example, that of thosecustomers who have the lowest value of LONGDIST (<5.996) and who are current customers (firstrow in table), 20% (.20) own a car and 80% (.80) do not (for reference, about 30% of all customers inthe Training partition own a car).

However, since we are interested in predicting CHURNED, we can look at this table a bit differently.If we hold LONGDIST constant (looking at the first 3 rows in the table where LONGDIST <5.996),we can see how car ownership varies by churn status. Customers who are voluntary churners (Vol)are more likely to own a car. Customers who are current are the least likely. It is the use of these

probabilities that allows the TAN model to make predictions.

Of course, this conditional probability table only includes two inputs. Since a customer will have avalue on all inputs (the default of listwise deletion), there will be many conditional probabilitydistributions that must be taken into account when making a prediction. And that is what the modeldoes with the help of Bayes theorem to combine probabilities (see the PASW Modeler 14 AlgorithmsGuide for details).

All these cells have at least one customer because there are probabilities listed in each one. Other conditional probability tables have zeros because no customer fit that particular pattern of values.This condition will be important in our discussion of the model predictions next.

Close the Bayes Net model browser



BAYESIAN NETWORKS

11-15

Add an Analysis node to the stream and connect it to the Bayes Net modelEdit the Analysis nodeClick Coincidence matrices (for symbolic targets) check boxClick Run

Figure 11.14 Analysis Node Output for TAN Model

On the Training partition the model is accurate on 78.16% of the cases. It is most accurate on currentcustomers, least accurate on involuntary churners. The model doesn’t do very well at all on theTesting partition and is only accurate on 69.79% of the cases overall.

There are no missing data in this data file, yet something odd appears in the Coincidence Matrix for the Testing data. There is a fourth predicted value of $null$, i.e., missing. Why would the Bayesnetwork predict a missing value or, more accurately, be unable to make a prediction for 13 cases?(The accuracy statistics don’t drop the missing cases and count them as an incorrect prediction.)

The fact that missing predictions appear in the Testing partition but not the Training partition is thetip-off to the cause. As mentioned above, the conditional probabilities are used to make predictions,e.g., the probability of .20 for a customer with low long distance service who owns a car and is a

current customer. But if a combination of values (a cell in a table) exists in the Training data but notin the Testing data, the network cannot make predictions for a customer with those characteristics.

Fortunately there are only 13 customers who have a missing predicted value, which is less than 2% of the file. This is probably acceptable. It does illustrate the importance of having a large and varied enough training dataset so that all possible combinations have one or more records.We’ll next try the other type of network structure, a Markov Blanket, using the default settingsotherwise.




11-16

Close the Analysis output browser Edit the Bayes Net modeling nodeClick Model tabClick Markov Blanket Click Run


Right-click and Browse the generated Bayes Net model

Figure 11.15 Bayes Net Model Browser for Markov Blanket Model

This model looks very different than the TAN network. First, not all the predictors are used. Second,the arrows go from the inputs to the target field, which is the direction we expect for a causal

predictive model but, as with the TAN model, the arrows should not be used to indicate causalinfluence. Arrows in a Markov Blanket can point away from the target field.

Third, there are no connections between the inputs. This isn’t always the case in a Markov Blanket, but is more likely than with a TAN network. In fact, this network is equivalent to a Naïve Bayesclassifier.

The top three fields on the predictor importance chart are identical to those for the TAN network,although the order is different.

Let’s view the conditional probability table for CHURNED (the tables for the inputs areuninteresting).

Click on the node for CHURNED Expand the right half of the model browser to view the probability table



BAYESIAN NETWORKS

11-17

Figure 11.16 Conditional Probability Table for CHURNED

The table is very large, so we can’t display it all in the figure above. Because all four inputs havearrows pointing toward CHURNED, its conditional probability table contains all four of these fields.

The first thing we can see is that there are many cells with a probability value of 0 which indicatesthat there were no customers with that combination of values.

Second, this type of table is one that is easier to think about and use in the context of predictingCHURNED because we can choose various combinations of values of the inputs and see what thedistribution is of CHURNED. So, for example, if we select males who make very few calls in anycategory (the first row in the table), we see that they are very unlikely to be voluntary churners (.051

probability).

Let’s see how well the Markov Blanket model does at predicting CHURNED.

Close the Bayes Net model browser Run the Analysis node





BAYESIAN NETWORKS

11-19

Figure 11.18 TAN Network Model with Bayes Adjustment

The model is essentially identical to what we saw before. To see how the Bayes adjustment hasaffected the network, we need to view a conditional probability table.

Click on the node for LOCAL Expand the right half of the model browser to view the table




11-20

Figure 11.19 Conditional Probability Table for LOCAL

What we can see are many cells that have a gray background. All of these cells have a zero count and so have been given a Bayes adjustment. This means these patterns (cells) in the data can now beestimated.

We can see this by viewing model predictions with the Analysis node.

Click Annotations tabClick Custom Type the text Bayes adjustment – TAN Click OK Add this model to the stream by the Type nodeConnect the Type node to the Bayes adjustment model Add an Analysis node to the stream and connect it to the Bayes adjustment modelEdit the Analysis nodeClick Coincidence matrices (for symbolic targets) check boxClick Run



BAYESIAN NETWORKS

11-21

Figure 11.20 Analysis Node Output for TAN Model with Bayes Adjustment

The accuracy on the Training partition remains at 78.3% because it is not affected by the Bayesadjustment. But there are now no predictions of $null$ for the Testing Partition, so predictions can bemade for all these cases. And this increased the accuracy on the Testing data to 70.71%.

Using a Bayes adjustment doesn’t guarantee that there will be no missing predictions. In fact, if yourun a TAN model with a Bayes adjustment, there will still be 39 cases with a missing prediction. Thisis because an adjustment can be made for an existing pattern by adding a small amount to a cell withzero count, but if a pattern is completely missing from the Training data, it still won’t be possible tomake a prediction in the Testing data.

As mentioned in an earlier section, using the Bayes adjustment is fine when the amount of missingdata is a small portion of the data file, but when there is a large amount of missing data, another solution should be employed.

At this point, we can continue to use the TAN network but can change the Maximal conditioning setsize parameter.

11.5 Modify ing B ayes Network Model Sett ing s As with SVM models, and many other types of models (neural networks, decision trees), finding the

best model requires some experimentation with other settings. Two key options for a Bayes Net arefeature selection and the Maximal conditioning set size.

Close the Analysis output browser Edit the Bayes Net modeling node




11-22

Click Include feature selection preprocessing step check box

Figure 11.21 Requesting Feature Selection for TAN Model

Although we only have about a dozen input fields, to change the Maximal conditioning set size for aTAN model, we need to request feature selection (this is because of how the model is calculated without taking parent and child nodes into account).

Click Expert tab, then click Expert option button



BAYESIAN NETWORKS

11-23

Figure 11.22 Feature Selection and Maximal Conditioning Set Size Options

We don’t need to specify any inputs to always be in the network, and we’ll leave the Maximumnumber of inputs at 10, which means that the TAN network can only include 10 of the 12 possible

inputs.

The algorithm for creating a Markov Blanket structure uses conditioning sets of increasing size tocarry out independence testing and remove unnecessary links from the network. The TAN network can also use a conditioning set to do feature selection. The higher the value for Maximal conditioningset size, the more time and memory for processing is required, but a higher value can be especiallyuseful when the data have strong dependencies among many fields.

We don’t expect strong relationships among the predictors, so we’ll reduce the value to 3, and thenrun the model.

Change the Maximal conditioning set size to 3

Click Run

When the model has been generated:

Right-click the generated model and select Browse




11-24

Figure 11.23 TAN Network with Maximal Conditioning Set Size=3

The resulting TAN network is much simpler than the original and includes only 4 fields. These arethe same four fields that were included in the Markov Blanket network, and they were four of the topfive in predictor importance in the original TAN network. As before, arrows point away fromCHURNED to the inputs. No arrows point to LONGDIST from any other input.

Click on the node for LONGDIST

Figure 11.24 Conditional Probability Table for LONGDIST

The conditional probability table for LONGDIST only includes CHURNED because that is its only parent. We see that involuntary churners are very likely to have little long distance call usage(probability .992 in the lowest category), while the probabilities are spread more evenly for the other two types of customers.

This network is very simple, but how will it do in predicting CHURNED? We’ll use an Analysis nodeto get the answer.



BAYESIAN NETWORKS

11-25

Closer the Bayes Net model browser Add the generated Bayes Net model to the streamConnect the new model to one of the Analysis nodes, replacing the connectionRun the Analysis node

Figure 11.25 Analysis Node Output for Modified TAN Network

The overall accuracy on the Training data has declined to 72.6%, a substantial drop. However, noticethat the accuracy on the Testing data is 71.24%, which is an increase of almost 1%. And, in the finalanalysis, how the network does on the Testing data is the key criteria.

It would appear, although this is only an educated guess, that the more complicated TAN modelssomewhat overfit the data, and that the Markov Blanket wasn’t quite complicated enough.

As is standard in data-mining modeling, we would continue developing variants of a Bayes Netmodel to try to find a handful of candidates to undergo further testing. However, there are fewer

parameters to modify than say, an SVM model, so that process shouldn’t be too burdensome.

Close the Analysis output browser

We’ll conclude the discussion of Bayes Net models by seeing how the predicted values are related tothe inputs.

In this latest model, the field LONGDIST has only CHURNED in its conditional probability table. Now LONGDIST has been binned into five categories (visible in Figure 11.24), but we won’t bother to take the time to use a Reclassify node to do this. We’ll just use a Histogram with an overlay to look at the general relationship.

There are Select nodes on the bottom of the stream that will select the Training or Testing partitions.We’ll use the one for the Training data.

Add a Select node to the stream




11-26

Connect the TAN model named Bayes adjustment – TAN to the Select node Add a Histogram node to the stream below the Select node, and connect these two nodesEdit the Histogram nodeSelect LONGDIST as the Field and CHURNED as the Color Overlay fieldClick Options tabClick Normalize by color Click Run

Figure 11.26 Histogram of LONGDIST with CHURNED as Overlay on Training Data

As the conditional probability table suggests, essentially all the involuntary churners have low valueson LONGDIST . The proportion of customers who are current or voluntary churners is about equalacross values of LONGDIST , and the pattern in the histogram echoes this. Now we’ll look at the

predicted values of CHURNED.

Close the Histogram windowEdit the Histogram nodeChange the Color Overlay field to $B-CHURNED Click Run



BAYESIAN NETWORKS

11-27

Figure 11.27 Histogram of LONGDIST with Predicted CHURNED as Overlay on Training

Data

The patterns are extremely similar, although there is more range in the values of LONGDIST for predicted involuntary churners (but all values are in the first bin for that field). Although there areother inputs in the network, these two histograms are similar because the only direct parent of

LONGDIST is CHURNED itself.

Although you can use the conditional probability tables if you are adept at reading that type of information, it is likely that you will want to conduct this type of analysis between the inputs and original and predicted values to understand how a Bayes Net model makes its predictions.

You may wish to continue this analysis with the TAN model. You can try these same histograms onthe Testing partition. Or you can use another input, such as SEX (Hint: use a Distribution node).




11-28

Summary Exerc ises The exercises in this lesson use the file charity.sav. The following table provides details about thefile.




In this set of exercises you will attempt to predict the field Response to campaign using a Bayes Netmodel.

1. If you have previously saved a stream that accesses the file charity.sav, you can use thatstream. Otherwise, use an Statistics source node to read this file. Tell PASW Modeler to Read Labels as Names.

2. Attach a Type and Table node in a stream to the source node. Run the stream and allowPASW Modeler to automatically define the types of the fields.


4. We will attempt to predict response to campaign ( Response to campaign) using the fieldslisted below. Set the role of all five of these fields to Input and the Response to campaignfield to Target.

Pre-campaign spend category

Pre-campaign visits categoryGender

Age



BAYESIAN NETWORKS

11-29

Mosaic Bands (which should be changed to measurement level nominal)

5. Attach a Bayes Net node to the Type node. First create a TAN network with the defaultsettings.

6. Once the model has finished training, browse the generated Bayes Net model. What are the

most important fields? Are all fields used? Can you look at the conditional probability tablesand learn anything about the network? How does predictor importance compare to the Neural

Net results in Lesson 4 or the SVM results in Lesson 5?

7. Place the generated Bayes Net node on the Stream canvas and connect the Type node to it.Connect the generated Net node to an Analysis node and create a matrix of actual responseagainst predicted response. How well does this model do in predicting response to thecampaign? How does its performance compare to other models?

8. Now create a Markov Blanket network and answer the same questions as in #6 and 7.Additionally, compare and contrast the two models. What are the differences? Which modeldoes better at predicting response to campaign?

9. Use various methods to explore how the two most important predictors are related to predictions of the model.

10. For those with extra time: Try using a dataset with more fields, such as customer_dbase.sav ,to predict an outcome with a more complex network. If you do so, you can use some of theExpert settings in the Bayes Net node.




11-30



FINDING THE BEST MODEL FOR CATEGORICAL TARGETS

12-1

Lesson 12: Finding the Best Model for Categorical Targets

Objectives• Introduce the Auto Classifier Node

• Use the Auto Classifier Node to predict customers who will churn

Data

In this lesson we will use the dataset churn.txt that we have used in several previous lessons. We will build models to predict whether a customer is loyal or not, and continue to use a Partition Node todivide the cases into two segments (subsamples), one to build or train the models and the other to testthe models.

12.1 Introdu ct io n

When you are creating a model, it isn’t possible to know in advance which modeling technique will produce the most accurate result. Often several different models may be appropriate for a given datafile and target, and normally it is best to try more than one. For example, suppose you are trying to predict a binary target (buy/not buy). Potentially, you could model the data with a Neural Net, any of the Decision Tree algorithms, an SVM model, a Bayes Net, Logistic Regression, Nearest Neighbor,Decision List, or Discriminant Analysis. Unfortunately this process can be quite time consuming.

The Auto Classifier node allows you to create and compare models for categorical targets using anumber of methods all at the same time, and then compare the results. You can select the modelingalgorithms that you want to use and the specific options for each. You can also specify multiplevariants for each model. For instance, rather than choose between the Multilayer Perceptron or RadialBasis Function methods for a neural net model, you can try them both. The Auto Classifier node

generates a set of models based on the specified options and ranks the candidates based on the criteriayou specify. The supported algorithms include Neural Net, all decision trees (C5.0, C&R Tree,QUEST, and CHAID), Logistic Regression, Decision List, Bayes Net, Discriminant, Nearest Neighbor and SVM.

To use this node, a single target field with categorical measurement level (flag, nominal or ordinal)and at least one predictor field are required. Predictor fields can be continuous or categorical, with thelimitation that some predictors may not be appropriate for some model types. For example, ordinalfields used as predictors in C&R Tree, CHAID, and QUEST models must have numeric storage (notstring), and will be ignored by these models if specified otherwise. Similarly, continuous predictor fields can be binned in some cases (as with CHAID). The requirements are the same as when usingthe individual modeling nodes.

When an automated modeling node is executed, the node estimates candidate models for every possible combination of options, ranks each candidate model based on the measure you specify, and saves the best models in a composite automated model nugget.

We continue to use the Churn.txt file which we used in many earlier lessons. However, we willcombine the Voluntary and Involuntary Leavers into a single category in order to use the AutoClassifier.




12-2

Click File…Open Stream, and then move to the c:\Train\ModelerPredModel folder Double-click on FindBestModel.str Place an Auto Classifier node from the Modeling palette to the right of the Type nodeConnect the Type Node to the Auto Classifier nodeEdit the Derive node named LOYAL

Figure 12.1 Creation of Flag Field Identifying Loyal Customers

In the Derive node we use the field CHURNED to create a new target with the name LOYAL. Thistarget will be a flag, with a value of Leave when CHURNED is not equal to Current ; this means thatcustomers who are voluntary or involuntary leavers will have values of Leave. Current customers willhave a value of Stay.

Close the Derive nodeEdit the Auto Classifier node




12-3

Figure 12.2 Auto Classifier Node

The Auto Classifier will use partitioned data if available. It will also create separate models for eachvalue of a split field. The number of models to use and display in the Auto Classifier is 3 by default,which you can change. The top-ranking 3 models are listed according to the specified rankingcriterion, but you can increase or decrease this value. The Rank models by option allows you tospecify the criteria used to rank the models. Note that the True value defined for the target field isassumed to represent a hit when calculating profits, lift, and other statistics (discussed below). Wehave defined Leave as the True category in the Derive node because we are more interested inlocating persons who will leave as customers than those who will stay.

Models can be ranked on either the Training or Testing data, if a Partition node is used. It is usually better to initially rank the models by the Training partition since the Testing data should only be used after you have some acceptable models.

Predictor importance can also be calculated; this option is turned on by default, but it can significantlyincrease execution time.

Click on the Rank models by dropdown to see the different ranking options






12-5

Click the Expert tabClick Training partition button under Rank models using:

Figure 12.4 Auto Classifier Expert Tab

The Expert tab allows you to select from the available model types and to specify stopping rules and misclassification costs. By default, all models are selected except KNN and SVM. However, it isimportant to note that the more models you select, the longer the processing time will be. You canuncheck a box if you don’t want to consider a particular algorithm. The Model parameters option can be used to change the default settings for each algorithm, or to request different versions of the samemodel.

In this example, we will request both Neural Net model algorithms and accept the default values for all the other models.

Click on the Model Parameters cell for Neural Net and select Specify




12-6

Figure 12.5 Algorithms Simple Tab for Neural Net Models

Click in the Neural network model row in the Options cell and select Both from thedropdown list




12-7

Figure 12.6 Selecting Neural Network Model Setting

The Auto Classifier node will now try both types of neural network models.

The Expert tab within the Algorithm settings dialog allows detailed changes to specific models.

Click Expert tab




12-8

Figure 12.7 Algorithm Settings Expert Tab for Neural Net

The settings in this dialog are those that would be available in the Neural Net node.

Note that the Set random seed parameter is set to a numeric value. This means each time the AutoClassifier node is run, the same neural net model will be found for these data and target (if we use thesame other settings). If you find a neural net model that performs well (or any other model dependenton a random seed), then you may wish to change the random seed and return the Auto Classifier tocheck for model stability.

Click the Simple tab, and then click OK Click Stopping rules… button

Figure 12.8 Stopping Rules Dialog




12-9

Stopping rules can be set to restrict the overall execution time to a specific number of hours. Allmodels generated to that point will be included in the results, but no additional models will be produced. In addition, you can request that execution be stopped once a model has been built thatmeets all the criteria specified in the Discard tab.

Click the Discard Tab

Click Cancel

Figure 12.9 Auto Classifier Discard Tab

The Discard tab allows you to automatically discard models that do not meet certain criteria. These

models will not be listed in the summary report. You can specify a minimum threshold for overallaccuracy, lift, profit, and area under the curve, and a maximum threshold for the number of fieldsused in the model. Optionally, you can use this dialog in conjunction with Stopping rules to stopexecution the first time a model is generated that meets all the specified criteria.





12-10

Figure 12.10 Auto Classifier Settings Tab

The Settings tab of the Auto Classifier node allows you to pre-configure the score-time options thatare available on the nugget. For flag targets you can select from the following Ensemble methods:Voting, Confidence-weighted voting, Raw propensity-weighted voting (flag targets only), Highestconfidence wins, and Average raw propensity (flag targets only). For voting methods, you can specifyhow ties are resolved. You can choose one of the tied values randomly, choose the tied value that was predicted with the highest confidence, or with the largest absolute raw propensity.

Misclassification Costs

In some contexts, certain kinds of errors are more costly than others. For example, it may be morecostly to classify a high-risk credit applicant as low risk (one kind of error) than it is to classify a low-

risk applicant as high risk (a different kind of error). Misclassification costs allow you to specify therelative importance of different kinds of prediction errors.

Misclassification costs are basically weights applied to specific outcomes. These weights are factored into the model and may actually change the prediction (as a way of avoiding more costly mistakes).

Misclassification costs are not taken into account when ranking or comparing models using the AutoClassifier node. A model that includes costs may produce more errors than one that doesn't and may




12-11

not rank any higher in terms of overall accuracy, but it is likely to perform better in practical terms because it has a built-in bias in favor of less expensive errors.

Click the Expert tabClick the Misclassification costs button

The cost matrix shows the cost for each possible combination of predicted category and actualcategory. By default, misclassification costs are set to 0.0 for the cells with correct predictions, and 1.0 for cells that represent errors of prediction (misclassification). To enter custom cost values, selectUse misclassification costs checkbox and enter custom values into the cost matrix.

Figure 12.11 Misclassification Costs Dialog

We don’t need to specify costs for this example.

Click OKClick Run

Since the Auto Classifier can take some time to compute all the models, especially if variants of models are requested, a feedback dialog is presented while the node is running.




12-12

Figure 12.12 Execution Feedback Dialog

Once the Auto Classifier is finished:

Edit the LOYAL model nugget in the stream

Figure 12.13 Auto Classifier Results for Testing Set

Although we requested that the models be ranked on the Training partition, the default View is of theTesting set (Partition). So let’s switch to the Training data.

Click the View: dropdown and select Training set




12-13

Figure 12.14 Auto Classifier Results for Training Set

Here we see the top ranking three models contained in the nugget, including C5.0, CHAID, and Decision List (see the Appendix for information on that type of model). The number to the right of the model type indicates whether this is the first, second, etc. model of that type created by the AutoClassifier. The best model is the C5.0 model, which is over 90% accurate overall at predicting LOYAL. The order of the models on accuracy is the same on the Training data as the Testing data.

Ranking models by Lift would place Decision List first, following by C5.0. As with any modelingexercise, you need to choose the model criterion that is most appropriate for your specific data-mining problem. But unlike when using one algorithm at a time, the Auto Classifier makes it easy tocompare models on several factors, as all the various ranking measures are displayed in separatecolumns.

You can use the Sort by: option or click on a column header to change the column used to sort the

table. In addition, you can use the Show/hide columns menu tool to show or hide specificcolumns.

Delete Unused Models will permanently delete all models that are unchecked in the Use? Column.

The bar chart thumbnails show how the model predictions are related to the actual value of the targetfield LOYAL. The full-sized plot includes up to 1000 records and will be based on a sample if the

dataset contains more records. Let’s see how the C5.0 model does.

Double-click on the Graph thumbnail for the C5.0 model




12-14

Figure 12.15 Distribution Graph of LOYAL by Predicted LOYAL for C5.0 Model

The predicted value ($C-LOYAL) is overlaid on the actual value of the target field. Ideally, if themodel was 100% accurate, each bar would be of only one color because the overlap would be perfect.

We can see from the graph that the model does fairly well at predicting customers who will stay and extremely well for those who will leave, as most of the bar for Leave is blue (expand the windowvertically to see this) and most of the bar for Stay is red, meaning there is great overlap between theactual and predicted values.

Close the Graphboard window

It is also possible to see how well the ensemble of three models does in predicting LOYAL.

Click the Graph tab




12-15

Figure 12.16 Accuracy and Predictor Importance of Ensemble of Models

As with the C5.0 model, the predicted value is overlaid on the actual value of LOYAL. The graph

includes both the training and testing partitions. Note that the accuracy for Leave is somewhat lessthan for the C5.0 model alone; the models are combined by confidence-weighted voting, the default.

Also displayed is the predictor importance for the combined models. No one predictor is especiallyimportant.

In order to further compare the performance of these three models, we can generate an evaluationchart directly from the Auto Classifier nugget.

Click Generate…Evaluation Chart(s)




12-16

Figure 12.17 Evaluation Chart Selection Dialog

Since we have used accuracy to rank and select the models, we’ll use Lift to further evaluate themodels on another criterion.

Click Lift Click OK

Figure 12.18 Lift Charts for Models

The best possible model is represented by the dark green line labeled $BEST. Initially the threemodels have about the same lift value, but eventually the C5.0 model surpasses the other two models.This is more evidence that the C5.0 model is the best performer.

Close the Evaluation chart window




12-17

You can also view each model in its standard Model Viewer. As illustration:

Click on the Model tabDouble-click on the C5.0 Model cell in the Model columnClick All button on toolbar

Figure 12.19 C5.0 Model

The Model Viewer has the same detail and options as for a C5.0 model created from that modeling

node. In this way you can explore specific characteristics of a model.

Click OKClose the Auto Classifier Model

We would normally continue to explore models here to see which ones are most satisfactory. One of the tricky things about using the Auto Classifier (and the Auto Numeric node in the next lesson) isthat you may have many models from which to choose. When you are comparing only two or threemodels, it is easy to simply look at the results on the Training partition, and then on the Testing




12-18

partition, to decide which model to select. But with many models, what is the appropriatemethodology to follow for model selection?

Ideally, you select the candidate models before looking at the Testing data, although some analystswould argue that the only thing that matters is performance on the Testing data. However, our adviceis to pick a set of possible models to assess, not all models generated, but more than just 1 or 2.

This will require looking at the evaluation chart, looking at other ways of ranking the models, lookingat how the models make their predictions (what are the important fields; what are the decision treerules, etc.), seeing which categories the models predict more accurately, and perhaps pickingminimum levels of lift, accuracy, or other measures.

For this example, let’s next use an Analysis node to evaluate the performance of the ensembled three best models.

Place an Analysis node to the right of the Auto Classifier nugget named LOYALConnect the LOYAL model node to the Analysis node

Edit the Analysis nodeClick Coincidence matrices (for symbolic targets) (not shown)Click Run




12-19

Figure 12.20 Analysis Node Output for Ensemble of Models

We see that the ensembled model is reasonably accurate on both data partitions, with accuracy in thetraining data of 87.40%, and in the testing data of 82.19%. From the Coincidence Matrix, the modelcorrectly identifies about 93% (316 of 338) of Leavers in the Training partition and 91% (280 of 307)of Leavers in the Testing partition. For the current customers who will stay the ensembled modeldidn’t yield the same degree of accuracy.

Close the Analysis windowDouble-click the Auto Classifier nugget named LOYAL

Since the most accurate model on the Testing data was C5.0, let’s examine its accuracy further with

an Analysis node. We can select the C5.0 model in the model column and then create a generated model.

Double-click the C5.0 model nugget in the Model columnIn the Model Viewer, click Generate…Model to Palette Click OK, and then OK again




12-20

Figure 12.21 C 5.0 Model Added to the Models Palette

Move the generated C5.0 to the Stream Canvas

Connect the C5.0 model to the Type nodePlace an Analysis node to the right of the C5.0 modelConnect the C5.0 model to the Analysis node

Figure 12.22 Revised Stream with the Addition of the C5.0 Model and an Analysis Node





12-21

Figure 12.23 Analysis Node Output for C5.0 Model

We observe that the C5.0 model is very accurate on both data partitions, though accuracy fell a bit, asexpected, on the Testing partition. It is more accurate than the ensemble of three models. The C5.0model correctly identifies almost all of the Leavers in the Testing partition (302 out of 307, or about98.4%!). And while it didn’t predict current customers who will stay with the same degree of accuracy, it still did very well with this group.

You could investigate the other candidate models in a similar way to see which ones do better onwhich category of customers. When all this work is done, you will have a winning model, either oneof the models, or a combination of the models.




12-22

Summary Exerc ises The exercises in this lesson use the data file charity.sav. The following table provides information onthis file.




1. Begin with a blank Stream canvas. Place a Statistics File source node on the canvas and connect it to charity.sav.

2. Try to predict Response to campaign using all the available model choices in the AutoClassifier. Use the defaults first. Which model is best, and which is worse? You can choosethe criterion for ranking models, or use more than one. Which models use fewer inputs?

3. Now change some of the model settings on one more models and rerun the Auto Classifier.Request more than 3 models. Does the order of models change?

4. Pick two or more models and generate a model for each. Add them to the stream and use an

Analysis node or other nodes to further compare their predictions. Which model would youuse, and why?

5. Then use an Analysis node with the Auto Classifier model to compare the predictions of theensemble of models. Does the ensemble do better than any individual model?



FINDING THE BEST MODEL FOR CONTINUOUS TARGETS

13-1

Lesson 13: Finding the Best Model for Continuous Targets

Objectives• Introduce the Auto Numeric Node

• Use the Auto Numeric Node to predict birth weight of babies

Data

In this lesson we use the dataset birthweight.sav. This file contains information on the births of about380 babies and characteristics of their mothers, such as age and various health measures (smoking,history of hypertension, etc.). Researchers are interested in accurately predicting birth weight months

in advance so that interventions can be done for potential low birth weight babies to increase their chances of survival. This dataset is relatively small, which is typical of many medical studies, but as

good practice we will still use a Partition node with the data.

13.1 Introdu ct io n In the previous lesson we learned how to automate the production of models to predict categorical

targets with the Auto Classifier node. In this lesson we discuss the Auto Numeric node, which in ananalogous manner can automate the production of models for targets that are numeric with a

continuous level of measurement.

The Auto Numeric node allows you to create and compare models for continuous targets using a

number of methods all at the same time, and then compare the results. You can select the modelingalgorithms that you want to use and the specific options for each. You can also specify multiple

variants for each model. The supported algorithms include Neural Net, C&R Tree, CHAID,Regression, Linear, Generalized Linear Models, KNN and SVM.

To use this node, a single target field of measurement level continuous and at least one predictor field are required. Predictor fields can be categorical or continuous, with the limitation that some predictors

may not be appropriate for some model types. For example, C&R Tree models can use categoricalstring fields as predictors, while linear regression models cannot use these fields and will ignore themif specified. The requirements are the same as when using the individual modeling nodes.

The format of this lesson will match that of the previous lesson on the Auto Classifier. We begin by

opening an existing stream file and reviewing the data.

Click File…Open Stream, and then move to the c:\Train\ModelerPredModel folderDouble-click on NumericPredictor.str

Run the Table node




13-2

Figure 13.1 Birthweight Data File

In the Statistics source node, we have checked the option to Read labels as names so the variablelabels become the field names in PASW Modeler. The field we want to predict is in the last column

( Birth Weight in Grams) which measures actual birth weight. There is also a separate field ( Low BirthWeight ) which indicates whether the birth weight was below a certain threshold. We won’t use thatfield in this example. All other fields can be used as predictors except for id , of course.

The Partition node splits the data into equal parts for training and testing.

We need to set the role for the fields in the model.

Close the Table windowEdit the Birthweight.sav source nodeClick on the Types tab

We need to fully instantiate the data so that PASW Modeler has values for all fields. We also need tochange the role of id and Low Birth Weight to None, and then the role of Birth Weight in Grams toTarget.

Click Read Values buttonClick OK




13-3

Figure 13.2 Types Tab for Birthweight Data

Change the role of id and Low Birth Weight to None Change the role of Birth Weight in Grams to Target (not shown)Click OK

Before attempting to model birth weight, let’s look at its distribution with a Histogram.

Add a Histogram node to the stream

Attach the Source node to the Histogram nodeEdit the Histogram nodeSpecify the Field as Birth Weight in Grams (not shown)Click Run




13-4

Figure 13.3 Histogram of Birth Weight

The distribution of birth weight is approximately normal, peaking around 3,000 grams, or about 6.6 pounds. Many physical and biological quantities have a normal distribution, which makes modelingless challenging. When a continuous field is distributed normally, just about any technique can beused to predict it. Also, there aren’t too many outliers on either the low or high end. This is because

babies born alive can only be so small, or large. This also makes creating models less problematic.

We can now add an Auto Numeric node to the stream.

Add an Auto Numeric node to the right of the Partition nodeConnect the Partition node to the Auto Numeric nodeEdit the Auto Numeric nodeClick the Model tab if necessary




13-5

Figure 13.4 Auto Numeric Node

The Auto Numeric node will use partitioned data and build a model for each split, if available.Models can be ranked on either the Training or Testing data, if a Partition node is used. It is usually

better to initially rank the models by the Training partition since the Testing data should only be used

after you have some acceptable models.

As with the Auto Classifier, the number of models to use is 3 by default. Predictor importance isturned on by default but it may lengthen the execution time.

Specify the Number of models to use: as 8 Click on the Rank models by menu to see the different ranking options




13-6

Figure 13.5 Ranking Options for Models

The Rank models by option allows you to specify the criteria used to rank the models. Because we are predicting a continuous target, the choices to rank models are suited for this type of target. Theyinclude:

• Absolute value of correlation between observed and predicted values

• Number of predictors used

• Relative error, which is defined as the ratio of the error variance for the model to the variance

of the target field

If the relationship between predicted and observed values is not linear, the correlation is not a good measure of fit or ranking. We’ll view scatterplots to make that determination.

The options to discard models are listed here in the Model tab dialog. You can specify criteria thatcorrespond to the ranking options to discard candidate models. The more models you generate, themore likely you are to use these options, but we don’t need to do so for this example.

When using a continuous target, profit can’t be defined based on a predicted category, nor canmisclassification costs be defined (a model can certainly directly predict revenue or profit, but thatisn’t the same as defining profit based on a categorical target).

Click the Expert tabClick Training partition button under Rank models using:




13-7

Figure 13.6 Auto Numeric Expert Tab

The Expert tab allows you to select from the available model types and to specify stopping rules. By

default, six model types except KNN and SVM are checked and will be used. The Model parametersoption can be used to change the default settings for each algorithm or to request different versions of the same model.

In this example, we will request both Neural Net models and also a stepwise Regression model.

Click on the Model Parameters cell for Neural Net and select Specify Click in the Neural network model row in the Options cell and select Both from the

dropdown list




13-8

Figure 13.7 Changing Neural Network Model Setting

The Expert tab within the Algorithm settings dialog allows detailed changes to specific models.

As with the Auto Classifier example, the random seed is set to a fixed value for the Neural Netmodels, and so each time it is run the same model will be found for these data and target (if we usethe identical other settings).

Now we’ll request a stepwise regression model as well.

Click OKClick on the Model Parameters cell for Regression and select Specify Click in the Method row in the Options cell and select Specify




13-9

Figure 13.8 Regression Parameter Editor

The Enter method is used by default, but we’ll add Stepwise.

Click Stepwise check boxClick OK

Click OK

Figure 13.9 Auto Numeric Dialog Completed

We have requested that a total of 8 models be constructed.




13-10

We won’t examine the Stopping rules dialog, which is identical to that for the Auto Classifier interms of options and operation.


Figure 13.10 Auto Numeric Settings Tab

The Settings tab of the Auto Numeric node allows you to pre-configure the score-time options thatare available on the nugget. For a continuous target, the ensemble scores will be generated byaveraging the predicted value of each model used.

Click Run

Since the Auto Numeric node can take some time to compute all the models, especially if variants of

models are requested, a feedback dialog is presented while the node is running.




13-11

Figure 13.11 Auto Numeric Settings Tab

Once the Auto Numeric model is finished:

Edit the Birth Weight in Grams model nugget in the stream

Although we requested that the models be ranked on the Training partition, the default View is of theTesting set (Partition). So let’s switch to the Training data.

Click the View: dropdown and selectTraining set




13-12

Figure 13.12 Auto Numeric Results for Training Set

There are wide differences in model performance. The best model, the first Neural Net with a

Multilayer Perceptron, has a correlation between the predicted and actual values of 0.543; the worstmodel, the second Regression (Stepwise method) only has a correlation of 0.255. The Regression 2model, which used stepwise selection, includes only one predictor ( Presence of Uterine Irritability;

you can find this information by double-clicking on the model icon.

The relative error is the ratio of the variance of the observed values from those predicted by the modelto the variance of the observed values from the mean. In practical terms, it compares how well themodel performs relative to a null or intercept model that simply returns the mean value of the target

field as the prediction. For a good model, this value should be less than 1, indicating that the model ismore accurate than the null model. The same differences are evident in the relative error, wheresmaller numbers closer to zero are better. The Neural Net is again the best model, followed by aGeneralized Linear model. This is an instance of automatic modeling where the best model clearly

stands out from the others.

Scatterplot thumbnails are provided for each model to show how the model predictions are related tothe actual value of the target field. As noted above, if the relationship between the actual and

predicted fields isn’t linear, then the correlation is not an appropriate measure to use for rankingmodels.

Let’s see how the Neural Net model does.Double-click on the graph thumbnail for the Neural Net 1 model




13-13

Figure 13.13 Distribution Graph of Birth Weight by Predicted Birth Weight for Neural Net

If the model predictions were perfect, all the points would fall on a straight line running from thelower left to upper right. Although the neural net model is far from perfect, that is the generaltendency of the points.

To see what a poor model’s predictions look like in a scatterplot, let’s open the graph for the C&R Tree model.

Close the Graphboard windowDouble-click on the graph thumbnail for the C&R Tree model




13-14

Figure 13.14 Distribution Graph of Birth Weight by Predicted Birth Weight for C&R Tree

The difference between the plots is immediately evident. The decision tree model is clearly a poor

performer. In fact, it seems to predict only two values for birth weight because the number of distinct

predictions depends on the number of terminal nodes in the tree.

Close the Graphboard window

Because we are predicting a continuous field, evaluation charts are not available to further assess the

models.

We can look at how well the ensemble of models does in predicting birth weight.

Click the Graph tabMove the slider in the Predictor Importance pane to the fourth position from the left






13-16

Figure 13.16 Auto Numeric Results for Testing Set

The results are somewhat different, perhaps because of the small sample sizes used in the two data partitions. The best model is now the generalized linear model, with the neural net second. CHAID

has fallen to number 5, while the first regression model is now third. There is also less difference between the first two models, and in fact, the neural net model has lower relative error than the

generalized linear model, so it may still be the best performer.

Despite the changes in model performance, we know that we should focus on the results on the

Testing partition when selecting final models. You may want to engage in a class discussion abouthow to use the Training and Testing data to select models.

For this example, we will select the top three models: Generalized Linear 1, Neural Net 1, and Regression 1. We will check whether the ensembled model makes more accurate predictions than the

Neural Net 1 model alone.

Double-click the Neural Net 1 icon in the Model columnClick Generate…Model to PaletteClick OKDeselect the Neural Net 2, CHAID 1, Linear 1, C&R Tree 1 andRegression 2 models in

the Use? ColumnClick OK

We can use an Analysis node to evaluate the models further, as we did with the Auto Classifier.

Drag the Neural Net 1 model nugget to the right of the Birth Weight in Grams nuggetConnect the two nuggets




13-17

Figure 13.17 Revised Stream with the Addition of the Neural Net Model

Place an Analysis node to the right of the Neural Net 1 modelConnect the Neural Net 1 model to the Analysis nodeRight-click the Analysis node and selectRun




13-18

Figure 13.18 Analysis Node Output

The Analysis node provides various summary measures for the ensembled model (Generalized Linear, Neural Net and Regression) and the Neural Net model on its own. These include the model

minimum and maximum error, the mean error, the mean absolute error (the better measure of the

two), standard deviation and the correlation. By mean absolute error, the ensembled model performed better.

You could investigate the other ensembled models in a similar way to see which combination do

better predicting birth weight.

You can also explore, using standard methods, how the ensembled model makes predictions, i.e.,what is the relationship of the input fields to the predicted value of birth weight.




13-19

Summ ary Exerc ises The exercises in this lesson are written for the data file charity.sav.

charity.sav is from a charity and contains information on individuals who were mailed a promotion.

The file contains details including whether the individuals responded to the campaign, their spending behavior with the charity and basic demographics such as age, gender and mosaic (demographic)

group. The file contains the following fields:

response Response to campaignorispend Pre-campaign expenditureorivisit Pre-campaign visitsspendb Pre-campaign spend categoryvisitb Pre-campaign visits categorypromspd Post-campaign expenditurepromvis Post-campaign visitspromspdb Post-campaign spend category

promvisb Post-campaign visit categorytotvisit Total number of visitstotspend Total spendforpcode Post Codemos 52 Mosaic Groupsmosgroup Mosaic Bandstitle Titlesex Genderyob Year of Birthage Ageageband Age Category

1. Begin with a blank Stream canvas. Place a Statistics File source node on the canvas and connect it to charity.sav.

2. Try to predict Post-campaign expenditure using all the available model choices in the Auto Numeric node. Use the defaults first. Which model is best, and which is worse? You can

choose the criterion for ranking models, or use more than one. Which models use fewer inputs?

3. Now change some of the model settings on one more models and rerun the Auto Numericnode. Does the order of models change?

4.

Pick two or more models and generate a model for each. Add them to the stream and use anAnalysis node and other nodes to further compare their predictions. Which model would youuse, and why? How do they compare to the ensemble of models?




13-20



GETTING THE MOST FROM MODELS

14-1

Lesson 14: Getting the Most from Models

Objectives

• Discuss common approaches to improving the performance of a model in data mining projects

• Use an Ensemble node to combine model predictions

• Use propensity scores to score records

• Do Meta-modeling to improve model performance

• Model errors in prediction

Data

In this lesson we will use the dataset churntrain.txt that is a variant of the churn.txt file that we haveused in several previous lessons. The data have been split into a separate training file, and we will

build or use models constructed with it (there is also a churnvalidate.txt file that can be used for model testing).

14.1 Introdu ct io n Throughout this course we have looked at several different modeling techniques, including neuralnetworks, decision trees and rule induction, regression and logistic regression, Bayes Nets, SVM

models, and discriminant analysis. After building a model we have usually performed some type of diagnostic analysis that helps with the interpretation of the model, and we have also done additional

analyses to help determine where a model is more and less accurate.

In this lesson we develop and extend the model building skills learned so far. The key concept inthese examples is that models built with an algorithm in PASW Modeler should usually (unlessaccuracy is very high and satisfactory) be viewed not as the endpoint of an analysis, but as a way

station on the path to a robust solution. There are various methods to improve models, only some of which we discuss here, and you are likely to come up with your own as you become experienced

using PASW Modeler and read references on data mining.

We provide methods in this lesson for how to improve a model, but there is no one simple answer as

to how this should be done. That is because the appropriate method is highly dependent uponcharacteristics of the existing model that has been built. Potential things to consider when improving

the performance of a model are:

• The modeling technique used

• The measurement level of the target field (categorical or continuous)

• Which parts of the model are under-performing, i.e., are less accurate

•

The distribution of confidence values for the existing model.

We begin the lesson with the Ensemble node, which is an automated method of combining the predictions from two or more models. We then discuss propensity scores and show how they can be

used to score a model. Following from this, we consider other methods of combining models,including modeling the error from a model.




14-2

14.2 Comb in ing Mod els w ith the Ensemble Node Many authors of data-mining books and articles recommend developing more than one predictive

model for a given project. This is usually good advice because there are so many model typesavailable, and a priori, it isn’t normally possible to forecast which model will do better. Moreover,

PASW Modeler makes it easy to try several models on the same data, including two nodes that

automate the building of many models simultaneously.

If you do develop several models, though, you then have the question of how to use them to make predictions. The simplest approach is to use the best model, but what is the “best” model? Is it the

most accurate overall, or the one that is most accurate at predicting the most critical category? And if the models are predicting a continuous target, there are several possible definitions of best model.

Since we have two or more models, another approach is to combine their predictions in some suitablemanner, on the theory that two heads (models) are better than one. And in prediction, that is often

true. There are a variety of methods to combine models. You could:

• Let the models vote, with the category predicted most frequently the “winner”

• Pick the model prediction with the highest confidence• Let the models vote, but weight the voting by model confidence

• Average the model predictions if predicting a continuous field

And there are several other possibilities, including using the propensity scores now available for mostmodels in PASW Modeler 14.0 (but only for flag fields). All the methods in the bullet list are

available in the Ensemble node, which is designed to make combining models a simple process.

Each Ensemble node generates a field containing the combined prediction, with the name based on

the target field and prefixed with $XF_ , $XS_ , or $XR_ , depending on the output field type (flag, set,or range, respectively). We’ll use a preexisting stream file with four generated models to demonstrate

the Ensemble node with the churn data. The Ensemble node is located in the Field Ops palette

because it creates new fields.

Click File…Open Stream, and then move to the c:\Train\ModelerPredModel folder Double-click on Ensemble.str Run the Table node




14-3

Figure 14.1 Churntrain Data File

The training data has just over 1,100 records. We will predict the field CHURNED, which is anominal in measurement level with three categories (you can run the Distribution node to review its

distribution). The Type node already has all the appropriate settings.


Looking at the stream (in Figure 14.2), we used four modeling nodes—CHAID, Neural Net, Bayes Net, and SVM—to create four models that have already been placed in the stream to save time. The

models were created with all available predictors.




14-4

Figure 14.2 Ensemble Stream

Before using an Ensemble node, let’s see how well these four models predict CHURNED.

Add an Analysis node to the stream near the last generated model and attach this model tothe Analysis node


There is a lot of output, so we show this in two figures. Figure 14.3 shows the results for all four

models.




14-5

Figure 14.3 Analysis Node Results for Four Models

The most accurate model overall is the SVM, at 82.85%. The least accurate model is the Bayes Net at

79.15%. Interestingly, although there are only 100 customers in the InVol category, which should make this group more difficult to predict, 3 of the 4 models do very well with this group, and the

CHAID model is 100% accurate, although it was not the best model overall. This illustrates the potential advantage of combining models.




14-6

Figure 14.4 Analysis Node Results When the Models Agree

The last two tables in the Analysis browser window show the accuracy when the model predictionsare combined in a simplistic fashion. All four models make the same prediction for 69.13% of the

cases. For this segment of the file, those predictions have an accuracy of 91.91%. Clearly, combiningmodels can improve performance.

The Ensemble NodeHowever, the models don’t make the same prediction for 30.87% of the customers, a sizeable fractionof the file. What to do with these records? What prediction should be made for them? This is wherethe Ensemble node can provide much assistance.

Close the Analysis output browser windowPlace an Ensemble node from the Field Ops palette near the last generated model Attach the last generated model to the Ensemble nodeEdit the Ensemble node




14-7

Figure 14.5 Ensemble Node Settings Tab

The Ensemble node will use the settings in the last Type node or Types tab from a Source node, so itrecognizes that CHURNED is the target field. Ensemble nodes can use flag, nominal, or continuoustarget fields from the upstream model nodes.

Because the results from many models are being placed in one stream, and each one of those models

generates at least two fields, by default the Ensemble node filters out all those generated fields ( Filter out fields generated by ensembled models check box). If you want to continue to compare individual

model predictions downstream, or use their predictions (see note below), then you will want todeselect this option.

The key setting is the Ensemble method, which determines how the model predictions will becombined.

Click Ensemble method dropdown




14-8

Figure 14.6 Ensemble Method Options

The choices available for Ensemble method will vary based on the measurement level of the target

field. For a categorical target, there are three choices:

1) Voting: The node counts the number of times each value is predicted, and selects the valuewith the highest total.

2) Confidence-weighted voting: The node counts not the simple fact that a prediction was made,

but instead uses the confidence of that prediction. So if a model predicts value A with aconfidence of .80, then the node counts .80 as the “vote.” These weights are summed, and the

value with the highest total is selected.3) Highest confidence wins: In this method, the best model, as measured by its confidence, is

used for each prediction.

If the target field is a flag, there are four other options available, all based on the propensity score(discussed in more detail in the next section). The Ensemble prediction can be based on propensity-weighted voting, or on average propensity. This can be done for either raw propensity or adjusted

propensity (which is based on a validation or testing data partition, and so is only available in thosesituations).

If the target field is numeric (continuous), the only available method is to average the model predictions.

We’ll use the default method of confidence-weighted voting.

If you use one of the voting methods, and there is a tie, the Ensemble node can break the tie in twoways. A random selection can be made, or the model with the highest confidence can be selected.

This latter choice seems like a better one, so we’ll use that.




14-9

Click Highest confidence option buttonClick OK

There is no Run option for an Ensemble node because the node creates new fields but is not a Modelor Output node.

We can first view the results of combining the four models with a Table node.

Add a Table node to the stream near the Ensemble node Attach the Ensemble node to the Table nodeRun the Table node

Figure 14.7 Table with Output Fields from Ensemble Node

The Ensemble node created two new fields, the prediction ($XS-CHURNED) and its confidence

($XSC-CHURNED). In this instance, the confidence is the sum of the confidence for the winningmodels divided by the total number of models.

We can use another Analysis node to check the performance of the combined prediction.




14-10

Close the Table window Add an Analysis node to the stream near the Ensemble nodeConnect the Ensemble node to the Analysis nodeEdit the Analysis nodeClick Coincidence matrices (for symbolic targets) Click Run

Figure 14.8 Analysis Node Output for Ensemble Node Prediction

Overall “model” accuracy is 84.12%, which is better than any individual model by about 1.3%. Whenaccuracy is important, this improvement would very likely be crucial. The model continues to predict

the InVol category almost perfectly, and does very well for the other two categories (we could use aMatrix node to get exact percentages correct in each category).

This stream, with the Ensemble node, can now be used to score new data.

Looking back at Figure 14.4, we recall that when all four models agreed on the prediction, the prediction was accurate for 91.91% of the customers. That is much better than 84.12%. So another

approach when using the Ensemble node is to follow this methodology:

1) If all models agree on a prediction, use that prediction2) When they don’t agree, use the prediction from the Ensemble node

This method can’t be used for continuous fields.




14-11

Is there any downside to combining models? The usual objection is that you cannot now explain whya specific prediction was made. If someone asks “What characteristics of this particular customer

caused him/her to be predicted to be a voluntary churner?” that information will not be available,since the models are combined. Still, if you are using a single Neural Net or SVM model, the same

holds true, so this is not a fatal objection. You can, as with any model, examine the predictions fromthe Ensemble node and see how they relate to the input fields, and that will be helpful. But full model

understanding is sacrificed here for accuracy.

Close the Analysis output browser.

In our next example, we will learn how to use a model to score records to rank them by the propensity

of a model prediction.

14.3 Using Pro pensit y Scores Confidence values obtained from a model in PASW Modeler reflect the level of confidence that themodel has in the prediction of a given output, and they are only available for categorical targets.Confidence values make no distinction between categories of a target field; thus, for a flag with

values of “yes” and “no,” confidence values can vary from 0 to 1 for predictions in each category.Consequently, a high degree of confidence does not help us determine whether that customer willcontinue or cancel their service (it instead indicates the confidence that the model has in its

prediction, whatever that is).

Sometimes it would be helpful to have a score so that, for a specific category of interest—such ascustomers who churned—a high score means a prediction of churn, and a low score indicates thecustomer is current. This type of score can be used in choosing cases for future actions—intervention,marketing efforts, and so forth.

To create a score as just described, most PASW Modeler models calculate a propensity score for aflag field (propensity scores are not available for nominal, ordinal or continuous fields). A propensityscore is actually based on the probability of a prediction. The raw propensity score is based only onthe training data (if using a Partition node), or the whole file, otherwise. When the model predicts the

true value defined for the target field, the propensity is the same as P, where P is the probability of the prediction. If the model predicts the false value, then the propensity is calculated as (1 – P).

The adjusted propensity score is only available if a Partition field is being used, and it is calculated based on model performance on the Testing data. Adjusted propensities attempt to compensate for

model overfitting on the Training data. We aren’t using a Partition node in this example so can onlyuse raw propensity scores.

Now for many models in PASW Modeler, such as all decision trees, the confidence is equal to the probability of the model prediction (why would that be?). For other models, such as a neural net, the

confidence and the probability are not equivalent, although they are usually close in value.

This simple transformation of the probability allows you to easily score a data file with the propensity

(probability) of an outcome to occur.

For this example, we continue using the dataset churntrain.txt . A Derive node has been added to the beginning of the stream to create a modified version of the CHURNED field. It converts CHURNED into the field LOYAL which measures whether or not a customer continued with the company. LOYAL




14-12

groups together both voluntary and involuntary leavers into one group, so comparisons can be madewith customers who remain loyal.

We begin by opening the corresponding stream.

Click File…Open Stream and move to the c:\Train\ModelerPredModel directory (if

necessary)Double-click on Propensity.str

A Derive node calculates the new field LOYAL. Then both a neural net and CHAID model weretrained to predict the field LOYAL, using the ChurnTrain.txt data. Their generated models were then

added to the stream connected to the Type node.

Figure 14.9 Stream with Two Models and new Field LOYAL

Let’s look at the Derive node, which calculates the LOYAL field.

Edit the Derive node LOYAL

The Derive node creates a flag field. The True value will be Leave, and this value will be assigned toa record when CHURNED is not equal to Current . Then LOYAL will be False for customers who arestill current. The new field is defined in this manner because we are interested in finding customers

who might churn.




14-13

Figure 14.10 Derive Node Combining Churning Customers in One Category

Close the Derive node dialog

Propensity scores are not calculated by default for a model but must be requested (unlike confidencevalues). For CHAID models, raw propensities can be calculated either at the time of model creation,

or later from the model nugget. For Neural Nets, propensities must be calculated when the model iscreated, not afterwards. We did this in the Model Options tab of the Neural Net node beforegenerating the model in the stream. To request propensity scores for CHAID models after the modelis created:

Edit the CHAID generated modelClick the Settings tab

There is a check box to request raw propensity scores. Because we aren’t using a Partition field, theoption for adjusted propensity scores is grayed out.




14-14

Figure 14.11 Settings Tab in CHAID Model

Click Calculate raw propensity scores Click OK

To illustrate the distribution of propensity scores and how they differ from confidence values, we’lllook at both fields for the CHAID model.

Add a Table node to the stream near the CHAID modelConnect the CHAID model to the Table nodeRun the Table node




14-15

Figure 14.12 Table with CHAID Model Confidence and Propensity

Recall that when the model predicts that a customer will Leave, the propensity is equal to the

probability. And, for the first record, with a prediction of Leave, the confidence ($RC-LOYAL) isequal to the propensity ($RRP-LOYAL), where the “RP” stands for raw propensity. This means thatthe probability for a CHAID model is equal to the confidence.

For record 4, where the prediction is Current, the propensity is 1 – confidence.

To make all this very clear, we will view histograms of confidence and propensity with an overlay of the model prediction.

Close the Table window Add a Histogram node to the Stream canvas, and connect the CHAID model to itEdit the Histogram node

Select $RC-LOYAL as the Field to display and $R_LOYAL as the Color Overlay Field (notshown)Run the Histogram node




14-16

Figure 14.13 Histogram of Confidence Value by Predicted Loyalty

We can see that the confidence values range from .50 to 1.0, but that a high confidence doesn’tnecessarily indicate that we expect a customer to leave or stay, since there are customers in both

categories at high confidence values (we would find the same pattern if we used the values of LOYAL,

the actual status of customers). In fact, the highest confidence is associated with customers who arecurrent.

Now we can create the histogram with the propensity scores.

Close the Histogram windowEdit the Histogram nodeSelect $RRP-LOYAL as the Field to displayRun the Histogram node

The distribution of the propensity is bimodal. Those customers predicted to leave have scores rangingfrom .50 to 1, and those predicted to remain have scores below .50.

Propensity scores have a similar distribution for the neural net model.




14-17

Figure 14.14 Histogram of Propensity Scores Overlaid by Predicted Loyalty

The propensity score can now be used to score a database, as is commonly done in many data-miningapplications, so that customers can, for example, be selected for a marketing campaign based on their

propensity to leave. Sort the file by propensity, and begin choosing customers with the highest propensities first.

In addition to scoring new records, there is another use of propensity scores. A score field can be used in a new model to improve the prediction of LOYAL. The score fields do not perfectly predict thevalue of LOYAL (remember we have been using the predicted value of LOYAL, not the actual values,in our histograms; try running the histograms with LOYAL itself to see the difference), but theyapparently have a high degree of potential predictive power. Clearly, this is based purely upon the

way that CHAID or the neural network has differentiated between customers who will leave or stay, but if the model has a high degree of accuracy (which it does in this case), then the propensity scoremay act as a very good predictor for another modeling technique. If a more complex model that takesoutput from one model as inputs for another (often called a meta-model ) were to be built, informationon the score values from the CHAID model could be used as an input to a neural network. We shall

look at this form of meta-modeling in the next section.

14.4 Meta-Level Modeling The idea of meta-level modeling is to build a model based upon the predictions, or results, of another

model. In the previous section, we used a stream which contained both a trained neural network and aCHAID rule induction model. We then created propensity scores for each to use separately.

We can use the propensity score, though, from the CHAID model as one of the inputs to a modified neural network model. We know that the CHAID algorithm can predict loyalty with higher accuracy;thus it is hoped that by inputting the propensity scores into a neural network analysis, the neural




14-18

network may be able to correctly predict some of the remaining cases that the CHAID modelincorrectly classified.

Click File…Close Stream and click No if asked to save changesClick File…Open Stream Double-click on Metamodel.str

The figure below shows the completed stream loaded into PASW Modeler. It is fairly complex but

not difficult.

Figure 14.15 Meta-Model Stream

A Type node has been inserted after the two generated model nodes. If we are to build a model based

upon results obtained from previous models, each of the newly created fields will need to beinstantiated and have its role set. We will be using both the new propensity score field and the predicted value from the CHAID model.

If we run the two Analysis nodes attached to the two generated models, we would find that the neuralnet and CHAID models had accuracies of 80.42% and 83.75%, respectively when trying to predictthe field LOYAL.

When doing this type of meta-modeling, you must make a decision about which fields should be

inputs to the new model. You can use all the original fields, or reduce their number since the CHAID propensity score and predicted category will effectively contain much of the predictive power of theoriginal fields. If the number of inputs isn’t large, then including them along with the two new fieldsin the new neural network will not appreciably slow training time, and that is the approach we takehere. But you may wish to drop at least some of the fields that had little influence on the model, since

including all fields can lead to over-fitting.

We’ll begin by examining the downstream Type node.

Run the Table node attached to the Type node downstream of the generated modelsClose the Table windowEdit the Type node attached to the CHAID generated model




14-19

Figure 14.16 Type Node Settings

In this example, we will use all the original input fields as predictors, plus the predicted value of LOYAL from the CHAID model ($R-LOYAL) and the propensity score ($RRP-LOYAL). The target

field remains LOYAL.

A Neural Net node has been attached to the Type node (and renamed MetaModel_LOYAL). We’veset the random seed to 1000 so that everyone will obtain the same solution, and we use the Quick

training method.

Let’s run the model.Close the Type nodeRun the neural network MetaModel_LOYAL Edit the generated modelClick on the Predictor Importance graph




14-20

Figure 14.17 Predictor Importance from Meta-Model for LOYAL

We can see that, not surprisingly, the fields $RRP-LOYAL is by far the dominant input within themodel. The actual predicted value from CHAID, $R-LOYAL, is only the fifth most important input.

We can check the accuracy of the meta-model with an Analysis node.

Close the Neural Net model browser Add an Analysis node to the stream and connect the generated meta-model to the

Analysis nodeEdit the Analysis nodeClick Coincidence matrices (for symbolic targets) (not shown)Click Run




14-21

Figure 14.18 Analysis Node Output for Meta-Model for LOYAL

In the portion of the output comparing $N1-LOYAL with LOYAL, we observe the overall accuracy of the meta-model is 84.93%. This is much better than the original neural net model (and it is even about1% better than the CHAID model).

This is a realistic example of how using the results of one model to improve another model can work in practice. Sometimes it is said that doing so is misguided, and it is true that in classical statistics onedoesn’t combine models in this manner. But in data mining, this is an acceptable methodology, butyou must always validate the final meta-model on a testing or validation sample. As an exercise, you

can use the file ChurnValidate.txt to validate this meta-model.




14-22

14.5 Erro r Modeling Error modeling is another form of meta-modeling that can be used to build a better model than the

original, and it is often recommended in texts on data mining. In essence, this method isstraightforward. Cases with errors of prediction are isolated and modeled separately. Almost

invariably, some of these cases can now be accurately predicted with a different modeling technique.

However, there is a catch to this technique. In both the training and test data files we have a target

field to check the accuracy of a model. Thus, in the churn data, we know whether a customer remained or left. But in real life, that is exactly what we are trying to predict. So how can we create a

model that uses the fact that an error of prediction has occurred since, when applying the model, wewon’t know whether the model is in error until it is too late, i.e., the event we are trying to predict hasoccurred?

The answer to this dilemma is that, of course, we can’t, so we have to find a viable substitute strategy.

The most common approach is to find groups of cases with similar characteristics for which we makea greater proportion of errors. We then create separate models for these cases, assuming that the same

pattern will hold in the future. It is, as always, crucial to validate the models with a holdout sample

when using this technique.

In this section we build an error model on the churn data in order to investigate where the initialneural network is under-performing, and then improve it by modeling the cases more prone to

prediction errors with a C5.0 model.

In this stream, the True value for LOYAL is reversed from our previous examples and is defined as acustomer will stay with their service. The False value is then a customer who will leave. Since wewould like to model errors for predicting both categories, it doesn’t make as much difference here

which category is associated with True.

Close the Analysis output browser Close the current stream (you don’t need to save it), and clear the Models Manager Click File…Open Stream Double-click on Errors.str Switch to small icons (right-click Stream canvas, click Icon Size…Small)

The figure below displays the error-model stream in the PASW Modeler Stream canvas. The upper stream in the canvas includes the generated model from the neural network and attaches a Derive

node to it. The Derive node compares the original target field ( LOYAL) with the network prediction of the target ($N-LOYAL), calculating a flag field (CORRECT ) with a value of “True” if the predictionof the neural network is correct, and “False” if it was not. You can open it and review it if you wish.

The first goal of the error model is to use a rule induction technique, which can isolate where the

neural network model is under-performing. This will be done by using the C5.0 algorithm to predict

the field CORRECT .

We chose a C5.0 model because its transparent output will provide the best understanding of wherethe neural network is under-performing. In order to ensure that the C5.0 model returns a relatively

simple model, the expert options have been set so that the minimum records per branch is 15. Settingthis value is a judgment call based on the number of records in the training data and the number of rules with which you wish to work (another approach would be to winnow attributes, an expertoption).




14-23

A Type node is used to set the new field CORRECT to role Target, and the original inputs to theneural network to Input. It would need to be fully instantiated before training the C5.0 model.

Figure 14.19 Error Modeling from a Neural Network Model

In this example, the model has already been trained and added to the stream, labeled C5 Error Model .Let’s browse this model.

Edit the C5.0 generated model node labeled C5 Error Model

We generated a ruleset from the C5.0 model because it makes it easier to view those rules for the

False values of CORRECT . Again, we are trying to predict the values of CORRECT , which means we

are trying to predict whether the neural network was accurate or not. There are two rules for a Falsevalue.

Click the Show all levels button

Click the Show or Hide Instances and Confidence button (so instances andconfidence values are visible)

The two rules all have reasonable values of confidence, ranging from .59 to .68 (although you might prefer them to be a bit higher). Rule 1 tells us that for male customers who make less than about atenth of a minute of long distance calls per month, we predict the value of CORRECT to be False, i.e.,the wrong prediction. Rule 2 is more complicated.

It would be better to have more than two rules predicting False, but this will suffice for purposes of illustration.




14-24

Figure 14.20 Decision Tree Ruleset Flagging Where Errors Occur Within the Neural Network

The next step is to split the training data into two groups based on the ruleset, one for predictions of True and the other for False. We can do this by generating a Rule Tracing Supernode from the Rule

browser window and applying a Reclassify or Derive node to truncate the values of the new field to just True and False. We will use the Reclassify node to modify the Rule field so that it only has twocategories, which we will rename as Correct and Incorrect. Let’s check the distribution of this field.

Close the C5.0 Model browser window

Run the Distribution node named Split




14-25

Figure 14.21 Distribution of Split Field

The neural network accuracy was about 80.5%. The distribution of Split doesn’t match this because

we limited the records per branch to no lower than 15, and because the C5.0 model can’t perfectly predict when the neural network was accurate or not.

There are clearly enough cases with a value of Correct (1032) to predict with a new model, but thereare only 78 cases with a value of Incorrect, which is a bit low for accurate modeling. The best

solution is to create a larger initial sample so that the cases predicted to be incorrect by the C5.0model would be represented by a larger number of cases. If that isn’t possible, you can use a Balancenode and boost the number of cases in the Incorrect category (although this is not an ideal solution,

either). Since this is an example of the general method, we won’t bother doing either to see how muchwe can improve our model with no special sampling.

Looking back at the stream, we next added a Type node to set the role of FALSE_TRUE and Split to None so that they are not used in the modeling process. We wish to use only the original predictors.

The stream then branches into two after the Type node. The upper branch uses a Select node to select

only those records with predictions expected to be correct, while the lower branch selects thoserecords with predictions expected to be incorrect. We reemphasize that the split of the training data isnot based on the target field. Instead, only demographic and customer status fields were used to create

the field Split used for record selection. It is for this reason that this model can, if successful, be used in a production environment to make predictions on new data where the outcome is unknown.

After the data are split, the customers for whom we generally made correct predictions are modeled

again with a neural network. We do so because these cases were modeled well before with a neuralnetwork, so the same should be true now. And, with the problematic cases removed, we expect thenetwork to perform better.




14-26

For the customer group for which predictions were generally wrong, we use a C5.0 model to try anew technique, since the neural network tended to mispredict for this group. We could certainly try

another neural network, however, or any other modeling technique.

After the models are created, they are added to the stream, and Analysis nodes are then attached toassess how well each performed. Let’s see how well we did.

Close the Distribution plot windowRun both Analysis nodes in the lower stream

The neural network model for the group of customers with generally originally correct predictions iscorrect 85.56% of the time, a substantial improvement over the base figure of 80.51%. The C5.0

model is even more accurate, correctly predicting who will leave or stay for 86.84% of the cases thatwere originally difficult to accurately predict. Clearly, using the errors in the original neural network to create new models has led to a substantial improvement with little additional effort. If you take thisapproach, you would, as usual, explore each model to see which fields are the better predictors and how this differs in each model.

Figure 14.22 Model Accuracy for Two Groups

So far so good, but we’d still like to automate the solution so that the data all flow in one stream

rather than in two, and we can therefore make a combined prediction for LOYAL on new data. This iseasy to do. To demonstrate, we open a stream with a modified version of the current one.

Close the current stream and don’t’ save it if askedClick File…Open Stream Double-click on Combined_predictions.str Switch to small icons (right-click Stream canvas, click Icon Size…Small)

We have combined the two generated models in sequence in this modified stream. You might think that we could simply combine the output from each model, since each was trained on a different

group of cases and thus will make predictions only for those cases, but this isn’t the case. Although




14-27

each model was trained on only a portion of the data, each will make predictions for all the cases.(Why? To verify this, run the Table node.)

Figure 14.23 Combined Predictions Stream

But the solution is simple. We know that the value of the field Split tells us which model’s output to

use, and we do so in the Derive node named Prediction.

Edit the Derive node named Prediction

This node creates a new field called Prediction. When Split is equal to Correct, the value of Prediction is set to the output of the neural network output. Otherwise, the value of Prediction is setto the output of the C5.0 model. Thus, we have a new field that contains the combined prediction

from the best model for that group of customers.




14-28

Figure 14.24 Derive Node to Create Prediction Field

We know that the baseline neural network had an accuracy of 80.51%, and made 216 errors. We will

do much better with these two models. To see how much, we can run the Matrix node thatcrosstabulates Prediction and LOYAL.

Close the Derive nodeRun the Matrix node named Prediction x LOYAL

The combined models have made only 159 errors, quite an improvement. This translates to anaccuracy of 85.64%, or an increase of about 5.1% over the original neural network model.




14-29

Figure 14.25 Comparison of Prediction and LOYAL

The process of modeling errors need not stop here. Although there will clearly be diminishing returnsas the number of errors decreases, it is certainly possible to attempt to separately model the remaining

errors from the combined model. At the very least, you would still want to investigate thosecustomers whose behavior remains difficult to model.

Eventually you would validate the models with the ChurnValidate.txt dataset. We won’t do that here because the stream with the C5.0 model predicting errors in the original neural network has only 33

records, not enough for a reasonable validation. Obviously, the validation dataset should be of sufficient size, just as with the training file.

We should also note that this same technique could be used for target fields that are continuous, either integers or real numbers. In that case, the errors are relative, not absolute, but some numeric bounds

can be specified to differentiate cases deemed to be in error from those with sufficiently accurate predictions. Then the former group of cases can be handled in a similar manner as was done above.




14-30

Summary Exerc ises In these exercises we will use the streams created in this lesson.

1. Use the stream Metamodel.str . Rerun the MetaModel_LOYAL neural network model,

removing all the original inputs from the model and thus using only the modified confidencescore and the predicted value from the CHAID model. How does this affect model

performance? Add this generated model to the stream and validate it with theChurnValidate.txt data file. Was the model validated, in your judgment?

2. Use the stream Errors.str . Instead of using a C5.0 model to predict cases with proportionallymore errors, try another type of model (your choice). How well does this perform compared to the C5.0 model? How does it compare to the accuracy of the original neural network?What do you recommend we use for these cases that were predicted in error?



DECISION LIST

A-1

Appendix A : Decision List

Overview

• Introduce the Decision List model

• Compare rule induction by Decision List with the decision trees nodes

• Outline the main differences between a decision tree and a decision rule

• Understand how Decision List models a categorical target

• Review the Interactive Decision List modeling feature

• Use partitioned data to test a model (optional, already covered in former lesson)

Data

In this appendix we use the data file churn.txt , which contains information on 1477 customers of atelecommunications firm who have at some time purchased a mobile phone. The customers fall intoone of three groups: current customers, involuntary leavers and voluntary leavers. Unlike the modelsdeveloped in Lesson 3, here we want to understand which factors influence the voluntary leaving of a

customer , rather than trying to predict all three categories.

In t roduct ion PASW Modeler contains five different algorithms for performing rule induction: C5.0, CHAID,QUEST, C&R Tree (classification and regression trees) and Decision List. The first four are similar

in that they all construct a decision tree by recursively splitting data into subgroups defined by the predictor fields as they relate to the target. However, they differ in several ways that are important to

the user (see Lesson 3).

Decision List predicts a categorical target, but it does not construct a decision tree; instead, itrepeatedly applies a decision rules approach. To give you some sense of a Decision List model we

begin by browsing such a model and viewing its characteristics. After that we continue by reviewing

a table that highlights some distinguishing features of the rule induction algorithms. Finally, we willoutline the difference between decision trees and decision rules and the various options for theDecision List algorithm in the context of predicting categorical fields.

A Decis ion Lis t Model Before diving into the details of the Decision List node, we review a decision list model.

Click File…Open Stream, and then move to the c:\Train\ModelerPredModel directoryDouble-click DecisionList.str




A-2

Figure A.1 Decision List Stream

Right-click the Decision List node CHURNED[Vol] Select Run

Once the Decision List generated model is in the Models palette, the model can be browsed.

Right-click the Decision List node named CHURNED[Vol] in the Models paletteClick Browse

The results are presented as a list of decision rules, hence Decision List. If you are familiar with theC5.0 model output you will see a distinct likeness to the Rule Set presentation of a C5.0 model.

Figure A.2 Browsing the Decision List Model



DECISION LIST

A-3

The first row gives information about the training sample. The sample has 719 records (Cover (n)) of which 267 meet the target value Vol ( Frequency). In consequence the percentage of records meeting

the target value is 37.13% ( Probability).

A numbered row represents a model rule and consists of an id , a Segment , a target value or Score(Vol ) and a number of measures (here: Cover (n), Frequency and Probability). As you can see, a

segment is described by one or more conditions, and each condition in a segment is based on a predictive field, e.g. SEX = F, INTERNATIONAL > 0 in the second segment.

All predictions are for the Vol category, as this is what is defined in the Decision List modeling node.The accuracy of predicting this category is listed for each segment in the Probability column, and accuracy is reasonably high for most segments.

As a whole our model has 5 segments and a Remainder. The maximum number of predictive fields in

a segment is 2. Each segment is not too small (see measure Cover (n)); the smallest has 52 records.This is not chance. The maximum number of segments in the model, the maximum number of

predictive fields in a segment and the minimum number of records in a segment are all set in theDecision List node, as we will see later.

We now review in some detail the Decision List model.

The Target

A characteristic of Decision List is that it models a particular value of a categorical target. In the

Decision List model at hand we have modeled the voluntary leaving of a customer as represented bytarget value CHURNED = Vol .

The Remainder Segment

The Remainder segment is yet another defining characteristic of the Decision List model. Unlike with

decision trees, there will be a group of customers for which no prediction is made (the Remainder).The Decision List algorithm is particularly suitable for business situations where you are interested ina relatively small but extremely good (in terms of response) subset of the customer base. Think of customer selection for a marketing campaign, where there is a limited campaign budget available. Sothe marketer will only be interested in the top N customers he can afford to approach given her

budget, and the rest (Remainder) will be excluded for the campaign.

Overlapping Segments

In our model the 5 segments and the Remainder form a non-overlapping segmentation of the training

sample, meaning that a customer (or a record) belongs to exactly one segment or to the Remainder.So the total of the Cover (n) for all segments including the Remainder should match the Cover (n) of the training sample. This basic requirement affects the way a particular segment should be interpreted

when reading the model.

The Nth

The record is in segment N and segment should be interpreted as:

not(segment N-1) and not(segment N-2 ) and…..and not

(segment 1)

Example

Given our model a Female customer with International >0 and AGE from 43 to 58 satisfies bothsegment 1 and segment 2. However she will be regarded as a member of segment 1. The rules areapplied in the order you see them listed for the segments, so this customer is assigned to segment 1.




A-4

A customer belongs to segment 2 if:not

and SEX = F and International > 0

(SEX = F and 42 < AGE <= 58) [the segment 1 conditions]

And a customer belongs to segment 3 if:not

and

(SEX = F and International > 0) [the segment 2 conditions]

notand SEX = F and 73 < AGE <= 89

(SEX = F and 42 < AGE <= 58) [the segment 1 conditions]

This mechanism prevents multiple counting of customers in overlapping segments. Be aware that theorder of the segments in the model affects the segment a customer belongs to and so also themeasures Cover (n), Frequency and Probability for each model segment.

This is a consequence of the iterative method by which Decision List generates rules. In a later

section we will cover in detail how this rule induction mechanism works. For now it is sufficient torealize that the Decision list algorithm is constructing trees of decision rules using a very different

splitting mechanism than the one used in the decision tree algorithms. This is the reason why theDecision List algorithm is not a tree but a rule algorithm.

Comparison o f Rule Indu ct ion Models The table below lists some of the important differences between the rule induction algorithmsavailable within PASW Modeler. The first four columns are repeated from Lesson 3 for ease of comparison.

Table A.1 Some Key Differences Between the Five Rule Induction Models

Model Criterion C5.0 CHAID QUEST C&R Tree Decision List

Split Type for

Categorical

Predictors

Multiple Multiple Binary1 Binary Multiple

Continuous

Target

No Yes No Yes No

Continuous

Predictors

Yes No Yes2 Yes No

Criterion for

Predictor

Selection

2

Information

measure

Chi-square

F test for continuous

Statistical Impurity

(dispersion)measure

Statistical

Can Cases

Missing Predictor

Values be Used?

Yes, usesfractiona-

lization

Yes, missing becomes a

category

Yes, usessurrogates

Yes, usessurrogates

Yes, missing becomes a

category

Priors No No Yes Yes No

Pruning

Criterion

Upper limiton predicted

error

Stops rather than overfits

Cost-complexity

pruning

Cost-complexity

pruning

Stops rather than overfits

Build Models

Interactively

No Yes Yes Yes Yes

Supports

Boosting

Yes Yes Yes Yes Yes



DECISION LIST

A-5

1SPSS has extended the logic of the CHAID approach to accommodate ordinal and continuous targetfields.2

Continuous predictors are binned into ordinal fields containing by default approximately equal sized categories.

Unlike the decision tree algorithms, Decision List does not create subgroups by splitting but by either

adding a new predictor or by narrowing the domain of the existing predictor(s) in the group (decisionrule approach) and in consequence tree-splitting issues are not applicable here.

Decision List can handle targets that are of measurement level flag, nominal, and ordinal. DecisionList is designed to model a specific category of a categorical target, so effectively it predicts a binaryoutcome (target or not target). The algorithm treats continuous predictors by binning them into

ordinal fields with approximately equal number of records in each category..

In generating rules, just like CHAID and QUEST, Decision List uses more standard statisticalmethods, as explained below.

The way missing values are handled is set with Expert options. Either the missing fields in a predictor

are neglected when it comes to using that predictor in forming a subgroup, or like CHAID, all missingvalues are used as an additional category in model building.

The process of rule generation halts based on settings such as the maximum number of predictors in arule, explicit group size related settings, and the statistical confidence required.

Rule Indu ct ion Using Decis ion L is t The Decision List modeling node must appear in a stream containing fully instantiated fields (either in a Type node or the Types tab in a source node). Within the Type node or Types tab, the field to be

predicted (or explained) must have role target or it must be specified in the Fields tab of the modelingnode. All fields to be used as predictors must have their role set to input (in Types tab or Type node)

or be specified in the Fields tab. Any field not to be used in modeling must have its role set to none.Any field with role both will be ignored by Decision List.

The Decision List node is labeled with the name of the target field and target category. Like mostother models, a Decision List model may be browsed and predictions can be made by passing newdata through it in the Stream Canvas.

The target field must have categorical values, and Decision List will model on a particular value of

the target field. That target value is set in the Decision List node. The other values of the target field will then be regarded as a second category value, appearing as the value $null$ in predictions.In this example we will attempt to predict which customers voluntarily cancel their mobile phone

contract. Rather than rebuild the source and Type nodes, we use the existing stream opened previously. We’ll delete the Decision list node so we can review the default settings.

Close the Decision List Browser windowDelete the CHURNED[Vol] node and the generated CHURNED[Vol] mode nodePlace a Decision List node from the Modeling palette to the upper right of the Type node in

the Stream CanvasConnect the Type node to the Decision List node (see Figure A.3)

The name of the Decision List node should immediately change to No Target Value.




A-6

Figure A.3 Decision List Modeling Node Added to Stream

The reason for the name “No Target Value” is because target field CHURNED has three values, but

Decision List predicts only one specific target value.

Double-click the Decision List node to edit it

Note the message stating that a target value must be specified.



DECISION LIST

A-7

Figure A.4 Decision List Dialog - Initial

The Model name option allows you to set the name for both the Decision List and resulting DecisionList nodes. The Use partitioned data option is checked so that the Decision List node will make useof the Partition field created by the Partition node earlier in the stream.

By default the model is built automatically, as the Mode is set to Generate model . By selecting Launch interactive session it is possible to create the model interactively.

The Target value has to be set explicitly to Vol .

Click the button, to the right of the Target valueClick Vol, then click Insert

With Decision List you are able to generate rules better than the average or worse than the averagedepending on your goal (where the average is the overall probability of the target value). This is set




A-8

by the Search Direction value of Up or Down. An upward search looks for segments with a highfrequency. A downward search will create segments with low frequency.

A decision rule model contains a number of segments. The maximum is set in Maximum number of

segments.

Each segment is described by one or more predictors, also known as attributes in the Decision Listnode. The maximum number of predictive fields to be used in a segment is set in Maximum number of attributes. You may compare this setting with Levels below root setting in CHAID and QUEST,

prescribing the maximum tree depth.

The Maximum number of attributes setting implies a stopping criterion for the algorithm. Just like thestopping criteria of CHAID, Decision List also has settings related to segment size: As percentage of previous segment (%) and As absolute value (N). The percentage setting states that a segment can

only be created if it contains at least a certain percentage of records of its parent. Compare this with a branch point in a tree algorithm. The absolute value setting is straightforward: a segment only

qualifies for the model if it is not too small, thus serving the generality requirement of a predictivemodel. The larger of these two settings takes precedence.

Note that whereas in CHAID’s stopping criteria you must choose either a percentage or an absolutevalue approach, Decision List combines the two by using the percentage requirement for the parentand the absolute value requirement for the child.

The model’s accuracy is controlled by Confidence interval for new conditions (%). This is a

statistical setting and the most commonly used value is 95, the default. Of course depending on the business case and how costly an erroneous prediction, you may increase or decrease this confidencevalue.

Understandin g the Rules and Determin ing Ac curacy

The predictive accuracy of the rule induction model is not given directly within the Decision Listnode. To obtain that information using an Analysis node may be confusing let alone misleading asDecision List will only explicitly report on the particular target value that was modeled and the other

value(s) will be regarded as $null$. To avoid that we will use Matrix nodes and Evaluation Charts todetermine how good the model is.

We use the Table node to examine the predictions from the Decision List model.

Click Run to run the modelPlace a Table node from the Output palette below the generated Decision List nodeConnect the generated Decision List node to the Table nodeRight-click the Table node, then click Run and scroll to the right in the table



DECISION LIST

A-9

Figure A.5 Three New Fields Generated by the Decision List Node

Three new columns appear in the data table, $D-CHURNED, $DP-CHURNED and $DI-CHURNED.The first represents the predicted target value for each record, the second the probability and the third

shows the ID of the model segment a record belongs to. The sixth segment is the Remainder.

Note that the predicted value is either Vol or $null$, demonstrating that the Decision List algorithm predicts a particular value of the target field to the exclusion of the others.

Click File…Close to close the Table output window


We will use a matrix to see where the predictions were correct, and then we evaluate the modelgraphically with a gains chart.

Place two Select nodes from the Records palette, one to the upper right of the generatedDecision List node and one to the lower right

Connect the generated Decision List node to each Select node

First we will edit the Select node on the upper right that we will use to select the Training sample

cases:

Double-click on the Select node on the upper right to edit itClick the Expression Builder buttonMove Partition from the Fields list box to the Expression Builder text boxClick the equal sign button

Click the Select from existing field values button and insert the value 1_Training (notshown)

Click OK




A-10

Click Annotations tabSelect Custom and enter value Training Click OK

Figure A.6 Completed Selection for the Training Partition

Now do the same for the Select node on the right to select the Testing sample cases.Insert Partition value “2_Testing” and annotate the node as “Testing.”

Now attach a separate Matrix node to each of the Select nodes. For each of the Select nodes:

Place a Matrix node from the Output palette near the Select nodeConnect the Matrix node to the Select nodeDouble-click the Matrix node to edit itPut CHURNED in the Rows:Put $D-CHURNED in the Columns: Click the Appearance tabClick the Percentage of row optionClick on the Output tab and custom name the Matrix node for the Training sample as

Training and the Testing sample as Testing (this will make it easier to keep track of which output we are looking at)

Click OK

For each actual risk category, the Percentage of row choice will display the percentage of records

predicted in each of the target categories.

Run each Matrix node



DECISION LIST

A-11

Figure A.7 Matrix Output for the Training and Testing Samples

Looking at the Training sample results, the model predicts about 82.0% of the Vol ( VoluntaryLeavers ) category correctly. The results with the testing sample compare favorably (80.5% accurate)

which suggests that the model will perform well with new data.

Note that technically no prediction for the other two categories is correct, since the model doesn’t

predict Current or InVol but just $null$. But we can combine these results by hand to obtain theaccuracy. The percentage of correct not Vol predictions is:

(313 + 48)/((313 + 68)+(48 +23))*100 = 79.9%.

We could have made this calculation easier by creating a two-valued target field based onCHURNED, thus creating a 2 by 2 matrix. Decision List would create the same rules for such a field.

To produce a gains chart for the Voluntary group:

Close both Matrix windowsPlace the Evaluation chart node from the Graphs palette to the right of the generated

Decision List node named CHURNED[Vol] Connect the generated Decision List node to the Evaluation chart nodeDouble-click the Evaluation chart node, and click the Include best line checkbox

By default, an Evaluation chart will use the first target category to define a hit. To change the target

category on which the chart is based, we must specify the condition for a User defined hit in theOptions tab of the Evaluation node.

Click the Options tabClick the User defined hit checkbox

Click the Expression Builder button in the User defined hit groupClick @Functions on the functions category drop-down list

Select @TARGET on the functions list, and click the Insert button

Click the = buttonRight-click CHURNED in the Fields list box, then select Field ValuesSelect Vol, and then click Insert button




A-12

Figure A.8 Specifying the Hit Condition within the Expression Builder

Click OK

Figure A.9 Defining the Hit Condition for CHURNED

In the evaluation chart, a hit will now be based on the Voluntary Leaver target category.

Click Run



DECISION LIST

A-13

Figure A.10 Gains Chart for the Voluntary Leaving Group

The gains line ($D-CHURNED) in the Training data chart rises steeply relative to the baseline,indicating that hits for the Voluntary Leaving category are concentrated in the percentiles predicted most likely to contain this type of customer, according to the model.

Hold the cursor over the model line in the Training partition at the 40th

percentile

Approximately 77% of the hits were contained within the first 40 Percentiles.




A-14

Figure A.11 Gains Chart for the Voluntary Leaving Group (Interaction Enabled)

The gains line in the chart using Testing data is very similar which suggests that this model can be

reliably used to predict voluntary leavers with new data.

Close the Evaluation chart window


Click File…Save Stream AsMove to the c:\Train\ModelerPredModel directory (if necessary)Type DecisionList Model in the File name: text boxClick Save

Understandin g the Most Impo rtant Factors in Predic t ion An advantage of rule induction models, as with decision trees, is that the rule form makes it clear which fields are having an impact on the predicted field. There is no great need to use alternativemethods such as web plots and histograms to understand how the rule is working. Of course, you may

still use the techniques described in the previous lessons to help understand the model, but they oftenare not needed.

In the Decision List algorithm the most important fields in the predictions can be thought of as thosethat define the best subgroups in the sample used for training the model at a certain stage in the

process. Thus in this example the most important fields when using the whole training sample areSEX and AGE . Because the sample used for training the model gradually decreases during thestepwise rule discovery process, there will be other predictive fields coming to the surface as beingmost important. This intuitively makes sense. So in step 2 when finding the best second segment,



DECISION LIST

A-15

using the whole training sample except the first segment, the most important fields turn out to be SEX and International . Similarly, when finding segment 3 and using the whole training sample except for

the first two segments, SEX and AGE are again the most important predictors.

The process continues until the algorithm is not able to construct segments satisfying therequirements, or stopping criteria are reached.

Expert Opt ion s for Decis ion L is t

Now that we have introduced you to the basics of Decision List modeling, we will discuss the Expertoptions which will allow you to refine your model even further.

Double-click on the Decision List node named CHURNED[Vol] to edit it

Expert mode options allow you to fine-tune the rule induction process.

Click the Expert tabClick the Expert Mode option button

Figure A.12 Decision List Expert Options

Binning

Binning is a method of transforming a numeric field (with measurement level continuous) into anumber of categories/intervals. The Number of bins input will set the maximum number of bins to be




A-16

constructed. Whether this maximum will actually be the number of bins depends on other settings aswell.

There are two main types of Binning methods, Equal Count and Equal Width. Equal Width will

transform numeric fields into a number of fixed width intervals. Equal Count is a more balanced binning method, and it will create intervals based on an equal number of records per interval.

The three settings below this control details of the model process, described below.

If Allow missing values in conditions is checked, the Decision List algorithm will regard being emptyor undefined as a particular category that can be used as a condition in a segment. That may result in asegment such as “SEX = F and AGE IS MISSING”.

The Decision List Algorithm

The Decision List algorithm constructs lists of rules based on the predictions of a tree. However, thetree is generated quite differently from the way it is done in the decision tree algorithms, so the word “tree” has to be regarded as a way to visualize the solution area and the rule generation process of theDecision List algorithm.

Process Hierarchy

In order to understand the Decision List rule generation process, we must first realize that a decisionlist contains segments, with each segment containing one or more conditions, and each condition

being based on one predictive field. This hierarchy is directly reflected in the rule generation process:a main cycle of generating the list’s segments and a sub cycle for each segment of constructing the

segment’s conditions based on the predictive fields.

The main cycle is also called the List cycle and the sub cycle is called the Rule cycle. In constructing

the conditions on the lowest process level the algorithm also has a Split cycle where the binning is performed in case of continuous predictive fields.

Qualification

A key question is: what makes one list better than the other and what makes a segment better than theother?

For a list the accuracy is defined by:

List% = 100* SUM (Frequency) / SUM (Cover(n)), the Remainder excluded

On a segment level, a segment at hand is better than another segment if:(1) the probability of the target value on the segment is better (2) there is no overlap between the confidence interval of the segment at hand and the confidence

interval of the other segment. This interval is directly related to the setting for the Confidenceinterval for new conditions (%), as set in the Simple mode of the Decision List dialog and defined as Probability± Error ), where Error is the statistical error in the prediction of theProbability.

List Generation – Simple

To simplify the argument we will describe the process given the setting Model search width = 1,meaning we will not create multiple lists simultaneously to choose from in the end. So we will

assume one List cycle here.



DECISION LIST

A-17

Rule Generation

Given the above, the rule generation process starts with the full Training sample to search for

segments. The solution area is generated as follows: on the first rule level segments are constructed based on 1 predictive field. The best 5 ( Rule search width) will be selected as starting points for asecond rule level, resulting in a set of segments each described by 2 predictive fields. Again the best 5are selected for the third rule level. This goes on till the last rule level, which is 5 ( Maximum number of attributes), and indeed in principle the fifth rule level segments are described by five predictive

fields.

It is not always possible to refine a certain segment in a next step by adding a new predictive field.One of the reasons is the group size as set in As absolute value (N). The algorithm may come up withsegments that are described by less than five predictive fields. On the other hand, refining a given

segment in a next step can also be done by not adding a new predictive field, but by reconsidering anexisting predictive field. This is set by Allow attribute re-use. (e.g., “Age between (20, 60)” in step 1

could be refined to “Age between (25, 55)” in level 2). So this is why in rule level N there may besegments having less than N predictive fields.

A segment that is not refined anymore is called a final result , which is comparable to terminal nodes

in a decision tree.

If Model search width =1, out of all these final results the algorithm will return the best 5 ( Maximum

number of segments) based on the target value’s probability. Our previous model did create all five.

The decision rule process may not be able to use all the “freedom” as set in the Rule search width (5)

and in Maximum number of attributes (5). The main reasons are typically group size requirementsand/or the statistical confidence requested.

List Generation – Boosting

Just like C5.0, Decision List has a “boosting” mechanism. This is reflected in the setting Model search width. In describing the decision list algorithm we assumed Model search width to be 1.

By setting a higher value (say 2) you direct the Decision List algorithm to consider 2 alternatives for each segment Thus the Decision Rule algorithm will deliver the best 2 segments after each Rule

cycle. In our model this means that we instructed the algorithm to build a list of 5 segments.The List cycle will now have 5 iterations of the Rule cycle (= Maximum Number of segments) and each Rule cycle will have 5 iterations (= Maximum number of attributes)

For the first segment on the list the Rule cycle will return the top 2 segments (= Model search width).

Thus, now 2 lists are created each with 1 segment and a Remainder. On each of the 2 lists the Rulecycle will be performed on the Remainder. This will result in 4 lists, each with 2 segments and aRemainder. Out of these 4 lists the top 2 based on List% are selected to find a third, and so forth.

When working in Interactive mode, the Maximum Number of alternatives setting is active. When the

model is automatically generated, its value is set to 1.

Be aware that the Model search width and the Rule search width have a direct impact on the data-mining processing time.




A-18

Interact ive Decis ion Lis t Decision lists can be generated automatically, allowing the algorithm to find the best model, as we

did with the first example. As an alternative, you can use the Interactive List Builder to take controlof the model building process. You can grow the model segment-by-segment, you can select specific

predictors at a certain intermediate point, and you can insure that the list of segments is not too

complex so that it is practical enough to be used for a business problem.

To use the List Builder, we simply specify a decision list model as usual, with the one addition of selecting Launch interactive session in the Decision List node’s Model tab. We’ll use the Decision

List Interactive session to predict the voluntary leavers.

Close the Decision List modeling nodeClick File…Open Stream Double-click on DecisionList Interactive.str On the Stream canvas double-click the Decision List node named CHURNED[Vol] Click the Model tabClick the Launch interactive session option button



DECISION LIST

A-19

Figure A.13 Decision List Model Tab with Interactive session enabled

Note that we have modified some of the default settings, such as the maximum number of attributes,the maximum number of segments, and the absolute value of the minimum segment size. Click on theExpert tab to review those settings as well.

When the model runs, a generated Decision List model is not added to the Model Manager area.

Instead, the Decision List Viewer opens, as shown in Figure A.14.

Click Run to open the Decision List Viewer




A-20

Figure A.14 The Decision List Viewer

The easy-to-use, task-based Decision List Viewer graphical interface takes the complexity out of themodel building process, freeing you from the low-level details of data mining techniques. It allowsyou to devote your full attention to those parts of the analysis requiring user intervention, such as

setting objectives, choosing target groups, analyzing the results, selecting the optimal model and evaluating models.

The whole workspace consists of one pane and two pop-up tabs, Working Model Pane, AlternativesTab and Snapshots Tab. The Working model pane (Figure A.14) displays the current model, including

mining tasks and other actions that apply to the working model. The Alternatives tab and Snapshots tab are generated when you click Find Segments

, the Alternatives tab lists all alternative mining

results for the selected model or segment on the working model pane. The Snapshots tab displayscurrent model snapshots (a snapshot is a model representation at a specific point in time).

Note: The generated, read-only model displays only the working model pane and cannot be modified.

In the working model pane you can see two rules. The first gives information about the trainingsample. Here the sample has 719 records (Cover (n)) of which 267 meet the target value ( Frequency).In consequence the percentage of records meeting the target value is 37.13% ( Probability).

The second, called Remainder , is now the first segment in our model and contains the whole trainingsample. This will be the starting point for building our Decision List model.

Right-click the Remainder segmentFrom the dropdown list select Find Segments



DECISION LIST

A-21

Figure A.15 Model Albums Dialog (Alternatives tab)

The pop-up window states that the mining task was performed on the Remainder segment and is

completed. There are the two alternatives that were generated by this data mining task. Recall that for this task the Model search width is set to 2. The first alternative (Alternative 1) contains 7 segments

and the model represented by this list has an average probability of 59.36%.

The second alternative has 8 segments and the corresponding model has an average probability of 56.13%.

Let’s view each of the two alternatives

Click on Alternative 1

The result will be displayed in the Alternative Preview pane.




A-22

Figure A.16 Preview of an Alternative

Click on Alternative 2 (not shown)

You will see that these two alternatives differ in their 7th segment. The second has a 7th

The first alternative has a Remainder of 281 and misses 7 voluntarily leaving customers, whereas thesecond alternative list has a Remainder of 254 and misses 6 of these customers.

segment based

on AGE but the first has no this rule. Another interesting segment is the Remainder.

Assume that we prefer the first alternative but we want to capture some more of the voluntary leaversin the model. First we must promote the first alternative list to our working model; from there we will

continue the model building process.

Right-click on Alternative 1 Select Load (or click the Load button on the bottom)Click OK

The result will be displayed in the Working model panel.



DECISION LIST

A-23

Figure A.17 Loading an Alternative to the Working Model

We can now create a Gains chart for the working model.

Click Gains tab

Figure A.18 Gains Chart of Working Model




A-24

The results look encouraging on both the training data and the testing data. The segments included inthe model are represented by the solid line; the excluded portion ( Remainder ) is represented by the

dashed line.

Let’s put both the Working model (Alternative 1) and Alternative 2 on display in the Gains chart, sothat we can choose one better model.

Click the Viewer tab

Click the Take Snapshot button

Click the toolbar button to view the alternativesRight-click on Alternative 2 Select Load (or click the Load button on the bottom)Click OK (for now, Alternative 2 is a working model)Click the Gains tabClick Chart Options

Figure A.19 Chart Options

Select the checkbox of Snapshot 1 (Actually Snapshot 1 is Alternative 1, we can click the

toolbar to view the snapshot)

Click OK



DECISION LIST

A-25

Figure A.20 Gains Chart of Alternatives

Although model performance is similar, the alternative 2 (Working Model) performs a bit more poorly than the alternative 1 (Snapshot 1).

Click Viewer tabIn the Working model pane, right-click on segment SEX = F




A-26

Figure A.21 Options to Modify a Segment in the Model

Choices in the context menu allow you to modify the segments created by the data mining task. For example you may decide to delete a segment or to exclude it from scoring. You can even edit thesegment. For example, you could add an extra condition to the segment ‘SEX = F’, or you could

modified the lower and upper boundary value of EST_INCOME in segment 6 (Edit Segment Rule).

Model Assessment

We have used the Gains chart above to get an overall view of the model. You can assess the model on

a segment level by using the model measures. There are five types of measures available.

From the menu, click Tools…Organize Model Measures



DECISION LIST

A-27

Figure A.22 Organize Model Measures Dialog

When building a Decision List model, you have five types of measures at your disposal ( Display): a

Pie Chart and four numerical measures.

Each measure has a Type, the Data Selection it will operate on (here Training Data) and whether itwill be displayed in the model (Show).

The Pie Chart displays the part of the Training sample that is described by a segment. The other Coverage measure is Cover (n), which will show the number of records in the Training sample in thatsegment. The Frequency measure displays the number of records in the segment with the target value, Probability calculates the ratio of Frequency over Cover(n) and the Error returns the statistical error.

It is possible to add new measures to your model by clicking the Add new model measure button .We’ll create a measure (call it %Test) showing the probability of each segment on the Testing

partition. Furthermore, we will rename Probability to %Train.

Click the Add new model measure button

This will create a new row named Measure 6.

Double-click in the Name cell for Measure 6 and change the name to %TestClick the dropdown list for Type and change to Probability




A-28

Figure A.23 Creating a New Measure

Click the dropdown list for Data Selection and change to Testing Data Click the Show checkbox for %TestDouble-click in the Name cell for Probability and change its name to %Train, then hit Enter

Figure A.24 Completed %Test Measure and Renamed Probability Measure

Click OK



DECISION LIST

A-29

Figure A.25 New Measures Added to the Working Model

Decision List Viewer can be integrated with Microsoft Excel, allowing you to use your own value

calculations and profit formulas directly within the model building process to simulate cost/benefitscenarios. The link with Excel allows you to export data to Excel, where it can be used to create

presentation charts, calculate custom measures, such as complex profit and ROI measures, and viewthem in Decision List Viewer while building the model.

The following steps are valid only when MS Excel is installed. If Excel is not installed, the optionsfor synchronizing models with Excel are not displayed.

Suppose that we have created a template in Excel where, based on the Probability and on theCoverage of a segment, we calculate the amount of loss we will suffer should the customers in asegment actually leave voluntarily.

Click Tools and select Organize Model MeasuresClick Yes for Calculate custom measures in Excel (TM)Click Connect to Excel (TM) … button Browse to C:\Train\ModelerPredModel\ and select Template_churn_loss.xlt Click Open




A-30

Figure A.26 The Excel Workbook for the Churn Case

Switch to PASW Modeler using the ALT-Tab keys on your keyboard (if necessary)

Figure A.27 Excel Input Fields

The Choose Inputs for Custom Measures window reveals that Excel expects two fields for input:Probability and Cover.

On the other hand four fields are available to add to your model:



DECISION LIST

A-31

Loss = Probability * Cover * Loss – Cover * Variable Cost%Loss = 100 * Loss / Sum (Loss), the fraction of the total loss a segment can be accounted for Cumulative = Cumulative Loss%Cumulative = % Cumulative Loss

By default all are selected.

Clicking on an empty Model Measure cell in the dialog will open a drop down list with all themeasures available in your model.

Click in the Model Measure cell for Probability and select %Train Click in the Model Measure cell for Cover and select Cover (n)

Figure A.28 Mapping Excel Input File to the Decision List Model Measure

Click OK

In the Organize Model Measures window you will see which measures are available for input to your

model. By default all are selected.

Deselect measure %Test (not shown)Click OK




A-32

Figure A.29 the Decision List Model with External Measures

As you can see, segment 4 is responsible for more than 20% of the total loss expected (reflected by itsmeasure %Loss), and the first four segments for more than 50% (reflected by the %Cumulative

measure for the fourth segment).

So if the business objective was to select a set of customers in a retention campaign to reduce theexpected loss by at least 50%, the list manager would probably choose the first 4 segments to bescored.

If you wish to exclude a segment from a model, it can be done from a context menu.

Right-click on Segment 5



DECISION LIST

A-33

Figure A.30 Manually Excluding Segments from Scoring Based on External Measures

Interactive Decision Lists are not a model, but instead are a form of output, like a table or graph.When you are satisfied with the list you have built, you can generate a model to be used in the stream

to make predictions.

Click Generate…Generate Model Click OK in the resulting dialog box (not shown)Close the Interactive Decision List Viewer window

A generated Decision List model appears in the upper left corner of the Stream Canvas. It can be

edited, attached to other nodes, and used like any other generated model. The only difference is inhow it was created.




A-34

Summary Exerc ises The exercises in this appendix are written for the data file Newschan.sav.

1. Begin with a clear Stream canvas. Place a Statistics File source node on the canvas and

connect it to Newschan.sav.

2. Try to predict with Decision List whether or not someone responds to a cable new service

offer ( NEWSCHAN ). Start by using the default settings. Use all available predictors. Howmany segments were created? What fields were used? Does this model seem adequate?

3. Try different models by changing various settings, including the minimum segment size,allowing attribute reuse, confidence interval (change to 90%), or some of the expert settings.Can you find a better model?



Documents

MELJUN CORTES Predictive Modeling With IBM SPSS Modeler