0E052_0a0521inst_stud

Advanced Data Preparation with IBM SPSS ModelerStudent GuideCourse Code: 0A052ERC 1.0

Advanced Data Preparation with IBM SPSS Modeler

0A052

Published October 2010

Licensed Materials - Property of IBM

© Copyright IBM Corp. 2010

US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide.

SPSS, and PASW are trademarks of SPSS Inc., an IBM Company, registered in many jurisdictions worldwide.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Other product and service names might be trademarks of IBM or other companies.

This guide contains proprietary information which is protected by copyright. No part of this document may be photocopied, reproduced, or translated into another language without a legal license agreement from IBM Corporation.

Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk.

ADVANCED DATA PREPARATION WITH PASW MODELER

i

TABLE OF CONTENTS

LESSON 1: INTRODUCTION TO DATA PREPARATION.................... 1-1

LESSON 2: SAMPLING DATA................................................................. 2-1

LESSON 3: WORKING WITH DATES .................................................... 3-1

LESSON 4: WORKING WITH STRING DATA....................................... 4-1

LESSON 5: DATA TRANSFORMATIONS.............................................. 5-1

LESSON 6: WORKING WITH SEQUENCE DATA ................................ 6-1

LESSON 7: EXPORTING DATA FILES................................................... 7-1

1.1 INTRODUCTION ................................................................................................................... 1-1

1.2 THE PROCESS OF DATA MINING......................................................................................... 1-1

1.3 FILE FORMAT FOR ANALYSIS ............................................................................................. 1-3

1.4 UNIT OF ANALYSIS FOR MODELING ................................................................................... 1-4

1.5 MATCHING THE DATA TO THE MODELING TOOL............................................................... 1-5

2.1 INTRODUCTION ................................................................................................................... 2-1

2.2 SAMPLE NODE .................................................................................................................... 2-1

2.3 TYPES OF SAMPLES ............................................................................................................ 2-2

2.4 SIMPLE SAMPLING.............................................................................................................. 2-3

2.5 COMPLEX SAMPLING.......................................................................................................... 2-5

SUMMARY EXERCISES............................................................................................................ 2-13

3.1 INTRODUCTION ................................................................................................................... 3-1

3.2 READING DATA WHICH INCLUDES DATES......................................................................... 3-1

3.3 CALCULATIONS INVOLVING DATES ................................................................................... 3-5

3.4 APPLYING THE SAME EXPRESSION TO MULTIPLE FIELDS.................................................. 3-8


4.1 INTRODUCTION ................................................................................................................... 4-1

4.2 MANIPULATING STRING DATA........................................................................................... 4-1

4.3 EXAMPLE OF STRING MANIPULATION ............................................................................... 4-3


5.1 INTRODUCTION ................................................................................................................... 5-1

5.2 USING SUMMARY STATISTICS WITH SET GLOBALS NODE................................................. 5-1

5.3 TRANSFORMING CONTINUOUS FIELDS............................................................................... 5-8

5.4 BINNING FIELDS ............................................................................................................... 5-13

5.5 APPENDIX: ERRORS IN PASW MODELER EXPRESSIONS.................................................. 5-24


6.1 INTRODUCTION ................................................................................................................... 6-1

6.2 PASW MODELER SEQUENCE FUNCTIONS.......................................................................... 6-1

6.3 COUNT AND STATE FORMS OF THE DERIVE NODE............................................................. 6-7

6.4 RESTRUCTURING SEQUENCE DATA USING THE HISTORY NODE ..................................... 6-11


ADVANCED DATA PREPARATION WITH PASW MODELER

ii

LESSON 8: EFFICIENCY WITHIN PASW MODELER..........................8-1

APPENDIX A: DATABASE JOINS WITH PASW MODELER ............. A-1

APPENDIX B: STATISTICS TRANSFORM NODE............................... B-1

7.1 INTRODUCTION ...................................................................................................................7-1

7.2 USING A DATA FILE OR STREAMS IN MODELING ...............................................................7-1

7.3 TYPES OF EXPORTED FILES.................................................................................................7-2

7.4 EXPORTING FLAT FILES ......................................................................................................7-3

7.5 EXPORTING TO DATABASES..............................................................................................7-10

SUMMARY EXERCISES ............................................................................................................7-13

8.1 INTRODUCTION ...................................................................................................................8-1

8.2 SQL PUSHBACK ..................................................................................................................8-1

8.3 SQL OPTIMIZATION ............................................................................................................8-3

8.4 NODE ORDER ......................................................................................................................8-5

8.5 USING SAMPLES OF DATA ..................................................................................................8-5

8.6 MAXIMUM SET SIZE............................................................................................................8-5

8.7 PERFORMANCE IN SPECIFIC NODES ....................................................................................8-7

A.1 INTRODUCTION .................................................................................................................A-1

A.2 BASIC MERGE SETUP IN PASW MODELER.......................................................................A-1

A.3 AN INNER JOIN ..................................................................................................................A-2

A.4 JOINS WITH NON-MATCHING RECORDS IN ONE TABLE ...................................................A-4

A.5 JOINS WITH NON-MATCHING RECORDS IN BOTH TABLES................................................A-6

A.6 COMPLEX JOINS IN PASW MODELER .............................................................................A-11

A.7 THE SORT AND DISTINCT NODES....................................................................................A-13

A.8 JOINING MORE THAN TWO TABLES................................................................................A-18

B.1 INTRODUCTION..................................................................................................................B-1

B.2 COUNTING OCCURRENCES OF VALUES.............................................................................B-1

Advanced Data Preparation with PASW Modeler

Introduction to Data Preparation 1-1

Lesson 1: Introduction to Data PreparationTopics

Provide an overview of data preparation in data mining

Describe the Data Preparation phase in the CRISP-DM methodology

Discuss appropriate file format and unit of analysis for a data-mining project

Consider how to match data to a modeling tool

Data

None

1.1 IntroductionData preparation is always required any time you do data analysis. This is true in data mining just as it has been in classical statistical analysis. Data preparation is important enough to receive its own separate phase in the CRISP-DM methodology, and in most projects we expect the bulk of time to be spent in these tasks. Thus this course guide will consider in depth the data preparation needed for a successful data-mining project. Because of the larger data files typical in data mining, often coming from diverse sources, data preparation must be well structured with a plan that is followed closely to insure that all data characteristics have been investigated.

Although there are practical limits on time spent because of project deadlines and available resources, spending more time on data preparation is almost always justified. This is especially critical in data mining where we often work on files that contain hundreds of fields and millions of records.

The course guide provides advice throughout on how to handle typical issues, including creating new fields with data transformations, manipulating string data, representing dates in PASW®

1.2 The Process of Data Mining

Modeler and calculating values based on dates, and working with sequence data. We also discuss sampling of data.

Throughout this and other PASW Modeler courses, we will refer to a standard framework that has been developed to help you and others carry out data-mining projects. The standard was developed by a consortium of companies, chiefly in Europe, and is called the Cross-Industry Standard Process for Data mining, or CRISP-DM. Although developed for large projects, it is sufficiently broad and flexible to apply to any size of data-mining effort. Figure 1.1 presents an outline of the six phases of any project, with generic tasks and subtasks listed for each. (For more information describing CRISP-DM, see http://www.crisp-dm.org). Version 2.0 of the CRISP-DM standard is now under development.In the figure, the phrases in bold list generic tasks, and those in italics identify outputs/reports that should be created.



Figure 1.1 CRISP-DM Model

The CRISP-DM model addresses the needs of all levels of users in deploying data mining technology to solve business and other problems. The CRISP-DM process is generally applicable across all sectors. If followed closely, it should make large data-mining projects faster, more efficient, more reliable, more manageable, and less costly. CRISP-DM has been kept sufficiently lightweight, however, to benefit even small-scale data mining investigations.

We are concerned in this course with the Data Preparation phase of the CRISP-DM model, which covers all activities to construct the final dataset (data that will be fed into the modeling tool(s) from the initial raw data). Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.

Although we concentrate in this course on data preparation, it is inextricably linked with the prior Data Understanding phase. You cannot successfully prepare the data unless you understand it and its characteristics and have done some previous data exploration.

In some lessons we will do some data exploration, while in other lessons we will presume that the exploration is done and we will move directly to the example of data preparation. The Data Understanding and Data Preparation phases of the model comprise 2 of 6 phases, but experienced data miners will readily understand that preparing data is the most time-consuming phase of any project. Estimates of the amount of total time and resources spent on data preparation vary from 60% to 80%, depending on the project and data complexity. This might seem excessive, but these figures should be a guideline for you as well. Having good, clean data that is well understood is at the heart of any research project, in any field. Thus, if you find that you aren’t spending at least 50% of actual project time on preparing data, you should question how thoroughly the job is being done.



Figure 1.2 provides some more detail on steps in the Data Preparation phase.

Figure 1.2 Data Preparation Tasks

In particular, data preparation can involve:

Extracting data from a data warehouse or data mart

Linking tables together within a database

Combining data files from different systems

Reconciling inconsistent field values/data cleaning

Identifying missing, incorrect, or extreme data values

Sampling and selecting of records

Restructuring data in a form the analysis requires

Transforming relevant fields (taking differences, ratios, etc.)

1.3 File Format for AnalysisData for data mining often comes from data warehouses or data marts and are stored in a database system. This type of data structure is quite efficient at data storage, but it is not suitable for modeling with external software tools. This is especially true because the data for a project may come from multiple tables in a database and need to be combined via various join and append operations to create a unified data file that contains all relevant information.

Most data mining programs, including PASW Modeler, require a rectangular data file when modeling is performed. This is not a logical requirement, but data mining tools typically apply several different techniques and a simple, common data structure facilitates this. This implies that if the relevant data



are spread across several databases or sources, they need be combined before analysis is performed. If many data sources are involved, the planning and execution of this step requires a substantial effort.

A rectangular or regular data file has the following characteristics.

All records (cases) contain the same data fields (variables, attributes)

All records (cases) contain the same data fields (variables, attributes) in the same order

All records (cases) contain some value for all data fields (variables), even if some of the values are missing data

In data-mining, a data file with this structure is called a denormalized file. Although the data need not begin with this structure, it generally must be converted to it before analysis.

One decision you will need to make early in a project is whether you, or a database administrator, should access a database and combine various tables to create a rectangular file, or whether you would prefer to do this in PASW Modeler. PASW Modeler can perform a variety of standard table join operations, and Appendix A provides examples of many.

1.4 Unit of Analysis for ModelingThe definition of a record or case is called the unit of analysis. For example, in a file of customers of a telecommunications company, if each record or row of data represents a unique customer, then customer is the unit of analysis. A rectangular data file has the same unit of analysis throughout the file.

Data can be organized and summarized at different levels of aggregation. It is important in data mining to make sure that the basic unit of analysis is appropriate for the question being studied. For example, a single purchase transaction record in a sales database might contain the customer ID, product code, sales location, time/date and price. If we were interested in relating location characteristics to sales volume, data at the transaction level are at too fine a granularity. Probably, the analyst would collapse or aggregate many individual transactional records to a summary record, which might represent total sales at a location for some time period (a month, year, or over the entire time span). Thus the data are organized so the basic unit of analysis (one store, or one store for a specific month, etc.) is directly relevant to the business question asked (how store characteristics relate to overall sales).

Nevertheless, the appropriate unit of analysis for a data-mining project is related to, but is not uniquely defined by, the business goals of the project. Suppose that we are interested in predicting whether a customer renews her contract for cell/mobile phone service with a telecommunications firm. We might presume that the appropriate unit of analysis is the customer so that our final file for modeling should contain one record per customer. While that is generally correct, our initial data tables may not be in this format and have a different unit of analysis from each other. One data table could contain billing information for each customer for each month, so that a customer appears on multiple records. A second table could contain any contacts the sales reps or tech support have had with the customer, and then a third table could contain promotions/offers that were sent to all customers of a certain class.

In this situation, what is the appropriate unit of analysis? Well, in the long run, almost all data mining tools require rectangular files, as noted in the previous section. But we don’t want to lose information that could be useful for modeling, such as the sequence of events in a customer’s record that are spread over multiple records in a table. First and foremost, we should try to retain as much information as possible about the customer. Second, we will need to condense information from



several records into one record, but still retain individual information. Thus, if payment record is important, we might create a field that records the number of late payments. Or if the timing of customer complaints and inquiries is thought to be interesting, we could create several fields to store this information on the customer data condensed to one record.

Because some information can be lost when moving from many records to one record, you should very carefully think about information that can be retained, or creatively constructed, from the multiple records per customer file before it is merged and aggregated (condensed).

1.5 Matching the Data to the Modeling ToolData-mining models algorithms can be very different in how they handle categorical fields with many values, continuous fields with outlying or extreme values, and missing data in any all types of fields. Some decision tree methods, such as CHAID, bin continuous fields into a small number of categories, so outliers can’t much affect the model. Neural networks standardize all continuous fields to a range of 0 to 1, with the endpoints equivalent to the minimum and maximum values. Here, an outlier can greatly affect a model because it can set the minimum or maximum value far away from the bulk of values. Some models—neural nets and Kohonen nets—will not use categorical fields with many values that you try to include in a model.

Although modeling tools handle these situations with no user intervention, it is much better if you understand how a tool/algorithm handles missing data, outliers, and fields with many categories, and then prepare the data accordingly. Here are some specific examples:

1) If you want to control how missing data is handled by a model, then either impute missing data beforehand, or drop records with lots of missing data from the data stream, or take some other action so that the modeling tool will encounter no missing data.

2) If you want to control how outliers are handled, you can remove them from the data stream, or you can reduce their influence with various methods, including data transformations or coercion to less extreme values.

3) If you have categorical fields with many values (such as type of product purchased), you can either force its inclusion, or you can reclassify/recode the field into a smaller number of categories so that it works better with a particular modeling algorithm.

4) If you would like to use models such as linear regression or logistic regression that don’t naturally search for interaction effects, you can create composites of fields that allow you to include interactions between fields in a model.

These four are just some of the most common ways in which you can anticipate how a model handles data and then take proactive steps so that data are dealt with exactly as you prefer. The two courses Predictive Modeling with PASW Modeler and Clustering and Association Models with PASW

Modeler discuss how models process data and how to handle missing values before modeling is done.




Sampling Data 2-1

Lesson 2: Sampling DataTopics

Describe different sampling methods

Introduce complex sampling techniques

Build clustered and stratified samples

Data

We use the data file property_assess_cs.sav which assesses property values across a range of locations. The data file contains one record per property together with location information of neighborhood, town and county. The years since last appraisal are also recorded.

2.1 IntroductionWhen mining your data, it is usually not necessary to use all of the records stored in a data file. It is common in data mining to have tens of thousands, if not millions, of records available for processing. Building a successful predictive model, or discovering the majority of associations between fields, can be accomplished quite well with a moderate number of records. In these situations, using all the records can be quite inefficient in terms of processing time and memory.

In this lesson, we will demonstrate how to sample data within PASW Modeler by use of the Sample node, which can draw a range of simple and complex samples.

2.2 Sample NodeIn this section we introduce the Sample node as a way of selecting samples of records from full datasets. The Sample node is found in the Record Ops palette.

Data selection and sampling can occur at multiple points in the data mining process. Data selection often occurs at the data collection stage very early in the process, even before any data exploration, sothat an overly large dataset is not created from the original data sources.

You can use Sample nodes to select a subset of records for analysis, or to specify a proportion of records to discard. A variety of sample types are supported, including stratified, clustered, and non-random (structured) samples. Sampling can be used for several reasons:

To improve performance by estimating models on a subset of the data. Models estimated from a sample are often as accurate as those derived from the full dataset, and may be more so if the improved performance allows you to experiment with different methods you might not otherwise have attempted.

To select groups of related records or transactions for analysis, such as selecting all the items in an online shopping cart (or market basket), or all the properties in a specific neighborhood.

To identify units or cases for random inspection in the interest of quality assurance, fraud prevention, or security.

Note: If you simply want to partition your data into training and test samples for purposes of validation, a Partition node can be used instead.


Sampling Data 2-2

2.3 Types of SamplesThere are a variety of sampling methods, from the simple to the complex. The sampling method must be fit to the problem at hand. For example, if you have a database of similar customers (they all buy a specific product), and you simply want to reduce file size for modeling, you can use one of the simple methods, such as random or systematic sampling. But if you need to sample directly from a database with customers of different types, you may want to draw a more complex sample. Complex sample types include:

Clustered samples. This type of sample is used to sample groups or clusters rather than individual units. For example, suppose you have a data file with one record per student. If you cluster by school and the sample size is 50%, then 50% of schools will be chosen and all students from each selected school will be picked. Students in unselected schools will be rejected. On average, you would expect about 50% of students to be picked, but because schools vary in size, the percentage may not be exact. Similarly, you could cluster shopping cart items by transaction ID to make sure that all items from selected transactions are maintained. Our example clusters properties by town, see theModelerDataPrep2.str sample stream.

Stratified samples. This type of sample is used to select samples independently within non-overlapping subgroups of the population, or strata. For example, you can ensure that men and women are sampled in equal proportions, or that every region or socioeconomic group within an urban population is represented. You can also specify a different sample size for each stratum (for example, if you think that one group has been under-represented in the original data). Our example stratifies properties by county, see the ModelerDataPrep2.str sample stream.

Sampling weights are automatically computed while drawing a complex sample and roughly correspond to the "frequency" that each sampled unit represents in the original data. Therefore, the sum of the weights over the sample should estimate the size of the original data.

Sampling Frame

A sampling frame defines the potential source of cases to be included in a sample or study. In some cases, it may be feasible to identify every single member of a population and include every one of them in a sample frame—for example, when sampling items that come off a production line. More often, you will not be able to access every possible case. For example, you cannot be sure who will vote in an election until after the election has happened. In this case, you might use the electoral register as your sampling frame, even though some registered people won’t vote, and some people may vote despite not having been listed at the time you checked the register (that is one of the difficulties of predicting elections). Anybody not in the sampling frame has no prospect of being sampled. Whether your sampling frame is close enough to the population you are trying to evaluate is a question that must be addressed for each real-life case.Our example in this lesson involves property values. In order to more closely track and assess real estate taxes, a government agency samples properties stratified by county, and clustered by town within each county. Since double-checking assessed value is expensive, a sample of properties is constructed to study further.

In our example the sampling frame will be the complete data file property_assess_cs.sav which we will assume is a complete inventory of property in various counties.

We begin by reading the data into PASW Modeler.

If the Stream Canvas is not empty, click File…New Stream


Sampling Data 2-3

Select the Statistics File node and place it on the Stream CanvasEdit the Statistics File node and set the file to property_assess_cs.sav in the

c:\Train\ModelerDataPrep directoryUnder Values, select Read labels as data (not shown)Click OK to return to the Stream CanvasConnect a Table node to the Statistics File nodeRun the Table node

Figure 2.1 Complete Data File

Here we see properties in the first town. We want to sample this in a number of ways using the sample node. You can choose the Simple or Complex

2.4 Simple Sampling

method as appropriate for your requirements.

The Simple method allows you to select a random percentage of records, select contiguous records, or select every nth record.

Close the Table windowSelect a Sample node from the Record Ops palette, place it to the right of the Statistics File

node in the Stream Canvas, and connect the nodesEdit the Sample node


Sampling Data 2-4

Figure 2.2 Sample Node Simple Method

In the Simple mode the Sample node allows you to select whether to pass (Include sample) or discard (Discard sample) records.

Include sample. Includes selected records in the data stream and discards all others. For example, if you set the mode to Include sample and set the 1-in-n option to 5, then every fifth record will be included, yielding a dataset that is roughly one-fifth the original size. This is the default mode when sampling data, and the only mode when using the complex method.

Discard sample. Excludes selected records and includes all others. For example, if you set the mode to Discard sample and set the 1-in-n option to 5, then every fifth record will be discarded. This mode is only available with the simple method.

The Maximum sample size checkbox limits directly the largest sample that will be passed through the node, whatever the sampling method. This option is redundant and therefore disabled when Firstand Include are selected. Also note that when used in combination with the Random % option, this setting may prevent certain records from being selected. For example, if you have 5 million records in your dataset, and you select 50% of records with a maximum sample size of one million records, then 50% of the first two million records will be selected, and the remaining three million records have no chance of being selected. To avoid this limitation, select the Complex sampling method and request a random sample of one million records without specifying a cluster or stratify field.

There are three sampling methods:

First: The first n records will be selected (where n is the value in the First text box). The Maximum sample size option is disabled when First and Include sample are selected.


Sampling Data 2-5

1-in-n: every nth

Random %: a random sample of size r %; the percentage r is to be specified in the text box

record, where n is to be specified in the text box

In this example we will select a random sample of approximately 60% of the original data file.

Set Sample: to Random %Set the Random % value to 60Click the Set random seed check boxType 6801347 in the Set random seed text boxClick OK to return to the Stream CanvasAdd a Table node to the stream and connect the Sample node to a Table nodeRun the Table node

Figure 2.3 60% Sample of Data File

2.5

We now see the number of cases (reported in the window title bar) has been reduced by about 40% (actually 38.97%, as the sampling is also random in respect to the exact percentage sampled), and the propid field shows the random selection.

Complex SamplingComplex sample options allow for more complex samples, including clustered, stratified, and weighted samples, and combinations of these, along with other options.

To illustrate the use of Complex samples, we will construct a sample of property values stratified by county, and then clustered by town within each country (strata). This will ensure that an independent sample of towns is drawn from within each county. Some towns will be included and others will not, but for each town that is included, all properties within the town are included (because they are all in the same county, or strata).


Sampling Data 2-6

Close the Table windowSelect another Sample node from the Record Ops palette, place it to the right of the

Statistics File node in the Stream Canvas, and connect the nodesEdit the Sample nodeClick Complex for Sample methodType 6801347 in the Set random seed text box

Figure 2.4 Sample Node with Complex Method

There are now a new range of options.

The Cluster and Stratify button opens a dialog that allows you to specify cluster, stratify, and input weight fields if needed. We will use this in the next section.

Sample type has two options:

Random. Selects clusters or records randomly within each strata.

Systematic. Selects records at a fixed interval. This option works like the 1 in n method, except the position of the first record changes depending on a random seed. The value of n is determined automatically based on the sample size or proportion.

You can select proportions or counts as the basic sample units.

You can specify the sample size in several ways:


Sampling Data 2-7

Fixed. Allows you to specify the overall size of the sample as a count or proportion.

Custom. Allows you to specify the sample size for each subgroup or strata. This option is only available if a stratification field has been specified in the Cluster and Stratify sub dialog box.

Variable. Allows the user to pick a field that defines the sample size for each subgroup or strata. This field should have the same value for each record within a particular stratum; for example, if the sample is stratified by county, then all records with county = Surrey must have the same value. The field must be numeric and its values must match the selected sample units. For proportions, values should be greater than 0 and less than 1; for counts, the minimum value is 1.

In this example, we will retain the default value of 0.5 for the sample size based on proportions. This will mean that about 50% of the clusters (towns) will be selected.

Minimum sample per stratum. Specifies a minimum number of records (or minimum number of clusters if a cluster field is specified).

Maximum sample per stratum. Specifies a maximum number of records or clusters. If you select this option without specifying a cluster or stratify field, a random or systematic sample of the specified size will be selected.

Cluster and Stratify Settings

The Cluster and Stratify dialog box allows you to select cluster, stratification, and weight fields when drawing a complex sample.

Click the Cluster and Stratify button

Figure 2.5 Cluster and Stratify Dialog

Returning to our example, in order to track and assess real estate taxes, the municipality samples properties stratified by county, and clustered by town within each county.

The Clusters field specifies a categorical field used to cluster records. Records are sampled based on cluster membership, with some clusters included and others not. But if any record from a given cluster is included, all are included. For example, when analyzing product associations in shopping


Sampling Data 2-8

carts, you could cluster items by transaction ID to make sure that all items from selected transactions are maintained.

The Stratify by field specifies a categorical field used to stratify records so that samples are selected independently within non-overlapping subgroups of the population, or strata. If you select a 50% sample stratified by gender, for example, then two 50% samples will be taken, one for men and one for women. Strata may be socioeconomic groups, customer types, product types, etc. allowing you to ensure adequate sample sizes for subgroups of interest. For example, if there are three times more women than men in the original dataset, this ratio will be preserved by sampling separately from each group. Multiple stratification fields can also be specified (for example, sampling product lines within regions or vice-versa).

Note: If you stratify by a field that has missing values (null or system missing values, empty strings, white space, and blank or user-defined missing values), then you cannot specify custom sample sizes for strata. If you want to use custom sample sizes when stratifying by a field with missing or blank values, then you need to fill them upstream.

The input weight field specifies a field used to weight records prior to sampling. For example, if the weight field has values ranging from 1 to 5, records weighted 5 are five times as likely to be selected. The values of this field will be overwritten by the final output weights generated by the node.

The new output weight specifies the name of the field where final weights are written if no input weight field is specified. (If an input weight field is specified, its values are replaced by the final weights as noted above, and no separate output weight field is created.) The output weight values indicate the number of records represented by each sampled record in the original data. The sum of the weight values gives an estimate of the sample size. For example, if a random 10% sample is taken, the output weight will be 10 for all records, indicating that each sampled record represents roughly ten records in the original data. In a stratified or weighted sample, the output weight values may vary based on the sample proportion for each stratum.

We are going to cluster by town and stratify by county.

Select town for the Clusters: fieldSelect county for the Stratify by: field (not shown)Click OK and then OK again in the sample node

To check the effect of sampling (you should always verify that the sample was taken as you planned), we can use a Distribution node to look at an overlay bar chart with both town and county.

Select a Distribution node from the graphs palette and connect the Sample node to the Distribution node

Edit the Distribution node and select town as the Field and county as the Overlay

Note: You may need to instantiate values in the Statistics File source node by selecting Read values in the Type tab.


Sampling Data 2-9

Figure 2.6 Distribution Node Settings

Click RunCopy the Distribution node and connect directly to the Statistics File nodeRun this second Distribution node

Figure 2.7 Comparison of Full Data File and Clustered and Stratified Sample

Above in

Figure 2.7 we have the full dataset graph on the left, while on the right we have the graph for the sample clustered by town and stratified by county. Because we clustered by town, not all the towns are selected, but for those that are, all the cases are retained within each county (the strata variable).About half the towns are sampled because of the 0.5 Fixed proportion setting (see Figure 2.4).

To select a random sample of units, rather than all units, from within each cluster, you can string two Sample nodes together. For example, you could first sample townships stratified by county as


Sampling Data 2-10

described above. Then attach a second Sample node and select town as a stratify field, allowing you to sample a proportion of records from within each township.

In cases where a combination of fields is required to uniquely identify clusters, a new field can be generated using a Derive node. For example, if multiple shops use the same numbering system for transactions, you could derive a new field that concatenates the shop and transaction IDs.

Sample Sizes for Strata

When drawing a stratified sample, the default option is to sample the same proportion of records or clusters from each stratum. If one group outnumbers another by a factor of 3, for example, you typically want to preserve the same ratio in the sample. If this is not the case, however, you can specify the sample size separately for each stratum.

Select another Sample node and connect it to the Statistics File nodeEdit the node and select Complex modeClick the Cluster and Stratify buttonSelect county and nbrhood as the Stratify by: fields, in that order (not shown)Click OKClick Custom option button for Sample size, then click Specify Sizes button

Figure 2.8 Sample Size for Strata: Proportions Dialog

The Sample Sizes for Strata dialog box lists each value of the stratification field, allowing you to override the default for that stratum. If multiple stratification fields are selected, every possible combination of values is listed, allowing you to specify the size for each ethnic group within each city, for example, or each town within each county. Sizes are specified as proportions or counts, as determined by the current setting in the Sample node.

In the Sample Sizes for Strata dialog box:


Sampling Data 2-11

Click the Read Values button at lower left to populate the display

To set a different sample size from the default, click in the Sample size cell and select Specify.

Figure 2.9 Setting Sample Size Proportion

Set the sample size proportion for the rows with a value of Central for county and 141, 142,and 143 for nbrhood to 0.4, 0.45, and 0.55, respectively

The dialog box should now look like Figure 2.10.

Figure 2.10 Sample Sizes Modified for First Four Strata

Click OK twice to exit the sample node Attach a Table to the Sample node and then run the Table node


Sampling Data 2-12

Figure 2.11 Sample with Two Strata Variables

Because we did not set a random seed, you will likely obtain different results than shown in the table above. The number of records selected will be identical—5,609—but the exact records selected, as identified by the propid value, will be different. The last field, SampleWeight, is the sampling weight field. It has a value of 2.0 for the first records, indicating that each one represents two records in the original file (this is a consequence of using a default proportion of 0.50).

Custom sample sizes may be useful if different strata have different variances in order to make sample sizes proportional to the standard deviation. (If the cases within the stratum are more varied, you need to sample more of them to get a representative sample.) Or if a stratum is small, you may wish to use a higher sample proportion to ensure that a minimum number of observations are included.

Summary

In this lesson we have introduced a number of ways to sample and select data records. You should now be able to:

Use the Sample node to select or discard records from the data

Use the Sample node to select clustered and stratified sample sets

Adjust the sample size of strata to match a required proportion


Sampling Data 2-13

Summary ExercisesIn this exercise you will use the file Charity.sav, which contains records on donations to a private charity. Two key variables are Mosaic Bands, a field measuring geodemographic segmentation. Some categories have many donors, some have fewer, and we want to insure a reasonable number in the final sample in all categories. The second variable is Response to campaign which records whether or not a person gave money to a recent fund-raising campaign (coded 0 or 1). The charity will be developing a model to predict this field, and so wants to have about a 50/50 split in the two categories.

1. Place a Statistics File node on the stream and read in the file Charity.sav. Check the optionsRead labels as names and Read labels as data. In the Type tab, have PASW Modeler read the data values.

2. Use a Distribution node to examine the distribution of Response to campaign. Which category is more frequent?

3. Add a Sample node to the stream connected to the Statistics File node. In this node we want to take a stratified sample by Response to campaign, using unequal proportions to obtain about a 50/50 split of responders and non-responders. Use the Complex sample mode, specifcy Response to campaign as the Stratify by: variable, and use the Custom button to specify custom sizes. Hint: The proportion for responders should be greater than 0.5, the proportion for non-responders less than that value. Also, change the name of the Sample Weight field to SampleWeight1. And, if you want to obtain the same sample every time, set the random seed.

4. Add a Distribution node to the stream to check the distribution of Response to campaign.

5. In a second step, we will take a stratified sample of Mosaic Bands, with equal proportions of 0.7. However, fields that are used by the Sample node must be instantiated, so first add aType node to the stream and connect it to the Sample node. Click Read values to instantiate the new field SampleWeight1. Set the random seed if you wish.

6. Now add another Sample node to the stream and connect it to the Type node. Specify Mosaic

Bands as the Stratify by: field, and specify the sample size proportion to be 0.7. Change the name of the Sample Weight field to SampleWeight2.

7. Add a second Distribution node to the stream and review the distribution of Response to

campaign. There should be fewer cases, but the proportion in the two categories should still be about equal. If you wish, add a Table node and review the values of the two sample weight fields. Can you explain why they have the values they do?

8. For those with extra time: Can you create the same stratified sample using only one Sample node?


Sampling Data 2-14


Working with Dates 3-1

Lesson 3: Working with DatesTopics

Introduce some of the PASW Modeler date functions available to perform calculations involving date fields

Data

In this lesson we use the data file fulldata.txt. The file contains one record per account held by customers of a financial organization, and also demographic details on the customer and individual account information, including the date at which the individual first became a customer and the account opening date. Dates are stored in Day/Month/Year format.

A small data file (MultDate.txt) containing six different date fields (date of initial and subsequent purchases) for retail customers is used to demonstrate the Multiple Field mode of the Derive node.

3.1 IntroductionData mining will often include working with fields that represent dates. This can range from simply sorting data into a chronological order, to calculating elapsed time, to performing some form of complex time series analysis.

We introduce in this lesson some PASW Modeler functions that can be used to perform date calculations. Since the same data transformations may need to be applied to several date fields, we also introduce the Multiple Mode form of the Derive node.

3.2 Reading Data Which Includes DatesWe’ll begin by reading data from the file fulldata.txt.

Select a Var. File node from the Sources Palette and place it on the left-hand side of the canvas.

Edit the Var. File nodeBrowse and select the data file fulldata.txt

Notice that the option Automatically recognize dates and times is checked by default in the Var. File node. PASW Modeler tries to recognize date entries as dates or times. Now let’s look at how PASWModeler has set the Storage format for each of the fields within the file.

Click the Types tab within the Var. File source node

Click the option within the Types tabClick the Data tab within the Var. File source node



Figure 3.1 Storage Format for Fields in fulldata.txt

Notice that the two date fields, STARTDT and OPENDATE, have been formatted as Date storage

with the storage icon ( ).

Click the Types tab within the Var. File source node



Figure 3.2 Type Settings for Fields in fulldata.txt

Notice that the two date fields, STARTDT and OPENDATE, have been given measurement level Continuous with their individual date values displayed in a date format. This indicates that they were fully instantiated.

Click the Preview button



Figure 3.3 Table Showing Format of the Date Fields

The two date fields display dates in a date format. Why is the date format YYY-MM-DD? That is not the format of the dates stored in the file fulldata.txt. The answer is that dates that are read in are transformed to display in the date format set in the Stream Properties dialog box. As a reminder, let’s review this.

Click OK to return to the Stream CanvasClick Tools…Stream Properties…Options



Figure 3.4 Date Format in Stream Properties Dialog

We can see here that the date format for all date fields in the stream is set to display YYYY-MM-DD.

3.3 Calculations Involving DatesIn this section we list some of the most frequently used functions for date calculations and then give a simple example involving calculating the time difference between the two fields containing dates.Table 3.1 contains three of the most commonly used date functions.

Table 3.1 Date Functions

Function Use Example

date_in_years(D) Returns the time in years from the baseline date (as set in Tools…Stream Properties…Options) to date D, as a real number.The argument may be string.The number of years is based on a year of approximately 365.0 days

date_in_years ('11/02/1997') returns the value 97.1.




date_years_difference(D1, D2)

Returns the time in years from dateD1 to date D2, as a real number.The argument may be string.

date_years_difference ( '18/02/1994', '23/5/1997') returns the value 3.26

@TODAY Returns the current date as a string in the current date format

date_years_difference ('18/02/1994',@TODAY) returns the difference between today’s date and February 18, 1994, in years.

In the above functions, years can be replaced with months, weeks or days to return the time in months (based on a month of 30.0 days), weeks (based on a week of 7 days) or days, respectively. Refer to the PASW Modeler User’s Guide for additional date and time manipulation functions available within PASW Modeler.

We will give a simple example of the use of the above functions to calculate the length of time between the customer starting date (STARTDT) and the account opening date (OPENDATE).

Close the Preview windowAdd a Derive node from the Field Ops Palette to the streamConnect the Source node to the Derive nodeEdit the Derive nodeSet the field name to LENGTH_WAIT

Click the Expression Builder button Click on the drop-down list in for Function category and select Date and TimeSelect date_months_difference in the Function list

Many date functions are available within PASW Modeler. When you click on a function in the Function list, help about the function appears in the space below the list.

Select STARTDT in the Field list

Click the Function Insert button

Select OPENDATE in the Field list

Click the Field Insert button



Figure 3.5 Date Function in Expression Builder

Click OK

Figure 3.6 Calculating a Time Difference in Months

Click OK to return to the Stream CanvasConnect the Derive node to a Filter node from the Field Ops Palette



Edit the Filter nodeFilter all fields except STARTDT, OPENDATE, and LENGTH_WAIT (not shown)Click the Preview button

Figure 3.7 Table Showing the Difference in Date Fields (in Months)

We now have a field with the elapsed time (in months) between the two date fields.

3.4 Applying the Same Expression to Multiple FieldsIn the previous example, we created a single field representing the elapsed time between two dates. When working with dates or other fields, there are times when we need to calculate a number of new fields, each one based on the same formula applied to a different field. In that instance, instead of adding multiple Derive nodes, we can use a single Derive node in Multiple mode to create multiple fields.

To illustrate this we use a small data file that records the dates of initial (AcctEst) and subsequent (Pur1, Pur2, …, Pur5) product purchases. We want to calculate the number of days that have elapsed between the initial purchase and each of the subsequent purchases.

The dates in the data are in the format MM/DD/YYYY. Because the option Automatically recognize

dates and times is checked in the Var. File node, the fields will be stored as data fields, with measurement level Continuous. The stream properties were changed to reflect the MM/DD/YYY format, so input (data) value and output (display) value match.

Close the Preview windowClick OK to return to the Stream CanvasClick File…Close Stream (click No when asked to save the stream)Click File…Open StreamDouble-click on MultipleDates.strRun the Table node



Figure 3.8 Sample File Containing Initial and Subsequent Purchase Dates

The sample dataset is quite small and the number of purchases is limited to provide a clear illustration of the process.

In order to calculate the number of days that elapsed between the initial purchase (when the account was established—AcctEst field) and later purchases (Pur1 to Pur5), the same differencing operation needs to be performed for each purchase date. The Multiple mode of the Derive node allows us to do this.

The basic logic of the calculation requires obtaining a difference in days between two dates (available via the date_days_difference function), where one date field is always the date the account was established (AcctEst) and the second date is one of the purchase date fields.

Connect a Derive node to the Type nodeEdit the new Derive node

By default, a Derive node is set to Single mode, which means that a single field will be created by the dialog. Switching to Multiple mode allows multiple fields to be created from a single base formula, within which selected variables are substituted.

Click the Multiple Mode option button



Figure 3.9 Derive Node in Multiple Mode

The Derive dialog changes when Multiple mode is selected. The Derive field text box is replaced with a Derive from box in which we select the fields on which the calculations are based. This is because the new field names will be automatically generated. New field names are created by adding a prefix or suffix from the Field name extension text box to the original field names selected in the Derive

from list box.

The Formula text box still supplies the equation and the Expression Builder can be invoked. The difference is that @FIELD will be entered in the Formula box as a placeholder for the field names selected in the Derive from list box, which will be used in the formulas. The dialog contains a tip to remind you of this.

Select Pur1, Pur2, Pur3, Pur4 and Pur5 in the Derive from boxEnter _Time in the Field name extension text boxClick the Add as: Suffix option button (if necessary)Type (or use the Expression builder) into the Formula text box:

date_days_difference(AcctEst, @FIELD)



Figure 3.10 Calculating Several Elapsed Time Fields

The number of fields created by the Derive node is determined by the number of fields selected in the Derive from list box. For each of these fields (here Pur1 to Pur5), the Derive node calculates the difference in days between that date and the date the account was established. The results will be stored in new fields whose names are generated by concatenating each field name in the Derive from

list box with the suffix _Time. In this example, five new fields will be created (Pur1_Time to Pur5_Time).

Click OKConnect the Derive node to a Table nodeRun the Table node connected to the Derive node

Figure 3.11 Fields Created from Multiple Mode Derive Node

Five new fields, recording the number of days elapsed since the AcctEst date, have been created by the Derive node. The field has a missing value ($null$) when no purchase was made.

Close the Table window



In this way, the multiple mode form of the Derive node can be used to apply the same formula to a series of fields. This feature is handy when multiple date fields are contained in a data file and on any occasion when the same equation must be applied to different fields.

Summary

In this lesson we introduced how to handle dates within PASW Modeler. You should now be able to:

Use the Derive node and PASW Modeler date expressions to perform calculations involving date fields

Use the Multiple mode within the Derive node



Summary ExercisesIn this exercise we will use the file custandhol.dat which contains information on trips taken by customers of a travel agency. Your task is to calculate two new fields representing the customer’s age and the month in which she traveled. The field DOB records date of birth, and the field TRAVDATE

records the date a trip was taken.

1. Set the Stream Properties to date format DD/MM/YY and change the setting for the 2-digit date to begin with 2005 (click Tools…Stream Properties…Options). Any date with the year less than 05 will be recognized as 200x and not 190x. This assumes that no customer is aged 94 or over.

2. Begin with an empty stream canvas, use a Var. File node to read the custandhol.dat file.Make sure, the option Automatically recognize dates and times is checked.

3. Check the type of the two date fields, DOB and TRAVDATE, using the Types tab. What is the storage of these two fields? And what is the measurement level?

4. Create a new field, named age that represents the customer’s age (in years) on the date they traveled. Connect and edit a Derive node. (Hint: Use a PASW Modeler date expression). Check the values of age using a Table node. What is the type of this new variable?

5. Modify the Age Derive node to create the age field as a rounded integer. (Hint: Use the round function.) Check your result by using a Table node.

6. Using a second Derive node connected to the first, create a new field, called hol_month that represents the month in which the customer traveled. Use the appropriate date function to derive this new field (hint: datetime_xxx functions extract a part of a date). Use a Table node to check your results.

7. Save the stream as Less3Exercise.str.

For those with extra time:

1. Open the MultipleDates.str stream.

2. Using the multiple form of the Derive node, calculate five new fields representing the number of months from each purchase to the current date. Connect a new Derive node; use the suffix _today for the new field names. (Hint: You can use the global @TODAY to represent the current date.) Check the results using a Table node.




Working with String Data 4-1

Lesson 4: Working with String Data Topics

Introduce some of the common issues involved in dealing with string data

Introduce some of the PASW Modeler string functions available to help manipulate the format of string data

Demonstrate how to clean the data in a string field so that all records have the same format

Data

In this lesson we will use the data file TelephoneData.txt, which is also available as a Statistics data file. The file contains the name and telephone number of a group of customers for a company based in the UK. Ideally, the company would like to use the phone numbers in some form of telemarketing campaign, either by physically phoning the customer or possibly sending a text message directly to the phone.

4.1 IntroductionIt is not unusual for fields containing string data to require a large amount of data preparation. When text data are entered into a database by customer sales or support representatives, it is very likely that the data will not have the same format. Typically, some data will be entered in upper case while some is in lower case. There are potentially limitless numbers of abbreviations that people use when entering text data, all of different formats and length; additionally, people use spaces and hyphens in all manner of places in the text. Since text or string data is case sensitive, all these differences will cause PASW Modeler to find values such as “YES” and “yes” as not equivalent.

One common application of text data is the analysis of data from web-logs, which store a history of user visits and actions at websites. As you can imagine, web-logs can be huge files, containing lots of string data in the form of the URL of the page visited. If an analyst wishes to analyze the log file in order to find patterns of usage, then lots of data manipulation has to be performed on the file in order to extract sequences of web pages. The Sequence algorithm within PASW Modeler simplifies the analysis of this type of data, but the hard (and time consuming) work is manipulating the string data into a format that can be understood by the analytical algorithm.

4.2 Manipulating String DataWhen working with string data, it is common to find that the data are not in a uniform format. Therefore, one of the first tasks when working with string data is to standardize, and clean, the data thoroughly. The PASW Modeler language contains a number of functions that can be used on string fields in order to manipulate them into the desired format.

We will first introduce some of the expressions and then illustrate their use in manipulating the telephone numbers in TelephoneData.txt to a common format. Note that in all of the following function descriptions we will use the following conventions to refer to different types of arguments or results:



Table 4.1 Conventions

BOOL A Boolean or flag (True or False)

NUM Any Number (REAL is used for real and INT for integers) or a field containing such values

CHAR A character code (Often given in single back quotes)

STRING A string value in single quotes or a field name containing string values

Some Useful String Functions

Table 4.2 String Functions


Length (STRING)Returns an integer representing the number of characters in STRING

length('London') returns the value 6

STRING1 >< STRING2Concatenates or joins the two STRING expressions

'New ' >< 'England' returns the string ‘New England’

substring (N, LEN, STRING)

Returns a string consisting of LEN

characters of STRING starting from the character at position N.

substring(1,2,'W12 4PQ') returns the string ‘W1’

locchar(CHAR, N, STRING)

Searches the STRING for the character CHAR (given in singleback quotes) starting the search at position N. Returns the integer representing the position at which CHAR was found. If it is not found it returns the Boolean “false”. (locchar_back can be used to search backwards starting from the Nth character)

locchar (`.`, 1, 'initial.surname') returns the value 8

substring_between(N1, N2, STRING)

Returns the substring of STRING, which begins at subscript N1 and ends at subscript N2

Substring(2,4,’tennis’) returns the string ‘enn’

strmember(CHAR, STRING)Equivalent to locchar with N = 1, i.e. Starts searching at the beginning of STRING

strmember(`.`, 'initial.surname') returns the value 8

allbutfirst(N, STRING)

Returns a string consisting of all characters within STRING except for the first N characters; that is, it removes the first N characters ofSTRING.

allbutfirst(3, 'Mr Smith') returns the string ‘Smith’

allbutlast(N, STRING)Return a string with the last Ncharacters removed

allbutlast(6, 'John Smith') returns ‘John’

Note

In all of the functions shown in Table 4.2, the argument STRING can be a quoted string or a field name (without quotes). It must also be noted that when searching for a particular character within a string field (using the locchar function for example), the character must be enclosed within a single



back quote. The back quote should not be mistaken for an apostrophe, and the difference between the two is displayed below, enclosed in parentheses.

Single apostrophe ( ‘ )Single back quote ( ` )

Refer to the PASW Modeler User’s Guide for further string manipulation functions available within PASW Modeler.

In the next section we will illustrate a number of the above functions in order to format one of the fields in TelephoneData.txt.

4.3 Example of String ManipulationWe begin by reading in the data and viewing it.

Begin with a clear Stream CanvasSelect the Var. File node and place it on the Stream CanvasEdit the Var. File node and set the file to TelephoneData.txt held in the

C:\Train\ModelerDataPrep directoryMake sure the Read field names from file option is checkedClick the Preview button

Figure 4.1 Table Showing Formats of PHONENUM Field

As you can see, there is a unique identifier for each customer, along with their name, title, and contact telephone number. The telephone numbers are considered to be strings because they have non-numeric characters, spaces, and leading zeros. Notice that within the field PHONENUM the data are not in a consistent format. All phone numbers in the UK are comprised of 11 digits, and typically the digits are broken down into a 5-digit area code followed by a 6-digit personal telephone number. (This differs in London where there is a 4-digit area code and a 7-digit number.)

Therefore, a UK-based telephone number should take one of the following formats:

01483 719200 Non-London Telephone Number

0208 6734000 London Telephone Number



Notice that in this particular example, while some of the telephone numbers do conform to the desired telephone protocol, there are many numbers that don’t. Common issues with telephone numbers are that some people enclose the area code in parentheses, while others have used the country code at the start of the number, in which case the leading zero is dropped from the area code. For example, we have data of the following types:

(01483) 719200 Area Code in parentheses

+44 1483 719200 Country Code precedes the area code

This is by no means an exhaustive list of issues that can occur in telephone numbers, or in any string data, but these will give us a chance to clearly demonstrate the types of data manipulation procedures that can be used to clean the data, and to manipulate strings, in general.

Our goal for the string manipulation will be to format the telephone numbers without a country code, or parentheses around the area code, and then have the area code and phone number separated by a space.

Note

When creating the following stream a separate Derive node will be used for each step of the process, as this will help clarify the logic of the data manipulation for new users. As you become more familiar with string data manipulation, it is highly likely that you will be able to combine many of these steps together and obtain the same result with far fewer Derive nodes.

First, we will sort the data by PHONENUM so that formats of a similar type will be together. As the country code is always at the beginning of the phone number, we will then use the isstartstring

function to determine which records start with the country code “+44”.

Close the Preview window Click OK to return to the Stream CanvasAdd a Sort node from the Records Op Palette and connect it to the Source node Edit the Sort nodeSelect PHONENUM to be the sort field and sort in Ascending order (not shownClick OKAdd a Derive node and connect it to the Sort node Edit the Derive nodeEnter the name FIND_CC in the Derive field edit boxDerive the field as Formula, with the default field typeEither using the Expression Builder or by typing directly in to the edit box, enter

isstartstring('+44', PHONENUM) in the Formula: text box

This operation locates records when a number begins with the country code.



Figure 4.2 Derive Node Determining if a Country Code is Present

Click OK

For records that start with a country code we will now create a derive node that removes the country code from the start of the phone number. Remember, when a country code is used the leading zero is dropped from the start of the area code. Therefore, as well as removing the country code we will need to replace it with a zero.

Connect a second Derive node to the firstEdit the Derive nodeName the new field REMOVE_CCSet Derive as: to ConditionalIn the If: textbox enter FIND_CC = 1 In the Then: textbox enter '0' >< allbutfirst(4, PHONENUM)In the Else: textbox enter PHONENUM

The result of these specifications will be that if a country code is present, the first four characters will be ignored, replaced with “0” and written into the new field REMOVE_CC. However, if a country code is not present, then the original data stored in PHONENUM will be written to the new field REMOVE_CC. The completed dialog is shown in Figure 4.3.



Figure 4.3 Derive Node which Removes the Country Code at the Start of the Phone Number

Click OK

Now that the country code has been removed, the second issue that requires attention is the removal of the parentheses from around the area code, if it is part of the number. In order to do this we must first determine which records contain parentheses.

Connect a third Derive node to the Derive node named REMOVE_CCEdit the Derive nodeName the new field OPENPSet Derive as: to FormulaEnter locchar(`(`,1, REMOVE_CC) in the Formula: text box

Remember formulas are case sensitive with regard to functions as well as field names, and the character that is to be searched for (an open parenthesis in our case) must be enclosed in a single back quote.



Figure 4.4 Derive Node which Finds Numbers Containing an Open Parenthesis

Click OK

Now that we know the character location of the open parenthesis, we must also find the character location of the close parenthesis (remember, this is different for area codes within London and outside of London).

Connect a fourth Derive node to the Derive node named OPENPEdit the Derive nodeName the new field CLOSEPEnter locchar(`)`,1, REMOVE_CC) in the Formula: text box (not shown)Click OK

Alternatively, you can copy the Derive node named OPENP, and then paste it into the stream. Then you can use this as a base to create the Derive node for CLOSEP.

Now that we know the character locations of both parentheses we can now extract the area code from within them.

Connect a fifth Derive node to the Derive node named CLOSEPEdit the Derive nodeName the new field PHONENUMBERSet Derive as: to ConditionalIn the If: textbox enter OPENP = 1In the Then: textbox enter substring_between(OPENP+1,CLOSEP-1, REMOVE_CC) ><

allbutfirst(CLOSEP, REMOVE_CC)In the Else: textbox enter REMOVE_CC



The completed dialog box is shown in Figure 4.5. The conditional formula states that when a telephone number has an open parenthesis, we concatenate two strings. The first string is the substring of the text between the parentheses (the area code). The second string is the remainder of the number after the close parenthesis (the remainder of the string, or the local phone number).

Figure 4.5 Derive Node which Removes the Parentheses from the Area Code

Click OK

The data manipulation is now complete, and all troublesome records should be cleaned. In order to check that this is the case we will look at the results of the Derive nodes in a Table.

Connect a Table node to the final Derive node

The final stream is shown in Figure 4.6.



Figure 4.6 Completed Stream which Cleans and Standardizes the Telephone Numbers

Run the Table node

Figure 4.7 Telephone Numbers with Country Code and Parentheses Removed

At this point you may want to filter out the intermediate fields created during the process in order to reduce the number of fields in the data file. Similarly, in order to reduce the stream length you may wish to encapsulate the Sort and Derive nodes in a SuperNode.

All the telephone numbers now have the same format.


Extensions

To see these operations condensed into two Derive nodes, open the stream file named CondensedTelephone.str.



Hint

We built a series of Derive nodes and then ran the stream. Until you are comfortable with the PASWModeler functions, we recommend you test each new Derive node (attach it to a Table node and Run)as you build it. The alternative is to use the Expression Builder with its validation feature, but it is still helpful to check that the calculation is working as you intend.

Summary

In this lesson we introduced how to manipulate string fields within PASW Modeler. You should now be able to:

Locate specific characters within a string field

Break a string field down into a series of components (or substrings)

Build a series of strings into one larger string, through concatenation



Summary ExercisesIn this exercise you will manipulate a field containing dates so that it conforms to one of the supported PASW Modeler date formats.

1. Load the stream Problem_Date.str which reads the data file Account_DateProb.dat.

2. Using the Options tab within the Stream Properties dialog (click Tools…Stream Properties…Options), make certain the date format is set to MM/DD/YYYY.

3. Attach a Table node to the Type node and run it. You will notice that the field Open_Date

needs substantial restructuring to conform to the desired date format.

4. Attach a Derive node to the Type node. Name this Derive node DatePosition. Some of the dates contain the string “Open_” which must be removed. As a preliminary step, locate the column location of the underscore within the dates containing this string. Hint: use the locchar function.

5. Attach a Derive node to DatePostion and name it JustDate. Use the allbutfirst function to extract only the date portion of Open_Date without the “Open_”. Attach a Table node and examine the results.

6. In the case of the Date values that do not contain the day portion, we will insert a proxy day value of 15. We can identify the date values with this problem by counting the number of slash marks. Dates with only one slash mark will need to be modified. Attach a Derive node to JustDate and name it Count_Slash. Use the count_substring function to count the number of slashes in each data value.

7. Attach a Derive node to Count_Slash and call it NewDate. Select Conditional from the Derive

as: pulldown menu. In the case of date values with one slash, insert the number 15 and a second slash mark. For example, 2/2002 becomes 2/15/2002.

8. Attach a Table node to NewDate and examine the results.




Data Transformations 5-1

Lesson 5: Data TransformationsTopics

Introduce the Set Globals node to use summary values for fields in expressions

Modify continuous fields with the Transform node

Bin or group continuous fields with the Binning node

Data

We use the Statistics data file customer_offers.sav, which contains customer data from a telecommunications company. It includes demographic data, information on use of telecommunications services, lifestyle information, whether a customer recently switched providers, and response to three offers from the firm.

5.1 IntroductionAs we have learned in the previous lessons, after you have explored the data file and fields, you often need to modify them. When working with continuous data there are several nodes designed expressly for continuous fields that can assist in transforming data for modeling or other analysis. The Set Globals node calculates summary values for a field, such as the mean, and makes these available for use in PASW Modeler expressions. The Transform node provides an interactive mechanism to apply a number of standard statistical transformations to a field to reduce the impact of outliers and/or make a distribution more normal. The Binning node can group or categorize continuous fields with five different methods. We provide examples of all three of these nodes in this lesson.

In Appendix B we illustrate what occurs when you make an error in a PASW Modeler expression when transforming a field.

5.2 Using Summary Statistics with Set Globals NodeIn PASW Modeler you can replace a blank value with the mean value for that field without having to first explicitly calculate the mean. You can do this from the Data Audit Node output. The Data Audit node also identifies outliers and extreme cases for a field by using the mean and standard deviation.

If you would like to gain complete control of these processes, or for other calculations using summary values for all the records, you can instead use a Set Globals node, located on the Output palette. This node calculates and stores in memory various summary values, such as the mean or standard deviation. You can then use special global functions in PASW Modeler expressions to use these summary values to modify existing fields or create new ones.

The Set Globals node calculates the mean, standard deviation, minimum, maximum, and sum for continuous fields and stores these values for use in other nodes. The Set Globals node is found in the Output palette and is a terminal node. It must be run first before the stored values can be used in other parts of a stream.We will use the customer_offers.sav Statistics data file in this lesson.

If the Stream Canvas is not empty, click File…New StreamPlace a Statistics File node on the Stream CanvasEdit the node and set the file to customer_offers.sav in the c:\Train\ModelerDataPrep

directory



Click OK to return to the Stream CanvasAdd a Type node to the streamConnect the Statistics File node to a Type nodeEdit the Type nodeClick in the Values cell for age and select SpecifyClick the Define blanks check boxEnter 99 as a missing value

Figure 5.1 Setting 99 as a Blank Value for Age

Click OKClick Read Values buttonClick OK to close the Type nodeAdd a Set Globals node from the Output palette to the right of the Type nodeConnect the Type node to the Set Globals nodeEdit the Set Globals node



Figure 5.2 Set Globals Dialog

By default, not only the global mean but also the sum, minimum, maximum and standard deviation will be calculated. The Clear all globals before executing option is used to remove all previously calculated global values before calculating new ones. The Display preview of globals created after

execution option is used to display the value of the global value(s) after the node is run.

We will initially use the Set Globals node with age to make a point about how it operates with blank (user-defined) missing data. You will recall that there are many customers with unknown age, and these people have a code of 99 for age, which is defined as missing (a blank) in the Type node. Let’s see whether these people are included in the calculation of the mean.

Select age as the fieldUncheck the SUM, MIN, MAX, and SDEV options to the right of the age field so that only the

Global Mean will be calculatedCheck the Display preview of globals created after execution option (not shown)Click Run

Figure 5.3 Global Mean Value for Age



When the Set Globals node is run, an informational dialog box appears with the summary statistics we requested. We see that the Global Mean for age is 46.997. The question, though, is whether the mean has been calculated by including the missing data, or not.

To check this, we can rerun the Set Globals node after deleting records with age values of 99.

Close the Set Globals output windowAdd a Select node to the stream between the Type node and the Set Globals nodeBend the arrow connecting the Type node to the Set Globals node over the Select node to

connect it between themEdit the Select nodeType the expression age < 99 in the Condition: box

Figure 5.4 Selection of Records with Age Less Than 99

Click OKRun the Set Globals node

Figure 5.5 Global Mean Value for Age with Missing Records Removed



Now the mean for age is 46.997 as well. Obviously, the Set Globals node does respect all missing value definitions when it has been defined as a blank.

We will now use the Set Globals node for the field income, calculating the mean and standard deviation. Then we will employ them in an expression to calculate a z-score, which is a measure of how deviant a case is from the mean.

Close the Set Globals output windowDelete the Select nodeConnect the Type node to the Set Globals nodeEdit the Set Globals nodeRemove age as the fieldAdd income as the field.Uncheck the SUM, MIN, and MAX so that only the Global Mean and Standard Deviation will

be calculated (not shown)Click Run

Figure 5.6 Global Mean and Standard Deviation for Income

Although the number of records used for this calculation is not listed, we can be certain that only valid data were used because the missing data has a value of $null$, which cannot be used in mathematical calculations.

The global mean and standard deviation have now been stored in memory and are ready to be used in PASW Modeler expressions.

As an important aside, the mean income is above the midpoint (median) of the distribution (look at a histogram for income to see this). This is because income is positively skewed. We didn’t address this characteristic of the distribution of income in this lesson yet, but there are certainly anomalous or outlying records for this field.

Our task is to create a new field that is the z-score, or standardized score, for each record on income. A z-score is a measure of how deviant a value is in a distribution. It gives the number of standard deviations that the score is away from the mean value. The z-score for a specific record, say i, on field X is defined by:

Z score for record i on field X = (value record i on field X - mean field X) / standard deviation field X



We can use a Derive node with a formula to create the z-score.

Close the Set Globals output windowAdd a Derive node to the stream and connect the Type node to itEdit the Derive nodeChange the Derive field: name to zscore_incomeClick the Expression Builder buttonClick the dropdown list of functions and select @FunctionsScroll down until you see the list of functions that begin with @GLOBAL

Figure 5.7 List of Global Functions

The Global functions are used by substituting a field name in parentheses that was specified in a previously run Set Globals node. The five functions refer to the five summary statistics that can be created in a Set Globals node.

Insert income into the Expression boxInsert a minus signInsert @GLOBAL_MEAN(FIELD) into the text expression boxSubsitute income for the question markAdd parentheses around the whole expressionAdd a division symbol /Insert @GLOBAL_SDEV(FIELD)Substitute income for the question markClick the Check button

Your screen should now look like Figure 5.8. (Alternatively, the globals can be picked by selecting

instead of Fields.)



Figure 5.8 Expression to Calculate Z Score for Income

Click OK, and then click OK againAdd a Table node to the stream and connect it to the Derive nodeRun the Table node

Figure 5.9 Z Score Values for Income

The new zscore_income field can be used to identify records that are outliers on income. In a large file, you might use a value of 3 or greater, the same criterion used in the Data Audit node. Unlike the Data Audit node output, though, this method allows you to work directly with the z-score to select exactly which records you consider to be outliers. You can then reduce the impact of outliers with avariety of methods (see the Introduction to PASW Modeler and Data Mining course).



The global values can also be used to impute missing data with your own formulas, rather than relying on the choices available in the Data Audit node.

5.3 Transforming Continuous FieldsWhen the distribution of a numeric (continuous) field is not normal but is instead skewed, it can cause problems with developing successful models. Such a field, e.g., tollmon (the number of minutes of toll free service), will typically also have outliers, as the skewness is, in part, caused by the outlying values. Reducing the effect of the outliers, though, will not, as a rule, remove the non-normality.

Several of the modeling techniques in PASW Modeler that are based on traditional statistical theory function best with normal data, including regression, logistic regression, and discriminant analysis. These techniques rely on assumptions about normal distributions of data that may not often be true for real world data. One approach to handle this is to apply transformations that modify a field’s values so that the overall distribution is more normal.

The Transform Node, located in the Output palette, provides the capability to perform a visual assessment of the best transformation to use for a field. You can see whether variables are normally distributed and, if necessary, choose the transformation you want and apply it. You can pick multiple fields and perform one transformation per field.After selecting the preferred transformations for the fields, you can generate Derive or Filler nodes that perform the transformations. Then you can attach these nodes to the stream and use them for further analysis and modeling.

We’ll apply the Transform node to both tollmon and income.

Close the Table windowAdd a Transform node from the Output palette to the streamConnect the Type node to the Transform nodeEdit the Transform nodeSpecify income and tollmon as the FieldsClick Options tab



Figure 5.10 Options Tab in Transform Node

There are five different methods, or transformations, available to modify a field. These methods have long been used in standard statistics. A common method to transform a skewed field toward normality is to take the natural or base 10 log. The default is to use all the formulas in the output.

When there are more than a handful of values of 0 for a field, you should enter an offset value for theInverse and two Logs. This is because the inverse and log are undefined for values of 0 so those records would not be transformed and would receive values of $null$ on the transformed field. The offset value can be quite small, although when you are using more than one field at the same time, the offset will apply to all.

In this instance, we’ll enter 1 as the offset, which means 1K of income or 1 minute of toll free service, respectively.

Click Select formulasClick all five check boxesEnter 1 as the Offset for the Inverse, log n, and log 10 (not shown)Click Run

Figure 5.11 Transform Output for Income and Tollmon



The Transform Output window displays thumbnails of histograms for the original distribution (labeled “Current Distribution” when the window first opens), and the distributions for the selected transformations. The output from the Transform node is interactive, so you can double-click on a thumbnail and it will open in a new window.

We can immediately see that that the original distributions for the two fields are skewed, especially for income. There is also the issue of all the customers with 0 minutes of toll free service. What happens to a distribution with lots of zeros when it is transformed? We’ll answer this question below.

What may seem truly remarkable is how well some of the transformations change the shape of a distribution towards normality. Both the LogN and Log10 distributions for income are quite normal in appearance. We can examine one of them in detail.

Double-click on the thumbnail graph for LogN for income

PASW Modeler adds a normal curve to the histogram, based on the mean and standard deviation of the transformed values. The distribution is now very close to normal for all intents and purposes.

Figure 5.12 Histogram of the Natural Log of Income

Transforming a field is an alternative to reducing the influence of outliers. However, there is no free lunch. When we change the value of outliers, we leave the bulk of values unchanged. When we transform a field with the natural log, as with income, we are modifying all the valid values. And if this field is used in a model, predictions will be made in the scale of the transformed field. This can be especially tricky when you are interested in the effect of a specific field on a target, as all interpretations must be made based on the transformed values. If, on the other hand, you are interested purely in prediction, or in the relative weight of a predictor compared to other fields, then the use of transformations may be a good strategy.



Note also that you should generally not transform a continuous target field, such as revenue from a customer. Doing so only further complicates the interpretation of a model, and if you transform both inputs and a target, understanding the effects of predictors becomes exceedingly difficult.

Close the Histogram window and return to the Transform output window

If we want to use the log transformed values for income, we can generate a Derive node, which will be incorporated in a SuperNode.

Click the LogN thumbnail for income to select it, if necessaryClick Generate…Derive Node

Figure 5.13 Generate Derive Node Dialog

The transformed field can be created based on the selected transformation, or it can be created and then standardized by creating z-scores from the transformed data (we were introduced to z-scores in the previous section).

Click OKSwitch to the streamRight-click the generated SuperNode and select Zoom InEdit the Derive node labeled income_LogN



Figure 5.14 Derive Node Calculating the Natural Log of Income

The Derive node contains the formula to calculate the natural log. Note that a 1 is added to income

because of the offset we specified. You could always change that to a smaller value, since it represents 1K of income, and all values are affected by the offset. You could also modify the Derive node to a conditional formula and only apply the offset when income is equal to zero. Actually, in these data, there are no customers with an income of 0 (otherwise they probably wouldn’t be customers), so the offset isn’t even required.

Click OKRight-click on the stream and select Zoom OutReturn to the Transform output window

The thumbnail graphs for tollmon show that the records with values of 0 (changed to 1 with the offset), are still distinct from the remainder of the distribution.

Double-click on the Log10 thumbnail graph for tollmon



Figure 5.15 Histogram for the Log Base 10 of Tollmon

Those customers with non-zero values for tollmon now have a normal appearing distribution, but the original values of 0 are now zero again (because the log of 1 is 0), and they remain separated from the other cases. This example makes it clear that transformations can solve problems of non-normality, and of some, but not all, types of outliers.

Close the Histogram window, and then close the Transform output window

5.4 Binning FieldsAnother transformation that can be applied to a continuous field is to bin it into a smaller number of categories, effectively turning it into an ordered set. There are several reasons why you might bin a field:

Some algorithms may perform better if an input has fewer categories, including multinomial logistic or a decision tree.

Even when an algorithm handles a continuous field by grouping it, such as a CHAID model, you may wish to control the grouping beforehand to create meaningful categories rather than rely on the grouping method of the algorithm.

The effect of outliers can be reduced by binning, so that all the outliers and extreme cases can be placed in one bin.

Binning solves problems of the shape of a distribution since the continuous field is turned into an ordered set.

Binning can allow for data privacy by reporting such things as salaries or bonuses in ranges rather than the actual values.

The Binning node allows you to automatically create new set fields based on the values of one or more existing continuous fields. The Binning node is located in the Fields Ops palette. There are five methods of binning available:



Fixed-width binning: This choice transforms a field into groups of equal width, or ranges, on the original field, such as age in groups of 20-29, 30-39, etc.

N-Tiles/percentiles: This choice transforms a field into groups based on percentiles. Thus, the choice of quartiles will create four groups of equal numbers of records.

Ranks: This choice transforms a field into ranks, from 1 to 1-N, where N is the number of distinct values in the original field.

Mean and standard deviation: This choice transforms a field based on a z-score, with the groups defined as a number of standard deviations below and above the mean.

Optimal binning: This choice transforms a field based on a second supervising field. The transformation is done so that there is maximum separation between groups in the binned field’s relationship with the supervising field.

We’ll demonstrate both the percentile and optimal binning methods.

Add a Binning node from the Fields Ops palette to the streamConnect the Type node to the Binning nodeEdit the Binning node

Figure 5.16 Binning Node Settings Tab



The selection of binning method is made from the Binning method dropdown list. The new fields will, by default, receive a suffix added to the field name that also identifies the binning method (the Fixed-width method uses a suffix of _BIN). The bottom of the dialog box will change based on the binning method. For fixed-width binning, you can either specify the bin width, which will begin at the minimum value, or you can specify the number of bins (if you wish more control than this for fixed width binning, you can use a Reclassify node).

Percentile Binning

We’ll use the percentile method to bin cardspent, which records the amount of money spent lastmonth on a customer’s primary credit card. Like most financial fields, it is positively skewed with some large outliers (you may wish to look at a histogram of cardspent before completing this example).

Specify the Bin fields: as cardspentClick the Binning method dropdown and select Tiles (equal count)Click the Vingtile (20) check boxClick the Decile (10) check box to deselect it

Figure 5.17 Vingtile Binning Selection for Cardspent

The Vingtile method will create 20 bins. The outliers will be in the upper bins, but they will not be widely separated in distance from the other records, as the new field will be coded from 1 to 20.



After the node is run, the Bin Values tab will show the bins and the original values contained within each.

To see the effect of binning, we need to add a Distribution node to the stream.

Click OKAdd a Distribution node to the stream near the Binning nodeConnect the Binning node to the Distribution nodeEdit the Distribution nodeSelect cardspent_TILE20 as the Field (not shown)Click Run

Figure 5.18 Distribution Table of Binned Cardspent

Each bin contains the same number of records (or almost). There are 250 records in each bin, or 5% of the data file (with the exception of bins 8 and 9). The actual data values for cardspent_TILE20 run from 1 to 20.

PASW Modeler considers the new binned field to be a nominal field, not an ordinal field. To change its measurement level, you would add a Type node downstream from this and manually change the type, then have PASW Modeler reread the data. This is important for modeling because ordinal fields are handled differently than nominal fields by some models. Also, the field needs to be instantiated, in any case, before being used in a model (as is true for any new field).

We can see what values are associated with each bin by returning to the Binning node.

Close the Distribution output windowEdit the Binning nodeClick the Bin Values tab



Figure 5.19 Lower and Upper Ranges for Each Bin for Cardspent

To create bins with equal numbers of records, the range of values in each bin is not equal. Most important, notice that for bin 20, the range is very large because it must encompass all the outliers, which are effectively grouped together in this last bin. The outliers are no longer extreme in the new binned field.

Click OK

Some additional thoughts about binning fields:

1. Binning, whatever its advantages, has the disadvantage of reducing the amount of information available about a field. The range of values on cardspent has been reduced from over 5,000 to 20. This may be a necessary tradeoff, but binning should only be done with a clear objective in mind, i.e., to reduce the effect of outliers or odd distributions or because the model will perform better.

2. If you can, try both the original field and the binned field in a model (some analysts even add both fields to a model and let the algorithm choose between them).

3. There is no obviously superior method of binning, so you might try several, if you have the time, at least for some of the more important input fields.



Note

Like other data transformations in PASW Modeler, the Binning node treats user-defined missing values (blanks) as valid. They are included in the new bins. To ignore blanks during binning, you should use a Filler node to replace the blank values with the system null value, or take some other action, such as estimating these missing values.

Optimal Binning

For our second example, we will use optimal binning on longmon, the number of minutes of long distance service last month for a customer. The Optimal binning method bins a continuous field with the help of a separate categorical (flag, nominal, ordinal) field which is used to guide or “supervise” the binning process. The supervising field should be at least moderately related to the field to be binned.

The basic steps of the Optimal Binning algorithm can be characterized as follows:

1. Preprocessing (optional). The binning input field is divided into n bins (where n is specified by you), and each bin contains the same number of records, or as near the same number as possible. By default, the maximum number of bins is 1,000 in preprocessing.

2. Identifying Potential Cut Points. Each distinct value of the binning input that does not belong to the same category of the guide field as the next larger distinct value of the binning input field is a potential cut point.

3. Selecting Cut Points. The potential cut point that produces the greatest information gain is evaluated by the Minimum Description Length Principle (MDLP) criterion. This is repeated until no potential cut points are accepted. The accepted cut points define the end points of the bins.

The procedure aims to minimize the entropy of the bins, which is simply a measure of the diversity in the bins of the categorical supervising field. If there is only one value of the categorical field in a bin of the continuous field, then the entropy is a minimum (equal to 0). That is the ideal, but the entropy will always be greater than 0 in practice.

Before binning, you can examine the relationship between the continuous and categorical fields to see whether the categorical field might successfully supervise binning, and we’ll follow this logic in our example.

Counterintuitively, in modeling, we often use a target field to supervise the binning of an input field. This might seem a bit unreasonable, but the practice is used in data mining because models are assessed with a test or validation dataset. That will be a check on any models that are too greedy, taking advantage of relationships that are only in the training data.

By optimally binning an input field with the target field, we create the ideal cutpoints to best separate—predict—the target’s values. Since the supervising field must be categorical, this procedure can only be used with categorical target fields. However, you can always use another categorical field in your data to bin a continuous input field.

One of the target fields in the data file is churn, which records whether a customer switched telecommunication providers last month (coded no or yes). The first step is to review the relationship between this field and longmon. For this we can use a histogram of longmon with churn as the overlay.



Add a Histogram node to the streamConnect the Type node to the Histogram nodeEdit the Histogram nodeSelect longmon as the FieldSelect churn as the Color Overlay (not shown)Click Options tabClick Normalize by color check boxClick Run

Figure 5.20 Normalized Histogram of Longmon Overlaid with Churn

It is easy to see that there is a relationship between the two fields. The greater the number of long distance minutes, the less likely a customer was to churn (churn=Yes) last month. (Perhaps customers who were going to churn used their current telecommunications service less because of dissatisfaction, or because they had already signed up with another service.) There is a decreasing percentage of customers who churned as longmon increases from 0 to about 70 (after that there are no churners). All this implies that churn could be a good supervising field for longmon (and that longmon could be a good predictor of churn).

We’ll now set up the optimal binning.

Close the Histogram windowEdit the Binning nodeClick the Settings tabDelete cardspent and add longmon in the Bin fields: list Click the Binning method dropdown and select OptimalSelect churn from the Supervisor field drop-down list



Figure 5.21 Optimal Binning Selection for Longmon

The new field will be given the suffix _OPTIMAL.

Click OKAdd a new Distribution nodeChange the Field: to longmon_OPTIMALSelect churn from the Overlay Color: drop-down listClick Normalize by color check box (not shown)Click Run



Figure 5.22 Distribution Table of Optimized Longmon Normalized by Churn

The field longmon has been binned into four groups. There are a large number of records in each bin, which is also an important criterion for a useful set of bins. Each successive bin has fewer customers who churned, which makes sense, since churn was used to create the bins, and we saw the relationship between the two fields in the histogram. There is a strong relationship between the binned longmon and churn.

Since both fields are now categorical, we can use a Matrix node to further examine the relationship.

Close the Distribution output windowAdd a Matrix node from the Output palette to the stream Connect the Binning node to the Matrix nodeEdit the Matrix nodeSelect longmon_OPTIMAL as the Rows: fieldSelect churn as the Columns: fieldClick the Appearance tabClick the Percentage of row check box



Figure 5.23 Selecting Row Percents

Click Run

In the Matrix Output window, click on the Labels button

Figure 5.24 Matrix of Binned Longmon by Churn

In the first binned category of longmon, there is a fifty-fifty split for churn. As we move to bins 2, 3, and 4, the percentage of churning customers drops from about 50 to 6.

In addition to optimizing the relationship between the two fields, the optimal binning has also removed any problems with outliers on longmon. We can see this by reviewing the binning values.



Click OKEdit the Binning nodeClick the Bin Values tab

Figure 5.25 Bin Lower and Upper Bounds for Binned Longmon

There is a very large range of values for the fourth bin, as the maximum value of longmon is about 180. All the outliers, and many other records, are in this bin. We’ll save the stream for future reference.

Click OKClick File…Save Stream AsEnter the name ModelerDataPrep5.str in the File Name: boxClick Save

In this lesson we have reviewed several techniques to transform continuous fields. As with all of data preparation, the more work you do in creating versions of fields that are appropriate for modeling, the better will be your results. Don’t shortchange this phase of the CRISP-DM process.

Summary

In this lesson we have introduced a number of techniques for transforming data. You should now be able to:

Use the Set Globals node to use summary values for fields

Use the Transform node to apply mathematical transformations to fields

Use the Binning node to categorize continuous fields



5.5 Appendix: Errors in PASW Modeler ExpressionsAs with any programming language, you will eventually make a mistake when constructing PASWModeler expressions. It is helpful to review the type of feedback you receive from PASW Modelerwhen you make such an error.

To illustrate, we’ll use the existing stream and introduce an error in the equation in the Derive node that creates zscore_income.

Edit the Derive node labeled zscore_incomeDelete the minus sign after the first occurrence of income in the Formula: text box

Figure 5.26 Expression Without a Minus Sign

What will happen when we try to run this portion of the stream?

Click OKRun the Table node connected to the Derive node



Figure 5.27 Error Message from PASW Modeler

Note: If you make more than one error, PASW Modeler normally finds the first error and then stops there, not searching for others. Thus, fixing one error doesn’t mean that the node will run properly, but it does usually imply that any other errors are after the first error in the PASW Modelerexpression.

When you can, use the Expression Builder to build and check an expression before execution. You will typically receive the same error message as when the node is run.



Summary ExercisesWe will use the custandhol.dat data file that was used in a previous exercise.

1. Use an existing stream (Less3Exercise.str) that accesses the custandhol.dat data file from the exercise in Lesson 3 to read the data and run a Table node to instantiate the fields. Make sure to use a Type node.

2. Your task is to use the Set Globals node to help calculate the amount that HOLCOST is above the minimum for the whole file. In other words, for a specific vacationer, how much more didshe spend on her trip than the least expensive trip in the file.

3. In the first step, use a Set Globals node, requesting that the minimum value for HOLCOST be stored in memory.

4. Now use a Derive node to calcuate the difference between this minimum amount and the value of HOLCOST for each record. Save this into a new field called HOLCOST_DIFF. Use a Statistics node to calculate the mean value for this derived field. What is the mean? Use a Histogram node to examine its distribution. How does its shape compare to that for HOLCOST? Why?

5. For the next task, you will transform the field HOLCOST to make it more normal. First, use a Histogram node to review the distribution of HOLCOST. Now add a Transform node to the stream attached to the data source or Type node. Specify HOLCOST as the field. Run the Transform node. Which transformations will make the distribution of HOLCOST more normal? How do these still differ from a perfectly normal distribution?

6. Generate a Derive node to create the transformed version of HOLCOST, using the transformation you selected. Rename the SuperNode to give it an appropriate name. Attach it to the stream, and then use a Histogram node to examine its distribution.

7. Now we will use the Binning node to bin HOLCOST, first with fixed width binning. Add a Binning node to the stream. Specify HOLCOST as the field. Use Fixed Width binning, and choose a Bin Width value. Then add a Distribution node and attach it to the Binning node. Specify the binned HOLCOST as the field, and examine its modified distribution. Try this again with a different Bin Width, and then try it by specifying the No. of bins value. You will probably find that some of the new bins contain no records. Can you explain why? Do you think it is reasonable to bin HOLCOST with the fixed width method? Why or why not?

8. Next we want to use optimal binning on the field DIST_TO_BEECH. We suspect this may vary by the country of destination. First, add a Histogram node to the stream and use it to review the distribution of DIST_TO_BEECH with an overlay of COUNTRY. Does this look promising? Next, add another Binning node to the stream and use optimal binning for DIST_TO_BEECH with COUNTRY as the Supervisor Field.

9. Add a Distribution node to the stream attached to the Binning node and specify the optimized DIST_TO_BEECH field. Run the Distribution node. How many categories were created in the binned field? What values does each include? (Hint: To answer this question, you must edit the Binning node.) If you look at the original distribution of DIST_TO_BEECH, would you have binned it by hand in the same groupings as the optimized field?



For those with extra time:

1. If you are not familiar with how the log transforms data, add a Plot node to the stream, attached to the Supernode that transformed HOLCOST, and request a plot of HOLCOST by the transformed HOLCOST.

2. Try using the Transform node for the field DIST_TO_BEECH. Do any of the transformations improve its distribution over the original?




Working with Sequence Data 6-1

Lesson 6: Working with Sequence DataTopics

Introduce a number of PASW Modeler sequence functions

Demonstrate the Count and State forms of the Derive node

Demonstrate the History node

Data

In this lesson we will use the data file year_balances.txt (a PASW Statistics version of the data file is also available named year_balances.sav). The file contains end-of-month account balances for 12 accounts. One year of data is held on each individual customer.

6.1 IntroductionFor many situations, each record in a data file can be considered as an individual case, independent of all others. In such instances the order of records is unimportant for analysis. However, in some files the order or sequence of records is extremely important. This often occurs with time-structured data in which the order of the records represents a sequence of events.

In situations such as these, each record can be thought of representing a snapshot at a particular instant in time. The instantaneous values may be of interest, or the way in which they vary over time.

PASW Modeler includes a number of functions that are primarily used to work with sequence data and we begin this lesson by introducing some of them.

We will then review two variants of the Derive node, Count and State, specifically designed for working with sequence data. Finally, we will demonstrate the History node, which is used to restructure sequence data.

6.2 PASW Modeler Sequence FunctionsPASW Modeler sequence functions are immediately recognizable due to:

All being prefixed with @

All names are in upper case

These functions include @INDEX, @MAX and @MEAN, among many others.

Sequence functions can refer to current records, records that have already passed through the node (previous records), and records that have yet to pass through the node (future records). They can also be used in conjunction with other components of PASW Modeler expressions.

In the following sections we introduce a number of the PASW Modeler sequence functions and illustrate their use with some examples. In these examples we use the comma-delimited data set year_balances.txt. We begin by reading the data into PASW Modeler and fully instantiating the data types.

Begin with an empty Stream CanvasPlace a Var. File node on the Stream Canvas



Set the File to year_balances.txt located in the c:\Train\ModelerDataPrep directoryMake sure the Read field names from file option is checkedClick the Preview button

Figure 6.1 Data Containing End-of-Month Balances for One Year

One final point before we begin to introduce the PASW Modeler expressions. If a data file contains sequence data, it is vital to sort the data in sequence order before using any of the functions detailed in this lesson.

Click OK, and then click OK again to return to the Stream CanvasConnect the Var. File node to a Type nodeConnect a Sort node from the Records Ops Palette to the Type nodeEdit the Sort node Click the Sort by: field list button and click ACCTNOControl click MONTHClick OK

Figure 6.2 Sort Dialog (Sorting by Month within Account)



Click the Preview button

The data should be arranged in monthly order (not shown) within account, and is prepared for sequence functions.

Close the Preview windowClick OK to return to the Stream Canvas

Types of Sequence Functions

Let’s look at the list of available sequence functions in the Expression Builder in a Derive node.

Connect a Derive node from the Field Ops Palette to the Sort nodeEdit the Derive nodeClick the Expression Builder buttonClick the Functions dropdown list and select @Functions

Figure 6.3 List of Sequence Functions in Expression Builder

Averaging, Summing and Comparing Values

A number of functions are available if you wish to total or create summary statistics for a specified field. Because these are sequence functions, they normally work on all records received so far in the node, rather than just the current record.

If you use the functions with a single argument they will refer to the whole sequence of records received so far in the node, including the current record. Thus, the function @SUM(BALANCE)

returns the sum of the values in the field BALANCE for all of the records so far, which is a cumulative total. A second argument, N, can be given to specify that the function should only consider the last Nrecords. For example, @MEAN(BALANCE,5) returns the average of the last five records seen—the current record and the four previous (comparable to a moving average used in time series analysis).

There are functions for the minimum, maximum, mean, sum, and standard deviation.



Retrieving Values for a Field Using @OFFSET

The function @OFFSET is used to retrieve values, for a given field, in previous or following records. For example:

@OFFSET(ACCTNO,1) Returns the value of ACCTNO for the previous record.@OFFSET(ACCTNO,-3) Returns the value of ACCTNO for the record that is three

records ahead in the sequence.

Retrieving Number of Records Since a Condition

The @OFFSET function returns a value for a field for previous or later records. If you need to know how many records before the current record a field had a certain value, the @SINCE function can be used. For example:

@SINCE(BALANCE >10000) Returns the number of records since BALANCE was greater than 10,000.

Retrieving Last Non-Missing Value

A related function returns the last field that was not blank (missing). For example:

@LAST_NON_BLANK(BALANCE) Returns the last value for BALANCE that was not blank, according to the blank definitions for BALANCE

Record Indexing Using @INDEX

The simplest sequence function is @INDEX, which returns the consecutive number of the record in the data file, starting with 1.

For example, to create a unique identifier for each record in the data:

Click Cancel to close the Expression BuilderSet the new Derive field name to IDSet Derive As: to FormulaType @INDEX in the Formula box (or use the Expression Builder)



Figure 6.4 Creating a Sequence Number with the @INDEX Function

Click OK to return to the Stream Canvas

The newly created field will be shown in Figure 6.6.

Next we can create a three record moving average that reinitializes when a new account is encountered. For this we will use both @OFFSET and @MEAN.

Connect another Derive node to the right of the ID Derive nodeEdit the Derive nodeSet the new field name to MA3Select Conditional from the Derive as drop-down listSet the If: statement to @OFFSET(ACCTNO,2) = ACCTNOSet the Then: statement to @MEAN(BALANCE,3)Set the Else: statement to '$null$' (note the quotes; alternatively, use undef to refer to the

system null value)



Figure 6.5 Derive Node Using @OFFSET and @MEAN Functions

This will have the effect of calculating a three-month average of the field BALANCE within each account. The mean will not be calculated unless the second record before the current one has the same account number. This implies that the mean will not be calculated for the first two months for each account, since the data are sorted by ACCTNO and MONTH.

Click OK to return to the Stream CanvasConnect the MA3 Derive node to a Table nodeRun the Table node

As shown in Figure 6.6, the moving average has been created (along with ID from the previous example).



Figure 6.6 Table Showing New Fields: ID and MA3 (Based on Sequence Functions)

6.3 Count and State Forms of the Derive NodeThe Derive node contains two forms, Count and State, specifically designed for working with time series and sequence data.

Count Calculates a new field based on the number of times a condition has been true.State Calculates a new field that is one of two states. Switching between these two states is

triggered by specified conditions.

Using Count to Count Events or Accumulate Values

The Derive node can be set to Derive as Count and used to count events or accumulate values. On execution the new field is set to an initial value that must be a numeric constant. As the records pass through the node, when one is encountered that complies with a specified condition (set by Increment

when), the new field is incremented by a specified value (set by Increment by). The new field will be reset to the initial value when a record is encountered that conforms to a second specified condition (Reset when).

We illustrate the use of Count by calculating a new field that gives the total number of times to date that an account has been overdrawn (has a value below 0). The count will be reset when the first entry of a new account is read.

Close the Table windowConnect a Derive node between the MA3 Derive node and the Table nodeEdit the Derive nodeSet the new Derive field name to NUM_O_DSet Derive As: to Count



Set Initial Value: to 0Set Increment when: to BALANCE < 0Set Increment by: to 1Set Reset when: to @OFFSET(ACCTNO,1) /= ACCTNO

Figure 6.7 Derive Node Using Count Variant

Click OK to return to the Stream CanvasRun the Table node



Figure 6.8 Table Containing Count of Account Balances Below 0

For each account number, the NUM_O_D field increments by one every time the end-of-month balance falls below zero.

Using State to Flag a Condition

The Derive node can be set to State and used to create a new field that has two values and therefore is similar to a flag field, however, the state can be toggled on and off by two independent conditions. This means that the value will change (turn On or Off) as each condition is met. For example, State can be used to flag an account when it becomes overdrawn but only turn off this flag when the account balance subsequently rises above 200.

On first execution the Initial state (either On or Off) is assigned. If the current state is on, it will switch to off when a record is encountered that conforms to the Switch “Off” when condition. Conversely, if the current state is off, it will switch to on when a record is encountered that conforms to the Switch “On” when condition. Every new record is given the current value of the State, not the initial value. The on and off values can be text you specify.

We illustrate the use of State by calculating a new field that flags if the end-of-month balance falls below zero. This flag will not be removed until the end-of-month balance is above 200 or if a new account is encountered.




Connect a new Derive node between the NUM_O_D Derive node and the Table node Edit the Derive nodeSet the new Derive field name to WARNINGSet Derive as: to StateSet "On" value to WarningSet Switch "On" when to BALANCE < 0Set "Off" value to ClearSet Switch "Off" when to (BALANCE > 200) or ((BALANCE > 0) and

(@OFFSET(ACCTNO,1) /= ACCTNO))Set Initial state: to Off

Figure 6.9 Derive Node Using State to Calculate a Warning Field

Click OK to return to the Stream CanvasRun the Table node

The resulting variable is shown in Figure 6.10. The new field will be in state off (Clear) until it passes a record that has a balance less than 0, when it will switch to on (Warning). It will remain in state on until either the balance is above 200, or a new account is encountered that has a balance over 0 (Clear).



Figure 6.10 Table Containing New Field Created using Derive Node (State Type)


6.4 Restructuring Sequence Data Using the History NodeIt is may be necessary to restructure sequence data so that the sequence of events is represented across a number of fields, as opposed to a number of records. This would allow you to use Association models on the data, or to do predictive modeling, among other tasks. The History node takes previous values of one or more fields and places them into the current record as values of new fields (the Restructure node can also accomplish similar actions). The History node, therefore, adds one or more new fields to each record that passes through it.

You specify the number of previous records from which to extract values (Span), and from which record, prior to the current, the first value should be extracted (Offset), in this span.

To make this clear, Offset specifies the record prior to the current record from which you want to extract historical field values. Span specifies the number of records from which you want to extract values, beginning at the offset record and working backwards. For example, if Offset is set to 4 and Span is set to 5, each record that passes through the node will have five fields added to it for each field specified in the Selected fields list, beginning four records previously. Thus, when the node is processing record 10, fields will be added from records 6 through record 2 (which is a span of 5 records, inclusive).

As further illustration, consider the data in Table 6.1.



Table 6.1 Data Before Passing Through History Node

Month Balance

Jan 100

Feb 150

Mar 200

Apr 250

May 300

June 350

If we applied a History node to this data using a span of 2, an offset of 1, and keeping all records (including those with no history), it would result in the data shown below.

Table 6.2 Data After Passing Through History Node, Span 2 and Offset 1

Month Balance Balance_1 Balance_2

Jan 100 $null$ $null$

Feb 150 100 $null$

Mar 200 150 100

Apr 250 200 150

May 300 250 200

June 350 300 250

Table 6.3 Data After Passing Through History Node, Span 2 and Offset 2

Month Balance Balance_1 Balance_2

Jan 100 $null$ $null$

Feb 150 $null$ $null$

Mar 200 100 $null$

Apr 250 150 100

May 300 200 150

June 350 250 200

We demonstrate the History node by transforming the data used in the previous sections into one record per account, containing the past 12 end-of-month balances. This will involve using the History node with a span of 11 and offset of 1. Once the data stream has passed through the History node, we will use the Sample node to select every 12th record. The result will be a data table containing one record per account.

Connect the Sort node to a History node from the Field Ops palette Edit the History nodeSelect the field BALANCE in the Selected Fields listSet the Offset to 1Set the Span to 11Select the Leave history undefined option button



Figure 6.11 History Node Dialog

The fields whose previous values are to be added to the record are those chosen in the Selected fields

list. Initial records in the file will not have previous values available. By default, records for which the history is unavailable are discarded (Discard records option in the Where history is unavailable

group), but can be retained by selecting the Leave history undefined or Fill values with options. These choices keep all records but differ in the value that will be assigned to history fields when a history value is unavailable. The Leave history undefined choice will assign $null$ when a history value is not available, or you can specify a value through the Fill values with option.

Click OK to return to the Stream CanvasConnect the History node to a Table nodeRun the Table node



Figure 6.12 Data after Passing Through the History Node with Offset 1 and Span 11

Although the History node has created the new fields successfully, the final result is not yet what we need. The file contains monthly data for separate accounts, so where the History node is placing 11 previous balances into a record, it is, in fact, placing balances belonging to more than one account. To solve this problem, we will use the Sample node to select every 12th record—the final record for every account—that contains all eleven previous month balances for the same account.

Close the Table windowConnect a Sample node from the Record Ops palette between the History and Table nodesEdit the Sample nodeClick Include sample Mode buttonSet Sample: to 1-in-n, and enter 12 in the text box



Figure 6.13 Completed Sample Node Used to Select Every 12th

Record

Click OK to return to the Stream CanvasRun the Table node connected to the Sample node

Figure 6.14 Table with One Record Per Account and Fields Containing Monthly Balances

The history of each account is now stored on a single record. This format is convenient for modeling. Note that we assume every account has twelve months of data. If not, then additional data manipulation would be needed.



Summary

In this lesson we have introduced a number of techniques for working with sequence data.

You should now be able to:

Use a variety of PASW Modeler sequence functions to derive new fields

Use the State and Count variants of the Derive node to count the number of events or flag a condition

Use the History node to extract previous values into new fields.



Summary ExercisesIn this exercise you will create a number of new fields using a variety of the sequence functions available in PASW Modeler.

1. If it is not already loaded, load the stream created in Lesson 3, Less3Exercise.str.

2. The first task is to create a field that gives a cumulative total of holiday cost (i.e., a cumulative total of money due to the company), HOL_CUST.

Before we can create this field we must sort the data in departure date order. Connect a Sort node to the Derive node nameded hol_month. Instruct the Sort node to sort the data in ascending order of TRAVDATE.

Connect a Derive node to the Sort node and compute a field that equals the cumulative total of the field HOL_COST. (Hint: You may need to use the PASW Modeler expression @SUM(field)). Connect a Table node to the Derive node to see whether the derived field is correct.

What is the cumulative total cost for all of the holidays? What is the total cost for holidays in July?

3. Next, create a field that increments each time a new holiday location is encountered. Since the file is sorted by HOLCODE, this will tell us the number of unique holiday codes in thefile.

4. After sorting the data in ascending order of HOLCODE and TRAVDATE, create a field that, starting at 1, increments each time a new holiday location is encountered. (Hint: You will need to use @OFFSET and the Derive node with COUNT form). What is the number of unique holiday codes?

5. Save your changes to a stream named Less6Exercise.str.




Exporting Data Files 7-1

Lesson 7: Exporting Data FilesTopics

Discuss whether to use a stream file or only a data file for modeling

Export data into a flat file

Review the typing of data when reread into PASW Modeler

Discuss the export of data into a database

Data

We use the PASW®

7.1 Introduction

Statistics data file customer_offers.sav, which contains customer data from a telecommunications company.

When you have completed data preparation, you may need to save the data file in one or more formats. Alternatively, you may only save the data after developing a model. Saving the data, called exporting in PASW Modeler, allows you to use the data in other software. It can also permit you to simplify and streamline the use of PASW Modeler so that you don’t have to rerun the stream(s), but just open the data file and proceed to modeling.

We will discuss the pros and cons of using the full stream versus only a data file. Then we export the data as a flat, or text, file and use the results to discuss what the measurement level of the fields isafter they have been exported and then reread into PASW Modeler. Measurement level for fields is essential for modeling, so we consider this issue in depth.

We also discuss exporting to a database and some of the available options.

7.2 Using a Data File or Streams in ModelingThe last step of data preparation, after fields have been modified and records selected, is to save a complete stream file for use in modeling. Alternatively, you can simply save a data file that contains all the data preparation you completed. Data can be saved in a variety of formats, although there is no “Modeler-native” format in which to save data files. Instead, you can choose common formats in which to save the data.

You don’t, though, need to save any data files in PASW Modeler. Instead, you can just use the existing stream(s) to recreate the file each time you work in the modeling phase of the CRISP-DM methodology. Here are the advantages and disadvantages of each approach.

Using a Stream File

If you have very large data files, which is common in data-mining, then you may not wish to, or have the available storage capacity, to save the data file after it has been prepared. In that case, the stream will recreate it each time and then you can do the modeling.

After you have created a model and validated it, you will eventually apply the model to new data. When you do so, those data will have to be prepared in exactly the same way as the data were for training the model, which means you will need to run the data through the existing stream. So it would seem simpler just to continue to work with the stream file.

Using the original stream file might make it easier to do additional modifications to the data during modeling, which is not uncommon. As you try various models, you may decide to try



other versions of modified fields (such as how outliers are removed). This will be more efficient if you can readily access the original nodes used in data preparation.

Using a Data File

If you have many nodes doing data preparation, then each time you run the stream, the data must pass those through those nodes. In large files, this can be a lengthy process, and it is much more efficient to use only the prepared data file.

You may wish to do some alternative analysis or create reports in other software. If so, you will need to write out the data in a format that can be read by that software.

The data will need to be reinstantiated for modeling, either by using an existing Type node oradding a new one. If you add a new one and run the data through it, the level of measurementof each field may well not match those from the original stream. This is an important issue, so you may find it best to reuse the Type node from the existing stream. But even that will not be an automatic solution (see example below).

After the data are prepared, you may want to integrate the modified data with the original data source, such as a database of some type (Oracle, Sequel Server, etc.). If so, you will need to write the data back into the database (assuming you have permission for this operation). Then you can read the data from the database and proceed directly to modeling.

You can also use a blend of these approaches. You can save the data file but still have the streams open that were used in data preparation. Then you can copy any nodes you may need into a new stream to make additional data transformations. Or, you can cache the data at some point in a stream file. Thus, after the stream is first run, you will have a saved data file at that cached node that will speed execution but still allow you to work with the full stream, if need be.

7.3 Types of Exported FilesEventually, almost every user saves or exports a data file of some type from PASW Modeler, so this lesson will provide some examples of this operation. PASW Modeler has six types of file exports.

The Database Export node writes data to an ODBC-compliant relational data source. In order to write to an ODBC data source, the data source must exist and you must have write permission for it.

The Flat File export node writes data to a delimited text file (with data values separated by commas, tabs, or other characters). A wide variety of other software can read this type of file.

The Statistics Export node writes data in PASW Statistics file format (.sav). These files can be read by PASW Statistics and other PASW products. This is also the format used for PASW Modeler cache files.

The SAS Export node writes data in SAS® software format to be read into SAS software or applications compatible with SAS software. Three formats are available, including SAS for Windows®/OS2, SAS for UNIX®, or SAS software Version 7/8/9.

The Excel Export node writes data in Microsoft® Excel format (.xls). You can also choose to launch Excel automatically and open the exported file when the node is run.

The Data Collection Export node saves data in the format used by PASW® Data Collection market research software, based on the PASW Data Collection Data Model. This format distinguishes case data—the actual responses to questions gathered during a survey—from the metadata that describes how the case data is collected and organized.



We’ll try the Flat File and Database exports in this lesson.

7.4 Exporting Flat FilesTo make this example more realistic, we will use a version of the stream file saved in a previous lesson.

Click File…Open Stream, and move to the c:\Train\ModelerDataPrep directoryDouble-click the file ModelerDataPrep7_forExport.str

Figure 7.1 Stream Preparing Data for Export

This stream now simulates the situation at the end of data preparation. A data source is read and typed, and then the data are modified. Your streams may have dozens of nodes after the Type node instead of a Derive and two Binning nodes, but the principle is the same.

To export data, we add a node from the Export palette.

Add a Flat File export node to the streamConnect the second Binning node to the Flat File nodeEdit the Flat File node



Figure 7.2 Flat File Export Dialog

Data written to a flat file can be read by a variety of other software. You can even use word-processing software to read or edit these files as they are simple text. You have choices to Overwrite

an existing file of the same name (output by default), or to append it to the existing file. The file name will be given no extension, but you may want to add the extension .dat or .txt to remind you of the file type. Field names will be written to the first row of the data file.

The Field separator is a comma, but you can change that to any convenient character. Symbolic fields have double quotes placed around their values, and this can be altered as you prefer.

Conveniently, since you may well plan to read the file into PASW Modeler later for modeling, PASWModeler will generate an import node you can use for this purpose (Generate an import node for this

data check box).

Change the Export file name to customer_offers.txtChange the directory location to c:\Train\ModelerDataPrep, then click SaveClick Generate an import node for this data



Figure 7.3 Completed Flat File Export Node

Click Run

After the stream has run, a Var. File source node named customer_offers.txt is placed in the upper left hand corner of the stream canvas. We’ll review its settings.

Edit the customer_offers.txt source node

A portion of the exported data is displayed in the preview window. All the settings are appropriate to read the flat file into PASW Modeler.



Figure 7.4 Generated Source Node to Read customer_offers File

This file originally came from a PASW Statistics data file, which contains labels for the data values fields. Those labels have been lost when the file was exported to the flat file format. The labels will be lost in all formats except for PASW Statistics and exports to SAS software.

The level of measurement of the fields may be different now in the flat file compared to the Statisticsfile. To see this, we’ll add a Type node to the stream and instantiate the text data, and then compare it to the levels of measurement in the existing stream.

Add a Type node to the stream and connect the customer_offers.txt source node to the Type node

Edit the Type node attached to customer_offers.txtClick the Preview buttonClose the Preview window



Figure 7.5 Types of Fields from customer_offers File

Most of the fields are numeric, and they have measurement level Continuous. But many of these fields were originally Nominal or Ordinal. How a model handles a field depends on its measurement level, so this is a critical characteristic to match.

Let’s open the original Type node for comparison.

Edit the Type node attached to the PASW Statistics source node

Figure 7.6 Types of Fields in Original Stream

There are several fields whose measurement level was originally Ordinal that now have measurement level Continuous, such as townsize or agecat. You will need to manually change the fields to the



appropriate measurement level in the new Type node, and then have PASW Modeler reread the data. You can try reusing the original Type node, but this will not necessarily solve the situation. Not all fields will have the same measurement level.

Even if you export the file as a PASW Statistics data file, and then read it back into PASW Modeler,the measurement level will not necessarily be identical. It will be more similar if the PASW Statisticsfile was originally read with the setting to Read labels as data in the Values area. This will mean that the data you write out for flag and nominal fields may well be text data rather than numbers, but this will insure that they will be treated as a flag or nominal field. But you may still have to change some nominal to ordinal fields. On the other hand, if you reuse an existing Type node, blank definitions and the Roles of the fields will be honored.

To see this, we’ll make a copy of the original Type node, attach it to the flat file, and check the settings.

Close both Type nodesRight-click on the original Type node and select Copy NodeRight-click on the stream canvas and select PasteDrag the copied Type node below the customer_offers flat file nodeConnect the two nodesEdit the Type node

Figure 7.7 Type of Fields Using Original Type Node

There are asterisks in the Missing column for age and income, which means that the missing or other specifications for these fields were retained.

Click in the Missing cell for income and select specify



Figure 7.8 Value Specifications for Income

In addition to the missing definition and range, there is a Description of the field, which was not written out with the data file. But reusing the Type node allows you to access this information for a field.

Click OKAdd a Table node to the streamConnect the copied Type node to the Table nodeRun the Table node



Figure 7.9 Data Read Using Original Type Node

The data look essentially identical to what was created in the upper stream before they were exported. Still, not all the measurement levels are the same. Gender was Nominal originally but now has measurement level Continuous. But reusing the Type node saved us some of the work needed to prepare for modeling.

As you can see from this discussion, there are some subtleties when exporting data to be used in modeling that require careful attention. If you want to export data and use that file, rather than the full stream, you will need to carefully check the Type and other characteristics of the fields.

Typing may be more similar to the original file when the original data don’t come from PASWStatistics and don’t have value labels. And typing may be more similar if you export to a format other tha PASW Statistics, although there are no hard and fast rules about this.


Note

Because the data were exported with the values of income coerced, the text data file customer_offers

has values of income between only 10.0 and 100.0. Nevertheless, the principle of using the old Type node still applies to this example, and in general.

7.5 Exporting to DatabasesAnother export option is to use the Database node to write the data to ODBC-compliant relational data sources. To do so, you must have an ODBC data source installed and configured for the relevant database, with read or write permissions as needed. The SPSS® Inc., an IBM® company, Data Access Pack includes a set of ODBC drivers that can be used for this purpose, and these drivers are available from the web site at http://www.spss.com/drivers/clientCLEM.htm.



We will simply describe some of the node options but not run an example, as we would have to configure a data source. Your instructor can show you how to do this if you are interested.

Add a Database node from the Export palette to the streamConnect the second Binning node to the Database nodeEdit the Database node

Figure 7.10 Database Export Node

You select a data source, and then select a table within that source in which to write the data. The table can either be new (Create table), you can write the data directly into an existing table (Insert

into table) or you can update selected database columns with values from corresponding source data fields (Merge table). Selecting this option enables the Merge button, which displays a dialog from where you can map source data fields to database columns.

You can write over an existing table of the same name by checking Drop existing table. This option will be greyed out if Merge table is selected.

As with the Flat File node, an import node can be generated to read the data.

There are Schema, Indexes, and Advanced buttons to set additional options. For example, the Schema dialog box allows you to set SQL data types for your fields, specify which fields are primary keys, and customize the CREATE TABLE statement generated upon export. The Advanced option controls various settings related to performance, such as bulk loading. You will need to understand database operations to use these successfully, but the node will function adequately with the default settings.

As we saw in the previous section with a flat file, the Type node settings will probably not be completely correct when you read the data back from the database. However, that may not be as critical with this export option. Often, you only write data back to the database after modeling, so action can be taken with the scored records. In that event, you won’t need to read the same data again.



Summary

In this lesson we have discussed how to export data from PASW Modeler. You should now be able to:

Understand the factors that affect whether existing streams should be used or data should be exported, and then reread.

Export data with any of the available nodes, especially the Flat File node and Database node.



Summary ExercisesIn this exercise you will export a file to two types of formats.

1. In the exercises for Lesson 6 you saved a file named Less6Exercise.str. Open that stream file to use it for exporting the modified custandhol.dat data (or use the file Backup_Chap6Exercise.str).

2. Add a Type node to the stream between the Derive node labeled hol_month and the Table node. Then Run the Table node so that new fields are typed.

3. Add a Statistics Export node to export the data to PASW Statistics format. Generate an import node for the exported file, and then read it back into PASW Modeler and run a Tablenode.

4. Try both settings in the Statistics Export node for names and labels. Do these settings make any difference in the exported data? Why or why not?

5. Now add a Type node between the generated Statistics node and the Table. Rerun the Table node, and then edit the Type node. Compare the measurement level of these fields to their measurement level in the Type node you added in step 2. Are they the same or different? What field should be a flag but is now nominal?

6. Next try exporting the data to an Excel file. Generate an import node for the exported file, and read it back into PASW Modeler. Also, open the data in Excel to view its format there.As in step 5, add a Type node between the generated Excel node and the Table. After the data are typed, is the flag field now correctly typed?




Efficiency within PASW Modeler 8-1

Lesson 8: Efficiency within PASW ModelerTopics

Introduce SQL Pushback that makes use of database efficiency and scalability

Discuss settings about set size that can affect performance

Review performance improvements for node order and specific nodes

Data

We work in this lesson with the custandhol database. If it has not already been defined as an ODBC data source, you will need to do so before running the example (defining an ODBC data source is covered in the Introduction to PASW Modeler and Data Mining training course).

8.1 IntroductionIn this lesson we introduce several features that can increase efficiency when running analyses in PASW Modeler. We first discuss SQL pushback, in which data manipulation nodes can push back the appropriate SQL instructions to the database source where the data manipulations operations will be performed more efficiently. We also discuss stream options to limit the number of distinct values a categorical field can have before being converted to measurement level typeless, and options to limit the set size for categorical fields when running Neural Net, Kohonen, or K-Means models. These options can improve PASW Modeler’s efficiency when manipulating data and modeling.

Further, we review the general order of nodes in a stream and how to modify it for efficiency, and discuss efficiency for specific nodes.

8.2 SQL PushbackMany data miners will access data directly from a database rather than work on extracts that have been written out to separate files. For those that do, one of the most powerful capabilities of PASWModeler is the ability to perform many data preparation and data-mining operations directly in the database. By generating SQL code that can be pushed back to the database for execution, many operations, such as sampling, sorting, deriving new fields, and certain types of graphing, can be performed in the database rather than on the PASW Modeler Client or Server computer. When you are working with large datasets, these pushbacks can dramatically enhance performance in several ways.

First, they can reduce the size of the dataset to be transferred from the DBMS to PASW Modeler.When large datasets are read through an ODBC driver, network I/O or driver inefficiencies may result. For this reason, the operations that benefit most from SQL optimization are row and column selection and aggregation (Select, Sample, Aggregate nodes), which typically reduce the size of the dataset to be transferred. Data can also be cached to a temporary table in the database at critical points in the stream (after a Merge or Select node, for example) to further improve performance.

Second, efficiency is increased because a DBMS can often take advantage of parallel processing, more powerful hardware, more sophisticated management of disk storage, and the presence of indexes.

Many nodes in PASW Modeler support SQL optimization. In addition to straightforward row and column operations, nodes using PASW Modeler expressions may support optimization. The main



consideration is whether the PASW Modeler expression is supported within the database through SQL, which is generally true for standard arithmetic or string operations.

When a stream is run, it is easy to observe SQL optimization since the nodes whose functionality is being pushed back to the database turn purple.

PASW Modeler Server Connectivity

Database modeling, SQL optimization, and batch-mode automation capabilities require that PASW®

Modeler Server connectivity be enabled on the PASW Modeler Client computer. With this setting enabled, you can access database algorithms, push back SQL directly from the PASW® ModelerClient, access PASW Modeler Server, and automate many tasks using batch mode.

Open the SQL_Pushback.str stream file in the c:\Train\ModelerDataPrep directory

This stream contains three data manipulation (Record Ops) nodes connected to database sources.

Figure 8.1 Data Manipulation Stream Accessing a Database

When you want to see which nodes can be pushed back to the database before executing a stream, you can use the Preview SQL Generation button on the toolbar.

Click on the Table node to select it

Click on the Preview SQL Generation for the Selection button

When you do so, all the nodes that can be pushed back to the database (in this case an MS Access Database file) turn purple. This includes the Database source nodes themselves along with the Append, Select, and Sort nodes. This is because SQL to perform these operations can be pushed back to the database and performed there. The Table node, since it is an Output node, is not pushed back.



Figure 8.2 SQL Pushback Preview of Stream

Run the Table node

When we run the Table node, the same nodes turn purple; the only difference from the preview is that this time we see the table of output records (not shown).

When working with large datasets, taking advantage of SQL pushback can substantially reduce processing time. Since not all databases support the same SQL functionality, whether a particular node can be pushed back in this way depends on the database and database drivers used and the PASW Modeler operations requested. Because SQL is used there can be no SQL pushback to non-database source nodes (text, Statistics, or SAS software files).

Note

Because of minor differences in SQL implementation, streams run in a database may return slightly different results when ran in PASW Modeler. These differences may also vary depending on the database vendor, for similar reasons. For example, depending on the database configuration for case sensitivity in string comparison and string collation, PASW Modeler streams using SQL pushback may produce different results from those without SQL pushback. To maximize compatibility with PASW Modeler, database string comparisons should be case sensitive.

8.3 SQL OptimizationPASW Modeler also provides options for optimizing the SQL generation when using SQL pushback. These options can be viewed and modified by clicking on Tools…Options…User Options, then selecting the Optimization tab. Stream optimizing involves reordering nodes so that more operations can be pushed back using SQL generation for execution in the database. This can also reduce the size of the data file returned to PASW Modeler, making for additional efficiency.

Let’s look at these options.Click Tools…Options…User OptionsClick the Optimization tab



Figure 8.3 SQL Optimization Options

Enable stream rewriting allows PASW Modeler to reorder the nodes in the stream for more efficient execution. The first choice, Optimize SQL generation, allows the reordering of nodes so that more operations can be pushed back for execution in the database. When a node is found that can’t be runusing SQL, PASW Modeler looks downstream to see if there are any nodes that can be pushed back, and safely moved upstream in front of the current node, without compromising total stream execution. The Optimize syntax execution increases the efficiency of operations that incorporate more than one node containing PASW Statistics syntax. Optimization is achieved by combining the syntax commands into a single operation, instead of running each as a separate operation.The Optimize other execution option optimizes efficiency in those nodes that can’t be pushed back to a database. It works by reducing the amount of data in the stream as early as possible by pushing data operations as close to source nodes as possible. The actual pushback to a database is enabled with the Generate SQL check box.

For streams ran in the database, data can be cached midstream to a temporary table rather than the file system. This can improve performance. The Database Caching option must also be enabled for this to function. To take advantage, you right-click on any non-terminal node to cache data at that point in the stream; the cache will be created directly in the database the next time the stream is run, as with any cache.

If you would like to see the SQL that PASW Modeler generates for its pushback, you can select Display SQL in the messages log during stream execution.

Note

These settings will be overridden by any settings on PASW Modeler Server.



8.4 Node OrderThe order of the nodes in a stream can affect performance. As with any software program, the goal is to minimize processing as soon as possible of either records or fields. Thus, if you have a record selection node, place it is as early in the stream as possible.

Further, when using SQL optimization, although nodes can be reordered, it is best to group nodes that can be optimized for SQL together at the beginning of the stream. The Help in PASW Modelerprovides a list of operations (nodes) that can be completed in most databases.

8.5 Using Samples of DataThis advice is not specific to PASW Modeler, or SQL pushback, but we should emphasize that whenever possible, you should work with samples of data rather than the full data file you may be using for scoring. The sample has to be large enough to give a full range of outliers and odd data, per our earlier discussions. But if you have files of millions of records, it is very likely that a smaller file of, say, 50,000 to 100,000 records will be perfectly adequate to construct the necessary nodes for data preparation.

8.6 Maximum Set SizeAn instantiated Type node stores a list of values for each categorical (nominal, ordinal) field. These are saved when the stream is saved and loaded when the stream is opened. If a stream contains categorical fields that have many values (hundreds, thousands, tens of thousands) this can place a substantial resource burden on PASW Modeler. Usually such fields (for example, name, address, phone number, 9-digit zip code, alphanumeric ID code) are not used in analyses, but might be included in the stream.

One method of handling such fields is to declare them as Typeless (in a Type node or Types tab), in which case their member values will not be stored in the stream and modeling nodes cannot use the fields. One of the stream property settings allows you to control this automatically.

Click OK to close the User Options dialogClick Tools…Stream Properties…Options



Figure 8.4 Stream Properties Dialog: Options Tab

The Maximum set size option, when checked (which is the default), will automatically set a field to type Typeless when the number of its members exceeds the value in the Maximum set size box. For example, if it were set to 250, a field that would ordinarily be nominal would be converted to Typeless if it contains more than 250 distinct values. Checking this option will prevent streams from containing long lists of set values to be saved in memory, which speeds loading, saving and processing.

When a field is converted to Typeless through this setting, a message is added to the Messages tab of the Stream Properties dialog.

A related option that applies to certain modeling nodes is the Limit set size for Neural, Kohonen and

K-Means modeling option. If this is checked, then a field of type nominal with more members than the number specified will be ignored by the Neural Net, Kohonen, and K-Means modeling nodes. These nodes are singled out because they internally create, for each nominal field, as many flag (0,1) fields as there are members in a nominal filed. Thus if a 5-digit zip code field of type nominal were included as an input to the Neural Net node, hundreds to thousands of flag fields would be internally created by the node for use in the analysis, depending on the file size, which would substantially increase processing demands (processing time and memory).

When a nominal field is excluded from a Neural Net, Kohonen, or K-Means analysis for this reason, a message is added to the Messages tab of the Stream Properties dialog.



Figure 8.5 “Ignoring Large Set input field” Message in Stream Properties Dialog

You can exclude individual fields from these models by setting their role to None in the Type node or Types tab of a source node.

Conversely, this setting may needlessly restrict the use of specific fields that you want to include in a model. If so, you can put the value higher than its default of 20, or you turn this off and then carefully set the role for categorical fields in a Type node.

8.7 Performance in Specific NodesThere are two nodes where some general advice can be given about improving efficiency.

The Binning node must read the entire data file in order to compute bin boundaries. Only then can it allocate records to the bins. After you have run the binning node once and are satisfied with the results, you can generate a Derive node from the Binning output. This will improve performance if used in place of the Binning node when the stream is run in the future.

The Distinct node must store all of the unique values based on the key fields. As a dataset grows larger, and most of the records are distinct (not duplicates), performance suffers. If you have a large data file and the order of the output data is not important (at least temporarily), you can sort the data on the key fields and then use the PASW Modeler expression @OFFSET with a Select node to select (or discard) the first distinct record from each distinct group based on the key values.

See the Help in PASW Modeler for information on other nodes and their processing requirements.



Summary

In this lesson we discussed several methods of improving efficiency within PASW Modeler. You should be able to:

Understand and make use of SQL Pushback

Reorder nodes in a stream for efficiency

Modify Stream Property options that are based on the set size of fields


Database Joins with PASW Modeler A-1

Appendix A: Database Joins with PASWModeler

A.1 IntroductionThis appendix is designed to help PASW Modeler users work with database tables that have special joins, or relationships. It does not contain as detailed instructions as are found in the earlier lessons, trading this depth for broad coverage of the different types of table joins. We describe the characteristics of the tables to be joined, the type of join desired, the PASW Modeler nodes and settings required to accomplish the merges, and the results.

We will begin by working with the Northwind database. We assume that you have already defined the Northwind database as an ODBC data source—this topic is covered in the Introduction to PASW

Modeler and Data Mining training course.

A.2 Basic Merge Setup in PASW ModelerBelow is the Relationships table that shows how the tables in the database are defined. On the links between the tables, we see many with a label of either 1 or , notating the 1 to many linkages in the table.

Figure A.1 Relationships Table for Sample Northwind Database

For example, this company only has 29 suppliers for 77 products. Some suppliers provide more than one product, creating a one-to-many join.

Open the Join.str stream file located in the c:\Train\ModelerDataPrep directory



Figure A.2 The Merge Stream for the Northwind Database

The above diagram shows the stream (Join.str) that joins two tables—the Products and the Suppliers tables from the Northwind database—with a Merge node. Results from this merge are viewed in a table. (Order of fields in the table after the merge is based on the order of the sources in the Inputs tab of the Merge node, which is initially determined by the order in which the sources are connected to the Merge node). If you want supplier information first, its source node should to be connected to the Merge node first or ordered first on the Inputs tab of the Merge node.This basic stream will be used throughout this document with the exception of the last section. Only changes to the Merge node will be made. For some examples, where noted, additional records have been added to some tables in a copy of the Northwind database.

Addition of the Sort node in the stream allows you to sort on a selected field (or fields) so that the results in a table may appear in either ascending (or descending) order.

A.3 An Inner Join

Table Relationships and Information

There are 77 products listed in the Products table (one per record).

There are 29 suppliers listed in the Supplier table (one per record)

All Suppliers (from the Supplier table) are referenced in the Products table

The Merge node dialog should appear as shown below.



Figure A.3 Merge Node Dialog

SupplierID is the Key (or Index) to link these tables together. The resulting table will contain 77 records, one for each product.

Figure A.4 Result of Inner Join



A.4 Joins with Non-Matching Records in One Table


All suppliers (from the Suppliers table) provide at least one product in the Products table.

All suppliers (from the Suppliers table) are mentioned in the Products table.

But not all products reference a supplier in the Suppliers table. A new record was added to the Products table (now 78 records) and the supplier is unknown. The new product is “Potatoes all rotten.”

Keep all Records (Full Outer Join)

This time, the Merge tab in the Merge dialog should appear as below.

Figure A.5 Merge Dialog Keeping Non-Matching Records

Upon execution of the stream (after sorting by product number in descending order, so that the new product appears first), we will find that the new product appears, but the supplier information is blank. We should now have 78 records.



Figure A.6 Merge Result When Product Has No Supplier

Keep Only Matching Records (Inner Join)

In this case, we do not want the record if it cannot be joined with supplier information. To obtain this result, the dialog box in the Merge node appears as:

Figure A.7 Merge Dialog Keeping Only Matching Records

Upon execution of the stream (after sorting by product number – ProductID), we will not see the additional product. We have 77 records.



Figure A.8 Merge Result Keeping Only Matching Records

A.5 Joins with Non-Matching Records in Both Tables


Not all suppliers are mentioned in the Products table. A new supplier had been added called “New Potatoes.”

Not all products reference a supplier in the Suppliers table. A new record was added to the Products table and the supplier is unknown. The product is “Potatoes all rotten.”

Keep all Records (Full Outer Join)

This time, the dialog box in the Merge node should appear as:



Figure A.9 Merge Dialog Keeping All Records (Matching and Non-Matching)

Upon execution of the stream (after sorting by product number), we will see the added product with no associated supplier.

Figure A.10 Result of Merge Keeping Non-Matching Records from Both Tables (Beginning)

Scrolling to the bottom of the table, we should see the added Supplier information, with no associated product. We now have 79 records.



Figure A.11 Result of Merge Keeping Non-Matching Records from Both Tables (End)

Keep Only Matching Records (Inner Join)

This time, the dialog box in the Merge node should appear as:



Figure A.12 Merge Dialog Keeping Only Matching Records

Upon execution of the stream, we note that the added product and the added supplier are not included. We now have 77 records.

Figure A.13 Merge Result Keeping Only Matching Records



Keep Non-Matching Records from One Table (Left- or Right-Outer Join)

In this case, the only non-matching records we want to include are for the Products table. If a product does not have supplier information, we still want to include that product. If, on the other hand, a supplier doesn’t have an associated product, we do not want to see that record. The dialog box in the Merge node should appear as:

Figure A.14 Merge Dialog to Begin Left- or Right-Outer Join

Any tables that contribute non-matching records are specified in the Select Dataset dialog.

Figure A.15 Selecting the Dataset that Supplies Incomplete (Non-Matching) Records

After executing this stream, we now see that our table has 78 records. Our dataset contains records in which no supplier information is present, but removes those records in which no product information is present.



Figure A.16 Result of Partial (Left- or Right-) Outer Join

A.6 Complex Joins in PASW ModelerJoins that are more complex may require one or more of the following nodes.

The Aggregate Node

The Aggregate node allows you to aggregate cases before joining, changing the case basis to a higher summary level.

For example, using the Northwind database, we may want to build a table that lists products and product information along with the number of products that have sold.

Products are in the Products table of the Northwind database (this is also the data source). Orders are in the Order Details table of the same data source. We first need to aggregate the Order Details table by ProductID and sum the number of units of the product sold.



Figure A.17 Aggregating to Product Level

The resulting file contains 77 records, with quantities of each product sold.

Figure A.18 Data after Aggregation to Product Level

Now a Merge node can be added that merges the aggregated results from the Order Details table to the information contained in the Products table.



Figure A.19 Result of Merging an Aggregated File

The resulting file has all of the product information and also the quantity of product sold.

A.7 The Sort and Distinct NodesUsing the Sort and Distinct nodes together will allow you to sort to find the most recent, oldest, first, last, etc. record and remove all duplicate records (based on user-specified criteria).

For example, using the Northwind database, we may be interested in merging the Orders and Employees tables together. As our result, we want a table with all employees and a field attached that notes whether or not this employee made a sale. We do not want one record per order with employee information.



Figure A.20 Sorting the Orders Table by Employee ID

We first attach a Sort node to the Database node for the Orders table (not Order Details). We choose to sort by EmployeeID. Most times you will want to sort by more than one field so that you can select the first or last occurrence of a given event.

Figure A.21 Retaining One Record for Each Employee ID Value

Next, as shown above, we attach a Distinct node with EmployeeID as the distinction criteria field and Include as the Mode option. This will create a new file that includes one order for each value of EmployeeID.



Figure A.22 Stream After Passing Through Distinct Node (One Record Per Employee)

Next, as shown in Figure A.23, we can attach a Merge node, connecting the Distinct node to the Database node (connection of the Employee table). Note that an additional employee (Joe Blank), who sold no products, was added to the Employee table.

Figure A.23 Merging after Passing One Record Per Employee (From the Orders Table)

We should specify to merge on Keys. If we select the box Include matching and non-matching

records (full outer join), we will get a table of all employees, whether or not a sale was made. If we choose Include only matching records (inner join), we will get a table of those employees that have made one or more sales. For this exercise, we should select Include matching and non-matching

records (full outer join) since we want to include all employees whether or not they have made a sale. We can use the Filter tab (shown in Figure A.24) to drop fields we do not want to include. The Photo field was also dropped.



Figure A.24 Dropping Fields in Merge Dialog (Filter Tab)

Since the Orders table is listed first in the Inputs tab (not shown), the order of fields in the Filter tab shows Order information first. We filter all the Order fields except EmployeeID.

Once the merge is complete, and a Table node is attached, our results appear as:

Figure A.25 Merge Result

Now that the tables are joined, we can fill missing values or derive a new variable that notes that some employees did not have sales. An employee with no sales would have a null value for the Orders table EmployeeID field after the Merge (since the employee had no orders) if this field were retained. Such an employee, Joe Blank, was added to the Employees table. In order to retain and use the values from the Orders table EmployeeID field after the merge, you need to request that PASW



Modeler not Combine duplicate key fields (the default is to combine them) and rename the field on the Filter tab of the Merge node, as shown in Figure A.26.

Figure A.26 Renaming a Field with the Filter Node

Now, we can derive a new field by attaching a Derive node after the Merge node.

Figure A.27 Creating a Flag Field - Did a Sales Employee Sell (T, F)



The new Derive node for the field nosales looks at the field salesemployee (new name for the EmployeeID field from the Orders table):

If salesemployee is null then the new values will be T,

If salesemployee is not null the new value will be F

We now attach another Filter node to the stream to remove salesemployee and some additional employee fields. Our new table (shown below) contains all sales people and a flag noting whether they have made sales.

Figure A.28 Employee Data with Flag Indicating Whether a Sale Was Made

A.8 Joining More Than Two TablesIf the same field joins all tables, PASW Modeler will let you join them with a single Merge node. InFigure A.29, if EmployeeID were the common key of all three tables, they could be joined using asingle Merge node.



Figure A.29 Merging Multiple Streams Using a Common Key

When joining more than two tables where the key fields joining these tables are different, use multiple merge nodes using the different keys. See the example below.

Figure A.30 Setup for Merging Multiple Streams Using Different Keys



Final Comments

While joining and manipulating tables within PASW Modeler is different than writing SQL to do table joins and database manipulation, with a little planning and patience you will be able to complete even the most difficult table joins.


Statistics Transform Node B-1

Appendix B: Statistics Transform Node

B.1 IntroductionWe have performed various types of transformations on fields in several lessons in this course, including mathematical transformations and two types of binning. PASW Modeler can complete many types of data transformations, but for those situations where you need to accomplish something that would be awkward or impossible within PASW Modeler, and you have access to PASWStatistics on your computer or a server, you have another option. The Statistics Transform node provides access to most of the data transformations available in PASW Statistics, using command syntax. This makes it possible to complete a number of transformations not supported by PASWModeler and allows automation of complex, multi-step transformations, including the creation of a number of fields from a single node. The Statistics Transform node is similar to the Statistics Output node, except that the data are returned to PASW Modeler for further analysis downstream; the Statistics Output node instead returns only the requested output objects, such as graphs or tables, not any data, because it is a terminal node.

In this appendix we will use the Count transformation in Statistics to illustrate how to use the Statistics Transform node to accomplish one common task in data mining. We use the data file shopping.txt which contains information on products bought by shoppers at a large grocery chain. The file also contains demographic information on the shoppers, with each record corresponding to one shopping trip.

B.2 Counting Occurrences of ValuesConsumer or retail data often contains several fields that store information on products purchased by customers. There are many types of analysis that can be done with such data, including developing association models to determine which combinations of products are purchased together. We also may be interested in how many different types of products were bought by each customer, since on the average those who buy more types of products are likely to generate more revenue.

The data file shopping.txt stores information on unique shopping trips by customers of a large grocery store chain. Let’s examine the data to see the format of the product fields. An existing stream file, ModelerDataPrepAppB.str, already contains the necessary source node.

Click File…Open StreamMove to the folder c:\Train\ModelerDataPrepDouble-click on ModelerDataPrepAppB.strRun the Table node



Figure B.1 Table of Grocery Shopping Data

There are ten product categories for types of food or products. Each of these is a flag field and coded with a value of either 1 or 0 depending on whether the customer bought one or more products in that category on a trip to the store (1=yes, 0=no). We would like to count the number of “Yes” responses in each record to create a field that measures the number of different product types each customer purchased. Could you accomplish this task in PASW Modeler? How?

The Statistics Count Transformation

The Count command in Statistics counts the occurrence of one or more specified values for one or more fields (variables), and it stores this information in a new field.

From the PASW Statistics palette, add a Statistics Transform node to the stream canvasAttach the Source node to the Statistics Transform nodeEdit the Statistics Transform nodeClick Syntax editor



Figure B.2 Statistics Transform Node

To create PASW Statistics transformations you must know PASW Statistics command syntax. There

is syntax help available by clicking the Syntax Help button , but this will only aid you if you already know syntax in general. Commands are entered in the text box just as you would format them in Statistics, and most, though not all, Statistics transformations are available. Syntax can be checked after creating it, and by default, the syntax is checked when you click OK to save the command(s).

The syntax we need to count the number of “Yes” responses has this generic format:

COUNT newvarname=var l ist (value l ist ) .

The command name is COUNT. A new variable (field) name is specified, and then after the equal sign, the list of fields in which to count the values, which are listed in parentheses, separated by commas.

Since PASW Statistics will be processing this command, field names must match valid variable names in Statistics. If they do not, you can either change them yourself before the StatisticsTransform node, or you can use the Filter tab in this dialog box. We need to make some modifications because some of the product type names have a space, which isn’t allowed in a PASW Statisticsvariable name.

Click Filter tab

The Filter Options menu has a choice to automatically change the field names as necessary

Click the Filter Options menu button From the dropdown list, select Rename for PASW StatisticsClick OK in the next dialog box



Figure B.3 Field Names Modified for use in PASW Statistics

PASW Modeler has added a “#” character where there was a space. This works but creates awkward field names, so you may prefer to edit them yourself; you can do so as you would in any Filter tab. For this example we will use the automatically renamed field names.

Click Syntax tabEnter the text COUNT NumProductTypes = in the text boxUse the field chooser button to add all ten product types after the equal signHit the Enter keyOn the second line, enter the text (1).



Figure B.4 Count Command Syntax

Note that PASW Statistics commands end with a period.

We can check the syntax.

Click the Check button



Figure B.5 Count Syntax Checked

Any messages from PASW Modeler appear in the information box below the text box. There are no error messages, only notes from PASW Modeler about changing variable names and storage.

Click the Preview buttonScroll to the right in the preview window

Figure B.6 New Product Sum Field Added to Data

The field NumProductTypes has been added to the data. It has decimal digits indicating it is a real number. This means that PASW Modeler is treating it as a field of type Continuous. To examine its distribution, we should use a Histogram.



Close the Preview windowClick OK to return to the Stream CanvasAdd a Histogram node from the Graphs palette to the streamAttach the Statistics Transform node to the Histogram nodeEdit the Histogram nodeSelect NumProductTypes as the FieldClick Run

Figure B.7 Histogram of NumProductTypes

There is an intriguing pattern to the data. The distribution is skewed to the right, with most customers purchasing only a few types of products. No customer bought all ten product types on one trip.

The CRISP-DM methodology mandates that we continually look for interesting patterns in the data, but also that we search for anomalies that might indicate potential problems. Most people who shop at a grocery store buy more than one type of product, so the pattern you see here might lead us to question the quality of the sample. We may suspect that these customers include only a subset of the larger file. This would send us back to earlier stages of the data-mining project or to the person knowledgeable about the business questions to investigate this further.

We can do one more test here. All things being equal, those with children might be expected to buy a wider range of products because those customers are buying for more people. To investigate this, we’ll overlay the histogram with the field CHILDREN which indicates whether that customer has children, or not.

Close the Histogram windowEdit the Histogram nodeAdd CHILDREN as a Color Overlay fieldClick Options tab and select Normalize by color (not shown)



Click Run

Figure B.8 Histogram of NumProductTypes Overlaid with CHILDREN

The pattern continues to look odd. Those with children are more likely to buy fewer types of products on a shopping trip. It is possible to come up with a supposition to explain this pattern—those with children are busier and often need to just buy one or two things of immediate need for their children—but that would need further support beyond these data. This does provide a realistic example of how data transformations can lead to additional questions, and opportunities.

In this appendix we have introduced the Statistics Transform node, which can perform transformations that can’t be done, or can’t be done easily, in PASW Modeler. Remember that PASWStatistics must be installed and properly licensed to use this feature.

Documents

0E052_0a0521inst_stud