Intelligent Miner for Data Applications Guidejliusun.bradley.edu/~jiangbo/Redbooks/sg245252IMGuide.pdf · to use and how to effectively exploit them. The business utilized as a case

IBML

Intelligent Miner for DataApplications Guide

Peter Cabena, Hyun Hee Choi, Il Soo Kim, Shuichi Otsuka,Joerg Reinschmidt, Gary Saarenvirta

International Technical Support Organization

http://www.redbooks.ibm.com

SG24-5252-00

International Technical Support Organization

Intelligent Miner for DataApplications Guide

March 1999

SG24-5252-00

IBML

Take Note!

Before using this information and the product it supports, be sure to read the general information inAppendix A, “Special Notices” on page 137.

First Edition (March 1999)

This edition applies to Version 2, Release 1 of the Intelligent Miner for Data, Program Number 5801-AAR for usewith the AIX Operating System.

Comments may be addressed to:IBM Corporation, International Technical Support OrganizationDept. QXXE Building 80-E2650 Harry RoadSan Jose, California 95120-6099

When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in anyway it believes appropriate without incurring any obligation to you.

Copyright International Business Machines Corporation 1999. All rights reserved.Note to U.S. Government Users — Documentation related to restricted rights — Use, duplication or disclosure issubject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.

Contents

Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixThe Team That Wrote This Redbook . . . . . . . . . . . . . . . . . . . . . . . . . ixComments Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Why Now? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Changed Business Environment . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.3 Enablers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 What Is Data Mining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Data Mining and Business Intelligence . . . . . . . . . . . . . . . . . . . . . 6

1.3.1 Where to from Here? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Data Mining Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Data Mining Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5.1 Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5.2 Database Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.5.3 Link Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.6 General Approach to Data Mining . . . . . . . . . . . . . . . . . . . . . . . . 141.6.1 Business Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . 151.6.2 Project Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.6.3 Business Solution Design . . . . . . . . . . . . . . . . . . . . . . . . . . 161.6.4 Data Mining Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.6.5 Business Implementation Design . . . . . . . . . . . . . . . . . . . . . . 161.6.6 Business Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 161.6.7 Results Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.6.8 Final Business Result Determination . . . . . . . . . . . . . . . . . . . 171.6.9 Business Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Chapter 2. Introduction to the Intelligent Miner . . . . . . . . . . . . . . . . . . 192.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Intended Customers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 What Is the Intelligent Miner? . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4 Data Mining with the Intelligent Miner . . . . . . . . . . . . . . . . . . . . . 202.5 Overview of the Intelligent Miner Components . . . . . . . . . . . . . . . . 20

2.5.1 Intelligent Miner Architecture . . . . . . . . . . . . . . . . . . . . . . . . 202.5.2 Intelligent Miner TaskGuides . . . . . . . . . . . . . . . . . . . . . . . . 222.5.3 Mining and Statistics Functions . . . . . . . . . . . . . . . . . . . . . . . 232.5.4 Processing Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5.5 Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Chapter 3. Case Study Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1 Customer Relationship Management . . . . . . . . . . . . . . . . . . . . . . 273.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3 Strategic Customer Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Chapter 4. Customer Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 33

Copyright IBM Corp. 1999 iii

4.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Business Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3 Data Mining Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.1 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.3.3 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Data Mining Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4.1 Cluster Details Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4.2 Cluster Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.4.3 Cluster Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.4.4 Decision Tree Characterization . . . . . . . . . . . . . . . . . . . . . . . 66

4.5 Business Implementation and Next Steps . . . . . . . . . . . . . . . . . . . 67

Chapter 5. Cross-Selling Opportunity Identification . . . . . . . . . . . . . . . . 695.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2 Business Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.3 Data Mining Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3.1 Cluster Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.3.2 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.3.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.3.4 Product Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . 74

5.4 Data Mining Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.4.1 Cluster Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.4.2 Association Rule Discovery . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.5 Business Implementation and Next Steps . . . . . . . . . . . . . . . . . . . 85

Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign . . 876.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.2 Business Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.3 Data Mining Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.3.1 Create Objective Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 906.3.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.3.3 Data Sampling for Training and Test . . . . . . . . . . . . . . . . . . . 936.3.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.3.5 Train and Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.3.6 Select ″Best Model″ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.3.7 Perform Population Stability Tests on Application Universe . . . . . 103

6.4 Data Mining Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.4.1 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.4.2 RBF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.4.3 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.5 Business Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Chapter 7. Attrition Model to Improve Customer Retention . . . . . . . . . . 1117.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.2 Business Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127.3 Data Mining Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.3.1 Data Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1157.3.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167.3.3 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.3.4 Gains Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207.3.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.4 Data Mining Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207.4.1 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207.4.2 RBF Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

iv Intell igent Miner Applications Guide

7.4.3 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267.4.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1277.4.5 Time-Series Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.5 Business Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Chapter 8. Intelligent Miner Advantages . . . . . . . . . . . . . . . . . . . . . 133

Appendix A. Special Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

Appendix B. Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . 139B.1 International Technical Support Organization Publications . . . . . . . . 139B.2 Redbooks on CD-ROMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139B.3 Other Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

How to Get ITSO Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141How IBM Employees Can Get ITSO Redbooks . . . . . . . . . . . . . . . . . . 141How Customers Can Get ITSO Redbooks . . . . . . . . . . . . . . . . . . . . . 142IBM Redbook Order Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

ITSO Redbook Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Contents v

vi Intell igent Miner Applications Guide

Figures

1. New Customer Relationships Out of Reach . . . . . . . . . . . . . . . . . 4 2. Data Mining Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3. Data Mining and Business Intelligence . . . . . . . . . . . . . . . . . . . . 7 4. Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5. Database Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 6. Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 7. The Data Mining Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 8. The Intelligent Miner Architecture . . . . . . . . . . . . . . . . . . . . . . . 21 9. The Data Task Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2310. Customer Segmentation Model . . . . . . . . . . . . . . . . . . . . . . . . . 3011. Data Mining Process: Customer Segmentation . . . . . . . . . . . . . . . 3512. Customer Transaction Data Model . . . . . . . . . . . . . . . . . . . . . . . 3713. Original Data Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4114. Post-Discretized Data Profile . . . . . . . . . . . . . . . . . . . . . . . . . . 4215. Post Logarithm Transformed Data Profile . . . . . . . . . . . . . . . . . . 4316. Clustering Process Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4517. Shareholder Value Demographic Clusters . . . . . . . . . . . . . . . . . . 5118. Shareholder Value Neural Network Clusters . . . . . . . . . . . . . . . . . 5219. Shareholder Value Demographic Cluster Details . . . . . . . . . . . . . . 5420. Cluster 6 Detailed View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5621. Cluster 3 Detailed View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5822. Cluster 5 Detailed View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6023. Cluster 5 Tabulated Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 6124. Cluster 1 Detailed View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6325. Decision Tree Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . 6626. Decision Tree Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6727. Data Mining Process: Cross-Selling Opportunity . . . . . . . . . . . . . . 7128. Typical Transaction Record . . . . . . . . . . . . . . . . . . . . . . . . . . . 7229. Product Association Analysis Workflow . . . . . . . . . . . . . . . . . . . . 7430. Parameter Settings for Associations . . . . . . . . . . . . . . . . . . . . . 7831. Associations on Good Customer Set . . . . . . . . . . . . . . . . . . . . . 7832. Associations on Good Customer Set Detail . . . . . . . . . . . . . . . . . 7933. Associations for Good Customer Set: LIS Removed . . . . . . . . . . . . 7934. Associations for Good Customer Set: LIS Removed, Detail . . . . . . . . 8035. Associations on Okay Customer Set . . . . . . . . . . . . . . . . . . . . . . 8036. Associations on Okay Customer Set Detail . . . . . . . . . . . . . . . . . . 8137. Associations for Okay Customer Set: LIS Removed . . . . . . . . . . . . 8138. Associations for Good Customer Set: LIS Removed, Summary . . . . . . 8239. Associations for Good Customer Set: LIS Removed, Detail . . . . . . . . 8240. Associations for Good Customer Set: LIS and Certain Products

Removed, Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8241. Associations for Good Customer Set: LIS and Certain Products

Removed, Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8342. Associations for Okay Customer Set: LIS and Certain Products

Removed, Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8343. Associations for Okay Customer Set: LIS and Certain Products

Removed, Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8344. Associations for All Transactions: LIS Removed, Summary . . . . . . . . 8445. Associations for All Transactions LIS Removed Detail . . . . . . . . . . . 8446. Data Mining Process: Cross-Selling . . . . . . . . . . . . . . . . . . . . . . 9047. Creating an Objective Variable . . . . . . . . . . . . . . . . . . . . . . . . . 91

Copyright IBM Corp. 1999 vii

48. Cross Selling: Data Sampling (5252f405/50) . . . . . . . . . . . . . . . . . 9449. Detailed Predictive Modeling Process . . . . . . . . . . . . . . . . . . . . . 9650. Decision Tree Results: Isolating the Key Decision Criteria . . . . . . . 10451. Gains Chart for Decision Tree Results (s406/0.0) . . . . . . . . . . . . . 10552. RBF Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10753. Cross-Selling: Comparison of Three Predictive Models . . . . . . . . . 10954. Cross-Selling: ROI Analysis Figures . . . . . . . . . . . . . . . . . . . . . 11055. Reducing Defections 5% Boosts Profits 25% to 85% . . . . . . . . . . . 11156. Data Mining Process: Attrition Analysis . . . . . . . . . . . . . . . . . . . 11457. Attrition Analysis: Data Definition . . . . . . . . . . . . . . . . . . . . . . 11658. Times Series: Setting the Parameters . . . . . . . . . . . . . . . . . . . . 11859. Attrition Analysis: Decision Tree Structure . . . . . . . . . . . . . . . . . 12160. Decision Tree Gains Chart: Training and Testing . . . . . . . . . . . . . 12261. RBF: Results Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12362. Attrition Analysis: Predicting Values Result . . . . . . . . . . . . . . . . 12463. Attrition Analysis: Predicting Values . . . . . . . . . . . . . . . . . . . . . 12564. Attrition Analysis: Comparative Gains Charts for All Methods . . . . . 12665. Attrition Analysis: Demographic Clustering of Likely Defectors . . . . . 12866. Profile of Time-Series Prediction . . . . . . . . . . . . . . . . . . . . . . . 12967. Time Profile of Defection Probability for Defectors . . . . . . . . . . . . 13068. Time Profile of Defection Probability for Nondefectors . . . . . . . . . . 131

viii Intell igent Miner Applications Guide

Tables

1. Customer Revenue by Cluster . . . . . . . . . . . . . . . . . . . . . . . . . 64 2. Comparison of Neural and Demographic Clustering Results . . . . . . . 65 3. Demographic Clustering Results: Percentage . . . . . . . . . . . . . . . . 77 4. Cross-selling: Summary - Predictive Modeling More Than Doubles ROI 88 5. Cross-Selling: Baseline ROI Calculation . . . . . . . . . . . . . . . . . . . 88 6. Cross-Selling: ROI Analysis Figures . . . . . . . . . . . . . . . . . . . . . 109

Copyright IBM Corp. 1999 ix

x Intell igent Miner Applications Guide

Preface

This redbook is a step-by-step guide to data mining with Intelligent MinerVersion 2. It will help customers better understand the usability and thebusiness value of the product.

The focus is on helping the Intelligent Miner V2 user determine which algorithmsto use and how to effectively exploit them. The business utilized as a case studyin the book is a retail bank client of Loyalty Consulting, an IBM business partnerbased in Toronto, Canada.

After a short introduction to data mining technology and Intelligent Miner V2, thecase study framework is described. The rest of the book covers each datamining technique in detail and provides ideas on how to implement thetechniques.

Although no in-depth knowledge of the Intelligent Miner V2 is required, a basicunderstanding of data mining technology is assumed.

The Team That Wrote This RedbookThis redbook was produced by a team of specialists from around the worldworking at the International Technical Support Organization, San Jose Center.

Peter Cabena is a data warehouse and data mining specialist at IBM′sInternational Technical Support Organization - San Jose Center. He holds aBachelor of Science degree in computer science from Trinity College, Dublin,Ireland. Peter has been extensively involved in the IBM data warehouse effortsince its inception in 1991. In recent years, he has taught and presentedinternationally on the subjects of data warehousing and data mining.

Peter conceived and managed the project that produced this book.

Hyun Hee Choi is a data mining researcher at the Korea Software DevelopmentInstitute, a branch of IBM in Korea. She holds a Master of Science degree instatistics from Korea University, Seoul, Korea, where she focused her researchon time-series analysis. Hyun Hee has several years of experience in datamining and business intelligence consulting projects for airline, banking,insurance, and cerdit card customer data analysis. She can be reached bye-mail at [email protected].

Il Soo Kim is a Business Intelligence Solution Specialist at IBM Korea. He holdsa Master of Science degree in engineering from Seoul National University, Seoul,Korea. Il Soo specializes in content management. Recently he has beeninvolved in constructing an in-house patent data warehouse and designing apatent data analysis program.

Shuichi Otsuka works for the Business Intelligence Solution Center, IBM Japan.He has been engaged for several years in data mining projects, mainly indistribution industries. Shuichi and his collegues have translated Data Miningwith Neural Networks by Joe Bigus into Japanese.

Joerg Reinschmidt is a data management and data mining specialist at IBM′sInternational Technical Support Organization, San Jose Center. He has been

Copyright IBM Corp. 1999 xi

engaged for several years in all data-management-related topics such as secondlevel suport and technical marketing support. For the last several years, Joerghas taught several technical classes on DB2, while focusing on DB2 and IMSInternet connectivity.

Gary Saarenvirta is a principal consultant of Loyalty Consulting at The LoyaltyGroup in Toronto, Canada. He has worked in the business intelligence industryfor more than eight years, providing data mining and data warehousingconsulting services for Global 2000 companies. Gary joined The Loyalty Groupto manage the design, construction, and operation of the company ′s datawarehouse. He played a key role in the development of Loyalty Consulting ′sDecision Support business over the last few years.

Gary was the lead editor of this book and conceived the framework and datamining methodology for each case study.

Thanks to the following people for their invaluable contributions to this project:

Hanspeter NagelInternational Technical Support Organization, San Jose Center

Susan DahmIBM Santa Teresa Laboratory

Ingrid FoersterIBM Santa Teresa Laboratory

Comments WelcomeYour comments are important to us!

We want our redbooks to be as helpful as possible. Please send us yourcomments about this or other redbooks in one of the following ways:

• Fax the evaluation form found in “ITSO Redbook Evaluation” on page 155 tothe fax number shown on the form.

• Use the electronic evaluation form found on the Redbooks Web sites:

For Internet users http://www.redbooks.ibm.comFor IBM Intranet users http://w3.itso.ibm.com

• Send us a note at the following address:

[email protected]

xii Intell igent Miner Applications Guide

Chapter 1. Introduction

Data mining is an interdisciplinary field bringing together techniques frommachine learning, pattern recognition, statistics, databases, and visualization toaddress the issue of information extraction from large databases. The genesisof the field came with the realization that traditional decision-supportmethodologies, which combine simple statistical techniques with executiveinformation systems, do not scale to the point where they can deal with largedatabases and data warehouses within the time limits imposed by today ′sbusiness environment. Data mining has captured the imagination of thebusiness and academic worlds, moving very quickly from a niche researchdiscipline in the mid-eighties to a flourishing field today. In fact, 80% of theFortune 500 companies are currently involved in a data mining pilot project orhave already deployed one or more data mining production systems.

1.1 Why Now?Much of the current upsurge of interest in data mining arises from theconfluence of two forces: the need for data mining (drivers) and the means toimplement it (enablers). The drivers are primarily the business environmentchanges that have resulted in an increasingly competitive marketplace. Theenablers are mostly recent technical advances in machine learning research,database, and technologies. This happy coincidence of growing commercialpressures and major advances in research and information technology lends aninevitable push toward a more advanced approach to advise critical businessdecisions.

Before looking at these drivers and enablers in some detail, it is worth reviewingthe commercial backdrop against which these two forces are coming together.

1.1.1 Changed Business EnvironmentToday′s business environment is in flux. Fundamental changes are influencingthe way organizations view and plan to approach their customers. Among thesechanges are:

• Customer behaviour patterns

Consumers are becoming more demanding and have access to betterinformation through buyers′ guides, catalogs, and the Web. Newdemographics are emerging: Only 15% of U.S. families are now traditionalsingle-earner units, that is, a married couple with or without children whereonly the husband works outside the home. Many consumers are reportedlyconfused by too many choices and are starting to limit the number ofbusinesses with which they are prepared to deal. They are starting to putmore value on the time they spend shopping for goods and services.

• Market saturation

Many markets have become saturated. For example, in the United Statesalmost everyone uses a bank account, has at least one credit card, hassome form of automobile and property insurance, and has well-establishedpurchasing patterns in basic food items. Thus, in these areas, few optionsare available to organizations wanting to expand their market share. If amerger or takeover is not possible, such organizations often must resort toeffectively stealing customers from competitors, frequently by what is called

Copyright IBM Corp. 1999 1

predatory pricing. Lowering prices is not a sound long-term strategy,however, as only one supplier can be the lowest-cost provider.

• New niche markets

New, untapped markets are opening up. Examples are the handicapped andethnic groups or the current U.S. inner-city hip-hop culture. Also, highlyspecialized stores such as SunGlass Hut are emerging.

• Increased commoditization

Increased commoditization, where even many leading brand products andservices are finding it increasingly difficult to differentiate themselves, hassent many suppliers in search of new distribution channels. Witness theincrease in online service outlets, from catalogs to banking and insurance toInternet-based shopping malls.

• Traditional marketing approaches under pressure

Traditional mass marketing and even database marketing approaches arebecoming ineffective, as customers are increasingly turning to more targetedchannels. Customers are shopping in fewer stores and are expecting to domore one-stop shopping.

• Time to market

Time to market has become increasingly important. Witness the recentemergence and spectacular rise of Netscape Communications Corporation inthe Web browser marketplace. With only a few months ′ lead over its rivals,Netscape captured an estimated 80% of the browser market within a year ofestablishment. This is the exception, of course; most companies operate bymaking small incremental changes to services or products to captureadditional customers.

• Shorter product life cycles

Today products are brought to market quickly but often have a short lifecycle. This phenomenon is currently exemplified by the personal computerand Internet industries where new products and services are offered atarguably faster rates than at any other time in the history of computing. Theresult of these shortened life cycles is that providers have less time to turn aprofit or to ″milk″ their products and services.

• Increased competition and business risks

Many of the above changes tend to combine to create a climate that issignificantly competitive and a challenging risk management environment formany organizations. General trends like commoditization, globalization,deregulation, and the Internet make it increasingly difficult to keep track ofcompetitive forces, both traditional and new. Equally, rapidly changingconsumer trends inject new risks into doing business.

1.1.2 DriversAgainst this background, many organizations have been forced to reevaluatetheir traditional approaches to doing business and have started to look for waysto respond to changes in the business environment. The main requirementsdriving this reevaluation are:

• Focus on the customer

The requirement here is to rejuvenate customer relationships with anemphasis on greater intimacy, collaboration, and one-to-one partnership. In

2 Intell igent Miner Applications Guide

turn, this requirement has forced organizations to ask new questions abouttheir existing customers and potential customers, for example:

− Which general classes of customer do I have?

− How can I sell more to my existing customers?

− Is there a recognizable pattern whereby my customers acquire productsor use services?

− Which of my customers will prove to be good, long-term valuablecustomers and which will not?

− Can I predict which of my customers are more likely to default on theirpayments or to defraud me?

• Focus on the competition

Organizations need to focus increasingly on competitive forces with a view tobuilding up a modern armory of business weapons. Some of the approachesto building such an armory are:

− Prediction of potential strategies or major business plans by leadingcompetitors

− Prediction of tactical movements by local competitors

− Discovery of subpopulations of existing customers that are especiallyvulnerable to competitive offers

• Focus on the data asset

Business and information technology (IT) managers are becomingincreasingly aware that there is an information-driven opportunity to beseized. Many organizations are now beginning to view their accumulateddata resources as a critical business asset.

Some of the factors contributing to this growing awareness are:

− Growing evidence of exponential return on investment (ROI) numbersfrom industry watchers and consultants on the benefits of a modern,corporate, decision-making strategy based on data-driven techniquessuch as data warehousing. Data mining is a high-leverage businesswhere even small improvements in the accuracy of business decisionscan have huge benefits.

− Growing availability of data warehouses. As the data warehouseapproach becomes more pervasive, early adopters are forced toleverage further value from their investments by pushing into newtechnology areas to maintain their competitive edge.

− Growing availability of success stories, both anecdotal and otherwise, inthe popular trade press.

Figure 1 on page 4 summarizes the situation. The frustrated business executiveis attempting to grasp new opportunities such as better customer relationshipsand improved services. He fails, however, given the combination of a rapidlychanging business environment and poor or outdated in-house technologysystems.

Chapter 1. Introduction 3

Figure 1. New Customer Relationships Out of Reach

1.1.3 EnablersThere is a set of enablers for data mining that, when combined with the drivingforces discussed above, substantially increases the momentum toward a revisedapproach to business decision making:

• Data flood

Forty years of information technology have led to the storage of enormousamounts of data (measured in gigabytes and terabytes) on computersystems. A typical business trip today generates an automatic electronicaudit trail of a traveler′s habits and preferences in airline travel, car hire,credit card usage, reading material, mobile phone services, and perhapsWeb sites.

In addition, the increasing availability of demographic and psychographicdata from syndicated providers, such as A.C. Nielsen and Acxiom in theUnited States, has provided data miners with a useful data source.Theavailability of such data is particularly important given the focus in data


mining on consumer behavior, which is often driven by preferences andchoices that are not visible in a single organization′s database.

• Growth of data warehousing

The growth of data warehousing in organizations has led to a ready supplyof the basic raw material for data mining: clean and well-documenteddatabases. Early adopters of the warehousing approach are now poised tofurther capitalize on their investment. See ″The Data Warehouse Connection″on page 18 for a detailed discussion of the integration of data warehouseand data mining approaches.

• New information technology solutions

More cost-effective IT solutions in terms of storage and processing abilityhave made large-scale data mining projects possible. This is particularly trueof parallel technologies, as many of the data mining algorithms are parallelby nature. Furthermore, increasingly affordable desktop power has enabledthe emergence of sophisticated visualization packages, which are a keyweapon in the data mining armory.

• New research in machine learning

New algorithms from research centers and universities are being pressedinto commercial service more quickly than ever. Emphasis on commercialapplications has focused attention on better and more scalable algorithms,which are beginning to come to market through commercial products. Thismovement is supported by increasing contact and joint ventures betweenresearch centers and commercial industries around the world.

The net effect of the changed business environment is that decision making hasbecome much more complicated, problems have become more complex, and thedecision-making process less structured. Decision makers today need a set ofstrategies and tools to address these fundamental changes.

1.2 What Is Data Mining?It is difficult to make definitive statements about an evolving area—and surelydata mining is an area in very quick evolution. However, we need a frameworkwithin which to position and better understand the subject. Figure 2 shows ageneral positioning of the components in a data mining environment.

Figure 2. Data Mining Positioning

Although there is no one single definition of data mining that would meet withuniversal approval, the following definition is generally acceptable:


Data Mining...

is the process of extracting previously unknown , valid , and actionableinformation from large databases and then using the information to makecrucial business decisions.

The highlighted words in the definition lend insight into the essential nature ofdata mining and help to explain the fundamental differences between it and thetraditional approaches to data analysis such as query and reporting and onlineanalytical processing (OLAP). In essence, data mining is distinguished by thefact that it is aimed at the discovery of information, without a previouslyformulated hypothesis.

First, the information discovered must have been previously unknown. Althoughthis sounds obvious, the real issue here is that it must be unlikely that theinformation could have been hypothesized in advance; that is, the data miner islooking for something that is not intuitive or, perhaps, even counterintuitive. Thefurther away the information is from being obvious, potentially the more value ithas. Data mining can uncover information that could not even have beenhypothesized with other approaches.

Second, the new information must be valid. This element of the definition relatesto the problem of overoptimism in data mining; that is, if data miners look hardenough in a large collection of data, they are bound to find something of interestsooner or later. For example, the potential number of associations betweenitems in customers′ shopping baskets rises exponentially with the number ofitems. The possibility of spurious results applies to all data mining andhighlights the constant need for post-data-mining validation and sanity checking.

Third, and most critically, the new information must be actionable, that is, it mustbe possible to translate it into some business advantage. In the case of theclassic example of the retail store manager, who, using data mining, discoveredthat there was a strong association between the sales of diapers and beer onFriday evenings, clearly he could leverage the results of the analysis by placingthe beer and diapers closer together in the store or by ensuring that the twoitems were not discounted at the same time. In many cases, however, theactionable criterion is not so simple.

The ability to use the mined data to inform crucial business decisions is anothercritical environmental condition for successful commercial data mining andunderpins data mining ′s strong association with and applicability to businessproblems. Needless to say, an organization must have the necessary politicalwill to carry out the action implied by the mining.

1.3 Data Mining and Business IntelligenceWe use business intelligence as a global term for all the processes, techniques,and tools that support business decision-making based on informationtechnology. The approaches can range from a simple spreadsheet to a majorcompetitive intelligence undertaking. Data mining is an important newcomponent of business intelligence. Figure 3 on page 7 shows the logicalpositioning of different business intelligence technologies according to theirpotential value as a basis for tactical and strategic business decisions.


In general, the value of the information to support decision-making increasesfrom the bottom of the pyramid to the top. A decision based on data in the lowerlayers, where there are typically millions of data records, will typically affect onlya single customer transaction. A decision based on the highly summarized datain the upper layers is much more likely to be about company or departmentinitiatives or even major redirection. Therefore we generally also find differenttypes of users on the different layers. A database administrator works primarilywith databases on the data source and data warehouse level, whereas businessanalysts and executives work primarily on the higher levels of the pyramid.

Note that Figure 3 portrays a logical positioning and not a physicalinterdependence among the various technology layers. For example, data miningcan be based on data warehouses of flat files, and the data presentation can beused outside data mining, of course.

Figure 3. Data Mining and Business Intel l igence

1.3.1 Where to from Here?It is probably a little early to ponder the future of data mining, but some trendson the horizon are already becoming clear.

Data mining technology trends are becoming established as we see vendorsscramble to position their tools and services within the new data miningparadigm. This scramble will be followed by the inevitable technology shakeoutwhere some vendors will manage to establish leadership positions in theprovision of tools and services and others will simply follow. Doubtless, new datamining algorithms will continue to be developed, but, over time, the technologywill begin to dissolve into the general backdrop of database and datamanagement technology. Already, we are seeing the merging of OLAP andmultidimensional database analysis (MDA) tools and the introduction ofstructured query language (SQL) extensions for mining data directly fromrelational databases.


On the data mining process side, there will be more open sharing of experiencesby the early adopters of data mining. Solid, verifiable success stories arealready beginning to appear. Over time, as more of the implementation detailsof these successes emerge, knowledge of the data mining process will begin tomove out into the public domain.

The final phase in the evolution will be the integration of the data mining processinto the overall business intelligence machinery. In the long run, data mining,like all truly great technologies, may simply become transparent!

1.4 Data Mining ApplicationsLarge customers in mature, competitive industries can no longer establishcompetitive advantage through transaction systems or business processimprovement. To distinguish their strategies and operations from those ofcompetitors, they must discover and extract strategic value from theiroperational data. Companies generate enormous amounts of data during thecourse of doing business, and Business Intelligence is the process oftransforming that data into knowledge.

Business Intelligence enables companies to make strategic marketing decisionsabout which markets to enter and which products to promote, all in an effort toincrease profitability. Some customers use business intelligence for marketingpurposes, others, to detect fraud. New marketing strategies and theimplementation of fraud detection can also reduce operating costs througheffective financial analysis, risk management, fraud management, distributionand logistics management, and sales analysis.

Perhaps the best known application area for data mining is database marketing.The objective is to drive targeted and therefore effective marketing andpromotional campaigns through the analysis of corporate databases. Data knownthrough credit card transactions or loyalty cards, for example, mixed withpublicly available information from sources such as lifestyle studies forms apotent concoction. Data mining algorithms then sift through the data, looking forclusters of ″model″ consumers who share the same characteristics such asinterests, income level, and spending habits. It is a win-win game for both theconsumers and marketers: Consumers perceive greater value in the (reduced)number of advertising messages, and marketers save by limiting theirdistribution costs and getting an improved response to the campaign.

Another application area for data mining is that of determining customerpurchasing patterns over time. Marketers can determine much about thebehavior of consumers, such as the sequence in which they take up financialservices as their family grows, or how they change their cars. Commonly theconversion of a single bank account to a joint account indicates marriage, whichcould lead to future opportunities to sell a mortgage, a loan for a honeymoonvacation, life insurance, a home equity loan, or a loan to cover college fees. Byunderstanding these patterns, marketers can advertise just-in-time to theseconsumers, thus ensuring that the message is focused and likely to draw aresponse. In the long run, focusing on long-term customer purchasing patternsprovides a full appreciation of the lifetime value of customers, where the strategyis to move away from share of market to share of customer. An averagesupermarket customer is worth $200,000 over his or her lifetime, and GeneralMotors estimates that the lifetime value of an automobile customer is $400,000,


which includes car, service, and income on loan financing. Clearly,understanding and cultivating long-term relationships bring commercial benefits.

Cross-selling campaigns constitute another application area where data miningis widely used. Cross selling is where a retailer or service provider makes itattractive for customers who buy one product or service to buy an associatedproduct or service.

1.5 Data Mining TechniquesData mining techniques are specific implementations of the algorithms that areused to carry out the data mining operations.

Predictive modeling, database segmentation, link analysis, and deviationdetection are the four major operations for implementing any of the businessapplications. We deliberately do not show a fixed, one-to-one link between thebusiness applications and data mining layers, to avoid the suggestion that onlycertain operations are appropriate for certain applications and vice versa. (Onthe contrary, truly breakthrough results can sometimes come from the use ofnonintuitive approaches to problems.) Nevertheless, certain well-establishedlinks between the applications and the corresponding operations do exist. Forexample, modern target marketing strategies are almost always implemented bymeans of the database segmentation operation. However, fraud detection couldbe implemented by any of the four operations, depending on the nature of theproblem and input data. Furthermore, the operations are not mutually exclusive.For example, a common approach to customer retention is to segment thedatabase first and then apply predictive modeling to the resultant, morehomogeneous segments. Typically the data analyst, perhaps in conjunction withthe business analyst, selects the data mining operations to use.

Not all algorithms to implement a particular data mining operation are equal,and each has its own strengths and weaknesses.

The key message is this: There is rarely one, fool-proof technique for any givenoperation or application, and the success of the data mining exercise reliescritically on the experience and intuition of the data analyst.

In the sections that follow we discuss in detail the operations associated withdata mining.

1.5.1 Predictive ModelingPredictive modelling is akin to the human learning experience, where we useobservations to form a model of the essential, underlying characteristics of somephenomenon. For example, in its early years, a young child observes severaldifferent examples of dogs and can then later in life use the essentialcharacteristics of dogs to accurately identify (classify) new animals as dogs.This predictive ability is critical in that it helps us to make sound generalizationsabout the world around us and to fit new information into a general framework.

In data mining, we use a predictive model to analyze an existing database todetermine some essential characteristics about the data. Of course, the datamust include complete, valid observations from which the model can learn howto make accurate predictions. The model must be told the correct answer tosome already solved cases before it can start to make up its own mind about


new observations. When an algorithm works in this way, the approach is calledsupervised learning. Physically, the model can be a set of IF THEN rules in someproprietary format, a block of SQL, or a segment of C source code.

Figure 4 illustrates the predictive modeling approach. Here a service company,for example an insurance company, is interested in understanding the increasingrates of customer attrition. A predictive model has determined that only twovariables are of interest: the length of time the client has been with the company(Tenure), and the number of the company′s services that the client uses(Services). The decision tree presents the analysis in an intuitive way. Clearly,those customers who have been with the company less than 2.5 years and useonly one or two services are the most likely to leave.

Figure 4. Predictive Model ing

Models are developed in two phases: training and testing. Training refers tobuilding a new model by using historical data, and testing refers to trying out themodel on new, previously unseen data to determine its accuracy and physicalperformance characteristics. Training is typically done on a large proportion ofthe total data available, whereas testing is done on some small percentage ofthe data that has been held out exclusively for this purpose.

The predictive modeling approach has broad applicability across manyindustries. Typical business applications that it supports are customer retentionmanagement, credit approval, cross selling, and target marketing.

There are two specializations of predictive modeling: classification and valueprediction. Although both have the same basic objective, namely, to make aneducated guess about some variable of interest, they can be distinguished by thenature of the variable being predicted.

With classification, a predictive model is used to establish a specific class foreach record in a database. The class must be one from a finite set of possible,predetermined class values. The insurance example in Figure 4 is a case inpoint. The variable of interest is the class of customer, and it has two possiblevalues: STAY and LEAVE.


With value prediction, a predictive model is used to estimate a continuousnumeric value that is associated with a database record. For example, a carretailer may want to predict the lifetime value of a new customer. A mining runon the historical data of present long-standing clients, including someagreed-upon measure of their financial worth to date, produces a model that canestimate the likely lifetime value of new customers.

A specialization of value prediction is scoring, where the variable to be predictedis a probability or propensity. Probability and propensity are similar in that theyare both indicators of likelihood. Both use an ordinal scale, that is, the higherthe number, the more likely it is that the predicted event will occur. Typicalapplications are the prediction of the likelihood of fraud or the probability that acustomer will respond to a promotional mailing.

1.5.2 Database SegmentationThe goal of database segmentation is to partition a database into segments ofsimilar records, that is, records that share a number of properties and so areconsidered to be homogeneous. In some literature the words segmentation andclustering are used interchangeably. Here, we use segmentation to describe thedata mining operation, and segments or clusters to describe the resulting groupsof data records. By definition, two records in different segments are different insome way. The segments should have high internal (within segment)homogeneity and high external (between segment) heterogeneity.

Database segmentation is typically done to discover homogeneoussubpopulations in a customer database to improve the accuracy of the profiles.A subpopulation, which might be ″wealthy, older, males″ or ″urban, professionalfemales,″ can be targeted for specialized treatment. Equally, as databases growand are populated with diverse types of data, it is often necessary to partitionthem into collections of related records to obtain a summary of each database orbefore performing a data mining operation such as predictive modeling.

Figure 5 on page 12 shows a scatterplot of income and age from a samplepopulation. The population has been segmented into clusters (indicated bycircles) that represent significant subpopulations within the database. Forexample, one cluster might be labeled ″young, well-educated professionals″ andanother, ″older, highly paid managers.″

The grid lines and shaded sectors on the plot illustrate the comparativeinefficiency of the traditional, slice-and-dice approach to the problem of databasesegmentation. The overlaid areas do not account for the truly homogeneousclusters because they either miss many of the cluster members or take inextraneous cluster members—which will skew the results.

In contrast, the segmentation algorithm can segment a database without anyprompting from the user about the type of segments or even the number ofsegments it is expected to find in the database. Thus, any element of humanbias or intuition is removed, and the true discovery nature of the mining can beleveraged. When an algorithm works in this way, the approach is calledunsupervised learning.


Figure 5. Database Segmentation

Database segmentation can be accomplished by using either demographic orneural clustering methods. The methods are distinguished by:

• The data types of the input attributes that are allowed

• The way in which they calculate the distance between records (that is, themeasure of similarity or difference between the records, which is theessence of the segmentation operation)

• The way in which they organize the resulting segments for analysis

Demographic clustering methods operate primarily on records with categoricvariables. They use a distance measurement technique based on the votingprinciple called condorect, and the resulting segments are not prearranged onoutput in any particular hierarchy.

Neural clustering methods are built on neural networks, typically by usingKohonen feature maps. Neural networks accept only numeric input, butcategorical input is possible by first transforming the input variables intoquantitative variables. The distance measurement technique is based onEuclidean distance, and the resulting segments are arranged in a hierarchywhere the most similar segments are placed closest together.

Segmentation differs from other data mining techniques in that its objective isgenerally far less precise than the objectives of predictive modeling or linkanalysis. As a result, segmentation algorithms are sensitive to redundant andirrelevant features. This sensitivity can be alleviated by directing thesegmentation algorithm to ignore a subset of the attributes that describe eachinstance or by assigning a weight factor to each variable.


Segmentation supports such business applications as customer profiling ortarget marketing, cross selling, and customer retention. Clearly, this operationhas broad, cross-industry applicability.

1.5.3 Link AnalysisIn contrast to the predictive modeling and database segmentation operations,which aim to characterize the contents of the database as a whole, the linkanalysis operation seeks to establish links (associations) between individualrecords, or sets of records, in the database. A classic application of thisoperation is associations discovery, that is, discovering the associations betweenthe products or services that customers tend to purchase together or in asequence over time. Other examples of business applications that link analysissupports are cross selling, target marketing, and stock price movement.

There are three specializations of link analysis: associations discovery,sequential pattern discovery, and similar time sequence discovery. Thedifferences among the three are best illustrated by some examples. If we definea transaction as a set of goods purchased in one visit to a shop, associationsdiscovery can be used to analyze the goods purchased within the transaction toreveal hidden affinities among the products, that is, which products tend to sellwell together. This type of analysis is called market basket analysis (MBA) orproduct affinity analysis.

Sequential pattern discovery is used to identify associations across relatedpurchase transactions over time that reveal information about the sequence inwhich consumers purchase goods and services. It aims to understand long-termcustomer buying behavior and thus leverage this new information through timelypromotions.

Similar time sequence discovery, the discovery of links between two sets of datathat are time dependent, is based on the degree of similarity between thepatterns that both time series demonstrate. Retailers would use this approachwhen they want to see whether a product with a particular pattern of sales overtime matches the sales curve of other products, even if the pattern match islagging some time behind. Figure 6 shows an example of three apparentlyunrelated patterns that could represent sales histories or even stock movementsover time. At first glance the graphs appear not to be related in any significantway. However, on closer examination, definite patterns can be identified, which,when translated into business terms, can be exploited for commercial gain.

Figure 6. Pattern Matching


1.6 General Approach to Data MiningFigure 7 depicts a general data mining process that fits into an overall businessprocess. In this section we briefly describe most of the actions to be performedwithin the data mining process.

Figure 7. The Data Mining Process


1.6.1 Business Requirements AnalysisThe first part of any data mining project is to understand the client′s businessrequirements. The business requirements that form the project objectivesshould be clearly presented and understood by all members of the project team.The data mining process is driven by the client′s business requirements.

• Economics of the business problem

In order to turn the data mining results into actionable business results, it isimportant to understand the economics and/or other drivers of the client ′sbusiness requirements. The data mining activities activities undertaken mustadhere to and improve the economics and/or other drivers of therequirements.

• Review of current methods used

An understanding of the current methods used and the current businessperformance of the methods is required to ensure that the application of newtechnologies and methods adds incremental value beyond the status quo.The status quo performance is the minimum performance required of anynew methods.

• Expected performance of a new method

The expected improvement over status quo methods should be presented bythe client to ensure that the project team has clear objectives. It is alsoimportant to set attainable expectations of results.

1.6.2 Project ManagementThe second part of a data mining project is to define the scope of the project andthe team to run the project.

• Project team identification

A cross-functional project management team including representation fromall parties is defined and ensures that all project issues are appropriatelydiscussed and resolved.

• Project plan design

The first task of the project management team is to agree on a project planthat includes identification of all project tasks, project task resourcing,project task scheduling, and project task estimation.

• Project objectives

At the outset of the project, a clear set of objectives must be defined tomaintain project focus and help resolve project issues. A project withoutclear objectives has a high probability of not being completed on time andwith positive results.

• Project evaluation criteria

The client must present criteria that will be used to evaluate the success ofthe project. The project management team should modify the evaluationcriteria as required to set an appropriate expectation of success and achievean objective evaluation.


1.6.3 Business Solution DesignBefore the actual data mining phase of a project, a business solution must bedesigned. The solution should define the detailed data mining tasks and can beillustrated as a flow diagram.

1.6.4 Data Mining RunAs illustrated in Figure 7 on page 14, the data mining action is iterative andconsists of these steps:

• Data selection

This step involves identifying and selecting all relevant data that can be usedfor data mining. The business requirement defines that data selection. Thedata requirement project activity defined above is the data selection miningactivity.

• Data preparation

Data preparation is a substantial portion of a data mining project. It involvesthe treatment of missing values, outliers, and the creation of new variablesbased on data transformations. Each data mining algorithm has a differentdata preparation requirements. Data preparation could also include datareduction, which is defined by the maximum number of variables that analgorithm can effectively utilize, and data sampling. The samplingrequirements are driven by the different data mining algorithms.

• Data mining

Data mining involves the execution of the various data mining algorithmsagainst the prepared data sets. Several (tens to hundreds) mining runs arecompleted for each data mining project. The effects of algorithm parametersand data transformations are scientifically evaluated.

• Results analysis

Once a data model has been created and tested, its performance isanalyzed. The analysis includes a description of all key variables andfindings that the model permits. All modeling assumptions are outlined, andimplementation issues are presented.

1.6.5 Business Implementation DesignBusiness implementation design involves designing the implementation of thedata mining results, with the goal of meeting the defined business requirements.The design should support quality control, tracking of business results, and theability to prove the causal effect of the data mining result. The design must alsotake into account any business implementation issues that are not part of thedata mining project. The business implementation design is more experimentalthan fixed.

1.6.6 Business ImplementationThe business implementation is the execution of the experimental design.


1.6.7 Results TrackingIf required by the client preliminary business results can be tracked against theexpected performance to ensure success of the business implementation.Preliminary results can be used to modify the current business activity ifwarranted.

1.6.8 Final Business Result DeterminationAt the conclusion of any business activity, a complete analysis of the profitabilityof the business implementation will be evaluated. The performance of the modelwill also be analyzed against its expected performance.

1.6.9 Business Result AnalysisThe final business result should be analyzed to identify general learnings thatcan be fed into future projects. Many companies are beginning to createlearning warehouses to store corporate knowledge.



Chapter 2. Introduction to the Intelligent Miner

The IBM Intelligent Miner for Data (IM in this book) is leading the way in helpingcustomers identify and extract high-value business intelligence from their dataassets. The process is one of discovery. Companies are empowered toleverage information hidden within enterprise data and discover associations,patterns and trends; detect deviations; group and classify information; anddevelop predictive models.

2.1 HistoryIBM ′s award-winning Intelligent Miner was released in 1996. It enables users tomine structured data stored in conventional databases or flat files. Customersand partners have successfully deployed its mining algorithms to address suchbusiness areas as market analysis, fraud and abuse, and customer relationshipmanagement.

2.2 Intended CustomersThe Intelligent Miner offerings are intended for use by data analysts andbusiness technologists in areas such as marketing, finance, productmanagement, and customer relationship management. In addition, the textmining technologies have applicability to a wide range of users who regularlyreview or research documents - for example, patent attorneys, corporatelibrarians, public relations teams, researchers, and students.

2.3 What Is the Intelligent Miner?The IBM Intelligent Miner is a suite of statistical, processing, and miningfunctions that you can use to analyze large databases. It also providesvisualization tools for viewing and interpreting mining results. The serversoftware runs on AIX, AS/400, OS/390, and Sun Solaris operating systems. AIX,OS/2, and Windows operating systems can be used for the clients.

Some of the features provided by the Intelligent Miner include:

• Extension of the associations, classification, clustering, and predictionfunctions

• Neural prediction

• Statistical functions

• Export and import of mining bases across operating systems

• Exploitation of DB2 Parallel Edition and DB2 Universal Database EnterpriseExtended Edition

• Repeatable sequences

• API for all server platforms

The Intelligent Miner provides a complete graphical user interface withTaskGuides that lead you through the steps of creating the different IntelligentMiner objects. General help for each TaskGuide provides additional information,examples, and valid values for the controls on each page.


In the sections that follow we introduce the data mining technology and the datamining process of the Intelligent Miner. We also explain in general thestatistical, processing, and mining functions that Intelligent Miner provides.

2.4 Data Mining with the Intelligent MinerData mining is the process of discovering valid, previously unknown, andultimately comprehensible information from large stores of data. It can be usedto extract information to form a prediction or classification model, or to identifysimilarities between database records. The resulting information can help youmake more informed decisions.

The Intelligent Miner helps organizations perform data mining tasks. Forexample,a retail store might use the Intelligent Miner to identify groups ofcustomers that are most likely to respond to new products and services or toidentify new opportunities for cross selling. An insurance company might use theIntelligent Miner with claims data to isolate likely fraud indicators.

2.5 Overview of the Intelligent Miner ComponentsIn this section we provide a high-level overview of the product architecture. Seethe Intelligent Miner Application Programming Interface and Utility Reference formore detailed information about the architecture and the APIs for the IntelligentMiner.

The Intelligent Miner links the mining and processing functions on the serverwith the administrative and visualization tools on the client. The clientcomponent includes a user interface from which you can invoke the mining andprocessing functions on an Intelligent Miner server. The results of the miningprocess can be returned to the client where you can visualize and analyze them.

The client components are available for AIX, OS/2, Windows NT, and Windows 95operating systems. The server components are available for AIX, OS/390,AS/400, and Sun Solaris systems. They are also available for RS/6000 SP andexploit parallel mining on multiple processing nodes. You can have client andserver components on the same machine.

2.5.1 Intelligent Miner ArchitectureFigure 8 on page 21 illustrates the client and server components of theIntelligent Miner and the way they are related to one another :


Figure 8. The Intel l igent Miner Architecture

User Interface The user interface is a program that enables you to define datamining functions in a graphical environment. You can definepreferences for the user interface that are stored on the client.

Environment Layer API The environmental layer API is a set of API functions thatcontrol the execution of mining runs and results. Sequences offunctions and mining operations can be defined and executed byusing the user interface through the environment layer API.

Data Definition This feature of Intelligent Miner provides the ability to collect andprepare the data for the data mining process.

Visualizer The Intelligent Miner provides a rich set of visualization tools. Youcan also use other visualization tools.

Data Access The Intelligent Miner provides access to flat files, database tables,and database views.

Chapter 2. Introduction to the Intelligent Miner 21

Databases and Flat Files The Intelligent Miner components work directly withdata stored in a relational database or in flat files. The data is notcopied to a special format. You define input and output data objectsthat are logical descriptions of the physical data. Therefore thephysical location of the data can be changed without affecting objectsthat use the data; only the logical descriptions must be changed. Thechange might be as simple as changing a database name.

Processing Library The processing library provides access to database functionssuch as bulk load of data and data transformation.

Mining Bases Mining bases are collections of data mining objects used for amining objective or business problem. Mining bases are stored on theserver, which allows access from different clients.

Mining Kernels Mining kernels provide the data mining and statistical functions.

Mining Results, Result API, and Export Tools Mining results are the dataresulting from running a mining or statistics function. Thesecomponents allow you to visualize results at the client. Results canbe exported for use by visualization tools.

2.5.2 Intelligent Miner TaskGuidesData mining in the Intelligent Miner is accomplished through the creation ofinterrelated objects. The objects are displayed as icons and represent thecollection of attributes or settings that define the data or function.

Working with the Intelligent Miner graphical user interface is fairly simple asIntelligent Miner offers TaskGuides. In this section we explain how to use aTaskguide to create a settings object.

To create a settings object, use the Create menu or click on a settings objecticon in the task bar. A TaskGuide opens to guide you through the creation of theobject.

Each TaskGuide starts with a Welcome page that provides an overview of thetype of settings object that you are creating. Each TaskGuide page providesstep-by-step instructions for filling in the fields and making selections that definethe settings for the object. You can click on a highlighted term to see a shortdefinition of the term.

Click the Next button to navigate to the next TaskGuide page. The last page ofevery TaskGuide summarizes the settings object that you created. Click theFinish button to create the object.

Figure 9 on page 23 shows the TaskGuide for creating a data settings object.


Figure 9. The Data Task Guide

You can have more than one TaskGuide open at a time. Thus you can leave aTaskGuide to create another object that is required to complete the firstTaskGuide. For example, while you are in the process of defining a miningfunction, you might have to define or modify an input data object. You can opena Data TaskGuide to define an input data object, then continue with the MiningTaskGuide.

2.5.3 Mining and Statistics FunctionsMining and statistics settings objects are similar in that they represent analyticalfunctions that are run against data. In both cases, you must indicate which datasettings object you want to use.

Mining and statistics settings objects produce a results object when run. You canview and analyze the results object with visualization tools. You can alsoindicate in the settings for these functions that you want to create output data inaddition to a results object.


The Intelligent Miner has many types of mining and statistics functions:

Mining Statistics

Associations Cross-correlationClustering − demographic Correlation matrixesClustering − neural Factor analysisSequential patterns Linear regressionTime sequence Principal component analysisClassification − t ree Univariate curve fitt ingClassification − neural Bivariate statisticsPrediction − Radial-Basis-FunctionPrediction − neural

2.5.4 Processing FunctionsProcessing functions are used to make data suitable for mining or analysis.Processing settings objects apply only to database tables and views becausethey take advantage of the processing capability of the database engine.

The Intelligent Miner has many processing functions:

Processing settings objects always read input from a database and create outputdata in a database. The only exception is the Copy Records to File function,which copies data to a file. When you create a processing settings object orupdate an existing one, you can use a data settings object to identify input dataor output data. In this way the name of a database table or view is copied to theprocessing settings object. Subsequent changes to the data settings object haveno effect on the processing settings object.

Aggregate values Filter fieldsCalculate values Filter recordsClean up input data or output data Filter records using a value setConvert to lowercase or uppercase Get random sampleCopy records to file Group recordsDiscard records with missing values Join data sourcesDiscretization into quantiles Map valuesDiscretization using ranges Pivot fields to recordsEncode missing values Run SQLEncode nonvalid values

2.5.5 ModesHow results objects are used with Intelligent Miner depends on the mode inwhich functions are run. Intelligent Miner provides the following modes underwhich to perform the mining process:

TrainingIn training mode, a mining function builds a model on the basis of theselected input data.

ClusteringIn clustering mode, the clustering functions build a model on thebasis of the selected input data. Clustering mode is similar to trainingmode for the predictive algorithms. Clustering mode offers the choiceof using background statistics from the input data or an input result.

TestIn test mode, a mining function uses new or the same data withknown results to verify that the model created in training mode


produces consistent results. Results objects are used for input andcreated as output.

ApplicationIn application mode, a mining function uses a model created intraining mode to predict the specified field for every record in the newinput data. The data format must be identical to that used togenerate the model.

For more information about how to work with the Intelligent Miner, see Using theIntelligent Miner for Data, SH12-6325-01, the documentation shipped with theproduct.



Chapter 3. Case Study Framework

Customer Relationship Management (CRM) is a key focus area today inmarketing departments in many different industries including finance,telecommunications, utilities, and insurance. Businesses in these industrieshave changed or are changing their marketing focus from a product-centric viewto a customer-centric view. There are several reasons for this change in focus:increased competition for nongrowing markets, government deregulation, atechnology revolution enabling the consolidation of corporate data and access tonew data sources, and a growing awareness that the primary assets of abusiness are its customers.

3.1 Customer Relationship ManagementCRM is a methodology used to market to customers. CRM′s key featuresinclude customer profitability, customer lifetime value, and customer loyalty. Inmanaging their customers, businesses recognize that all customers are notcreated equal and that they should focus their marketing efforts on retainingtheir best customers, increasing the profitability of their high-potentialcustomers, spending less marketing dollars on their low-potential customers,and acquiring new high-potential customers at a lower cost. A customersegmentation based on their key characteristics is central to CRM and is used toderive strategic marketing campaigns.

A consolidated customer view enabled through the process of data warehousingpermits businesses to determine the current and potential value of customers. Abusiness can associate customer purchase behaviors with their customers ′value to the shareholders. By understanding the association betweentransaction behavior and shareholder value, marketers can influence customersto change their purchase behavior in ways profitable to the organization.

By further understanding the complete view of its customers, includingdemographic, geodemographic, and psychographic profiles, a business can domore than simply influence behavioral change through the use of customerrewards. Understanding the needs of customers, as exhibited through theirpurchase behavior, marketers can use the customer profile information to betterserve these customers by targeting them for products/services that they arelikely to purchase. Increased understanding of their customers also allowsmarketers to communicate relevant messages through customer-preferredchannels such as direct mail or phone campaigns. Effectively serving the needsof the customer requires less incentive to change customer behavior. Increasedtargeting of customers, focusing on meeting strategic campaign initiatives forsmaller customer segments, substantially reduces the cost of marketing and canincrease its effectiveness.

Strategic campaign initiatives can be derived by creating customer segmentationmodels. Several different strategic initiatives can be applied to the differentcustomer segments. Businesses have realized that a minority of customers,10%-25%, contribute the lion ′s share, 40%-80%, of the bottom line. A retentionstrategy is the primary initiative for these ″best″ customers. As many as fiveaverage customers are required to replace a ″best″ customer. With the highcost of customer acquisition, businesses have a strong business case to investheavily in retaining their ″best″ customers and best potential customers. Loyal


customers increase in value over time; they spend more over time, consolidatetheir purchases, and refer new customers.

Another important customer segment to consider is that containing customerswith a high potential value. In addition to retention, high-potential customers arecandidates for cross-selling and up-selling campaigns. Finding additionalproducts and services that can be marketed to this segment can be determinedby analyzing the customer purchase behaviors. By profiling and understandingthe characteristics of best customers, a business can effectively target customerlists to acquire more profitable customers.

In addition to changing the way in which organizations market to theircustomers, a change is occurring in the way marketing campaigns areimplemented. The status quo in marketing science is the implementation ofmarketing campaigns in a series of waves or tactical campaigns. In this type ofmarketing, groups of customers are targeted for a specific promotion. Thecustomers ′ buying behavior initiates a promotional period during whichcustomers can respond to the promotional offer. At the end of such a campaign,the results are determined and then fed back into future waves of marketingactivity.

A new method of continuous marketing has recently appeared. With multiplecustomer interaction channels, including the Internet, inbound telephone calls,outbound telephone calls, direct sales, and direct mail, organizations with thecapability to provide CRM data to operational customer service applications cancontinuously market to single customers. For instance, if a customer segmentdefinition and its sensitivity to purchase certain product or service informationare made available to customer service agents during inbound telephone calls,the customer service agent can be directed to deliver the appropriate marketingmessage to the customer interactively. Furthermore, if organizations had thecapability to update a customer segment and other purchase behavior models inrealtime, they would be able to conduct continuous interactive marketingcampaigns. Organizations must track all customer interactions and providetimely and accurate customer behavioral information to the marketer to executesuch a campaign. In the example above, customer service agents must havereal-time information to know that the customer has not already purchased theproducts they are marketing. Failure to have real-time information in thisinstance can have a detrimental effect on customer service.

Continuous marketing is also driven by the technology revolution. The technicalchallenge in continuous marketing is the ability to access real-time information.In order to deliver real-time information, an organization must be able totransform its customer purchase behavior into decision support information inrealtime. With wave marketing campaigns, it can take an organization severalweeks or months to provide the decision support information that drives themarketing strategy. Organizations can no longer wait for its knowledge workersto spend weeks creating models and decision support analysis to supportmarketing campaigns. Automated models and expert systems will create thedecision support information required by continuous marketing. Data miningtechnology will play an ever-increasing role in providing decision supportinformation to continuous marketing campaigns.

In summary, technology plays a fundamental role in CRM and continuousinteractive marketing (CIM). Data warehousing permits the consolidation of anorganization′s operational data. Data mining is used to create customer


segments and to identify profitable marketing opportunities. Campaignmanagement tools are used to implement and manage the design, execution,tracking, and postanalysis of marketing campaigns.

Technology is the key enabler in the implementation of CIM. This case studyguide illustrates the use of data warehousing and the Intelligent Miner to supportCRM and CIM.

3.2 Case StudiesThe business used for the case studies presented in this book is a retail bank,who is a client of Loyalty Consulting. Throughout the book this retail bank shallbe referred to as the ″Bank″.

Loyalty Consulting, a subsidiary of The Loyalty Group, grew out of theexperience of building and outsourcing the data warehouse for the Air MilesReward Program (AMRP). By maintaining the data warehouse and providinganalytical services to the AMRP and sponsor companies, Loyalty Consultinggained substantial experience in the application of technology to real businessrequirements. It was one of the original partners for the Intelligent Miner datamining product for IBM and has been applying the technology for more than twoyears.

Loyalty Consulting offers services that can be broadly categorized as:

• Database and data warehouse consulting

• Data mining or knowledge discovery in databases

• Geographic information system (GIS)

3.3 Strategic Customer SegmentationIn meeting its database marketing needs, the Bank currently uses standardanalytical techniques. The Bank′s business analysts use recency frequencymonetary (RFM) analysis, OLAP tools, and linear statistical methods to mine thedata for marketing opportunities and to analyze the success of the variousmarketing initiatives undertaken by various lines of business. The Bankrecognizes the opportunity to increase the efficiency of its database marketingactivities and improve the knowledge of its customers through advanced datamining technology.

The case studies presented in this book are driven by the Bank′s businessrequirement to use data mining to identify new business opportunities and/or toreduce the cost of marketing campaigns to existing customers. In this section wedescribe a framework for customer relationship management. We illustrate theframework by using four data mining case studies, which we present in 3.4,“Case Studies” on page 30.

Customer segmentation is one of the most important data mining methods inmarketing or CRM. Segmentation using behavioral data creates strategicbusiness initiatives. The customer purchase data that a company collects formsthe basis of the behavioral data. It is important to create customer segments byusing the variables that calculate customer profitability. These variables

Chapter 3. Case Study Framework 29

typically include current customer profitability and some measure of risk and/ora measure of the lifetime value of a customer.

Creating customer segments based on variables that calculate customerprofitability will highlight obvious marketing opportunities. For example, asegment of high-profit, high-value, and low-risk customers is the segment acompany wants to keep. This segment typically represents the 10% to 20% ofcustomers who create 50% to 80% of a company ′s profits. The strategicinitiative for this group is obviously retention. A company would not want to losethese customers. A low-profit, high-value, and low-risk customer segment isalso attractive to a company. The obvious goal of the company for this segmentwould be to increase its profitability. Cross-selling (selling new products) andup-selling (selling more of what customers currently buy) to this segment are themarketing initiatives of choice.

Within the behavioral segments, demographic clusters and/or segments arecreated. Customer demographic data does not typically correlate with customerprofitability, which is why it should not be used with behavioral data. Creatingdemographic segments allows the marketer to create relevant advertising, selectthe appropriate marketing channel, and identify campaigns within the strategiccustomer segment defined above.

Let us say a bank has both a high-profit and a low-profit behavioral customersegment that have similar demographic subsegments. The profile of thesubsegment is young, high-income professionals with families. The marketerwould want to ask the following question: Why do these similar demographicsegments behave differently and how do I change the low-profit group to ahigh-profit group? It is difficult, if not impossible, to answer the why, but datamining provides an answer to the how. Affinity analysis discovers that thehigh-profit segment of young wealthy professionals has a distinct product pattern- mortgages, mutual funds, and credit cards. Using affinity analysis on thelow-profit segment reveals that two of its product patterns are the same as thoseof the high-profit segment - mutual funds and credit cards. The marketingcampaign to increase the profitability of the low profit segment would thus be tomarket mortgages to it.

In summary, behavioral segmentation helps derive strategic marketing initiativesby using the variables that determine customer profitability. Demographicsegmentation within the behavioral segments defines tactical marketingcampaigns and the appropriate marketing channel and advertising for thecampaigns. It is then possible to target those customers most likely to exhibitthe desired behavior (in the above example, those customers most likely topurchase a mortgage) by creating predictive models. See Figure 10.


Figure 10. Customer Segmentation Model

Chapter 3. Case Study Framework 31

3.4 Case StudiesIn this book we present the following four case studies that highlight the roleIBM ′s Intelligent Miner and data mining technology play in supporting a CRMsystem:

• Customer Segmentation

The first case study creates a customer segmentation that will be used in theother case studies. Using shareholder value variables to create thesegmentation will drive strategic initiatives for the customer segmentsdiscovered. Two of Intelligent Miner′s clustering techniques and a decisiontree are used to build segmentation models.

• Cross-Selling Opportunity Identification

Identifying a cross-selling opportunity that is actionable and profitable usingIntelligent Miner′s product associations algorithms is the topic of this casestudy. This study is based on the customer segment from the first casestudy whose strategic initiative is to increase its profitability.

• Target Marketing Model to Support a Cross-Selling Campaign

In this case study, we build a predictive model to target those customerslikely to buy the product identified as a cross-selling opportunity in theprevious case study. Several algorithms from Intelligent Miner are used. Themodels built with the Intelligent Miner decision tree, radial basis function(RBF) regression, and neural network are compared.

• Attrition Model to Improve Customer Retention

In this case study, profitable customer segments are selected from thesegmentation model built in the first case study. An attrition model is builtidentifying those profitable customers likely to defect. Several algorithmsfrom Intelligent Miner are compared. In addition to the predictive modelingalgorithms used in the previous case study, a time-series neural network willbe utilized.

The four case studies represent four major components of a CRM program thatan organization can implement. The strengths of Intelligent Miner ′s algorithmsand visualization tools and its ability to work on a wide variety of businessproblems are illustrated through the case study results. Figure 10 on page 30shows the customer segmentation model used in the case studies shown in thisbook.


Chapter 4. Customer Segmentation

This case study creates a customer segmentation that will be used in the othercase studies. Using shareholder value variables to create the segmentation willdrive strategic initiatives for the customer segments discovered. Two ofIntelligent Miner′s clustering techniques and a decision tree are used to buildthe segmentation models.

4.1 Executive SummaryThe Bank wanted to create an advanced segmentation of its customer base inorder to further understand customer behavior. The segmentation was to becompared with the existing segmentation that was created through RFManalysis. A segmentation framework as described in 3.3, “Strategic CustomerSegmentation” on page 29, was to be created to meet these key businessrequirements:

• Define ″shareholder value″ for the corporation

• Define strategic objectives for customer management

• Understand customer behavior in terms of shareholder value

• Understand the interaction between customer transaction behavior andshareholder value

Shareholder value was a well-understood concept for the Bank. However, thespecific variables that make up shareholder value were not previouslyconsidered in detail. The selection or creation of these variables was a primaryrequirement.

Having defined the metrics or variables used to approximate shareholder value,the Bank wanted to understand how the customer base was segmented byshareholder value. An analysis of customer segments defined by shareholdervalue were to be used to derive strategic initiatives for managing theshareholder value of each of the segments.

Further segmentation using detailed customer transaction behavior, defined byRFM variables by product over time, would provide insight into which customerbehaviors were related to positive and negative shareholder value.Understanding the relationship between customer behavior and shareholdervalue would drive the creation of tactical marketing initiatives that could beexecuted to meet the various customer segment strategies.

4.2 Business RequirementsThe Bank wanted to create an advanced segmentation of its customer base tofurther understand customer behavior. This segmentation was to be comparedto the existing segmentation that was created with RFM analysis. Asegmentation framework as described in 3.3, “Strategic CustomerSegmentation” on page 29, was to be created to meet the following keybusiness requirements:

• Define ″shareholder value″ for the corporation

• Define strategic objectives for customer management


• Understand customer behavior in terms of shareholder value

• Understand the interaction between customer transaction behavior andshareholder value

The Bank′s data warehouse was used as a data source for this case study. TheBank had spent considerable effort cleaning and transforming the data prior toloading it into their warehouse. Therefore, some of the data preparationactivities that are usually time consuming were not required in this case study.

Customer segments were to be determined using the following shareholdervaluable variables that were identified by the Bank′s executives as key drivers oftheir business:

• Number of products used by the customer over a lifetime

• Number of products used the customer in the last 12 months

• Revenue contribution of the customer over a lifetime

• Revenue contribution of the customer over last 12 months lifetime

• Most recent Customer Credit Score

• Customer tenure in months

• Ratio of (number of products/tenure)

• Ratio of (revenue/tenure)

• Recency

A review of the clustering process was presented in sufficient detail so thattechnical analysts could use their own data to reproduce a clustering project.

The results showed that the existing segmentation scheme was valid but coulduse some additional refinement. The key drivers of profitability were verified. Ahighly profitable customer segment was identified and represented 35% of thecorporate profit with only 9% of customers. Some cross-selling opportunitieswere quantified; they represented a potential profit increase of 18% over theentire customer base.

The Bank executives decided that there was potential value in data mining andstarted several data mining projects, including target marketing, opportunityidentification, and further segmentation work.

4.3 Data Mining ProcessIn this section we outline the data mining process that was used to meet thebusiness requirements of the Bank (see 4.2, “Business Requirements” onpage 33). Figure 11 on page 35 highlights the major steps in the process:

1. Shareholder value definition

2. Data selection

3. Data preparation including discretization

4. Demographic clustering

5. Neural clustering

6. Cluster result analysis

7. Classification of clusters with decision tree


8. Comparison of results

9. Selection of clusters and/or segments for further analysis

We describe the first four topics in this section. We discuss topics 5 through 8 in4.4, “Data Mining Results” on page 50.

Figure 11. Data Mining Process: Customer Segmentation

A high-level tabulated comparison of the demographic clustering algorithm andneural clustering results is made. We chose to present the demographicclustering results in detail because they are more interesting than the neuralclustering results. This difference is not a general observation; it is true for thisparticular case study. In our experience both algorithms produce good results,usually one slightly better than the other, depending on the business problemand more importantly the characteristics of the data that was mined.

Chapter 4. Customer Segmentation 35

4.3.1 Data SelectionWe use the data model in Figure 12 on page 37 as the primary source of data.Approximately 50,000 customers and their associated transaction data for a12-month period was selected as a representative sample for the study. (Weused this data in all of our case studies.) The transaction data used containedtransactions across all possible products. We selected the complete transactiondata because we wanted to develop an understanding of the different customertransaction behaviors. All customer transaction behaviors are contained entirelywithin the transaction and customer tables.

The shareholder value variables we defined for this case study includedrevenue, tenure, number of products purchased over the customer tenure,number of products purchased over the last 12 months, customer credit scoreand recency (in months) of the last transaction. These variables form the core ofthe top layer of the hierarchical clustering model that we develop in this casestudy (see Figure 10 on page 30). We had to calculate all of these variables fromthe raw transaction data. The selection of these variables was driven entirely bythe business requirement. These are the variables the business had decided touse in managing its customer base.

The profitability data in the data model in Figure 12 on page 37 was contained inthe transactions table. Each transaction record contained a revenue figure thatcould be used to estimate profitability by applying a gross profit margin orinterest rate spread. More sophisticated profit models could be developed butwere outside the scope of this work.1 The other shareholder value variableswere calculated by using aggregate functions on the transaction data whilejoining the data to each customer record.

1 More sophisticated profit models may include the transaction cost as well as the transaction gross revenue. The cost ofmarketing to the customer can be determined from the promotion history table as in Figure 12 on page 37. Other costs canbe allocated by the customer ′s transaction intensity or by some other variable relevant to the business problem at hand.


Figure 12. Customer Transaction Data Model

The Bank offers has divided its products into the 14 categories listed below. Thecategory labels in the results will be denoted by ″cat #____″ to protect theBank ′s confidentiality.

• Loans

• Mortgages

• Leases

• Credit Card

• Term Deposits

• ATM Card

• Savings Accounts

• Personal Banking

• Internet Banking

• Telephone Banking

• Business Loans

• Business Mortgages

• Business Deposit Accounts

• Business Credit Cards

We created transaction variables for each of the above product categories. Foreach customer we calculated the recency in months, revenue by quarter andnumber of transactions by quarter for two consecutive quarters in 1997.


4.3.2 Data PreparationOnce the data required for the data mining process is selected, it must be in theappropriate format or distribution. Therefore it has to be cleaned andtransformed to meet the requirements of the data mining algorithms.

4.3.2.1 Data CleaningVery little data cleaning was required for this case study because the data wasextracted from the Bank′s data warehouse. During the load process for thiswarehouse, substantial data cleaning occurs to minimize the data preparationrequired for all analytical activities, including data mining.

After we created all the variables on each customer record, we had to clean thedata. We profiled the data to determine how many variables had records withmissing values, unknown values, invalid values, or valid values. Following arethe definitions for possible field contents:

• Missing Value

− A record has no value for a particular field.

• Unknown Value

− A record has a value for a particular field that has no known meaning.

• Invalid Value

− A record has a value for a particular field that is invalid but whosemeaning is known.

• Valid value

− A record has a value for a field that is valid.

Data cleaning is the process of assigning valid values to all records withmissing, invalid, and unknown values. In this case study only the transactionvariables had missing values. (Transaction data is usually very consistent andhas no invalid or unknown values). The missing values resulted from particularcustomers having no transaction activity for a particular product. We assignedthese missing values a value of zero.

We assigned a new value to all categorical variables that had records withmissing and unknown values. We corrected the invalid values for these variablesto valid values.

4.3.2.2 Data TransformationAfter we cleaned the data, handled all missing and invalid values, and made theknown valid values consistent, we were ready to transform the data to maximizethe information content that can be retrieved.

For statistical analysis the data transformation phase is critical as somestatistical methodologies require that the data be linearly related to an objectivevariable, normally distributed and containing no outliers. Artificial intelligenceand machine learning methods do not strictly require the data to be normal orlinearized, and some methods, like the decision tree, do not even requireoutliers to be dealt with. This is a major difference between statistical analysisand data mining. The machine learning algorithms can automatically deal withthe nonlinearity and nonnormal distributions, although the algorithms work betterin many cases if these criteria are met. A good statistician with a lot of time canmanually linearize, standardize, and remove outliers better than the artificial


intelligence and machine learning methods. The challenge is that with millions ofdata records and thousands of variables, it is not feasible to do this workmanually. Also, most analysts are not qualified statisticians, so using automatedmethods is the only reasonable solution.

After cleaning the original data variables, we created new variables using ratios,differences, and business intuition. We created total transaction variables, whichwere the sum of the transaction variables over two quarters. We used thesetotals to create ratio variables. We created timeseries variables to capture thetime difference in all transaction variables between quarters.

Other variables that we calculated on the basis of our knowledge of the businesswere:

• Number of products purchased by the customer over a lifetime

• Number of products purchased by the customer in the last 12 months


• Revenue contribution of the customer over last 12 months lifetime

• Most Recent Customer Credit Score


• Ratio of (number of products/tenure)

• Ratio of (revenue/tenure)

• Recency

This last group of variables were designated as ″shareholder value″ variablesand they were the variables selected by the business to be used to createstrategic customer relationship marketing initiatives.

To use the data in the demographic clustering algorithm, we discretized it.Discretization facilitates interpretating the results, for both the neural clusteringand demographic clustering algorithms, and takes care of outliers. The followingquantiles were calculated for all numeric variables: 10, 25, 50, 75, 90. The valuesof the variables at these breakpoints were determined and the data was dividedinto six ordinal values.

We arbitrarily chose the quantiles for the discretization, and we found theselection useful. 2 The quantile breaks were generated in an automated fashion.We then profiled the resulting distributions and manually adjusted them to beunimodal or at least monotonic. We selected the modality and monotonicitycriteria for ease of interpretation; in our experience these criteria provide usefulresults.

To improve the clustering results, advanced analysts removed the correlatedvariables. Factor analysis can be used to create linearly independentcomponents. For easy interpretation of results, the original data can beclustered against the components, and the variables most representative of thecomponents chosen as input to the clustering algorithm.

Refer to Figure 13 on page 41 for a view of the original data and to Figure 14 onpage 42 for a post-discretized view of the data.

2 We give credit for this discretization scheme to Dr. Messatfa from the IBM ECAM lab in Paris, France.


Figure 13 on page 41 shows the variable names taken from the data source, andFigure 14 on page 42 shows the variable names as follows:

• Unchanged variables have the original variable names.

• Changed variables have an underscore added to the end of the variablename.

• New variables show like unchanged variables.


Figure 13. Original Data Profi le


Figure 14. Post-Discretized Data Profi le


Some of the key features to note in Figure 13 on page 41 and Figure 14 onpage 42 are:

• The original data has missing values treated. (The Bank data warehousedata is cleaned before loading and thus less cleaning is required.)

• The original data has continuous variables that are extremely skewed.

• The original data has multimodal variables.

• The discretized data is much easier to interpret than the original data.

• Many of the previously skewed distributions are ″normal″ in shape, whichenables the algorithms to obtain accurate results and/or allow the results tobe easily interpreted.

• Some of the data in the discretized set is still skewed, indicating that thedata may not be useful.

To prepare the data for clustering with the neural clustering algorithm, westandardized some of the continuous variables, using a logarithmic transform.See Figure 15.

Figure 15. Post Logarithm Transformed Data Profi le

Some key features of the logarithm transformed data are:

• The data is much less skewed.

• Some of the variables are unimodal (LAVGBAL, LRATIO1, LRATIO2,LRATIO3, LTENURE)

• Some variables (LREV12, LREV3) have two peaks because of a large numberof records with zero or small values.

• Some variables (LDIFF3, LDIFF3TX, LDIFF6, LDIFF6TX) have three modes orpeaks because of the transformation used. The data in transformed form is


much easier to visualize than in its original pre-prepared state. Thealgorithms should achieve better results using this data and/or results thatwill be much easier to interpret.

Once the data has been selected, prepared, and transformed, it is possible torun the data mining algorithms.

4.3.3 Data MiningFigure 16 on page 45 shows the clustering process flow for this case study. Weused demographic clustering so that we could use the results to interpret theoutput from neural clustering. Neural clustering can be difficult to interpretbecause of the use of continuous data, which is typically skewed or has beenlogarithm transformed to remove the skew.


Figure 16. Clustering Process Flow

4.3.3.1 Parameter SelectionReferring to Figure 16, you see that the first step in the clustering process, afterselecting the data set (the discretized data in this case) and after selecting analgorithm to be run (demographic clustering in this case), is to choose thesebasic run parameters for the algorithm:

• Maximum number of clusters


This parameter indicates the maximum number of clusters allowed. Thealgorithm may find less. This feature is unique to Intelligent Miner. Mostother clustering algorithms require that the number of clusters be specified.

• Maximum number of passes through the data

This parameter indicates how many times the algorithm can read the data.The higher this number and the lower the accuracy criterion (see below), thelonger the algorithm will run and the more accurate the result will be. Thisparameter is a stopping criterion for the algorithm. If the algorithm has notsatisfied the accuracy criterion after the maximum number of passes, it stopsanyway.

• Accuracy

This number is a stopping criterion for the algorithm. If the change in the condorect criteria between data passes is smaller than the accuracy (as apercentage), the algorithm terminates.

• Similarity threshold

This parameter defines the similarity threshold between two values indistance units. The default distance unit is the absolute number. Thereforetwo values are considered equal if their absolute difference is less than orequal to 0.5.

The neural clustering algorithm has the following parameters:

• Number of rows and number of columns

Multiply the two numbers together to get the maximum number of clusters.The rectangle defined by the number of rows and columns of neural networknodes changes the resulting clusters. Unless you are an advanced user, werecommend choosing the most ″square″ output grid shape. For example, ifyou want 9 clusters, choose 3 rows by 3 columns (the default). If you want 12clusters, choose 4 rows by 3 columns as opposed to 6 rows by 2 columns.

• Number of passes

This parameter indicates the number of passes through the data thealgorithm will make to build the neural network.

For the first clustering run, we selected a maximum number of clusters largerthan the number we wanted at the end of the project. By selecting more weallowed the algorithm to choose less if that is all that is in the data. If thealgorithm comes back with the maximum, we know that there are likely moreclusters. The number of clusters chosen is driven by how many clusters thebusiness can manage. In our experience, this number is less than 10 for mostcompanies. For this case study we chose 9 for the maximum number of clusters.For the maximum number of passes, we chose 5 and specified the accuracy as0.1. We left the similarity threshold at the default value of 0.5. The parametersettings for the number of passes and accuracy were arbitrary. We wanted areasonable number of passes through the data to ensure a reasonableconvergence of the solution.

For the initial neural clustering run, we selected a three row by three columngrid; this selection results in a maximum of nine clusters. We left the number ofpasses at the default.


The analysis of the results of each run will guide the selection of parameters forfollow-on runs. The clustering process is highly iterative as shown in Figure 16on page 45.

For the first run of the demographic clustering algorithm, we left the advancedparameter settings at the default. Because we discretized the data ahead oftime, and all the discretized variables had approximately the same range, manyof the advanced parameters were not required. We used these advancedparameter settings to allow continuous data to be effectively clustered with thealgorithm:

• Distance measure

− Absolute

One unit of absolute difference in the magnitude of two record values forone variable.

− Range

The range (difference between maximum and minimum) of a variable isconsidered one distance unit.

− Standard deviation

The standard deviation of a variable is considered one distance unit.This setting is only meaningful if the variable is normally distributed.

• Field weighting

− Probability weighting

Uses the probability of the occurrence of a variable value to compensatefor its contribution to the overall cluster result.

− Information theoretic weighting

Uses manually selected weights to compensate for the contribution of avariable to the overall cluster result.

4.3.3.2 Input Field SelectionWe selected these input field variables for the first run:

• Number of products purchased by the customer over a lifetime

• Number of products purchased by the customer in the last 12 months


• Most Recent Credit Score

• Revenue contribution over the last 12 months


• Ratio of (revenue/tenure): Ratio 1

• Ratio of (number of products/tenure): Ratio 3

• Region

• Recency

• Tenure (number of months since the customer first activated at the bank)

We used the discretized versions of these variables for demographic clusteringand the log-transformed continuous versions for neural clustering.


As discussed in section 4.2, “Business Requirements” on page 33, the first layerof clusters in the CRM framework is created by using shareholder valuevariables and any other variables the business would like to use to manage itscustomers. All other discrete and categorical variables and some interestingcontinuous variables were input as supplementary variables to be profiled withthe clusters but not used to define them. These supplementary variables can beused to interpret the cluster as well. The ability to add supplementary variablesat the outset of clustering is a very useful feature of Intelligent Miner, whichallows the direct interpretation of clusters using other data very quickly andeasily.

4.3.3.3 Output Field SelectionThe entire data set was output with the cluster information appended to the endof each record. The entire data set was output so that the results of otherclustering runs using both the demographic clustering and neural clusteringalgorithms could be directly compared by cross-tabulating the cluster IDs fromthe various schemes. This is one advantage of Intelligent Miner. Having multiplealgorithms allows the output of one algorithm to be used as the input to another.The algorithms used in combination are more powerful than those applied alone.

4.3.3.4 Results VisualizationThe output of the clustering algorithms is an output data set and a visualization.The visual results display the number of clusters, the size of each cluster, thedistribution of each variable in each cluster, and the importance of each variableto the definition of each cluster (based on several metrics including chi-squaretest, entropy, and condorect criteria).

The result is completely unsatisfactory if there is only one cluster, or if there isone very large cluster (> 90%) and several small clusters. This situation willoccur if highly skewed continuous variables are used as input or if the modalfrequency of some of the discretized variables is very large (> 50%-90%). Ifthis situation occurs, we recommend using probability field weighting for thediscrete variables and discretization of the continuous variables. The statisticsof the input variables can be viewed in the cluster details.

4.3.3.5 Cluster Details AnalysisThe cluster details contain some tabulated statistics for the cluster model. Theglobal measures include the condorect criteria for the demographic clusteringalgorithm and the quality for neural clustering. Realistic ″good″ values for thecondorect criteria are in the 0.6-0.75 range. Higher values are usuallyassociated with the case of one very large cluster and a number of smallerclusters. ″Good″ neural cluster values are in the 0.5-0.7 range.

For the demographic clustering algorithm the details view also shows thecondorect criteria for each cluster and for each variable globally and within eachcluster, the similarities among all clusters, and the global statistics and statisticswithin each cluster for each variable. The neural clustering algorithm also showsglobal statistics and statistics within each cluster for each variable.

The details can be used, for example, to assess the quality of the cluster models,assess the contribution of each variable to the model, and to compare differentcluster models.


4.3.3.6 Cluster ProfilingThe next step in the clustering process is to profile the clusters by executingSQL queries. The purpose of profiling is to quantitively assess the potentialbusiness value of each cluster by profiling the aggregate values of theshareholder value variables by cluster. The scientific quality of the clustersshould also be profiled. Some of the variables for profiling include:

• Record scores

− Intelligent Miner provides a score on each record in addition to thecluster ID, which is a measure of how well the records fit the clustermodel.

• 2nd choice cluster

− Intelligent Miner provides a cluster ID for the second choice cluster towhich the record could have been assigned.

• 2nd choice scores

− Intelligent Miner provides the score for how well the record fits thesecond choice cluster assignment.

• Comparison of methods considering 2nd choice clusters and scores

• Other measures including entropy, chi-square, Euclidean distance

4.3.3.7 Cluster Characterization (Qualitative)Once the cluster algorithm has been run, the next step is to qualitativelycharacterize the clusters. Cluster characterization can be completed using theresults visualization. Each cluster should be considered variable by variable. Thedifferences and similarities among the clusters, the variable distributions bycluster and global distribution, the cluster sizes, and ordering of variables withinthe cluster by different metrics should be noted.

4.3.3.8 Cluster Characterization Using a Decision TreeOne of the disadvantages of cluster models is that there are no explicit rules todefine each cluster. The model is thus difficult to implement, and there is noclear understanding of how the model assigns cluster IDs. The cluster modeltends to be a blackbox. You can use a decision tree to classify the cluster IDsusing all the input data and supplementary data that was used in the clusteringalgorithms. The decision tree will define rules that classify the records using thecluster ID. In many instances, based on our experience, the decision treeproduces a very accurate representation of the cluster model (>90% accuracy).If the tree representation is accurate, it is preferable to implement a treebecause it provides explicit, easy-to-understand rules for each cluster.

4.3.3.9 Final ResultThe final clustering result is selected on the basis of a combination of scientificand business reasons. Cluster models that have good global values of thecondorect criteria or quality, whose clusters are distinct and different from eachother, and which can be accurately modeled with a decision tree are″scientifically″ good. Good business models are defined by sensibleinterpretation of the clusters, good segmentation in shareholder value variables,segmentation that drives obvious business strategies, and segments that areactionable.


4.4 Data Mining ResultsFigure 17 on page 51 presents the results of several iterations of demographicclustering. This diagram is the cluster visualizer in Intelligent Miner that is usedby both demographic clustering and neural clustering.

Here is some general information to help you read the diagram:

• Each strip of the diagram represents a cluster.

• The clusters are ordered from top to bottom according to their size.

• The numbers down the left side show the size of the cluster as a percentageof the universe.

• The numbers down the right side are cluster IDs.

• The variables are ordered from left to right in their order of importance tothe cluster, based on chi-square tests between the variables and cluster IDs.This is the default metric. Among other ordering criteria you could use areentropy, condorect criteria, and database order.

• The variables in square brackets are the supplementary variables. Variableswithout brackets are those used to define a cluster.

• Numeric (integer), discrete numeric (smallint), binary, and continuousvariables have their frequency distribution or histogram shown as a bargraph. The outlines in the foreground of the bars indicate the distribution ofthe variable within the current cluster. The grey solid bars in the backgroundindicate the distribution of the variable in the entire universe. The moredifferent the cluster distribution is from the distribution within the entireuniverse, the more interesting or distinct the cluster is.

• Categorical variables are shown as pie charts. The inner pie represents thedistribution of the categories for the current cluster, and the outer ringrepresents the pie chart distribution of the variable for the entire universe.Again, the more different the distribution of the variable is for the currentcluster as compared to the average distribution, the more interesting ordistinct the cluster is.


Figure 17. Shareholder Value Demographic Clusters

The result shows that there are nine clusters in the model. There are likely moreclusters in the data as we chose nine to be the maximum number of clustersallowed. The clusters are reasonably distributed (not one very large cluster). Thevariable distributions within the clusters tend to be different from their global


distributions. The Best98, Revenue, and CreditScore variables are commonlyimportant to several clusters.

For comparison purposes, a high-level neural clustering is shown in Figure 18 tohighlight some similarities and differences between the results from the twodifferent methods.

Figure 18. Shareholder Value Neural Network Clusters

The input variables chosen for the neural network were the logarithmtransformed versions of the variables used for the demographic clustering. Thediscretized variables used for the demographic clustering were input assupplementary variables to aid in the interpretation of the neural cluster. Somekey features to note are:


• The neural clusters are not quite as uniformly distributed with respect tocluster size as the demographic cluster results. In our experience, theopposite is usually the case.

• The same variables as in demographic clustering appear as the mostimportant variables (for example, Best98, other Best vars, REVENUE_,CREDSCORE_, and NUMPROD_).

• The discretized variables are more significant to the cluster definitions thanthe logarithm transformed variables used to create the clusters. Thisillustrates one of the values of discretization. You can use the discretevariables to assist in the interpretation of clusters while using the continuousvariables to build the clusters.

Because of the similarity of the neural clustering and demographic clusteringresults and in an effort to reduce redundancy in the presentation of results, thediscussion below focuses on the demographic clustering results.

4.4.1 Cluster Details AnalysisFigure 19 on page 54 shows the cluster details. From this result we can seethat the global condorect value is 0.6098. This value is at the low end of areasonable result. The lower value may be due to several factors including thefact that we restricted the output to nine clusters when there may have beenmore, some variables are not very good, the data is not on distinct clusters orother reasons.

The quality of the clusters ranges from a condorect criterion value of 0.42 to 0.72,as shown in the Cluster Characteristics section in Figure 19 on page 54.

In the Similarity Between Clusters section of Figure 19 on page 54, you can seethat there is some similarity among clusters, with the similarity measure rangingfrom <0.25 to a maximum of 0.42.

Figure 19 on page 54 also shows that the REVENUE12_, CREDSCORE_ andNPRODL12 variables have low condorect values. Therefore they could beremoved from the cluster model to improve the result (see the Reference FieldCharacteristics section in Figure 19 on page 54).

These results indicate that further iteration is wanted.


Figure 19. Shareholder Value Demographic Cluster Details


4.4.2 Cluster CharacterizationIn this section we discuss the characterizations of some of the interestingclusters in Figure 17 on page 51.

The Best98 variable is a binary variable that indicates the best customers in thedatabase as determined by other means. The clustering model presented seemsto agree very well with this existing definition as most of the clusters seem tohave almost all Best or no Best. As a first pass, this is an exciting result, as thestatus quo Best segment has been confirmed with little effort! To be confident ofthe data mining results, you should always observe the current businessknowledge in the results. Any successful company knows its business wellenough that the obvious results should show clearly in any data mining results.Observing the current business knowledge provides confidence that the dataselection and data preparation efforts have been valid. If results are observedthat were previously unknown, one can find confidence in them as long as theyare alongside currently known facts.

This clustering result does not only validate the existing concept of bestcustomers, it also extends the idea of best customers by creating clusters withinbest. It can be seen from Figure 17 on page 51 that there are several clusterswith varying levels of revenue. Perhaps this builds a case to create a″VeryBest″ customer group?

Cluster 6 can be interpreted as almost all Best98 customers, whose CreditScore, Revenue in the last 12 months, and revenue per month and number ofproducts used per month are in the 50th to 75th percentile. (Recall thediscretization definition in 4.3.2.2, “Data Transformation” on page 38). Cluster 6represents 24% of the population. Refer to Figure 20 on page 56 for a detailedview of cluster 6.


Figure 20. Cluster 6 Detailed View


Cluster 3 can be interpreted as almost no Best98 customers, whose revenue,credit score, revenue in the last 12 months, revenue per month, and number ofproducts per month are all in the 25th to 50th percentile. (Recall thediscretization definition given in 4.3.2.2, “Data Transformation” on page 38).Cluster 3 represents 23% of the population. Refer to Figure 21 on page 58 for adetailed view of cluster 3.




Cluster 5 represents 9% of the population, and the customers′ revenue, creditscore, and number of products per month are all in the 75% percentile andabove, skewed to almost all greater than the 90th percentile. The Best95, Best96,and Best97 variables represent the status of the customers in the calendar years1995, 1996, and 1997. The fraction of customers who were best was increased byyear! This looks like a very profitable cluster. Refer to Figure 22 on page 60 fora detailed view of cluster 5.



Figure 23 on page 61 provides the tabulated details for cluster 5.


Figure 23. Cluster 5 Tabulated Details


Cluster 5 contains 8.9% of the customer population. The condorect value forcluster 5 is 0.5946, just below the global value. Cluster 5 is most similar tocluster 7 and cluster 0. Notice that REVENUE_ and CREDSCORE_ havecondorect values of 0.71 and 0.83, respectively. Recall that globally thesevariables had low condorect values, but for this cluster they have very highvalues. NPRODL12_ has a low condorect value, 0.37 for this cluster and lowglobally. This information can be used to decide whether or not these variablesshould be included in the model. The details also present the chi-square valueand the entropy value, which are measures of the association between thevariable and the cluster to which the records have been assigned.

In cluster 1, the supplementary variable, NEW, is a binary variable that indicateswhether or not the customer is new to the Bank. This cluster clearly consists ofnew customers. The recency is low (which means the customer has not had arecent transaction, that is, they have opened accounts but not transacted yet),and the tenure is low. It would be very interesting to track these customersover time to see how they progress. Refer to Figure 24 on page 63 for adetailed view of cluster 1.




4.4.3 Cluster ProfilingIn this section we present an example of a profile of revenue, number ofproducts purchased, and customer tenure (see Table 1). The Leverage columnis a ratio of revenue to customer. Table 1 shows that cluster 5 is the mostprofitable cluster in that it represents 35% of the revenue yet only 9% of thecustomers. The leverage ratio is the highest for this cluster. From Table 1 youcan also see that as profitability increases so does the average number ofproducts purchased. The product index is the ratio of the average number ofproducts purchased by the customers in the cluster divided by the averagenumber of products purchased overall. It is also interesting to note thatcustomer profitability increases as customer tenure increases.

From this simple result it is possible to derive some high-level businessstrategies. From Table 1 it is obvious that the best customers (considering onlythe data in the table) are in clusters 2, 5, and 7. These customers have a higherrevenue per person than other clusters as indicated by the leverage column.

Some possible high-level business strategies are:

• Retention strategy for best customers (clusters 2, 5, and 7)

− A business does not want to lose its best customers.

• Cross-sell strategy for clusters (2, 6, and 0) by contrasting with clusters 5and 7.

− Clusters 2, 6, and 0 have a product index close to that of clusters 5 and7, which have the highest number of products purchased. Because theclusters are close in number of products purchased, it is not a big stretchto convert customers from clusters 2, 6, and 0. By comparing theproducts bought by the best customers to those purchased by clusters 2,6, and 0, you can find missing products, which are candidates for crossselling.

− If you could increase the number of products purchased by 10% ofcluster 6 customers by one additional product, you could increase theprofitability of cluster 7 by 20% and the entire base by 5%.

− If you could increase the number of products purchased by 10% ofcluster 7 customers by two products, you could increase the profitabilityof cluster 2 by 25% and the entire base by 9%.

Table 1. Customer Revenue by Cluster

ClusterID

Revenue Customer Productindex

Leverage Tenure

5 34.74% 8.82% 1.77 3.94 60.92

6 26.13% 23.47% 1.41 1.11 57.87

7 21.25% 10.71% 1.64 1.98 63.52

3 6.62% 23.32% 0.73 0.28 47.23

0 4.78% 3.43% 1.45 1.40 31.34

2 4.40% 2.51% 1.46 1.75 61.38

4 1.41% 2.96% 0.99 0.48 20.10

8 0.45% 14.14% 0.36 0.03 30.01

1 0.22% 10.64% 0.00 0.02 4.66


• You can similarly cross-sell clusters 3 and 4 compared to clusters 2, 6, and 0as they are close in value.

• The strategy for cluster 1 would be a wait-and-see plus information strategy.

− Cluster 1 appears to be a group of new customers. As they are newcustomers, sufficient data has not been collected to determine thebehaviors they may exhibit. Informing cluster 1 of the products andservices the business offers would make them profitable quickly.

• The strategy for cluster 8 may be not to spend any significant marketingdollars.

− Cluster 8 appears to be the worst cluster; it has a very low revenuepercentage and purchases very few products, although it has been withthe company for about 30 months.

4.4.3.1 Cluster Results ComparisonIntelligent Miner permits the output of one algorithm to be used as the input toanother. Table 2 is a cross-tabulation of the cluster IDs created by the neuralclustering model and the demographic clustering model. The neural networkcluster ID distribution is presented by row, and the demographic clusteringdistribution by column. The comparison shows the similarity of the two models.

The highlighted cells indicate a significant overlap between the two models.From Table 2 it is possible to conclude the following:

• Cluster 1 and Cluster 0 from both models agree almost 100%. Theagreement is not usually this good unless the cluster is very distinct. In thiscase the cluster contains new Bank customers with very little activity. Thefact that both models agree would allow you to apply this particular clusterwith confidence.

• The cluster models agree fairly well with each other. The results indicatethat there are likely more than nine clusters. Rerunning the results with ahigher number should result in better agreement between the two models.

Table 2. Comparison of Neural and Demographic Clustering Results

0 1 2 3 4 5 6 7 8 Total

0 3 5306 0 0 7 0 0 0 0 5316

1 2 7 0 183 89 0 1 0 567 849

2 1 8 3 665 21 0 0 0 5182 5880

3 1247 0 37 14 648 533 812 169 0 3460

4 3 0 11 2163 455 1 355 0 45 3033

5 2 0 28 5343 32 0 9 2 1277 6693

6 69 0 744 4 3 3733 4661 4625 0 13839

7 124 0 400 2461 33 99 4707 490 0 8314

8 262 0 34 828 193 43 1189 67 0 2616

Total 1713 5321 1257 11661 1481 4409 11734 5353 7071 50000


4.4.4 Decision Tree CharacterizationOne disadvantage of clustering methods is that the cluster definitions are noteasily extracted. Building a decision tree model with the cluster ID as the fieldto be classified and using all available input data allows explicit rules to beextracted for each cluster. The decision tree model built using the demographicclustering result from above showed an accuracy of 95% (see the confusionmatrix in Figure 25). The confusion matrix shows the distribution of theclassification errors and the global accuracy.

Figure 25. Decision Tree Confusion Matrix

See Figure 26 on page 67 for a view of the decision tree model and a rule forcluster 5. Rules for each of the clusters can be extracted.


Figure 26. Decision Tree Model

As the accuracy of the decision tree is very high (95%), it is preferable toimplement the decision tree version of the customer segmentation model ratherthan the original demographic clustering model.

4.5 Business Implementation and Next StepsThe results of this case study drew several reactions from the Bank executives:

1. Excellent visualization of results allow for more meaningful and actionableanalysis.

2. The original segmentation methodology was validated very quickly.

3. Refinement to the original segmentation is indicated and meaningful.

Based on the results of this case study, several data mining opportunities wereidentified, and several projects were undertaken. Some of these projectsinclude:

• Several predictive models for direct mail targeting

• Further work on segmentation, using more detailed behavioral data

• Opportunity identification using association algorithms within the segmentsdiscovered


Data mining tools can be used to quickly find business opportunity in customertransaction data. The simple example presented herein attempts to highlight aprocess that can be used to achieve profitable data mining results.

Once a segmentation model is built and the customer is satisfied with the result,the model is ready to be implemented. The first step in the implementation is tointegrate the model into the data warehouse and to modify the data warehouseload process to automatically assign customers to the appropriate segments.The variables used in the final segmentation model should be calculated andstored in the data warehouse permanently. A data warehouse table should becreated to track each customer over time and record which segment thecustomer was a part of in each time period. Such a table is very useful foranalytical purposes and can be used to measure the overall effectiveness ofmarketing campaigns by observing their effect on customer behavior over time.

The segmentation model should also be rebuilt periodically (in our experience,from monthly to annually depending on the organization). A comparison ofsegmentation models over time should reveal changing market dynamics andchanging customer behavior due to an organization′s marketing efforts, changingproducts and services, and social, political, and economic changes.

When the segmentation model has been implemented in the data warehouse, itis possible to begin using it to drive actionable business activities. Thecustomer segment information can be used in operational data stores to supportcontinuous marketing and other operational activities, create standard reportshighlighting the shareholder value, demographic profiles and transactionbehavior of each segment, and as a framework to support opportunityidentification.

The next case study explores the use of affinity analysis within the segmentationmodel defined herein, to find profitable cross-selling opportunities.


Chapter 5. Cross-Selling Opportunity Identification

Using Intelligent Miner′s product associations algorithms to identify across-selling opportunity that is actionable and profitable is the topic of this casestudy. It is based on the customer segments derived from the first case studywhose strategic initiative is to increase profitability.

5.1 Executive SummaryThe business requirements for this case study are to identify cross-sellingopportunities for the customer segments defined in Chapter 4 and ensure thatthe opportunities discovered adhere to the corporate objectives.

Customer purchase transactions or billing data are required to perform productassociations. We used the Bank ′s data warehouse to analyze transaction data,thereby reducing the data preparation requirements for the case study.

A review of the cross-selling process using association discovery was presentedin sufficient detail for technical analysts to be able to reproduce the project usingtheir own data.

The cluster selected for cross-selling opportunity represented 7% of revenue for23% of customers. The target behavior cluster represented 26% of revenue for23% of customers (see Figure 17 on page 51 or Table 1 on page 64 for thecorresponding clusters). Changing the behavior of 10% of the cross-sellingcluster to that of the target cluster would represent a 25% increase in thecluster′s profitability, or a 3% increase in the overall profitability of the business.A credit card product category was identified as a cross-selling opportunity.Several specific products within the credit card category were also identified.The selection of these products was driven by the fact that the Bank′sexecutives, based on previous analyses, knew that credit card products werevery profitable. Our analyses also revealed that these products have the highestprofit potential opportunity. Confirmation of the business intuition providedadditional confidence in proceeding with a campaign. Although the data mininganalysis simply confirmed the business intuition, it provided quantitative resultsand a specific target group, both of which were previously missing.

The recommended next steps include some demographic profiling of the targetcustomer group to assist the marketer in creating appropriate advertising andmarketing messages as well as selecting marketing channels. To further refinethe target group, we also recommend the construction of a predictive model,which is the content of a later case study (see Chapter 7, “Attrition Model toImprove Customer Retention” on page 111).

5.2 Business RequirementThe main objective of this case study was to use data mining techniques to findactionable cross-selling opportunities from the analysis of customer transactiondata. Any opportunities that are identified should support strategic marketinginitiatives for the customer segments used by the organization. Thesegmentation and the strategic initiatives recommended from the previous casestudy in Chapter 4, “Customer Segmentation” on page 33 should be used.


Finally, the next steps required to implement the cross-selling opportunities as amarketing campaign should be recommended.

5.3 Data Mining ProcessFigure 27 on page 71 highlights the data mining process implemented in thiscase study to meet the business requirements. The major steps in the processare:

1. Cluster (segment) selection

2. Transaction data selection

3. Data preparation

4. Product association mining

5. Results analysis

6. Compare to identify cross selling opportunities

7. Compare methodology

8. Select a cross selling opportunity

We cover topics 1 through 4 in this section. We cover the other topics in 5.4,“Data Mining Results” on page 76.


Figure 27. Data Mining Process: Cross-Sell ing Opportunity

5.3.1 Cluster SelectionThe process to find cross-selling opportunities within a specific customersegment depends on contrasting the purchase behavior of more than twoclusters. (The method discussed here is not the only method of findingcross-selling opportunities.) One cluster is selected to be the group ofcustomers whose behavior is to be replicated in other clusters; this cluster isusually the more profitable one. For cross-selling opportunity identification, thepurchase behaviors of interest are represented as product associations derivedfrom purchase transactions. Comparing the patterns of two or more clustershighlights product pattern differences. For instance, if cluster A had a productassociation, (A,B-->C) and cluster B had a product association (A-->B), thecross-selling opportunity would be to market product C to cluster B. The

Chapter 5. Cross-Sell ing Opportunity Identif ication 71

behavior of the clusters being contrasted should not differ significantly. The gapin missing products should not differ by more than a 1-3 product count because itis difficult to change customer behavior drastically. Furthermore, too large a gapbetween clusters could indicate fundamentally different customer behaviors thatwould be impossible or difficult to bridge.

5.3.2 Data SelectionThe data required to create product associations is customer transaction data,market basket data, or any other data that has a similar layout. Figure 28illustrates a typical transaction record used in a data mining project as input.

Figure 28. Typical Transaction Record

For market basket analysis the customer is not usually known, and productassociations are found when using transaction or market basket data. In thiscase study the customer is known and it is therefore possible to link customertransactions over time; this is much more powerful than an analysis of marketbaskets without the customer ID.

The following considerations are important in the selection of transaction datafor association rule mining:

• Time window of transactions

• Level of product aggregation

• Definition of product activity

The selection of a time window for the transactions is driven by the productpurchase cycle. We typically choose 2 to 4 product cycles, a range that hasproduced positive results. The average purchase cycle can be determined byquery analysis of customer purchase transactions. (If the customer ID is notknown, the product cycles must be determined by empirical or survey methods.)For frequently purchased products, a short time window is sufficient. A long timewindow and hence more transaction records are required for low frequencyitems. It is typically more difficult to find patterns in low frequency itemsbecause of the amount of data and the prevalence of too many product cycles ofhigh-frequency-item transactions. To find patterns between low frequency items,we recommend removing the transaction line items for all high-frequencyproducts. If the objective is to find patterns between low- and high-frequencypurchases, there is no choice but to use the long time window and alltransaction details.


For this case study we selected a 12-month window of customer transactiondata. This implies that the patterns or associations discovered will be forcustomer purchases that occur with a frequency of six months or less.

Another important consideration is the level of aggregation chosen for productdefinition. If product codes are too specific (that is, they are based on productdetails like size and flavor groupings), fewer associations will be discovered.The associations discovered will also be less actionable because of thespecificity required in a promotional advertisement. A product taxonomy orhierarchy is usually helpful in guiding the selection of product definition.

For this case study we used product categories, which resulted in a reduction inthe number of possible product codes from more than 130 to 13.

A final consideration important to product association analysis is the definition ofwhat constitutes a product purchase. This is more relevant when the customerID is known. If all products that were purchased only once over time areincluded, more product patterns will be discovered, some of which may not bevery strong (that is, have lower confidence). Setting some minimum criterion forinclusion of a particular product for a particular customer should reduce thenumber of weak rules, and thus permit easy analysis. A threshold may be toconsider only products that have been purchased more than once by a customeror products where the customer has spent some level of money.

5.3.3 Data PreparationTransaction data for organizations that generate revenue through customerbilling is typically very ″clean.″ The industry sectors are finance,telecommunications, insurance, and utilities. In these industries the transactiondata to be analyzed is actually billing data. Very little data preparation isrequired to perform product association analysis for these industry sectors. Thepreparation activities conducted would typically include:

• Ensure that product codes are consistent

• Addition of product hierarchy information

• Creation of new product hierarchy levels

Product IDs which reference the same product should be made consistent.Some variations in the product ID could result from the use of different codes indifferent stores or regions, code changes due to supplier change, new codingsystems being implemented, and errors. If the product IDs are not madeconsistent, the support for patterns and hence the number of patterns discoveredwill be less.

The product codes in the customer transaction data used for this case studywere consistent, with only a few exceptions.

Adding in the product hierarchy information (as illustrated in Figure 28 onpage 72) allows the product association mining to be easily conducted usingdifferent levels of product definition. The Bank′s data warehouse alreadycontained the product hierarchy information on each transaction record,obviating this step in the process.

The final data preparation activity that may be required is the manual creation ofnew product hierarchy levels. This activity is required when too few patterns arediscovered as a result of product definitions that are too specific. In such cases,


we recommend using a higher level in the product hierarchy. If, however, usingthe higher level results in rules that are too general and hence difficult to action,the creation of an intermediate layer is required. This process could be verylaborious if the number of possible products is large. (Most industries haveseveral hundred products, except retail, which may have tens of thousands!)

The creation of the product hierarchy for the Bank′s data was based on pastanalytical experience and was therefore appropriate for analysis withoutmodification.

5.3.4 Product Association AnalysisFigure 29 illustrates the steps required to discover product associations:

1. Parameter Settings

2. Association Discovery

3. Profile Rules and Large Item Sets (LIS)

4. Selectively Remove Large Item Sets

5. Iteration back to step 1

6. Rebuild Rules

Figure 29. Product Association Analysis Workflow


5.3.4.1 Parameter SettingsThe first step in setting up an association run is to select the algorithmparameters. The parameters available include:

• Minimum support

This is the minimum frequency of occurrence of a pattern required for a ruleto considered.

• Minimum confidence

This is the minimum joint probability between the rule head and rule tailrequired for a rule to be considered.

• Maximum rule length

This is the maximum number of products allowed in any rule to beconsidered.

• Item constraints

This is a list of items that all rules must contain in order to be considered.

Starting with values that are too low for support and confidence may causeunnecessary computer load. The association algorithm is very memory and CPUintensive as the number of products and number of rules considered grows. Werecommend choosing very high values for support and confidence (50% for both)and gradually lowering them until the number of patterns becomes unwieldy.We usually leave the confidence level at 50% to eliminate most of thepermutations of rules that meet the minimum support criterion. For example, ifrule (A-->B) meets the support criterion, so does rule (B-->A). If the originalrule meets the confidence criterion, the permutation usually does not. Reducingthe permutation results in fewer rules and permits easy results analysis. Wenever limit the maximum rule length or constrain the list of items within thealgorithm. If certain items are not to be considered, it is convenient to removethem from the transaction records.

5.3.4.2 Association DiscoveryAssociation discovery is repeated for all the clusters that were selected forcontrasting. The minimum set of rules in the two cases is compared to identifyproducts not present in some of the clusters. The list of removed large item sets(LIS) is also compared to identify products not present in some clusters. Thesemissing products are the cross-selling candidate opportunities. The acceptanceof candidates as actionable opportunities is usually driven by the number ofcustomers who have the missing product. Too small a group of customers willhave too little return to justify the promotional investment.

5.3.4.3 Selectively Remove Large Item SetsHaving determined the parameter bounds, you can discover the associationrules. The number of rules generated initially is usually very large andintimidating. Rather than changing the parameter settings at this point, it ispossible to begin temporarily removing certain products from the transactionrecords. The products removed are LIS. There are two types of LIS:

1. Large item sets whose frequency in the entire transaction data universe isstatistically equivalent to the frequency in the current data set. These itemscomplicate the analysis and should be removed to achieve less complicatedrules that are easy to analyze.


2. Large item sets whose frequency in the entire transaction data universe isstatistically different from the frequency in the current data set. These itemsare the items that make up the patterns discovered.

After the first type of LIS is removed from the transaction data, the associationsare rediscovered. If the number of rules is still unmanageable, begin removingthe second type of LIS, noting carefully what is removed. The associations areagain rediscovered. This process is repeated until the number of rules ismanageable (usually 20 to 50 rules). Removing the LIS allows you to understandthe ″structure″ of the rules. The remaining 20 to 50 rules at this point form thecore of the rules. The initial unmanageable set of rules is created by permutatingthe LIS with the remaining rules and applying the support and confidencecriterion.

5.3.4.4 Profile Rules and Large Item SetsThis step is repeated for all the clusters that where selected for contrasting. Theminimum set of rules in the two cases is compared to identify products notpresent in some of the clusters. The list of removed LIS is also compared toidentify products not present in some clusters. These missing products are thecross-selling candidate opportunities. The acceptance of candidates asactionable opportunities is usually driven by the number of customers thatbought the missing product. Too small a group of customers will have too littlereturn to justify the promotional investment.

5.3.4.5 Rebuild RulesOnce you have determined the candidate opportunities, it is important toreconstruct the original and actual rules present in the data. Removing rulesfrom the transaction data affects the statistics of the rules. To get accuratestatistics, the LIS must be returned. Adding the LIS back one by one andobserving the change in the discovered associations will give useful insight intothe rule ″structure.″

5.4 Data Mining ResultsAs mentioned before, the result of the demographic clustering process asdescribed in our first case study has been used for this case study. Identifyingopportunities for cross selling is a two-step process:

1. Select customer segments created before and select those clusterscontaining valuable customers.

2. Perform the product association discovery on the selected cluster data.

5.4.1 Cluster SelectionWe created Table 3 on page 77 using the result of the demographic clustering,enhanced by some data we selected through query analysis against the originalcluster data.


The two clusters chosen for further study in this case study were clusters 3 and6. From Table 3 you can see that cluster 6 represents a profitable customersegment with 26% of revenue represented by 23% of customers. In contrastcluster 3 represents only 7% of revenue for 23% of customers. The number ofproducts used by cluster 6 customers (indicated by the product index) is greaterthan that for cluster 3. Query analysis reveals that the difference is on averagetwo products. Furthermore cluster 6 has a slightly longer tenure. These twoclusters were chosen because of the sizeable opportunity and the small gap inpurchase behavior between them.

Table 3. Demographic Clustering Results: Percentage

ClusterID

Profit Customer RevenueLl2 ProductIndex

Leverage(Profit/Cust)

Tenure

5 34.74% 8.82% 32.83% 1.77 3.94 60.92

6 26.13% 23.47% 28.36% 1.41 1.11 57.87

7 21.25% 10.71% 20.10% 1.64 1.98 63.52

3 6.62% 23.32% 5.98% 0.73 0.28 47.23

0 4.78% 3.43% 6.78% 1.45 1.40 31.34

2 4.40% 2.51% 3.00% 1.46 1.75 61.38

4 1.41% 2.96% 2.46% 0.99 0.48 20.10

8 0.45% 14.14% 0.47% 0.36 0.03 30.01

1 0.22% 10.64% 0.01% 0.00 0.02 4.66

Total 100.00% 100.00% 100.00%

5.4.2 Association Rule DiscoveryWe initially performed product association discovery on the selected clusterdata, using the Intelligent Miner parameter settings illustrated in Figure 30 onpage 78 .


Figure 30. Parameter Settings for Associations

5.4.2.1 Association Results for Cluster 6Figure 31 shows that associations for the entire Good Customer Set returnedmany (2,218) rules.

Figure 31. Associations on Good Customer Set

Figure 32 on page 79 shows that the frequent item sets include loan (94%),mortgage (90%), and credit card (79%).


Figure 32. Associations on Good Customer Set Detail

Figure 33 shows the associations when loan, mortgage, and credit card areremoved. Note that the number of rules has been reduced to 286.

Figure 33. Associations for Good Customer Set: LIS Removed

There are many multiple-item (more than four) rules in the Good Customer Set(see Figure 34 on page 80.)


Figure 34. Associations for Good Customer Set: LIS Removed, Detail

5.4.2.2 Okay Customer SetFigure 35 shows the associations for the entire Okay Customer Set. Many (212)rules have been generated.

Figure 35. Associations on Okay Customer Set

Figure 36 on page 81 shows that the frequent item sets include loan (70%),mortgage (70%), and credit card (24%). The substantially lower frequency ofcredit card activity in cluster 3 represents a cross-selling opportunity.


Figure 36. Associations on Okay Customer Set Detail

Figure 37 shows the associations when loan and mortgage are removed. Notethat the number of rules has been reduced to 48.

Figure 37. Associations for Okay Customer Set: LIS Removed

No rules in the Okay Customer Set contain more than four items. Furtherdetailed comparison of the association rules will reveal other cross sellingopportunities. The largest cross-selling opportunities are revealed bydifferences in the large item sets.

5.4.2.3 Association Rules Discovery: Product Detail LevelSo far, all of the associations have been processed on product categories thatsummarize products. A comparison of Figure 38 on page 82 (and Figure 39 onpage 82) with Figure 33 on page 79 (and Figure 34 on page 80) shows whathappens when associations are run on a more detailed level. With low-levelproducts instead of product categories specified, associations exploded from 286to 1,521 for the Good Customer Set.

5.4.2.4 Good Customer Set


Figure 38. Associations for Good Customer Set: LIS Removed, Summary

Figure 39. Associations for Good Customer Set: LIS Removed, Detail

Figure 40 and Figure 41 on page 83 show the results of associations when inaddition to all types of loans, mortgage and credit cards being removed we alsoremoved frequent account types from the Good Customer Set customer productsets. Note the number of rules has been reduced to 55.

Figure 40. Associations for Good Customer Set: LIS and Certain Products Removed,Summary


Figure 41. Associations for Good Customer Set: LIS and Certain Products Removed,Detail

5.4.2.5 Okay Customer SetFigure 42 and Figure 43 show the results of associations when transactionscontaining large item sets are removed from the Okay Customer Set. Thenumber of rules has been reduced from 513 to 15.

Figure 42. Associations for Okay Customer Set: LIS and Certain Products Removed,Summary

Figure 43. Associations for Okay Customer Set: LIS and Certain Products Removed,Detail

The difference in large item sets removed for cluster 6 and cluster 3 reveals theopportunity to cross-sell web banking to cluster 3. Furthermore, a lowerfrequency of occurrence of term deposits in cluster 3 reveals another substantial


opportunity. Further detailed comparison of the differences in the rulesgenerated in cluster 6 and cluster 3 will reveal further cross-sellingopportunities.

5.4.2.6 Association Rule Discovery for the Entire UniverseFigure 44 and Figure 45 show the association rule results discovered frommining against all customer transaction records without using segmentation.Product categories were used in this example.

Figure 44. Associations for A l l Transactions: LIS Removed, Summary

Figure 45. Associations for A l l Transactions LIS Removed Detail

The number of rules in this case is 480 compared to 286 from cluster 6. Theincreased number of rules results in a more complex analysis. Furthermore, thelack of segment objectives makes it difficult to know what to search for.


5.5 Business Implementation and Next StepsSeveral cross-selling opportunities were identified. At the product categorylevel, cross-selling CREDIT CARD to cluster 3 was the best opportunity. Thestrategic initiative for cluster 3 derived in the segmentation case study was toidentify cross-selling opportunity. This CREDIT CARD opportunity is thusconsistent with the segment objectives.

With previous methods and common business experience the Bank recognizedthat cross selling CREDIT CARD to its customer base was an objective. Themethod presented in this case study provides the additional benefit of targetingCREDIT CARD to customer segments that are high shareholder value.

Using more detailed product definitions revealed several specific product crossselling opportunities. These included cross-selling term deposits and webbanking to cluster 3.

Before the cross-selling opportunity can be implemented, several activities mustbe completed. Some demographic profiling or clustering of the target universeis required to assist the marketer and advertiser in creating the appropriatemarketing message and selecting the appropriate marketing channel. It is alsonot efficient to target the entire group for this cross-selling campaign. Otherfactors must be considered to target those customers most likely to use a creditcard. Building a predictive model to target those customers most likely to use orrequire a Credit Card as well as targeting those customers most likely to pass acredit check would further reduce the mailing cost in executing this campaign.

The creation of a predictive model to target these customers is the topic of thenext case study.



Chapter 6. Target Marketing Model to Support a Cross-SellingCampaign

In this case study, we build three predictive models to target those customerslikely to buy the product identified as a cross-selling opportunity in the previouscase study. Several algorithms from Intelligent Miner are used. The modelsbuilt with the Intelligent Miner (decision tree, radial base function (RBF)regression, and neural network) are compared.

6.1 Executive SummaryIn Chapter 5, “Cross-Selling Opportunity Identification” on page 69 we describehow we identified a new business opportunity: cross-selling credit cards toexisting customers to make them more profitable. The focus was on customersin the Okay Customer Set. Our strategy was to market the Bank ′s credit card tothese customers. By getting some of the customers in the Okay Customer Set touse a credit card we would migrate them to the Good Customer Set andincrease their profitability.

A simple approach would have been to conduct a direct campaign for mail allcustomers in the Okay Customer Set, but our limited marketing budget did notpermit that. Furthermore, an additional goal was to reduce the customeracquisition and thus increase the campaign ROI, reduce the cost and maximizethe number of customers cross-sold.

Using the customers from the Okay Customer Set and those of the GoodCustomer Set we created a data set that we could use to predict whichcustomers in Okay Customer Set had a propensity to use a credit card. We builtthree predictive models, using the three prediction techniques available inIntelligent Miner:

• Decision tree

• Value prediction with RBF

• A neural network

A process for predictive modeling was presented to the analysts and the resultsfrom each algorithm were compared.

The neural network model had the best performance. By mailing only 40% ofthe total Okay Customer Set, we managed to include 76% of the customers withthe highest propensity to use a credit card. Furthermore, we expected to get anROI of 113%, in contrast to the 60% ROI we could have expected by mailing theentire Okay Customer Set. The higher ROI was achieved by reducing the costper customer acquisition from $167 to $88.

Table 4 on page 88 summarizes the financial details.


Table 4. Cross-selling: Summary - Predictive Model ing More Than Doubles ROI

WithoutPrediction

Model

WithPrediction

Model

Mailed customers 25,000 10,000

Cost of mailing material and mailing percustomer

$5 $5

Total cost $125,000 $50,000

New acquisitions 750 570

Cost per acquisition $167 $88

Average profit/year per customer $100 $100

Total profit per year $75,000 $57,000

Return on investment 60% 113%

6.2 Business RequirementsThe general objective in this case study was to improve company revenue andprofitability by attracting more customers to the credit card. We specificallywanted to cross sell customers from the Okay Customer Set and in successfullydoing so increase their profitability and move them to the more profitable GoodCustomer Set. The average profit from a customer who uses a credit card is$100 per year.

We used a direct mailing campaign to target the best prospects for the creditcard from the Okay Customer Set.

The first task in designing the campaign was to establish a baseline againstwhich to measure the success of the planned mailing campaign. In other words,we had to calculate the ROI, that would be expected without any data mining,from such a mailing campaign. We calculated the ROI by looking at the historicaltrends in the movement of customers from the Okay Customer Set to the GoodCustomer Set. Table 5 summarizes the calculations.

The Okay Customer Set and Good Customer Set total approximately 25,000customers. The expected response credit card offer is about 3%, based on theobserved movement of customers from the Okay Customer Set to the Good

Table 5. Cross-Selling: Baseline ROI Calculation

Total number of customers 25,000

Cost to mailing and creation (per piece) $5

Total cost of mailing $125,000

Expected takeup rate 25,000/734 = 3% (see Figure 48 onpage 94)

Expected new acquisitions 750 (3% of 25,000)

Cost per acquisition $125,000/750 = $167

Average profit per customer $100

Total profit per year $100 * 750 = $75,000

Return on investment $75,000/$125,000 = 60%


Customer Set (see Figure 48 on page 94 for details). The baseline ROI for amass marketing campaign is 60%. Thus it was not feasible to use massmarketing methods to implement the direct mailing campaign. The specificcampaign goals were to achieve a positive return while moving as manycustomers as possible from the Okay Customer Set to Good Customer Set.

6.3 Data Mining ProcessOur general approach was to build predictive models to help identify thosecustomers in the Okay Customer Set who were the best prospects for using thecredit card. In fact, we used several different predictive techniques in IntelligentMiner, both as a means of gaining more insight into the target customer set andas a cross-validation of the different mining algorithms. Figure 46 on page 90illustrates the overall approach.

Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign 89

Figure 46. Data Mining Process: Cross-Selling

6.3.1 Create Objective VariableThe first step in the predictive modeling process is to determine the objectivevariable to be modeled. When building models for targeting direct mailcampaigns, the objective variable is usually based on the customer ′s historicalresponse data to similar campaigns. In this particular case, we did not have ahistorical campaign similar enough to the proposed campaign to use to create aresponse variable. In practice this situation occurs frequently. An alternative tobuilding a response model is to build a propensity model. A propensity model


predicts which customers, who do not currently purchase the product beingcross-sold, have a higher likelihood or propensity to purchase the product.

The first and critical step in creating the target or objective variable is to selectthe time period in consideration. Setting the objective variable correctly iscritical. The size of the time window selected is driven by the time horizon ofthe desired prediction. For instance, if the marketing campaign to be executedhas a six-month window for the customer to respond, the objective variableshould be defined with a six-month window.

In this case, the marketing objective was to cross-sell a particular product to agroup of customers targeted for a direct mail campaign that would have asix-month window of opportunity for the customer to respond. We thusconsidered customers who used the credit card product in question in the mostrecent six-month period. In fact, we only considered customers who hadactivated in the most recent six months, that is, they had never used the Bank′scredit card more than six months ago. This last statement is extremelyimportant. To create a predictive model we must be able to predict the futurebehavior of a customer before that customer exhibits the behavior. If we are topredict the propensity of a customer to use the Bank′s credit card in the next sixmonths, we must do so using past data for the customer, that is, data from aperiod before the customer used the credit card.

We assigned the objective variable a value of 1 if a customer had no credit cardin the previous third quarter of 1997 but had activity in the third or fourth quarterof 1997. All other customer records were assigned a value of 0 for the objectivevariable. To predict those customers who activated in the third and fourthquarters of 1997, we used the customer transaction records for only the first twoquarters of 1997 (see Figure 47).

The final consideration in creating data for prediction is to ensure that no datarelated to the objective variable is used for prediction. For instance, if you arepredicting the credit card profitability of customers, do not use credit card databecause it is one and the same. Profitable credit card customers will have moreactivity on their cards, so using credit card activity to predict credit cardprofitability is a self-fulfilling prophecy.

Figure 47. Creating an Objective Variable

We also selected customers only from the Okay Customer Set and the GoodCustomer Set instead of sampling from the entire customer universe. Byfocusing on these customer clusters, both of which were profitable, we


automatically eliminated customers who were low profit from the direct mailcampaign. Note that some of the customers in the Okay Customer Set and GoodCustomer Set had no activity in the last 12 months with the credit card we werecross-selling. It is important to have a mix of target groups and non-targetgroups in order to develop a model that can distinguish between the twoextremes.

We specifically chose three different types of data to use for the predictionmodeling process problem:

1. Customer transaction data

Transaction data includes revenue, the number of transactions, and recencyof transactions from each Bank product category by time period (in this caseby quarter). The Bank′s data warehouse categorizes products into 14 groupsas explained in 4.3.1, “Data Selection” on page 36. We therefore created atotal of 84 variables (3 variables * 14 categories * 2 quarters).

2. Customer demographic and summary data

This data consists of demographic data including customer age, gender,income, and household size. The Bank ′s data warehouse also containssummarized transaction data including total revenue lifetime to date (LTD),number of products used LTD, total transactions LTD, recency, and firsttransaction date.

3. Third party census and tax data

In Canada the government permits the reselling of census and tax data. Thisdata is aggregated to the enumeration area which contains 300 to 400households. A variety of data is available including profession, ethnicity,education, income, and income by source.

Promotion history data is also typically used when building response models fordirect mail campaigns. In this case, promotion history was not available for theparticular product we were targeting, so the model built is not a response model.(The credit card we identified as a cross-selling opportunity was a card with newfeatures that had not been marketing via direct mail previously). It is apropensity model, which predicts the customer′s propensity to use a credit card.

6.3.2 Data PreparationWe performed two types of data preparation: data cleaning and transformations.

6.3.2.1 Data CleaningThe data cleaning required for predictive modeling is similar to data cleaning forclustering as discussed in 4.3.2, “Data Preparation” on page 38. The onlydifference is that more care is required in assigning values to missing records.In choosing a value to assign, the resulting distribution of the variable inquestion should not be drastically altered. Ideally you want to assign values thatdo not change the characteristics of the distribution (for example, the min, max,and mean). If it is not possible to assign values without dramatically altering avariable ′s distribution, discard that variable to avoid spurious correlations.

We assigned all transaction variables that had missing values a value of zero.Such an assignment is appropriate as an absence of transaction activity (null inthe database) implies zero activity. We discarded demographic data withmissing values if the missing portion was significant. Also, we created binaryvariables indicating the missing portion of all categorical variables.


6.3.2.2 Data TransformationAfter we cleaned the data, handled all missing and invalid values, and made theknown valid values consistent, we transformed the data to maximize theinformation content that can be retrieved.

For statistical analysis the data transformation phase is critical, as somestatistical methodologies require that the data be linearly related to an objectivevariable, normally distributed and containing no outliers. Artificial intelligenceand machine learning methods do not strictly require the data to be normal orlinearized, and some methods, like the decision tree, do not even requireoutliers to be dealt with. This is a major difference between statistical analysisand data mining. The machine learning algorithms can automatically deal withthe nonlinearity and nonnormal distributions, although the algorithms work betterin many cases if these criteria are met. A good statistician with a lot of time canmanually linearize, standardize, and remove outliers better than the artificialintelligence and machine learning methods. The challenge is that with millions ofdata records and thousands of variables, it is not feasible to do this workmanually. Also, most analysts are not qualified statisticians, so using automatedmethods is the only reasonable solution.

After cleaning the original data variables, we created new variables using ratios,differences, and business intuition. We created total transaction variables, whichwere the sum of the transaction variables over two quarters. We used thesetotals as normalizing constants to create ratio variables. We created timeseriesvariables to capture the time difference in all transaction variables betweenquarters.

The data set for predictive modeling is almost identical to that created for aclustering model except that more care is taken in the data cleaning and datatransformation processes. In addition it is important to remove all colinearitiesfrom the input variables before you execute any algorithms. Variables that arecolinear cause most data mining algorithms difficulty and worsen modelperformance. Colinearities can be removed by using either:

• Correlation analysis

• Principal component analysis

• Regression

Removing colinearities is especially important when you use RBF and the neuralnetwork. One of the assumptions made in the back-propagation algorithm is thatthe input variables are linearly independent. If this is not true, it may take a longtime to train the neural network, and the results may be poor depending on howcorrelated the inputs are.

6.3.3 Data Sampling for Training and TestFinally, we took a sample of the data for training and testing the Intelligent Minerprediction algorithm. (see Figure 48 on page 94).


Figure 48. Cross Sell ing: Data Sampling (5252f405/50)

It′s important to create both a training data set, which is used to build the model,and a test or hold-back data set, which is used to test the model. A model shouldbe tested against data that it has never seen before to ensure that there is nooverfitting. In this case we were trying to build a model that would predict abinary outcome (that is, the customer propensity to a particular product). In thecustomer universe of the Okay Customer Set and the Good Customer Set, wesampled approximately 23000 records. The distribution of a positive event (thatis, customer used a credit card in second half of 1997 but not before) was 734records out of the 23000. A minimum number of positive events required to builda predictive model is approximately 250. (We chose that number on the basis ofour experience). On the basis of the this distribution, we randomly split theentire file into two equal sized data sets as shown in Figure 48. One data setwas to be used for testing and was left as is.

The training portion of the data set was further sampled to create a 50/50distribution of the target variable. This is known as stratified sampling. Whenthe distribution of the target to non target is less than 10%, stratified samplingtends to improve the modeling results. The stratified sample data set size isusually driven by the number of positive events, but when the number of recordsbecomes small, as in this case, it is important to consider the sample size of thenon target or negative events relative the entire universe. To avoid the samplebias, the sample of non targets should not be too small. If sample bias is aconcern, it is possible to distribute the target to non target events non evenly (forexample, 20/80) or to duplicate records with positive events to permit a largernon target sample. To consider these effects, we recommend creating multipletraining data sets with different target and non target distributions to ensurevalid samples and to maximize model performance.


In this case study we simply explored the 50/50 case for the sake of brevity, eventhough the non target sample is small. (The wavy results in the gains charts inFigure 51 on page 105 could be a symptom of these effects.)

6.3.4 Feature SelectionIf the number of variables in the training and test data sets is very large (i.e.greater than 300), it is useful to reduce the number of variables before buildingany models. Feature selection, the process of selecting a subset of variablesmost correlated to the target variable from a larger set of variables, is an entirediscipline in itself. Here we simply mention some of the methods for selectingvariables:

• Linear and non linear correlation

• Principal components analysis

• Factor analysis

• Regression

• Decision trees

Most problems that we have worked on had over 1000 variables at the outset,and feature selection was a part of the predictive modeling process.

6.3.5 Train and TestWe used several methods to build the predictive model after preparing the data:

• Classification using a decision tree

• Value prediction with RBF regression

• Classification with a back-propagating neural network

Figure 49 on page 96 outlines the detailed steps in running a predictivemodeling algorithm.


Figure 49. Detailed Predictive Model ing Process

6.3.5.1 Algorithm SelectionThe first algorithm that is usually used to build a model is the decision tree.There are two reasons for using a decision tree:

1. The tree is very good at finding anomalies in the data. The first half dozenruns typically fine tune the data preparation. The decision tree discoversmissed details.

2. The tree can also be used as a data reduction tool. It typically reduces thenumber of variables by one order of magnitude if a few hundred variablesare input into the algorithm. The tree algorithm is very scalable, andperformance is not hampered by several hundred variables. Selecting thevariables that are in the tree model as input to the value prediction andneural classification algorithms improves their accuracy and performance.

The tree is used to create a reduced set of variables. The top 10 to 20 variablesare selected from the tree according to the position and number of occurrencesin the tree (that is, the higher up in the tree a variable occurs and the moretimes it occurs, the more significant it is). Value prediction with RBF requires areduced set of variables because the algorithm does a clustering pass beforebuilding the predictive model. The more variables present, the more difficult it is


to get good clusters and the worse the results. If you use linearly independentinput variables, such as those created by principal components analysis, RBFcan handle many more variables. Creating training and test data sets withPrincipal Component Analysis factors can improve the accuracy of both RBF andthe neural network.

The neural network also requires a reduced set of input variables. The majorconcern in using the neural network is the algorithm run time. This is thereason for selecting a reduced variable set.

6.3.5.2 Parameter SelectionAfter completing the data preparation and selecting the data set to be mined (inthis case the training data set first) you have to select the algorithm parameters.Set the basic parameters first and, if you are an advanced user, you can setadvanced settings to be different from the defaults.

Decision Tree — For the decision tree the parameters available for selection are:

• Maximum tree depth

The maximum tree depth sets the maximum number of levels to which thetree can grow. We typically leave this at no limit. When no limit is chosen,the algorithm fits the data and then prunes back the tree, using minimumdescription length (MDL) pruning. If you want to prevent overfitting or limitthe complexity of a tree, set this limit.

• Maximum purity per internal node

The maximum purity per internal node sets a limit for the purity beyondwhich the tree will no longer split the data. We typically leave this at 100%,which allows the tree to fit the data before pruning. If you are concernedabout overfitting, choose a lower value.

• Minimum number of records per internal node

The minimum number of records per internal node sets a minimum numberof records required per node. We typically set this parameter to 50. If a nodecontains at least 50 records, the resulting rule is likely to be statisticallysignificant.

Value Prediction with RBF — For the RBF algorithm tree the parametersavailable for selection are:

• In-sample size

In-sample size is the number of consecutive records assigned to the trainingdata set before out-sample records are assigned to the cross-validation dataset. The ratio of in-sample size to out-sample size is the same as the ratioof the training to cross-validation data sets. A cross-validation data set isused to test the accuracy of the model between successive passes throughthe data and model iterations. Cross-validation is used to choose the bestmodel and to minimize the likelihood of overfitting. Although this algorithmhas cross-validation, we strongly recommend that the model be testedagainst the hold back test data set. The in-sample to out-sample ratio isdriven by the number of positive target events. You would like to have atleast 250 positive target events in the training or in-sample. If this criterionis met, we usually use an 80/20 split, where the in-sample data set is thelarger data set.

• Out-sample size


Out-sample size is the number of consecutive records assigned to thecross-validation data set.

• Maximum number of passes

Maximum number of passes is the maximum number of passes thealgorithm makes through the data. This is a stopping criterion for thealgorithm (that is, if the algorithm has not achieved its accuracy criterion, itwill continue to run on until it has made the maximum passes through thedata). We usually start with 25 passes. If the algorithm uses less, the valuechosen was good. If the algorithm stops at 25 passes, we recommenddoubling the number of passes until the accuracy result is achieved beforethe maximum passes or until it seems that the accuracy criterion will not beachieved no matter how high the number of passes.

• Maximum number of centers

Maximum number of centers is the maximum number of gaussian regionsthat will be built by the model. If this value is set to zero, the algorithmchooses the number of centers to maximize the accuracy.

• Minimum region size

Minimum region size is the minimum number of records that the clusteringportion of the algorithm will assign to one gaussian region. Any gaussianswith less than this number of records will be deleted after each pass throughthe data. We use approximately 50 records so that the gaussians areassigned to regions that are statistically significant. If there are not sufficientdata records to set the minimum region size to 50, choose a minimum regionsize to get at least 5 to 10 regions in the output.

• Minimum number of passes

Minimum number of passes is the minimum number of passes the algorithmwill take through the data. During these initial passes, the algorithm doesnot do cross-validation.

Neural Network — For the neural network algorithm the parameters are:

• In-sample size

The in-sample and out-sample size parameters are used to split the inputdata set into a training data set and cross-validation data set exactly asdescribed above for value prediction with RBF regression. The neuralnetwork uses the cross-validation data set to choose a network architectureas well as to find the weights that minimize the model root mean squareerror. Again it is important to test the model against a hold back test dataset.

• Out-sample size

Out-sample size is the number of corrective records assigned to thecross-validation data set.


Maximum number of passes is a stopping criterion for the algorithm. If theaccuracy and error criteria are not achieved, the algorithm will stop aftertaking the maximum number of passes through the data. We use 500 passesas a starting point and test the effect of increasing the number of passes onthe accuracy.

• Accuracy


Accuracy is a stopping criterion for the algorithm. It is the percentage ofrecords that the algorithm classified correctly. The accuracy is testedagainst the out-sample or cross validation data set.

• Error rate

Error rate is a stopping criterion for the algorithm. It is the percentage ofrecords that the algorithm classified incorrectly. This is different from theaccuracy rate, because an unknown class is assigned if the network cannotmake a decision. In the predictive modeling case, where you are interestedin simply rank ordering the records, which is different from classification, theaccuracy and error rate of classification are not necessarily important. Thenetwork may have poor accuracy yet still rank order the records correctly.The network outputs a confidence, which is the actual output of the neuralnetwork that can be used to rank order.

• Network architecture

Using the manual architecture option of IM it is possible to assign thenumber of nodes per hidden layer. The neural network can have up to threehidden layers. The number of nodes in each layer can be selected byspecifying the number in the hidden layer 1, hidden layer 2, and hidden layer3 parameters. Selecting the default setting or automatic architecturedetermination causes the algorithm to iterate several architectures andchoose the best one based on preliminary cross-validation results. Unlessyou have some reason to specify an architecture, we recommend usingautomated architecture selection. Sometimes the algorithm creates a neuralnetwork with no hidden layers. In this case, to compare results, you maywant to force some hidden layers to compare results.

• Learning rate

Learning rate can be used to control the rate of gradient descent. Theparameter can range from 0 to 1. Too high a value causes the network todiverge, and too low a value causes the neural network to train very slowly.The academic literature recommends a value of 0.1, which seems to workbest in most cases. This is the default setting. If the algorithm is convergingtoo slowly, you might slowly increase the value of this parameter.

• Momentum

Momentum can be used to control the rate of convergence. It controls thedirection of gradient descent and it is the fraction of the previous directionthat is maintained in the current descent step. The parameter can rangefrom 0 to 1. Too high a value causes the algorithm to converge very slowlyor not at all as the descent direction is not sufficiently changed. Too low avalue causes very slow convergence as the convergence direction changestoo much, causing the descent direction to ″zig-zag″ across the errorsurface. The academic literature recommends a value of 0.9, which is thedefault value.

6.3.5.3 Input Field SelectionIn this case we selected transaction and demographic data that we had createdas input to the tree algorithm. Refer to 6.3.1 ″Data Definition″ on page 146 for adetailed description. We also included some of the clustering algorithm data inthe tree to test their significance in predicting propensity to use a credit card.


Decision Tree — To rank order the records in order of the customer propensityto buy a particular product, you must set the objective variable type to discretenumeric or continuous.

Valuable Prediction with RBF — The RBF algorithm requires that the objectivevariable be continuous. RBF also allows the use of supplementary variables,which are profiled by model region but not used to build the model. This usefulfeature of RBF enables you to immediately profile the model scores. We selectedthe top decision tree variables as input to the RBF algorithm.

Neural Network — The neural network requires that the objective variable becategorical. We selected the top decision tree variables as input to the neuralnetwork algorithm.

6.3.5.4 Output Field SelectionTo build a gains chart, the minimum output requirements are, for each algorithm:

• Customer ID

• Objective variable

• Algorithm prediction

If you want to use the output of one algorithm, including its prediction, as theinput to another algorithm, you should output the entire original data set. Havingthe scores of multiple algorithms in the same file is useful for comparisons. Forinstance, if the tree places certain records in the top decile, and the RBFalgorithm assigns them to the middle decile, it may be possible to correct themodels by creating new variables or altering the training data set to compensatefor this disagreement.

6.3.5.5 Results VisualizationWe used three different mining algoritms in this case study. The following givesan idea how the result presented may look for each of the algorithms. We alsoshow how the most common problems for each algorithm appear within theresult visualization.

Decision Tree — The tree algorithm outputs a summary screen showing themean and root mean square error. From this screen it is possible to view boththe unpruned and pruned trees. As discussed earlier, the tree will find all dataanomalies. Symptoms of anomalous trees are:

• One leaf only

This symptom is caused by using a variable in the input data that is perfectlycorrelated with the objective variable. This typically occurs with variables asdates or customer IDs or other fields that are unique to each customer.These fields produce a one-to-one mapping to the objective.

• Highly unbalanced tree with only one leaf to one side of the root node

This symptom is also caused by a variable that is highly correlated with theobjective variable.

• Very shallow tree when many input variables are used

This symptom can also be caused by variables that are highly correlatedwith the objective.

Reasonable tree visualizations should produce balanced trees with a reasonablenumber of levels, depending on the number of input variables. The purity of the


leaf nodes should range from highly pure with either value of the target to leafnodes with mixed distributions of the target values.

Value Prediction with RBF — The RBF algorithm outputs a visualization similar tothat of the cluster viewer. The main difference between the two visualizations isthat RBF presents the records by region and not cluster. A record is assigned tothe region to which it has the highest probability of belonging. The visualizationshows the average model score by region and root mean square error by regionas well.

Anomalous results are indicated by these symptoms:

• Visualizations with only one or two regions

This symptom is usually indicative of very strong predictor variables thatmask the effect of other input. In this situation look at the decision treeresults to determine whether there are segments at the top of the tree withthe same variables that are most important to the regions. To correct thissituation, remove the strong variables from the chosen input fields and splitthe data into multiple files based on the segmentation by the strongvariables as indicated by the tree. It is then possible to run RBF against eachof the separate files, and after scoring, simply append the results into onefile.

• A low ratio of the average score in the top region to the average score in thebottom region

This symptom is caused by either too many input variables or just poor dataused for the prediction problem.

Good results are indicated by:

• A high ratio (> 2-3) between the top and bottom regions′ average predictionscores

• Several ( >5 ) regions present

Neural Network — The algorithm outputs a confusion matrix that shows theclassification accuracy of the network. The algorithm adds an unknown class tothe possible predicted class set. Anomalous output is indicated by too manyrecords being classified as unknown. If too many records are unknown, tryincreasing the number of passes. The algorithm also outputs a sensitivity matrixthat assigns each input variable a percentage. The percentage indicates howsensitive the output is to changes in that variable. Anomalous results may occurif one or a few variables contribute a vary large fraction of the sensitivity. Thesevariables may indicate the presence of segments in the data. If this occurs,observe the decision tree results and see whether the high sensitivity variablesoccur at the top of the tree. If they do, split the data into segments as indicatedby the tree, and train a neural network for each segment. Once each segment isscored, it is possible to append the results together to analyze.

6.3.5.6 Results Analysis/RefinementThe results of a predictive model that rank orders records are typically displayedas a gains chart (see Figure 51 on page 105). A gains chart contrasts theperformance of the model within the results achieved by random chance.Several iterations of the algorithm are executed, varying the parameters. Gainscharts for each run should be compared and studied. In training mode the gainscurves should be perfectly smooth, and the counts of the positive target event bydecreasing decile should be monotonically decreasing and non wavering.


6.3.5.7 Run Model against Test DataTo ensure that the model has not overfit the data and to assess the modelperformance against a data set that has the same characteristics as theapplication universe, the model should be executed against the test data in testmode. Test mode permits using an existing model to score the records. Thetest mode results, should be approximately equal to the training results, exceptwhen stratified sampling is used. When stratified sampling is used, the testmode gains chart should be better for the test data set than the training data set.The performance of the model prediction by descending decile should result in amonotonic decrease in the counts of positive target events. Any wavering in thetop deciles of the model that are likely to be mailed should be studied. Thecause of the wavering should be identified and corrected. If the model performswell against the test data set, it should perform similarly against the applicationuniverse, if both populations have the same statistics. (This point is discussedfurther in 6.3.7, “Perform Population Stability Tests on Application Universe” onpage 103.)

6.3.6 Select ″Best Model ″After using gains charts to analyze the model results, you have to explain whythe model is scoring as it is. Perform clustering on the input data, using thescore decile (or other quantile) as a supplementary variable, and observe andcharacterize the clusters that appear. If the model is working properly, theclusters should separate the quantile field. The clusters can be used to explainthe difference between records containing high scores and low scores. Comparethe characterizations of the scores from each of the algorithms to determinewhether the algorithms are observing the same effect or one algorithm isdiscovering something the others are not. Use the differences found toiteratively improve the results of each algorithm.

After having iteratively improved the models, you chose the ″best model″.Typically the best model has the highest performance as measured by the gainschart, that is, it rank orders the input records the best. Sometimes, however,you may choose a model that does not rank order the best, for several reasons:

• The model is easy to explain

Sometimes the best model contains variables that are not easily explainedor are not the related to the current business problem, and it may be difficultto justify its application. It is just as important to be able to explain why amodel works as it is for the model to work well.

• The model agrees with the current business intuition

If the model reflects the current understanding of the factors that affect thebusiness problem, more confidence can be assigned to the result.Furthermore, if some new learnings are present with the currentunderstanding, more confidence can be assigned to the new learnings. If amodel contains unusual factors that cannot be explained, the model shouldnot be implemented.

• The model is simple to implement

A simple model with few variables or that requires little data processing ispreferable to more complex models. The implementation of complex modelscould result in errors as well as a high tendency to overfit.


6.3.7 Perform Population Stability Tests on Application UniverseAfter you have selected the ″best″ model, it is crucial to ensure that theapplication data set that the model will be implemented against is the same asthe test data set that the model was tested against. The similarity can bedetermined by univariate and multivariate profiles of the data sets. Acomparison of statistics from these profiles should show very little differencebetween the universes. If the statistics are very different, the model willprobably not work properly. The statistics could be different for these fewreasons:

• Sample bias

The test and training samples created were biased samples. If this is thecase, the data should be re-created, and the modeling process repeated.

• Incorrect problem setup

The design of test and training data differs from design to which the modelwas intended to be applied. The process used to create the application dataset should be identical to the process used to create the test data set, exceptof course for the difference in time periods.

6.4 Data Mining ResultsIn this section we explain how to use the Intelligent Miner visualization tools topresent the results of the mining algorithms and how to interpret those results.

6.4.1 Decision TreeFigure 50 on page 104 shows the visualization results from the decision tree.


Figure 50. Decision Tree Results: Isolating the Key Decision Criteria

This result was achieved after several iterations during which some variableswere removed. The variables that appeared in the tree rules included totalrevenue, total number of transactions in Q1, savings account revenue in Q2, thenumber of savings account transactions in Q2, best customer in 96, and thesecond-choice cluster ID assigned by Intelligent Miner during the CustomerSegmentation case study (see Chapter 4). All of these variables agreed with thecurrent business understanding.

The gains chart for the training data produced a smooth curve as expected. Thetraining results are typically non-interesting, as in most cases the modelsachieve good results against training. A more important test is how well themodel performs against the test or holdback data set. A gains chart was createdfor the test data set (see Figure 51 on page 105).


Figure 51. Gains Chart for Decision Tree Results (s406/0.0)

Gains Chart — A gains chart is a graph created from the rank ordered modelscores. For algorithms that create a continuous score, such as RBF and theneural network, the score variable can be quantiled. The gains graph chart isthen created by plotting the number of cumulative positive events by descendingquantile versus the cumulative number of records by descending quantile. Foran algorithm containing discontinuous scores, such as the decision tree, it is notpossible to quantile the scores. The decision tree scores records by assigningthe average leaf node score to all records in the leaf node. You can thereforebuild the gains graph chart by plotting the number of cumulative positive eventsby descending leaf node score versus the cumulative number of records bydescending leaf node score. The line labeled random indicates that a randomrank ordering of records results in an even number of positive events byquantile. This is the expected result for random ordering. If our model is rankordering well, there should be more positive target events in the top quantiles,and the slope of the gains curve should be higher in the top quantiles than therandom line. This higher slope will result in a curve that is lifted above therandom line.

In most cases the business action taken using the output of a predictive modeltypically uses 10%-40% of the possible universe. It is therefore important tonote the ratio of gains curve to random at the implementation cutoff. InFigure 53 on page 109 we observe a lift of approximately 1.5 at 25% of theUniverse. This is a modest lift curve. In our experience most gains charts havea lift ratio ranging between 1.5 to 3.5. Too low a lift indicates that the data is notvery predictive of what was being modeled. Too high a lift is also suspiciousand may indicate sample bias or the use of input data that is too closely relatedto the target variable.

Another feature to note is the smoothness of the gains curve. If the curve is verysmooth, the number of positive events by quantile is distributed monotonically.A monotonic distribution of positive events indicates that the model is rankordering correctly. A wavy curve indicates that the positive events are notmonotonically distributed. This implies that there is a secondary factor in thedata that the model did not capture. If the waviness occurs in the top quantiles


or in a range in which you intend to use the model, it should be corrected. If thewaviness occurs in the bottom quantiles or out of range, you can ignore it.

In this case the tree gains chart had a modest positive lift of approximately 1.5times random and was a smooth curve.

6.4.2 RBFFigure 52 on page 107 presents the RBF visualization results. One immediateadvantage of using RBF is apparent: the results of the RBF algorithm present aprofile by model region, which can be used to characterize or explain why themodel is working.


Figure 52. RBF Results


In Figure 52 on page 107, observe that the top region with an average modelscore of 0.7778 is characterized by customers with higher than average revenuein Q2, and a larger positive revenue difference between Q2 and Q1 indicatinggrowth in activity. The third region from the top is characterized by customerswho were in the Best segment in 1996 and who have much higher than averagewithdrawal amounts from savings accounts. These characterizations areconsistent with the current business understanding of customers likely to use acredit card. The gains chart for RBF is plotted in Figure 53 on page 109. TheRBF model used against the training data set results in a gains curve similar tothat of the decision tree, with a modest lift of 1.5 over random. The model is,however, wavy in the top quantiles, which raises some concern and should beresolved before implementation.

6.4.3 Neural NetworkThe neural network algorithm was run against the top six variables selectedfrom the decision tree. The following sensitivity analysis results were output:

Field Name SensitivitySavings_Revenue_Q2 4.6Savings_Txns_Q2 4.7Loan_Revenue_Q1 12.0Best96 Status 0.2Total Revenue 22.2Total_Txns_Q1 56.0

This result indicates that the output is most affected by changes in the totalnumber of transactions in the first quarter of 1997, which accounts for 56% of thetotal change observed. Total revenue in the first half of 1997 accounts for 22.2%of the observed change. The large fraction of sensitivity accounted for by two ofthe variables raises some concern. These two variables are also at the very topof the decision tree. Better results may be achieved with the neural network ifthe training file is first segmented using the tree rules for these variables andthen a neural network is trained on each segment.

Figure 53 on page 109 shows the gains chart for both training and test for thisneural network model. Although the training results outperform both othermodels, however, the gains curve is much wavier below the top 20% of the list.In training, the lift of the model is approximately 2 times random. In test modethe inflection point of the lift curve moves to the left. When the model is builtagainst a stratified training data set, this is expected. The lift of the test curve isapproximately 3 times random at 20% of the total population. The curve is verywavy at the top, however. This severe waviness indicates that the model hasmissed a major factor and should be resolved. The two overpowering variablesin the model as indicated by the sensitivity results above could be masking othereffects that would otherwise be present and could explain the gains curvewaviness. Preliminary results indicate that the neural network model will workbetter in the end for this data set.


Figure 53. Cross-Selling: Comparison of Three Predictive Models

6.5 Business ImplementationYou may recall that the business objective was: given a limited mailing budget,target the most likely prospects for a credit card offer to reduce the cost ofcustomer acquisition and improve the campaign ROI.

The optimal number of customers to target can be decided by looking at Table 6and its graphic representation in Figure 54 on page 110.

Table 6. Cross-Selling: ROI Analysis Figures

PercentageofUniverse(%)

PredictedResponseRate (%)

Number ofResponses

Total Costof Mailing(U.S. $)

AnnualCreditCard Profit(U.S. $)

PredictedROI (%)

10 55 416 12,500 41,600 333

20 58 435 25,000 43,500 174

30 65 488 37,500 48,800 130

40 76 570 50,000 57,000 114

50 83 623 62,500 62,300 99.7

60 90 675 75,000 67,500 90

70 92 690 87,500 69,000 79

80 95 713 100,000 71,300 71

90 98 735 112,500 73,500 65

100 100 750 125,000 75,000 60


Figure 54. Cross-Selling: ROI Analysis Figures

The ROI analysis table is built from a combination of the comparative gains chart(Figure 53 on page 109) and the baseline ROI calculation in Table 5 on page 88.The first two columns in Table 6 are derived by reading the gains chart inFigure 54 from left to right as follows: contacting 20% of the universe will yield58% of the respondents, contacting 40% of the universe will yield 76% of therespondents, and so on. The number of responses is derived from the neuralnetwork test model. The cost and profit figures are taken directly from Table 5on page 88.

In conclusion, to achieve a positive return and at the same time maximize themigration of customers from the Okay Customer Set to the Good Customer Set,40% of the potential customer universe should be targeted.


Chapter 7. Attrition Model to Improve Customer Retention

In this chapter we discuss attrition management analysis, which is how to keepyour customer satisfied, how to predict the customers who will leave within sixmonths, and how to make these expected defectives loyal customers. Ingeneral, it is more profitable to influence the nonloyal customers to be loyal toyour company than to strive to gather new customers.

With many analysts estimating customer attrition rates at almost 50% every fiveyears, the challenge to manage customer attrition is driving companies to gain amore comprehensive understanding of their customers.

Figure 55 illustrates the point well. This chart is taken from the HarvardBusiness Review and demonstrates the value of good attrition control to theprofitability of businesses in several different sectors. The message is clear: bydecreasing the rate of attrition, you can increase the profitability of yourbusiness.

Figure 55. Reducing Defections 5 % Boosts Profits 25% to 85%

Intelligent Miner can develop models that you can use to accurately targetcustomers who might defect. If you take the appropriate business action to stopthe potential defection, you can stop the reduction in customer attrition.

7.1 Executive SummaryThe goal of this case study was to identify which profitable customers were likelyto defect. The profitable customers selected for analysis were the OkayCustomer Set and the Good Customer Set customers from the CustomerSegmentation case study. In addition to being able to predict which customershad a higher likelihood of defection, we wanted to understand the characteristicsof the defectors and nondefectors and how we could use that information toincrease the company ′s retention rate.

We used four different methods to solve the business problem. The methodologyto implement these techniques, including data preparation and analysis ofresults, are also presented. For this prediction we used a combination of


customer transaction data and demographic data. We contrasted the resultsfrom the three standard prediction techniques with a time-series technique.

The neural networks, both the standard and time-series versions, were able topredict the best which customers were likely to defect. The standard neuralnetwork could identify 95% of the defectors in only 20% of the customerpopulation. The time-series neural network could identify 92% of defectors in20% of the customer population. In addition the time-series neural networkcould reduce the window of predicting the time of defection to one month,instead of six months for the standard techniques.

We profiled and characterized the output of all the techniques to distinguishbetween a typical defector customer and a nondefector customer. Thecharacterizations from all algorithms agreed very well. The definingcharacteristics of defectors were:

• Mostly from the Okay Customer Set

• No Best customers

• Lower product usage than average

• Shorter tenure

• In general lower usage of all products, especially telebanking, credit card,mortgages, and loans

The defining characteristics of nondefectors were:

• Mostly from the Good Customer Set, although not as skewed in favor as thedefector

• Higher ratio of Best customers

• Higher product usage than average

• Longer tenure

• In general a higher usage of all products, especially telebanking, credit card,personal banking, mortgages, and loans

The Bank′s personal banking service is a bundle of savings, checking, creditcard and lower fees for various services. This bundle was only associated withnon defector customers. Customers with a multi-faceted relationship with theBank are less likely to defect. Selling the personal banking bundle is a goodstart to build a strong relationship with customers.

7.2 Business RequirementThe goal of this case study was to identify those customers in profitablesegments who have a high probability of defection. Once customers likely todefect have been identified, it is then possible to take business action to reducethe likelihood by offering the customer incentives to remain loyal. The reasonfor analyzing only profitable customers is that they provide sufficient margin topermit discounting and rewards and still be profitable. Customers who are likelyto defect and are not profitable should be let go. We built an attrition model forthe Okay Customer Set and the Good Customer Set, both of which are profitablecustomer segments.


We defined customer defection as a customer who had no activity for at least sixmonths. Our analysis was completed in January 1998, so the most recentdefectors were customers who had no activity in July 1997 through December1997.

In addition to identifying those customers most likely to defect, we analyzed howcustomers could be prevented from defecting. We also profiled the customerslikely to defect.

7.3 Data Mining ProcessWe took two broad approaches to the problem:

• Model 1:

A combination of three tried-and-tested Intelligent Miner algorithms

• Model 2:

A new Intelligent Miner algorithm called time-series prediction

We combined decision tree, RBF, and neural classification much as we did forthe Cross-Selling case study (see Chapter 6, “Target Marketing Model toSupport a Cross-Selling Campaign” on page 87). On the basis of detailedcustomer transactions from January through December 1997, we identified thosecustomers with a high probability of defecting sometime in the last six months of1997. The more sophisticated time-series analysis algorithm used the sameinput data but provided us with not just a single probability of defection but alsowith six different data points corresponding to the probability of defection in anyone of the last six months of 1997. Figure 56 on page 114 illustrates the generalapproach.

Chapter 7. Attrit ion Model to Improve Customer Retention 113

Figure 56. Data Mining Process: Attrit ion Analysis

• The modeling approach shown on the left-hand side of Figure 56 is identicalto the approach we used in the Cross-Selling case study (see Chapter 6,“Target Marketing Model to Support a Cross-Selling Campaign” on page 87).

• The modeling approach shown on the right-hand side of Figure 56 usestime-series prediction.

The only differences are the definition and creation of the objective variable. Infact, because the same time periods were involved for both case studies, weused the exact same initial data set to start the data mining process. Thereforewe only discuss method 1, the combination of decision tree, RBF, and neuralclassification, for the definition of the objective variable. For details of theprocess refer to Chapter 6, “Target Marketing Model to Support a Cross-SellingCampaign” on page 87. The data mining process focuses on the secondmethod. We present and discuss the results for both methods.


7.3.1 Data DefinitionRefer to Figure 57 on page 116 for a layout of the data required for eachmethod. Note that for the standard prediction methods, there is one record percustomer, with one variable being the targeted. For the time-series approacheach customer has one record per month and one value of the objective variableper month, that is, the target variable has a profile, whereas the time-seriesapproach will try to predict the profile.

7.3.1.1 Method 1As a first step we created the objective variable to be modeled. We assignedthe objective variable a value of 0 or 1. Customers who had activity in the firsthalf of 1997 and had no activity in the second half were assigned a value of 1,and all other customers were assigned a value of 0.

7.3.1.2 Method 2For this method we defined the objective variable in the same way as in method1, but implemented it differently. Customers who had activity in the first half of1997 and had no activity in the second half were assigned a value of 1 for theobjective variable for each of the six months of the second half of 1997, and avalue of zero for each of the 6 months of the first half of 1997. All othercustomers were assigned a zero for the objective variable for all 12 months.Because of the definition of customer defection in both methods, we actuallybuilt a model to identify which customers defected in July 1997.


Figure 57. Attrition Analysis: Data Definit ion

7.3.2 Data PreparationGiven the two models we used, shown in Figure 56 on page 114, we performedthe data preparation for each model.

7.3.2.1 Method 1Refer to Chapter 6, “Target Marketing Model to Support a Cross-SellingCampaign” on page 87.

7.3.2.2 Method 2One advantage of using the time-series prediction method is that it does notrequire pivoting the transaction data: that is, taking the transaction data out of itsnatural time sequence and creating time variables for each customer record.The time-series algorithm uses data in its time-like sequence, whereas thestandard prediction methods use a different variable on the same record foreach time period. This fact limits the number of variables that can be analyzedwith the standard prediction techniques. To capture time effects, a separatevariable must be created for each time period being considered, the differencesbetween time periods, and other factors. In method 1 the selection of quarterlytime periods was driven by the desire to keep the number of variables small.For 14 product categories, with 3 variables (recency, number of transactions, and


revenue) and 2 quarters, there are 84 base variables with 42 differences, 42totals, and 84 ratios. If we had used months instead of quarters, we would havehad 3 times the number of variables. With the time-series method we only had42 variables (that is, 14 categories with 3 variables per category) and one recordper month or 12 times the number of records. The time series neural networkdoes not require the creation of time-derivative terms as it captures those effectsusing time-delay layers. The algorithms tend to scale logarithmically with thenumber of records and exponentially with the number of variables. It istherefore much more efficient and elegant to use the time-series method.

The algorithm requires that each customer have a record for each time period.Unfortunately, if a customer has no transaction activity in a period, it has notransaction record. Therefore a dummy table must be created containing allcustomer ID and month-number pairs. The transaction data can then be joinedthrough an outer join, which will assign null values to all missing customer IDand month-number pairs. The null values can then be updated to zeroes.

7.3.3 Data MiningFigure 49 on page 96 outlines the detailed steps in running a predictivemodeling algorithm. We refer to Figure 49 on page 96 in our discussion of thetime-series prediction method.

7.3.3.1 Parameter SelectionFor the two methods described here, we used the following parameter settings:

Method 1 — Refer to Chapter 6, “Target Marketing Model to Support aCross-Selling Campaign” on page 87.

Method 2 — The time-series algorithm has the following parameters (see Figure 58 on page 118).

• In-sample size

Refer to 6.3.5.2, ″Parameter Selection,″ under Neural Network.

• Out-sample size




• Forecast horizon

The forecast horizon is the number of periods in the future from the record inconsideration for which the algorithm is making a prediction. In this case weused a forecast horizon of 1, that is, we were trying to predict one month inadvance.

• Window size

The window size is the number of historical records used to make a futureprediction. In our case we used six months of historical data to predict onemonth in advance.

• Average error

Average error is an algorithm stopping criterion. It is the average root meansquare (RMS) error limit. If the average RMS error is greater than this limit,


the algorithm continues training until the criterion is met or the maximumnumber of passes is exceeded.

• Neural architecture


• Learning Rate


• Momentum


Figure 58. Times Series: Setting the Parameters

7.3.3.2 Input Field SelectionWe selected the input fields for each model as described below.


Method 2 — The prediction variable should be continuous for the time-seriesalgorithm. We selected the variables indicated by the decision tree to beimportant predictors of defection. The algorithm also profiles the output byquantile break points that can be user defined. We input as supplementary alldata not used in the prediction.


7.3.3.3 Output Field SelectionWe selected the following output fields for the two models used in this casestudy:


Method 2 — The output of the time-series neural network can be viewed as aseries of gains charts, one gains chart for each time period prediction.Therefore, at a minimum we required:

• Customer ID

• Objective variable

• Algorithm prediction

7.3.3.4 Results VisualizationAs we used two different models in this case study, the following will give you anidea of what the results would look like for each model:


Method 2 — The algorithm generates a profile of the quantile breaks, using theclustering visualizer. This output also shows the average score by eachquantile. A reasonable result should have the following characteristics:

• Ratio of average score for top quantile to bottom quantile should be at least2 or 3 to 1.

If this criterion is not satisfied, a mistake was made in setting up the data forthe model or the data selected is not predictive of the target.

• The characteristics of the top quantile to bottom quantile should be different.

If the top and bottom quantiles do not have differing characteristics, themodel will have poor lift, or the data was not defined correctly.

• The average score should be monotonically decreasing with decreasingquantile bucket.

If the average score is not monotonic with decreasing bucket, there issample bias or an effect that the model is not capturing.

• The order of importance of the variable within each quantile should bedifferent.

If the order of importance of the first few variables in each quantile is thesame, those variables are likely very powerful indicators that a separatemodel should be built for each segment. Compare the results of theprediction with a decision tree. If the powerful variables appear in the top ofthe tree, then this is the case. If not, the variables may be systematicallyrelated to the target variable.


7.3.4 Gains ChartRefer to Chapter 6, “Target Marketing Model to Support a Cross-SellingCampaign” on page 87 for a discussion of the role of gains charts in predictivemodeling.

7.3.5 ClusteringIn addition to the exercise of building predictive models to identify thosecustomers likely to defect, we completed some clustering of the resulting modelsto explain the characteristics of defectors and nondefectors. We extracted thetop and bottom decile customer records from the neural prediction model andappended them together into a data set. We used the demographic clusteringalgorithm to cluster all input data, not just the variables used to create theneural network model. The quantile number was input into the clusteringalgorithm as a supplementary variable to determine whether the algorithm coulddistinguish the customer records using other data. We then used the insight intothe differences between defectors and nondefectors to design a campaign toreduce the defection rate in the targeted customer segments.

In addition to clustering the neural network scores, RBF output visualizationautomatically provides profile information about the customer records in eachregion. For the run of the RBF algorithm, we made sure to include all possibleinput data available as supplementary data if it was not used already as inputinto the model.

7.4 Data Mining ResultsIn this section we describe the results achieved with each of the data miningalgorithms used in this case study. Remember that the decision tree, RBFmodeling, neural network, and clustering represent the first model used, whereasthe time-series prediction represents the second model used for the attritionmanagement analysis.

7.4.1 Decision TreeThe following primary variables appeared in the decision tree:

• Lifetime to Date Revenue

• Mortgage Balance

• Total loan balance


• Number of products used in 1997

Figure 59 on page 121 shows a node for customers who will leave with 85%probability. The characteristics of these customers are:

• Total revenue is less than 1091.

• They have lower than average mortgage balances.

• They have no loans


Figure 59. Attrition Analysis: Decision Tree Structure

All of these variables were meaningful for the business problem. Figure 60 onpage 122 shows the gains chart for both training and test for the decision tree.The training results are very smooth, indicating monotonicity in the distributionof the target variable positive events by descending leaf node score. Thetraining result has a lift ratio of 1.75 times random at 20% of the customerpopulation. The test gains curve has a higher lift ratio of 3.3 at 20%. Thisimprovement is due to the use of stratified sampling in the training mode. Theproblem with the test mode is the severe waviness of the gains curve in the topportion of the curve. This is due to either a biased training sample (because of asmall sample of negative target events) or some effect that the model wasmissing. The former explanation is the most probable.


Figure 60. Decision Tree Gains Chart: Training and Testing

7.4.2 RBF ModelingThe RBF results visualization in Figure 61 on page 123 is positive as we havemultiple regions. The top region to bottom region have a ratio of defection ofover 10 to 1, and the characteristics of each region are different andinterpretable.


Figure 61. RBF: Results Window

Those customers in the top region of Figure 62 on page 124 have low activityacross all products and tend to live in Ontario.


Figure 62. Attrition Analysis: Predicting Values Result

The second region from the top, which also has a high likelihood of defection,also contains customers skewed to Ontario on whom the Bank didn ′ t have anydemographics. The reason for no demographics would be that these customershad not applied for any credit and hence had not completed a credit check. Thisregion also had low activity across all products. The first two regions alsocontained more male customers than average.

The regions that had a low probability of defection were mostly from the GoodCustomer Set and a small segment of the Okay Customer Set. These customerstended to be the Bank′s Best customers and to come from Western Canada.These customers had high product activity and tended to have a longer tenure


as well. All these characterizations reflect the current business understanding(see Figure 63 on page 125).

Figure 63. Attrition Analysis: Predicting Values

Figure 64 on page 126 shows the gains curves for the RBF models against boththe training and test data sets. The RBF training result is similar to the decisiontree gains chart at the top quantiles or below 25% of the customer universe.Above 25% percent of the cumulative customer population, the tree trainingmodel performs significantly better. The test result for the RBF model is worsethan that of the decision tree because it has a lower lift ratio and is more wavyin the top quantiles. The model is still very promising, showing a lift over


random of approximately 2.8 at 20% of the customer population. Some furtheranalysis to account for the source of the waviness may improve the result.

Figure 64. Attrition Analysis: Comparative Gains Charts for A l l Methods

7.4.3 Neural NetworkWe built the neural network model, using the most significant variables asidentified by the decision tree. The following sensitivity analysis resulted fromthe neural network:

Field Name SensitivityREVENUE 7.4TENURE 14.3AVG_LOAN_BAL 1.7AVG_MTG_BAL 2.2PRODUCTS_USED 74.3

The ′PRODUCTSS_USED′ variable seems to be highly related to customerdefection. It is concerning that one variable accounts for such a high fraction ofthe observed output sensitivity. However, exploring this variable in otheralgorithms reveals that it is not the most sensitive variable as identified by thedecision tree or RBF. The business interpretation of the result indicates that themore products a customer uses, the less likely they are to defect. This is acurrent reflection of the business understanding.

Referring to Figure 64, you can see that the neural network achieves the bestresults in training with a lift over random of 2 to 1 up to 40% of the customerpopulation. The training curve is also smooth. The test results have anexceptional lift over random of approximately 4.75 to 1 at 20% of the customerpopulation. The gains curve, however, is wavy at the top quantiles. The resultshould be accepted with caution. The high sensitivity of the one high variable,the ″too good″ gains curve, and the waviness should all be explained before theresult is accepted as valid. The initial results are very promising and indicatethat the neural network will produce the best model for this business problem.


7.4.4 ClusteringFinally, we clustered the neural network results to:

1. Validate the models which were developed

A clustering algorithm should be able to distinguish between the top decileand bottom decile of predictive model scored output if the model is valid.

2. Identify the distinguishing characteristics of defectors and nondefectors

The distinguishing features could then be utilized to create retentioncampaigns.

Figure 65 on page 128 shows the results of the clustering on the neural networkmodel output. The data set used for the clustering contained a 50/50 split of topdecile and bottom decile customer records. As expected we got two largeclusters, which are 43% and 38%, respectively. If the model had workedperfectly, we would see just two clusters. In this case we got five smallerclusters. The largest cluster contains the defectors, as indicated by the quantilevariable (the left two bars are the quantile buckets for the bottom two 5percentile buckets, and the right two buckets are the top two 5 percentilebuckets). Notice the slight sliver of low quantile customers in this cluster. Theseare mistakes made by either the neural network or the clustering algorithm;further analysis of these records may improve the prediction result. Because theclustering found a larger segment of defectors, there are probably fewercharacteristics associated with defectors as opposed to nondefectors.

The characteristics of defectors are:

• Mostly from the Okay Customer Set segment

Recall that we only used customers from the Okay Customer Set and theGood Customer Set.

• No Best customers

• Lower product usuage than the average

• Shorter tenure

• In general lower usage of all products, especially telebanking, credit card,mortgages and loans

In contrast, the characteristics of nondefectors are:

• Mostly from the Good Customer Set, although not as skewed in favor as thedefector

• Higher ratio of Best customers

• Higher product usage than average

• Longer tenure

• In general a higher usage of all products, especially telebanking, credit card,personal banking, mortgages, and loans


Figure 65. Attrition Analysis: Demographic Clustering of Likely Defectors

7.4.5 Time-Series PredictionThe time-series neural network outputs a profile of the model scores by quantile(see Figure 66 on page 129). The output is very good with a ratio of almost 20to 1 between the top and bottom quantiles. The scores are monotonicallydistributed by decreasing quantile, although the distribution could be a littlesmoother, and variables have differing importance to each quantile group.Customers with a high likelihood to defect, the top quantile in Figure 66 onpage 129, have lower than average activity in the different products. Customerswith a low likelihood to defect have high levels of activity at all products. This


characterizations of defectors and nondefectors agrees with the output of allother algorithms, including those using standard prediction methods.

Figure 66. Profile of Time-Series Prediction

Figure 64 on page 126 shows both the test and training gains curves for thetime-series neural network. The training case is almost identical to the standard


back-propagating neural network. The test case is slightly worse than thestandard neural network, but the curve is smooth with no wave. The lift overrandom for the test case is very high at approximately 4.6 to 1 over random at20%. Again you should be skeptical of such good performance and do someanalysis to ensure that there is no sample bias or the input data is not highlyrelated to the target field.

The time-series neural network can also be used to generate prediction profilesover time by plotting the prediction versus time period. Figure 67 and Figure 68on page 131 show the profiles of a few randomly selected customers from thetest data set. These plotted profiles are representative of the many that weanalyzed in the case study.

Figure 67. Time Profi le of Defection Probabil ity for Defectors


Figure 68. Time Profile of Defection Probability for Nondefectors

The profile of a defector by definition has a step-like profile, with a step betweenmonths 6 and 7, as visible in Figure 67 on page 130. The predicted profileshave a similar step-like profile except the height of the step is smaller and thelocation of the step is not always between months 6 and 7. One profile also hasa blip in the first few months of the year. The nondefector profiles in Figure 68should be a flat line at 0. The shape of the profiles is not flat. They are wavy,with no distinct step-like shape as the defectors had. The clear differencebetween a defector and nondefector profile makes this algorithm very useful.Not only can you distinguish between defectors and nondefectors, as you canwith the standard predictive methods, but you can approximate when thecustomer is likely to defect. This fact is of critical importance in customerdefection problems because the timing of a business activity to reduce customerdefection should occur very close to the time that a customer is likely to defect.With the standard techniques in this case, our prediction of defection iswindowed to within 6 months. Using the time-series approach we can narrowthis approximation to a particular month in this case. This is a substantialimprovement. The added difficulty of predicting the time of defection tends tocause the gains curves of the time-series prediction to be slightly worse than thestandard neural technique, but in this case the model can distinguish defectorsand nondefectors very easily.

7.5 Business ImplementationOnce customers with a high likelihood of defection are identified, it is possible toexecute a direct mail campaign to target them. If we are to believe the neuralnetwork model results, we could target 95% of customers likely to defect for20% of the cost compared to mailing to all customers (in the Okay Customer Setand the Good Customer Set). This is a substantial cost savings. Using thetime-series prediction we can also time the customer communication to be as


close as possible to the time of likely customer defection to make the contact asrelevant as possible.

Once we have identified a list of customers we intend to target, we can profilethe profitability of those customers to determine the margin available to be usedto increase customer retention.

We can then implement this budget to try to change the defectors intonondefectors. The characteristics of non-defectors were primarily lots of productusage indicating a strong relationship with the Bank. A key product in building amulti-product relationship is the personal banking bundle which was present inonly non-defector customers. This bundle includes a savings account, checkingaccount, credit card and other services all at low fee. A campaign to cross-sellthis product to profitable customers likely to defect may help consolidate theBank ′s relationship.


Chapter 8. Intelligent Miner Advantages

In the first case study we created a segmentation model to be used as a basisfor CRM. Using several techniques, we were able to create segments ofcustomers with differing levels of shareholder value. The differences inshareholder value of the customer segments allowed us to identify the mostprofitable customers, high potential customers, and low potential customers.The best customer segment represented 35% of customer revenue from 9% ofcustomers. Several high potential segments were also identified. If bymarketing additional products and/or services to these customers we were ableto change the purchase behavior of 10% of the high potential customers to besimilar to our best customers, we could impact total revenue by 18%.

Selecting one of the high potential customer segments, we used productassociation techniques to find cross-selling opportunities in the second casestudy. By contrasting the behavior of the high potential segment against thebehavior of a higher potential group, we were able to identify missing productsthat could be marketed in a cross-selling campaign. The product to cross sellwas identified to be a credit card. If through marketing we were able to activate10% of the high potential cluster to use the credit card, we could impact thecluster revenue by 25% and the overall revenue by 3%.

Having identified a customer segment and a product for cross-selling, we couldexecute a promotion. Rather than marketing to every customer in the segmentof interest, building a predictive model to target just those customers likely toactivate with the bank′s credit card would be more cost effective. In the thirdcase study, we built several predictive models, using different techniques. Thebest of these models was able to predict 65% of likely activations with only 30%of the mailing cost. If we targeted 30% of the customer segment in question, ourexpected ROI would be 160%.

In the fourth case study, we built predictive models to identify which customerswere likely to defect. From the customer segmentation model in the first casestudy we were able to identify the current best customers and future highpotential value customers. An important marketing strategy for these customersis retention. By marketing to these customers, we would be able to reduce thedefection rate in our best customer segments, ensuring the corporation′s futureearning potential and maintaining current revenue levels.

Using IBM′s Intelligent Miner for Data product in these four case studies, wewere able to illustrate how data mining can be used to support a CRM programwithin an organization. We also showed the power of Intelligent Miner and itscapability to work on a wide variety of business problems. In selecting a tool fordata mining, an organization should consider the range of problems to be solvedand the potential for feeding the results of one business problem into the input ofthe next. Intelligent Miner was able to execute a sequence of business activitiesfundamental to CRM:

• Create strategic marketing initiatives

• Identify marketing opportunities to support strategic initiatives

• Effectively target customers for a particular promotion

Based on the case studies presented, the total impact on the bottom line wasapproximately a 25% increase in profitability.


IBM ′s Intelligent Miner is the leading data mining product in today′smarketplace, offering these competitive advantages:

• Algorithms based on open academic research

The algorithms within Intelligent Miner are based on open academicresearch. They were developed in IBM laboratories around the world byworld-leading researchers in artificial intelligence and machine learning.This body of research dates back more than 20 years. For end users of thistechnology, this large body of research means higher quality algorithms thatproduce better results than other tools in the marketplace.

• Research grown out of IBM core competence

IBM has been cultivating artificial intelligence and machine learning throughbillions of dollars of investment in R&D for decades. In the corporate worldIBM research labs are second to none. Organizations building competitivedata mining products do not have nearly as strong a competency in thedisciplines required for data mining.

In addition to a core competency in data mining, IBM has a core competencyin software development. The technical challenge of implementing datamining technology with the ability to work against millions of customerrecords is immense. No other organization has developed such complexalgorithms that are as scalable as Intelligent Miner ′s. Most of thecompetition produces data mining products for PCs and uniprocessor serverplatforms.

To customers, this advantage means higher quality algorithms that havebeen so efficiently implemented that the time required to create decisionsupport information has been significantly reduced.

• Algorithms that have existed for a long time

Some of the algorithms in Intelligent Miner have existed in other IBMproducts for 10 years. The newest algorithms are more than three years old.The competitors are just creating and releasing products today. In additionto being first in the marketplace, IBM consultants have been using thealgorithms in more than 100 engagements around the world. In fact,Intelligent Miner was created because the data mining consultantsrecognized the need for an integrated data mining product with severaldifferent analytical methods.

End users benefit from these advantages. More product use means thatcurrent product will have fewer software bugs. Investments in million dollarmarketing campaigns are more secure with Intelligent Miner. All thealgorithm bugs have been shaken out through years of practical application.

• Wider variety of algorithms and visualizations in one tool

As shown in the case studies presented in this book, the wide variety ofalgorithms in Intelligent Miner allows for a wider range of analysis than otherdata mining packages. The powerful combination of visualization tools anddata mining algorithms, some of which are unique to Intelligent Miner,permits better business results than other products permit.

• Unique algorithms

Intelligent Miner has two algorithms that are unique and were invented byIBM researchers. The demographic algorithm is the only clusteringalgorithm that can cluster categorical data. The product associationsalgorithms were also invented in IBM research labs.


These unique capabilities enable end users to perform analyses not possiblewith other tools.

• ″Infinitely scalable″

Intelligent Miner runs on the SP2 MPP platform, which can scale to handleterabyte-sized data warehouses. No competitive products are as scalable.Intelligent Miner can also connect to operational databases for scoring andvalidation. This scalability enables end users with millions of customers toefficiently integrate data mining technology into their businesses today.

• Open technology

Intelligent Miner runs on other vendor platforms, including HP, Sun, andWindows NT. It can also interface to other databases, using IBM′sDataJoiner product. Thus end user customers can use IBM data miningtechnology on their platforms today.

Chapter 8. Intell igent Miner Advantages 135


Appendix A. Special Notices

This publication is intended to help all customers to better understand thedifferent data mining algorithms used by the Intelligent Miner for Data. Theinformation in this publication is not intended as the specification of anyprogramming interfaces that are provided by Intelligent Miner for Data. See thePUBLICATIONS section of the IBM Programming Announcement for IntelligentMiner for Data for more information about what publications are considered tobe product documentation.

References in this publication to IBM products, programs or services do notimply that IBM intends to make these available in all countries in which IBMoperates. Any reference to an IBM product, program, or service is not intendedto state or imply that only IBM′s product, program, or service may be used. Anyfunctionally equivalent program that does not infringe any of IBM′s intellectualproperty rights may be used instead of the IBM product, program or service.

Information in this book was developed in conjunction with use of the equipmentspecified, and is limited in application to those specific hardware and softwareproducts and levels.

IBM may have patents or pending patent applications covering subject matter inthis document. The furnishing of this document does not give you any license tothese patents. You can send license inquiries, in writing, to the IBM Director ofLicensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785.

Licensees of this program who wish to have information about it for the purposeof enabling: (i) the exchange of information between independently createdprograms and other programs (including this one) and (ii) the mutual use of theinformation which has been exchanged, should contact IBM Corporation, Dept.600A, Mail Drop 1329, Somers, NY 10589 USA.

Such information may be available, subject to appropriate terms and conditions,including in some cases, payment of a fee.

The information contained in this document has not been submitted to anyformal IBM test and is distributed AS IS.

The use of this information or the implementation of any of these techniques is acustomer responsibility and depends on the customer ′s ability to evaluate andintegrate them into the customer′s operational environment. While each itemmay have been reviewed by IBM for accuracy in a specific situation, there is noguarantee that the same or similar results will be obtained elsewhere.Customers attempting to adapt these techniques to their own environments doso at their own risk.

Any pointers in this publication to external Web sites are provided forconvenience only and do not in any manner serve as an endorsement of theseWeb sites.


The following terms are trademarks of the International Business MachinesCorporation in the United States and/or other countries:

The following terms are trademarks of other companies:

C-bus is a trademark of Corollary, Inc.

Java and HotJava are trademarks of Sun Microsystems, Incorporated.

Microsoft, Windows, Windows NT, and the Windows 95 logo are trademarksor registered trademarks of Microsoft Corporation.

PC Direct is a trademark of Ziff Communications Company and is usedby IBM Corporation under license.

Pentium, MMX, ProShare, LANDesk, and ActionMedia are trademarks orregistered trademarks of Intel Corporation in the U.S. and othercountries.

UNIX is a registered trademark in the United States and othercountries licensed exclusively through X/Open Company Limited.

Other company, product, and service names may be trademarks orservice marks of others.

AIX AIX/6000DATABASE 2 DB2DB2 Universal Database IBMIntell igent Miner QMFRISC System/6000 RS/6000TextMiner Visual Warehouse


Appendix B. Related Publications

The publications listed in this section are considered particularly suitable for amore detailed discussion of the topics covered in this redbook.

B.1 International Technical Support Organization PublicationsFor information on ordering these ITSO publications see “How to Get ITSORedbooks” on page 141.

• Discovering Data Mining, SG24-4839

• Mining Relational and Nonrelational Data with IBM Intelligent Miner for Data,SG24-5278

B.2 Redbooks on CD-ROMsRedbooks are also available on CD-ROMs. Order a subscription and receiveupdates 2-4 times a year at significant savings.

CD-ROM Title SubscriptionNumber

Collection KitNumber

System/390 Redbooks Collection SBOF-7201 SK2T-2177Networking and Systems Management Redbooks Collection SBOF-7370 SK2T-6022Transaction Processing and Data Management Redbook SBOF-7240 SK2T-8038AS/400 Redbooks Collection SBOF-7270 SK2T-2849RS/6000 Redbooks Collection (HTML, BkMgr) SBOF-7230 SK2T-8040RS/6000 Redbooks Collection (PostScript) SBOF-7205 SK2T-8041Application Development Redbooks Collection SBOF-7290 SK2T-8037Personal Systems Redbooks Collection SBOF-7250 SK2T-8042

B.3 Other PublicationsThese publications are also relevant as further information sources:

• Using the Intelligent Miner for Data, SH12-6325



How to Get ITSO Redbooks

This section explains how both customers and IBM employees can find out about ITSO redbooks, CD-ROMs,workshops, and residencies. A form for ordering books and CD-ROMs is also provided.

This information was current at the time of publication, but is continually subject to change. The latestinformation may be found at http://www.redbooks.ibm.com.

How IBM Employees Can Get ITSO Redbooks

Employees may request ITSO deliverables (redbooks, BookManager BOOKs, and CD-ROMs) and information aboutredbooks, workshops, and residencies in the following ways:

• PUBORDER — to order hardcopies in United States

• GOPHER link to the Internet - type GOPHER.WTSCPOK.ITSO.IBM.COM

• Tools disks

To get LIST3820s of redbooks, type one of the following commands:

TOOLS SENDTO EHONE4 TOOLS2 REDPRINT GET SG24xxxx PACKAGETOOLS SENDTO CANVM2 TOOLS REDPRINT GET SG24xxxx PACKAGE (Canadian users only)

To get BookManager BOOKs of redbooks, type the following command:

TOOLCAT REDBOOKS

To get lists of redbooks, type one of the following commands:

TOOLS SENDTO USDIST MKTTOOLS MKTTOOLS GET ITSOCAT TXTTOOLS SENDTO USDIST MKTTOOLS MKTTOOLS GET LISTSERV PACKAGE

To register for information on workshops, residencies, and redbooks, type the following command:

TOOLS SENDTO WTSCPOK TOOLS ZDISK GET ITSOREGI 1998

For a list of product area specialists in the ITSO: type the following command:

TOOLS SENDTO WTSCPOK TOOLS ZDISK GET ORGCARD PACKAGE

• Redbooks Web Site on the World Wide Web

http://w3.itso.ibm.com/redbooks

• IBM Direct Publications Catalog on the World Wide Web

http://www.elink.ibmlink.ibm.com/pbl/pbl

IBM employees may obtain LIST3820s of redbooks from this page.

• REDBOOKS category on INEWS

• Online — send orders to: USIB6FPL at IBMMAIL or DKIBMBSH at IBMMAIL

• Internet Listserver

With an Internet e-mail address, anyone can subscribe to an IBM Announcement Listserver. To initiate theservice, send an e-mail note to [email protected] with the keyword subscribe in the body ofthe note (leave the subject line blank). A category form and detailed instructions will be sent to you.

Redpieces

For information so current it is still in the process of being written, look at ″Redpieces″ on the Redbooks WebSite (http://www.redbooks.ibm.com/redpieces.htm). Redpieces are redbooks in progress; not all redbooksbecome redpieces, and sometimes just a few chapters will be published this way. The intent is to get theinformation out much quicker than the formal publishing process allows.


How Customers Can Get ITSO Redbooks

Customers may request ITSO deliverables (redbooks, BookManager BOOKs, and CD-ROMs) and information aboutredbooks, workshops, and residencies in the following ways:

• Online Orders — send orders to:

• Telephone orders

• Mail Orders — send orders to:

• Fax — send orders to:

• 1-800-IBM-4FAX (United States) or (+1)001-408-256-5422 (Outside USA) — ask for:

Index # 4421 Abstracts of new redbooksIndex # 4422 IBM redbooksIndex # 4420 Redbooks for last six months

• Direct Services - send note to [email protected]

• On the World Wide Web

Redbooks Web Site http://www.redbooks.ibm.comIBM Direct Publications Catalog http://www.elink.ibmlink.ibm.com/pbl/pbl

• Internet Listserver

With an Internet e-mail address, anyone can subscribe to an IBM Announcement Listserver. To initiate theservice, send an e-mail note to [email protected] with the keyword subscribe in the body ofthe note (leave the subject line blank).

Redpieces

For information so current it is still in the process of being written, look at ″Redpieces″ on the Redbooks WebSite (http://www.redbooks.ibm.com/redpieces.htm). Redpieces are redbooks in progress; not all redbooksbecome redpieces, and sometimes just a few chapters will be published this way. The intent is to get theinformation out much quicker than the formal publishing process allows.

IBMMAIL InternetIn United States: usib6fpl at ibmmail [email protected] Canada: caibmbkz at ibmmail [email protected] North America: dkibmbsh at ibmmail [email protected]

United States (toll free) 1-800-879-2755Canada (toll free) 1-800-IBM-4YOU

Outside North America (long distance charges apply)(+45) 4810-1320 - Danish(+45) 4810-1420 - Dutch(+45) 4810-1540 - English(+45) 4810-1670 - Finnish(+45) 4810-1220 - French

(+45) 4810-1020 - German(+45) 4810-1620 - Italian(+45) 4810-1270 - Norwegian(+45) 4810-1120 - Spanish(+45) 4810-1170 - Swedish

IBM PublicationsPublications Customer SupportP.O. Box 29570Raleigh, NC 27626-0570USA

IBM Publications144-4th Avenue, S.W.Calgary, Alberta T2P 3N5Canada

IBM Direct ServicesSortemosevej 21DK-3450 AllerødDenmark

United States (toll free) 1-800-445-9269Canada 1-403-267-4455Outside North America (+45) 48 14 2207 (long distance charge)


IBM Redbook Order Form

Please send me the following:

Title Order Number Quantity

First name Last name

Company

Address

City Postal code Country

Telephone number Telefax number VAT number

• Invoice to customer number

• Credit card number

Credit card expiration date Card issued to Signature

We accept American Express, Diners, Eurocard, Master Card, and Visa. Payment by credit card notavailable in all countries. Signature mandatory for credit card payment.

How to Get ITSO Redbooks 143


Glossary

Aadaptive connection . A numeric weight used todescribe the strength of the connection between twoprocessing units in a neural network. The connectionis called adaptive because it is adjusted duringtraining. Values typically range from zero to one, or-0.5 to +0.5.

aggregate . To summarize data in a field.

application program interface (API) . A functionalinterface supplied by the operating system or aseparate orderable licensed program that allows anapplication program written in a high-level languageto use specific data or functions of the operatingsystem or the licensed program.

architecture . The number of processing units in theinput, output, and hidden layer of a neural network.The number of units in the input and output layers iscalculated from the mining data and inputparameters. An intell igent data mining agentcalculates the number of hidden layers and thenumber of processing units in those hidden layers.

associations . The relationship of items in atransaction in such a way that items imply thepresence of other items in the same transaction.

attribute . Characteristics or properties that can becontrolled, usually to obtain a required appearance.For example, the color is an attribute of a line. Inobject-oriented programming, a data element definedwithin a class.

Bback propagation . A general-purpose neural networknamed for the method used to adjust weights whilelearning data patterns. The Classification − Neuralmining function uses such a network.

boundary field . The upper limit of an interval as usedfor discretization using ranges of a processingfunction.

bucket . One of the bars in a bar chart showing thefrequency of a specific value.

Ccategorical values . Discrete, nonnumerical datarepresented by character strings; for example, colorsor special brands.

chi-square test . A test to check whether twovariables are statistically dependent or not.Chi-square is calculated by subtracting the expectedfrequencies (imaginary values) from the observedfrequencies (actual values). The expected frequenciesrepresent the values that were to be expected if thevariable question were statistically independent.

classification . The assignment of objects into groupsor categories based on their characteristics.

cluster . A group of records with similarcharacteristics.

cluster prototype . The attribute values that aretypical of all records in a given cluster. Used tocompare the input records to determine whether arecord should be assigned to the cluster representedby these values.

clustering . A mining function that creates groups ofdata records within the input data on the basis ofsimilar characteristics. Each group is called a cluster.

confidence factor . Indicates the strength or thereliability of the associations detected.

continuous field . A field that can have any floatingpoint number as its value.

DDATABASE 2 (DB2) . An IBM relational databasemanagement system.

database table . A table residing in a database.

database view . An alternative representation of datafrom one or more database tables. A view can includeall or some of the columns contained in the databasetable or tables on which it is defined.

data field . In a database table, the intersection fromtable description and table column where thecorresponding data is entered.

data format . There are different kinds of dataformats, for example, database tables, databaseviews, pipes, or flat files.

data table . A data table, regardless of the dataformat it contains.


data type . There are different kinds of IntelligentMiner data types, for example, discrete numeric,discrete nonnumeric, binary, or continuous.

discrete . Pertaining to data that consists of distinctelements such as character or to physical quantitieshaving a finite number of distinctly recognizablevalues.

discretization . The act of making mathematicallydiscrete.

Eenvelope . The area between two curves that areparallel to a curve of time-sequence data. The firstcurve runs above the curve of time-sequence data,the second one below. Both curves have the samedistance to the curve of time-sequence data. Thewidth of the envelope, that is, the distance from thefirst parallel curve to the second, is defined asepsilon.

epsilon . The maximum width of an envelope thatencloses a sequence. Another sequence isepsilon-similar if it fits in this envelope.

epsilon-similar . Two sequences are epsilon-similar ifone sequence does not go beyond the envelope thatencloses the other sequence.

equality compatible . Pertaining to different datatypes that can be operands for the = logicaloperator.

Euclidean distance . The square root of the sum ofthe squared differences between two numeric vectors.The Euclidean distance is used to calculate the errorbetween the calculated network output and the targetoutput in neural classification, to calculate thedifference between a record and a prototype clustervalue in neural clustering. A zero value indicates anexact match; larger numbers indicate greaterdifferences.

Ffield . A set of one or more related data itemsgrouped for processing. In this document, with regardto database tables and views, f ield is synonymouswith column.

file . A collection of related data that is stored andretrieved by an assigned name.

file name . (1) A name assigned or declared for a file.(2) The name used by a program to identify a file.

flat file . (1) A one-dimensional or two-dimensionalarray: a list or table of items. (2) A file that has nohierarchical structure.

formatted information . An arrangement ofinformation into discrete units and structures in amanner that facilitates its access and processing.Contrast with narrative information.

F-test . A statistical test that checks whether twoestimates of the variances of two independentsamples are the same. In addition, the F-test checkswhether the null hypothesis is true or false.

function . Any instruction or set of related instructionsthat perform a specific operation.

fuzzy logic . In artificial intelligence, a techniqueusing approximate rules of inference in which truthvalues and quantifiers are defined as possibilitydistributions that carry linguistic labels.

Iinput data . The metadata of the database table,database view, or flat file containing the data youspecified to be mined.

input layer . A set of processing units in a neuralnetwork which present the numeric values derivedfrom user data to the network. The number of fieldsand type of data in those fields are used to calculatethe number of processing units in the input layer.

instance . In object-oriented programming, a single,actual occurrence of a particular object. Any level ofthe object class hierarchy can have instances. Aninstance can be considered in terms of a copy of theobject type frame that is fil led in with particularinformation.

interval . A set of real numbers between two numberseither including or excluding both of them.

interval boundaries . Values that represent the upperand lower limits of an interval.

item category . A categorization of an item. Forexample, a room in a hotel can have the followingcategories: Standard, Comfort, Superior, Luxury. Thelower category is called the child item category. Eachchild item category can have several parent itemcategories. Each parent item category can haveseveral grandparent item categories.

item description . The descriptive name of acharacter string in a data table.

item ID . The identifier for an item.

item set . A collection of items. For example, all itemsbought by one customer during one visit to adepartment store.


KKohonen Feature Map . A neural network modelcomprised of processing units arranged in an inputlayer and output layer. All processors in the inputlayer are connected to each processor in the outputlayer by an adaptive connection. The learningalgorithm used involves competition between units foreach input pattern and the declaration of a winningunit. Used in neural clustering to partition data intosimilar record groups.

Llarge item sets . The total volume of items above thespecified support factor returned by the Associationsmining function.

learning algorithm . The set of well-defined rules usedduring the training process to adjust the connectionweights of a neural network. The criteria and methodsused to adjust the weights define the differentlearning algorithms.

learning parameters . The variables used by eachneural network model to control the training of aneural network which is accomplished by modifyingnetwork weights.

lift . Confidence factor divided by expectedconfidence.

Mmetadata . In databases, data that describes dataobjects.

mining . Synonym for analyzing or searching.

mining base . A repository where all informationabout the input data, the mining run settings, and thecorresponding results is stored.

model . A specific type of neural network and itsassociated learning algorithm. Examples include theKohonen Feature Map and back propagation.

Nnarrative information . Information that is presentedaccording to the syntax of a natural language.Contrast with formatted information.

neural network . A collection of processing units andadaptive connections that is designed to perform aspecific processing function.

Neural Network Utility (NNU) . A family of IBMapplication development products for creating neuralnetwork and fuzzy rule system applications.

nonsupervised learning . A learning algorithm thatrequires only input data to be present in the datasource during the training process. No target output isprovided; instead, the desired output is discoveredduring the mining run. A Kohonen Feature Map, forexample, uses nonsupervised learning.

Ooffset . (1) The number of measuring units from anarbitrary starting point in a record, area, or controlblock, to some other point. (2) The distance from thebeginning of an object to the beginning of a particularfield.

operator . (1) A symbol that represents an operationto be done. (2) In a language statement, the lexicalentity that indicates the action to be performed onoperands.

output data . The metadata of the database table,database view, or flat file containing the data beingproduced or to be produced by a function.

output layer . A set of processing units in a neuralnetwork which contain the output calculated by thenetwork. The number of outputs depends on thenumber of classification categories or maximumclusters value in neural classification and neuralclustering, respectively.

Ppass . One cycle of processing a body of data.

prediction . The dependency and the variation of onefield′s value within a record on the other fields withinthe same record. A profile is then generated that canpredict a value for the particular field in a new recordof the same form, based on its other field values.

processing unit . A processing unit in a neuralnetwork is used to calculate an output by summing allincoming values multiplied by their respectiveadaptive connection weights.

Qquantile . One of a finite number of nonoverlappingsubranges or intervals, each of which is representedby an assigned value.

Q is an N%-quanti le of a value set S when:

• Approximately N percent of the values in S arelower than or equal to Q.

• Approximately (100-N) percent of the values aregreater than or equal to Q.

Glossary 147

The approximation is less exact when there are manyvalues equal to Q. N is called the quantile label. The50%-quantile represents the median.

Rradial basis function . In data mining functions, radialbasis functions are used to predict values. Theyrepresent functions of the distance or the radius froma particular point. They are used to build upapproximations to more complicated functions.

record . A set of one or more related data itemsgrouped for processing. In reference to a databasetable, record is synonymous with row.

region . (Sub)set of records with similarcharacteristics in their active fields. Regions are usedto visualize a prediction result.

round-robin method . A method by which items aresequentially assigned to units. When an item hasbeen assigned to the last unit in the series, the nextitem is assigned to the first again. This process isrepeated until the last item has been assigned. TheIntelligent Miner uses this method, for example, tostore records in output files during a partitioning job.

rule . A clause in the form head ⇐ body. It specifiesthat the head is true if the body is true.

rule body . Represents the specified input data for amining function.

rule group . Covers all rules containing the sameitems in different variations.

rule head . Represents the derived items detected bythe Associations mining function.

Sscale . A system of mathematical notation: fixed-pointor floating-point scale of an arithmetic value.

scaling . To adjust the representation of a quantity bya factor in order to bring its range within prescribedlimits.

scale factor . A number used as a multiplier inscaling. For example, a scale factor of 1/1000 wouldbe suitable to scale the values 856, 432, -95, and /182to lie in the range from -1 to +1, inclusive.

self-organizing feature map . See Kohonen FeatureMap.

sensitivity analysis report . An output from theClassification − Neural mining function that showswhich input fields are relevant to the classificationdecision.

sequential patterns . Intertransaction patterns suchthat the presence of one set of items is followed byanother set of items in a database of transactionsover a period of time.

similar time sequences . Occurrences of similarsequences in a database of time sequences.

Structured Query Language (SQL) . An establishedset of statements used to manage information storedin a database. By using these statements, users canadd, delete, or update information in a table, requestinformation through a query, and display results in areport.

supervised learning . A learning algorithm thatrequires input and resulting output pairs to bepresented to the network during the training process.Back propagation, for example, uses supervisedlearning and makes adjustments during training sothat the value computed by the neural network willapproach the actual value as the network learns fromthe data presented. Supervised learning is used inthe techniques provided for predicting classificationsas well as for predicting values.

support factor . Indicates the occurrence of thedetected association rules and sequential patternsbased on the input data.

symbolic name . In a programming language, aunique name used to represent an entity such as afield, file, data structure, or label. In the IntelligentMiner you specify symbolic names, for example, forinput data, name mappings, or taxonomies.

Ttaxonomy . Represents a hierarchy or a lattice ofassociations between the item categories of an item.These associations are called taxonomy relations.

taxonomy relation . The hierarchical associationsbetween the item categories you defined for an item.A taxonomy relation consists of a child item categoryand a parent item category.

trained network . A neural network containingconnection weights that have been adjusted by alearning algorithm. A trained network can beconsidered a virtual processor: it transforms inputs tooutputs.

training . The process of developing a model whichunderstands the input data. In neural networks, themodel is created by reading the records of the inputand modifying the network weights until the networkcalculates the desired output data.

translation process . Converting the data provided inthe database to scaled numeric values in theappropriate range for a mining kernel using neural


networks. Different techniques are used dependingon whether the data is numeric or symbolic. Also,converting neural network output back to the unitsused in the database.

transaction . A set of items or events that are linkedby a common key value, for example, the articles(items) bought by a customer (customer number) on aparticular date (transaction identifier). In thisexample, the customer number represents the keyvalue.

transaction ID . The identifier for a transaction, forexample, the date of a transaction.

transaction group . The identifier for a set oftransactions. For example, a customer number, canrepresent a transaction group that includes all

purchases of a particular customer during the monthof May.

Vvector . A quantity usually characterized by anordered set of numbers.

Wweight . The numeric value of an adaptive connectionrepresenting the strength of the connection betweentwo processing units in a neural network.

winner . The index of the cluster which has theminimum Euclidean distance from the input record.Used in the Kohonen Feature Map to determine whichoutput units will have their weights adjusted.

Glossary 149


List of Abbreviations

AMRP air miles reward program

API application programminginterface

CIM continuous interactivemarketing

CPU central processing unit

CRM customer relationshipmarketing

DB2 DATABASE 2

GB gigabyte

GIS graphical information system

IBM International BusinessMachines Corporation

IT information technology

ITSO International TechnicalSupport Organization

LIS large item sets

MBA market basket analysis

MDA multidimensional databaseanalysis

MDL minimum description length

MPP massive parallel processor

OLAP online analytical processing

PC personal computer

POS point of sale

PROFS Professional Office System

R&D research and development

RBF radial basis function

RFM recency frequency monetary

RMS root mean square

ROI return on investment

SQL structured query language

TB terabyte



Index

Aaccuracy 46, 98affinity analysis 30aggregate functionaggregationalgori thm selectionanalysis

affinity 30cluster detail 48, 53data 6decision tree 120factor 39intell igence 6link 9, 13market basket 13mult idimensional database 7product affinity 13result 16, 101, 120time-series 113

anomalous decision tree 100application

data mining 8mode 24

architectureIntell igent Miner 20neural 118

association discovery 13association rule discoveryattr i t ion model

clustering result 126data definition 114data preparation 116decision tree result 120gains chart 119input field selection 118mining process 113neural network result 126output field selection 118parameter selection 117RBF result 122result visualization 119time-series result 128

average error

Bbehavior pattern

customer 1binary variables 50business

analyst 7

Ccalculate ROI 88campaign

cross selling 9categoric variables 12, 50chart

gains 105cleaning

data 92cluster

characterization 49, 54detail analysis 48, 53maximum number of 45profi l ing 48, 63result comparison 65selection 71, 76values 48

clusteringdemographic 12disadvantages 49mode 24neural 12process 44

competit ionfocus on 3

componentsCRM 32Intell igent Miner 20

Condorect criteria 12confidence

minimum 74confusion matrix 65, 101continuous marketing 28continuous variables 50create

objective variable 90CRM

components 32cross selling

association discovery 75association rule discovery 77campaign 9cluster selection 71, 76data preparation 73data selection 72identif ication 32large item set removal 75mining process 70mining results 76opportunity identif ication 68parameter sett ings 74rebuild rules 76target marketing model 32


customerbehavior pattern 1focus on 2purchasing pattern 8relationship management 25retention 30, 32, 110retention manmagement 10segmentation 29, 32

customer segmentationcluster analysis 48, 53cluster characterization 49, 54cluster comparison 65cluster profil ing 48, 63data preparation 37data selection 35decision tree characterization 65input field selection 47mining process 34output field selection 48parameter selection 45result visualization 48, 50

Ddata

access 21analysis 6cleaning 38, 92definit ion 21, 114flood 4mining 5, 16preparation 16, 37, 73, 92, 116reduction 16sampling 16, 93selection 16, 35, 72transformation 38, 92

data miningapplication 8process 34, 70, 89results 103techniques 9

data warehouse 3, 7database

analysis 7marketing 8segmentation 9, 11

decision treresults 103

decision tree 10, 49, 65, 96, 99anomalous 100parameter 97

demographic clustering 12, 45demographic profi le 27deviat ion

detection 9standard 47

discoveryassociation 13, 75association rule 77

discovery (continued)subpopulations 3

discrete numeric variables 50discretization 39distance

absolute 47range 47standard deviation 47

drivers 2

Eenablers 4error

average 117rate 98

Ffactor analysis 39feature selection 95focus

on competit ion 3on customer 2on data assets 3relationship 2

forecast horizon 117function

aggregate 36analytical 23mining 23processing 24statistical 23

Ggains chart 105geodemographic profi le 27

Hhierarchy 73horizon

forecast 117

Iinput field selection 47, 99integer variables 50Intell igent Miner

architecture 20components 20

invalid value 38item constraints 75item set

large 75removal 75


KKohonen feature map 12

Llarge item sets

removal 75learning

supervised 9unsupervised 11

level aggregation 73library processing 22link analysis 9, 13logarithmic transformation 43

Mmachine learning 1, 5management

customer relationship 25market

niche 2saturation 1

market basket analysis 13marketing

continuous 28database 8

matr ixconfusion 101

maximum rule length 75minimum

confidence 74support 74

miningbase 22data 5functions 23kernel 22result 22

missing value 38model

attr i t ion 110propensity 90

modelingpredict ive 9

modesapplication 24clustering 24test 24training 24

momentum 98, 118

Nnetwork

neural 98, 100neural

architecture 118

neural (continued)clustering 12network 12, 98network parameter 98, 100prediction 19

numeric variables 50

Ooutput field selection 48, 100

Pparameter

accuracy 98clustering algorithm 46error rate 98in-sample size 97, 98item constraints 74learning rate 98maximum number of clusters 45maximum number of passes 45minimum confidence 74minimum support 74momentum 98number of centers 97number of passes 97, 98number of records 97out-sample size 97, 98purity per internal node 97region size 97, 98rule length 74selection 45, 97, 117settings 74tree depth 97

passesmaximum number of 45

permutat ion 75prediction

neural 19potential strategies 3profi le 130strategies 2tactical movements 3time-series 128value 97value with RBF 100, 101

predictive modeling 9probabil i ty weighting 47process

clustering 44data mining 34, 70, 89

processingfunctions 24l ibrary 22

productaggregation 72association analysis 73, 74hierarchy 73

Index 155

product (continued)ID 73

product affinityanalysis 13

product association 73, 74profi le prediction 130profi les

demographic 27geodemographic 27psychographic 27

projectdesign 15evaluation 15management 15objectives 15plan 15team 15

propensity 11model 90

psychographic profi le 27

Rrange

distance measure 47rate

error 98learning 98

RBF modelingresult 122

record scores 48result

analysis 16, 101, 120data mining 103decision tree 103RBF modeling 122visualization 49, 100, 119

result visualizationcustomer segmentation 48

ROIcalculate 88

rulesrebuild 76

Ssampling

data 93stratif ied 94

scatterplot 11scores

record 48scoring 11segmentation 11

customer 29, 32database 9

selectionalgori thm 96cluster 76

selection (continued)data 35feature 95input field 47, 99output field 48, 100parameter 45, 97, 117

sellingcross 9, 13, 30up 30

shareholder value 33, 36similarity threshold 46standard deviation 47statistical functions 23stratif ied sampling 94subpopulation discovery 3supervised learning 9support

min imum 74

Ttarget marketing

algorithm selection 96data preparation 92data sampling 93decision tree result 103feature selection 95input field selection 99mining process 89neural network result 108output field selection 100parameter selection 97RBF result 106result analysis 101result visualization 100train and test 95variable creation 90

TaskGuide 19, 22techniques

data mining 9test 95threshold similarity 46time sequence 13time-series

analysis 113parameter 117prediction 128result analysis 128

train 95transformation

data 38, 92logarithmic 43

Uunknown value 38unsupervised learning 11up-selling 30


user interface 21

Vvalue

invalid 38missing 38prediction with RBF 97, 100, 101shareholder 33, 36unknown 38val id 38

variablebinary 50categoric 12categorical 50continuous 50create objective 90discrete numeric 50integer 50numeric 50

visualizationresult 48, 49, 100, 119

visualizer 21

Wweighting

information theoretic 47probabil i ty 47

window size 117

Index 157


ITSO Redbook Evaluation

Intelligent Miner for Data Applications GuideSG24-5252-00

Your feedback is very important to help us maintain the quality of ITSO redbooks. Please complete thisquestionnaire and return it using one of the following methods:

• Use the online evaluation form found at http://www.redbooks.com• Fax this form to: USA International Access Code + 1 914 432 8264• Send your comments in an Internet note to [email protected]

Please rate your overall satisfaction with this book using the scale:(1 = very good, 2 = good, 3 = average, 4 = poor, 5 = very poor)

Overall Satisfaction ____________

Please answer the following questions:

Was this redbook published in time for your needs? Yes____ No____

If no, please explain:_____________________________________________________________________________________________________

_____________________________________________________________________________________________________

_____________________________________________________________________________________________________

_____________________________________________________________________________________________________

What other redbooks would you like to see published?_____________________________________________________________________________________________________

_____________________________________________________________________________________________________

_____________________________________________________________________________________________________

Comments/Suggestions: ( THANK YOU FOR YOUR FEEDBACK! )_____________________________________________________________________________________________________

_____________________________________________________________________________________________________

_____________________________________________________________________________________________________

_____________________________________________________________________________________________________

_____________________________________________________________________________________________________


SG

24-5

252-

00P

rinte

d in

the

U.S

.A.

Intelligent Miner for Data Applications Guide SG24-5252-00

IBM

L

PAGE SEGMENT 5252F122 CONTAINS INVALID DATA.′.EDFAWRK′ LINE 900: .si 5252F122 inlineSTARTING PASS 2 OF 2.+++ Page check:document requires more passes or extended cross-reference to resolve correctly. (Page 32 File: 5252CH3)

Documents

Intelligent Miner for Data Applications Guidejliusun.bradley.edu/~jiangbo/Redbooks/sg245252IMGuide.pdf · to use and how to effectively exploit them. The business utilized as a case