Upload
alejandro-biain
View
213
Download
0
Embed Size (px)
Citation preview
8/12/2019 Data Mining in the Chemical Industry
1/7
Data Mining in the Chemical IndustryAlex Kalos
The Dow Chemical Company2301 N. Brazosport Bvd.Freeport, TX, USA 77541
Tim ReyThe Dow Chemical Company
2020 Dow CenterMidland, MI, USA, 78674
ABSTRACTIn this paper we describe the experience of introducing data
mining to a large chemical manufacturing company. The multi-
national nature of doing business with multiple business units,
presents a unique opportunity for the deployment of data mining.While each business unit has its own objectives and challenges,
which may be at odds with those of other units, they also share
many common interests and resources. In this environment, data
mining can be used to identify potential value-creating
opportunities, through large site integration of multiple assets andsynergies from the use of common assets, such as site-wide
manufacturing facilities, and world-wide supply-chain,
purchasing and other shared services. However, issues arise, on
one hand from overly complex systems, and on the other hand,
from the danger of reaching sub-optimal solutions, if a big enoughpicture is not considered when executing projects. The company-
wide initiative and use of Six Sigma at all levels of the company
provided a fertile ground for making the case for data mining and
facilitating its acceptance. The Six Sigma mindset of measuring
the performance of processes and analyzing data promotes data-based decision making, therefore making data mining a natural
extension of this methodology. We will describe the approach for
launching a data mining capability within this framework, the
strategy for securing upper management support, drawing from
internal modeling, statistical, and other communities, and fromexternal consultants and universities. Lessons learned from
industrial case studies, enterprise-wide tool evaluation and peer
benchmarking will be discussed.
Categories and Subject DescriptorsI.6.5 [Simulation and Modeling]: Model Development
modeling methodologies.
General TermsManagement, Measurement, Documentation, Performance,
Design, Human Factors, Standardization, Verification.
KeywordsData mining, manufacturing, chemical industry.
1. INTRODUCTION
1.1 Six Sigma InitiativeIn the late 1990s the Dow Chemical Company launched the
practice of Six Sigma methodology [4]. By now, almost everyoneat Dow has been exposed to or has in some way been involved in
Six Sigma. The Measure, Analyze, Improve, and Control
(MAIC) phases are clearly delineated. Significant efforts and
resources have gone into developing and delivering training
materials on these topics. However, until recently, the Define
phase was largely being practiced inconsistently. As a result,many Six Sigma projects were delayed or terminated due to the
lack of, or poor execution of, defined deliverables for this phase.
Furthermore, projects that at first blush might have looked
good, often turned out to not be viable.
It became apparent that a better data-driven method was neededto identify projects and generate charters with a greater potential
for success. Data Mining and Modeling (DMM), the
methodology of finding relationships between inputs and outputs
(modeling) and converting this exploratory model into value,
was identified as a viable approach for accomplishing this and ateam was formed to bring knowledge about the DMM
methodology into the company and make it accessible to the Six
Sigma community at large.
Fortunately, Six Sigma has played a key role in promoting a
mentality of continuous improvement: Its foundation is that in
order to be able to fix something you have to be able to measure itand analyze it first. This mindset has provided a fertile ground for
making the case of data mining and facilitating its acceptance as a
natural extension to these phases of the Six Sigma methodology.
Upper management in particular was much more open to give
data mining a try and provide adequate resources to launch it at asignificant level.
1.2 Unique Data Mining Needs of a Global
Chemical CompanyAlthough the company shares some of the same issues and
concerns as other companies when it comes to data mining, it isalso different in many ways than the companies wheretraditional data mining has been largely applied of the sort done
in insurance, banking, credit card, financial institutions and the
large retail or on-line stores, where millions of transactions may
take place on a daily basis. For example, we only have 40,000
customers and just over 1 million shipments per year. Ourtransaction load is no where near that of a large on-line retailer
(millions per day). It is probably fair to say that much attention
has been devoted by vendors to provide tools and methods to
address these types of problems. And although other industries
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
KDD05, August 2125, 2005, Chicago, Illinois, USA.
Copyright 2005 ACM 1-59593-135-X/05/0008$5.00.
763
Industry/Government Track Poster
8/12/2019 Data Mining in the Chemical Industry
2/7
can be the beneficiaries of such developments, still there are
unique requirements driving the need for a different sort of data
mining approaches: In contrast to the huge terabyte data sets, for
example, we generally deal with smaller gigabyte-size sets of
greater variety, such as manufacturing process data, research anddevelopment data for new material and product development,
marketing and business data (orders, purchasing, etc), and supply-
chain data of a globally distributed company dealing with multi-national regulatory issues. The company is essentially a
collection of businesses, each with different (and sometimesopposing) needs and objectives, yet they share in many of the
benefits that come from large scale implementation and
integration, especially at the geographic site level, through pooled
resources and services. The end result of all of this is that any
widely deployed data mining methodology and tools must begeneral enough and flexible enough to accommodate diverse
needs, in order to solve local problems effectively, while
avoiding sub-optimization.
2. Strategy for Deploying Data Mining2.1 Engaged External ResourcesConsultants were contracted to assess the existing situation, andrecommend approaches for how and where a core capability
should be established. Relationships were also established with
universities. A special arrangement with Central Michigan
University was setup whereby funding was provided to launch theCMURC Business Intelligence/Data Mining resource center in
close proximity to company headquarters. This arrangement
made available to the company both hardware and software
resources as well as personnel to kick off data mining projects.
Training of selected personnel on specialized software was alsoprovided.
The intent of the CMURC is to provide an incubator likeenvironment to kick the tires so to speak before fully investing
in a data mining infrastructure. As in any investment, there are
hidden costs to be concerned with. As 80% of a data miningeffort is in the data preparation phase, the extent to which anygiven company has invested in data warehouses and data marts
will determine the size of this initial investment. Therefore, for a
company like ours where the transactional load is low as
compared to the large retail stores, it is important to demonstrate
value with low initial costs, before jumping in and investingheavily in expensive tools.
2.2 Core Group CreationA core group was created with members of diverse backgrounds
drawn from different functions representing the main three chunks
of the company (manufacturing, commercial, and R&D). The
group was chartered to develop a business plan, set standards and
best practices, define infrastructure, identify and deployenterprise-wide tools, execute large scale projects, and manage
external relations. The group also launched evangelical type of
communications campaign throughout the company (businesses,
functions, and departments) and at all levels of management.
2.3 Training CurriculumThe first task of the core group was to pull together various
people who were already devoting a significant part of their time
as data and knowledge workers (e.g., math modelers, Six Sigma
black belts, etc.) and elevate their knowledge of data mining
through an intense training program. The participants were
selected such that they would reside within various businesses,
in order to seed the interest by placing knowledgeable people
within major functions.
After an initial search in the market place, it was decided that the
kind of DMM course that would address the companys overallneeds, one that would address business, manufacturing, and
research & development needs, was rather unique. Such a course
was simply not available on the market, and the course curriculum
had to be developed internally, with the help of an external
consultant. The curriculum development team consisted of SixSigma Master Black Belts and data mining & modeling domain
experts from inside and outside the company, who in addition to
developing and delivering the curriculum also were assigned as
mentors to course participants. One constraint in developing thecurriculum was that it would have to be built around tools that
were already deployed in the company. Essentially, this was
considered a pilot and it was important to do at as low a cost as
possible.
The expectation of successful completion of the two week-long
course is that DMM practitioners would be able to perform
advanced data analyses and create models that would enable themto make intelligent decisions regarding the viability of fixing a
perceived process or product defect, and to know which part
should be fixed if needed; in other terms, to develop good project
charters -- the actual fixing would still be left to the MAIC
process. Still, the course was designed at the 101 level, i.e.,targeted to the relative newcomer to DMM and additional, more
advanced training would need to be sought by those who intend to
be practitioners in the field.
The team began the curriculum development late in 2001 and
delivered two waves of the course in 2002. Thirty five people
took the course, seven of whom went on to actively put thisknowledge into practice. Although it is still early to assess the full
potential of this effort, 34 projects have been identified (13 ofwhich have been initiated) with potential value of over $1 billion.
3. Business PlanThe core group developed a business plan for setting up and
acting on a corporate data mining effort. The key elements of the
business include: the mission, product/service offering, identified
customer base, need for the business, value proposition, business
strategy, resource needs, time table, metrics, communication planand requirements, constraints, and risks.
The primary outcome of the business plan is the development of a
long-term, sustainable, resource model for data mining and
modeling in the company by:
Setting long term DMM strategy Developing and managing best practices
Solving commercial and technical problems
Building/managing DMM skills across the company
Enabling DMM projects and personnel
Supporting a data miners community
Consulting on tools and approaches
Leveraging external resources
764
Industry/Government Track Poster
8/12/2019 Data Mining in the Chemical Industry
3/7
A very formal approach was put in place to assess projects, to
determine the support model, and to assess and track value.
4. Best Practices The Data Mining/Modeling ProcessOne of the main activities of the core group is to identify, define,
and promote best practices with regards to data mining. This hasto be in a manner consistent the companys overall approach to
establishing well defined work processes, according to what is
known as Most Effective Technology, or MET. This calls fordetailed processes, roles and responsibilities, resource models,
technology, training and documentation.
In order to accommodate the diverse business needs for data
mining, the group developed the Data Mining and Modeling
(DMM) process (see Figure 1), based on its core set of
competencies in mathematical modeling. When compared to otherdata mining methodologies, like CRISP-DM [22], SEMMA [21],
the Virtuous DM Cycle [2], or the KDD process [7], there are
both similarities and differences. DMM takes a bit deeper look at
the system under study using various methods generally found
in the Systems literature, to detail the system. Also, there is amore formal separation between process and methods used to aid
certain steps in the process.
Metrics Raw Data Information Knowledge Preliminary
Six Sigma
Charters
Data Mining/ Modeling Subprocess
Business,
Site,
FunctionalLeadership
Local Champion
Data Miner
Systematically
fix
collection
methods
Defects
Local Champion
Data Miner
Black BeltMaster Black Belt
O
P
P
O
R
T
U
N
I
T
Y
D
E
P
L
O
Y
M
E
N
T
O
P
P
O
R
T
U
N
I
T
Y
D
I
S
C
OV
E
R
Y
D
A
T
A
P
R
E
P
R
O
C
E
S
S
I
N
G
S
Y
S
T
E
M
&
D
A
T
A
I
D
S
T
R
A
T
E
G
I
C
I
N
T
E
N
T
B
U
S
I
N
E
S
S
/
S
I
T
E
/
F
U
N
C
T
I
O
N
BusinessPlan
Feedback
Figure 1. Data Mining and Modeling Process
It is important to note that despite its block or linear
appearance, the DMM process, much like the KDD process [7], is
highly iterative and interactive. A high level description of eachof the phases of the DMM process follows:
4.1 Strategic Intent
4.1.1 Assess Current SituationThe main deliverable of this step is a preliminary DMM projectcharter, which includes modifying an existing project charter that
may have been handed to the data miner. Under consideration
here are understanding the business objectives, alignment with
strategic business goals, an initial assessment of the value of theopportunity, including the costs of doing the project as well as
the estimated hard and soft benefits (e.g., revenue generation, cost
reduction, etc). The preliminary plan will include scope and
boundaries, assumptions, constraints, risks, expected deliverables
and initial translation of the business goals to data mining project
tasks.
4.1.2 Validate & Cross ReferenceThe purpose of this step is ensure that the proposed data mining
project is consistent with other projects and initiatives that the
business may already have underway, or is planning to do in the
near future, as defined in the businesss managing improvement(MI) plans as well as enterprise-wide initiatives. This is to ensure
the relevance of the proposed data mining project and also toavoid sub-optimization that could result if the project were to be
done in a vacuum. Another outcome of this step is to identify the
business success criteria, including various measurements
(customer-, process-, and financial-measurements), and any
relevant benchmarking studies.
4.2 System & Data Identification
4.2.1 Conduct Stakeholder AnalysisThe purpose of this step is to identify the key stakeholders from a
long list of potential candidates, including sponsors (people
driving the project) and those who will fund the project as well as
other decision makers and the process owners. Other stakeholdersmay include data miners, math modelers, subject matter experts
and other technical people, black belts (a Six Sigma term referring
to the individuals who will actually implement the solution), and
finally the people who will ultimately benefit form the work (the
end-users). A desired outcome of this step is to determinehypotheses held by the key stakeholders. This is important both
as a means of cross-validation of project expectations, but also for
formulating the data mining plan. These hypotheses are
determined through interview sessions using brainstorming and
mind mapping techniques. Finally, the other objective here is todevelop a communication plan -- what should be communicated
to whom and how often.
4.2.2 Discern Previous WorkThe objective here is to review both internal documents andexternal literature to identify other related work on similarchemistries, systems, and work processes. Ideally, prior analysis
and modeling work would be identified which may be useful for
the proposed project in terms of lesson learned, what worked,
what didnt, etc.
4.2.3 Determine and Document System Structure andOperation -- Basis to ProceedThe purpose of this step is to document as much as possible aboutthe system under consideration, in order to provide a meaningful
context for data mining. The documentation may include system
diagrams and process maps, mind maps, relationship maps,
information exchange diagrams, flow diagrams of material flow
and money flow, and envelope analysis.
4.2.4 Identify & Understand the Sources of DataThe objective here is to identify all sources of data and determine
if all of the data needed for analysis is already available or if it is
necessary to start a data acquisition campaign. The desiredoutcome is to identify the specific data sets (e.g., named sources
and databases), to do a preliminary assessment of any gaps in the
data, identify important missing variables and ways to remedy
such, and determine if not being able to acquire missing data or
variables in a timely fashion would be detrimental to the project.
765
Industry/Government Track Poster
8/12/2019 Data Mining in the Chemical Industry
4/7
In addition, here we collect as much information as possible
regarding the data sources, including the collection method
(manual, electronic, or instrument).
4.2.5 Team Characteristics/CompositionHere, we identify the core group of people that will be involved in
the execution of the data mining project. This typically includes
data miners, process and system domain experts, data contentexperts, and data access/ extraction experts.
4.2.6 Develop Specific Problem StatementGiven what has been found up to this point, including input from
stakeholders, prior work, and what is now known about the
available data, it is time to develop a specific problem statement.
This is done in the form of a project charter and templates havebeen developed to facilitate capturing important elements. The
problem statement is reviewed with the process owner and other
key stakeholders.
4.2.7 Understand The Existing Data -- Build Contextfor the Data.This step is aimed at the detailed understanding of the data,
including the sampling frequency (and if the data is from aprocess information data historian, the granularity and any
filtering or smoothing that may have been done), any intentional
biases (and if outliers have been omitted, the selection criteria),
update frequencies, alignment criteria, collection gaps, and anyprocessing algorithms that may have already been applied to the
data. Also, here we identify the input- and output-variables,
describe the variable types and attributes, and document attribute
definitions, the scale (interval, nominal, ordinal), units of
measure, standard operating ranges (upper and lower limits), andany real physical limits. We also determine if a given variable is
measurable, controllable, random, or descriptive. Formatting
issues are documented, e.g., file types (flat files, relational files,
delimiters, missing data indicators, etc). Finally, sources and
magnitude of measurement error are identified.
All of this is for building the appropriate context for mining thedata and is driven by the principle of functional paranoia that
can best be described via a series of questions: What has been
done to the data? Is it filtered, aggregated, calculated or
measured? What is the update timing and time stamp rules?
What is the data taxonomy and how does that map to the businessstructure? Why was the data collected? How does it fit in with
the project goals? Why is it needed? Are there gaps or holes in
time? Is the data available at the right frequency and the
appropriate level of granularity to solve the problem on hand?
What about the information content: are there lots of rows of databut not enough variation in the patterns?
4.2.8 Secure/Collect Needed DataThe purpose of this step is to assemble all relevant sources of dataand/or start a data collection campaign to secure needed data that
is not yet available. This step generally requires working with
database analysts who will do the actual data extraction, so it is
important to be very specific about how the data sets should becreated, including prescriptions for the time frame, level/
aggregation, frequency, scope, format, variable names, file type,
delimiters, index variables, sort-merge-stack sequences,
harmonization strategy for multiple data sets, etc.
4.3 Data PreprocessingThe data is prepared and structured so that it may be imported
into the analysis platform. This may involve harmonizing datafrom multiple data sets as well as merging and stacking the data
and is often done outside of the final analysis platform through
the use of desktop databases. The merged dataset is then
imported for analysis.
4.3.1 Preliminary Data AnalysesThis phase essentially follows a traditional data analysis approach
[14], [20]. Major steps include visual exploration and inspection
of descriptive statistics for interval and ordinal variables,
assessing information content, assessing colinearity inindependent variables and eliminating redundant variables, and
assessing variability and time characteristics. Also, we identify
missing values and devise a strategy to handle them, and identify
and handle outliers.
4.3.2 Variable SelectionIn this phase, we select the variables that will be used as inputs
and outputs. It is characterized primarily by four aspects:
imputation, feature creation, variable reduction, and variable
partitioning. Imputation techniques are used to fill in missingdata, in cases where there is still adequate information to considerthe variable as an input. Feature creation techniques are used to
create meta variables, i.e., variables not directly found in the
original data but that may be derived by combining/transforming
other variables that are in the data. This generally requires
domain expertise to draw from physicochemical principles.Variable reduction and variable partitioning is done when there
are a lot of variables. In some cases, too many dimensions may
pose a problem for some of the downstream modeling techniques,
so down selection becomes a necessity. Clustering, principal
components and other multivariate techniques are used for theseactivities.
4.3.3
Transform/Recode DataStandard transformation techniques (e.g., standardization,normalization, log and other transforms) are used to recode the
data. Categorical data is recoded into numerical data asappropriate.
4.3.4 Document & Validate Data Findings andDerivationsThe last step of the data preprocessing phase is to document the
data finding and transformations and to communicate and validate
these with the stakeholders. Pending acceptance, the data sets arefinalized to be used for modeling.
4.4 Opportunity DiscoveryThis phase is the essentially the heart of the DMM process
consisting of two main activities: exploratory data analysis andmodeling.
4.4.1 Develop Data Analysis StrategyHere, the decision is made as to what analyses and modeling
techniques will be used. This is done on the basis of the type of
problem on hand. For supervised type problems (i.e. when thereis a response variable), the choices depend on whether it is a
prediction, classification, estimation, or optimization problem.
For unsupervised type problems (when there is no response
variable), the choice may be clustering, association, or linkage
766
Industry/Government Track Poster
8/12/2019 Data Mining in the Chemical Industry
5/7
type models. A methods-selection decision tree in the form of an
interactive mind map has been developed and is provided to the
data miner to facilitate the selection of the appropriate analysis
and modeling methods, emphasizing the assumptions, strengths,
and limitation of each method, and providing a framework forassessing methods unto themselves as well as comparing them to
one another. After the analysis/modeling methods have been
selected, it may be necessary to re-format or re-structure the dataaccordingly to accommodate these methods. Also, it may be
necessary to re-sample the data, as the data set may be too largefor some techniques.
4.4.2 Conduct Exploratory AnalysisThe aim here is to look for patterns and themes using
visualization and other techniques, such as distributions and
histograms, X by Y plots, contingency plots, linear regression,clustering, and recursive partitioning. If necessary, row reduction
is done to enhance information content. Feature extraction and
dimensionality reduction is done if necessary, using techniques
like principal components analysis. Finally, in the cases of highlynon-linear systems, we consider the generation of alternative
functional forms via genetic programming. Using such features
can help to linearize the problem and thus make it amenable tostandard techniques.
4.4.3 Build Models & Assess Model PerformanceThe data is partitioned into training, validation, and test sets [10].
Again, depending on whether or not a response variable is present
and the type of variables, different techniques may be appropriate
[5], but the general approach is to try linear methods first, since
these provide ample tests for testing the significance of themodels, then move on to non-linear models if necessary. The
general suite of methods used includes clustering techniques,
principal components analysis, discriminant and factor analysis,
linear and logistic regression, decision trees, and neural networks
[9], [15], [23]. For time series type models, special techniques areused [3], aimed at accounting for different sources of variability
in order to identify any periodic patterns or trends in the data.
Individual models are assessed for performance according to
technique-specific procedures (e.g., F-test, correlation coefficient,
lack of fit, root mean square error, predicted vs. actual plots,
residual plots, etc), as well as the differences of model fit ontraining, validation and test sets. Comparison of performance
between different types of models is trickier (e.g., linear
regression vs. neural nets), since significance tests that apply to
one do not always apply to the other. The stability of non-linear
models in particular is checked to ensure that convergence hasbeen reached and that the parameter estimation procedure did not
get stuck at a local minimum. As is the case for the entire DMM
process, this phase in particular is highly iterative; it may be
necessary to cycle back to the beginning and re-build models.Finally, the best model or models are chosen and assessed againstbusiness criteria in order to identify and validate relationships that
can be explored for further opportunities for improvement.
4.5 Opportunity DeploymentThis phase is characterized by three main activities: First the
developed model(s) may be used to make an immediate business
decision; this in and of itself may be the extent of a particular data
mining project. Furthermore, the developed model may beadequate/accurate enough to be implemented as part of an on-line
or real-time system. In this case, further development may be
needed (e.g. software development or re-coding in a format
appropriate for the deployment platform). The other outcome of
the data mining effort is that opportunities are identified that willneed further work to be realized. These are cast as Six Sigma
MAIC projects and are turned over to black belts. In this case, it
is the job of the data miner to define the project charters,
including assessment of the relative value of discoveredopportunities, identification of uncertainties, full documentationof all data mining activities and key findings, a description of the
model(s), a summary of the results, and any recommendations and
identification of potential challenges.
5. Evaluation of Enterprise-wide ToolsAs part of the development of most effective technology (MET),we have formally launched an evaluation of various technology
solutions. Independent entities like Gartner, Forrester Research,
Frost and Sullivan, etc. have done the same over the years, but
each organization has its own set of requirements, and needs to go
through its own learning curve in terms of the technology. Thus,we have established a formal pilot to review the top players in the
industry and will then conduct a formal assessment and choosethe technology that best suites the companys needs.
It is important to note that we do not assume that only one
technology will suite all of our needs. In fact, we will adapt a
layered approach for technology. Reporting and OLAP tools areat the base of this pyramid, used by thousands of people in the
company. At the next level, we have mid range statistical tools.
Specifically, JMP [21] is broadly used at this level, with over
3000 users. JMP does in fact have some basic exploratory data
analysis/data mining capabilities (PLS, neural networks, decisiontrees, and linear time series) and was selected as the basis for our
DMM 101 training curriculum. In the JMP space there are two
tiers of modelers: Those that have been trained fully in basic
statistics via our Six Sigma program, and those that have taken
our entry level DMM course. At the next level, and this is wherethe evaluation will take place, we expect to install a single
enterprise wide tool like SAS/EM, SPSS Clementine, S+
InsightfulMiner, IBMs IntelligentMiner, etc., of which we
expect about 50 - 75 users with the appropriate training to utilizeas their primary tool. Beyond that layer, we will draw from
specialty packages like those found in Wolframs Mathematicaor
Mathworks MATLAB (e.g., symbolic regression, support vector
machines, genetic algorithms, etc.). In this tier we only expect a
dozen or so of the highest level modelers to be involved.
6. Collaboration with Peers & Key CustomersAn important part of learning how to structure a data mining
effort is via collaborating with peers. The CMURC environment
is designed to do just that. On a quarterly basis, companies suchas The Dow Chemical Company, Ford, Eli Lilly, Steelcase, Henry
Ford Health Care Systems, EDS, Kelly Services, GFS, Kitchen
Aid, IBM, SAS, ESRI, Harris Interactive, etc. get together to
discuss BI/DM applications, data sources, and technology trends.
This allows companies to get out of the box a bit and generateideas of where and how to apply data mining in their own
situations.
767
Industry/Government Track Poster
8/12/2019 Data Mining in the Chemical Industry
6/7
6.1 BenchmarkingIn order to better understand how to design, support, value, fund,
and gain acceptance of the a data mining effort, The DowChemical Company, Ford, and Eli Lilly have joined with
CMURC to design and launch a BI/DM benchmarking study.
This study will give participants a look into various kinds of
organizations and industries that are at various stages of
implementation concerning BI/DM. Results of the study will bepresented at the July 2005 BI Forum at CMU.
7. Case StudyThe first project that we did in partnership with the CMURC
was an effort to link Customer Loyalty to financial impact [12].
Dow has a long history of collecting Customer
Satisfaction/Loyalty type perception data. From 1999 to 2005,some 50+ separate studies have been run across the globe
resulting in a Customer Loyalty data repository of over 30,000
observations of which 2/3rds is competitive data. Considering the
cost of design, collection, analysis, reporting and action, as the
results of this work are used in market planning, setting servicestandards and feeding Six Sigma projects, Customer Loyalty can
cost a company a considerable amount of time and money. Thus,as in the case of any large companys initiatives, the question is
asked, does Customer Loyalty make any difference financially?
As most people associated with the Customer Satisfactionindustry realize, this is in fact the holy grail.
In order to establish the value proposition for loyalty, a large datamining effort was established. Data was amassed from sources
ranging from perception data that included point in time market
orientation assessments of the different business units, global
employee satisfaction studies, customer complaint data and thecustomer loyalty data, to behavior data that included volume and
sales trend data, pricing data, profitability data, and attraction and
retention data. These data sets ranged in size from dozens of
variables and hundreds of observations to hundreds of variables
and millions of observations.
This data was harmonized with a series of complex SAS programs
in order to produce a modeling data set. Data preparationprocesses included but was not limited to imputation, hostage
modeling and removal, outlier testing and removal,
transformations, and smoothing. The fundamental model used
was based on a blend of the theoretical framework for Customer
Loyalty of Rust [19], Gale [8], Reichheld [16], Oliver [13],Johnson and Gustafson [11] and is primarily linked to the work of
Rey and Johnson [17]. This fundamental hierarchical path
modeling framework is well suited to the Customer Loyalty
problem, but assumes that the relationships are all linear. Variousauthors have shown this not to be the case [1]. Thus, a structured
neural network approach was used to first model the within
study framework, and the across many studies framework, and
then the full customer loyalty-profit chain framework. This
work was in fact unique in that some authors have claimed thatthere has never been an account level modeling effort that
showed the linkage between customer loyalty and financial
impact.
In the end, data miners would set themselves up for failure if they
expect to find loyalty as the key driver of financial impact. In
general, the global economy, regional economy, industry
economy, market economy, company economy, business
economy and customer economy will play a larger role in the
financial landscape than loyalty alone. Breaking a financial
number like profitability down to its constituent parts, one sees
that there are very few aspects of profitability that can actually beaffected by loyalty. In the end, margin is a function of revenue
and costs. Revenue is based on volume and price. Thus, what
can loyalty affect in volume, price and cost? It has beenhypothesized by many authors that loyal customers bring more
volume, allow higher prices and cost less to serve. In anindustrial commodity market with deep and frequent price cycles
like that found in the chemical industry, this often translates into
slower rate of change of volume and price for loyal customers,
high account share for loyal customers, and lower costs to serve
in the long run.
Using primarily a cross sectional approach, and applyingtraditional time-lagged econometric adjusted models based on
previous years financial trends, this data mining effort did in fact
show that customer loyalty perceptual measures, market
orientation perceptual measures and employee satisfaction
perceptual measures do contribute, at the account level, to the
explanation of financial impact in a statistically significantfashion, Rey [18]. The findings that related to perceptual vs.
behavioral measures followed much of what Dick and Basau
reported [6].
8. ConclusionIn this paper we described our approach to introducing data
mining in a large, global chemical company. Due to the unique
nature of the company and the dynamic nature of the needs of itsconstituents businesses, a customized and targeted methodology
had to be developed, borrowing beneficial aspects from published
methodologies, while fine-tuning other aspects to better suit
special needs.
Some lessons learned include that data mining is not for the faint
of heart in terms of quantitative methods, and that datapreparation is an important skill set (between data extractors and
data miners). Along the way, we had to deal with the ubiquitous
abuse of the term data mining it means different things to
different people; anyone that has ever manipulated a spreadsheet
is a self-proclaimed data mining expert, and data extractors areconsidered by many to be data miners. In a way, we and the
vendors and the trade journals carry some of the blame in our
zeal to preach the virtues of data mining we often end up
overselling and hyping. This often leads to unrealistic
expectations. Some of the misconceptions at high managementlevels included that only the big iron will do, that data mining is
only for terabyte type problems, that it is too esoteric and no one
within the company knows modeling. We also confirmed our
notion that while it is important to leverage external resources asmuch as possible, it is equally important to have experienced
people internally to jump-start the process, oversee projects and
keep the consultants honest.
On the plus side, the widespread use of Six Sigma methods and
the measuring and analyzing mindset that it promotes, proved to
be catalysts for both motivating the use of data mining and
facilitating its acceptance.
768
Industry/Government Track Poster
8/12/2019 Data Mining in the Chemical Industry
7/7
9. ACKNOWLEDGMENTSOur thanks to Dorian Pyle of Data Miners Inc. for his assistance
in developing and deploying the training program. We wouldalso like to thank Jim Mentele, Tim Pletcher, and Carl Lee of
CMURC, as well as our Dow colleagues Andy Paquet, Ken
Beebe, Dave Rothman, and Mike Costa.
10. REFERENCES[1] Anderson, E. W. and Mittal, V., Strengthening the
Satisfaction-Profit Chain,Journal of Service Research,Volume 3, No. 2, (Nov. 2000), 107-120.
[2] Berry, M. J. A., Linoff, G. S.,Data Mining Techniques: For
Marketing, Sales, and Customer Support.John Wiley &Sons, (1997) , 17-19 and 30-34.
[3] Box, G., Jenkins, G., and Reinsel, G., Time Series Analysis -
Forecasting and Control, Third Edition., Pearson Education,
Inc, 1994.
[4] Breyfogle, W. III,Implementing Six Sigma SmarterSolutions Using Statistical Methods, Wiley-Interscience,
1999
[5] Dhar, V., and Stein, R., Seven Methods for TransformingCorporate Data into Business Intelligence,Prentice Hall,
1996.
[6] Dick, A. S. and Basu, K., Customer Loyalty: Toward an
Integrated Conceptual Framework,Journal of the Academyof Marketing Science, 22 (2), (1994), 99-113.
[7] Fayyad, U. M., Piatesky-Shapiro, G., and Smyth, P. (eds),
From Data Mining to Knowledge Discovery: An Overview.InAdvances In Knowledge Discovery and Data Mining,pp.
1-34, AAAI Press/MIT Press, 1996.
[8] Gale, B. T., Managing Customer Value, The Free Press,
New York, New York, 1994.
[9] Hand , D. J., Mannila, H., and Smyth, P.,Principles of DataMining, MIT Press, 2001.
[10]Haykin, S.,Neural Networks: A Comprehensive Foundation,Second Edition, Prentice Hall, New Jersey, 1999.
[11]Johnson, M. and Gustafsson, A., Improving Customer
Satisfaction, Loyalty and Profit: An Integrated Measurementand Management System. San Francisco: Jossey-Bass, 2000.
[12]Lee, C., Mentele, J., Gaver, Rey, T.D., Structured Neural
Network Techniques for Modeling Loyalty and Profitability,
SUGI2005.
[13]Oliver, R. L., Satisfaction: A Behavioral Perspective on the
Consumer.New York: McGraw-Hill, 1997.
[14]Pyle , D.,Data Preparation for Data Mining, Morgan
Kaufmann, 1999.[15]Pyle , D.,Business Modeling and Data Mining, Morgan
Kaufmann, 2003.
[16]Reichheld, F. The Loyalty Effect: The Hidden Source Behind
Growth, Profits, and Lasting Value. Boston: HarvardBusiness School Press, 1996
[17]Rey, T. D. and Johnson, M., Modeling the Connection
Between Loyalty and Financial Impact: A Journey. In
Earning a Place at the Table, 23rd Annual MarketingResearch Conference, American Marketing Association,
Chicago, IL, September 8-11, 2002..
[18]Rey, T. D., Tying Customer Loyalty to Financial Impact. In
Symposium on Complexity and Advanced Analytics Applied
to Business, Government and Public Policy Society forIndustrial and Applied Mathematics, Great Lakes Section,
University of Michigan, Dearborn Campus, October 23,
2004.
[19]Rust, Z. and Kenningham,Return on Quality: Measuring theFinancial Impact of Your Company's Quest for Quality,
Probus Professional Publishing, 1993.
[20]Wang, X. Z. Data Mining and Knowledge Discovery forProcess Monitoring and Control (Advances in Industrial
Control), Springer Verlag, 1999
[21]SAS Institute, http://www.sas.com/technologies/analytics/-
datamining/miner/semma.html, 2005.
[22]Shearer, C., The CRISP-DM Model: The New Blueprint for
Data Mining. InJournal Data Warehousing, Vol. 5, No. 4,(2000), 13-22.
[23]Witten, I. H., Frank, E.,Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations,
Morgan Kaufmann Publishers, 1999.
769
Industry/Government Track Poster