Data Mining Tools Overview & Tutorial Ahmed Sameh Prince Sultan University Department of Computer...
301
Data Mining Tools Overview & Tutorial Ahmed Sameh Prince Sultan University Department of Computer Science & Info Sys May 2010 (Some slides belong to IBM) 1
Data Mining Tools Overview & Tutorial Ahmed Sameh Prince Sultan University Department of Computer Science & Info Sys May 2010 (Some slides belong to IBM)
Data Mining Tools Overview & Tutorial Ahmed Sameh Prince
Sultan University Department of Computer Science & Info Sys May
2010 (Some slides belong to IBM) 1
Slide 2
2 Introduction Outline zDefine data mining zData mining vs.
databases zBasic data mining tasks zData mining development zData
mining issues Goal: Provide an overview of data mining.
Slide 3
3 Introduction zData is growing at a phenomenal rate zUsers
expect more sophisticated information zHow? UNCOVER HIDDEN
INFORMATION DATA MINING
Slide 4
4 Data Mining Definition zFinding hidden information in a
database zFit data to a model zSimilar terms yExploratory data
analysis yData driven discovery yDeductive learning
Slide 5
5 Data Mining Algorithm zObjective: Fit Data to a Model
yDescriptive yPredictive zPreference Technique to choose the best
model zSearch Technique to search the data yQuery
Slide 6
6 Database Processing vs. Data Mining Processing zQuery yWell
defined ySQL zQuery yPoorly defined yNo precise query language Data
Data Operational data Output Output Precise Subset of database Data
Data Not operational data Output Output Fuzzy Not a subset of
database
Slide 7
7 Query Examples zDatabase zData Mining Find all customers who
have purchased milk Find all items which are frequently purchased
with milk. (association rules) Find all credit applicants with last
name of Smith. Identify customers who have purchased more than
$10,000 in the last month. Find all credit applicants who are poor
credit risks. (classification) Identify customers with similar
buying habits. (Clustering)
Slide 8
8 Related Fields Statistics Machine Learning Databases
Visualization Data Mining and Knowledge Discovery
Slide 9
9 Statistics, Machine Learning and Data Mining zStatistics:
ymore theory-based ymore focused on testing hypotheses zMachine
learning ymore heuristic yfocused on improving performance of a
learning agent yalso looks at real-time learning and robotics areas
not part of data mining zData Mining and Knowledge Discovery
yintegrates theory and heuristics yfocus on the entire process of
knowledge discovery, including data cleaning, learning, and
integration and visualization of results zDistinctions are
fuzzy
Slide 10
Definition zA class of database application that analyze data
in a database using tools which look for trends or anomalies. zData
mining was invented by IBM.
Slide 11
Purpose zTo look for hidden patterns or previously unknown
relationships among the data in a group of data that can be used to
predict future behavior. zEx: Data mining software can help retail
companies find customers with common interests.
Slide 12
Background Information zMany of the techniques used by today's
data mining tools have been around for many years, having
originated in the artificial intelligence research of the 1980s and
early 1990s. zData Mining tools are only now being applied to
large-scale database systems.
Slide 13
The Need for Data Mining zThe amount of raw data stored in
corporate data warehouses is growing rapidly. zThere is too much
data and complexity that might be relevant to a specific problem.
zData mining promises to bridge the analytical gap by giving
knowledgeworkers the tools to navigate this complex analytical
space.
Slide 14
The Need for Data Mining, cont zThe need for information has
resulted in the proliferation of data warehouses that integrate
information multiple sources to support decision making. zOften
include data from external sources, such as customer demographics
and household information.
Slide 15
Definition (Cont.) Data mining is the exploration and analysis
of large quantities of data in order to discover valid, novel,
potentially useful, and ultimately understandable patterns in data.
Valid: The patterns hold in general. Novel: We did not know the
pattern beforehand. Useful: We can devise actions from the
patterns. Understandable: We can interpret and comprehend the
patterns.
Slide 16
Of laws, Monsters, and Giants zMoores law: processing capacity
doubles every 18 months : CPU, cache, memory zIts more aggressive
cousin: yDisk storage capacity doubles every 9 months What do the
two laws combined produce? A rapidly growing gap between our
ability to generate data, and our ability to make use of it.
Slide 17
What is Data Mining? Finding interesting structure in data
zStructure: refers to statistical patterns, predictive models,
hidden relationships zExamples of tasks addressed by Data Mining
yPredictive Modeling (classification, regression) ySegmentation
(Data Clustering ) ySummarization yVisualization
Slide 18
Slide 19
19 Major Application Areas for Data Mining Solutions
zAdvertising zBioinformatics zCustomer Relationship Management
(CRM) zDatabase Marketing zFraud Detection zeCommerce zHealth Care
zInvestment/Securities zManufacturing, Process Control zSports and
Entertainment zTelecommunications zWeb
Slide 20
20 Data Mining zThe non-trivial extraction of novel, implicit,
and actionable knowledge from large datasets. yExtremely large
datasets yDiscovery of the non-obvious yUseful knowledge that can
improve processes yCan not be done manually zTechnology to enable
data exploration, data analysis, and data visualization of very
large databases at a high level of abstraction, without a specific
hypothesis in mind. zSophisticated data search capability that uses
statistical algorithms to discover patterns and correlations in
data.
Slide 21
21 Data Mining (cont.)
Slide 22
22 Data Mining (cont.) zData Mining is a step of Knowledge
Discovery in Databases (KDD) Process yData Warehousing yData
Selection yData Preprocessing yData Transformation yData Mining
yInterpretation/Evaluation zData Mining is sometimes referred to as
KDD and DM and KDD tend to be used as synonyms
Slide 23
23 Data Mining Evaluation
Slide 24
24 Data Mining is Not zData warehousing zSQL / Ad Hoc Queries /
Reporting zSoftware Agents zOnline Analytical Processing (OLAP)
zData Visualization
Slide 25
25 Data Mining Motivation zChanges in the Business Environment
yCustomers becoming more demanding yMarkets are saturated
zDatabases today are huge: yMore than 1,000,000
entities/records/rows yFrom 10 to 10,000
fields/attributes/variables yGigabytes and terabytes zDatabases a
growing at an unprecedented rate zDecisions must be made rapidly
zDecisions must be made with maximum knowledge
Slide 26
Why Use Data Mining Today? Human analysis skills are
inadequate: yVolume and dimensionality of the data yHigh data
growth rate Availability of: yData yStorage yComputational power
yOff-the-shelf software yExpertise
Slide 27
An Abundance of Data zSupermarket scanners, POS data zPreferred
customer cards zCredit card transactions zDirect mail response
zCall center records zATM machines zDemographic data zSensor
networks zCameras zWeb server logs zCustomer web site trails
Slide 28
Evolution of Database Technology z1960s: IMS, network model
z1970s: The relational data model, first relational DBMS
implementations z1980s: Maturing RDBMS, application-specific DBMS,
(spatial data, scientific data, image data, etc.), OODBMS z1990s:
Mature, high-performance RDBMS technology, parallel DBMS, terabyte
data warehouses, object-relational DBMS, middleware and web
technology z2000s: High availability, zero-administration, seamless
integration into business processes z2010: Sensor database systems,
databases on embedded systems, P2P database systems, large-scale
pub/sub systems, ???
Slide 29
Much Commercial Support zMany data mining tools
yhttp://www.kdnuggets.com/softwarehttp://www.kdnuggets.com/software
zDatabase systems with data mining support zVisualization tools
zData mining process support zConsultants
Slide 30
Why Use Data Mining Today? Competitive pressure! The secret of
success is to know something that nobody else knows. Aristotle
Onassis zCompetition on service, not only on price (Banks, phone
companies, hotel chains, rental car companies) zPersonalization,
CRM zThe real-time enterprise zSystemic listening zSecurity,
homeland defense
Slide 31
The Knowledge Discovery Process Steps: 1.Identify business
problem 2.Data mining 3.Action 4.Evaluation and measurement
5.Deployment and integration into businesses processes
Slide 32
Data Mining Step in Detail 2.1 Data preprocessing yData
selection: Identify target datasets and relevant fields yData
cleaning xRemove noise and outliers xData transformation xCreate
common units xGenerate new fields 2.2 Data mining model
construction 2.3 Model evaluation
Slide 33
Preprocessing and Mining Original Data Target Data Preprocessed
Data Patterns Knowledge Data Integration and Selection
Preprocessing Model Construction Interpretation
Slide 34
34 Data Mining Techniques Descriptive Clustering Association
Sequential Analysis Predictive Classification Decision Tree Rule
Induction Neural Networks Nearest Neighbor Classification
Regression
Slide 35
35 Data Mining Models and Tasks
Slide 36
36 Basic Data Mining Tasks zClassification maps data into
predefined groups or classes y Supervised learning y Pattern
recognition y Prediction z Regression is used to map a data item to
a real valued prediction variable. zClustering groups similar data
together into clusters. yUnsupervised learning ySegmentation
yPartitioning
Slide 37
37 Basic Data Mining Tasks (contd) zSummarization maps data
into subsets with associated simple descriptions. yCharacterization
yGeneralization zLink Analysis uncovers relationships among data.
yAffinity Analysis yAssociation Rules ySequential Analysis
determines sequential patterns.
Slide 38
38 Ex: Time Series Analysis zExample: Stock Market zPredict
future values zDetermine similar patterns over time zClassify
behavior
Slide 39
39 Data Mining vs. KDD zKnowledge Discovery in Databases (KDD):
process of finding useful information and patterns in data. zData
Mining: Use of algorithms to extract the information and patterns
derived by the KDD process.
Slide 40
40 Data Mining Development Similarity Measures Hierarchical
Clustering IR Systems Imprecise Queries Textual Data Web Search
Engines Bayes Theorem Regression Analysis EM Algorithm K-Means
Clustering Time Series Analysis Neural Networks Decision Tree
Algorithms Algorithm Design Techniques Algorithm Analysis Data
Structures Relational Data Model SQL Association Rule Algorithms
Data Warehousing Scalability Techniques
45 Data Mining Applications: Retail zPerforming basket analysis
yWhich items customers tend to purchase together. This knowledge
can improve stocking, store layout strategies, and promotions.
zSales forecasting yExamining time-based patterns helps retailers
make stocking decisions. If a customer purchases an item today,
when are they likely to purchase a complementary item? zDatabase
marketing yRetailers can develop profiles of customers with certain
behaviors, for example, those who purchase designer labels clothing
or those who attend sales. This information can be used to focus
costeffective promotions. zMerchandise planning and allocation
yWhen retailers add new stores, they can improve merchandise
planning and allocation by examining patterns in stores with
similar demographic characteristics. Retailers can also use data
mining to determine the ideal layout for a specific store.
Slide 46
46 Data Mining Applications: Banking zCard marketing yBy
identifying customer segments, card issuers and acquirers can
improve profitability with more effective acquisition and retention
programs, targeted product development, and customized pricing.
zCardholder pricing and profitability yCard issuers can take
advantage of data mining technology to price their products so as
to maximize profit and minimize loss of customers. Includes risk-
based pricing. zFraud detection yFraud is enormously costly. By
analyzing past transactions that were later determined to be
fraudulent, banks can identify patterns. z Predictive life-cycle
management yDM helps banks predict each customers lifetime value
and to service each segment appropriately (for example, offering
special deals and discounts).
Slide 47
47 Data Mining Applications: Telecommunication zCall detail
record analysis yTelecommunication companies accumulate detailed
call records. By identifying customer segments with similar use
patterns, the companies can develop attractive pricing and feature
promotions. zCustomer loyalty ySome customers repeatedly switch
providers, or churn, to take advantage of attractive incentives by
competing companies. The companies can use DM to identify the
characteristics of customers who are likely to remain loyal once
they switch, thus enabling the companies to target their spending
on customers who will produce the most profit.
Slide 48
48 Data Mining Applications: Other Applications zCustomer
segmentation yAll industries can take advantage of DM to discover
discrete segments in their customer bases by considering additional
variables beyond traditional analysis. zManufacturing yThrough
choice boards, manufacturers are beginning to customize products
for customers; therefore they must be able to predict which
features should be bundled to meet customer demand. zWarranties
yManufacturers need to predict the number of customers who will
submit warranty claims and the average cost of those claims.
zFrequent flier incentives yAirlines can identify groups of
customers that can be given incentives to fly more.
Slide 49
49 Which are our lowest/highest margin customers ? Who are my
customers and what products are they buying? Which customers are
most likely to go to the competition ? What impact will new
products/services have on revenue and margins? What impact will new
products/services have on revenue and margins? What product prom-
-otions have the biggest impact on revenue? What is the most
effective distribution channel? A producer wants to know.
Slide 50
50 Data, Data everywhere yet... zI cant find the data I need
ydata is scattered over the network ymany versions, subtle
differences zI cant get the data I need yneed an expert to get the
data zI cant understand the data I found yavailable data poorly
documented zI cant use the data I found yresults are unexpected
ydata needs to be transformed from one form to other
Slide 51
51 What is a Data Warehouse? A single, complete and consistent
store of data obtained from a variety of different sources made
available to end users in a what they can understand and use in a
business context. [Barry Devlin]
Slide 52
52 What are the users saying... zData should be integrated
across the enterprise zSummary data has a real value to the
organization zHistorical data holds the key to understanding data
over time zWhat-if capabilities are required
Slide 53
53 What is Data Warehousing? A process of transforming data
into information and making it available to users in a timely
enough manner to make a difference [Forrester Research, April 1996]
Data Information
Slide 54
54 Very Large Data Bases zTerabytes -- 10^12 bytes: zPetabytes
-- 10^15 bytes: zExabytes -- 10^18 bytes: zZettabytes -- 10^21
bytes: zZottabytes -- 10^24 bytes: Walmart -- 24 Terabytes
Geographic Information Systems National Medical Records Weather
images Intelligence Agency Videos
Slide 55
55 Data Warehousing -- It is a process zTechnique for
assembling and managing data from various sources for the purpose
of answering business questions. Thus making decisions that were
not previous possible zA decision support database maintained
separately from the organizations operational database
Slide 56
56 Data Warehouse zA data warehouse is a ysubject-oriented
yintegrated ytime-varying ynon-volatile collection of data that is
used primarily in organizational decision making. -- Bill Inmon,
Building the Data Warehouse 1996
Slide 57
Data Warehousing Concepts Decision support is key for companies
wanting to turn their organizational data into an information asset
Traditional database is transaction-oriented while data warehouse
is data-retrieval optimized for decision-support Data Warehouse "A
subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management's decision-making
process" OLAP (on-line analytical processing), Decision Support
Systems (DSS), Executive Information Systems (EIS), and data mining
applications 57
Slide 58
What does data warehouse do? integrate diverse information from
various systems which enable users to quickly produce powerful
ad-hoc queries and perform complex analysis create an
infrastructure for reusing the data in numerous ways create an open
systems environment to make useful information easily accessible to
authorized users help managers make informed decisions 58
Slide 59
Benefits of Data Warehousing zPotential high returns on
investment zCompetitive advantage zIncreased productivity of
corporate decision-makers 59
Slide 60
Comparison of OLTP and Data Warehousing OLTP systemsData
warehousing systems Holds current dataHolds historic data Stores
detailed dataStores detailed, lightly, and summarized data Data is
dynamicData is largely static Repetitive processingAd hoc,
unstructured, and heuristic processing High level of transaction
throughputMedium to low transaction throughput Predictable pattern
of usageUnpredictable pattern of usage Transaction drivenAnalysis
driven Application orientedSubject oriented Supports day-to-day
decisionsSupports strategic decisions Serves large number ofServes
relatively lower number clerical / operational usersof managerial
users 60
Slide 61
Data Warehouse Architecture Operational Data Load Manager
Warehouse Manager Query Manager Detailed Data Lightly and Highly
Summarized Data Archive / Backup Data Meta-Data End-user Access
Tools 61
Slide 62
End-user Access Tools zReporting and query tools zApplication
development tools zExecutive Information System (EIS) tools zOnline
Analytical Processing (OLAP) tools zData mining tools 62
Slide 63
Data Warehousing Tools and Technologies Extraction, Cleansing,
and Transformation Tools Data Warehouse DBMS Load performance Load
processing Data quality management Query performance Terabyte
scalability Networked data warehouse Warehouse administration
Integrated dimensional tools Advanced query functionality 63
Slide 64
Data Marts zA subset of data warehouse that supports the
requirements of a particular department or business function
64
Slide 65
Online Analytical Processing (OLAP) zOLAP yThe dynamic
synthesis, analysis, and consolidation of large volume of multi-
dimensional data zMulti-dimensional OLAP yCubes of data 65
Slide 66
Problems of Data Warehousing zUnderestimation of resources for
data loading zHidden problem with source systems zRequired data not
captured zIncreased end-user demands zData homogenization zHigh
demand for resources zData ownership zHigh maintenance zLong
duration projects zComplexity of integration 66
Slide 67
Codd's Rules for OLAP Multi-dimensional conceptual view
Transparency Accessibility Consistent reporting performance
Client-server architecture Generic dimensionality Dynamic sparse
matrix handling Multi-user support Unrestricted cross-dimensional
operations Intuitive data manipulation Flexible reporting Unlimited
dimensions and aggregation levels 67
Slide 68
OLAP Tools zMulti-dimensional OLAP (MOLAP) yMulti-dimensional
DBMS (MDDBMS) zRelational OLAP (ROLAP) yCreation of multiple
multi-dimensional views of the two-dimensional relations zManaged
Query Environment (MQE) yDeliver selected data directly from the
DBMS to the desktop in the form of a data cube, where it is stored,
analyzed, and manipulated locally 68
Slide 69
Data Mining Definition The process of extracting valid,
previously unknown, comprehensible, and actionable information from
large database and using it to make crucial business decisions
Knowledge discovery Association rules Sequential patterns
Classification trees Goals Prediction Identification Classification
Optimization 69
Slide 70
Data Mining Techniques zPredictive Modeling ySupervised
training with two phases yTraining phase : building a model using
large sample of historical data called the training set yTesting
phase : trying the model on new data zDatabase Segmentation zLink
Analysis zDeviation Detection 70
Slide 71
What are Data Mining Tasks? zClassification zRegression
zClustering zSummarization zDependency modeling z Change and
Deviation Detection 71
Slide 72
What are Data Mining Discoveries? z New Purchase Trends z Plan
Investment Strategies z Detect Unauthorized Expenditure z
Fraudulent Activities z Crime Trends z Smugglers-border crossing
72
Slide 73
73 Data Warehouse Architecture Data Warehouse Engine Optimized
Loader Extraction Cleansing Analyze Query Metadata Repository
Relational Databases Legacy Data Purchased Data ERP Systems
Slide 74
74 Data Warehouse for Decision Support & OLAP zPutting
Information technology to help the knowledge worker make faster and
better decisions yWhich of my customers are most likely to go to
the competition? yWhat product promotions have the biggest impact
on revenue? yHow did the share price of software companies
correlate with profits over last 10 years?
Slide 75
75 Decision Support zUsed to manage and control business zData
is historical or point-in-time zOptimized for inquiry rather than
update zUse of the system is loosely defined and can be ad-hoc
zUsed by managers and end-users to understand the business and make
judgements
Slide 76
76 Data Mining works with Warehouse Data zData Warehousing
provides the Enterprise with a memory zData Mining provides the
Enterprise with intelligence
Slide 77
77 We want to know... zGiven a database of 100,000 names, which
persons are the least likely to default on their credit cards?
zWhich types of transactions are likely to be fraudulent given the
demographics and transactional history of a particular customer?
zIf I raise the price of my product by Rs. 2, what is the effect on
my ROI? zIf I offer only 2,500 airline miles as an incentive to
purchase rather than 5,000, how many lost responses will result?
zIf I emphasize ease-of-use of the product as opposed to its
technical capabilities, what will be the net effect on my revenues?
zWhich of my customers are likely to be the most loyal? Data Mining
helps extract such information
Slide 78
78 Application Areas IndustryApplication FinanceCredit Card
Analysis InsuranceClaims, Fraud Analysis TelecommunicationCall
record analysis TransportLogistics management Consumer
goodspromotion analysis Data Service providersValue added data
UtilitiesPower usage analysis
Slide 79
79 Data Mining in Use zThe US Government uses Data Mining to
track fraud zA Supermarket becomes an information broker
zBasketball teams use it to track game strategy zCross Selling
zWarranty Claims Routing zHolding on to Good Customers zWeeding out
Bad Customers
Slide 80
80 What makes data mining possible? zAdvances in the following
areas are making data mining deployable: ydata warehousing ybetter
and more data (i.e., operational, behavioral, and demographic) ythe
emergence of easily deployed data mining tools and ythe advent of
new data mining techniques. -- Gartner Group
Slide 81
81 Why Separate Data Warehouse? zPerformance yOp dbs designed
& tuned for known txs & workloads. yComplex OLAP queries
would degrade perf. for op txs. ySpecial data organization, access
& implementation methods needed for multidimensional views
& queries. zFunction yMissing data: Decision support requires
historical data, which op dbs do not typically maintain. yData
consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many heterogeneous
sources: op dbs, external sources. yData quality: Different sources
typically use inconsistent data representations, codes, and formats
which have to be reconciled.
Slide 82
82 What are Operational Systems? zThey are OLTP systems zRun
mission critical applications zNeed to work with stringent
performance requirements for routine tasks zUsed to run a
business!
Slide 83
83 RDBMS used for OLTP zDatabase Systems have been used
traditionally for OLTP yclerical data processing tasks ydetailed,
up to date data ystructured repetitive tasks yread/update a few
records yisolation, recovery and integrity are critical
Slide 84
84 Operational Systems zRun the business in real time zBased on
up-to-the-second data zOptimized to handle large numbers of simple
read/write transactions zOptimized for fast response to predefined
transactions zUsed by people who deal with customers, products --
clerks, salespeople etc. zThey are increasingly used by
customers
Slide 85
85 Examples of Operational Data
Slide 86
86 Application-Orientation vs. Subject-Orientation
Application-Orientation Operational Database Loans Credit Card
Trust Savings Subject-Orientation Data Warehouse Customer Vendor
Product Activity
Slide 87
87 OLTP vs. Data Warehouse zOLTP systems are tuned for known
transactions and workloads while workload is not known a priori in
a data warehouse zSpecial data organization, access methods and
implementation methods are needed to support data warehouse queries
(typically multidimensional queries) ye.g., average amount spent on
phone calls between 9AM-5PM in Pune during the month of
December
Slide 88
88 OLTP vs Data Warehouse zOLTP yApplication Oriented yUsed to
run business yDetailed data yCurrent up to date yIsolated Data
yRepetitive access yClerical User zWarehouse (DSS) ySubject
Oriented yUsed to analyze business ySummarized and refined
ySnapshot data yIntegrated Data yAd-hoc access yKnowledge User
(Manager)
Slide 89
89 OLTP vs Data Warehouse zOLTP yPerformance Sensitive yFew
Records accessed at a time (tens) yRead/Update Access yNo data
redundancy yDatabase Size 100MB -100 GB zData Warehouse
yPerformance relaxed yLarge volumes accessed at a time(millions)
yMostly Read (Batch Update) yRedundancy present yDatabase Size 100
GB - few terabytes
Slide 90
90 OLTP vs Data Warehouse zOLTP yTransaction throughput is the
performance metric yThousands of users yManaged in entirety zData
Warehouse yQuery throughput is the performance metric yHundreds of
users yManaged by subsets
Slide 91
91 To summarize... zOLTP Systems are used to run a business
zThe Data Warehouse helps to optimize the business
Slide 92
92 Why Now? zData is being produced zERP provides clean data
zThe computing power is available zThe computing power is
affordable zThe competitive pressures are strong zCommercial
products are available
Slide 93
93 Myths surrounding OLAP Servers and Data Marts zData marts
and OLAP servers are departmental solutions supporting a handful of
users zMillion dollar massively parallel hardware is needed to
deliver fast time for complex queries zOLAP servers require massive
and unwieldy indices zComplex OLAP queries clog the network with
data zData warehouses must be at least 100 GB to be effective
Source -- Arbor Software Home Page
Slide 94
II. On-Line Analytical Processing (OLAP) Making Decision
Support Possible
Slide 95
95 Typical OLAP Queries zWrite a multi-table join to compare
sales for each product line YTD this year vs. last year. zRepeat
the above process to find the top 5 product contributors to margin.
zRepeat the above process to find the sales of a product line to
new vs. existing customers. zRepeat the above process to find the
customers that have had negative sales growth.
Slide 96
96 * Reference:
http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html What Is OLAP?
zOnline Analytical Processing - coined by EF Codd in 1994 paper
contracted by Arbor Software* zGenerally synonymous with earlier
terms such as Decisions Support, Business Intelligence, Executive
Information System zOLAP = Multidimensional Database zMOLAP:
Multidimensional OLAP (Arbor Essbase, Oracle Express) zROLAP:
Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)
Slide 97
97 The OLAP Market zRapid growth in the enterprise market
y1995: $700 Million y1997: $2.1 Billion zSignificant consolidation
activity among major DBMS vendors y10/94: Sybase acquires
ExpressWay y7/95: Oracle acquires Express y11/95: Informix acquires
Metacube y1/97: Arbor partners up with IBM y10/96: Microsoft
acquires Panorama zResult: OLAP shifted from small vertical niche
to mainstream DBMS category
Slide 98
98 Strengths of OLAP zIt is a powerful visualization paradigm
zIt provides fast, interactive response times zIt is good for
analyzing time series zIt can be useful to find some clusters and
outliers zMany vendors offer OLAP tools
Slide 99
99 Nigel Pendse, Richard Creath - The OLAP Report OLAP Is FASMI
zFast zAnalysis zShared zMultidimensional zInformation
Slide 100
100 Month 1234765 Product Toothpaste Juice Cola Milk Cream Soap
Region W S N Dimensions: Product, Region, Time Hierarchical
summarization paths Product Region Time Industry Country Year
Category Region Quarter Product City Month Week Office Day Office
Day Multi-dimensional Data zHeyI sold $100M worth of goods
Slide 101
101 A Visual Operation: Pivot (Rotate) 10 47 30 12
JuiceColaMilkCream NYLASF 3/1 3/2 3/3 3/4 Date Month Region
Product
Slide 102
102 Slicing and Dicing Product Sales Channel Regions
RetailDirectSpecial Household Telecomm Video Audio India Far East
Europe The Telecomm Slice
Slide 103
103 Roll-up and Drill Down zSales Channel zRegion zCountry
zState zLocation Address zSales Representative Roll Up Higher Level
of Aggregation Low-level Details Drill-Down
Slide 104
Results of Data Mining Include: zForecasting what may happen in
the future zClassifying people or things into groups by recognizing
patterns zClustering people or things into groups based on their
attributes zAssociating what events are likely to occur together
zSequencing what events are likely to lead to later events
Slide 105
Data mining is not zBrute-force crunching of bulk data zBlind
application of algorithms zGoing to find relationships where none
exist zPresenting data in different ways zA database intensive task
zA difficult to understand technology requiring an advanced degree
in computer science
Slide 106
Data Mining versus OLAP zOLAP - On-line Analytical Processing
yProvides you with a very good view of what is happening, but can
not predict what will happen in the future or why it is
happening
Slide 107
Data Mining Versus Statistical Analysis Data Mining Originally
developed to act as expert systems to solve problems Less
interested in the mechanics of the technique If it makes sense then
lets use it Does not require assumptions to be made about data Can
find patterns in very large amounts of data Requires understanding
of data and business problem Data Analysis Tests for statistical
correctness of models Are statistical assumptions of models
correct? Eg Is the R-Square good? Hypothesis testing Is the
relationship significant? Use a t-test to validate significance
Tends to rely on sampling Techniques are not optimised for large
amounts of data Requires strong statistical skills
Slide 108
Examples of What People are Doing with Data Mining:
Fraud/Non-Compliance Anomaly detection Isolate the factors that
lead to fraud, waste and abuse Target auditing and investigative
efforts more effectively Credit/Risk Scoring Intrusion detection
Parts failure prediction Recruiting/Attracting customers Maximizing
profitability (cross selling, identifying profitable customers)
Service Delivery and Customer Retention Build profiles of customers
likely to use which services Web Mining
Slide 109
What data mining has done for... Scheduled its workforce to
provide faster, more accurate answers to questions. The US Internal
Revenue Service needed to improve customer service and...
Slide 110
What data mining has done for... analyzed suspects cell phone
usage to focus investigations. The US Drug Enforcement Agency
needed to be more effective in their drug busts and
Slide 111
What data mining has done for... Reduced direct mail costs by
30% while garnering 95% of the campaigns revenue. HSBC need to
cross-sell more effectively by identifying profiles that would be
interested in higher yielding investments and...
Slide 112
Suggestion:Predicting Washington zC-Span has lunched a digital
archieve of 500,000 hours of audio debates. zText Mining or Audio
Mining of these talks to reveal cwetrain questions such as.
Slide 113
Example Application: Sports IBM Advanced Scout analyzes NBA
game statistics yShots blocked yAssists yFouls zGoogle: IBM
Advanced Scout
Slide 114
zDSS Agent uses intelligent agents data mining provides
multiple functions recognizes sales patterns among stores discovers
sales patterns by time of day day of year category of product etc.
swiftly identifies trends & shifts in customer tastes performs
Market Basket Analysis (MBA) analyzes Point-of-Sale or -Service
(POS) data identifies relationships among products and/or services
purchased E.g. A customer who buys Brand X slacks has a 35% chance
of buying Brand Y shirts. Agent tool is also used by other Fortune
1000 firms average ROI > 300 % average payback in 1 ~ 2 years
Market Basket Analysis
Slide 157
Case Based Reasoning (CBR) General scheme for a case based
reasoning (CBR) model. The target case is matched against similar
precedents in the historical database, such as cases A and B.
Slide 158
Case Based Reasoning (CBR) zLearning through the accumulation
of experience zKey issues Indexing: storing cases for quick,
effective access of precedents Retrieval: accessing the appropriate
precedent cases zAdvantages Explicit knowledge form recognizable to
humans No need to re-code knowledge for computer processing
zLimitations Retrieving precedents based on superficial features
E.g. Matching Indonesia with U.S. because both have similar
population size Traditional approach ignores the issue of
generalizing knowledge
Slide 159
Genetic Algorithm Generation of candidate solutions using the
procedures of biological evolution. Procedure 0. Initialize. Create
a population of potential solutions ("organisms"). 1. Evaluate.
Determine the level of "fitness" for each solution. 2. Cull.
Discard the poor solutions. 3. Breed. a. Select 2 "fit" solutions
to serve as parents. b. From the 2 parents, generate offspring. *
Crossover: Cut the parents at random and switch the 2 halves. *
Mutation: Randomly change the value in a parent solution. 4.
Repeat. Go back to Step 1 above.
Slide 160
Genetic Algorithm (Cont.) zAdvantages Applicable to a wide
range of problem domains. Robustness: can obtain solutions even
when the performance function is highly irregular or input data are
noisy. Implicit parallelism: can search in many directions
concurrently. zLimitations Slow, like neural networks. But:
computation can be distributed over multiple processors (unlike
neural networks) Source: www.pathology.washington.edu
Slide 161
Multistrategy Learning zEvery technique has advantages &
limitations zMultistrategy approach Take advantage of the strengths
of diverse techniques Circumvent the limitations of each
methodology
Slide 162
Types of Models Prediction Models for Predicting and
Classifying Regression algorithms (predict numeric outcome): neural
networks, rule induction, CART (OLS regression, GLM) Classification
algorithm predict symbolic outcome): CHAID, C5.0 (discriminant
analysis, logistic regression) Descriptive Models for Grouping and
Finding Associations Clustering/Grouping algorithms: K-means,
Kohonen Association algorithms: apriori, GRI
Slide 163
Neural Networks zDescription yDifficult interpretation yTends
to overfit the data yExtensive amount of training time yA lot of
data preparation yWorks with all data types
Slide 164
Rule Induction Description zIntuitive output zHandles all forms
of numeric data, as well as non-numeric (symbolic) data C5
Algorithm a special case of rule induction zTarget variable must be
symbolic
Slide 165
Apriori Description Seeks association rules in dataset Market
basket analysis Sequence discovery
Slide 166
Data Mining Is zThe automated process of finding relationships
and patterns in stored data z It is different from the use of SQL
queries and other business intelligence tools
Slide 167
Data Mining Is zMotivated by business need, large amounts of
available data, and humans limited cognitive processing abilities
zEnabled by data warehousing, parallel processing, and data mining
algorithms
Slide 168
Common Types of Information from Data Mining zAssociations --
identifies occurrences that are linked to a single event zSequences
-- identifies events that are linked over time zClassification --
recognizes patterns that describe the group to which an item
belongs
Slide 169
Common Types of Information from Data Mining zClustering --
discovers different groupings within the data zForecasting --
estimates future values
Slide 170
Commonly Used Data Mining Techniques zArtificial neural
networks zDecision trees zGenetic algorithms zNearest neighbor
method zRule induction
Slide 171
The Current State of Data Mining Tools zMany of the vendors are
small companies zIBM and SAS have been in the market for some time,
and more biggies are moving into this market zBI tools and RDMS
products are increasingly including basic data mining capabilities
zPackaged data mining applications are becoming common
Slide 172
The Data Mining Process zRequires personnel with domain, data
warehousing, and data mining expertise zRequires data selection,
data extraction, data cleansing, and data transformation zMost data
mining tools work with highly granular flat files zIs an iterative
and interactive process
Slide 173
Why Data Mining zCredit ratings/targeted marketing : yGiven a
database of 100,000 names, which persons are the least likely to
default on their credit cards? yIdentify likely responders to sales
promotions zFraud detection yWhich types of transactions are likely
to be fraudulent, given the demographics and transactional history
of a particular customer? zCustomer relationship management :
yWhich of my customers are likely to be the most loyal, and which
are most likely to leave for a competitor? : Data Mining helps
extract such information
Slide 174
Applications zBanking: loan/credit card approval ypredict good
customers based on old customers zCustomer relationship management:
yidentify those who are likely to leave for a competitor. zTargeted
marketing: yidentify likely responders to promotions zFraud
detection: telecommunications, financial transactions yfrom an
online stream of event identify fraudulent events zManufacturing
and production: yautomatically adjust knobs when process parameter
changes
Slide 175
Applications (continued) zMedicine: disease outcome,
effectiveness of treatments yanalyze patient disease history: find
relationship between diseases zMolecular/Pharmaceutical: identify
new drugs zScientific data analysis: yidentify new galaxies by
searching for sub clusters zWeb site/store design and promotion:
yfind affinity of visitor to pages and modify layout
Slide 176
The KDD process zProblem fomulation zData collection ysubset
data: sampling might hurt if highly skewed data yfeature selection:
principal component analysis, heuristic search zPre-processing:
cleaning yname/address cleaning, different meanings (annual,
yearly), duplicate removal, supplying missing values
zTransformation: ymap complex objects e.g. time series data to
features e.g. frequency zChoosing mining task and mining method:
zResult evaluation and Visualization: Knowledge discovery is an
iterative process
Slide 177
Relationship with other fields zOverlaps with machine learning,
statistics, artificial intelligence, databases, visualization but
more stress on yscalability of number of features and instances
ystress on algorithms and architectures whereas foundations of
methods and formulations provided by statistics and machine
learning. yautomation for handling large, heterogeneous data
Slide 178
Some basic operations zPredictive: yRegression yClassification
yCollaborative Filtering zDescriptive: yClustering / similarity
matching yAssociation rules and variants yDeviation detection
Slide 179
Classification zGiven old data about customers and payments,
predict new applicants loan eligibility. Age Salary Profession
Location Customer type Previous customers ClassifierDecision rules
Salary > 5 L Prof. = Exec New applicants data Good/ bad
Slide 180
Classification methods zGoal: Predict class Ci = f(x1, x2,..
Xn) zRegression: (linear or any other polynomial) ya*x1 + b*x2 + c
= Ci. zNearest neighour zDecision tree classifier: divide decision
space into piecewise constant regions. zProbabilistic/generative
models zNeural networks: partition by non- linear boundaries
Slide 181
zDefine proximity between instances, find neighbors of new
instance and assign majority class zCase based reasoning: when
attributes are more complicated than real-valued. Nearest neighbor
Cons Slow during application. No feature selection. Notion of
proximity vague Pros + Fast training
Slide 182
Clustering zUnsupervised learning when old data with class
labels not available e.g. when introducing a new product.
zGroup/cluster existing customers based on time series of payment
history such that similar customers in same cluster. zKey
requirement: Need a good measure of similarity between instances.
zIdentify micro-markets and develop policies for each
Slide 183
Applications zCustomer segmentation e.g. for targeted marketing
yGroup/cluster existing customers based on time series of payment
history such that similar customers in same cluster. yIdentify
micro-markets and develop policies for each zCollaborative
filtering: ygroup based on common items purchased zText clustering
zCompression
Slide 184
Distance functions zNumeric data: euclidean, manhattan
distances zCategorical data: 0/1 to indicate presence/absence
followed by yHamming distance (# dissimilarity) yJaccard
coefficients: #similarity in 1s/(# of 1s) ydata dependent measures:
similarity of A and B depends on co-occurance with C. zCombined
numeric and categorical data: yweighted normalized distance:
Slide 185
Clustering methods zHierarchical clustering yagglomerative Vs
divisive ysingle link Vs complete link zPartitional clustering
ydistance-based: K-means ymodel-based: EM ydensity-based:
Slide 186
Agglomerative Hierarchical clustering zGiven: matrix of
similarity between every point pair zStart with each point in a
separate cluster and merge clusters based on some criteria :
ySingle link: merge two clusters such that the minimum distance
between two points from the two different cluster is the least
yComplete link: merge two clusters such that all points in one
cluster are close to all points in the other.
Slide 187
Partitional methods: K-means zCriteria: minimize sum of square
of distance xBetween each point and centroid of the cluster.
xBetween each pair of points in the cluster zAlgorithm: ySelect
initial partition with K clusters: random, first K, K separated
points yRepeat until stabilization: xAssign each point to closest
cluster center xGenerate new cluster centers xAdjust clusters by
merging/splitting
Slide 188
Collaborative Filtering zGiven database of user preferences,
predict preference of new user zExample: predict what new movies
you will like based on yyour past preferences yothers with similar
past preferences ytheir preferences for the new movies zExample:
predict what books/CDs a person may want to buy y(and suggest it,
or give discounts to tempt customer)
Slide 189
Association rules zGiven set T of groups of items zExample: set
of item sets purchased zGoal: find all rules on itemsets of the
form a-->b such that y support of a and b > user threshold s
yconditional probability (confidence) of b given a > user
threshold c zExample: Milk --> bread zPurchase of product A
--> service B Milk, cereal Tea, milk Tea, rice, bread cereal
T
Slide 190
Prevalent Interesting zAnalysts already know about prevalent
rules zInteresting rules are those that deviate from prior
expectation zMinings payoff is in finding surprising phenomena 1995
1998 Milk and cereal sell together! Zzzz... Milk and cereal sell
together!
Slide 191
Applications of fast itemset counting Find correlated events:
zApplications in medicine: find redundant tests zCross selling in
retail, banking zImprove predictive capability of classifiers that
assume attribute independence z New similarity measures of
categorical attributes [Mannila et al, KDD 98]
Slide 192
Application Areas IndustryApplication FinanceCredit Card
Analysis InsuranceClaims, Fraud Analysis TelecommunicationCall
record analysis TransportLogistics management Consumer
goodspromotion analysis Data Service providersValue added data
UtilitiesPower usage analysis
Slide 193
Usage scenarios zData warehouse mining: yassimilate data from
operational sources ymine static data zMining log data zContinuous
mining: example in process control zStages in mining: y data
selection pre-processing: cleaning transformation mining result
evaluation visualization
Slide 194
Mining market zAround 20 to 30 mining tool vendors zMajor tool
players: yClementine, yIBMs Intelligent Miner, ySGIs MineSet, ySASs
Enterprise Miner. zAll pretty much the same set of tools zMany
embedded products: yfraud detection: yelectronic commerce
applications, yhealth care, ycustomer relationship management:
Epiphany
Slide 195
Vertical integration: Mining on the web zWeb log analysis for
site design: ywhat are popular pages, ywhat links are hard to find.
zElectronic stores sales enhancements: yrecommendations,
advertisement: yCollaborative filtering: Net perception, Wisewire
yInventory control: what was a shopper looking for and could not
find..
Slide 196
State of art in mining OLAP integration zDecision trees
[Information discovery, Cognos] yfind factors influencing high
profits zClustering [Pilot software] ysegment customers to define
hierarchy on that dimension zTime series analysis: [Seagates Holos]
yQuery for various shapes along time: eg. spikes, outliers
zMulti-level Associations [Han et al.] yfind association between
members of dimensions zSarawagi [VLDB2000]
Slide 197
Data Mining in Use zThe US Government uses Data Mining to track
fraud zA Supermarket becomes an information broker zBasketball
teams use it to track game strategy zCross Selling zTarget
Marketing zHolding on to Good Customers zWeeding out Bad
Customers
Slide 198
Some success stories zNetwork intrusion detection using a
combination of sequential rule discovery and classification tree on
4 GB DARPA data yWon over (manual) knowledge engineering approach
yhttp://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good
detailed description of the entire process zMajor US bank: customer
attrition prediction yFirst segment customers based on financial
behavior: found 3 segments yBuild attrition models for each of the
3 segments y40-50% of attritions were predicted == factor of 18
increase zTargeted credit marketing: major US banks yfind customer
segments based on 13 months credit balances ybuild another response
model based on surveys yincreased response 4 times -- 2%
Slide 199
Data Mining Tools: KnowledeSe eker 4.5 199 What is
KnowledgeSeeker? Produced by ANGOSS Software Corporation, who focus
solely on data mining software. Offer training and consulting
services Produce data mining add-ins which accepts data from all
major databases Works with popular query and reporting,
spreadsheet, statistical and OLAP & ROLAP tools.
Slide 200
Data Mining Tools: KnowledeSe eker 4.5 200 CompanySoftware
Clementine 6.0 Enterprise Miner 3.0 Intelligent Miner Major
Competitors
Slide 201
Data Mining Tools: KnowledeSe eker 4.5 201 CompanySoftware
Mineset 3.1 Darwin Scenario Major Competitors
Slide 202
Data Mining Tools: KnowledeSe eker 4.5 202 Current Applications
Manufacturing Used by the R.R. Donnelly & Sons commercial
printing company to improve process control, cut costs and increase
productivity. Used extensively by Hewlett Packard in their United
States manufacturing plants as a process control tool both to
analyze factors impacting product quality as well as to generate
rules for production control systems.
Slide 203
Data Mining Tools: KnowledeSe eker 4.5 203 Current Applications
Auditing Used by the IRS to combat fraud, reduce risk, and increase
collection rates. Finance Used by the Canadian Imperial Bank of
Commerce (CIBC) to create models for fraud detection and risk
management.
Slide 204
Data Mining Tools: KnowledeSe eker 4.5 204 Current Applications
CRM Telephony Used by US West to reduce churning and increase
customer loyalty for a new voice messaging technology.
Slide 205
Data Mining Tools: KnowledeSe eker 4.5 205 Current Applications
Marketing Used by the Washington Post to improve their direct mail
targeting and to conduct survey analysis. Health Care Used by the
Oxford Transplant Center to discover factors affecting transplant
survival rates. Used by the University of Rochester Cancer Center
to study the effect of anxiety on chemotherapy-related nausea.
Slide 206
Data Mining Tools: KnowledeSe eker 4.5 206 More Customers
Slide 207
Data Mining Tools: KnowledeSe eker 4.5 207 Questions 1.What
percentage of people in the test group have high blood pressure
with these characteristics: 66-year-old male regular smoker that
has low to moderate salt consumption? 2.Do the risk levels change
for a male with the same characteristics who quit smoking? What are
the percentages? 3.If you are a 2% milk drinker, how many factors
are still interesting? 4.Knowing that salt consumption and smoking
habits are interesting factors, which one has a stronger
correlation to blood pressure levels? 5.Grow an automatic tree.
Look to see if gender is an interesting factor for 55-year-old
regular smoker who does not each cheese?
Slide 208
Association zClassic market-basket analysis, which treats the
purchase of a number of items (for example, the contents of a
shopping basket) as a single transaction. zThis information can be
used to adjust inventories, modify floor or shelf layouts, or
introduce targeted promotional activities to increase overall sales
or move specific products. zExample : 80 percent of all
transactions in which beer was purchased also included potato
chips.
Slide 209
Sequence-based analysis zTraditional market-basket analysis
deals with a collection of items as part of a point-in-time
transaction. zto identify a typical set of purchases that might
predict the subsequent purchase of a specific item.
Slide 210
Clustering zClustering approach address segmentation problems.
zThese approaches assign records with a large number of attributes
into a relatively small set of groups or "segments." zExample :
Buying habits of multiple population segments might be compared to
determine which segments to target for a new sales campaign.
Slide 211
Classification zMost commonly applied data mining technique
zAlgorithm uses preclassified examples to determine the set of
parameters required for proper discrimination. zExample : A
classifier derived from the Classification approach is capable of
identifying risky loans, could be used to aid in the decision of
whether to grant a loan to an individual.
Slide 212
Issues of Data Mining zPresent-day tools are strong but require
significant expertise to implement effectively. zIssues of Data
Mining ySusceptibility to "dirty" or irrelevant data. yInability to
"explain" results in human terms.
Slide 213
Issues zsusceptibility to "dirty" or irrelevant data yData
mining tools of today simply take everything they are given as
factual and draw the resulting conclusions. yUsers must take the
necessary precautions to ensure that the data being analyzed is
"clean."
Slide 214
Issues, cont zinability to "explain" results in human terms
yMany of the tools employed in data mining analysis use complex
mathematical algorithms that are not easily mapped into human
terms. ywhat good does the information do if you dont understand
it?
Slide 215
Comparison with reporting, BI and OLAP Reporting zSimple
relationships zChoose the relevant factors zExamine all details
(Also applies to visualisation & simple statistics) Data Mining
zComplex relationships zAutomatically find the relevant factors
zShow only relevant details zPrediction
Slide 216
Comparison with Statistics Statistical analysis zMainly about
hypothesis testing zFocussed on precision Data mining zMainly about
hypothesis generation zFocussed on deployment
Slide 217
Example: data mining and customer processes zInsight: Who are
my customers and why do they behave the way they do? zPrediction:
Who is a good prospect, for what product, who is at risk, what is
the next thing to offer? zUses: Targeted marketing, mail- shots,
call-centres, adaptive web- sites
Slide 218
Example: data mining and fraud detection zInsight: How can
(specific method of) fraud be recognised? What constitute normal,
abnormal and suspicious events? zPrediction: Recognise similarity
to previous frauds how similar? Spot abnormal events how
suspicious? zUsed by: Banks, telcos, retail, government
Slide 219
Example: data mining and diagnosing cancer zComplex data from
genetics yChallenging data mining problem zFind patterns of gene
activation indicating different diseases / stages zChanged the way
I think about cancer Oncologist from Chicago Childrens Memorial
Hospital
Slide 220
Example: data mining and policing zKnowing the patterns helps
plan effective crime prevention zCrime hot-spots understood better
zSift through mountains of crime reports zIdentify crime series
zOther people save money using data mining we save lives. Police
force homicide specialist and data miner
Slide 221
Data mining tools: Clementine and its philosophy
Slide 222
How to do data mining zLots of data mining operations zHow do
you glue them together to solve a problem? zHow do we actually do
data mining? zMethodology yNot just the right way, but any way
Slide 223
Myths about Data Mining (1) Data, Process and Tech Data mining
is all about massive data It can be, but some important datasets
are very small, and sampling is often appropriate Data mining is a
technical process Business analysts perform data mining every day
It is a business process Data mining is all about algorithms
Algorithms are a key tool But data mining is done by people, not by
algorithms Data mining is all about predictive accuracy It's about
usefulness Accuracy is only a small component
Slide 224
Myths about Data Mining (2) Data Quality Data mining only works
with clean data Cleaning the data is part of the data mining
process Need not be clean initially Data mining only works with
complete data Data mining works with whatever data you have.
Complete is good, incomplete is also ok. Data mining only works
with correct data Errors in data are inevitable. Data mining helps
you deal with them.
Slide 225
One last exploding myth Neural Networks are not useful when you
need to understand the patterns that you find (which is nearly
always in data mining) Related to over-simplistic views of data
mining Data mining techniques form a toolkit We often use
techniques in surprising ways E.g. Neural nets for field selection
Neural nets for pattern confirmation Neural nets combined with
other techniques for cross-checking What use is a pair of
pliers?
Slide 226
226 Related Concepts Outline zDatabase/OLTP Systems zFuzzy Sets
and Logic zInformation Retrieval(Web Search Engines) zDimensional
Modeling zData Warehousing zOLAP/DSS zStatistics zMachine Learning
zPattern Matching Goal: Examine some areas which are related to
data mining.
Slide 227
227 Fuzzy Sets and Logic zFuzzy Set: Set membership function is
a real valued function with output in the range [0,1]. zf(x):
Probability x is in F. z1-f(x): Probability x is not in F. zEX: yT
= {x | x is a person and x is tall} yLet f(x) be the probability
that x is tall yHere f is the membership function DM: Prediction
and classification are fuzzy.
Slide 228
228 Information Retrieval zInformation Retrieval (IR):
retrieving desired information from textual data. zLibrary Science
zDigital Libraries zWeb Search Engines zTraditionally keyword based
zSample query: Find all documents about data mining. DM: Similarity
measures; Mine text/Web data.
Slide 229
Prentice Hall 229 Dimensional Modeling zView data in a
hierarchical manner more as business executives might zUseful in
decision support systems and mining zDimension: collection of
logically related attributes; axis for modeling data. zFacts: data
stored zEx: Dimensions products, locations, date Facts quantity,
unit price DM: May view data as dimensinoal.
Slide 230
230 Dimensional Modeling Queries zRoll Up: more general
dimension zDrill Down: more specific dimension zDimension
(Aggregation) Hierarchy zSQL uses aggregation zDecision Support
Systems (DSS): Computer systems and tools to assist managers in
making decisions and solving problems.
Slide 231
231 Cube view of Data
Slide 232
232 Data Warehousing z Subject-oriented, integrated,
time-variant, nonvolatile William Inmon zOperational Data: Data
used in day to day needs of company. zInformational Data: Supports
other functions such as planning and forecasting. zData mining
tools often access data warehouses rather than operational data.
DM: May access data in warehouse.
Slide 233
233 OLAP zOnline Analytic Processing (OLAP): provides more
complex queries than OLTP. zOnLine Transaction Processing (OLTP):
traditional database/transaction processing. zDimensional data;
cube view zVisualization of operations: ySlice: examine sub-cube.
yDice: rotate cube to look at another dimension. yRoll Up/Drill
Down DM: May use OLAP queries.
Slide 234
234 OLAP Operations Single CellMultiple CellsSliceDice Roll Up
Drill Down
Slide 235
235 Statistics zSimple descriptive models zStatistical
inference: generalizing a model created from a sample of the data
to the entire dataset. zExploratory Data Analysis: yData can
actually drive the creation of the model yOpposite of traditional
statistical view. zData mining targeted to business user DM: Many
data mining methods come from statistical techniques.
Slide 236
236 Machine Learning zMachine Learning: area of AI that
examines how to write programs that can learn. zOften used in
classification and prediction zSupervised Learning: learns by
example. zUnsupervised Learning: learns without knowledge of
correct answers. zMachine learning often deals with small static
datasets. DM: Uses many machine learning techniques.
Slide 237
Prentice Hall 237 Pattern Matching (Recognition) zPattern
Matching: finds occurrences of a predefined pattern in the data.
zApplications include speech recognition, information retrieval,
time series analysis. DM: Type of classification.
Slide 238
238 DM vs. Related Topics
Slide 239
Prentice Hall 239 Data Mining Techniques Outline zStatistical
yPoint Estimation yModels Based on Summarization yBayes Theorem
yHypothesis Testing yRegression and Correlation zSimilarity
Measures zDecision Trees zNeural Networks yActivation Functions
zGenetic Algorithms Goal: Provide an overview of basic data mining
techniques
Slide 240
240 Point Estimation zPoint Estimate: estimate a population
parameter. zMay be made by calculating the parameter for a sample.
zMay be used to predict value for missing data. zEx: yR contains
100 employees y99 have salary information yMean salary of these is
$50,000 yUse $50,000 as value of remaining employees salary. Is
this a good idea?
Slide 241
241 Estimation Error zBias: Difference between expected value
and actual value. zMean Squared Error (MSE): expected value of the
squared difference between the estimate and the actual value: zWhy
square? zRoot Mean Square Error (RMSE)
Slide 242
242 Expectation-Maximization (EM) zSolves estimation with
incomplete data. zObtain initial estimates for parameters.
zIteratively use estimates for missing data and continue until
convergence.
Slide 243
243 Models Based on Summarization zVisualization: Frequency
distribution, mean, variance, median, mode, etc. zBox Plot:
Slide 244
244 Bayes Theorem zPosterior Probability: P(h 1 |x i ) zPrior
Probability: P(h 1 ) zBayes Theorem: zAssign probabilities of
hypotheses given a data value.
Slide 245
245 Hypothesis Testing zFind model to explain behavior by
creating and then testing a hypothesis about the data. zExact
opposite of usual DM approach. zH 0 Null hypothesis; Hypothesis to
be tested. zH 1 Alternative hypothesis
Slide 246
246 Regression zPredict future values based on past values
zLinear Regression assumes linear relationship exists. y = c 0 + c
1 x 1 + + c n x n zFind values to best fit the data
Slide 247
247 Correlation zExamine the degree to which the values for two
variables behave similarly. zCorrelation coefficient r: 1 = perfect
correlation -1 = perfect but opposite correlation 0 = no
correlation
Slide 248
Prentice Hall 248 Similarity Measures zDetermine similarity
between two objects. zSimilarity characteristics: zAlternatively,
distance measure measure how unlike or dissimilar objects are.
Slide 249
249 Distance Measures zMeasure dissimilarity between
objects
Slide 250
250 Decision Trees zDecision Tree (DT): yTree where the root
and each internal node is labeled with a question. yThe arcs
represent each possible answer to the associated question. yEach
leaf node represents a prediction of a solution to the problem.
zPopular technique for classification; Leaf node indicates class to
which the corresponding tuple belongs.
Slide 251
Prentice Hall 251 Decision Trees zA Decision Tree Model is a
computational model consisting of three parts: yDecision Tree
yAlgorithm to create the tree yAlgorithm that applies the tree to
data zCreation of the tree is the most difficult part. zProcessing
is basically a search similar to that in a binary search tree
(although DT may not be binary).
Slide 252
Prentice Hall 252 Neural Networks zBased on observed
functioning of human brain. z(Artificial Neural Networks (ANN) zOur
view of neural networks is very simplistic. zWe view a neural
network (NN) from a graphical viewpoint. zAlternatively, a NN may
be viewed from the perspective of matrices. zUsed in pattern
recognition, speech recognition, computer vision, and
classification.
Slide 253
253 Generating Rules zDecision tree can be converted into a
rule set zStraightforward conversion: yeach path to the leaf
becomes a rule makes an overly complex rule set zMore effective
conversions are not trivial y(e.g. C4.8 tests each node in
root-leaf path to see if it can be eliminated without loss in
accuracy)
Slide 254
254 Covering algorithms zStrategy for generating a rule set
directly: for each class in turn find rule set that covers all
instances in it (excluding instances not in the class) zThis
approach is called a covering approach because at each stage a rule
is identified that covers some of the instances
Slide 255
255 Rules vs. trees zCorresponding decision tree: (produces
exactly the same predictions) zBut: rule sets can be more clear
when decision trees suffer from replicated subtrees zAlso: in
multi-class situations, covering algorithm concentrates on one
class at a time whereas decision tree learner takes all classes
into account
Slide 256
256 A simple covering algorithm zGenerates a rule by adding
tests that maximize rules accuracy zSimilar to situation in
decision trees: problem of selecting an attribute to split on yBut:
decision tree inducer maximizes overall purity zEach new test
reduces rules coverage: witten&eibe
Slide 257
Algorithm Components 1. The task the algorithm is used to
address (e.g. classification, clustering, etc.) 2. The structure of
the model or pattern we are fitting to the data (e.g. a linear
regression model) 3. The score function used to judge the quality
of the fitted models or patterns (e.g. accuracy, BIC, etc.) 4. The
search or optimization method used to search over parameters and/or
structures (e.g. steepest descent, MCMC, etc.) 5. The data
management technique used for storing, indexing, and retrieving
data (critical when data too large to reside in memory)
Slide 258
Slide 259
Models and Patterns Models Prediction Probability Distributions
Structured Data Linear regression Piecewise linear
Slide 260
Models Prediction Probability Distributions Structured Data
Linear regression Piecewise linear Nonparamatric regression
Slide 261
Slide 262
Models Prediction Probability Distributions Structured Data
Linear regression Piecewise linear Nonparametric regression
Classification logistic regression nave bayes/TAN/bayesian networks
NN support vector machines Trees etc.
Slide 263
Models Prediction Probability Distributions Structured Data
Linear regression Piecewise linear Nonparametric regression
Classification Parametric models Mixtures of parametric models
Graphical Markov models (categorical, continuous, mixed)
Slide 264
Models Prediction Probability Distributions Structured Data
Linear regression Piecewise linear Nonparametric regression
Classification Parametric models Mixtures of parametric models
Graphical Markov models (categorical, continuous, mixed) Time
series Markov models Mixture Transition Distribution models Hidden
Markov models Spatial models
Slide 265
Bias-Variance Tradeoff High Bias - Low VarianceLow Bias - High
Variance overfitting - modeling the random component Score function
should embody the compromise
Slide 266
Patterns Global Local Clustering via partitioning Hierarchical
Clustering Mixture Models Outlier detection Changepoint detection
Bump hunting Scan statistics Association rules
Slide 267
x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx
xx x x The curve represents a road Each x marks an accident Red x
denotes an injury accident Black x means no injury Is there a
stretch of road where there is an unually large fraction of injury
accidents? Scan Statistics via Permutation Tests
Slide 268
Scan with Fixed Window zIf we know the length of the stretch of
road that we seek, e.g., we could slide this window long the road
and find the most unusual window location x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x x xx xx x x
Slide 269
Spatial-Temporal Scan Statistics zSpatial-temporal scan
statistic use cylinders where the height of the cylinder represents
a time window
Slide 270
270 Major Data Mining Tasks zClassification: predicting an item
class zClustering: finding clusters in data zAssociations: e.g. A
& B & C occur frequently zVisualization: to facilitate
human discovery zSummarization: describing a group zDeviation
Detection: finding changes zEstimation: predicting a continuous
value zLink Analysis: finding relationships z
Slide 271
271 Classification Learn a method for predicting the instance
class from pre-labeled (classified) instances Many approaches:
Statistics, Decision Trees, Neural Networks,...
Slide 272
272 Clustering Find natural grouping of instances given
un-labeled data
274 Visualization & Data Mining zVisualizing the data to
facilitate human discovery zPresenting the discovered results in a
visually "nice" way
Slide 275
275 Summarization nDescribe features of the selected group nUse
natural language and graphics nUsually in Combination with
Deviation detection or other methods Average length of stay in this
study area rose 45.7 percent, from 4.3 days to 6.2 days,
because...
Slide 276
276 Data Mining Central Quest Find true patterns and avoid
overfitting (finding seemingly signifcant but really random
patterns due to searching too many possibilites)
Slide 277
277 Classification Learn a method for predicting the instance
class from pre-labeled (classified) instances Many approaches:
Regression, Decision Trees, Bayesian, Neural Networks,... Given a
set of points from classes what is the class of new point ?
Slide 278
278 Classification: Linear Regression Linear Regression w 0 + w
1 x + w 2 y >= 0 Regression computes w i from data to minimize
squared error to fit the data Not flexible enough
Slide 279
279 Classification: Decision Trees X Y if X > 5 then blue
else if Y > 3 then blue else if X > 2 then green else blue 52
3
Slide 280
280 DECISION TREE zAn internal node is a test on an attribute.
zA branch represents an outcome of the test, e.g., Color=red. zA
leaf node represents a class label or class label distribution. zAt
each node, one attribute is chosen to split training examples into
distinct classes as much as possible zA new instance is classified
by following a matching path to a leaf node.
Slide 281
281 Classification: Neural Nets Can select more complex regions
Can be more accurate Also can overfit the data find patterns in
random noise
Slide 282
282 Evaluating which method works the best for classification
zNo model is uniformly the best zDimensions for Comparison yspeed
of training yspeed of model application ynoise tolerance
yexplanation ability zBest Results: Hybrid, Integrated models
Slide 283
283 Comparison of Major Classification Approaches A hybrid
method will have higher accuracy
Slide 284
284 Evaluation of Classification Models zHow predictive is the
model we learned? zError on the training data is not a good
indicator of performance on future data yThe new data will probably
not be exactly the same as the training data! zOverfitting fitting
the training data too precisely - usually leads to poor results on
new data
Slide 285
285 Classification: Train, Validation, Test split Data
Predictions Y N Results Known Training set Validation set + + - - +
Model Builder Evaluate +-+-+-+- Final Model Final Test Set +-+-+-+-
Final Evaluation Model Builder
Slide 286
286 Cross-validation zCross-validation avoids overlapping test
sets yFirst step: data is split into k subsets of equal size
ySecond step: each subset in turn is used for testing and the
remainder for training zThis is called k-fold cross-validation
zOften the subsets are stratified before the cross-validation is
performed zThe error estimates are averaged to yield an overall
error estimate
Slide 287
287 Cross-validation example: Break up data into groups of the
same size Hold aside one group for testing and use the rest to
build model Repeat Test
Slide 288
288 More on cross-validation zStandard method for evaluation:
stratified ten-fold cross-validation zWhy ten? Extensive
experiments have shown that this is the best choice to get an
accurate estimate zStratification reduces the estimates variance
zEven better: repeated stratified cross-validation yE.g. ten-fold
cross-validation is repeated ten times and results are averaged
(reduces the variance)
Slide 289
289 Clustering Methods zMany different method and algorithms:
yFor numeric and/or symbolic data yDeterministic vs. probabilistic
yExclusive vs. overlapping yHierarchical vs. flat yTop-down vs.
bottom-up
Slide 290
290 Clustering Evaluation zManual inspection zBenchmarking on
existing labels zCluster quality measures ydistance measures yhigh
similarity within a cluster, low across clusters
Slide 291
291 The distance function zSimplest case: one numeric attribute
A yDistance(X,Y) = A(X) A(Y) zSeveral numeric attributes:
yDistance(X,Y) = Euclidean distance between X,Y zNominal
attributes: distance is set to 1 if values are different, 0 if they
are equal zAre all attributes equally important? yWeighting the
attributes might be necessary
Slide 292
292 Simple Clustering: K-means Works with numeric data only
1)Pick a number (K) of cluster centers (at random) 2)Assign every
item to its nearest cluster center (e.g. using Euclidean distance)
3)Move each cluster center to the mean of its assigned items
4)Repeat steps 2,3 until convergence (change in cluster assignments
less than a threshold)
Slide 293
293 Data Mining in CRM: Customer Life Cycle zCustomer Life
Cycle yThe stages in the relationship between a customer and a
business zKey stages in the customer lifecycle yProspects: people
who are not yet customers but are in the target market yResponders:
prospects who show an interest in a product or service yActive
Customers: people who are currently using the product or service
yFormer Customers: may be bad customers who did not pay their bills
or who incurred high costs zIts important to know life cycle events
(e.g. retirement)
Slide 294
294 Data Mining in CRM: Customer Life Cycle zWhat marketers
want: Increasing customer revenue and customer profitability
yUp-sell yCross-sell yKeeping the customers for a longer period of
time zSolution: Applying data mining
Slide 295
295 Data Mining in CRM zDM helps to yDetermine the behavior
surrounding a particular lifecycle event yFind other people in
similar life stages and determine which customers are following
similar behavior patterns
Slide 296
296 Data Mining in CRM (cont.) Data Warehouse Data Mining
Campaign Management Customer Profile Customer Life Cycle Info.
Slide 297
CRISP-DM: Benefits of a standard methodology zCommunication yA
common language zRepeatability yRational structure zEducation yHow
do I start? www.crisp-dm.org
Slide 298
CRISP-DM Overview An industry-standard process model for data
mining. Not sector-specific Non-proprietary CRISP-DM Phases:
Business Understanding Data Understanding Data Preparation Modeling
Evaluation Deployment Not strictly ordered - respects iterative
aspect of data mining www.crisp-dm.org
Slide 299
299 Rules vs. decision lists zPRISM with outer loop removed
generates a decision list for one class ySubsequent rules are
designed for rules that are not covered by previous rules yBut:
order doesnt matter because all rules predict the same class zOuter
loop considers all classes separately yNo order dependence implied
zProblems: overlapping rules, default rule required
Slide 300
Process Standardization CRISP-DM: CRoss Industry Standard
Process for Data Mining Initiative launched Sept.1996 SPSS/ISL,
NCR, Daimler-Benz, OHRA Funding from European commission Over 200
members of the CRISP-DM SIG worldwide DM Vendors - SPSS, NCR, IBM,
SAS, SGI, Data Distilleries, Syllogic, Magnify,.. System Suppliers
/ consultants - Cap Gemini, ICL Retail, Deloitte & Touche, End
Users - BT, ABB, Lloyds Bank, AirTouch, Experian,...
Slide 301
CRISP-DM Non-proprietary Application/Industry neutral Tool
neutral Focus on business issues As well as technical analysis
Framework for guidance Experience base Templates for Analysis
Slide 302
Why CRISP-DM? The data mining process must be reliable and
repeatable by people with little data mining skills CRISP-DM
provides a uniform framework for guidelines experience
documentation CRISP-DM is flexible to account for differences
Different business/agency problems Different data