View
1.985
Download
1
Category
Tags:
Preview:
DESCRIPTION
Citation preview
© Copyright 2006, Natasha Balac 1
Introduction to Data Mining
Natasha Balac, Ph.D.
© Copyright 2006, Natasha Balac 2
Outline
Motivation: Why Data Mining?
What is Data Mining?
History of Data Mining
Data Mining Functionality and Terminology
Data Mining Applications
Are all the Patterns Interesting?
Issues in Data Mining
© Copyright 2006, Natasha Balac 3
Necessity is the Mother of Invention
Data explosion
Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories
We are drowning in data, but starving for
knowledge!
© Copyright 2006, Natasha Balac 4
Necessity is the Mother of Invention
We are drowning in data, but starving for
knowledge!
Solution
Data Mining
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
© Copyright 2006, Natasha Balac 5
Why DATA MINING?
Huge amounts of data Electronic records of our decisions
Choices in the supermarket Financial records Our comings and goings
We swipe our way through the world – every swipe is a record in a database
Data rich – but information poor Lying hidden in all this data is information!
© Copyright 2006, Natasha Balac 6
Data vs. Information
Society produces massive amounts of data business, science, medicine, economics, sports, …
Potentially valuable resource Raw data is useless
need techniques to automatically extract information Data: recorded facts Information: patterns underlying the data
© Copyright 2006, Natasha Balac 7
What is DATA MINING?
Extracting or “mining” knowledge from large amounts of data
Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of data
Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data
© Copyright 2006, Natasha Balac 8
Data mining: Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) information or patterns from data in large databases
What Is Data Mining?
© Copyright 2006, Natasha Balac 9
Data Mining is NOT
Data Warehousing (Deductive) query processing
SQL/ Reporting Software Agents Expert Systems Online Analytical Processing (OLAP) Statistical Analysis Tool Data visualization
© Copyright 2006, Natasha Balac 10
Data Mining
Programs that detect patterns and rules in the data
Strong patterns can be used to make non-trivial predictions on new data
© Copyright 2006, Natasha Balac 11
Data Mining Challenges
Problem 1: most patterns are not interesting
Problem 2: patterns may be inexact or completely spurious when noisy data present
© Copyright 2006, Natasha Balac 12
Machine Learning Techniques
Technical basis for data mining: algorithms for
acquiring structural descriptions from examples
Methods originate from artificial intelligence,
statistics, and research on databases
© Copyright 2006, Natasha Balac 13
Machine Learning Techniques
Structural descriptions represent patterns explicitly can be used to predict outcome in new situation understand and explain how prediction is derived
(maybe even more important)
© Copyright 2006, Natasha Balac 14
Multidisciplinary Field
Data Mining
Database Technology
Statistics
OtherDisciplines
Artificial Intelligence
MachineLearning Visualization
© Copyright 2006, Natasha Balac 15
Multidisciplinary Field
Database technology Artificial Intelligence
Machine Learning including Neural Networks Statistics Pattern recognition Knowledge-based systems/acquisition High-performance computing Data visualization
© Copyright 2006, Natasha Balac 16
History of Data Mining
© Copyright 2006, Natasha Balac 17
History
Emerged late 1980s Flourished –1990s Roots traced back along three family lines
Classical Statistics Artificial Intelligence Machine Learning
© Copyright 2006, Natasha Balac 18
Statistics
Foundation of most DM technologies Regression analysis, standard
distribution/deviation/variance, cluster analysis, confidence intervals
Building blocks Significant role in today’s data mining –
but alone is not powerful enough
© Copyright 2006, Natasha Balac 19
Artificial Intelligence
Heuristics vs. Statistics Human-thought-like processing Requires vast computer processing power Supercomputers
© Copyright 2006, Natasha Balac 20
Machine Learning
Union of statistics and AI Blends AI heuristics with advanced statistical
analysis Machine Learning – let computer programs
learn about data they study - make different decisions based on the quality of studied data
using statistics for fundamental concepts and adding more advanced AI heuristics and algorithms
© Copyright 2006, Natasha Balac 21
Data Mining
Adoption of the Machine learning techniques to the real world problems
Union: Statistics, AI, Machine learning Used to find previously hidden trends or
patterns Finding increasing acceptance in science and
business areas which need to analyze large amount of data to discover trends which could not be found otherwise
© Copyright 2006, Natasha Balac 22
Terminology
Gold Mining Knowledge mining from databases Knowledge extraction Data/pattern analysis Knowledge Discovery Databases or KDD Information harvesting Business intelligence
© Copyright 2006, Natasha Balac 23
KDD Process
Database
Selection Transformation
Data Preparation
Data Data MiningMining
Training Data
Evaluation, Verification
Model, Patterns
© Copyright 2006, Natasha Balac 24
LEARNING ALGORITHMS
Fundamental idea:
learn rules/patterns/relationships automatically from the data
© Copyright 2006, Natasha Balac 25
Data Mining Tasks
Exploratory Data Analysis Predictive Modeling: Classification and Regression Descriptive Modeling
Cluster analysis/segmentation Discovering Patterns and Rules
Association/Dependency rules Sequential patterns Temporal sequences
Deviation detection
© Copyright 2006, Natasha Balac 26
Data Mining Tasks
Concept/Class description: Characterization and discrimination Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
Association (correlation and causality) Multi-dimensional or single-dimensional association
age(X, “20-29”) ^ income(X, “60-90K”) buys(X, “TV”)
© Copyright 2006, Natasha Balac 27
Data Mining Tasks
Classification and Prediction
Finding models (functions) that describe and distinguish classes or concepts for future prediction
Example: classify countries based on climate, or classify cars based on gas mileage
Presentation: If-THEN rules, decision-tree, classification rule,
neural network Prediction: Predict some unknown or missing
numerical values
© Copyright 2006, Natasha Balac 28
Cluster analysis Class label is unknown: Group data to form
new classes, Example: cluster houses to find distribution
patterns
Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
Data Mining Tasks
© Copyright 2006, Natasha Balac 29
Data Mining Tasks
Outlier analysis Outlier: a data object that does not comply with the
general behavior of the data
Mostly considered as noise or exception, but is
quite useful in fraud detection, rare events analysis
Trend and evolution analysis Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
© Copyright 2006, Natasha Balac 30
Data Mining: Classification Schemes
General functionality Descriptive data mining Vs. Predictive data mining
Different views - different classifications Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques employed
Kinds of applications
© Copyright 2006, Natasha Balac 31
A Multi-Dimensional View of Data Mining Classification
Databases to be mined Relational, transactional, object-oriented, object-
relational, active, spatial, time-series, text, multi-media,WWW, etc.
Knowledge to be mined Characterization, discrimination, association,
classification, clustering, trend, deviation and outlier analysis, etc.
Multiple/integrated functions Mining at multiple levels of abstractions
© Copyright 2006, Natasha Balac 32
A Multi-Dimensional View of Data Mining Classification
Techniques utilized Decision/Regression trees, clustering, neural
networks, etc. Applications adapted
Retail, telecom, banking, DNA mining, stock market analysis, Web mining
© Copyright 2006, Natasha Balac 33
Data Mining Applications
Science: Chemistry, Physics, Medicine Biochemical analysis Remote sensors on a satellite Telescopes – star galaxy classification Medical Image analysis
© Copyright 2006, Natasha Balac 34
Data Mining Applications
Bioscience Sequence-based analysis Protein structure and function prediction Protein family classification Microarray gene expression
© Copyright 2006, Natasha Balac 35
Pharmaceutical companies, Insurance and Health care, Medicine Drug development Identify successful medical therapies Claims analysis, fraudulent behavior Medical diagnostic tools Predict office visits
Data Mining Applications
© Copyright 2006, Natasha Balac 36
Financial Industry, Banks, Businesses, E-commerce Stock and investment analysis Identify loyal customers vs. risky customer Predict customer spending Risk management Sales forecasting
Data Mining Applications
© Copyright 2006, Natasha Balac 37
Retail and Marketing Customer buying patterns/demographic
characteristics Mailing campaigns Market basket analysis Trend analysis
Data Mining Applications
© Copyright 2006, Natasha Balac 38
Database analysis and decision support Market analysis and management
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Risk analysis and management Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and management
Data Mining Applications
© Copyright 2006, Natasha Balac 39
Sports and Entertainment IBM Advanced Scout analyzed NBA game
statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat
Astronomy JPL and the Palomar Observatory discovered
22 quasars with the help of data mining
Data Mining Applications
© Copyright 2006, Natasha Balac 40
DATA MINING EXAMPLES
Grocery store NBA Banking and Credit Card scoring
Fraud detection Personalization & Customer Profiling Campaign Management and Database
Marketing
© Copyright 2006, Natasha Balac 41
Data mining at work: Case study 1
© Copyright 2006, Natasha Balac 42
Processing Loan Applications
Given: questionnaire with financial and personal information
Problem: should money be lend? Borderline cases referred to loan officers But: 50% of accepted borderline cases defaulted! Solution:
reject all borderline cases? Borderline cases are most active customers!
© Copyright 2006, Natasha Balac 43
Enter Machine Learning
Given: 1000 training examples of borderline cases
20 attributes: age, years with current employer,years at current address,
years with the bank, years at current job, other credit cards Learned rules predicted 2/3 of borderline cases
correctly! Rules could be used to explain decisions to customers
© Copyright 2006, Natasha Balac 44
Case study 2:Screening images
Given: radar satellite images of coastal waters
Problem: detecting oil slicks in those images
Oil slicks = dark regions with changing size and shape
Look-alike dark regions can be caused by weather conditions (e.g. high wind)
Expensive process requiring highly trained personnel
© Copyright 2006, Natasha Balac 45
Dark regions extracted from normalized image Attributes:
size of region, shape, area, intensity, sharpness and jaggedness of boundaries, proximity of other regions, info about background
Constraints: Scarcity of training examples (oil slicks are rare!) Unbalanced data: most dark regions aren’t oil
slicks Regions from same image form a batch Requirement is adjustable false-alarm rate
Enter Machine Learning
© Copyright 2006, Natasha Balac 46
Data Mining Challenges
Computationally expensive to investigate all possibilities
Dealing with noise/missing information and errors in data
Choosing appropriate attributes/input representation
Finding the minimal attribute space Finding adequate evaluation function(s) Extracting meaningful information Not overfitting
© Copyright 2006, Natasha Balac 47
Are All the “Discovered” Patterns Interesting?
Interestingness measures: A pattern is
interesting if it is easily understood by humans,
valid on new or test data with some degree of
certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm
© Copyright 2006, Natasha Balac 48
Are All the “Discovered” Patterns Interesting?
Objective vs. subjective measures: Objective: based on statistics and structures of
patterns support and confidence
Subjective: based on user’s belief in the data unexpectedness, novelty, action ability, etc.
© Copyright 2006, Natasha Balac 49
Can We Find All and Only Interesting Patterns?
Completeness - Find all the interesting
patterns Can a data mining system find all the interesting
patterns?
Association vs. classification vs. clustering
© Copyright 2006, Natasha Balac 50
Can We Find All and Only Interesting Patterns?
Optimization - Search for only interesting patterns
Can a data mining system find only the interesting
patterns?
Approaches First general all the patterns and then filter out the
uninteresting ones
Mining query optimization
© Copyright 2006, Natasha Balac 51
Major Issues in Data Mining
Mining methodology and user interaction Mining different kinds of knowledge in databases Incorporation of background knowledge Handling noise and incomplete data Pattern evaluation: the interestingness problem Expression and visualization of data mining results
© Copyright 2006, Natasha Balac 52
Performance and scalability Efficiency of data mining algorithms Parallel, distributed and incremental mining
methods Issues relating to the diversity of data types
Handling relational and complex types of data Mining information from diverse databases
Major Issues in Data Mining
© Copyright 2006, Natasha Balac 53
Issues related to applications and social impacts Application of discovered knowledge
Domain-specific data mining tools Intelligent query answering Expert systems Process control and decision making
A knowledge fusion problem Protection of data security, integrity, and privacy
Major Issues in Data Mining
© Copyright 2006, Natasha Balac 54
Summary
Data mining: discovering interesting patterns from large amounts of data
A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
© Copyright 2006, Natasha Balac 55
Summary
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, association, classification, clustering, outlier and trend analysis, etc.
Classification of data mining systems Major issues in data mining
© Copyright 2006, Natasha Balac 56
Exercise
Practical Data mining example
© Copyright 2006, Natasha Balac 57
Kinds of Data Mining
Decision Tree Learning Clustering Neural Networks Association Rules Support Vector Machines Genetic Algorithms Nearest Neighbor Method
© Copyright 2006, Natasha Balac 58
Decision Tree ExampleGrandparents
A lot A little
© Copyright 2006, Natasha Balac 59
DECISION TREE FOR THE CONCEPT
“Play Tennis”Day Outlook Temp Humidity Wind PlayTennis
D1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong NoMitchell, 1997
© Copyright 2006, Natasha Balac 60
DECISION TREE FOR THE CONCEPT
“Play Tennis”
[Mitchell,1997]
Recommended