Upload
rose-norman
View
221
Download
3
Embed Size (px)
Citation preview
CSE 8392 SPRING 1999DATA MINING:
PART I
Professor Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University
Dallas, Texas 75275
(214) 768-3087
fax: (214) 768-3085
email: [email protected]
www: http://www.seas.smu.edu/~mhd
January 1999
CSE 8392 Spring 1999 2
CSE8392 SPRING 1999 OUTLINE
• Course Objective: To examine Data Mining concepts. A database perspective (rather than AI or statistics) is taken.
• I. Introduction and Related Topics
• II. Core Topics
• III. Advanced Topics
• IV. Case Studies
• V. Student Presentations
• VI. Summary and Future Trends
CSE 8392 Spring 1999 3
INTRODUCTION AND RELATED TOPICS
• Section Objective: Provide an introduction of data mining concepts. Briefly examine related concepts and background topics.
• Historical Perspective
– Gleaning Knowledge from the Data
– User Expectations increase as amount/sophistication of collected data increases.
– Reality vs Extracted Data
Reality
QueryInformation
Need
Data
Physical View Database View
CSE 8392 Spring 1999 4
Related Topics (to be covered)
– Knowledge Discovery
– Information Retrieval
– Fuzzy Sets
– Data Warehousing and OLAP
– Dimensional Modeling
CSE 8392 Spring 1999 5
Data Mining Overview
• What is Data Mining?– Definition: Fayyad, p. 9 – A.k.a.
• Exploratory data analysis• Unsupervised pattern recognition• Data driven discovery• Deductive learning
• Data Mining determines patterns in the data– Non-trivial– Valid– Novel– Potentially useful– Interesting– General and simple– Understandable
CSE 8392 Spring 1999 6
DM Techniques (R[1])
• DM involves many different algorithms to accomplish different things. All have the following techniques in common.
– Model(Must fit a model to the data.)
• Function/Purpose
• Representation
– Preference Criteria (How to choose one model over another?)
– Search Algorithm (How to search the data)
• Example (Loan Data, fig 1.1 p6 in Fayyad):
– Model: Classification, Linear Function
– Preference: What best fits data? (Fig 1.2 or 1.4)
– Search Algorithm: Linear search of database
CSE 8392 Spring 1999 7
DM Model Functions (R[1])
• Classification - Map data into predefined groups
• Regression - Map data to real valued predicate variable
• Clustering - Map data into groups defined by data itself
• Summarization - Map subsets of data into simple description
• Dependency Modeling - Identify dependencies among data items
• Link Analysis - Identify other relationships among data (association rules)
• Sequence Analysis - Identify sequential patterns in data
CSE 8392 Spring 1999 8
DM Historical Perspective
• Late 70’s: Spreadsheet analysis
• 80’s: Transactional databases support data storage and retrieval
• Early 90’s: Growing interest in end user support (a.k.a. decision support)
– Issue: transactional databases are not designed for decision support
• Mid 90’s: Dedicated data warehouses for decision support and multidimensional analysis
• Late 90’s: Proliferation; new concepts (data marts)
• DM Tools: Neovista, Red Brick
CSE 8392 Spring 1999 9
Data Mining Metrics
• Berson, Tables 17-1,17-2,17-3, p 347
• Accuracy
• Clarity
• Dirty Data
• Dimensionality
• Raw Data (Preprocessing)
• RDBMS embedding
• Scalability
• Speed
• Validation
CSE 8392 Spring 1999 10
DM Issues
• Overfitting
• Outliers
• Closed World Assumption
• Database schemas and database models
• Algorithms for data mining
• Interpretation and visualization of results
• Size of databases
• Multimedia data, Spatio-Temporal Data
• Changing data
• Integration
• DM Applications
– Basket market analysis Stock analysis and selection
– Fraud detection and prevention
– Crisis prediction and prevention
CSE 8392 Spring 1999 11
KNOWLEDGE DISCOVERY IN DATABASES (KDD)
• “Overall process of discovering useful knowledge from data.” (p28 in R[1])
• Defn: R[1] p 30
• Steps Fig 1, p29 R[1] (Fig 1.3 in Fayyad)
• Data Mining is one step in KDD process
• KDD objective not usually clear or exact. May require time with customer understanding needs.
• Data usually has problems - needs cleaning
– Incorrect/missing data
– Extract from multiple sources and compare
– Delete anomalous data and sources
– Different data types/metrics
CSE 8392 Spring 1999 12
FUZZY SETS and LOGIC
• Set membership described by a real valued (0,1) membership function
• Ex: Set of all tall people
• Set membership function: f(x)=x is tall iff height(x)>6 ft.
• Note that this is a simple classification problem. Just as the Loan example, the results are not exact.
• Basis of many classification and clustering approaches
• In a conventional DB how do you retrieve all tall people?
– Three valued logic: True, False, Maybe
– Multi-valued logic: More than 2 values
CSE 8392 Spring 1999 13
Fuzzy Logic
• Reasoning with uncertainty
• Extends multivalued logic; allows user to communicate using imprecise concepts, i.e.
– “good” and “bad”
– “close to” and “far away”
• Avoids brittleness of rule based reasoning by introducing probability of set membership
– Allows for smoother transition between classification sets in the domain
– Example
• Berson figure 16.2, page 325
CSE 8392 Spring 1999 14
INFORMATION RETRIEVAL
• Store and retrieve documents based on fuzzy queries
• Predecessor of web based access
• Ex: Store information about all articles in all IEEE Transactions journals and Retrieve all documents dealing with heaps.
• Overview
– Conventional IR Systems
– Query Structures(Keywords)
– Matching(Multivalued logic)
– Measures
– Text Analysis Techniques
– IR Related Topics
CSE 8392 Spring 1999 15
Conventional IR Systems
• Library card catalogs
• Documents (Library Science)
– Formatted
– Unformatted (Text)
– Mixed
• Document Surrogates
– Identifiers
– Titles, names, and dates
– Abstracts, extracts, reviews
– Summaries of Numerical Data
– Image Descriptions
CSE 8392 Spring 1999 16
IR Queries
• Query Structures
– Matching Criteria
– Boolean Queries
– Vector
– Fuzzy
– Natural Language
• Logical combination of keywords
• Weight associated with keywords
• Similarity measures
CSE 8392 Spring 1999 17
Similarity Measures
– Document Vector:
– Different Measures:
– Salton and McGill, Introduction to Modern Information Retrieval, 1984, McGraw-Hill, pp201-204.
– Similarity uses:
• Document-Document
• Query-Query
• Document-Query
iniii dddD ,...,, 21
n
kjkikji ddDDSim
1
),(
CSE 8392 Spring 1999 18
IR Document/Query Matching
• Matching Process
– Relevance and Similarity Measures
– Boolean based matching
• Logical match
– Vector based matching
• Threshold match
– Probabilistic Match
n documents relevant
• P(relevant) =
N total documents
– Fuzzy Matching
– Proximity Matching
– Weighting
– Relative Importance of Items
CSE 8392 Spring 1999 19
IR Matching
• Scaling
– Impact of Sample Size
– Clustering
– Centroids
• Measures
– Precision
– Recall
CSE 8392 Spring 1999 20
IR Indexing
• Text Analysis
– Indexing is the assignment of keywords or terms that represent document content
• Originally a library science problem that has grown with the advent of web based searches
– Indexing types
• Automated vs. manual
• Controlled vs. uncontrolled
• Single term vs. terms in context
• Deep vs. shallow
CSE 8392 Spring 1999 21
IR Indexing
• General Steps
– 1. Assignment of terms or concepts capable of representing content
– 2. Assignment to each term a weight or value
• Indexing
– Vector based
• Start with excerpts, remove high frequency words
– Stop list
– Thesaurus
• Compute discrimination values of terms
CSE 8392 Spring 1999 22
IR Retrieval
• Retrieval or Classification
– Vector based
• Same starting point as with indexing
• Compute weighting factors
• Assign to each document a weighted term vector
– Similarity Measures
• Measure similarity between document/query
• Results normalized to range between 0 - 1
CSE 8392 Spring 1999 23
IR Retrieval
– Inverse Document Frequency
• Assumes importance is proportional to standard occurrence frequency, and inversely proportional to the total number of documents.
• Also used for similarity measurement
– Inverted Indexing of Document
– Concept Hierarchy
• DAG of concepts
• Follow nodes from general to more specific
• Tag articles with low level concepts so that each may be distinguished from ancestors
CSE 8392 Spring 1999 24
IR Related Topics
• Information Retrieval Related Topics
– Text Analysis
– Fuzzy Sets
– Extending Databases
– Hypertext
– Digital Libraries
– Data Mining
• Web based browsers
CSE 8392 Spring 1999 25
DATA WAREHOUSING AND OLAP
– Preparations for Mining: Data Warehousing
• Extracting the data (from RDBMS)
• Storing the data
– Data warehouse or data mart
• Cleansing the data
• Mining the data
– Often with multidimensional queries
• Definition
– Blend of technologies
– Integration
– Enables Strategic Use of Data
• Architecture
– Figure 6.1, page 116
CSE 8392 Spring 1999 26
DW Migration
• Migration from Relational Database to Data Warehouse
– Differences (Relational vs. Data Warehouse)
– Procedure for Migration
• Extraction
• Cleanup
• Transformation
• Migration
• Issues
– Multiple sources
– Database Heterogeneity
– Data Heterogeneity
CSE 8392 Spring 1999 27
DW Design
• Data Warehouse Design Considerations - Nine Step Method:
– Subject Matter
– Fact Table contents
– Dimensioning
– Fact Selection
– Precalculations
– Rounding out dimension table
– Duration selection
– What about change?
– Query priorities
• Technical Considerations
– Hardware
– Communications Infrastructure
– Data Structures
CSE 8392 Spring 1999 28
More on DW• Benefits
– Development of strategic information and resources
– Hypothesis testing
– Knowledge discovery
• Data Marts
– Definition: a mini data warehouse for data mining
– Directed at a partition of data
– Dedicated user group
– May be physically separate
– Drivers
• Urgent user requirements
• Small budget
• Absence of sponsor
• Decentralization
• Smaller project size
CSE 8392 Spring 1999 29
DIMENSIONAL MODELING
• Dimensional Modeling– Describes relationships in the data that
will be mined– Relatively new concept, still developing– A technique for visualizing data models– Schema (Star and Snowflake)– Facts - A collection of related data items,
consisting of measures and context data– Dimensions - A collection of members or
units of the same type of view. Axis for modeling. Sets the context for the facts.
– Measures - Numeric attribute of fact (What is stored about sales data)
• Focus - Tends to be on numeric data• MD Analysis vs. DM - Figure 4, R[3]
CSE 8392 Spring 1999 30
Data Cube
• Way to visualize facts and dimensions
• Hypercube (more than 3 dimensions)
• May be nested
• Figure 13.1, p249, Berson
• Figure 15,R[3]
CSE 8392 Spring 1999 31
Part No.
Dimension
Customer
Dimension
Time
Dimension
Salesperson
Dims
Product
Dimension
Sales
Facts
Star Schema
– Contains large fact table and a surrounding set of dimension tables
– A.k.a. constellation or multistar model
– Figure 9.1, p171,Berson
– Following from Figure 18, R[3]
CSE 8392 Spring 1999 32
Part No.
Dimension
Customer
Dimension
Time
Dimension
Salesperson
Dimension
Product
Dimension
Sales
Facts
Location
Dimension
Manager
Dimension
Month
Dimension
Week
Dimension
Snowflake Schema
• Sometimes dimensions have hierarchies among themselves
• N:1 relationships among members of a dimension may be subdivided
• Decomposition yields a snowflake like schema
CSE 8392 Spring 1999 33
OLAP (On Line Analytic Processing)• Multidimensional database• Allows user to analyze data using elaborate,
multidimensional, complex views• MOLAP - Multidimensional OLAP.
Supported by specialized DBMS/software systems. (Data structures, temporal)– May not be general enough for other uses– Access limited and optimized for OLAP
processing– Fig 13.3 p 253, Berson
• ROLAP - Underlying data stored in traditional (relational) DBMS and accessed by traditional query language (SQL).– Layer on top of DBMS. Middleware.– May have poor performance for OLAP
applications– Fig 13.4 p 254, Berson
CSE 8392 Spring 1999 34
OLAP Operations
• Move view of facts down/up dimensions
– Drill Down
– Roll Up
– Figure 3, R[3]
– Figure 16,R[3]
• Look at data by partitioning the cube
– Slice - Look at subcube to get more specific data
– Dice - Rotate cube to look at another dimension
– Figure 17,R[3]