CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas,

CSE 8392 SPRING 1999DATA MINING:

PART I

Professor Margaret H. Dunham

Department of Computer Science and Engineering

Southern Methodist University

Dallas, Texas 75275

(214) 768-3087

fax: (214) 768-3085

email: [email protected]

www: http://www.seas.smu.edu/~mhd

January 1999

CSE 8392 Spring 1999 2

CSE8392 SPRING 1999 OUTLINE

• Course Objective: To examine Data Mining concepts. A database perspective (rather than AI or statistics) is taken.

• I. Introduction and Related Topics

• II. Core Topics

• III. Advanced Topics

• IV. Case Studies

• V. Student Presentations

• VI. Summary and Future Trends


INTRODUCTION AND RELATED TOPICS

• Section Objective: Provide an introduction of data mining concepts. Briefly examine related concepts and background topics.

• Historical Perspective

– Gleaning Knowledge from the Data

– User Expectations increase as amount/sophistication of collected data increases.

– Reality vs Extracted Data

Reality

QueryInformation

Need

Data

Physical View Database View


Related Topics (to be covered)

– Knowledge Discovery

– Information Retrieval

– Fuzzy Sets

– Data Warehousing and OLAP

– Dimensional Modeling


Data Mining Overview

• What is Data Mining?– Definition: Fayyad, p. 9 – A.k.a.

• Exploratory data analysis• Unsupervised pattern recognition• Data driven discovery• Deductive learning

• Data Mining determines patterns in the data– Non-trivial– Valid– Novel– Potentially useful– Interesting– General and simple– Understandable


DM Techniques (R[1])

• DM involves many different algorithms to accomplish different things. All have the following techniques in common.

– Model(Must fit a model to the data.)

• Function/Purpose

• Representation

– Preference Criteria (How to choose one model over another?)

– Search Algorithm (How to search the data)

• Example (Loan Data, fig 1.1 p6 in Fayyad):

– Model: Classification, Linear Function

– Preference: What best fits data? (Fig 1.2 or 1.4)

– Search Algorithm: Linear search of database


DM Model Functions (R[1])

• Classification - Map data into predefined groups

• Regression - Map data to real valued predicate variable

• Clustering - Map data into groups defined by data itself

• Summarization - Map subsets of data into simple description

• Dependency Modeling - Identify dependencies among data items

• Link Analysis - Identify other relationships among data (association rules)

• Sequence Analysis - Identify sequential patterns in data


DM Historical Perspective

• Late 70’s: Spreadsheet analysis

• 80’s: Transactional databases support data storage and retrieval

• Early 90’s: Growing interest in end user support (a.k.a. decision support)

– Issue: transactional databases are not designed for decision support

• Mid 90’s: Dedicated data warehouses for decision support and multidimensional analysis

• Late 90’s: Proliferation; new concepts (data marts)

• DM Tools: Neovista, Red Brick


Data Mining Metrics

• Berson, Tables 17-1,17-2,17-3, p 347

• Accuracy

• Clarity

• Dirty Data

• Dimensionality

• Raw Data (Preprocessing)

• RDBMS embedding

• Scalability

• Speed

• Validation

CSE 8392 Spring 1999 10

DM Issues

• Overfitting

• Outliers

• Closed World Assumption

• Database schemas and database models

• Algorithms for data mining

• Interpretation and visualization of results

• Size of databases

• Multimedia data, Spatio-Temporal Data

• Changing data

• Integration

• DM Applications

– Basket market analysis Stock analysis and selection

– Fraud detection and prevention

– Crisis prediction and prevention

CSE 8392 Spring 1999 11

KNOWLEDGE DISCOVERY IN DATABASES (KDD)

• “Overall process of discovering useful knowledge from data.” (p28 in R[1])

• Defn: R[1] p 30

• Steps Fig 1, p29 R[1] (Fig 1.3 in Fayyad)

• Data Mining is one step in KDD process

• KDD objective not usually clear or exact. May require time with customer understanding needs.

• Data usually has problems - needs cleaning

– Incorrect/missing data

– Extract from multiple sources and compare

– Delete anomalous data and sources

– Different data types/metrics

CSE 8392 Spring 1999 12

FUZZY SETS and LOGIC

• Set membership described by a real valued (0,1) membership function

• Ex: Set of all tall people

• Set membership function: f(x)=x is tall iff height(x)>6 ft.

• Note that this is a simple classification problem. Just as the Loan example, the results are not exact.

• Basis of many classification and clustering approaches

• In a conventional DB how do you retrieve all tall people?

– Three valued logic: True, False, Maybe

– Multi-valued logic: More than 2 values

CSE 8392 Spring 1999 13

Fuzzy Logic

• Reasoning with uncertainty

• Extends multivalued logic; allows user to communicate using imprecise concepts, i.e.

– “good” and “bad”

– “close to” and “far away”

• Avoids brittleness of rule based reasoning by introducing probability of set membership

– Allows for smoother transition between classification sets in the domain

– Example

• Berson figure 16.2, page 325

CSE 8392 Spring 1999 14

INFORMATION RETRIEVAL

• Store and retrieve documents based on fuzzy queries

• Predecessor of web based access

• Ex: Store information about all articles in all IEEE Transactions journals and Retrieve all documents dealing with heaps.

• Overview

– Conventional IR Systems

– Query Structures(Keywords)

– Matching(Multivalued logic)

– Measures

– Text Analysis Techniques

– IR Related Topics

CSE 8392 Spring 1999 15

Conventional IR Systems

• Library card catalogs

• Documents (Library Science)

– Formatted

– Unformatted (Text)

– Mixed

• Document Surrogates

– Identifiers

– Titles, names, and dates

– Abstracts, extracts, reviews

– Summaries of Numerical Data

– Image Descriptions

CSE 8392 Spring 1999 16

IR Queries

• Query Structures

– Matching Criteria

– Boolean Queries

– Vector

– Fuzzy

– Natural Language

• Logical combination of keywords

• Weight associated with keywords

• Similarity measures

CSE 8392 Spring 1999 17

Similarity Measures

– Document Vector:

– Different Measures:

– Salton and McGill, Introduction to Modern Information Retrieval, 1984, McGraw-Hill, pp201-204.

– Similarity uses:

• Document-Document

• Query-Query

• Document-Query

iniii dddD ,...,, 21

n

kjkikji ddDDSim

1

),(

CSE 8392 Spring 1999 18

IR Document/Query Matching

• Matching Process

– Relevance and Similarity Measures

– Boolean based matching

• Logical match

– Vector based matching

• Threshold match

– Probabilistic Match

n documents relevant

• P(relevant) =

N total documents

– Fuzzy Matching

– Proximity Matching

– Weighting

– Relative Importance of Items

CSE 8392 Spring 1999 19

IR Matching

• Scaling

– Impact of Sample Size

– Clustering

– Centroids

• Measures

– Precision

– Recall

CSE 8392 Spring 1999 20

IR Indexing

• Text Analysis

– Indexing is the assignment of keywords or terms that represent document content

• Originally a library science problem that has grown with the advent of web based searches

– Indexing types

• Automated vs. manual

• Controlled vs. uncontrolled

• Single term vs. terms in context

• Deep vs. shallow

CSE 8392 Spring 1999 21

IR Indexing

• General Steps

– 1. Assignment of terms or concepts capable of representing content

– 2. Assignment to each term a weight or value

• Indexing

– Vector based

• Start with excerpts, remove high frequency words

– Stop list

– Thesaurus

• Compute discrimination values of terms

CSE 8392 Spring 1999 22

IR Retrieval

• Retrieval or Classification

– Vector based

• Same starting point as with indexing

• Compute weighting factors

• Assign to each document a weighted term vector

– Similarity Measures

• Measure similarity between document/query

• Results normalized to range between 0 - 1

CSE 8392 Spring 1999 23

IR Retrieval

– Inverse Document Frequency

• Assumes importance is proportional to standard occurrence frequency, and inversely proportional to the total number of documents.

• Also used for similarity measurement

– Inverted Indexing of Document

– Concept Hierarchy

• DAG of concepts

• Follow nodes from general to more specific

• Tag articles with low level concepts so that each may be distinguished from ancestors

CSE 8392 Spring 1999 24

IR Related Topics

• Information Retrieval Related Topics

– Text Analysis

– Fuzzy Sets

– Extending Databases

– Hypertext

– Digital Libraries

– Data Mining

• Web based browsers

CSE 8392 Spring 1999 25

DATA WAREHOUSING AND OLAP

– Preparations for Mining: Data Warehousing

• Extracting the data (from RDBMS)

• Storing the data

– Data warehouse or data mart

• Cleansing the data

• Mining the data

– Often with multidimensional queries

• Definition

– Blend of technologies

– Integration

– Enables Strategic Use of Data

• Architecture

– Figure 6.1, page 116

CSE 8392 Spring 1999 26

DW Migration

• Migration from Relational Database to Data Warehouse

– Differences (Relational vs. Data Warehouse)

– Procedure for Migration

• Extraction

• Cleanup

• Transformation

• Migration

• Issues

– Multiple sources

– Database Heterogeneity

– Data Heterogeneity

CSE 8392 Spring 1999 27

DW Design

• Data Warehouse Design Considerations - Nine Step Method:

– Subject Matter

– Fact Table contents

– Dimensioning

– Fact Selection

– Precalculations

– Rounding out dimension table

– Duration selection

– What about change?

– Query priorities

• Technical Considerations

– Hardware

– Communications Infrastructure

– Data Structures

CSE 8392 Spring 1999 28

More on DW• Benefits

– Development of strategic information and resources

– Hypothesis testing

– Knowledge discovery

• Data Marts

– Definition: a mini data warehouse for data mining

– Directed at a partition of data

– Dedicated user group

– May be physically separate

– Drivers

• Urgent user requirements

• Small budget

• Absence of sponsor

• Decentralization

• Smaller project size

CSE 8392 Spring 1999 29

DIMENSIONAL MODELING

• Dimensional Modeling– Describes relationships in the data that

will be mined– Relatively new concept, still developing– A technique for visualizing data models– Schema (Star and Snowflake)– Facts - A collection of related data items,

consisting of measures and context data– Dimensions - A collection of members or

units of the same type of view. Axis for modeling. Sets the context for the facts.

– Measures - Numeric attribute of fact (What is stored about sales data)

• Focus - Tends to be on numeric data• MD Analysis vs. DM - Figure 4, R[3]

CSE 8392 Spring 1999 30

Data Cube

• Way to visualize facts and dimensions

• Hypercube (more than 3 dimensions)

• May be nested

• Figure 13.1, p249, Berson

• Figure 15,R[3]

CSE 8392 Spring 1999 31

Part No.

Dimension

Customer

Dimension

Time

Dimension

Salesperson

Dims

Product

Dimension

Sales

Facts

Star Schema

– Contains large fact table and a surrounding set of dimension tables

– A.k.a. constellation or multistar model

– Figure 9.1, p171,Berson

– Following from Figure 18, R[3]

CSE 8392 Spring 1999 32

Part No.

Dimension

Customer

Dimension

Time

Dimension

Salesperson

Dimension

Product

Dimension

Sales

Facts

Location

Dimension

Manager

Dimension

Month

Dimension

Week

Dimension

Snowflake Schema

• Sometimes dimensions have hierarchies among themselves

• N:1 relationships among members of a dimension may be subdivided

• Decomposition yields a snowflake like schema

CSE 8392 Spring 1999 33

OLAP (On Line Analytic Processing)• Multidimensional database• Allows user to analyze data using elaborate,

multidimensional, complex views• MOLAP - Multidimensional OLAP.

Supported by specialized DBMS/software systems. (Data structures, temporal)– May not be general enough for other uses– Access limited and optimized for OLAP

processing– Fig 13.3 p 253, Berson

• ROLAP - Underlying data stored in traditional (relational) DBMS and accessed by traditional query language (SQL).– Layer on top of DBMS. Middleware.– May have poor performance for OLAP

applications– Fig 13.4 p 254, Berson

CSE 8392 Spring 1999 34

OLAP Operations

• Move view of facts down/up dimensions

– Drill Down

– Roll Up

– Figure 3, R[3]

– Figure 16,R[3]

• Look at data by partitioning the cube

– Slice - Look at subcube to get more specific data

– Dice - Rotate cube to look at another dimension

– Figure 17,R[3]