13
DATA MINING METHODS COURSE Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

Embed Size (px)

Citation preview

Page 1: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

DATA MINING METHODS COURSE

Dr. Russell AndersonDr. Musa Jafar

West Texas A&M University

Page 2: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

What is Data Mining?

The process of discovering useful information in large data repositories. (Tan, P-N., Steinbach, M., and Kumar, V., Introduction to Data Mining, Addison-Wesley, 2006)

Discovered information should be: Valid Previously unknown Actionable

Page 3: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

Course Objectives

Seven objectives of Lenox and Cuff in 2002 (based on ACM 2001 Ironman Report) Prepare and warehouse data Process data based on set of DM algorithms Analyze results Make predictions Select proper algorithm Make application Motivated to continue graduate studies in DM

We have added Get to know data using statistical analysis tools Use visualization tools for analysis and review

Page 4: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

Overall Approach

1. Get to know the data.2. Select an appropriate data mining

algorithm based on the data and the mining objective.

3. Construct a model using the selected algorithm.

4. Analyze the results.5. Make application.

Page 5: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

Get to Know the Data

How is it structured? Single table/flat-file. Multi-table – relationships

Number of observations Number of dimensions (attributes)

Compute summary statistics using tool such as MS-Excel

Visually evaluate characteristics of the data

Page 6: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

Visual Exploration

Tools developed: Correlation Matrix Scatter Plot Parallel Coordinate Plot

Page 7: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

Visual Exploration Objectives

Distributions of data Data ranges of numeric attributes Cardinality of discrete attributes Shape of distribution

Skewed Multi-model

Location of outliers Identification possible relationships

between attributes Identification of subpopulations within the

data

Page 8: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

The Data Mining Methodologies

Microsoft Business Intelligence Tools Association Analysis – aka market basket analysis Classification

Decision Trees Artificial Neural Network Bayesian Analysis

Regression Cluster Analysis

Custom Tools with Embedded Visual Presentation Artificial neural network for both classification and

regression Self-Organizing Map (SOM) for cluster analysis

Page 9: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

What do students need to know?

Purpose of each methodology Steps of underlying algorithm Data types supported Issues in construction and application

Parameter settings Results interpretation

Page 10: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

Issue - Overtraining

Does the model fit the training data too well?

Need to separate available into training and validation subsets.

Visual view of training progress valuable.

Page 11: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

Classification ErrorsWhat are the costs?

Mushroom edibility classifiers

Classifier A ActualEdible Poisonous

Predicted Edible 38% 0%Poisonous 8% 54%

Classifier B ActualEdible Poisonous

Predicted Edible 44% 1%Poisonous 2% 53%

Page 12: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

Prediction Model Evaluation

Black Box - models built using sophisticated methodologies (ANN’s for example) perform very well, but gaining an understanding of the model itself is difficult.

Contribution of individual input attributes Nature of contribution (shape of curve) Interaction between input attributes

Page 13: Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

See you tomorrow

For a detailed presentation of the mechanics of the software deployed, attend our workshop tomorrow morning. Saturday: 8-10 AM Kachina A

Microsoft SQL Server Business Intelligence Studio

Visualization Tools