IE 5300- DM Project Presentation

CLASSIFICATION PROJECT FOR THE RETAIL

INDUSTRY

Group MembersRaghavendra Kumar Punugu

Mewan Wijemanne

IE 5300 – Data MiningProfessor

Shouyi Wang

Introduction Competition data set based on the retail industry

Variables are unknown

Determine the optimal classification method

Objective is to achieve highest accuracy

DATASETRetail industry

Loyal Not loyal

Behavioral variables

( Purchase)

Data Classification Approach

DATA DATA EXPLORATION

DATA QUALITY

DATA PREPROCESSI

NG

CLASSIFICATION

DATA

DATA EXPLORATION DATA QUALITY DATA PREPROCESSING CLASSIFICATION

Size• Train

(25676x255)• Test

(11042x255)

Attributes• 255

(Anonymous)

Property of Attributes• Numerical –

continuous• Categorical-

ordinal

Target variableLoyal (1)Not loyal (0)

EXPLORATION

Mean•14.5•18.4

Range •40•353

Standard Deviation•12.09•38.49

Mode•0•0

Median •11•4

Investigation of data for better understanding of specific characteristics

Helps in selecting preprocessing tools Helps in identifying the outliers, noise in data or missing values Helps in better data visualization


DATA QUALITY Missing values Based on the analysis conducted, under 2% of the data was determined to be missing/non-applicable. In order to conduct the classification, these values were replaced with 0.

NoisesGiven that the origin of the data set is unknown, it is difficult to determine the existence of Noises.

Outliers Currently very few outliers have been determined based on boxplot. However, for future work, outlier determinant techniques such as Bonferroni outlier test, can be applied to detect if any exist.


DATA PREPROCESSING

Dimensionality reduction

MRMR(minimum

redundancy maximum relevance)

EnsembleLinear

discriminant analysis

Support vector

machine

Principal component

analysis

Linear discriminant

analysis


Dimensionality reduction Reducing the high dimensionality space to a lower dimensionality

space. Reduces the time and storage space It is easy to visualize the data when reduced from 3D to 2D.

Principal component

analysis

SVD: Eigen values of covariance

matrix

Variance retained is 99.99% [161]

Minimum redundancy maximum relevance

Statistical dependency

between selected features and classification

variables

Decreasing order of importance [255]


Principal component analysis and MRMR[COEFF,SCORE,latent] = princomp(X)

Coefficient: For a NxP matrix, it returns a pxp matrix with each column representing one principal component and each column Is in order of decreasing variance

Latent: singular value decomposition ( default), Eigen value decomposition (covariance matrix) or Alternating least square ( missing values in data) ; gives you the variance accounted by each column.

Score : representation of X in principal component space.

MRMR: Gives you features based on order of importance 11, 43, 33, 103, 247, 93, 96, 13, 244, 212,…….

DATA CLASSIFICATION

Data classificatio

n

Ensemble

Linear Discriminant Analysis

Support vector

machine


Linear Discriminant Analysis (LDA)

Designed by R.A Fisher in the 1900s

Comparable to more complex methods

Uses Fisher’s Score Function

Searches for linear combinations that best classifies

Support Vector Machine (SVM) Creates multiple hyperplanes to define decision boundaries. Determines the hyperplane that gives the largest minimum distance

to the training data set There are multiple Kernel functions used with SVM Kernels used

Radial Basis Function (RBF) Linear Polynomial (Orders; 2, 8 ,10)

Ensemble Classifier Itself a supervised learning algorithm that can be trained It combines a set of trained weak learner models and the data set

used Running multiple learning cycles causes it to adjust the algorithm to

better classify There are multiple types of ensemble techniques Ensemble types used

Adaboost Logiboost

Visualization of Output

Error vs Number of cycles (Ensemble)

Computational ResultsVariable Selection

Method Classification Method Function Order Variable Size AccuracyMRMR Linear Discriminative Analysis 150 0.662751MRMR Linear Discriminative Analysis 130 0.663422MRMR Linear Discriminative Analysis 100 0.665319MRMR Linear Discriminative Analysis 90 0.664914MRMR Linear Discriminative Analysis 95 0.665929MRMR Linear Discriminative Analysis 98 0.666967MRMR Support Vector Machine RBF 98 0.543112MRMR Support Vector Machine Polynomial 10 98 0.548724MRMR Support Vector Machine Polynomial 2 98 0.465846MRMR Support Vector Machine Polynomial 8 98 0.503412MRMR Support Vector Machine Linear 98 0.534284MRMR Fitted Ensemble Adaboost 100 98 0.669126MRMR Fitted Ensemble Adaboost 80 98 0.669867MRMR Fitted Ensemble Adaboost 50 98 0.671196MRMR Fitted Ensemble Adaboost 40 98 0.671196MRMR Fitted Ensemble Adaboost 43 98 0.671498MRMR Fitted Ensemble Logitboost 43 98 0.667572MRMR Fitted Ensemble Logitboost 100 98 0.669082MRMR Fitted Ensemble Logitboost 200 98 0.670894MRMR Fitted Ensemble Logitboost 300 98 0.674819PCA Linear Discriminative Analysis 161 0.587564

Conclusion Lowest accuracy achieved is 46.5846% Worst combination

Variable Selection Method - MRMR Classification Method - Support Vector Machine Function – Polynomial Polyorder - 2 Variable Size - 98

Current rank 32nd out of 215 Current highest accuracy achieved is 67.4819% Best combination

Variable Selection Method - MRMR Classification Method - Fitted Ensemble Function – Logitboost Learning Cycles - 300 Variable Size - 98

Thank you!

Documents

IE 5300- DM Project Presentation