Upload
raghavendra-kumar-punugu
View
28
Download
1
Embed Size (px)
Citation preview
CLASSIFICATION PROJECT FOR THE RETAIL
INDUSTRY
Group MembersRaghavendra Kumar Punugu
Mewan Wijemanne
IE 5300 – Data MiningProfessor
Shouyi Wang
Introduction Competition data set based on the retail industry
Variables are unknown
Determine the optimal classification method
Objective is to achieve highest accuracy
DATASETRetail industry
Loyal Not loyal
Behavioral variables
( Purchase)
Data Classification Approach
DATA DATA EXPLORATION
DATA QUALITY
DATA PREPROCESSI
NG
CLASSIFICATION
DATA
DATA EXPLORATION DATA QUALITY DATA PREPROCESSING CLASSIFICATION
Size• Train
(25676x255)• Test
(11042x255)
Attributes• 255
(Anonymous)
Property of Attributes• Numerical –
continuous• Categorical-
ordinal
Target variableLoyal (1)Not loyal (0)
EXPLORATION
Mean•14.5•18.4
Range •40•353
Standard Deviation•12.09•38.49
Mode•0•0
Median •11•4
Investigation of data for better understanding of specific characteristics
Helps in selecting preprocessing tools Helps in identifying the outliers, noise in data or missing values Helps in better data visualization
DATA EXPLORATION DATA QUALITY DATA PREPROCESSING CLASSIFICATION
DATA QUALITY Missing values Based on the analysis conducted, under 2% of the data was determined to be missing/non-applicable. In order to conduct the classification, these values were replaced with 0.
NoisesGiven that the origin of the data set is unknown, it is difficult to determine the existence of Noises.
Outliers Currently very few outliers have been determined based on boxplot. However, for future work, outlier determinant techniques such as Bonferroni outlier test, can be applied to detect if any exist.
DATA EXPLORATION DATA QUALITY DATA PREPROCESSING CLASSIFICATION
DATA PREPROCESSING
Dimensionality reduction
MRMR(minimum
redundancy maximum relevance)
EnsembleLinear
discriminant analysis
Support vector
machine
Principal component
analysis
Linear discriminant
analysis
DATA EXPLORATION DATA QUALITY DATA PREPROCESSING CLASSIFICATION
Dimensionality reduction Reducing the high dimensionality space to a lower dimensionality
space. Reduces the time and storage space It is easy to visualize the data when reduced from 3D to 2D.
Principal component
analysis
SVD: Eigen values of covariance
matrix
Variance retained is 99.99% [161]
Minimum redundancy maximum relevance
Statistical dependency
between selected features and classification
variables
Decreasing order of importance [255]
DATA EXPLORATION DATA QUALITY DATA PREPROCESSING CLASSIFICATION
Principal component analysis and MRMR[COEFF,SCORE,latent] = princomp(X)
Coefficient: For a NxP matrix, it returns a pxp matrix with each column representing one principal component and each column Is in order of decreasing variance
Latent: singular value decomposition ( default), Eigen value decomposition (covariance matrix) or Alternating least square ( missing values in data) ; gives you the variance accounted by each column.
Score : representation of X in principal component space.
MRMR: Gives you features based on order of importance 11, 43, 33, 103, 247, 93, 96, 13, 244, 212,…….
DATA CLASSIFICATION
Data classificatio
n
Ensemble
Linear Discriminant Analysis
Support vector
machine
DATA EXPLORATION DATA QUALITY DATA PREPROCESSING CLASSIFICATION
Linear Discriminant Analysis (LDA)
Designed by R.A Fisher in the 1900s
Comparable to more complex methods
Uses Fisher’s Score Function
Searches for linear combinations that best classifies
Support Vector Machine (SVM) Creates multiple hyperplanes to define decision boundaries. Determines the hyperplane that gives the largest minimum distance
to the training data set There are multiple Kernel functions used with SVM Kernels used
Radial Basis Function (RBF) Linear Polynomial (Orders; 2, 8 ,10)
Ensemble Classifier Itself a supervised learning algorithm that can be trained It combines a set of trained weak learner models and the data set
used Running multiple learning cycles causes it to adjust the algorithm to
better classify There are multiple types of ensemble techniques Ensemble types used
Adaboost Logiboost
Visualization of Output
Error vs Number of cycles (Ensemble)
Computational ResultsVariable Selection
Method Classification Method Function Order Variable Size AccuracyMRMR Linear Discriminative Analysis 150 0.662751MRMR Linear Discriminative Analysis 130 0.663422MRMR Linear Discriminative Analysis 100 0.665319MRMR Linear Discriminative Analysis 90 0.664914MRMR Linear Discriminative Analysis 95 0.665929MRMR Linear Discriminative Analysis 98 0.666967MRMR Support Vector Machine RBF 98 0.543112MRMR Support Vector Machine Polynomial 10 98 0.548724MRMR Support Vector Machine Polynomial 2 98 0.465846MRMR Support Vector Machine Polynomial 8 98 0.503412MRMR Support Vector Machine Linear 98 0.534284MRMR Fitted Ensemble Adaboost 100 98 0.669126MRMR Fitted Ensemble Adaboost 80 98 0.669867MRMR Fitted Ensemble Adaboost 50 98 0.671196MRMR Fitted Ensemble Adaboost 40 98 0.671196MRMR Fitted Ensemble Adaboost 43 98 0.671498MRMR Fitted Ensemble Logitboost 43 98 0.667572MRMR Fitted Ensemble Logitboost 100 98 0.669082MRMR Fitted Ensemble Logitboost 200 98 0.670894MRMR Fitted Ensemble Logitboost 300 98 0.674819PCA Linear Discriminative Analysis 161 0.587564
Conclusion Lowest accuracy achieved is 46.5846% Worst combination
Variable Selection Method - MRMR Classification Method - Support Vector Machine Function – Polynomial Polyorder - 2 Variable Size - 98
Current rank 32nd out of 215 Current highest accuracy achieved is 67.4819% Best combination
Variable Selection Method - MRMR Classification Method - Fitted Ensemble Function – Logitboost Learning Cycles - 300 Variable Size - 98
Thank you!