Exploratory Data Analysis And DATA MINING

Exploratory Data Analysis And DATA MINING

Presentation of E.D.A and D.M submitted in partial fulfillment of the requirements for the course of

Raj Kumar 411MA5058

Department of Mathematics ,N.I.T Rourkela

Exploratory Data Analysis And Data Mining: Data mining is a cross field which includes application of different fields such as computer Science, Statistics and Machine learning to extract knowledge from a given dataset which can be used for future values to prediction or classification purpose .Any Data mining project goes through a cyclic execution cycle which is presented in the following diagram.

Data Integration

And Selection

Data Cleaning And E.D.A

Statistical Results and Data

Transformation

Data Mining ,Pattern

Evaluation

Knowledge Discovery

Reiteration if results not productive

enough

• Data Integration• Data Selection• Data Cleaning• Exploratory data Analysis• Data Transformation• Data mining (linear

Regression,Neural Network)• Pattern Evaluation• Deployment• Reiteration

Task Statement:To develop a model on iris dataset.Iris Dataset: Iris flower is a multivariate dataset used by Scientist Ronald fisher to develop model.It is a flower which have three subcategories the task is to classify the given dataset into its subcategories using its attributes and perform discriminant Analysis.

First exploratory data Analysis techniques were applied to understand the data .

• We will find out the basic histograms to understand the distribution of data and explore its each corresponding attributes.

• From the following plot we can explore that the different species of iris have different ration of length and width and for its sepal length and width they can be classified into three different class .

• “Setosa” Species forms a different chunk altogether and is seperable from “versicolor “ and “Virginica”.

Correlation Matrix: Correlation represents the dependency of two random variables and is measure inbetwen (-1 to 1) .if the value is zero than there is no correlation between them linear relationship .Following R script produces the correlation plot between them .Following Rscript produces the correlation matrix.library(corrplot)iris1<-irisiris1$Species<-NULLiris_cor<-cor(iris)round(iris_cor,digits=2)

corrplot(iris_cor,diag=FALSE)

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variablesfit<-princomp(iris1,cor=TRUE)summary(fit) loadings(fit)plot(fit,type="lines")biplot(fit)

DATA TESTS:

There are different type of test which we can apply on the dataset to do preliminary analysis . They can be divided into two parts.• Two-sample test.• Paired two sample tests.

Two- Sample tests:

Kolmogorov-Smirnov Test • The Kolmogorov-Smirnov test is a non-parametric test of the similarity of two distributions. The null hypothesis is that

the two samples are drawn from the same distribution. The two-sided and the two one-sided tests are performed.

Wilcoxon rank sum Test• The two-sample non-parametric Wilcoxon rank sum test (equivalent to the Mann-Whitney test) is performed on the two

specified samples. The null hypothesis is that the distributions are the same (i.e., there is no shift in the location of the two distributions) with an alternative hypothesis that they differ on location (based on median).

Two-Sample t-Test • The two-sample T-test is performed on the two specified samples. The null hypothesis is that the difference

between the two means is zero. This test assumes that the two samples are normally distributed. If not, use the Wilcoxon Rank-Sum test.

PAIRED TWO SAMPLE TESTS

Correlation test• The paired sample correlation test is performed on the two specified samples. The two samples are expected to be

paired (two observations for the same entity). The null hypothesis is that the two samples have no (i.e., 0) correlation. Pearson's product moment correlation coefficient is used.

Wilcox Signed Rank• he paired sample non-parametric Wilcoxon signed rank test is performed on the two specified samples. The two

samples are expected to be paired (two observations for the same entity). The null hypothesis is that the distributions are the same.

SCATTERPLOT MATRIXscatterplot matrix (sometimes abbreviated as SPLOM),which is simply a collection of scatterplots arranged in a grid. It is used to detect patterns among three or more variables. The scatterplot matrix is not a true multidimensional visualization because only two features are examined at a time. Still, it provides a general sense of how the data may be interrelated.Following graph can be plotted in R using the following command .```{r}library(psych)

pairs.panels(iris1[c(“Sepal.Length”,”Sepal.Width”,+”Petal.Length”,”Petal.Width”)])```

DISCRIMINANT ANALYSIS ON IRIS DATASET

Discriminant function analysis is a statistical analysis to predict a categorical dependent variable (called a grouping variable) by one or more continuous or binary independent variables (called predictor variables). It is different from an ANOVA or MANOVA, which is used to predict one (ANOVA) or multiple (MANOVA) continuous dependent variables by one or more independent categorical variables. Discriminant function analysis is useful in determining whether a set of variables is effective in predicting category membership

data(iris)library(MASS)require(MASS)linear_disc<-lda(formula=Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris,prior=c(1,1,1)/3)linear_disc$priorlinear_disc$countslinear_disc$meanslinear_disc$scalinglinear_disc$levlinear_disc$svd

CLASSIFICATION OF IRIS DATASET USING THE NEURAL NETWORKS

Neural networks can be used for both classification and forecasting purpose .In neural network each layer have nodes which have assigned weights this weights can be trained for classification as well as forecasting purpose . Initially a neural network is trained to optimize its weight .Here we will use the iris dataset to first train the model and than will evaluate itsperformance . All operations will be performed in R environment and inbuilt packages will be used for the purpose.

```{r}#Normalize the data for increaing the neural network performancenormalize<-function(x){

return ((x-(min(x))/(max(x)-min(x)))}iris1<-irisiris1$Sepal.Length<-normalize(iris1$Sepal.Length)iris1$Sepal.Width<-normalize(iris1$Sepal.Width)iris1$Petal.Length<-normalize(iris1$Petal.Length)iris1$Petal.Width<-normalize(iris1$Petal.Width)

library(neuralnet)iris_train<-iris1[sample(1:150,75),]iris_train$setosa<-c(iris_train$Species=="setosa")iris_train$versicolor<-c(iris_train$Species=="versicolor")iris_train$virginica<-c(iris_train$Species=="virginica")iris_net<-neuralnet(setosa+virginica+versicolor~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,iris_train,hidden=2,lifesign="full")plot(iris_net,rep="best")plot(iris_net,rep="best",intercept="False")```

The following R chunk produces the neural network for classification purpose in case of iris data.We can change the function parameters to obtain different layers and and the relation between them. The following graphs are for 2,3,4 Hidden layers.

Hidden layer=2 Hidden layer=3 Hidden layer=4

library(neuralnet)iris_train<-iris1[sample(1:150,75),]iris_train$setosa<-c(iris_train$Species=="setosa")iris_train$versicolor<-c(iris_train$Species=="versicolor")iris_train$virginica<-c(iris_train$Species=="virginica")iris_net<-neuralnet(setosa+virginica+versicolor~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,iris_train,hidden=2,lifesign="full")plot(iris_net,rep="best")plot(iris_net,rep="best",intercept="False")```

CROSS VALIDATING THE RESULT

The neural network formed will classify the iris dataset using the weight deployed to each node. Here we will cross validate the model formed to see whatever the model formed is accurately prediciting or not.

predict<-compute(iris_net,iris[1:4])predict$net.resultresult<-0for(i in 1:150){

result[i]<-which.max(predict$net.result[i,])}for(i in 1:150){if(result[i]==1){

result[i]="setosa"}}

for(i in 1:150){if(result[i]==2){result[i]="versicolor"

}}for(i in 1:150){f(result[i]==3){

result[i]="virginica"}}

comparision<-iris1comparision$predicted<-result

RESULT

The neural netwok formed can be successfully used to classify in between “setosa” and (“versicolor” and “virginica “Sub class .While results are not attainable for classification between “versicolor” and “virginica “class.

END

Documents

Exploratory Data Analysis And DATA MINING