24
Data Analysis using WEKA Waikato Environment for Knowledge Analysis Weka is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. This paper is attempts to show the use of this software for Cluster Analysis and Decision Tree Analysis. Prabhjot Singh Bhatia – 10BM60060 Vinod Gupta School of Management, Indian Institute of Technology Kharagpur India.

Data Analysis using WEKA

Embed Size (px)

Citation preview

Page 1: Data Analysis using WEKA

Data Analysis using WEKA Waikato Environment for Knowledge Analysis Weka is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. This paper is attempts to show the use of this software for Cluster Analysis and Decision Tree Analysis. Prabhjot Singh Bhatia – 10BM60060 Vinod Gupta School of Management, Indian Institute of Technology Kharagpur India.

Page 2: Data Analysis using WEKA

1

Table of Contents Introduction ........................................................................................................................................................ 3

Cluster Analysis .............................................................................................................................................................. 3

K Means Clustering ................................................................................................................................................... 3

Decision Trees ................................................................................................................................................................ 4

Features ....................................................................................................................................................................... 4

Cluster Analysis using WEKA ......................................................................................................................... 5

About the dataset ........................................................................................................................................................... 5

Steps for K Means clustering ....................................................................................................................................... 5

Interpretation ................................................................................................................................................................ 14

Cluster 0 .................................................................................................................................................................... 14

Cluster 1 .................................................................................................................................................................... 14

Cluster 2 .................................................................................................................................................................... 15

Cluster 3 .................................................................................................................................................................... 15

Cluster 4 .................................................................................................................................................................... 15

Decision trees using WEKA .......................................................................................................................... 16

About the Dataset ........................................................................................................................................................ 16

Steps for decision tree generation .............................................................................................................................. 16

Interpretation of Output ............................................................................................................................................. 21

Works Cited ...................................................................................................................................................... 23

Page 3: Data Analysis using WEKA

2

Table of Figures Screenshot 1: The data file loaded into Weka. ............................................................................................... 6

Screenshot 2: The Visualize Tab. ..................................................................................................................... 6

Screenshot 3: Selecting Simple KMeans ......................................................................................................... 7

Screenshot 4: The Cluster Tab ......................................................................................................................... 7

Screenshot 5: KMeans Clustering Options .................................................................................................... 8

Screenshot 6: SimpleKMeans, 2 Clusters ....................................................................................................... 8

Screenshot 7: Simple KMeans, Results for 2-14 clusters ............................................................................. 9

Screenshot 8: Knee point for number of clusters ......................................................................................... 9

Screenshot 9: Visualizing 5 cluster solution ................................................................................................. 12

Screenshot 10: Visualization of Clusters - I ................................................................................................. 13

Screenshot 11: Visualization of Clusters - II ............................................................................................... 13

Screenshot 12: The data file loaded into Weka ........................................................................................... 17

Screenshot 13: Visualizing the given dataset ................................................................................................ 17

Screenshot 14: Selecting J48 Tree Algorithm .............................................................................................. 18

Screenshot 15: The Classify Tab .................................................................................................................... 18

Screenshot 16: The Output of J48 Algorithm ............................................................................................. 19

Screenshot 17: The Decision Tree Generated. ............................................................................................ 19

Screenshot 18: Dissection of the Textual Decision Tree Output ............................................................. 22

Page 4: Data Analysis using WEKA

3

Introduction

Identifying patterns in data and being able to make predictions based on the patterns plays

significant role in all aspects of an industry or an individual business. A plethora of methods and

tools are available. This paper is an attempt to introduce two such methods namely Decision tree

classification and K Means clustering using a tool called Weka.

Cluster Analysis

As a part of exploratory data mining, cluster analysis is used to assign a set of objects to various

groups such that they are more similar to other objects within the group than those belonging to

another. Cluster Analysis as a statistical data analytical tool forms the backbone of various disciplines

including marketing research, pattern finding, intelligence gathering and image analysis etc.

Cluster analysis comprises of a set of algorithms, all used in specific situations. These include

Hierarchical Clustering, K-Means clustering etc.

K Means Clustering

This method has the objective of classifying a set of n objects into k clusters, based on the closeness

to the cluster centers. The closeness to cluster centers is measured by the use of a standard distance

algorithm, eg. Euclidean distance.

Features

KMeans clustering is computationally very fast, as compared to other types of clustering algorithms.

The number of clusters is expected as an input for the algorithm to work. This number “k” can be

determined by a number of methods:One of them uses hierarchical clustering first to determine k

and then continues with K Means. However, this method has a limitation of not being

computationally efficient and takes a long time to converge.

Another method, and the one that will be described here, is to use the Knee- point method. (Sugar,

Gareth, & James, 2003) This comprises of running the K means algorithm for k=1 to an arbitrary

number say 10-15. The distortion or average sum of squared errors are then plotted against number

of clusters to find a “knee” point. This gives the number of clusters.

Page 5: Data Analysis using WEKA

4

KMeans Clustering Algorithm

The algorithm takes the number of clusters („k’) as a mandatory input. The algorithm follows the

following steps: (Matteucci)

1. „k‟ points are placed into space represented by the objects under consideration. These are the

initial cluster centres, and are the group centroids.

2. Based on a distance algorithm, the objects are assigned to the cluster closest to them.

3. Step 2 continues until all objects have been assigned to a cluster.

4. The positions of the cluster centroids are calculated again.

5. Steps 2 and 3 are repeated until the cluster centres no longer move.

6. This gives a partitioning of objects into distinct groups.

Decision Trees

The aim of this method is to predict what criteria is used to determine the outcome and in what

order. We start with a sample data called the training set and generate a model based on machine

learning. We later use this model in the field to predict the outcome on the basis of the input criteria.

The actual splitting of data into various groups is done on the basis of some rules. A majority of

them depend on a technique called recursive partitioning, since it is repeated for each sub-group of

data. Further partitioning is done till the point where such partitioning does not add value. There are

several algorithms for decision trees including J48, BTree, Random Forest, etc.

Features

Simple to understand and interpret.

No or little data preprocessing is required.

Ability to handle both numerical and categorical data.

Performs well with large data in a short time.

Trees can get very complex and very large to visualize

Attributes with more levels tend to bias the tree.

Page 6: Data Analysis using WEKA

5

Cluster Analysis using WEKA

About the dataset

The dataset used is a german credit dataset, and is available from (Hofman). The data has been

compiled by Professor Dr. Hans Hofmann, Institut f"ur Statistik und "Okonometrie, University of

Hamburg, Germany. The dataset describes various customers on the basis of a set of 20 parameters,

including age, housing, number of years since being a resident, car owned, etc.

The data set also contains a “class” attribute to classify credit-worthiness. We do not include this

attribute in our clustering algorithm.

Steps for K Means clustering

1. Open the file named credit-g.arff, available from the source above. The file is pre-processed,

we can use the data as is, without need for preprocessing. (Screenshot 1)

2. Click on “Visualize” tab to have a look at the data. This shows the plots of one field against

every other field, colour coded by the class. (Screenshot 2)

3. Click on “Cluster” tab (Screenshot 4). Click on “Choose” and select “SimpleKMeans”

(Screenshot 3). By default, Weka chooses a commonly used value for its

parameters.(Screenshot 5) The various options are self-explanatory. One major decision

variable is the number of clusters. As described above, we need to perform the cluster

analysis.

4. Keep the number of clusters as 2, since a 1 cluster solution would not be appropriate in this

case. Click OK.

5. Click on Ignore attributes. If the data set contains any column from other classification

experiments, select that attribute from the list provided. This is needed to make the

clustering process unbiased.

6. Click “Start” to start the clustering algorithm. The Weka icon at the bottom right corner

keeps flipping while processing is still in progress. Once the icon stops, the analysis result is

displayed in the adjacent window. (Screenshot 6)

7. Click on the text box next to “Choose” button to get the list of options. Increase the

number of clusters to 3, and click OK. Repeat steps 5 and 6.

8. Similarly, repeat step 7 for cluster numbers upto 14.

9. Tabulate the “Average Within Cluster Sum of Squared Errors” (as seen in Screenshot 7) and

plot the graph using a spreadsheet program (Screenshot 8).

10. The knee point shows the number of clusters in the dataset. Use the output of 5 clusters as

the cluster solution.

11. Right Click on the 5 cluster solution and click on “Visualize Cluster

Assignments”.(Screenshot 9)

Page 7: Data Analysis using WEKA

6

Screenshot 1: The data file loaded into Weka.

Screenshot 2: The Visualize Tab.

Page 8: Data Analysis using WEKA

7

Screenshot 4: The Cluster Tab

Screenshot 3: Selecting Simple KMeans

Page 9: Data Analysis using WEKA

8

Screenshot 5: KMeans Clustering Options

Screenshot 6: SimpleKMeans, 2 Clusters

Page 10: Data Analysis using WEKA

9

Clusters Avg. Within Cluster Sum of Squared Error

2 5365.998 3 5145.269 4 4927.793 5 4691.713 6 4613.818 7 4530.644 8 4437.524 9 4273.035

10 4202.059 11 4197.927 12 4157.734 13 4113.83 14 4037.48

Screenshot 7: Simple KMeans, Results for 2-14 clusters

5, 4691.713078

4000

4200

4400

4600

4800

5000

5200

5400

5600

2 3 4 5 6 7 8 9 10 11 12 13 14

Avg

. W

ith

in C

lust

er

Su

m o

f Sq

uar

ed

Err

ors

No. of Clusters

Avg. Within Cluster Sum of Squared Error

Screenshot 8: Knee point for number of clusters

Knee Point

Page 11: Data Analysis using WEKA

10

=== Run information ===

Scheme:weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S

10

Relation: german_credit

Instances: 1000

Attributes: 21

checking_status

duration

credit_history

purpose

credit_amount

savings_status

employment

installment_commitment

personal_status

other_parties

residence_since

property_magnitude

age

other_payment_plans

housing

existing_credits

job

num_dependents

own_telephone

foreign_worker

Ignored:

class

Test mode:evaluate on training data

=== Model and evaluation on training set ===

kMeans

======

Number of iterations: 9

Within cluster sum of squared errors: 4691.713078260774

Missing values globally replaced with mean/mode

Cluster centroids: Cluster# Attribute Full Data 0 1 2 3 4 (1000) (220) (130) (230) (267) (153) =========================================================================================================================================================================================================================== checking_status no checking no checking <0 0<=X<200 no checking <0 duration 20.903 21.7409 27.4462 17.1522 20.779 19.9935 credit_history existing paid critical/other existing credit existing paid existing paid existing paid existing paid purpose radio/tv new car used car radio/tv radio/tv new car credit_amount 3271.258 3523.8136 5507.3846 2465.4739 2763.4307 3105.6471

Page 12: Data Analysis using WEKA

11

savings_status <100 <100 <100 <100 <100 <100 employment 1<=X<4 1<=X<4 >=7 1<=X<4 >=7 <1 installment_commitment 2.973 3.0591 3.0769 2.8957 3.161 2.549 personal_status male single female div/dep/mar male single male single male single female div/dep/mar other_parties none none none none none none residence_since 2.845 2.5727 3.5769 2.3261 3.0225 3.085 property_magnitude car car no known property real estate life insurance car age 35.546 34.4091 44.4923 33.6652 37.7004 28.6471 other_payment_plans none none none none none none housing own own for free own own rent existing_credits 1.407 1.6318 1.4 1.2913 1.4045 1.268 job skilled skilled high qualif/self emp/mgmt unskilled resident skilled skilled num_dependents 1.155 1.0727 1.2462 1.2174 1.1835 1.0523 own_telephone none yes yes none none none foreign_worker yes yes yes yes yes yes

The above output in a cleaner form:

Cluster Number

Attribute Full Data 0 1 2 3 4

(1000) (220) (130) (230) (267) (153)

checking_status no checking

no checking <0 0<=X<200

no checking

<0

duration 20.903 21.7409 27.4462 17.1522 20.779 19.9935 credit_history existing

paid critical/other existing credit

existing paid

existing paid

existing paid

existing paid

purpose radio/tv new car used car radio/tv radio/tv new car credit_amount 3271.258 3523.8136 5507.3846 2465.4739 2763.430

7 3105.6471

savings_status <100 <100 <100 <100 <100 <100 employment 1<=X&lt

;4 1<=X<4 >=7 1<=X<4 >=7 <1

installment_commitment

2.973 3.0591 3.0769 2.8957 3.161 2.549

personal_status male single

female div/dep/mar

male single

male single male single

female div/dep/mar

other_parties none none none none none none residence_since 2.845 2.5727 3.5769 2.3261 3.0225 3.085 property_magnitude car car no known

property real estate life

insurance

car

age 35.546 34.4091 44.4923 33.6652 37.7004 28.6471 other_payment_plans none none none none none none housing own own for free own own rent existing_credits 1.407 1.6318 1.4 1.2913 1.4045 1.268 job skilled skilled high

qualif/self emp/mgmt

unskilled resident

skilled skilled

num_dependents 1.155 1.0727 1.2462 1.2174 1.1835 1.0523 own_telephone none yes yes none none none foreign_worker yes yes yes yes yes yes

Time taken to build model (full training data) : 0.08 seconds

Page 13: Data Analysis using WEKA

12

=== Model and evaluation on training set ===

Clustered Instances

0 220 ( 22%)

1 130 ( 13%)

2 230 ( 23%)

3 267 ( 27%)

4 153 ( 15%)

Screenshot 9: Visualizing 5 cluster solution

Page 14: Data Analysis using WEKA

13

Screenshot 10: Notice that a majority of the people in cluster 1 are highly qualified and those in cluster 2 are unemployed

Screenshot 11: Notice that a majority of the people in cluster 0 are holders of previous unpaid/delayed credit

Page 15: Data Analysis using WEKA

14

Interpretation

As evident from the visualization, we can see the output has been classified into 5 clusters. The

algorithm took 9 iterations to complete and reach the result. The distortion (average within cluster

sum of squared errors) amongst the clusters themselves is 4691 units.

The cluster centers have been shown in the form of a table. The following results can be seen from

the cluster center table:

A majority of people in all clusters are foreign workers, with no other payment plans from

bank/store, and have atleast one family person dependent wholly on them. They also do not

have any guarantors or third parties to vouch for their loan. Moreover, all of these have very

little savings to their name.

Clusters 0 and 4 are female and are divorcees/dependent/married, others are single males.

Only Clusters 0 and 1 own a telephone.

Cluster 0

This set of customers comprise of middle aged married/divorcee female applicants, who

have been employed in a skilled job for the past 1-4 years, and have been resident at their

self-owned house for around 2.5 years. They currently have critically delayed existing credit

to their name. They are seeking a loan for a second new car, and are willing to commit about

3 % of their income towards installments, to be paid over 21 months. However, they do not

have a checking account where their salary is drawn. In the absence of a guarantor, this

seems to be a high risk category.

Cluster 1

This set of customers comprise of single male applicants in their mid-40s, who have been

employed in a high skilled job or self-employed for more than 7 years, and have been

resident at their freely provided/company provided house for around 3.5 years. They

currently have no existing credit to their name, but also do not have any property registered

to their name. Also, they have an overdrawn checking account. They are seeking a loan for a

used car, and are willing to commit about 3 % of their income towards installments, to be

paid over 27 months. However, they are asking for too high an amount for a used car,

considering other groups are asking only 60% of that amount for a new car. This is

Page 16: Data Analysis using WEKA

15

indicative of intentions of either using the loan money for some other purpose or possible

fraud. In the absence of a guarantor, this seems to be a high risk category.

Cluster 2

This set of customers comprise of middle aged single male applicants, who have been

employed in an unskilled resident job for the past 1-4 years, and have been resident at their

self-owned house for around 2 years. They also own real estate on their name. They are

seeking a loan for a radio/TV, and are willing to commit about 2.8 % of their income

towards installments, to be paid over 17 months. They maintain a checking balance of 0-200

DeutscheMark in their account. In the absence of a guarantor, this seems to be a low risk

category

Cluster 3

This set of customers comprise of middle aged single male applicants, who have been

employed in a skilled job for over 7 years, and have been resident at their self-owned house

for over 3 years. Apart from this, they own nothing more than a life insurance policy to their

name. They are seeking a loan for a radio/TV, and are willing to commit about 3.1 % of

their income towards installments, to be paid over 20 months. However, they do not have a

checking account where their salary is drawn. In the absence of a guarantor, this seems to be

a medium risk category.

Cluster 4

This set of customers comprise of very young divorced/married female applicants, who have

been employed in a skilled job for less than 1 year, and have been resident at their rented

house for around 3 years. They are seeking a loan for a second new car, and are willing to

commit about 2.5 % of their income towards installments, to be paid over 20 months. Their

checking account is overdrawn. In the absence of a guarantor, this seems to be a high risk

category.

Page 17: Data Analysis using WEKA

16

Decision trees using WEKA

About the Dataset

The data set is vote.arff, US Congressional Voting Records, is a part of the UCI Repository

(University of California Irvine), a repository of data available publicly.

The dataset is basically a training set used to classify voters on the basis of their likelihood to vote

for either democrats or republicans, on the basis of various criteria chosen from their manifesto.

These criteria include the parties‟ stands on various issues like cost sharing of drinking water project,

freezing physician fee, spending on education, among other factors.

Steps for decision tree generation

1. Launch Weka and open the “Explorer Window”. Click on “Open File” and browse to the

where “vote.arff” is stored. The data set opens up in Weka. (Screenshot 12)

2. Click “Visualize” tab to see the data in graphical form, plotted against each

attribute.(Screenshot 13)

3. Click on “Classify” tab. We see the “ZeroR” as the default classification rule. (Screenshot 15)

4. Click on “Choose” button and Select “J48” under “Trees”.(Screenshot 14)

5. Weka makes workable assumptions about the various configuration options for this

algorithm, so we can continue using these. Click on “Start” to start executing the algorithm.

When the algorithm has finished executing, we get the output, including the textual decision

tree(Screenshot 16)

6. To visualize the decision tree in graphical format, Right Click on the result in the left pane

and click “Visualize Tree”. A new windows is opened up showing the decision tree in

graphical format.(Screenshot 17)

Page 18: Data Analysis using WEKA

17

Screenshot 12: The data file loaded into Weka

Screenshot 13: Visualizing the given dataset

Page 19: Data Analysis using WEKA

18

Screenshot 15: The Classify Tab

Screenshot 14: Selecting J48 Tree Algorithm

Page 20: Data Analysis using WEKA

19

Screenshot 16: The Output of J48 Algorithm

Screenshot 17: The Decision Tree Generated.

Page 21: Data Analysis using WEKA

20

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2

Relation: vote

Instances: 435

Attributes: 17

handicapped-infants

water-project-cost-sharing

adoption-of-the-budget-resolution

physician-fee-freeze

el-salvador-aid

religious-groups-in-schools

anti-satellite-test-ban

aid-to-nicaraguan-contras

mx-missile

immigration

synfuels-corporation-cutback

education-spending

superfund-right-to-sue

crime

duty-free-exports

export-administration-act-south-africa

Class

Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree

------------------

physician-fee-freeze = n: democrat (253.41/3.75)

physician-fee-freeze = y

| synfuels-corporation-cutback = n: republican (145.71/4.0)

| synfuels-corporation-cutback = y

| | mx-missile = n

| | | adoption-of-the-budget-resolution = n: republican (22.61/3.32)

| | | adoption-of-the-budget-resolution = y

| | | | anti-satellite-test-ban = n: democrat (5.04/0.02)

| | | | anti-satellite-test-ban = y: republican (2.21)

| | mx-missile = y: democrat (6.03/1.03)

Number of Leaves : 6

Size of the tree : 11

Time taken to build model: 0.06 seconds

Page 22: Data Analysis using WEKA

21

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 419 96.3218 %

Incorrectly Classified Instances 16 3.6782 %

Kappa statistic 0.9224

Mean absolute error 0.0611

Root mean squared error 0.1748

Relative absolute error 12.887 %

Root relative squared error 35.9085 %

Total Number of Instances 435

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.97 0.048 0.97 0.97 0.97 0.971 democrat

0.952 0.03 0.952 0.952 0.952 0.971

republican

Weighted Avg. 0.963 0.041 0.963 0.963 0.963 0.971

=== Confusion Matrix ===

a b <-- classified as

259 8 | a = democrat

8 160 | b = republican

Interpretation of Output

The complete output of the J48 Algorithm is shown in the previous section. The beginning defines

the summary of the dataset, which shows that the output was reached using the 10 fold Cross

validation method. To avoid ambiguity, the attributes worked upon are also shown.

Under the heading “Classifier model J48 Pruning tree” we see the decision tree in textual format.

The first split is based on whether the “Physician fee freeze” is part of the manifesto or not. If not,

the voter straight away votes for the Democrats.

A detailed description of the various elements of this textual decision tree are as follows(Screenshot

18):

1. The attribute which decides the outcome for that branch

2. The value of the attribute for that branch.

Page 23: Data Analysis using WEKA

22

3. The colon is the separator for the class label assigned to a particular leaf.

4. The class label assigned to a particular leaf

5. The number of instances assigned for the leaf, as a decimal number

6. The number of instances wrongly classified, and assigned to this leaf.

The graphically formatted decision tree (Screenshot 17) is more visually appealing and has the same

components as above. As mentioned, the number of leaves and size of the tree are 6 and 11

respectively.

The next section of the output gives an idea about the performance measure of the tree. 3.6782%

instances are incorrectly classified. Also, from the confusion matrix, it can be seen that a total of 16

instances are incorrectly classified (8 from democrats and 8 from republicans).

A high value of Kappa statistic (92.24%) shows that the results generated from this output are quite

believable and this model is suitable for deployment in the real world.

This decision tree gives us a fair idea about the importance and order of various factors to be taken

into consideration by a voter while making a decision on whom to vote.

Screenshot 18: Dissection of the Textual Decision Tree Output

1 2 3 4 5 6

Branch

Leaf

Page 24: Data Analysis using WEKA

23

Works Cited

Hofman, D. H. (n.d.). Institut f"ur Statistik und "Okonometrie, University of Hamburg, Germany; German

Credit Dataset. Retrieved April 16, 2012, from Auckland University Website:

http://www.stat.auckland.ac.nz/~reilly/credit-g.arff

Matteucci, M. (n.d.). A Tutorial on Clustering Algorithms: . Retrieved April 2, 2012, from Matteucci's

personal page at Politecnico di Milano Website:

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html

Sugar, C. A., Gareth, & James, M. (2003). Finding the number of clusters in a data set: An

information theoretic approach. Journal of the American Statistical Association, Vol. 98, 750-763.

University of California Irvine, M. R. (n.d.). 1984 US State Congressional voting Records Dataset.

Retrieved April 16, 2012, from University of California, Irvine, Machine Learning Repository

Archive: http://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

Wikipedia. (2012, March 28). Decision_tree_learning. Retrieved April 6, 2012, from Wikipedia:

http://en.wikipedia.org/wiki/Decision_tree_learning

Wikipedia. (2012, April 7). K-means_clustering. Retrieved April 9, 2012, from Wikipedia:

http://en.wikipedia.org/wiki/K-means_clustering