19
Classification Technique KNN in Data Mining ---on dataset “Iris” Comp722 data mining Kaiwen Qi, UNC Spring 2012

Data mining project presentation

Embed Size (px)

Citation preview

Page 1: Data mining project presentation

Classification Technique KNN in Data Mining

---on dataset “Iris”

Comp722 data miningKaiwen Qi, UNC

Spring 2012

Page 2: Data mining project presentation

Outline

Dataset introduction Data processing Data analysis KNN & Implementation Testing

Page 3: Data mining project presentation

Dataset Raw dataset Iris(http

://archive.ics.uci.edu/ml/datasets/Iris)

150 total records

50 records Iris Setosa

50 records Iris Versicolour

50 records Iris Virginica

5 Attributes

Sepal length in cm(continious number)

Sepal width in cm(continious number)

Petal length in cm(continious number)

Petal width in cm(continious number)

Class(nominal data: Iris Setosa Iris Versicolour Iris Virginica)

(a) Raw data

(b) Data organization (C) Data

organization

Page 4: Data mining project presentation

Classification Goal

Task

Page 5: Data mining project presentation

Data Processing

Original data

Page 6: Data mining project presentation

Data Processing

• Balanced distribution

Page 7: Data mining project presentation

Data Analysis

Statistics

Page 8: Data mining project presentation

Data Analysis

Histogram

Page 9: Data mining project presentation

Data Analysis

Histogram

Page 10: Data mining project presentation

KNN

KNN algorithm

The unknown data, the green circle, is classified to be square when K is 5. The distance between two points is calculated with Euclidean distance d(p, q)= . .In this example, square is the majority in 5 nearest neighbors.

Page 11: Data mining project presentation

KNN

Advantage the skimpiness of implementation. It is

good at dealing with numeric attributes.    Does not set up the model and just

imports the dataset with very low computer overhead.

Does not need to calculate the useful attribute subset. Compared with naïve Bayesian, we do not need to worry about lack of available probability data

Page 12: Data mining project presentation

Implementation of KNN Algorithm

Algorithm: KNN. Asses a classification label from training data for an unlabeled data

Input: K, the number of neighbors. Dataset that include training data Output: A string that indicates unknown tuple’s classification

Method: Create a distance array whose size is K Initialize the array with the distances between the unlabeled tuple with

first K records in dataset Let i=k+1 calculate the distance between the unlabeled tuple with the (k+1)th

record in dataset, if the distance is greater than the biggest distance in the array, replace the old max distance with the new distance; i=i+1

repeat step (4) until i is greater than dataset size(150) Count the class number in the array, the class of biggest number is

mining result

Page 13: Data mining project presentation

Implementation of KNN

UML

Page 14: Data mining project presentation

Testing

Testing (K=7, total 150 tuples)

Page 15: Data mining project presentation

Testing Testing (K=7, 60% data as training data)

Page 16: Data mining project presentation

Testing

Input random distribution dataset

Random dataset

Accuracy test:

Page 17: Data mining project presentation

n

Performance

Comparison Decision tree

Advantage• comprehensibility • construct a decision tree

without any domain knowledge• handle high dimensional • By eliminating unrelated

attributes and tree pruning, it simplifies classification calculation 

Disadvantage• requires good quality of training

data. • usually runs in memory • Not good at handling

continuous number features.

Advantage• relatively simply. • By simply calculating

attributes frequency from training data and without any other operations (e.g. sort, search),  

Disadvantage• The assumption of

independence is not right

• No available probability data to calculate probability

Naïve Bayesian

Page 18: Data mining project presentation

Conclusion

KNN is a simple algorithm with high classification accuracy for dataset with continuous attributes.

It shows high performance with balanced distribution training data as input.

Page 19: Data mining project presentation

ThanksQuestion?