Data mining project presentation

Classification Technique KNN in Data Mining

---on dataset “Iris”

Comp722 data miningKaiwen Qi, UNC

Spring 2012

Outline

Dataset introduction Data processing Data analysis KNN & Implementation Testing

Dataset Raw dataset Iris(http

://archive.ics.uci.edu/ml/datasets/Iris)

150 total records

50 records Iris Setosa

50 records Iris Versicolour

50 records Iris Virginica

5 Attributes

Sepal length in cm(continious number)

Sepal width in cm(continious number)

Petal length in cm(continious number)

Petal width in cm(continious number)

Class(nominal data: Iris Setosa Iris Versicolour Iris Virginica)

(a) Raw data

(b) Data organization (C) Data

organization

http://archive.ics.uci.edu/ml/datasets/Iris

http://archive.ics.uci.edu/ml/datasets/Iris

Classification Goal

Task

Data Processing

Original data

Data Processing

• Balanced distribution

Data Analysis

Statistics

Data Analysis

Histogram

Data Analysis

Histogram

KNN

KNN algorithm

The unknown data, the green circle, is classified to be square when K is 5. The distance between two points is calculated with Euclidean distance d(p, q)= . .In this example, square is the majority in 5 nearest neighbors.

KNN

Advantage the skimpiness of implementation. It is

good at dealing with numeric attributes. Does not set up the model and just

imports the dataset with very low computer overhead.

Does not need to calculate the useful attribute subset. Compared with naïve Bayesian, we do not need to worry about lack of available probability data

Implementation of KNN Algorithm

Algorithm: KNN. Asses a classification label from training data for an unlabeled data

Input: K, the number of neighbors. Dataset that include training data Output: A string that indicates unknown tuple’s classification

Method: Create a distance array whose size is K Initialize the array with the distances between the unlabeled tuple with

first K records in dataset Let i=k+1 calculate the distance between the unlabeled tuple with the (k+1)th

record in dataset, if the distance is greater than the biggest distance in the array, replace the old max distance with the new distance; i=i+1

repeat step (4) until i is greater than dataset size(150) Count the class number in the array, the class of biggest number is

mining result

Implementation of KNN

UML

Testing

Testing (K=7, total 150 tuples)

Testing Testing (K=7, 60% data as training data)

Testing

Input random distribution dataset

Random dataset

Accuracy test:

n

Performance

Comparison Decision tree

Advantage• comprehensibility • construct a decision tree

without any domain knowledge• handle high dimensional • By eliminating unrelated

attributes and tree pruning, it simplifies classification calculation

Disadvantage• requires good quality of training

data. • usually runs in memory • Not good at handling

continuous number features.

Advantage• relatively simply. • By simply calculating

attributes frequency from training data and without any other operations (e.g. sort, search),

Disadvantage• The assumption of

independence is not right

• No available probability data to calculate probability

Naïve Bayesian

Conclusion

KNN is a simple algorithm with high classification accuracy for dataset with continuous attributes.

It shows high performance with balanced distribution training data as input.

ThanksQuestion?

Technology

Data mining project presentation