Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks

1

Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered

knowledge

Brief introduction to lectures

Transparencies prepared by Ho Tu Bao [JAIST]

2

Lecture 5: Automatic Cluster Detection

•One of the most widely used KDD classification techniques for unsupervised data.

•Content of the lecture1. Introduction2. Partitioning Clustering3. Hierarchical Clustering4. Software and case-studies

•Prerequisite: Nothing special

3

Partitioning Clustering

:conditions following the satisfying ,clusters called often n),(K X of P,...,P,P subsets empty-non disjointK of collection a is

}x,...,x,{xX objects n of set a of partition AK21

n21

• Each cluster must contain at least one object• Each object must belong to exactly one group

P of components called are P },P,...,P,{PP partition the Denote iK21

X.P...PP:X is union their (2) ji ,P and P all for 0PP :disjoint are they (1)

K21

jiji

4

Partitioning ClusteringWhat is a “good” partitioning clustering?Key ideas: Objects in each group are similar and objects between different groups are dissimilar.

Minimize the within-group distance and Maximize the between-group distance.

Notice: Many ways to define the “within-group distance” (the average of distance to the group’s center or the average of distance between all pairs of objects, etc.) and to define the “between-group distance”. It is in general impossible to find the optimal clustering.

}},,{,},{,},,,,{{321

65372109741 PPP

xxxxxxxxxxP

5

Hierarchical Clustering

A hierarchical clustering is a sequence of partitions in which each partition is nested into the next partition in the sequence.

Partition Q is nested into partition P if every component of Q is a subset of a component of P.

(This definition is for bottom-up hierarchical clustering. In case of top-down hierarchical clustering, “next” becomes “previous”).

},,{},,{},,,,,{ 65382109741 xxxxxxxxxxP },{},{},,{},,{},,,{ 63582107941 xxxxxxxxxxQ

6

Bottom-up Hierarchical Clustering

654321 ,,,,, xxxxxx

},{},,,,{ 654321 xxxxxx

},{},{},,,{ 654321 xxxxxx

}{},{},{},,{},{ 654321 xxxxxx

}{},{},{},{},{},{ 654321 xxxxxx x1 x2 x3 x4 x5 x6

7

Top-Down Hierarchical Clustering

654321 ,,,,, xxxxxx

},{},,,,{ 654321 xxxxxx

},{},{},,,{ 654321 xxxxxx

}{},{},{},,{},{ 654321 xxxxxx

}{},{},{},{},{},{ 654321 xxxxxx

x1 x2 x3 x4 x5 x6

8

OSHAM: Hybrid Model

WisconsinBreastCancerData

Attributes

Brief Descriptionof Concepts

ConceptHierarchy

Multiple Inheritance Concepts

Discovered Concepts

9

Lecture 1: Overview of KDDLecture 2: Preparing dataLecture 3: Decision tree induction Lecture 4: Mining association rulesLecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge


10

Lecture 6: Neural networks•One of the most widely used KDD classification

techniques.•Content of the lecture


1. Neural network representation2. Feed-forward neural networks3. Using back-propagation algorithm4. Case-studies

11

Lecture 1: Overview of KDDLecture 2: Preparing dataLecture 3: Decision tree induction Lecture 4: Mining association rulesLecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge


12

Lecture 7 Evaluation of discovered knowledge

•One of the most widely used KDD classification techniques.

•Content of the lecture1. Cross validation2. Bootstrapping3. Case-studies


13

Out-of-sample testing

HistoricalData

(warehouse)Samplingmethod

Sampledata

Samplingmethod

Trainingdata Induction

method

Testingdata

Errorestimation

Model

2/3

1/3

error

The quality of the test sample estimate is dependent on the number of test cases and the validity of the independent assumption

14

Cross ValidationHistorical

Data(warehouse)

Samplingmethod

Sampledata

Samplingmethod

Sample 1 Inductionmethod

Sample nError

estimation

Model

Run’serror

10-fold cross validation appears adequate (n = 10)

Sample 2

...

Errorestimation

iterate

- Mutually exclusive- Equal size

15

randomly split the data set into 3 subsets of equal size

run on each 2 subsets as training

data to find knowledge

test on the rest subset as testing data to evaluate the accuracy

average theaccuracies asfinal evaluation

2

3

1

1

2

2A data set

A method to be evaluated

Evaluation: k-fold cross validation (k=3)1 3

3 2

3 1

16

Outline of the presentation

Objectives, Prerequisite and Content

Brief Introduction to Lectures

DiscussionandConclusion

This presentation summarizes the content and organizationof lectures in module “Knowledge Discovery and Data Mining”

Documents

Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks