Clustering Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

Clustering

Bamshad MobasherDePaul University

Bamshad MobasherDePaul University

2

What is Clustering in Data Mining?

i Cluster:4 a collection of data objects that

are “similar” to one another and thus can be treated collectively as one group

4 but as a collection, they are sufficiently different from other groups

i Clustering4 unsupervised classification4 no predefined classes

Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters

Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters

Helps users understand the natural grouping or structure in a data set

Applications of Cluster Analysis

i Data reduction

4 Summarization: Preprocessing for regression, PCA, classification, and association analysis

4 Compression: Image processing: vector quantizationi Hypothesis generation and testingi Prediction based on groups

4 Cluster & find characteristics/patterns for each group

i Finding K-nearest Neighbors

4 Localizing search to one or a small number of clusters

i Outlier detection: Outliers are often viewed as those “far away” from any cluster

3

Basic Steps to Develop a Clustering Task

i Feature selection / Preprocessing4 Select info concerning the task of interest

4 Minimal information redundancy

4 May need to do normalization/standardization

i Distance/Similarity measure4 Similarity of two feature vectors

i Clustering criterion4 Expressed via a cost function or some rules

i Clustering algorithms4 Choice of algorithms

i Validation of the results

i Interpretation of the results with applications

4

Quality: What Is Good Clustering?

i A good clustering method will produce high quality clusters

4 high intra-class similarity: cohesive within clusters

4 low inter-class similarity: distinctive between clusters

i The quality of a clustering method depends on

4 the similarity measure used by the method

4 its implementation, and

4 Its ability to discover some or all of the hidden patterns

5

Measure the Quality of Clustering

i Distance/Similarity metric4 Similarity is expressed in terms of a distance function, typically metric:

d(i, j)4 The definitions of distance functions are usually rather different for

interval-scaled, Boolean, categorical, ordinal ratio, and vector variables4 Weights should be associated with different variables based on

applications and data semantics

i Quality of clustering:4 There is usually a separate “quality” function that measures the

“goodness” of a cluster.4 It is hard to define “similar enough” or “good enough”

h The answer is typically highly subjective

6

7

Distance or Similarity Measures

i Common Distance Measures:

4 Manhattan distance:

4 Euclidean distance:

4 Cosine similarity:

( , ) 1 ( , )dist X Y sim X Y 2 2

( )( , )

i ii

i ii i

x ysim X Y

x y

8

More Similarity Measures

Simple Matching

Cosine Coefficient

Dice’s Coefficient

Jaccard’s Coefficient

i In vector-space model many similarity measures can be used in clustering

2 2

( )( , )

i ii

i ii i

x ysim X Y

x y

sim X Y x yi ii

( , ) sim X Y

x y

x y

i ii

i iii

( , )

2

2 2

sim X Yx y

x y x y

i ii

i i i iiii

( , )

2 2

9

Distance (Similarity) Matrix

similarity (or distance) of to ij i jd D D

Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix.

The diagonal is all 1’s (similarity) or all 0’s (distance)

Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix.

The diagonal is all 1’s (similarity) or all 0’s (distance)

i Similarity (Distance) Matrix4 based on the distance or similarity measure we can construct a symmetric

matrix of distance (or similarity values)4 (i, j) entry in the matrix is the distance (similarity) between items i and j

10

Example: Term Similarities in Documentsi Suppose we want to cluster terms that appear in a collection of

documents with different frequencies

i We need to compute a term-term similarity matrix4 For simplicity we use the dot product as similarity measure (note that this is the non-

normalized version of cosine similarity)

4 Example:

1

( , ) ( )i j jk

N

ikk

sim T T w w

Each term can be viewed as a vector of term frequencies (weights)

Each term can be viewed as a vector of term frequencies (weights)

T1 T2 T3 T4 T5 T6 T7 T8Doc1 0 4 0 0 0 2 1 3Doc2 3 1 4 3 1 2 0 1Doc3 3 0 0 0 3 0 3 0Doc4 0 1 0 3 0 0 2 0Doc5 2 2 2 3 1 4 0 2

N = total number of dimensions (in this case documents)wik = weight of term i in document k.

N = total number of dimensions (in this case documents)wik = weight of term i in document k.

Sim(T1,T2) = <0,3,3,0,2> * <4,1,0,1,2>0x4 + 3x1 + 3x0 + 0x1 + 2x2 = 7

Sim(T1,T2) = <0,3,3,0,2> * <4,1,0,1,2>0x4 + 3x1 + 3x0 + 0x1 + 2x2 = 7

11

Example: Term Similarities in Documents

T1 T2 T3 T4 T5 T6 T7 T8Doc1 0 4 0 0 0 2 1 3Doc2 3 1 4 3 1 2 0 1Doc3 3 0 0 0 3 0 3 0Doc4 0 1 0 3 0 0 2 0Doc5 2 2 2 3 1 4 0 2

sim T T w wi j jkikk

N

( , ) ( )

1

T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3

Term-TermSimilarity Matrix

Term-TermSimilarity Matrix

12

Similarity (Distance) Thresholds4 A similarity (distance) threshold may be used to mark pairs that are

“sufficiently” similar

T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3

T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0

Using a threshold value of 10 in the previous example

13

Graph Representationi The similarity matrix can be visualized as an undirected graph

4 each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix)

T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0

T1 T3

T4

T6T8

T5

T2

T7If no threshold is used, thenmatrix can be represented asa weighted graph

If no threshold is used, thenmatrix can be represented asa weighted graph

14

Connectivity-Based Clustering Algorithms

i If we are interested only in threshold (and not the degree of similarity or distance), we can use the graph directly for clustering

i Clique Method (complete link)4 all items within a cluster must be within the similarity threshold of all other

items in that cluster4 clusters may overlap4 generally produces small but very tight clusters

i Single Link Method4 any item in a cluster must be within the similarity threshold of at least one

other item in that cluster4 produces larger but weaker clusters

i Other methods4 star method - start with an item and place all related items in that cluster4 string method - start with an item; place one related item in that cluster; then

place anther item related to the last item entered, and so on

15

Simple Clustering Algorithmsi Clique Method

4 a clique is a completely connected subgraph of a graph4 in the clique method, each maximal clique in the graph becomes a cluster

T1 T3

T4

T6T8

T5

T2

T7

Maximal cliques (and therefore the clusters) in the previous example are:

{T1, T3, T4, T6}{T2, T4, T6}{T2, T6, T8}{T1, T5}{T7}

Note that, for example, {T1, T3, T4} is also a clique, but is not maximal.

16

Simple Clustering Algorithmsi Single Link Method

4 selected an item not in a cluster and place it in a new cluster4 place all other similar item in that cluster4 repeat step 2 for each item in the cluster until nothing more can be added4 repeat steps 1-3 for each item that remains unclustered

T1 T3

T4

T6T8

T5

T2

T7

In this case the single link method produces only two clusters:

{T1, T3, T4, T5, T6, T2, T8} {T7}

Note that the single link method does not allow overlapping clusters, thus partitioning the set of items.

Major Clustering Approaches

i Partitioning approach: 4 Construct various partitions and then evaluate them by some criterion,

e.g., minimizing the sum of square errors4 Typical methods: k-means, k-medoids, CLARANS

i Hierarchical approach: 4 Create a hierarchical decomposition of the set of data (or objects) using

some criterion4 Typical methods: Diana, Agnes, BIRCH, CAMELEON

i Density-based approach: 4 Based on connectivity and density functions4 Typical methods: DBSACN, OPTICS, DenClue

i Grid-based approach: 4 based on a multiple-level granularity structure4 Typical methods: STING, WaveCluster, CLIQUE

17

Major Clustering Approaches (Cont.)

i Model-based: 4 A model is hypothesized for each of the clusters and tries to find the best

fit of that model to each other4 Typical methods: EM, SOM, COBWEB

i Frequent pattern-based:4 Based on the analysis of frequent patterns4 Typical methods: p-Cluster

i User-guided or constraint-based: 4 Clustering by considering user-specified or application-specific

constraints4 Typical methods: COD (obstacles), constrained clustering

i Link-based clustering:4 Objects are often linked together in various ways4 Massive links can be used to cluster objects: SimRank, LinkClus

18

19

Partitioning Approachesi The notion of comparing item similarities can be extended to

clusters themselves, by focusing on a representative vector for each cluster4 cluster representatives can be actual items in the cluster or other “virtual”

representatives such as the centroid4 this methodology reduces the number of similarity computations in

clustering4 clusters are revised successively until a stopping condition is satisfied, or

until no more changes to clusters can be made

i Partitioning Methods4 reallocation method - start with an initial assignment of items to clusters

and then move items from cluster to cluster to obtain an improved partitioning

4 Single pass method - simple and efficient, but produces large clusters, and depends on order in which items are processed

The K-Means Clustering Method

i Given k, the k-means algorithm is implemented in four

steps:

4 Partition objects into k nonempty subsets

4 Compute seed points as the centroids of the clusters of

the current partitioning (the centroid is the center, i.e.,

mean point, of the cluster)

4 Assign each object to the cluster with the nearest seed

point

4 Go back to Step 2, stop when the assignment does not

change

20

21

K-Means Algorithmi The basic algorithm (based on reallocation method):

1. Select K initial clusters by (possibly) random assignment of some items to clusters and compute each of the cluster centroids.

2. Compute the similarity of each item xi to each cluster centroid and (re-)assign each item to the cluster whose centroid is most similar to xi.

3. Re-compute the cluster centroids based on the new assignments.4. Repeat steps 2 and 3 until three is no change in clusters from one iteration to the next.

Initial (arbitrary) assignment:C1 = {D1,D2}, C2 = {D3,D4}, C3 = {D5,D6}

Cluster Centroids

Example: Clustering Documents

T1 T2 T3 T4 T5

D1 0 3 3 0 2D2 4 1 0 1 2D3 0 4 0 0 2D4 0 3 0 3 3D5 0 1 3 0 1D6 2 2 0 0 4D7 1 0 3 2 0D8 3 1 0 0 2C1 4/2 4/2 3/2 1/2 4/2C2 0/2 7/2 0/2 3/2 5/2C3 2/2 3/2 3/2 0/2 5/2

22

Example: K-Means

D1 D2 D3 D4 D5 D6 D7 D8C1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2C2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2C3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2

Now compute the similarity (or distance) of each item with each cluster, resulting a cluster-document similarity matrix (here we use dot product as the similarity measure).

For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table). After the reallocation we have the following new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment.

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}

This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation…

This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation…

23

Example: K-Means

T1 T2 T3 T4 T5

D1 0 3 3 0 2D2 4 1 0 1 2D3 0 4 0 0 2D4 0 3 0 3 3D5 0 1 3 0 1D6 2 2 0 0 4D7 1 0 3 2 0D8 3 1 0 0 2

C1 8/3 2/3 3/3 3/3 4/3C2 2/4 12/4 3/4 3/4 11/4C3 0/1 1/1 3/1 0/1 1/1

Now compute new cluster centroids using the original document-term matrix

This will lead to a new cluster-doc similarity matrix similar to previous slide. Again, the items are reallocated to clusters with highest similarity.

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}

D1 D2 D3 D4 D5 D6 D7 D8C1 7.67 15.01 5.34 9.00 5.00 12.00 7.67 11.34C2 16.75 11.25 17.50 19.50 8.00 6.68 4.25 10.00C3 14.00 3.00 6.00 6.00 11.00 9.34 9.00 3.00

C1 = {D2,D6,D8}, C2 = {D1,D3,D4}, C3 = {D5,D7}C1 = {D2,D6,D8}, C2 = {D1,D3,D4}, C3 = {D5,D7}New assignment

Note: This process is now repeated with new clusters. However, the next iteration in this exampleWill show no change to the clusters, thus terminating the algorithm.

24

K-Means Algorithmi Strength of the k-means:

4 Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k, t << n

4 Often terminates at a local optimum

i Weakness of the k-means:4 Applicable only when mean is defined; what about

categorical data?4 Need to specify k, the number of clusters, in advance4 Unable to handle noisy data and outliers

i Variations of K-Means usually differ in:4 Selection of the initial k means4 Dissimilarity calculations4 Strategies to calculate cluster means

25

Single Pass Methodi The basic algorithm:

1. Assign the first item T1 as representative for C1

2. for item Ti calculate similarity S with centroid for each existing cluster

3. If Smax is greater than threshold value, add item to corresponding cluster and recalculate centroid; otherwise use item to initiate new cluster

4. If another item remains unclustered, go to step 2

See: Example of Single Pass Clustering Technique

i This algorithm is simple and efficient, but has some problems4 generally does not produce optimum clusters4 order dependent - using a different order of processing items will result in a

different clustering

http://maya.cs.depaul.edu/~classes/ds575/single-pass.html

Hierarchical Clustering Algorithms• Two main types of hierarchical clustering

– Agglomerative: • Start with the points as individual clusters• At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

– Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters)

• Traditional hierarchical algorithms use a similarity or distance matrix

– Merge or split one cluster at a time

27

Hierarchical Clustering Algorithmsi Use dist / sim matrix as clustering criteria

4 does not require the no. of clusters as input, but needs a termination condition

a

b

c

d

e

ab

cd

cde

abcde

Step 0 Step 1 Step 2 Step 3 Step 4

Step 4 Step 3 Step 2 Step 1 Step 0

Agglomerative

Divisive

28

Hierarchical Agglomerative Clustering

i Hierarchical Agglomerative Methods4 starts with individual items as clusters 4 then successively combine smaller clusters to form larger ones4 combining clusters requires a method to determine similarity or distance

between two existing clusters

i Some commonly used HACM methods for combining clusters4 Single Link: at each step join most similar pairs of objects that are not yet in the

same cluster4 Complete Link: use least similar pair between each cluster pair to determine

inter-cluster similarity - all items within one cluster are linked to each other within a similarity threshold

4 Group Average (Mean): use average value of pairwise links within a cluster to determine inter-cluster similarity (i.e., all objects contribute to inter-cluster similarity)

4 Ward’s method: at each step join cluster pair whose merger minimizes the increase in total within-group error sum of squares (based on distance between centroids) - also called the minimum variance method

29

Hierarchical Agglomerative Clustering

i Basic procedure4 1. Place each of N items into a cluster of its own.4 2. Compute all pairwise item-item similarity coefficients

h Total of N(N-1)/2 coefficients

4 3. Form a new cluster by combining the most similar pair of current clusters i and j

h (use one of the methods described in the previous slide, e.g., complete link, group average, Ward’s, etc.);

h update similarity matrix by deleting the rows and columns corresponding to i and j;

h calculate the entries in the row corresponding to the new cluster i+j.

4 4. Repeat step 3 if the number of clusters left is great than 1.

Hierarchical Agglomerative Clustering:: Example

Nested Clusters Dendrogram

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

2

3

4

5

61

2 5

3

4

Input/ Initial settingi Start with clusters of individual points and a distance/similarity

matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Distance/Similarity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Intermediate Statei After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

...p1 p2 p3 p4 p9 p10 p11 p12


Intermediate Statei Merge the two closest clusters (C2 and C5) and update the

distance matrix

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

...p1 p2 p3 p4 p9 p10 p11 p12


After Mergingi “How do we update the distance matrix?”

C1

C4

C2 + C5

C3 ? ? ? ?

?

?

?

C2 +

C5C1

C1

C3

C4

C2 + C5

C3 C4

...p1 p2 p3 p4 p9 p10 p11 p12

Distance between two clustersi Single-link distance between clusters Ci and Cj is the minimum

distance between any object in Ci and any object in Cj

4 The distance is defined by the two most similar objects

jiyxjisl CyCxyxdCCD ,),(min, ,

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00

1 2 3 4 5

Distance between two clustersi Complete-link distance between clusters Ci and Cj is the

maximum distance between any object in Ci and any object in Cj

4 The distance is defined by the two least similar objects

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00

jiyxjicl CyCxyxdCCD ,),(max, ,

1 2 3 4 5

Distance between two clustersi Group average distance between clusters Ci and Cj is the average

distance between objects in Ci and objects in Cj

4 The distance is defined by the average similarities

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00

ji CyCxji

jiavg yxdCC

CCD,

),(1

,

1 2 3 4 5

Strengths of single-link clustering

Original Points Two Clusters

• Can handle non-elliptical shapes

Two Clusters

Limitations of single-link clustering

Original Points

• Sensitive to noise and outliers• It produces long, elongated clusters

Strengths of complete-link clustering

Original Points Two Clusters

• More balanced clusters (with equal diameter)• Less susceptible to noise

Limitations of complete-link clustering

Original Points

• Tends to break large clusters• All clusters tend to have the same diameter• Small clusters are merged with larger ones

Two Clusters

Average-link clustering

i Compromise between Single and Complete Link

i Strengths4 Less susceptible to noise and outliers

i Limitations4 Biased towards globular clusters

43

Clustering Application: Collaborative Filtering

i Discovering Aggregate Profiles4 Goal: to capture “user segments” based on their common behavior or interests4 Method: Cluster user transactions to obtain user segments automatically, then

represent each cluster by its centroidh Aggregate profiles are obtained from each centroid after sorting by weight and filtering

out low-weight items in each centroid

4 Profiles are represented as weighted collections of items (pages, products, etc.)h weights represent the significance of item within each cluster

4 profiles are overlapping, so they capture common interests among different groups/types of users (e.g., customer segments)

44

Aggregate Profiles - An ExampleA B C D E F

user0 1 1 0 0 0 1user1 0 0 1 1 0 0user2 1 0 0 1 1 0user3 1 1 0 0 0 1user4 0 0 1 1 0 0user5 1 0 0 1 1 0user6 1 1 0 0 0 1user7 0 0 1 1 0 0user8 1 0 1 1 1 0user9 0 1 1 0 0 1

A B C D E FCluster 0 user 1 0 0 1 1 0 0

user 4 0 0 1 1 0 0user 7 0 0 1 1 0 0

Cluster 1 user 0 1 1 0 0 0 1user 3 1 1 0 0 0 1user 6 1 1 0 0 0 1user 9 0 1 1 0 0 1

Cluster 2 user 2 1 0 0 1 1 0user 5 1 0 0 1 1 0user 8 1 0 1 1 1 0

PROFILE 0 (Cluster Size = 3)--------------------------------------1.00 C1.00 D

PROFILE 1 (Cluster Size = 4)--------------------------------------1.00 B1.00 F0.75 A0.25 C

PROFILE 2 (Cluster Size = 3)--------------------------------------1.00 A1.00 D1.00 E0.33 C

Original Session/user

data

Result ofClustering

Given an active session A B, the best matching profile is Profile 1. This may result in a recommendation for item F since it appears with high weight in that profile.

Given an active session A B, the best matching profile is Profile 1. This may result in a recommendation for item F since it appears with high weight in that profile.

45

Web Usage Mining: clustering example

i Transaction Clusters: 4 Clustering similar user transactions and using centroid of each cluster as an

aggregate usage profile (representative for a user segment)

Support URL Pageview Description

1.00 /courses/syllabus.asp?course=450-96-303&q=3&y=2002&id=290

SE 450 Object-Oriented Development class syllabus

0.97 /people/facultyinfo.asp?id=290 Web page of a lecturer who thought the above course

0.88 /programs/ Current Degree Descriptions 2002

0.85 /programs/courses.asp?depcode=96&deptmne=se&courseid=450

SE 450 course description in SE program

0.82 /programs/2002/gradds2002.asp M.S. in Distributed Systems program description

Sample cluster centroid from dept. Web site (cluster size =330)

46

Clustering Application: Discovery of Content Profiles

i Content Profiles4 Goal: automatically group together documents which partially deal with

similar concepts4 Method:

h identify concepts by clustering features (keywords) based on their common occurrences among documents (can also be done using association discovery or correlation analysis)

h cluster centroids represent docs in which features in the cluster appear frequently

4 Content profiles are derived from centroids after filtering out low-weight docs in each centroid

4 Note that each content profile is represented as a collections of item-weight pairs (similar to usage profiles)

h however, the weight of an item in a profile represents the degree to which features in the corresponding cluster appear in that item.

47

Content Profiles – An Example

PROFILE 0 (Cluster Size = 3)--------------------------------------------------------------------------------------------------------------1.00 C.html (web, data, mining)1.00 D.html (web, data, mining)0.67 B.html (data, mining)

PROFILE 1 (Cluster Size = 4)-------------------------------------------------------------------------------------------------------------1.00 B.html (business, intelligence, marketing, ecommerce)1.00 F.html (business, intelligence, marketing, ecommerce)0.75 A.html (business, intelligence, marketing)0.50 C.html (marketing, ecommerce)0.50 E.html (intelligence, marketing)

PROFILE 2 (Cluster Size = 3)-------------------------------------------------------------------------------------------------------------1.00 A.html (search, information, retrieval)1.00 E.html (search, information, retrieval)0.67 C.html (information, retrieval)0.67 D.html (information, retireval)

Filtering threshold = 0.5

48

User Segments Based on Contenti Essentially combines usage and content profiling techniques

discussed earlier

i Basic Idea:4 for each user/session, extract important features of the selected

documents/items4 based on the global dictionary create a user-feature matrix4 each row is a feature vector representing significant terms associated with

documents/items selected by the user in a given session4 weight can be determined as before (e.g., using tf.idf measure)4 next, cluster users/sessions using features as dimensions

i Profile generation:4 from the user clusters we can now generate overlapping collections of features

based on cluster centroids4 the weights associated with features in each profile represents the significance

of that feature for the corresponding group of users.

49

A.html B.html C.html D.html E.html

user1 1 0 1 0 1

user2 1 1 0 0 1

user3 0 1 1 1 0

user4 1 0 1 1 1

user5 1 1 0 0 1

user6 1 0 1 1 1

A.html B.html C.html D.html E.html

web 0 0 1 1 1

data 0 1 1 1 0

mining 0 1 1 1 0

business 1 1 0 0 0

intelligence 1 1 0 0 1

marketing 1 1 0 0 1

ecommerce 0 1 1 0 0

search 1 0 1 0 0

information 1 0 1 1 1

retrieval 1 0 1 1 1

User transaction matrix UT

Feature-DocumentMatrix FP

50

Content Enhanced Transactions

web data mining business intelligence marketing ecommerce search information retrieval

user1 2 1 1 1 2 2 1 2 3 3

user2 1 1 1 2 3 3 1 1 2 2

user3 2 3 3 1 1 1 2 1 2 2

user4 3 2 2 1 2 2 1 2 4 4

user5 1 1 1 2 3 3 1 1 2 2

user6 3 2 2 1 2 2 1 2 4 4

User-FeatureMatrix UF Note that: UF = UT x FPT

Example: users 4 and 6 are more interested in concepts related to Web information retrieval, while user 3 is more interested in data mining.

51

Clustering and Collaborative Filtering :: clustering based on ratings: movielens

http://movielens.umn.edu/

http://movielens.umn.edu/

52

Clustering and Collaborative Filtering :: tag clustering example

53

Hierarchical Clustering:: example – clustered search results

Can drill down within clusters to view sub-topics or to view the relevant subset of results