23
+ Clustering Very Large Textual Unstructured Customers' Reviews in a Natural Language Jan Žižka Karel Burda František Dařena Department of Informatics Faculty of Business and Economics Mendel University in Brno Czech Republic

Zizka aimsa 2012

Embed Size (px)

Citation preview

Page 1: Zizka aimsa 2012

+

Clustering Very Large Textual

Unstructured Customers' Reviews in

a Natural Language

Jan Žižka

Karel Burda

František Dařena

Department

of

Informatics

Faculty of

Business

and

Economics

Mendel

University

in Brno

Czech

Republic

Page 2: Zizka aimsa 2012

+ Introduction

Many companies collect opinions expressed by

their customers.

These opinions can hide valuable knowledge.

Discovering such the knowledge by people can

be a very demanding task because

the opinion database can be very large,

the customers can use different languages,

the people can handle the opinions subjectively,

sometimes additional resources (like lists of positive

and negative words) might be needed.

Page 3: Zizka aimsa 2012

+ Introduction

Our previous research focused on analysis what was

significant for including a certain opinion into one of

categories like satisfied or dissatisfied customers

However, this requires to have the reviews separated

into classes sharing a common opinion/sentiment

Clustering as the most common form of unsupervised

learning enables automatic grouping of unlabeled

documents into subsets called clusters

Page 4: Zizka aimsa 2012

+ Objective

The objective is to find out how well a computer can separate the classes expressing a certain opinion, without prior knowledge of the nature of such the classes, and to find a clustering algorithm with a set of its best parameters, similarity and clustering-criterion functions, word representation, and the role of stemming for the given specific data.

Page 5: Zizka aimsa 2012

+ Data description

Processed data included reviews of hotel clients collected from publicly available sources

The reviews were labeled as positive and negative

Reviews characteristics:

more than 5,000,000 reviews

written in more than 25 natural languages

written only by real customers, based on a real experience

written relatively carefully but still containing errors that are typical for natural languages

Page 6: Zizka aimsa 2012

+ Properties of data used for

experiments

The subset used in our experiments contained

almost two million opinions marked as written in

English.

Review category Positive Negative

Number of reviews 1,190,949 741,092

Maximal review length 391 words 396 words

Average review length 21.67 words 25.73 words

Variance 403.34 words 618.47 words

Page 7: Zizka aimsa 2012

+ Review examples

Positive

The breakfast and the very clean rooms stood out as the best features of this hotel.

Clean and moden, the great loation near station. Friendly reception!

The rooms are new. The breakfast is also great. We had a really nice stay.

Nothing, the hotel is very noisy, no sound insulation whatsoever. Room very small. Shower not nice with a curtain. This is a 2/3 star max.

Negative

High price charged for internet access which actual cost now is extreamly low.

water in the shower did not flow away

The room was noisy and the room temperature was higher than normal.

The train almost running through your room every 10 minutes, the old man at the restaurant was ironic beyond friendly, the food was ok but very German.

Page 8: Zizka aimsa 2012

+ Data preparation

Data collection, cleaning (removing tags, non-letter characters), converting to upper-case

Removing words shorter than 3 characters

Porter’s Stemming

Stopwords removing, spell checking, diacritics removal etc. were not carried out

Creating 14 smaller subsets containing positive and negative reviews with following proportions: 131:144, 229:211, 987:1029, 1031:1085, 2096:2211, 4932:4757, 4832:4757, 7432:7399, 10023:8946, 10251:9352, 15469:14784, 24153:23956, 52146:49986, and 365921:313752

Page 9: Zizka aimsa 2012

+ Experimental steps

Random selection of desired amount of reviews

Transformation of the data into the vector representation

Loading the data in Cluto* and performing clustering

Evaluating the results

* Free software providing different clustering methods working with

several clustering criterion functions and similarity measures, suitable

for operating on very large datasets.

Page 10: Zizka aimsa 2012

+ Clustering algorithm parameters

Clustering algorithm – describes the way how objects to be

clustered are assigned into individual groups

Available algorithms

Cluto's k-means variation – algorithm iteratively adapts the initial

randomly generated k cluster centroids' positions

Repeated bisection – a sequence of cluster bisections

Graph-based – partitioning a graph representing objects to be

clustered

Page 11: Zizka aimsa 2012

+ Clustering algorithm parameters

Similarity – an important measure affecting the results of

clustering because the objects within one cluster need to be

similar while objects from different clusters should be dissimilar

Available similarity/distance measures

Cosine similarity – measures the cosine of the angle between

couples of vectors representing the documents

Pearson's correlation coefficient – measures linear correlation

between values of two vectors

Euclidean distance – computes the distance between points

representing documents in the abstract space

Page 12: Zizka aimsa 2012

+ Clustering algorithm parameters

Criterion functions – particular clustering criterion function defined over the entire clustering solution is optimized

Internal functions are defined over the documents that are part of each cluster and do not take into account the documents assigned to different clusters

External criterion functions derive the clustering solution the difference among individual clusters

Internal and external functions can be combined to define a set of hybrid criterion functions that simultaneously optimize individual criterion functions

Available criterion functions

Internal – I1, I2

External – E1, E2

Hybrid – H1, H2

Graph based – G1

Page 13: Zizka aimsa 2012

+ Clustering algorithm parameters

Document representation – documents are represented using the vector-space model

Vector dimensions – document properties (terms, in our experiments words)

Vector values

Term Presence (TP)

Term Frequency (TF)

Term Frequency × Inverse Document Frequency (TF-IDF)

Term Presence × Inverse Document Frequency (TP-IDF)

𝑖𝑑𝑓 𝑡𝑖 = log𝑁

𝑛(𝑡𝑖)

Page 14: Zizka aimsa 2012

+ Evaluation of cluster quality

Purity based measures – measure the extend to which each

cluster contained documents from primarily one class

Purity of cluster Sr of size nr:

P(𝑆𝑟)=1

nrmaxinri

Purity of the entire solution with k clusters:

𝑃𝑢𝑟𝑖𝑡𝑦 = 𝑛𝑟𝑛P(𝑆𝑟)

𝑘

𝑟=1

A perfect clustering solution – clusters contain documents from

only a single class Purity = 1

Page 15: Zizka aimsa 2012

+ Evaluation of cluster quality

Entropy based measures – how the various classes of documents are distributed within each cluster

Entropy of cluster Sr of size nr:

E 𝑆𝑟 = −1

log 𝑞 𝑛𝑟𝑖

𝑛𝑟log𝑛𝑟𝑖

𝑛𝑟

𝑞

𝑖=1

,

where q is the number of classes and 𝑛𝑟𝑖 number of documents

in ith class that were assigned to the rth cluster

Entropy of the entire solution with k clusters:

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝑛𝑟𝑛E(𝑆𝑟)

𝑘

𝑟=1

A perfect clustering solution – clusters contain documents from only a single class Entropy = 0

Page 16: Zizka aimsa 2012

+ Results

Best results were achieved by k-means, repeated bisection,

and cosine similarity as demonstrated in following tables

A certain boundary from which the entropy value oscillates and

does not change much with increasing number of documents

was found – around 10,000 documents

IDF weighting had a considerable positive impact on clustering

results in comparison with simple TP/TF

TF-IDF document representation provided almost the same

results as TP-IDF

Page 17: Zizka aimsa 2012

+ Results

Using cosine similarity provided the best results unlike the Euclidean distance and Pearson's correlation coefficient.

For example, for the set of documents containing 4,932 positive and 4,745 negative reviews, the entropy was 0.594 for cosine similarity, while Euclidean distance provided entropy 0.740, and Pearson's coefficient 0.838

The H2 and I2 criterion functions provided the best results.

For the I1 criterion function, the entropy of one cluster was very low (less than 0.2). On the other hand, the second cluster's entropy was extremely high.

Stemming applied during the preprocessing phase had no impact on the entropy at all.

Page 18: Zizka aimsa 2012

+ Weighted entropy

Ratio P:N

K-means Repeated bisection

TF-IDF TP-IDF TF-IDF TP-IDF

I2 H2 I2 H2 I2 H2 I2 H2

131:144 0.792 0.785 0.793 0.741 0.726 0.767 0.774 0.774

229:211 0.694 0.632 0.695 0.627 0.648 0.643 0.650 0.647

987:1029 0.624 0.610 0.618 0.605 0.624 0.609 0.618 0.611

4832:4757 0.601 0.581 0.599 0.579 0.600 0.584 0.598 0.580

7432:7399 0.605 0.596 0.599 0.587 0.605 0.595 0.594 0.586

15469:14784 0.604 0.583 0.598 0.579 0.604 0.582 0.598 0.579

24153:23956 0.597 0.580 0.589 0.572 0.597 0.580 0.589 0.572

52164:49986 0.596 0.582 0.600 0.573 0.604 0.582 0.598 0.574

201346:204716 0.599 0.583 0.592 0.575 0.597 0.583 0.593 0.576

365921:313752 0.602 0.586 0.598 0.584 0.599 0.581 0.598 0.580

Page 19: Zizka aimsa 2012

+ Percentage ratios of documents in

the clusters

Ratio P:N

K-means Repeated bisection

I2 H2 I2 H2

cluster 0

(P:N) cluster 1

(P:N) cluster 0

(P:N) cluster 1

(P:N) cluster 0

(P:N) cluster 1

(P:N) cluster 0

(P:N) cluster 1

(P:N)

131:144 76:24 24:74 78:24 22:74 75:22 25:76 78:19 22:78

229:211 84:21 16:79 86:18 14:82 84:20 16:80 84:18 16:82

987:1029 80:12 19:87 85:16 14:83 79:11 20:88 85:15 15:84

4832:4757 83:13 17:87 87:15 13:85 83:12 17:87 86:14 14:86

7432:7399 82:12 17:87 85:14 14:85 82:12 17:86 86:14 14:85

15469:14784 80:11 19:89 85:13 15:86 81:10 19:89 85:13 15:87

24153:23956 81:11 19:89 85:13 14:86 81:10 18:89 86:13 14:87

52164:49986 18:89 81:11 15:87 85:13 19:89 80:10 15:87 85:12

201346:204716 82:11 18:88 85:13 15:86 82:11 18:89 15:87 85:12

365921:313752 19:89 80:10 16:88 83:12 80:10 20:90 16:87 84:12

Page 20: Zizka aimsa 2012

+ Weighted entropy for different data

set sizes

Page 21: Zizka aimsa 2012

+ Conclusions

The goal was to automatically build clusters

representing positive and negative opinions and

finding a clustering algorithm with a set of its best

parameters, similarity measure, clustering-criterion

function, word representation, and the role of

stemming.

The main focus was on clustering large real-world

data during a reasonable time, without applying any

sophisticated methods that can increase the

computational complexity.

Page 22: Zizka aimsa 2012

+ Conclusions

The best results were obtained with

k-means

performed better when compared with other

algorithms

proved itself as a faster algorithm

binary vector representation

idf weighting

cosine similarity

H2 criterion function

stemming did not improve the results

Page 23: Zizka aimsa 2012

+ Future work

Clustering of reviews in other languages

Analysis of “incorrectly” categorized reviews

Clustering smaller units of reviews (e.g., sentences)