Seminar1

+

Mining Textual Significant

Expressions Reflecting Opinions in

Natural Languages

Jan Žižka

František Dařena

Department

of

Informatics

Faculty of

Business

and

Economics

Mendel

University

in Brno

Czech

Republic

+ Introduction

Many companies collect opinions expressed

by their customers.

These opinions can hide valuable knowledge.

Discovering the knowledge by people can be

sometimes a very demanding task because

the opinion database can be very large,

the customers can use different languages,

the people can handle the opinions subjectively,

sometimes additional resources (like lists of positive

and negative words) might be needed.

+ Introduction

Text mining can reveal units of the texts

(words, phrases, sentences etc.) that can

represent the meaning/sentiment

Individual words usually do not bring

enough information

More information can provide phrases, but

their extraction, based on linguistic

analysis, requires additional knowledge

that is unique for every language

+ Objective

The objective is to find a way how a

computer can reveal phrases that

express a certain opinion, without the

exacting and time consuming linguistic

analysis which is miscellaneous for

different natural languages.

+ Data description

Processed data included reviews of hotel clients

collected from publicly available sources

The reviews were labeled as positive and negative

Reviews characteristics:

more than 5,000,000 reviews

written in more than 25 natural languages

written only by real customers, based on a real

experience

written relatively carefully but still containing errors that

are typical for natural languages

+ Review examples

Positive The breakfast and the very clean rooms stood out as the best

features of this hotel.

Clean and moden, the great loation near station. Friendly reception!

The rooms are new. The breakfast is also great. We had a really nice stay.

Good location - very quiet and good breakfast.

Negative High price charged for internet access which actual cost now

is extreamly low.

water in the shower did not flow away

The room was noisy and the room temperature was higher than normal.

The air conditioning wasn't working

+ Data preparation

Data collection, cleaning (removing tags, non-

letter characters), converting to upper-case

Transforming into the bag-of-words

representation, term frequencies (TF) used as

attribute values

Removing the words with global frequency < 2

Stemming, stopwords removing, spell

checking, diacritics removal etc. were not

carried out

+ Data characteristics – number of

reviews

0

200000

400000

600000

800000

1000000

1200000

English French Spanish German Italian Czech

nu

mb

er

of

rev

iew

s

positive

negative

+ Data characteristics – dictionary

sizes

0

50000

100000

150000

200000

250000

English German French Spanish Italian Czech

nu

mb

er

of

un

iqu

e w

ord

s

MinTF=1

MinTF=2

+ Finding significant words

Thanks to having a large collection of labeled examples a classifier that separates positive and negative reviews could be created

To reveal significant attributes (words) a decision tree was built using the tree-generating algorithm c5 based on entropy minimization

The goal was not to achieve the best classification accuracy but to find relevant attributes that contribute to assigning a text to a given class

The significant words appeared in the nodes of the decision tree

+ Finding the significant words

The classification accuracy which is proportional to

the relevancy of words was between 89.5 – 92.5%

The decision tree provided a list of about 200–300

words significant for classification from the

sentiment perspective

These words are used as the basis for extraction of

significant expressions in order to prevent from

considering all possible combinations of words

+ Extracting significant expressions

Extraction of significant expressions starts from the list of significant words, the reviews are being searched in the proximity of these words

Significant-expression extracting algorithm parameters: D – the distance from a significant word within

which the search is carried out

N – the number of words forming the significant expressions

M – the minimal number of occurrences of a specific group of words

+ An example

Searching for significant expressions in a review,

the algorithm parameters: D = 3, N = 3.

+ Results

Lists of significant expressions extracted from the

original text reviews were obtained.

The expressions need to be considered by people.

+ Significant expressions for English

+ Significant expressions for

German

+ Significant expressions for Spanish

+ Significant expressions for Spanish

+ Discussion

Some of the significant expressions were very similar

The significant expressions were mostly quite

meaningful and potentially useful for the target

audience

Some of the expressions were naturally not useful at all

It is necessary to find a trade-off between the size of

expressions, the length of the texts where the search is

carried out and the informative value of expressions

+ Discussion

Examples of different distances of words forming the same

significant expression "good location"

+ Discussion

But, the same expression can be formed from words from

more contexts:

“... Breakfast was really good. The location is a

little out of the center ...”

or

“Good service. Convenient location”

or

“It is a quiet location for a good nights sleep”

+ Handling large collections

For languages with large amount of reviews the

datasets were randomly split into subsets

consisting of 50,000 reviews because of memory

requirements and a decision tree was created for

each such subset

Each of the 50,000-sample subsets gave almost the

same list of words

The relevancies of extracted words were averaged

+ Conclusions

A procedure how to apply computers, machine learning, and natural language processing areas to automatically find significant expressions was presented

From the total number of words (80,000–200,000) only about 200–300 were identified as significant and used as the basis for expressions extraction

The simple, unified procedure worked well for many languages

Following research focuses on preprocessing phase (e.g. eliminating meaningless words)

The procedure might be used during the marketing research or marketing intelligence, for filtering reviews, generating lists of key-words etc.

Thank you for your attention

Vielen Dank für Ihre Aufmerksamkeit

Gracias por vuestra atención

Merci de votre attention

Grazie per la vostra attenzione

Děkuji za vaši pozornost

Education

Seminar1