24
Opinion Mining on the Web 2.0 Characteristics of User Generated Content and Their Impacts ITEC 547 Text Mining Ass. Professor: Nazife Dimililer Name: Feras Allababidi ID: 145416 1

Opinion Mining on the Web 2.0 Characteristics of User Generated Content and Their Impacts ITEC 547 Text Mining Ass. Professor: Nazife Dimililer Name: Feras

Embed Size (px)

Citation preview

1

Opinion Mining on the Web 2.0

Characteristics of User Generated Content and Their Impacts

ITEC 547 Text Mining

Ass. Professor: Nazife DimililerName: Feras AllababidiID: 145416

2

Introduction

Opinion Mining is: analyzing people’s opinions, sentiments, attitudes and emotions.

3

Web 2.0 Made it Easy

Because the Web 2.0 allowed user generated content, people are expressing their opinions on everything on the web and it became important to understand that feedback and analyze it.

4

Importance

Why: to get answers for new geopolitical, social and business-related questions.

Examples..

5

Challenges

Besides the typical challenges known from natural language processing and text processing, there are many challenges to opinion mining:

Noisy texts: User generated contents in social media tend to be less grammatically correct, they are informally written and have spelling mistakes. These texts often make use of emoticons and abbreviations or unorthodox capitalization.

Language variations: Texts in user generated content typically contain irony and sarcasm; texts lack contextual information but have implicit knowledge about a specific topic.

6

Challenges

Relevance and boilerplate: Relevant content on webpages is usually surrounded by irrelevant elements like advertisements, navigational components or previews of other articles; discussions and comment threads can divert to non-relevant topics.

Target identification: Search-based approaches to opinion mining often face the problem that the topic of the retrieved document does not necessarily match the mentioned object.

Complexity and changing rate of opinions.

7

Structure

Opinion mining has been investigated mainly at three different levels:

1. Document level

2. Sentence level

3. Entity/aspect-level

8

Structure cont.

Opinion is defined as a quintuple (Ei, Aij, Sijkl, Hk, Tl)

Ei: Name of entity

Aij: Aspect of the entity

Sijkl: Sentiment of an aspect (positive, negative, or natural)

Hk: Opinion holder

Tl: Time of expressed opinion

9

Technical Approaches

Sentiment classification

Feature-based opinion mining (or aspect-based opinion mining)

Comparison-based opinion mining

10

Paper Objective

The objective of this paper is to investigate the differences between social media channels and to discuss the impacts of their characteristics to opinion mining approaches

11

Paper Methodology

Identify the most popular approaches for opinion mining in the scientific field and their underlying principles of detecting and analyzing text.

Identify and deduce criteria from literature to exhibit differences between the different kinds of social media sources regarding possible impacts on the quality of opinion mining.

Do an empirical analysis based on the deduced criteria in order to determine the differences between several social media channels.

Social network services (Facebook)

Microblogs (Twitter)

Comments on weblogs

Product reviews (Amazon and other product review sites).

In the last step, the social media source types need to be correlated with applicable opinion mining approaches based on their respective characteristics.

12

Algorithms Used

Supervised learning

Unsupervised learning

Partially supervised learning

Latent variable models (Hidden Markov Model HMM)

Conditional Random Fields CRF

Latent Semantic Association LSA

Pointwise Mutual Information PMI

17

Empirical Analysis

Focused on Specific Brand (Samsung)

Specific time: between June, 15th 2011 and Jan, 28th 2013

Data labeled manually by four different human labelers

Sources were taken in four different languages

Number of sources of each media: Facebook: 410 postings, using the API

Twitter: 287 tweets, using API

Blog: 387 blog posts

discussion forum: 417 posts from 4 different forums, performed manually

product reviews: 433 reviews from Amazon, and two product review pages) using Web-crawler

18

Evaluation Criteria

19

Results of Survey: FaceBook

Length of postings: Facebook 19 words compared to 119 in product reviews

Emoticons and Internet slang: Emoticons are highest with 27.8%, while slang surprisingly least with only 8.3%

Grammatical and orthographical correctness: Second highest with error ratio of 42%

Aspects and details: 33% has one or more aspect. Mainly contain postings on entity-level 65.4%.

Subjectivity: 67.3% lowest subjectivity, while 26.1% objective

Opinion holder: between 95% and 97.6% reveal the author as the opinion holder

Topic Relatedness: lowest with 82.3%. 1.1% both topic and non-topic

20

Results of Survey: Twitter

Length of postings: Lowest with 14 words out of 119 highest

Emoticons and Internet slang: Emoticons second lowest with 24.4% while its highest in slang with 20.2%

Grammatical & orthographical correctness: Highest error ratio with 48.8%

Aspects and details: 60.6% contain an aspect or more. Mainly contain postings on entity-level 56.6%

Subjectivity: 82.9% highest subjectivity, while 12.8% objective

Opinion holder: between 95% and 97.6% reveal the author as the opinion holder

Topic Relatedness: lowest with 95.3%. 0% both topic and non-topic

21

Results of Survey: Blogs

Emoticons and Internet slang: Emoticons second with 27.6% but very close to Facebook, while slang came with 12.8% and higher than FB

Grammatical and orthographical correctness: lowest error ratio with 35.4%

Aspects and details: 55.3% go into detail. 5.6% contain aspects as well as opinions on entity-level.

Subjectivity: 69.3% subjective, while 19.6% objective

Opinion holder: between 95% and 97.6% reveal the author as the opinion holder

Topic Relatedness: lowest with 92.6%. 1.1% both topic and non-topic

22

Results of Survey: Product Reviews

Length of postings: Highest 119 words in Product reviews

Emoticons and Internet slang: Emoticons least with 20.1% only. While slang came with 12.8% and higher than FB.

Grammatical and orthographical correctness: The error ratio is second lowest with 37.2%

Aspects and details: product review postings go into detail (39.6%) and contain aspects as well as opinions on entity-level 27.0%

Subjectivity: 71.7% subjective, while 26.12.9% objective making 25.4% both

Opinion holder: 90% the author is the opinion holder

Topic Relatedness: lowest with 93.1%. 5.8% both topic and non-topic

23

Impact on Opinion Mining

Blogs

Many research papers that focus on blogs do not unfold how comments to the blog posts are taken into consideration.

Depending on the type of the blog (corporate blog vs. j-blog) both the blog posting and the blog comments can be interesting sources for opinion mining.

24

Impact on Opinion Mining

Product review:

Several researchers proposed models to identify aspects and sentiments.

Few assume that all of the words in a sentence cover one single topic.

Social Network (Facebook): Because users can interact with each other, respond to questions and the amount of grammatical mistakes, there are similar challenges like with discussion forums. More research work is required.

25

Impact on Opinion Mining

Microblog (Twitter): Many grammatical errors, short sentences, heavy usage of hashtags and other abbreviations.

Researchers mainly use supervised learning or semisupervised learning

Davidov et al. use Twitter characteristics and language conventions as features.

Zhang et al. combine lexicon-based and learning-based methods for Twitter sentiment analysis.

The usage of part-of-speech features does not seem to be useful in the microblogging domain.

26

Further Research

Further research work should be conducted:

(i) Measure and compare the factual implications of the characteristics of social media on the performance of the different opinion mining approaches.

(ii) Conduct more research work on alternative (statistical / mathematical) approaches.

28

The End

Thank you for listening…

Any Questions