Cross domainsc new

BackgroundPreliminary experiments

Modeling accuracy loss for cross-domain SCGraph-based algorithms

Cross-domain Sentiment Classification: ResourceSelection and Algorithms

Natalia Ponomareva

Statistical Cybermetrics Research Group,University of Wolverhampton, UK

December 17, 2011

Natalia Ponomareva Cross-domain Sentiment Classification



Outline

1 BackgroundIntroductionState-of-the-art research

2 Preliminary experimentsIn-domain studyCross-domain experiments

3 Modeling accuracy loss for cross-domain SCDomain similarityDomain complexityModel construction and validation

4 Graph-based algorithmsComparisonDocument similarityStrategy for choosing the best parameters




IntroductionState-of-the-art research

What is Sentiment Classification?

Task within the research field of Sentiment Analysis.

It concerns classification of documents on the basis of overallsentiments expressed by their authors.

Different scales can be used:

positive/negative;positive, negative and neutral;rating: 1*, 2*, 3*, 4*, 5*;

Example

“The film was fun and I enjoyed it.” ⇒ positive“The film lasted too long and I got bored.” ⇒ negative





Applications:Business Intelligence





Applications: Event prediction





Applications: Opinion search





Why challenging?

Irony, humour.

Example

If you are reading this because it is your darling fragrance, pleasewear it at home exclusively and tape the windows shut.

Generally positive words.

Example

This film should be brilliant. It sounds like a great plot, the actorsare fisrt grade, and the supporting cast is good as well, andStallone is attempting to deliver a good performance.However, it cannot hold up.





Why challenging?

Context dependency.

Example

This is a great camera.A great amount of money was spent for promoting this camera.One might think this is a great camera. Well think again,because.....

Rejection or advice?

Example

Go read the book.





Approaches to Sentiment Classification

Lexical approaches

Supervised machine learning

Semi-supervised and unsupervised approaches

Cross-domain Sentiment Classification (SC)





Lexical approaches





Lexical approaches

Use of dictionaries of sentiment words with a given semanticorientation.





Lexical approaches


Dictionaries are built either manually or (semi-)automatically.





Lexical approaches



A special scoring function is applied in order to calculate thefinal semantic orientation of a text.





Lexical approaches



A special scoring function is applied in order to calculate thefinal semantic orientation of a text.

Example

lightweight +3, good +4, ridiculous -2Lightweight, stores a ridiculous amount of books and good batterylife.SO1 = 3+4−2

3 = 123

SO2 = max{|3|, |4|, |−2|} · sign(max{|3|, |4|, |−2|}) = 4





Supervised Machine Learning






Learn sentiment phenomena from an annotated corpus.







Different Machine Learning methods were tested (NB, SVM,ME). In the majority of cases SVM demonstrates the bestperformance.








For review data ML approach performs better than lexical onewhen training and test data belong to the same domain.








For review data ML approach performs better than lexical onewhen training and test data belong to the same domain.

But it needs substantial amount of annotated data.











Require small amount of annotated data or no data at all.






Require small amount of annotated data or no data at all.

Different techniques were exploited:

Automatic extraction of sentiment words on the Web usingseed words (Turney, 2002).Exploiting spectral clustering and active learning (Dasgupta etal., 2009).Applying co-training (Li et al., 2010)Bootstrapping (Zagibalov, 2010)Using graph-based algorithms (Goldberg et al., 2006)





Cross-domain SC

Main approaches:

Ensemble of classifiers (Read 2005, Aue and Gamon 2005);

Structural Correspondence Learning (Blitzer 2007);

Graph-based algorithms (Wu 2009).





Ensemble of classifiers

Classifiers are learned on data belonging to different sourcedomains.

Various methods can be used to combine classifiers:

Majority voting;

Weighted voting, where development data set is used to learncredibility weights for each classifier.

Learning a meta-classifier on a small amount of target domaindata.





Structural Correspondence Learning

Blitzer et al., 2007:

Introduce pivot features that appear frequently in source andtarget domains.

Find projections of source features the co-occur with pivots ina target domain.

Example

The laptop is great, it is extremely fast.The book is great, it is very engaging.









Example










Example










Example






Discussion





Discussion

Machine learning methods demonstrate avery good performance and when the size ofthe data is substantial they outperformlexical approaches.





Discussion


On the other hand, there is a plethora ofannotated resources on the Web and thepossibility to re-use them would be verybeneficial.





Discussion



Structural Correspondence Learning andsimilar approaches are good for binaryclassification but difficult to be applied formulti-class problem.





Discussion



Structural Correspondence Learning andsimilar approaches are good for binaryclassification but difficult to be applied formulti-class problem.

That motivates us to exploit graph-basedcross-domain algorithms.




In-domain studyCross-domain experiments

Data





Data

Data represent the corpus consist of Amazon product reviewson 7 different topics: books (BO), electronics (EL),kitchen&housewares (KI), DVDs (DV), music (MU),health&personal care (HE) and toys&games(TO).





Data


Reviews are rated either as positive or negative.





Data


Reviews are rated either as positive or negative.

Data within each domain are balanced, they contain 1000positive and 1000 negative reviews.





Data statistics

corpus num words mean words vocab size vocab size (>= 3)

BO 364k 181.8 23k 8 256DV 397k 198.7 24k 8 632MU 300k 150.1 19k 6 163EL 236k 117.9 12k 4 465KI 198k 98.9 11k 4 053TO 206k 102.9 11k 4 018HE 188k 93.9 11k 4 022





Data statistics

corpus num words mean words vocab size vocab size (>= 3)

BO 364k 181.8 23k 8 256DV 397k 198.7 24k 8 632MU 300k 150.1 19k 6 163EL 236k 117.9 12k 4 465KI 198k 98.9 11k 4 053TO 206k 102.9 11k 4 018HE 188k 93.9 11k 4 022

BO, DV, MU - longer reviews, richer vocabularies.





Feature selection

We compared several characteristics of features:

words vs. stems and lemmas;

unigrams vs. unigrams + bigrams;

binary weights vs. frequency, idf and tfidf;

features filtered by presence of verbs, adjectives, adverbs andmodal verbs vs. unfiltered features.





Feature selection





Feature selection

Filtering of features worsen the accuracy for all domains.





Feature selection


Unigrams + bigrams generally perform significantly muchbetter then unigrams alone.





Feature selection


Unigrams + bigrams generally perform significantly muchbetter then unigrams alone.

Binary, idf and delta idf weights generally give better resultsthan frequency, tfidf and delta tfidf weights.





Feature selection

domain features preference confidence interval, α = 0.01

BO word ≈ lemma ≈ stem inside

DV word ≈ lemma ≈ stem inside

MU lemma > stem > word boundary

EL word > lemma ≈ stem inside

KI word ≈ lemma > stem inside

TO word ≈ stem > lemma boundary

HE stem > lemma > word inside





Feature selection













Feature selection













10 most discriminative positive features

BO EL KI DV

highly recommend plenty perfect for albumconcise plenty of be perfect magnificent

for anyone highly recommend favorite superbi highly highly highly recommend debut

excellent ps NUM fiestaware wolfmy favorite please with be easy join

unique very happy easy to charlieinspiring beat perfect love it

must read glad eliminate highly recommend

and also well as easy rare






BO EL KI DV












BO EL KI DV












BO EL KI DV












BO EL KI DV











10 most discriminative negative features

BO EL KI DV

poorly refund waste of your moneydisappointing repair return it so bad

waste of do not buy it break ridiculousyour money waste of refund waste of

waste waste to return wasteannoying defective waste worst movie

bunch forum return pointlessboring junk very disappoint talk and

bunch of work worst patheticto finish worst I return horrible






BO EL KI DV





bunch of stop work worst patheticto finish worst I return horrible






BO EL KI DV





bunch of work worst patheticto finish worst I return horrible





Results





Results for cross-domain SC

Accuracy Accuracy drop




Domain similarityDomain complexityModel construction and validation

Motivation

Usually cross-domain algorithms do not work well for verydifferent source and target domains.

Combinations of classifiers from different domains in somecases perform much worse than a single classifier trained onthe closest domain (Blitzer et al. 2007)

Finding the closest domain can help to improve the results ofcross-domain sentiment classification.





How to compare data sets?






Machine-learning techniques are based on the assumption thattraining and test data are driven from the same probabilitydistribution, and, therefore, they perform much better whentraining and test data sets are alike.







The task of finding the best training data transforms into thetask of finding data whose feature distribution is similar to thetest one.








We propose two characteristics to model accuracy loss:domain similarity and domain complexity or, more precisely,domain complexity variance.









Domain similarity approximate similarity between distributionsfor frequent features.









Domain similarity approximate similarity between distributionsfor frequent features.

Domain complexity compares tails of distributions.





Domain similarity





Domain similarity

We are not interested in all terms but rather on those bearingsentiment.





Domain similarity


The study on SA suggested that adjectives, verbs and adverbsare the main indicators of sentiment, so, we keep onlyunigrams and bigrams that contain those POS as features.





Domain similarity


The study on SA suggested that adjectives, verbs and adverbsare the main indicators of sentiment, so, we keep onlyunigrams and bigrams that contain those POS as features.

We compare different weighting schemes: frequencies, TF-IDFand IDF to compute corpus similarity.





Measures of domain similarity






χ2 taken from Corpus Linguistics where it was demonstratedto have the best correlation with the gold standard.







Kullback-Leibler divergence (DKL) and its symmetric analogueJensen-Shannon divergence (DJS) were borrowed fromInformation Theory.







Kullback-Leibler divergence (DKL) and its symmetric analogueJensen-Shannon divergence (DJS) were borrowed fromInformation Theory.

Jaccard coefficient (Jaccard) and cosine similarity (cosine) arewell-known similarity measures





Correlation for different domain similarity measures

Table: Correlation with accuracy drop

measure R (freq) R (filtr.,freq) R (filtr.,TFIDF) R (filtr.,IDF)

cosine -0.790 -0.840 -0.836 -0.863Jaccard -0.869 -0.879 -0.879 -0.879χ2 0.855 0.869 0.876 0.879DKL 0.734 0.827 0.676 0.796DJS 0.829 0.833 0.804 0.876





Domain similarity: χ2inv

The boundary between similar and distinct domains approximatelycorresponds to χ2

inv = 1.7.





Domain complexity

Similarity between domains is mostly controlled by frequentwords, but the shape of the corpus distribution is alsoinfluenced by rare words representing its tail.

It was shown that richer domains with more rare words aremore complex for SC.

We also observed that the accuracy loss is higher incross-domain settings when source domain is more complexthan the target one.





Measures of domain complexity

We propose several measures to approximate domain complexity:

percentage of rare words;

word richness (proportion of vocabulary size in a corpus size);

relative entropy.

Correlation of domain complexity measures with in-domainaccuracy:

% of rare words word richness rel.entropy

-0.904 -0.846 0.793





Measures of domain complexity

We propose several measures to approximate domain complexity:

percentage of rare words;

word richness (proportion of vocabulary size in a corpus size);

relative entropy.

Correlation of domain complexity measures with in-domainaccuracy:

% of rare words word richness rel.entropy

-0.904 -0.846 0.793





Domain complexity

corpus accuracy % of rare words word richness rel.entropy

BO 0.786 64.77 0.064 9.23DV 0.796 64.16 0.061 8.02MU 0.774 67.16 0.063 8.98EL 0.812 61.71 0.049 12.66KI 0.829 61.49 0.053 14.44TO 0.816 63.37 0.053 15.27HE 0.808 61.83 0.056 15.82





Domain complexity

corpus accuracy % of rare words word richness rel.entropy

BO 0.786 64.77 0.064 9.23DV 0.796 64.16 0.061 8.02MU 0.774 67.16 0.063 8.98EL 0.812 61.71 0.049 12.66KI 0.829 61.49 0.053 14.44TO 0.816 63.37 0.053 15.27HE 0.808 61.83 0.056 15.82





Modeling accuracy loss

To model the performance drop we assume a linear dependency ondomain similarity and complexity variance and propose thefollowing linear regression model:

F (sij ,∆cij) = β0 + β1sij + β2∆cij , (1)

wheresij – domain similarity (or distance) between target domain i andsource domain j∆cij = ci − cj , – difference between domain complexities.The unknown coefficients βi are solutions of the following systemof linear equations:

β0 + β1sij + β2∆cij = ∆aij , (2)

where ∆aij is the accuracy drop when adapting the classifier fromdomain i to domain j .





Model evaluation

The evaluation of the constructed regression model includesfollowing steps:

Global test (or F-test) to verify statistical significance ofregression model with respect to all its predictors.

Test on individual variables (or t-test) to reveal regressors thatdo not bring a significant impact into the model.

Leave-one-out-cross validation for the data set of 42 examples.





Global test

The null hypothesis for global test states that there is nocorrelation between regressors and the response variable.

Our purpose is to demonstrate that this hypothesis must berejected with a high level of confidence.

In other words, we have to show that coefficient ofdetermination R2 is high enough to consider its valuesignificantly different from zero.

R2 R F-value p-value

0.873 0.935 134.60 << 0.0001





Test on individual coefficients

β0 β1 β2value -8.67 27.71 -0.55standard error 1.08 1.77 0.11t-value -8.00 15.67 -4.86p-value << 0.0001 << 0.0001 << 0.0001

All coefficients are justified to be statistically significant withthe confidence level higher than 99.9%.





Leave-one-out cross-validation results

accuracy drop standard error standard deviation max error, 95%

all data 1.566 1.091 3.404< 5% 1.465 1.133 3.373> 5%, < 10% 1.646 1.173 3.622> 10% 1.556 1.166 3.519







all data 1.566 1.091 3.404< 5% 1.465 1.133 3.373> 5%, < 10% 1.646 1.173 3.622> 10% 1.556 1.166 3.519

We are able to predict accuracy loss with standard error of 1.5%and maximum error not exceeding 3.4%.







all data 1.566 1.091 3.404< 5% 1.465 1.133 3.373> 5%, < 10% 1.646 1.173 3.622> 10% 1.556 1.166 3.519

We are able to predict accuracy loss with standard error of 1.5%and maximum error not exceeding 3.4%.Lower values are being noticed for domains which are moresimilar.







all data 1.566 1.091 3.404< 5% 1.465 1.133 3.373> 5%, < 10% 1.646 1.173 3.622> 10% 1.556 1.166 3.519

We are able to predict accuracy loss with standard error of 1.5%and maximum error not exceeding 3.4%.Lower values are being noticed for domains which are moresimilar.This is a strength of the model as our main purpose is to identifythe closest domains.





Comparing actual and predicted drop





Comparing actual and predicted drop




ComparisonDocument similarityStrategy for choosing the best parameters

Graph-based algorithms: OPTIM

Goldberg et al., 2006:

The algorithm is based on theassumption that the rating function issmooth with respect to the graph.

Rating difference between the closestnodes is minimised.

Difference between initial rating andthe final value is also minimised.

The result is a solution of anoptimisation problem.





Graph-based algorithms: RANK

Wu et al., 2009:

On each iteration of the algorithmsentiment scores of unlabeleddocuments are updated on the basis ofthe weighted sum of sentiment scoresof the nearest labeled neighbours andthe nearest unlabeled neighbours.

The process stops when convergenceis achieved.





Comparison

OPTIM algorithm(Goldberg et al., 2006)

RANK algorithm(Wu et al., 2009)





Comparison

Initial setting of RANK does not allow in-domain andout-domain neighbours to be different: easy to change!

The condition of smoothness of sentiment function over thenodes is satisfied for both algorithms.

Unlike RANK, OPTIM requires the closeness of initialsentiment values and output ones for unlabeled nodes.

The last condition makes the OPTIM solution more stable.





Comparison

Initial setting of RANK does not allow in-domain andout-domain neighbours to be different: easy to change!

The condition of smoothness of sentiment function over thenodes is satisfied for both algorithms.

Unlike RANK, OPTIM requires the closeness of initialsentiment values and output ones for unlabeled nodes.

The last condition makes the OPTIM solution more stable.

What about the measure of similarity between graph nodes?





Document representation

We consider 2 types of document representation:

feature-based

sentiment units-based







feature-based, that involves weighted document features.









Features are filtered by POS: adjectives, verbs and adverbs.Features are weighted using either tfidf or idf.










sentiment units-based, that is based upon the percentage ofpositive and negative units in a document.









sentiment units-based, that is based upon the percentage ofpositive and negative units in a document.

Units can be either sentences or words.PSP states for positive sentences percentage, PWP - forpositive words percentage.Lexical approach was exploited to calculate semanticorientation of sentiment units with the use of SentiWordNetand SOCAL dictionary.SO of sentences are averaged by a number of its positive andnegative words.





Results

Correlation between document’s ratings and document features/units:

domain idf tfidf PSP SWN PSP SOCAL PWP SWN PWP SOCAL

BO 0.387 0.377 0.034 0.206 0.067 0.252

DV 0.376 0.368 0.064 0.251 0.098 0.316

EL 0.433 0.389 0.048 0.182 0.043 0.196

KI 0.444 0.416 0.068 0.238 0.076 0.230





Results



BO 0.387 0.377 0.034 0.206 0.067 0.252

DV 0.376 0.368 0.064 0.251 0.098 0.316

EL 0.433 0.389 0.048 0.182 0.043 0.196

KI 0.444 0.416 0.068 0.238 0.076 0.230

Feature-based document representation with idf-weights bettercorrelates with document rating than any other representation.





Results



BO 0.387 0.377 0.034 0.206 0.067 0.252

DV 0.376 0.368 0.064 0.251 0.098 0.316

EL 0.433 0.389 0.048 0.182 0.043 0.196

KI 0.444 0.416 0.068 0.238 0.076 0.230

Feature-based document representation with idf-weights bettercorrelates with document rating than any other representation.SentiWordNet does not provide good results for this task, probablydue to high level of noise which comes from its automaticconstruction.





Results



BO 0.387 0.377 0.034 0.206 0.067 0.252

DV 0.376 0.368 0.064 0.251 0.098 0.316

EL 0.433 0.389 0.048 0.182 0.043 0.196

KI 0.444 0.416 0.068 0.238 0.076 0.230

Feature-based document representation with idf-weights bettercorrelates with document rating than any other representation.SentiWordNet does not provide good results for this task, probablydue to high level of noise which comes from its automaticconstruction.Document similarity is calculated using cosine measure.





Best accuracy improvement achieved by the algorithms

We tested the performance of each algorithm for severalvalues of their parameters.

The best accuracy improvement that was given by eachalgorithm:

OPTIM RANK





General observations

We selected and examined only those results that were insidethe confidence interval of the best accuracy for α = 0.01.

RANK: tends to depend a lot on values of its parameters andthe most unstable results are obtained when source and targetdomains are different.

RANK: A great improvement is achieved when adapting theclassifier from more complex to more simple domains.

OPTIM: Stable, but results are modest.





Analysis of RANK behaviour






Within clusters of similar domains the majority of goodanswers have γ ≥ 0.9.







This demonstrates that information provided by labeled datais more valuable.








For non-similar domains, when source domain is more complexthan the target one, best results are achieved with smaller γclose to 0.5.








For non-similar domains, when source domain is more complexthan the target one, best results are achieved with smaller γclose to 0.5.

This means that the algorithm benefits much from unlabeleddata.











For non-similar domains, when target one is more complexthan the source one, γ tends to increase to 0.7







That gives preference to more simple labeled data.








Number of labeled and unlabeled neighbours is not equal,there is a clear tendency to prefer results with smaller numberof unlabeled and higher number of labeled examples.








Number of labeled and unlabeled neighbours is not equal,there is a clear tendency to prefer results with smaller numberof unlabeled and higher number of labeled examples.

Proportion of 50 against 150 seems to be an ideal, coveringmost of the cases.





RANK best RANK





OPTIM best RANK





Conclusions and future work

Our strategy seems reasonable, the RANK performance is stillhigher than the OPTIM performance.

In the future we aim to apply the gradient descent method torefine parameters values.





Thank you for your

attention!





Relevant references

Aue, A., Gamon, M.:Customizing sentiment classifiers to new domains: A case study. In Proceedingsof RANLP’05. (2005).

Blitzer, J., Dredze, M., Pereira, F.:Biographies, bollywood, boom-boxes and blenders: Domain adaptation forsentiment classification. In Proceedings of ACL’07, pp. 440–447. (2007).

Dasgupta, S., Ng, V.:Mine the easy, classify the hard: A semi-supervised approach to automaticsentiment classification. In Proceedings of ACL’09, pp. 701–709. (2009).

Goldberg, A.B., Zhu, X.:Seeing stars when there aren’t many stars: graph-based semi-supervisedlearning for sentiment categorization. In Proceedings of TextGraphs’06, pp.45–52. (2006).

Pang, B., Lee, L.:Opinion mining and sentiment analysis. Foundations and Trends in InformationRetrieval 2(1-2), 1–135. (2008),





Relevant references

Read, J.:Using emoticons to reduce dependency in machine learning techniques forsentiment classification. In Proceedings of the ACL Student ResearchWorkshop. pp. 43–48. (2005).

Turney, P.:Thumbs up or thumbs down? semantic orientation applied to unsupervisedclassification of reviews. In Proceedings of ACL’02, pp. 417–424. (2002).

Wu, Q., Tan, S., Cheng, X.:Graph ranking for sentiment transfer. In Proceedings of ACL-IJCNLP’09, pp.317–320. (2009).

Zagibalov, T.:Unsupervised and Knowledge-poor Approaches to Sentiment Analysis. Ph.D.thesis, University of Sussex. (2010)


Technology

Cross domainsc new