18
Carmen Banea, Rada Mihalcea University of North Texas [email protected], [email protected] A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Janyce Wiebe University of Pittsburg [email protected]

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources

  • Upload
    livi

  • View
    41

  • Download
    1

Embed Size (px)

DESCRIPTION

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources. Carmen Banea, Rada Mihalcea University of North Texas [email protected], [email protected]. Janyce Wiebe University of Pittsburg [email protected]. Subjectivity analysis. - PowerPoint PPT Presentation

Citation preview

Page 1: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

Carmen Banea, Rada Mihalcea

University of North [email protected], [email protected]

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources

Janyce WiebeUniversity of Pittsburg

[email protected]

Page 2: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

Subjectivity analysisSubjectivity analysis (opinions and sentiments)Used in a wide variety of applications

Tracking sentiment timelines in news (Lloyd et. al, 2005)Review classification (Turney, 2002; Pang et. al, 2002)Mining opinions from product reviews (Hu and Liu, 2004)Expressive text-to-speech synthesis (Alm et. al, 2005)Text semantic analysis (Wiebe and Mihalcea, 2006; Esuli

and Sebastiani, 2006)Question answering (Yu and Hatzivassiloglou, 2003)

Much work on subjectivity analysis has focused on EnglishJapanese (Takumura et. al, 2006), Chinese (Hu et. al,

2005), German (Kim and Hovy, 2006)

Page 3: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

Proportion of Languages on the Web

internetworldstats.com ~ updated November 30, 2007

Page 4: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

ObjectiveDevelop a method for subjectivity analysis

thatRequires few electronic resources Can be easily ported to a new language

Applicable to the large number of languages that have scarce electronic resources

Page 5: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

Related WorkTools that rely on manually or semi-automatically

constructed lexiconsYu and Hatzivassiloglou, 2003; Riloff and Wiebe, 2003; Kim

and Hovy, 2006Enable the efficient rule-based subjectivity and sentiment

classifiers that rely on the presence of lexicon entries in text

These tools assume the availability of advanced language processing tools:

Syntactic parsers (Wiebe, 2000), Information extraction (Riloff and Wiebe, 2003)

broad-coverage rich lexical resources WordNet (Essuli and Sebastiani, 2006)

Our approach relates most closely to the method of (Turney, 2002) for the construction of lexicons annotated for polarityWe address the task of acquiring a subjectivity lexicon We rely on fewer, smaller-scale resources

Page 6: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

Our MethodBased on bootstrappingRequires:

A small seed set of subjective entriesOne/multiple electronic dictionariesA small training corpus (approx.

500,000 words)Experiments focused on Romanian

Applicable to other languages as well

Page 7: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

Bootstrapping Process

seedsseeds query Candidate synonymsCandidate synonyms

Max. no. of iterations?

no

yes

Candidate synonymsCandidate synonyms

Selected synonymsSelected synonyms

Variable filtering

Online dictionary

Fixed filtering

Page 8: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

Seed SetCategory

Sample Entries (with their English translation)

Noun blestem (curse), despot (tyrant), furie (fury), idiot (idiot), fericire (happiness)

Verb iubi (love), aprecia (appreciate), spera (hope), dori (wish), uri (hate)

Adjective

frumos (beautiful), dulce (sweet), urat (ugly), fericit (happy), fascinant (fascinating)

Adverb posibil (possibly), probabil (probably),desigur (of course), enervant (unnerving)

60 seeds, evenhandedly sampled from verbs, nouns, adjectives and adverbs.

Manually selectedSeed sources:

XI-th grade curriculum for Romanian Language and Literature

Translations of instances appearing in the OpinionFinder strong subjective lexicon (Wiebe and Riloff, 2005)

Page 9: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

Expansion

Romanian dictionary: http://www.dexonline.roDictionaries for other languages are also available, or

can be obtained from paper dictionaries through OCR

Definition

All open-class words, that have a definition in the dictionary

longer than 3 lettersDiacritics are removed

Candidate synonymsCandidate synonyms

SeedSeed

Page 10: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

FilteringCandidates are filtered based on a measure

of similarity with the original seedsWe use Latent Semantic Analysis (LSA)

(Dumais et al., 1988) trained on the SemCor corpus (Miller et al., 1993)

After each iteration, only candidates with an LSA score higher than a given threshold are selected for further expansion

Example:Seed: dulce (sweet)Candidate synonyms: cu gust dulce (sweet-

tasting). placut (pleasant), dulceag (quasi-sweet)

Page 11: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

FilteringSeveral iterations of the bootstrapping

process will result in a subjectivity lexicon consisting of a ranked list of candidates in decreasing order of similarity to the original seeds

A variable filtering threshold can be used to further restrict the similarity for a more pure lexicon

Filtering parameters:Similarity thresholdNumber of iterations

Page 12: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

Lexicon Acquisition

Page 13: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

EvaluationRule-based classifier of subjectivity

(Riloff and Wiebe, 2003)Subjective sentence: three or more subjective

entries.Objective sentence: two subjective entries or less.

Gold standard data set (Mihalcea, Banea and Wiebe, 2007)504 sentences from five SemCor documents

(manually translated in Romanian)Labeled by two annotatorsAgreement (all): 83% (=0.67)Agreement (uncertain removed): 89% (=0.77)Baseline: 54% (all subjective)

Page 14: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

Number of Iterations

F-measure for the bootstrapping subjectivity lexicon over 5 iterations and an LSA threshold of 0.5

Page 15: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

Similarity Threshold

F-measure for the fifth bootstrapping iteration for varying LSA scores

Page 16: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

Comparison

Bootstrapping rule-based classifier: uses a 3913 entries subjectivity lexicon obtained through 5 iterations and similarity threshold of 0.5

Page 17: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

ConclusionsOur bootstrapping method uses few

electronic resources:A small seed setOne/multiple dictionariesA small corpus of half a million words

A large subjectivity lexicon of approx. 4000 entries was extracted

Using an unsupervised rule-based classifier, a subjectivity F-measure of 66.20% and an overall F-measure of 61.69% can be achieved

Page 18: A Bootstrapping Method  for Building Subjectivity Lexicons  for Languages with Scarce Resources

Questions?