31
1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850, USA {ves,cardie}@cs.cornell.edu Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2006/10/23 ACL2006 Workshop on Sentiment and Subjectivity in Text

1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

Embed Size (px)

Citation preview

Page 1: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

1

Toward Opinion Summarization: Linking

the SourcesVeselin Stoyanov and Claire CardieDepartment of Computer Science

Cornell UniversityIthaca, NY 14850, USA

{ves,cardie}@cs.cornell.edu

Advisor: Hsin-Hsi ChenSpeaker: Yong-Sheng Lo

Date: 2006/10/23

ACL2006 Workshop on Sentiment and Subjectivity in Text

Page 2: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

2

Agenda

IntroductionToward opinion summarization

– Source coreference resolution

Data set

The methodTransformation

– Standard noun phrase coreference resolution

Coreference resolution – By Ng and Cardie (2002)

Evaluation

Conclusion

Page 3: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

3

Introduction 1/4

Problem of opinion summarizationAddressing the dearth of approaches for summarizing opinion information

– Source coreference resolution» Deciding which source mentions (opinion holders) are as

sociated with opinions that belong to the same real-world entity

– Example (see next page)

Coreference resolution– Deciding what noun phrases in the text refer to the same rea

l-world entities– 阿扁 or 陳總統 or 中華民國陳總統 = 陳水扁

Page 4: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

4

Introduction 2/4

Example (corpus of manually annotated opinions)

“ [Target Delaying of Bulgaria’s accession to the EU] would be a serious mistake” [Source Bulgarian Prime Minister Sergey Stanishev] said in an interview for the German daily Suddeutsche Zeitung. “ [Target Our country] serves as a model and encourages countries from the region to follow despite the difficulties”, [Source he] added.

[Target Bulgaria] is criticized by [Source the EU] because of slow reforms in the judiciary branch, the newspaper notes. Stanishev was elected prime minister in 2005. Since then, [Source he] has been a prominent supporter of [Target his country’s accession to the EU].

Page 5: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

5

Introduction 3/4

Page 6: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

6

Introduction 4/4

Example (source coreference resolution)

“ [Target Delaying of Bulgaria’s accession to the EU] would be a serious mistake” [Source Bulgarian Prime Minister Sergey Stanishev] said in an interview for the German daily Suddeutsche Zeitung. “ [Target Our country] serves as a model and encourages countries from the region to follow despite the difficulties”, [Source he] added.

[Target Bulgaria] is criticized by [Source the EU] because of slow reforms in the judiciary branch, the newspaper notes. Stanishev was elected prime minister in 2005. Since then, [Source he] has been a prominent supporter of [Target his country’s accession to the EU].

Page 7: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

7

Data set 1/2

MPQA corpus (Wilson and Wiebe, 2003)Multi-Perspective Question Answering Developing annotation using GATE

– General Architecture for Text Engineering – Example (see next page)

535 manually annotated documents with phrase-level opinion informationOver 11-month period, between June 2001 and May 2002Suitable for the political, government and commercial domainCan find source coreference chainContains no coreference information for general NPs (which are not sources)

Page 8: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

8

Data set 2/2

Example of annotations in GATE

Page 9: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

9

The method 1/10

To solve source coreference resolutionTransformation– How source coreference resolution (SCR) can be transformed i

nto standard noun phrase coreference resolution (NPCR) ?

Difference between SCR and NPCR :1. The sources of opinions do not exactly correspond to the autom

atic extractors’ notion of noun phrases (NPs)

2. The time-consuming nature of coreference annotation

Page 10: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

10

The method 2/10

The general approach to SCR1. Preprocessing

– To obtain an augmented set of NPs in the text– Done by Ng and Cardie (2002)

» Running a tokenizer, sentence splitter, POS tagger, parser, a base NP finder, and a named entity finder

2. Source to noun phrase mapping– Three problems– Using a set of heuristics

3. Coreference resolution– Applying a state-of-the-art coreference resolution approach to t

he transformed data» “ Improving Machine learning approaches to coreference re

solution ” [ by Ng and Cardie (2002) ]

Page 11: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

11

The method 3/10

Three problems– Inexact span match

» “Venezuelan people” vs. “the Venezuelan people”» “Muslims rulers” was not recognized, while “Muslims” and “ru

lers” were recognized by the NP extractor– Multiple NP match

» “the country’s new president, Eduardo Duhalde”» “Latin American leaders at a summit meeting in Costa Rica”» “Britain, Canada and Australia”

– No matching NP» “Carmona named new ministers, including two military officer

s who rebelled against Chavez”» “many”, “which”, and “domestically”» “lash” and “taskforce”

Page 12: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

12

The method 4/10

Using a set of heuristics– Rule 1

» If a source matches any NP exactly in span, match that source to the NP; do this even if multiple NPs overlap the source

» Example_1· [determiner] “ the Venezuelan people ” · [NP extractor] “ the Venezuelan people ”

» Example_2· [determiner] “ the country’s new president, Eduardo Duhalde ”· [NP extractor] “ the country’s new president” , “Eduardo Duhalde ”·

Page 13: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

13

The method 5/10

Rule 2If no NP matches exactly in span then :

– If a single NP overlaps the source, » Then map the source to that NP

– If multiple NP overlaps the source,» Then prefer three cases :» The outermost NP

· Because longer NPs contain more information» The last NP

· Because it is likely to be the head NP of a phrase» NP’s before preposition

· Because a preposition signals an explanatory prepositional phrase

Page 14: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

14

The method 6/10Example

1. The outermost NP– [determiner]

» “Prime Minister Sergey Stanishev”– [NP extractor]

» “Bulgarian Prime Minister” , “Sergey Stanishev”» “Bulgarian Prime Minister Sergey Stanishev”

2. The last NP– [determiner]

» “new president, Eduardo Duhalde”– [NP extractor]

» “the country’s new president” , “Eduardo Duhalde”3. NP’s before preposition

– [determiner]» “Latin American leaders at a summit meeting in Costa Rica”

– [NP extractor]» “Latin American leaders” , ”summit meeting” , “Costa Rica”

Page 15: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

15

The method 7/10

Rule 3If no NP overlaps the source, select the last NP before the source.

– Stanishev was elected prime minister in 2005. Since then, [sour

ce he] has been a prominent supporter.» [determiner] => “he“ » [NP extractor]

· “Stanishev“,“prime minister”,“prominent supporter”

In half of the cases we are dealing with the word who, which typically refers to the last preceding NP.

– “Carmona named new ministers, including two military officers who rebelled against Chavez”

Page 16: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

16

The method 8/10

Coreference resolutionUsing the standard combination of classification and single-link clustering

– Soon et al. (2001) and Ng and Cardie (2002)

Machine learning approach– Computing a vector of 57 features for every pair of source nou

n phrases from the preprocessed corpus» (source , NP)

– Training » To predict whether a source NP pair should be classified a

s positivepositive (the NPs refer to the same entity) or negative– Testing

» To predict whether a source NP pair is positive » and single-link clustering to group together sources that bel

ong to the same entity

Page 17: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

17

The method 9/10

Example (Single-link clustering) Training (positive instance)

– (source , NP) + feature set– ( 李登輝 , 李前總統 ) + 57 features– ( 李登輝 , 登輝先生 ) + 57 features– ( 阿輝伯 , 登輝先生 ) + 57 features

Testing – ( 李前總統 , 登輝先生 ) => positive– ( 阿輝伯 , 李前總統 ) => positive

» 阿輝伯 -- 李前總統 -- 登輝先生

Page 18: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

18

The method 10/10

Machine learning techniquesTo try the reportedly best techniques for pairwise classification

– RIPPER (Cohen, 1995)» Repeated Incremental Pruning to Produce Error Reduction» Using 24 different settings

– SVM light» Support Vector Machines» Using 56 different settings

Feature set57 = 12 + 41 + ??

– 12 by Soon et al. (2001)– 41 by Ng and Cardie (ACL2002)

Page 19: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

19

Feature set (12 features)

( NPi , NPj )

Page 20: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

20

Feature set (41 features)

Page 21: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

21

Feature set (41 features) cont.

Page 22: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

22

Evaluation

MPQA corpus (535 documents)400 for training set (random)

135 for test set (remaining)

The purpose of the evaluationTo create a strong baseline

– Using the best setting for the NP coreference resolution

Page 23: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

23

Evaluation

Instance selectionAdopt the method of Soon et al.(2001)

– selects for each NP the pairs with the n preceding coreferent instances and all intervening non-coreferent pairs

Soon 1 (n=1) [ Ng and Cardie (2002) ]

Soon 2 (n=2) [ Ng and Cardie (2002) ]

None

Page 24: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

24

Evaluation

Using performance measures for coreference resolutionB-CUBED (Bagga and Baldwin, 1998)MUC score (Vilain et al.,1995)Positive identification

– Precision, recall and F1 » Using these metrics on the identification of the positive class» By using the pairwise decisions as the classifiers outputs them» Example (see next page)

Actual Positive identification– Precision, recall and F1

» Using these metrics on the identification of the positive class» By performing the clustering of the source NPs and then consi

dering a pairwise decision to be positive if the two source NPs belong to the same cluster

» Example (see next page)

Page 25: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

25

Sample of answer set

The classifiers output (positive)

Positive identification

( 陳水扁 , 陳水扁總統 )

( 陳水扁 , 陳總統 )**

( 陳水扁 , 阿扁總統 )**

( 馬英九 , 市長馬英九 )

( 陳總統 , 陳水扁總統 )**

( 陳總統 , 陳總統 )

( 陳總統 , 阿扁總統 )

( 阿扁 , 陳水扁總統 )**

( 阿扁 , 陳總統 )

( 阿扁 , 阿扁總統 )

( 陳水扁 , 陳水扁總統 )

( 馬英九 , 市長馬英九 )

( 陳總統 , 陳總統 )

( 阿扁 , 阿扁總統 )

Actual Positive identification

( 陳水扁 , 陳水扁總統 )

( 馬英九 , 市長馬英九 )

( 陳總統 , 陳總統 )

( 陳總統 , 阿扁總統 )

( 阿扁 , 陳總統 )

( 阿扁 , 阿扁總統 )

( 陳水扁 , 陳水扁總統 )

( 馬英九 , 市長馬英九 )

( 陳總統 , 陳總統 )

( 阿扁 , 阿扁總統 )

(source)陳水扁(NP)

陳水扁總統

(source)馬英九(NP)

市長馬英九

(source)陳總統 , 阿扁

(NP)陳總統 , 阿扁總統

Page 26: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

26

Evaluation

Page 27: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

27

Evaluation

Page 28: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

28

Evaluation

Page 29: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

29

Evaluation

Page 30: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

30

Evaluation

Page 31: 1 Toward Opinion Summarization: Linking the Sources Veselin Stoyanov and Claire Cardie Department of Computer Science Cornell University Ithaca, NY 14850,

31

Conclusion

As a first step toward opinion summarization To target the problem of source coreference resolution

To show that this problem can be tackled effectively as noun coreference resolution

To create a baseline

Next stepTo develop a method that utilizes the unlabeled NPs in the corpus using a structured rule learner