S10-1

ACL 2010

SemEval 2010

5th International Workshop on Semantic Evaluation

Proceedings of the Workshop

15-16 July 2010Uppsala UniversityUppsala, Sweden

Production and Manufacturing byTaberg Media Group ABBox 94, 562 02 TabergSweden

c2010 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-932432-70-1 / 1-932432-70-1

ii

Preface

Welcome to SemEval 2010!

Thank you for offering so many different and intriguing semantic analysis tasks, and for creating so manygreat systems to solve them. We are very much looking forward to this workshop, and are curious to hearabout your work.

Katrin and Carlo.

iii

Organizers:

Katrin Erk, University of Texas at AustinCarlo Strapparava, ITC IRST

Program Committee:

Eneko Agirre Timothy Baldwin Marco BaroniChris Biemann Chris Brew Nicoletta CalzolariDmitriy Dligach Phil Edmonds Dan GildeaIris Hendrickx Veronique Hoste Nancy IdeElisabetta Jezek Peng Jin Adam KilgarriffSu Nam Kim Ioannis Klapaftis Dimitrios KokkinakisAnna Korhonen Zornitsa Kozareva Sadao KurohashiEls Lefever Ken Litkowski Oier Lopez de LacalleSuresh Manandha Katja Markert Lluis MarquezDiana McCarthy Saif Mohammad Roser MorantePreslav Nakov Vivi Nastase Hwee Tou NgManabu Okumura Martha Palmer Ted PedersenMarco Pennacchiotti Massimo Poesio Valeria QuochiGerman Rigau Lorenza Romano Anna RumshiskyJosef Ruppenhofer Emili Sapena Kiyoaki ShiraiRavi Sinha Caroline Sporleder Mark StevensonStan Szpakowicz Mariona Taule Dan TufisTony Veale Marc Verhagen Yannick VersleyRichard Wicentowski Yunfang Wu Dekai WuNianwen Xue Deniz Yuret Diarmuid O Seaghdha

v

Table of Contents

SemEval-2010 Task 1: Coreference Resolution in Multiple LanguagesMarta Recasens, Llus Ma`rquez, Emili Sapena, M. Anto`nia Mart, Mariona Taule, Veronique Hoste,

Massimo Poesio and Yannick Versley . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

SemEval-2010 Task 2: Cross-Lingual Lexical SubstitutionRada Mihalcea, Ravi Sinha and Diana McCarthy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

SemEval-2010 Task 3: Cross-Lingual Word Sense DisambiguationEls Lefever and Veronique Hoste . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific ArticlesSu Nam Kim, Olena Medelyan, Min-Yen Kan and Timothy Baldwin . . . . . . . . . . . . . . . . . . . . . . . . . 21

SemEval-2010 Task 7: Argument Selection and CoercionJames Pustejovsky, Anna Rumshisky, Alex Plotnick, Elisabetta Jezek, Olga Batiukova and Valeria

Quochi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations between Pairs of NominalsIris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid O Seaghdha, Sebastian

Pado, Marco Pennacchiotti, Lorenza Romano and Stan Szpakowicz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

SemEval-2 Task 9: The Interpretation of Noun Compounds Using Paraphrasing Verbs and PrepositionsCristina Butnariu, Su Nam Kim, Preslav Nakov, Diarmuid O Seaghdha, Stan Szpakowicz and Tony

Veale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

SemEval-2010 Task 10: Linking Events and Their Participants in DiscourseJosef Ruppenhofer, Caroline Sporleder, Roser Morante, Collin Baker and Martha Palmer . . . . . . 45

SemEval-2010 Task 12: Parser Evaluation Using Textual EntailmentsDeniz Yuret, Aydin Han and Zehra Turgut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

SemEval-2010 Task 13: TempEval-2Marc Verhagen, Roser Sauri, Tommaso Caselli and James Pustejovsky . . . . . . . . . . . . . . . . . . . . . . . 57

SemEval-2010 Task 14: Word Sense Induction & DisambiguationSuresh Manandhar, Ioannis Klapaftis, Dmitriy Dligach and Sameer Pradhan . . . . . . . . . . . . . . . . . . 63

SemEval-2010 Task: Japanese WSDManabu Okumura, Kiyoaki Shirai, Kanako Komiya and Hikaru Yokono . . . . . . . . . . . . . . . . . . . . . . 69

SemEval-2010 Task 17: All-Words Word Sense Disambiguation on a Specific DomainEneko Agirre, Oier Lopez de Lacalle, Christiane Fellbaum, Shu-Kai Hsieh, Maurizio Tesconi,

Monica Monachini, Piek Vossen and Roxanne Segers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

SemEval-2010 Task 18: Disambiguating Sentiment Ambiguous AdjectivesYunfang Wu and Peng Jin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

SemEval-2010 Task 11: Event Detection in Chinese News SentencesQiang Zhou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

SemEval-2 Task 15: Infrequent Sense Identification for Mandarin Text to Speech SystemsPeng Jin and Yunfang Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

vii

RelaxCor: A Global Relaxation Labeling Approach to Coreference ResolutionEmili Sapena, Llus Padro and Jordi Turmo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88

SUCRE: A Modular System for Coreference ResolutionHamidreza Kobdani and Hinrich Schutze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

UBIU: A Language-Independent System for Coreference ResolutionDesislava Zhekova and Sandra Kubler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Corry: A System for Coreference ResolutionOlga Uryupina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

BART: A Multilingual Anaphora Resolution SystemSamuel Broscheit, Massimo Poesio, Simone Paolo Ponzetto, Kepa Joseba Rodriguez, Lorenza Ro-

mano, Olga Uryupina, Yannick Versley and Roberto Zanoli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

TANL-1: Coreference Resolution by Parse Analysis and Similarity ClusteringGiuseppe Attardi, Maria Simi and Stefano Dei Rossi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

FCC: Modeling Probabilities with GIZA++ for Task 2 and 3 of SemEval-2Darnes Vilarino Ayala, Carlos Balderas Posada, David Eduardo Pinto Avendano, Miguel Rodrguez

Hernandez and Saul Leon Silverio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Combining Dictionaries and Contextual Information for Cross-Lingual Lexical SubstitutionWilker Aziz and Lucia Specia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

SWAT: Cross-Lingual Lexical Substitution using Local Context Matching, Bilingual Dictionaries andMachine Translation

Richard Wicentowski, Maria Kelly and Rachel Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

COLEPL and COLSLM: An Unsupervised WSD Approach to Multilingual Lexical Substitution, Tasks 2and 3 SemEval 2010

Weiwei Guo and Mona Diab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

UHD: Cross-Lingual Word Sense Disambiguation Using Multilingual Co-Occurrence GraphsCarina Silberer and Simone Paolo Ponzetto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

OWNS: Cross-lingual Word Sense Disambiguation Using Weighted Overlap Counts and Wordnet BasedSimilarity Measures

Lipta Mahapatra, Meera Mohan, Mitesh Khapra and Pushpak Bhattacharyya . . . . . . . . . . . . . . . . 138

273. Task 5. Keyphrase Extraction Based on Core Word Identification and Word ExpansionYou Ouyang, Wenjie Li and Renxian Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

DERIUNLP: A Context Based Approach to Automatic Keyphrase ExtractionGeorgeta Bordea and Paul Buitelaar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

DFKI KeyWE: Ranking Keyphrases Extracted from Scientific ArticlesKathrin Eichler and Gunter Neumann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Single Document Keyphrase Extraction Using Sentence Clustering and Latent Dirichlet AllocationClaude Pasquier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

SJTULTLAB: Chunk Based Method for Keyphrase ExtractionLetian Wang and Fang Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

viii

Likey: Unsupervised Language-Independent Keyphrase ExtractionMari-Sanna Paukkeri and Timo Honkela . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

WINGNUS: Keyphrase Extraction Utilizing Document Logical StructureThuy Dung Nguyen and Minh-Thang Luong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

KX: A Flexible System for Keyphrase eXtractionEmanuele Pianta and Sara Tonelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

BUAP: An Unsupervised Approach to Automatic Keyphrase Extraction from Scientific ArticlesRoberto Ortiz, David Pinto, Mireya Tovar and Hector Jimenez-Salazar . . . . . . . . . . . . . . . . . . . . . . 174

UNPMC: Naive Approach to Extract Keyphrases from Scientific ArticlesJungyeul Park, Jong Gun Lee and Beatrice Daille . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

SEERLAB: A System for Extracting Keyphrases from Scholarly DocumentsPucktada Treeratpituk, Pradeep Teregowda, Jian Huang and C. Lee Giles . . . . . . . . . . . . . . . . . . . .182

SZTERGAK : Feature Engineering for Keyphrase ExtractionGabor Berend and Richard Farkas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

KP-Miner: Participation in SemEval-2Samhaa R. El-Beltagy and Ahmed Rafea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

UvT: The UvT Term Extraction System in the Keyphrase Extraction TaskKalliopi Zervanou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

UNITN: Part-Of-Speech Counting in Relation ExtractionFabio Celli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

FBK NK: A WordNet-Based System for Multi-Way Classification of Semantic RelationsMatteo Negri and Milen Kouylekov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

JU: A Supervised Approach to Identify Semantic Relations from Paired NominalsSantanu Pal, Partha Pakray, Dipankar Das and Sivaji Bandyopadhyay . . . . . . . . . . . . . . . . . . . . . . . 206

TUD: Semantic Relatedness for Relation ClassificationGyorgy Szarvas and Iryna Gurevych . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

FBK-IRST: Semantic Relation Extraction Using CycKateryna Tymoshenko and Claudio Giuliano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

ISTI@SemEval-2 Task 8: Boosting-Based Multiway Relation ClassificationAndrea Esuli, Diego Marcheggiani and Fabrizio Sebastiani . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

ISI: Automatic Classification of Relations Between Nominals Using a Maximum Entropy ClassifierStephen Tratz and Eduard Hovy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

ECNU: Effective Semantic Relations Classification without Complicated Features or Multiple ExternalCorpora

Yuan Chen, Man Lan, Jian Su, Zhi Min Zhou and Yu Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

UCD-Goggle: A Hybrid System for Noun Compound ParaphrasingGuofu Li, Alejandra Lopez-Fernandez and Tony Veale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

ix

UCD-PN: Selecting General Paraphrases Using Conditional ProbabilityPaul Nulty and Fintan Costello . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

UvT-WSD1: A Cross-Lingual Word Sense Disambiguation SystemMaarten van Gompel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

UBA: Using Automatic Translation and Wikipedia for Cross-Lingual Lexical SubstitutionPierpaolo Basile and Giovanni Semeraro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

HUMB: Automatic Key Term Extraction from Scientific Articles in GROBIDPatrice Lopez and Laurent Romary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

UTDMet: Combining WordNet and Corpus Data for Argument Coercion DetectionKirk Roberts and Sanda Harabagiu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

UTD: Classifying Semantic Relations by Combining Lexical and Semantic ResourcesBryan Rink and Sanda Harabagiu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

UvT: Memory-Based Pairwise Ranking of Paraphrasing VerbsSander Wubben . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

SEMAFOR: Frame Argument Resolution with Log-Linear ModelsDesai Chen, Nathan Schneider, Dipanjan Das and Noah A. Smith . . . . . . . . . . . . . . . . . . . . . . . . . . 264

Cambridge: Parser Evaluation Using Textual Entailment by Grammatical Relation ComparisonLaura Rimell and Stephen Clark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

MARS: A Specialized RTE System for Parser EvaluationRui Wang and Yi Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

TRIPS and TRIOS System for TempEval-2: Extracting Temporal Information from TextNaushad UzZaman and James Allen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

TIPSem (English and Spanish): Evaluating CRFs and Semantic Roles in TempEval-2Hector Llorens, Estela Saquete and Borja Navarro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

CityU-DAC: Disambiguating Sentiment-Ambiguous Adjectives within ContextBin LU and Benjamin K. Tsou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

VENSES++: Adapting a deep semantic processing system to the identification of null instantiationsSara Tonelli and Rodolfo Delmonte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296

CLR: Linking Events and Their Participants in Discourse Using a Comprehensive FrameNet DictionaryKen Litkowski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

PKU HIT: An Event Detection System Based on Instances Expansion and Rich Syntactic FeaturesShiqi Li, Pengyuan Liu, Tiejun Zhao, Qin Lu and Hanjing Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

372:Comparing the Benefit of Different Dependency Parsers for Textual Entailment Using SyntacticConstraints Only

Alexander Volokh and Gunter Neumann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

SCHWA: PETE Using CCG Dependencies with the C&C ParserDominick Ng, James W.D. Constable, Matthew Honnibal and James R. Curran . . . . . . . . . . . . . . 313

x

ID 392:TERSEO + T2T3 Transducer. A systems for Recognizing and Normalizing TIMEX3Estela Saquete Boro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

HeidelTime: High Quality Rule-Based Extraction and Normalization of Temporal ExpressionsJannik Strotgen and Michael Gertz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

KUL: Recognition and Normalization of Temporal ExpressionsOleksandr Kolomiyets and Marie-Francine Moens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325

UC3M System: Determining the Extent, Type and Value of Time Expressions in TempEval-2Mara Teresa Vicente-Dez, Julian Moreno-Schneider and Paloma Martnez . . . . . . . . . . . . . . . . . 329

Edinburgh-LTG: TempEval-2 System DescriptionClaire Grover, Richard Tobin, Beatrice Alex and Kate Byrne . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

USFD2: Annotating Temporal Expresions and TLINKs for TempEval-2Leon Derczynski and Robert Gaizauskas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

NCSU: Modeling Temporal Relations with Markov Logic and Lexical OntologyEun Ha, Alok Baikadi, Carlyle Licata and James Lester . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

JU CSE TEMP: A First Step towards Evaluating Events, Time Expressions and Temporal RelationsAnup Kumar Kolya, Asif Ekbal and Sivaji Bandyopadhyay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

KCDC: Word Sense Induction by Using Grammatical Dependencies and Sentence Phrase StructureRoman Kern, Markus Muhr and Michael Granitzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .351

UoY: Graphs of Unambiguous Vertices for Word Sense Induction and DisambiguationIoannis Korkontzelos and Suresh Manandhar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355

HERMIT: Flexible Clustering for the SemEval-2 WSI TaskDavid Jurgens and Keith Stevens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

Duluth-WSI: SenseClusters Applied to the Sense Induction Task of SemEval-2Ted Pedersen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363

KSU KDD: Word Sense Induction by Clustering in Topic SpaceWesam Elshamy, Doina Caragea and William Hsu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367

PengYuan@PKU: Extracting Infrequent Sense Instance with the Same N-Gram Pattern for the SemEval-2010 Task 15

Peng-Yuan Liu, Shi-Wen Yu, Shui Liu and Tie-Jun Zhao. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .371

RALI: Automatic Weighting of Text Window DistancesBernard Brosseau-Villeneuve, Noriko Kando and Jian-Yun Nie . . . . . . . . . . . . . . . . . . . . . . . . . . . . .375

JAIST: Clustering and Classification Based Approaches for Japanese WSDKiyoaki Shirai and Makoto Nakamura . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

MSS: Investigating the Effectiveness of Domain Combinations and Topic Features for Word Sense Dis-ambiguation

Sanae Fujita, Kevin Duh, Akinori Fujino, Hirotoshi Taira and Hiroyuki Shindo . . . . . . . . . . . . . . 383

IIITH: Domain Specific Word Sense DisambiguationSiva Reddy, Abhilash Inumella, Diana McCarthy and Mark Stevenson . . . . . . . . . . . . . . . . . . . . . . 387

xi

UCF-WS: Domain Word Sense Disambiguation Using Web SelectorsHansen A. Schwartz and Fernando Gomez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

TreeMatch: A Fully Unsupervised WSD System Using Dependency Knowledge on a Specific DomainAndrew Tran, Chris Bowes, David Brown, Ping Chen, Max Choly and Wei Ding . . . . . . . . . . . . 396

GPLSI-IXA: Using Semantic Classes to Acquire Monosemous Training Examples from Domain TextsRuben Izquierdo, Armando Suarez and German Rigau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402

HIT-CIR: An Unsupervised WSD System Based on Domain Most Frequent Sense EstimationYuhang Guo, Wanxiang Che, Wei He, Ting Liu and Sheng Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

RACAI: Unsupervised WSD Experiments @ SemEval-2, Task 17Radu Ion and Dan Stefanescu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411

Kyoto: An Integrated System for Specific Domain WSDAitor Soroa, Eneko Agirre, Oier Lopez de Lacalle, Wauter Bosma, Piek Vossen, Monica Monachini,

Jessie Lo and Shu-Kai Hsieh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

CFILT: Resource Conscious Approaches for All-Words Domain Specific WSDAnup Kulkarni, Mitesh Khapra, Saurabh Sohoney and Pushpak Bhattacharyya . . . . . . . . . . . . . . . 421

UMCC-DLSI: Integrative Resource for Disambiguation TaskYoan Gutierrez Vazquez, Antonio Fernandez Orqun, Andres Montoyo Guijarro and Sonia Vazquez

Perez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .427

HR-WSD: System Description for All-Words Word Sense Disambiguation on a Specific Domain at SemEval-2010

Meng-Hsien Shih . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433

Twitter Based System: Using Twitter for Disambiguating Sentiment Ambiguous AdjectivesAlexander Pak and Patrick Paroubek. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .436

YSC-DSAA: An Approach to Disambiguate Sentiment Ambiguous Adjectives Based on SAAOLShi-Cai Yang and Mei-Juan Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440

OpAL: Applying Opinion Mining Techniques for the Disambiguation of Sentiment Ambiguous Adjectivesin SemEval-2 Task 18

Alexandra Balahur and Andres Montoyo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444

HITSZ CITYU: Combine Collocation, Context Words and Neighboring Sentence Sentiment in SentimentAdjectives Disambiguation

Ruifeng Xu, Jun Xu and Chunyu Kit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .448

xii

Conference Program

Thursday, July 15, 2010

09:0010:40 Task description papers

09:0009:20 SemEval-2010 Task 1: Coreference Resolution in Multiple LanguagesMarta Recasens, Llus Ma`rquez, Emili Sapena, M. Anto`nia Mart, Mariona Taule,Veronique Hoste, Massimo Poesio and Yannick Versley

09:2009:40 SemEval-2010 Task 2: Cross-Lingual Lexical SubstitutionRada Mihalcea, Ravi Sinha and Diana McCarthy

09:4010:00 SemEval-2010 Task 3: Cross-Lingual Word Sense DisambiguationEls Lefever and Veronique Hoste

10:0010:20 SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific ArticlesSu Nam Kim, Olena Medelyan, Min-Yen Kan and Timothy Baldwin

10:2010:40 SemEval-2010 Task 7: Argument Selection and CoercionJames Pustejovsky, Anna Rumshisky, Alex Plotnick, Elisabetta Jezek, OlgaBatiukova and Valeria Quochi


11:0011:20 SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations betweenPairs of NominalsIris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid OSeaghdha, Sebastian Pado, Marco Pennacchiotti, Lorenza Romano and Stan Sz-pakowicz

11:2011:40 SemEval-2 Task 9: The Interpretation of Noun Compounds Using ParaphrasingVerbs and PrepositionsCristina Butnariu, Su Nam Kim, Preslav Nakov, Diarmuid O Seaghdha, Stan Sz-pakowicz and Tony Veale

11:4012:00 SemEval-2010 Task 10: Linking Events and Their Participants in DiscourseJosef Ruppenhofer, Caroline Sporleder, Roser Morante, Collin Baker and MarthaPalmer

12:0012:20 SemEval-2010 Task 12: Parser Evaluation Using Textual EntailmentsDeniz Yuret, Aydin Han and Zehra Turgut

12:2012:40 SemEval-2010 Task 13: TempEval-2Marc Verhagen, Roser Sauri, Tommaso Caselli and James Pustejovsky

xiii

Thursday, July 15, 2010 (continued)


14:0014:20 SemEval-2010 Task 14: Word Sense Induction & DisambiguationSuresh Manandhar, Ioannis Klapaftis, Dmitriy Dligach and Sameer Pradhan

14:2014:40 SemEval-2010 Task: Japanese WSDManabu Okumura, Kiyoaki Shirai, Kanako Komiya and Hikaru Yokono

14:4015:00 SemEval-2010 Task 17: All-Words Word Sense Disambiguation on a Specific DomainEneko Agirre, Oier Lopez de Lacalle, Christiane Fellbaum, Shu-Kai Hsieh, MaurizioTesconi, Monica Monachini, Piek Vossen and Roxanne Segers

15:0015:20 SemEval-2010 Task 18: Disambiguating Sentiment Ambiguous AdjectivesYunfang Wu and Peng Jin

16:0017:30 Task description posters

SemEval-2010 Task 11: Event Detection in Chinese News SentencesQiang Zhou

SemEval-2 Task 15: Infrequent Sense Identification for Mandarin Text to Speech SystemsPeng Jin and Yunfang Wu

16:00-17:30 Posters

RelaxCor: A Global Relaxation Labeling Approach to Coreference ResolutionEmili Sapena, Llus Padro and Jordi Turmo

SUCRE: A Modular System for Coreference ResolutionHamidreza Kobdani and Hinrich Schutze

UBIU: A Language-Independent System for Coreference ResolutionDesislava Zhekova and Sandra Kubler

Corry: A System for Coreference ResolutionOlga Uryupina

xiv


BART: A Multilingual Anaphora Resolution SystemSamuel Broscheit, Massimo Poesio, Simone Paolo Ponzetto, Kepa Joseba Rodriguez,Lorenza Romano, Olga Uryupina, Yannick Versley and Roberto Zanoli

TANL-1: Coreference Resolution by Parse Analysis and Similarity ClusteringGiuseppe Attardi, Maria Simi and Stefano Dei Rossi

FCC: Modeling Probabilities with GIZA++ for Task 2 and 3 of SemEval-2Darnes Vilarino Ayala, Carlos Balderas Posada, David Eduardo Pinto Avendano, MiguelRodrguez Hernandez and Saul Leon Silverio

Combining Dictionaries and Contextual Information for Cross-Lingual Lexical Substitu-tionWilker Aziz and Lucia Specia

SWAT: Cross-Lingual Lexical Substitution using Local Context Matching, Bilingual Dic-tionaries and Machine TranslationRichard Wicentowski, Maria Kelly and Rachel Lee

COLEPL and COLSLM: An Unsupervised WSD Approach to Multilingual Lexical Substi-tution, Tasks 2 and 3 SemEval 2010Weiwei Guo and Mona Diab

UHD: Cross-Lingual Word Sense Disambiguation Using Multilingual Co-OccurrenceGraphsCarina Silberer and Simone Paolo Ponzetto

OWNS: Cross-lingual Word Sense Disambiguation Using Weighted Overlap Counts andWordnet Based Similarity MeasuresLipta Mahapatra, Meera Mohan, Mitesh Khapra and Pushpak Bhattacharyya

273. Task 5. Keyphrase Extraction Based on Core Word Identification and Word ExpansionYou Ouyang, Wenjie Li and Renxian Zhang

DERIUNLP: A Context Based Approach to Automatic Keyphrase ExtractionGeorgeta Bordea and Paul Buitelaar

DFKI KeyWE: Ranking Keyphrases Extracted from Scientific ArticlesKathrin Eichler and Gunter Neumann

Single Document Keyphrase Extraction Using Sentence Clustering and Latent DirichletAllocationClaude Pasquier

xv


SJTULTLAB: Chunk Based Method for Keyphrase ExtractionLetian Wang and Fang Li

Likey: Unsupervised Language-Independent Keyphrase ExtractionMari-Sanna Paukkeri and Timo Honkela

WINGNUS: Keyphrase Extraction Utilizing Document Logical StructureThuy Dung Nguyen and Minh-Thang Luong

KX: A Flexible System for Keyphrase eXtractionEmanuele Pianta and Sara Tonelli

BUAP: An Unsupervised Approach to Automatic Keyphrase Extraction from Scientific Ar-ticlesRoberto Ortiz, David Pinto, Mireya Tovar and Hector Jimenez-Salazar

UNPMC: Naive Approach to Extract Keyphrases from Scientific ArticlesJungyeul Park, Jong Gun Lee and Beatrice Daille

SEERLAB: A System for Extracting Keyphrases from Scholarly DocumentsPucktada Treeratpituk, Pradeep Teregowda, Jian Huang and C. Lee Giles

SZTERGAK : Feature Engineering for Keyphrase ExtractionGabor Berend and Richard Farkas

KP-Miner: Participation in SemEval-2Samhaa R. El-Beltagy and Ahmed Rafea

UvT: The UvT Term Extraction System in the Keyphrase Extraction TaskKalliopi Zervanou

UNITN: Part-Of-Speech Counting in Relation ExtractionFabio Celli

FBK NK: A WordNet-Based System for Multi-Way Classification of Semantic RelationsMatteo Negri and Milen Kouylekov

xvi


JU: A Supervised Approach to Identify Semantic Relations from Paired NominalsSantanu Pal, Partha Pakray, Dipankar Das and Sivaji Bandyopadhyay

TUD: Semantic Relatedness for Relation ClassificationGyorgy Szarvas and Iryna Gurevych

FBK-IRST: Semantic Relation Extraction Using CycKateryna Tymoshenko and Claudio Giuliano

ISTI@SemEval-2 Task 8: Boosting-Based Multiway Relation ClassificationAndrea Esuli, Diego Marcheggiani and Fabrizio Sebastiani

ISI: Automatic Classification of Relations Between Nominals Using a Maximum EntropyClassifierStephen Tratz and Eduard Hovy

ECNU: Effective Semantic Relations Classification without Complicated Features or Mul-tiple External CorporaYuan Chen, Man Lan, Jian Su, Zhi Min Zhou and Yu Xu

UCD-Goggle: A Hybrid System for Noun Compound ParaphrasingGuofu Li, Alejandra Lopez-Fernandez and Tony Veale

UCD-PN: Selecting General Paraphrases Using Conditional ProbabilityPaul Nulty and Fintan Costello

Friday, July 16, 2010

09:0010:30 System papers

09:0009:15 UvT-WSD1: A Cross-Lingual Word Sense Disambiguation SystemMaarten van Gompel

09:1509:30 UBA: Using Automatic Translation and Wikipedia for Cross-Lingual Lexical SubstitutionPierpaolo Basile and Giovanni Semeraro

09:3009:45 HUMB: Automatic Key Term Extraction from Scientific Articles in GROBIDPatrice Lopez and Laurent Romary

xvii

Friday, July 16, 2010 (continued)

09:4510:00 UTDMet: Combining WordNet and Corpus Data for Argument Coercion DetectionKirk Roberts and Sanda Harabagiu

10:0010:15 UTD: Classifying Semantic Relations by Combining Lexical and Semantic ResourcesBryan Rink and Sanda Harabagiu

10:1510:30 UvT: Memory-Based Pairwise Ranking of Paraphrasing VerbsSander Wubben

11:0012:30 System papers

11:0011:15 SEMAFOR: Frame Argument Resolution with Log-Linear ModelsDesai Chen, Nathan Schneider, Dipanjan Das and Noah A. Smith

11:1511:30 Cambridge: Parser Evaluation Using Textual Entailment by Grammatical Relation Com-parisonLaura Rimell and Stephen Clark

11:3011:45 MARS: A Specialized RTE System for Parser EvaluationRui Wang and Yi Zhang

11:4512:00 TRIPS and TRIOS System for TempEval-2: Extracting Temporal Information from TextNaushad UzZaman and James Allen

12:0012:15 TIPSem (English and Spanish): Evaluating CRFs and Semantic Roles in TempEval-2Hector Llorens, Estela Saquete and Borja Navarro

12:1512:30 CityU-DAC: Disambiguating Sentiment-Ambiguous Adjectives within ContextBin LU and Benjamin K. Tsou

14:0015:30 PANEL

16:0017:30 Posters

VENSES++: Adapting a deep semantic processing system to the identification of nullinstantiationsSara Tonelli and Rodolfo Delmonte

xviii


CLR: Linking Events and Their Participants in Discourse Using a ComprehensiveFrameNet DictionaryKen Litkowski

PKU HIT: An Event Detection System Based on Instances Expansion and Rich SyntacticFeaturesShiqi Li, Pengyuan Liu, Tiejun Zhao, Qin Lu and Hanjing Li

372:Comparing the Benefit of Different Dependency Parsers for Textual Entailment UsingSyntactic Constraints OnlyAlexander Volokh and Gunter Neumann

SCHWA: PETE Using CCG Dependencies with the C&C ParserDominick Ng, James W.D. Constable, Matthew Honnibal and James R. Curran

ID 392:TERSEO + T2T3 Transducer. A systems for Recognizing and NormalizingTIMEX3Estela Saquete Boro

HeidelTime: High Quality Rule-Based Extraction and Normalization of Temporal Expres-sionsJannik Strotgen and Michael Gertz

KUL: Recognition and Normalization of Temporal ExpressionsOleksandr Kolomiyets and Marie-Francine Moens

UC3M System: Determining the Extent, Type and Value of Time Expressions in TempEval-2Mara Teresa Vicente-Dez, Julian Moreno-Schneider and Paloma Martnez

Edinburgh-LTG: TempEval-2 System DescriptionClaire Grover, Richard Tobin, Beatrice Alex and Kate Byrne

USFD2: Annotating Temporal Expresions and TLINKs for TempEval-2Leon Derczynski and Robert Gaizauskas

NCSU: Modeling Temporal Relations with Markov Logic and Lexical OntologyEun Ha, Alok Baikadi, Carlyle Licata and James Lester

JU CSE TEMP: A First Step towards Evaluating Events, Time Expressions and TemporalRelationsAnup Kumar Kolya, Asif Ekbal and Sivaji Bandyopadhyay

xix


KCDC: Word Sense Induction by Using Grammatical Dependencies and Sentence PhraseStructureRoman Kern, Markus Muhr and Michael Granitzer

UoY: Graphs of Unambiguous Vertices for Word Sense Induction and DisambiguationIoannis Korkontzelos and Suresh Manandhar

HERMIT: Flexible Clustering for the SemEval-2 WSI TaskDavid Jurgens and Keith Stevens

Duluth-WSI: SenseClusters Applied to the Sense Induction Task of SemEval-2Ted Pedersen

KSU KDD: Word Sense Induction by Clustering in Topic SpaceWesam Elshamy, Doina Caragea and William Hsu

PengYuan@PKU: Extracting Infrequent Sense Instance with the Same N-Gram Pattern forthe SemEval-2010 Task 15Peng-Yuan Liu, Shi-Wen Yu, Shui Liu and Tie-Jun Zhao

RALI: Automatic Weighting of Text Window DistancesBernard Brosseau-Villeneuve, Noriko Kando and Jian-Yun Nie

JAIST: Clustering and Classification Based Approaches for Japanese WSDKiyoaki Shirai and Makoto Nakamura

MSS: Investigating the Effectiveness of Domain Combinations and Topic Features forWord Sense DisambiguationSanae Fujita, Kevin Duh, Akinori Fujino, Hirotoshi Taira and Hiroyuki Shindo

IIITH: Domain Specific Word Sense DisambiguationSiva Reddy, Abhilash Inumella, Diana McCarthy and Mark Stevenson

UCF-WS: Domain Word Sense Disambiguation Using Web SelectorsHansen A. Schwartz and Fernando Gomez

TreeMatch: A Fully Unsupervised WSD System Using Dependency Knowledge on a Spe-cific DomainAndrew Tran, Chris Bowes, David Brown, Ping Chen, Max Choly and Wei Ding

xx


GPLSI-IXA: Using Semantic Classes to Acquire Monosemous Training Examples fromDomain TextsRuben Izquierdo, Armando Suarez and German Rigau

HIT-CIR: An Unsupervised WSD System Based on Domain Most Frequent Sense Estima-tionYuhang Guo, Wanxiang Che, Wei He, Ting Liu and Sheng Li

RACAI: Unsupervised WSD Experiments @ SemEval-2, Task 17Radu Ion and Dan Stefanescu

Kyoto: An Integrated System for Specific Domain WSDAitor Soroa, Eneko Agirre, Oier Lopez de Lacalle, Wauter Bosma, Piek Vossen, MonicaMonachini, Jessie Lo and Shu-Kai Hsieh

CFILT: Resource Conscious Approaches for All-Words Domain Specific WSDAnup Kulkarni, Mitesh Khapra, Saurabh Sohoney and Pushpak Bhattacharyya

UMCC-DLSI: Integrative Resource for Disambiguation TaskYoan Gutierrez Vazquez, Antonio Fernandez Orqun, Andres Montoyo Guijarro and SoniaVazquez Perez

HR-WSD: System Description for All-Words Word Sense Disambiguation on a SpecificDomain at SemEval-2010Meng-Hsien Shih

Twitter Based System: Using Twitter for Disambiguating Sentiment Ambiguous AdjectivesAlexander Pak and Patrick Paroubek

YSC-DSAA: An Approach to Disambiguate Sentiment Ambiguous Adjectives Based onSAAOLShi-Cai Yang and Mei-Juan Liu

OpAL: Applying Opinion Mining Techniques for the Disambiguation of Sentiment Am-biguous Adjectives in SemEval-2 Task 18Alexandra Balahur and Andres Montoyo

HITSZ CITYU: Combine Collocation, Context Words and Neighboring Sentence Senti-ment in Sentiment Adjectives DisambiguationRuifeng Xu, Jun Xu and Chunyu Kit

xxi

Proceedings of the 5th International Workshop on Semantic Evaluation, ACL 2010, pages 18,Uppsala, Sweden, 15-16 July 2010. c2010 Association for Computational Linguistics

SemEval-2010 Task 1: Coreference Resolution in Multiple LanguagesMarta Recasens? Llus Ma`rquez Emili Sapena M. Anto`nia Mart?Mariona Taule? Veronique Hoste Massimo Poesio Yannick Versley??

?: CLiC, University of Barcelona, {mrecasens,amarti,mtaule}@ub.edu: TALP, Technical University of Catalonia, {lluism,esapena}@lsi.upc.edu

: University College Ghent, [email protected]: University of Essex/University of Trento, [email protected]??: University of Tubingen, [email protected]

Abstract

This paper presents the SemEval-2010task on Coreference Resolution in Multi-ple Languages. The goal was to evaluateand compare automatic coreference reso-lution systems for six different languages(Catalan, Dutch, English, German, Italian,and Spanish) in four evaluation settingsand using four different metrics. Such arich scenario had the potential to provideinsight into key issues concerning corefer-ence resolution: (i) the portability of sys-tems across languages, (ii) the relevance ofdifferent levels of linguistic information,and (iii) the behavior of scoring metrics.

1 Introduction

The task of coreference resolution, defined as theidentification of the expressions in a text that re-fer to the same discourse entity (1), has attractedconsiderable attention within the NLP community.

(1) Major League Baseball sent its head of se-curity to Chicago to review the second in-cident of an on-field fan attack in the lastseven months. The league is reviewing se-curity at all ballparks to crack down onspectator violence.

Using coreference information has been shown tobe beneficial in a number of NLP applicationsincluding Information Extraction (McCarthy andLehnert, 1995), Text Summarization (Steinbergeret al., 2007), Question Answering (Morton, 1999),and Machine Translation. There have been a fewevaluation campaigns on coreference resolution inthe past, namely MUC (Hirschman and Chinchor,1997), ACE (Doddington et al., 2004), and ARE(Orasan et al., 2008), yet many questions remainopen:

To what extent is it possible to imple-ment a general coreference resolution systemportable to different languages? How muchlanguage-specific tuning is necessary?

How helpful are morphology, syntax and se-mantics for solving coreference relations?How much preprocessing is needed? Does itsquality (perfect linguistic input versus noisyautomatic input) really matter?

How (dis)similar are different coreferenceevaluation metricsMUC, B-CUBED,CEAF and BLANC? Do they all provide thesame ranking? Are they correlated?

Our goal was to address these questions in ashared task. Given six datasets in Catalan, Dutch,English, German, Italian, and Spanish, the taskwe present involved automatically detecting fullcoreference chainscomposed of named entities(NEs), pronouns, and full noun phrasesin fourdifferent scenarios. For more information, thereader is referred to the task website.1

The rest of the paper is organized as follows.Section 2 presents the corpora from which the taskdatasets were extracted, and the automatic toolsused to preprocess them. In Section 3, we describethe task by providing information about the dataformat, evaluation settings, and evaluation met-rics. Participating systems are described in Sec-tion 4, and their results are analyzed and comparedin Section 5. Finally, Section 6 concludes.

2 Linguistic Resources

In this section, we first present the sources of thedata used in the task. We then describe the auto-matic tools that predicted input annotations for thecoreference resolution systems.

1http://stel.ub.edu/semeval2010-coref

1

Training Development Test#docs #sents #tokens #docs #sents #tokens #docs #sents #tokens

Catalan 829 8,709 253,513 142 1,445 42,072 167 1,698 49,260Dutch 145 2,544 46,894 23 496 9,165 72 2,410 48,007English 229 3,648 79,060 39 741 17,044 85 1,141 24,206German 900 19,233 331,614 199 4,129 73,145 136 2,736 50,287Italian 80 2,951 81,400 17 551 16,904 46 1,494 41,586Spanish 875 9,022 284,179 140 1,419 44,460 168 1,705 51,040

Table 1: Size of the task datasets.

2.1 Source Corpora

Catalan and Spanish The AnCora corpora (Re-casens and Mart, 2009) consist of a Catalan anda Spanish treebank of 500k words each, mainlyfrom newspapers and news agencies (El Periodico,EFE, ACN). Manual annotation exists for ar-guments and thematic roles, predicate semanticclasses, NEs, WordNet nominal senses, and coref-erence relations. AnCora are freely available forresearch purposes.

Dutch The KNACK-2002 corpus (Hoste and DePauw, 2006) contains 267 documents from theFlemish weekly magazine Knack. They weremanually annotated with coreference informationon top of semi-automatically annotated PoS tags,phrase chunks, and NEs.

English The OntoNotes Release 2.0 corpus(Pradhan et al., 2007) covers newswire and broad-cast news data: 300k words from The Wall StreetJournal, and 200k words from the TDT-4 col-lection, respectively. OntoNotes builds on thePenn Treebank for syntactic annotation and on thePenn PropBank for predicate argument structures.Semantic annotations include NEs, words senses(linked to an ontology), and coreference informa-tion. The OntoNotes corpus is distributed by theLinguistic Data Consortium.2

German The TuBa-D/Z corpus (Hinrichs et al.,2005) is a newspaper treebank based on data takenfrom the daily issues of die tageszeitung (taz). Itcurrently comprises 794k words manually anno-tated with semantic and coreference information.Due to licensing restrictions of the original texts, ataz-DVD must be purchased to obtain a license.2

Italian The LiveMemories corpus (Rodrguezet al., 2010) will include texts from the ItalianWikipedia, blogs, news articles, and dialogues

2Free user license agreements for the English and Germantask datasets were issued to the task participants.

(MapTask). They are being annotated accordingto the ARRAU annotation scheme with coref-erence, agreement, and NE information on topof automatically parsed data. The task datasetincluded Wikipedia texts already annotated.

The datasets that were used in the task were ex-tracted from the above-mentioned corpora. Ta-ble 1 summarizes the number of documents(docs), sentences (sents), and tokens in the train-ing, development and test sets.3

2.2 Preprocessing SystemsCatalan, Spanish, English Predicted lemmasand PoS were generated using FreeLing4 forCatalan/Spanish and SVMTagger5 for English.Dependency information and predicate semanticroles were generated with JointParser, a syntactic-semantic parser.6

Dutch Lemmas, PoS and NEs were automat-ically provided by the memory-based shallowparser for Dutch (Daelemans et al., 1999), and de-pendency information by the Alpino parser (vanNoord et al., 2006).

German Lemmas were predicted by TreeTagger(Schmid, 1995), PoS and morphology by RFTag-ger (Schmid and Laws, 2008), and dependency in-formation by MaltParser (Hall and Nivre, 2008).

Italian Lemmas and PoS were provided byTextPro,7 and dependency information by Malt-Parser.8

3The German and Dutch training datasets were not com-pletely stable during the competition period due to a few er-rors. Revised versions were released on March 2 and 20, re-spectively. As to the test datasets, the Dutch and Italian doc-uments with formatting errors were corrected after the eval-uation period, with no variations in the ranking order of sys-tems.

4http://www.lsi.upc.es/ nlp/freeling5http://www.lsi.upc.edu/ nlp/SVMTool6http://www.lsi.upc.edu// xlluis/?x=cat:57http://textpro.fbk.eu8http://maltparser.org

2

3 Task Description

Participants were asked to develop an automaticsystem capable of assigning a discourse entity toevery mention,9 thus identifying all the NP men-tions of every discourse entity. As there is nostandard annotation scheme for coreference andthe source corpora differed in certain aspects, thecoreference information of the task datasets wasproduced according to three criteria:

Only NP constituents and possessive deter-miners can be mentions.

Mentions must be referential expressions,thus ruling out nominal predicates, appos-itives, expletive NPs, attributive NPs, NPswithin idioms, etc.

Singletons are also considered as entities(i.e., entities with a single mention).

To help participants build their systems, thetask datasets also contained both gold-standardand automatically predicted linguistic annotationsat the morphological, syntactic and semantic lev-els. Considerable effort was devoted to provideparticipants with a common and relatively simpledata representation for the six languages.

3.1 Data FormatThe task datasets as well as the participantsanswers were displayed in a uniform column-based format, similar to the style used in previousCoNLL shared tasks on syntactic and semantic de-pendencies (2008/2009).10 Each dataset was pro-vided as a single file per language. Since corefer-ence is a linguistic relation at the discourse level,documents constitute the basic unit, and are de-limited by #begin document ID and #end doc-ument ID comment lines. Within a document, theinformation of each sentence is organized verti-cally with one token per line, and a blank line afterthe last token of each sentence. The informationassociated with each token is described in severalcolumns (separated by \t characters) represent-ing the following layers of linguistic annotation.

ID (column 1). Token identifiers in the sentence.Token (column 2). Word forms.

9Following the terminology of the ACE program, a men-tion is defined as an instance of reference to an object, andan entity is the collection of mentions referring to the sameobject in a document.

10http://www.cnts.ua.ac.be/conll2008

ID Token Intermediate columns Coref

1 Major . . . (12 League . . .3 Baseball . . . 1)4 sent . . .5 its . . . (1)|(26 head . . .7 of . . .8 security . . . (3)|2)9 to . . .. . . . . . . . . . . .27 The . . . (128 league . . . 1)29 is . . .

Table 2: Format of the coreference annotations(corresponding to example (1) in Section 1).

Lemma (column 3). Token lemmas.PoS (column 5). Coarse PoS.Feat (column 7). Morphological features (PoS

type, number, gender, case, tense, aspect,etc.) separated by a pipe character.

Head (column 9). ID of the syntactic head (0 ifthe token is the tree root).

DepRel (column 11). Dependency relations cor-responding to the dependencies described inthe Head column (sentence if the token isthe tree root).

NE (column 13). NE types in open-close notation.Pred (column 15). Predicate semantic class.APreds (column 17 and subsequent ones). For

each predicate in the Pred column, its seman-tic roles/dependencies.

Coref (last column). Coreference relations inopen-close notation.

The above-mentioned columns are gold-standard columns, whereas columns 4, 6, 8, 10,12, 14, 16 and the penultimate contain the sameinformation as the respective previous column butautomatically predictedusing the preprocessingsystems listed in Section 2.2. Neither all layersof linguistic annotation nor all gold-standard andpredicted columns were available for all six lan-guages (underscore characters indicate missing in-formation).

The coreference column follows an open-closenotation with an entity number in parentheses (seeTable 2). Every entity has an ID number, and ev-ery mention is marked with the ID of the entityit refers to: an opening parenthesis shows the be-ginning of the mention (first token), while a clos-ing parenthesis shows the end of the mention (last

3

token). For tokens belonging to more than onemention, a pipe character is used to separate mul-tiple entity IDs. The resulting annotation is a well-formed nested structure (CF language).

3.2 Evaluation SettingsIn order to address our goal of studying the effectof different levels of linguistic information (pre-processing) on solving coreference relations, thetest was divided into four evaluation settings thatdiffered along two dimensions.

Gold-standard versus Regular setting. Onlyin the gold-standard setting were participants al-lowed to use the gold-standard columns, includ-ing the last one (of the test dataset) with truemention boundaries. In the regular setting, theywere allowed to use only the automatically pre-dicted columns. Obtaining better results in thegold setting would provide evidence for the rel-evance of using high-quality preprocessing infor-mation. Since not all columns were available forall six languages, the gold setting was only possi-ble for Catalan, English, German, and Spanish.

Closed versus Open setting. In the closed set-ting, systems had to be built strictly with the in-formation provided in the task datasets. In con-trast, there was no restriction on the resources thatparticipants could utilize in the open setting: sys-tems could be developed using any external toolsand resources to predict the preprocessing infor-mation, e.g., WordNet, Wikipedia, etc. The onlyrequirement was to use tools that had not been de-veloped with the annotations of the test set. Thissetting provided an open door into tools or re-sources that improve performance.

3.3 Evaluation MetricsSince there is no agreement at present on a stan-dard measure for coreference resolution evalua-tion, one of our goals was to compare the rank-ings produced by four different measures. Thetask scorer provides results in the two mention-based metrics B3 (Bagga and Baldwin, 1998) andCEAF-3 (Luo, 2005), and the two link-basedmetrics MUC (Vilain et al., 1995) and BLANC(Recasens and Hovy, in prep). The first three mea-sures have been widely used, while BLANC is aproposal of a new measure interesting to test.

The mention detection subtask is measured withrecall, precision, and F1. Mentions are rewardedwith 1 point if their boundaries coincide with those

of the gold NP, with 0.5 points if their boundariesare within the gold NP including its head, andwith 0 otherwise.

4 Participating Systems

A total of twenty-two participants registered forthe task and downloaded the training materials.From these, sixteen downloaded the test set butonly six (out of which two task organizers) sub-mitted valid results (corresponding to nine systemruns or variants). These numbers show that thetask raised considerable interest but that the finalparticipation rate was comparatively low (slightlybelow 30%).

The participating systems differed in terms ofarchitecture, machine learning method, etc. Ta-ble 3 summarizes their main properties. Systemslike BART and Corry support several machinelearners, but Table 3 indicates the one used for theSemEval run. The last column indicates the exter-nal resources that were employed in the open set-ting, thus it is empty for systems that participatedonly in the closed setting. For more specific detailswe address the reader to the system description pa-pers in Erk and Strapparava (2010).

5 Results and Evaluation

Table 4 shows the results obtained by two naivebaseline systems: (i) SINGLETONS considers eachmention as a separate entity, and (ii) ALL-IN-ONEgroups all the mentions in a document into a sin-gle entity. These simple baselines reveal limita-tions of the evaluation metrics, like the high scoresof CEAF and B3 for SINGLETONS. Interestinglyenough, the naive baseline scores turn out to behard to beat by the participating systems, as Ta-ble 5 shows. Similarly, ALL-IN-ONE obtains highscores in terms of MUC. Table 4 also reveals dif-ferences between the distribution of entities in thedatasets. Dutch is clearly the most divergent cor-pus mainly due to the fact that it only contains sin-gletons for NEs.

Table 5 displays the results of all systems for alllanguages and settings in the four evaluation met-rics (the best scores in each setting are highlightedin bold). Results are presented sequentially by lan-guage and setting, and participating systems areordered alphabetically. The participation of sys-tems across languages and settings is rather irreg-ular,11 thus making it difficult to draw firm conclu-

11Only 45 entries in Table 5 from 192 potential cases.

4

System Architecture ML Methods External Resources

BART(Broscheit et al., 2010) Closest-first with entity-

mention model (English),Closest-first model (German,Italian)

MaxEnt (English, Ger-man), Decision trees(Italian)

GermaNet & gazetteers (Ger-man), I-Cab gazetteers (Italian),Berkeley parser, Stanford NER,WordNet, Wikipedia name list,U.S. census data (English)

Corry(Uryupina, 2010) ILP, Pairwise model SVM Stanford parser & NER, Word-

Net, U.S. census data

RelaxCor(Sapena et al., 2010) Graph partitioning (solved by

relaxation labeling)Decision trees, Rules WordNet

SUCRE(Kobdani and Schutze, 2010) Best-first clustering, Rela-

tional database model, Regularfeature definition language

Decision trees, NaiveBayes, SVM, MaxEnt

TANL-1(Attardi et al., 2010) Highest entity-mention simi-

larityMaxEnt PoS tagger (Italian)

UBIU(Zhekova and Kubler, 2010) Pairwise model MBL

Table 3: Main characteristics of the participating systems.

sions about the aims initially pursued by the task.In the following, we summarize the most relevantoutcomes of the evaluation.

Regarding languages, English concentrates themost participants (fifteen entries), followed byGerman (eight), Catalan and Spanish (seven each),Italian (five), and Dutch (three). The number oflanguages addressed by each system ranges fromone (Corry) to six (UBIU and SUCRE); BART andRelaxCor addressed three languages, and TANL-1five. The best overall results are obtained for En-glish followed by German, then Catalan, Spanishand Italian, and finally Dutch. Apart from differ-ences between corpora, there are other factors thatmight explain this ranking: (i) the fact that most ofthe systems were originally developed for English,and (ii) differences in corpus size (German havingthe largest corpus, and Dutch the smallest).

Regarding systems, there are no clear win-ners. Note that no language-setting was ad-dressed by all six systems. The BART system,for instance, is either on its own or competingagainst a single system. It emerges from par-tial comparisons that SUCRE performs the best inclosedregular for English, German, and Italian,although it never outperforms the CEAF or B3 sin-gleton baseline. While SUCRE always obtains thebest scores according to MUC and BLANC, Re-laxCor and TANL-1 usually win based on CEAF

and B3. The Corry system presents three variantsoptimized for CEAF (Corry-C), MUC (Corry-M),and BLANC (Corry-B). Their results are consis-tent with the bias introduced in the optimization(see English:opengold).

Depending on the evaluation metric then, therankings of systems vary with considerable scoredifferences. There is a significant positive corre-lation between CEAF and B3 (Pearsons r = 0.91,p< 0.01), and a significant lack of correlation be-tween CEAF and MUC in terms of recall (Pear-sons r = 0.44, p< 0.01). This fact stresses theimportance of defining appropriate metrics (or acombination of them) for coreference evaluation.

Finally, regarding evaluation settings, the re-sults in the gold setting are significantly better thanthose in the regular. However, this might be a di-rect effect of the mention recognition task. Men-tion recognition in the regular setting falls morethan 20 F1 points with respect to the gold setting(where correct mention boundaries were given).As for the open versus closed setting, there is onlyone system, RelaxCor for English, that addressedthe two. As expected, results show a slight im-provement from closedgold to opengold.

6 Conclusions

This paper has introduced the main features ofthe SemEval-2010 task on coreference resolution.

5

CEAF MUC B3 BLANC

R P F1 R P F1 R P F1 R P Blanc

SINGLETONS: Each mention forms a separate entity.

Catalan 61.2 61.2 61.2 0.0 0.0 0.0 61.2 100 75.9 50.0 48.7 49.3Dutch 34.5 34.5 34.5 0.0 0.0 0.0 34.5 100 51.3 50.0 46.7 48.3English 71.2 71.2 71.2 0.0 0.0 0.0 71.2 100 83.2 50.0 49.2 49.6German 75.5 75.5 75.5 0.0 0.0 0.0 75.5 100 86.0 50.0 49.4 49.7Italian 71.1 71.1 71.1 0.0 0.0 0.0 71.1 100 83.1 50.0 49.2 49.6Spanish 62.2 62.2 62.2 0.0 0.0 0.0 62.2 100 76.7 50.0 48.8 49.4

ALL-IN-ONE: All mentions are grouped into a single entity.

Catalan 11.8 11.8 11.8 100 39.3 56.4 100 4.0 7.7 50.0 1.3 2.6Dutch 19.7 19.7 19.7 100 66.3 79.8 100 8.0 14.9 50.0 3.2 6.2English 10.5 10.5 10.5 100 29.2 45.2 100 3.5 6.7 50.0 0.8 1.6German 8.2 8.2 8.2 100 24.8 39.7 100 2.4 4.7 50.0 0.6 1.1Italian 11.4 11.4 11.4 100 29.0 45.0 100 2.1 4.1 50.0 0.8 1.5Spanish 11.9 11.9 11.9 100 38.3 55.4 100 3.9 7.6 50.0 1.2 2.4

Table 4: Baseline scores.

The goal of the task was to evaluate and compareautomatic coreference resolution systems for sixdifferent languages in four evaluation settings andusing four different metrics. This complex sce-nario aimed at providing insight into several as-pects of coreference resolution, including portabil-ity across languages, relevance of linguistic infor-mation at different levels, and behavior of alterna-tive scoring metrics.

The task attracted considerable attention from anumber of researchers, but only six teams submit-ted their final results. Participating systems did notrun their systems for all the languages and evalu-ation settings, thus making direct comparisons be-tween them very difficult. Nonetheless, we wereable to observe some interesting aspects from theempirical evaluation.

An important conclusion was the confirmationthat different evaluation metrics provide differentsystem rankings and the scores are not commen-surate. Attention thus needs to be paid to corefer-ence evaluation. The behavior and applicability ofthe scoring metrics requires further investigationin order to guarantee a fair evaluation when com-paring systems in the future. We hope to have theopportunity to thoroughly discuss this and the restof interesting questions raised by the task duringthe SemEval workshop at ACL 2010.

An additional valuable benefit is the set of re-sources developed throughout the task. As taskorganizers, we intend to facilitate the sharing ofdatasets, scorers, and documentation by keepingthem available for future research use. We believethat these resources will help to set future bench-

marks for the research community and will con-tribute positively to the progress of the state of theart in coreference resolution. We will maintain andupdate the task website with post-SemEval contri-butions.

Acknowledgments

We would like to thank the following peo-ple who contributed to the preparation of thetask datasets: Manuel Bertran (UB), OriolBorrega (UB), Orphee De Clercq (U. Ghent),Francesca Delogu (U. Trento), Jesus Gimenez(UPC), Eduard Hovy (ISI-USC), Richard Johans-son (U. Trento), Xavier Llus (UPC), MontseNofre (UB), Llus Padro (UPC), Kepa JosebaRodrguez (U. Trento), Mihai Surdeanu (Stan-ford), Olga Uryupina (U. Trento), Lente Van Leu-ven (UB), and Rita Zaragoza (UB). We would alsolike to thank LDC and die tageszeitung for dis-tributing freely the English and German datasets.

This work was funded in part by the Span-ish Ministry of Science and Innovation throughthe projects TEXT-MESS 2.0 (TIN2009-13391-C04-04), OpenMT-2 (TIN2009-14675-C03), andKNOW2 (TIN2009-14715-C04-04), and an FPUdoctoral scholarship (AP2006-00994) held byM. Recasens. It also received financial sup-port from the Seventh Framework Programmeof the EU (FP7/2007-2013) under GA 247762(FAUST), from the STEVIN program of the Ned-erlandse Taalunie through the COREA and SoNaRprojects, and from the Provincia Autonoma diTrento through the LiveMemories project.

6

Mention detection CEAF MUC B3 BLANCR P F1 R P F1 R P F1 R P F1 R P Blanc

CatalanclosedgoldRelaxCor 100 100 100 70.5 70.5 70.5 29.3 77.3 42.5 68.6 95.8 79.9 56.0 81.8 59.7SUCRE 100 100 100 68.7 68.7 68.7 54.1 58.4 56.2 76.6 77.4 77.0 72.4 60.2 63.6TANL-1 100 96.8 98.4 66.0 63.9 64.9 17.2 57.7 26.5 64.4 93.3 76.2 52.8 79.8 54.4UBIU 75.1 96.3 84.4 46.6 59.6 52.3 8.8 17.1 11.7 47.8 76.3 58.8 51.6 57.9 52.2closedregularSUCRE 75.9 64.5 69.7 51.3 43.6 47.2 44.1 32.3 37.3 59.6 44.7 51.1 53.9 55.2 54.2TANL-1 83.3 82.0 82.7 57.5 56.6 57.1 15.2 46.9 22.9 55.8 76.6 64.6 51.3 76.2 51.0UBIU 51.4 70.9 59.6 33.2 45.7 38.4 6.5 12.6 8.6 32.4 55.7 40.9 50.2 53.7 47.8opengoldopenregularDutchclosedgoldSUCRE 100 100 100 58.8 58.8 58.8 65.7 74.4 69.8 65.0 69.2 67.0 69.5 62.9 65.3closedregularSUCRE 78.0 29.0 42.3 29.4 10.9 15.9 62.0 19.5 29.7 59.1 6.5 11.7 46.9 46.9 46.9UBIU 41.5 29.9 34.7 20.5 14.6 17.0 6.7 11.0 8.3 13.3 23.4 17.0 50.0 52.4 32.3opengoldopenregularEnglishclosedgoldRelaxCor 100 100 100 75.6 75.6 75.6 21.9 72.4 33.7 74.8 97.0 84.5 57.0 83.4 61.3SUCRE 100 100 100 74.3 74.3 74.3 68.1 54.9 60.8 86.7 78.5 82.4 77.3 67.0 70.8TANL-1 99.8 81.7 89.8 75.0 61.4 67.6 23.7 24.4 24.0 74.6 72.1 73.4 51.8 68.8 52.1UBIU 92.5 99.5 95.9 63.4 68.2 65.7 17.2 25.5 20.5 67.8 83.5 74.8 52.6 60.8 54.0closedregularSUCRE 78.4 83.0 80.7 61.0 64.5 62.7 57.7 48.1 52.5 68.3 65.9 67.1 58.9 65.7 61.2TANL-1 79.6 68.9 73.9 61.7 53.4 57.3 23.8 25.5 24.6 62.1 60.5 61.3 50.9 68.0 49.3UBIU 66.7 83.6 74.2 48.2 60.4 53.6 11.6 18.4 14.2 50.9 69.2 58.7 50.9 56.3 51.0opengoldCorry-B 100 100 100 77.5 77.5 77.5 56.1 57.5 56.8 82.6 85.7 84.1 69.3 75.3 71.8Corry-C 100 100 100 77.7 77.7 77.7 57.4 58.3 57.9 83.1 84.7 83.9 71.3 71.6 71.5Corry-M 100 100 100 73.8 73.8 73.8 62.5 56.2 59.2 85.5 78.6 81.9 76.2 58.8 62.7RelaxCor 100 100 100 75.8 75.8 75.8 22.6 70.5 34.2 75.2 96.7 84.6 58.0 83.8 62.7openregularBART 76.1 69.8 72.8 70.1 64.3 67.1 62.8 52.4 57.1 74.9 67.7 71.1 55.3 73.2 57.7Corry-B 79.8 76.4 78.1 70.4 67.4 68.9 55.0 54.2 54.6 73.7 74.1 73.9 57.1 75.7 60.6Corry-C 79.8 76.4 78.1 70.9 67.9 69.4 54.7 55.5 55.1 73.8 73.1 73.5 57.4 63.8 59.4Corry-M 79.8 76.4 78.1 66.3 63.5 64.8 61.5 53.4 57.2 76.8 66.5 71.3 58.5 56.2 57.1GermanclosedgoldSUCRE 100 100 100 72.9 72.9 72.9 74.4 48.1 58.4 90.4 73.6 81.1 78.2 61.8 66.4TANL-1 100 100 100 77.7 77.7 77.7 16.4 60.6 25.9 77.2 96.7 85.9 54.4 75.1 57.4UBIU 92.6 95.5 94.0 67.4 68.9 68.2 22.1 21.7 21.9 73.7 77.9 75.7 60.0 77.2 64.5closedregularSUCRE 79.3 77.5 78.4 60.6 59.2 59.9 49.3 35.0 40.9 69.1 60.1 64.3 52.7 59.3 53.6TANL-1 60.9 57.7 59.2 50.9 48.2 49.5 10.2 31.5 15.4 47.2 54.9 50.7 50.2 63.0 44.7UBIU 50.6 66.8 57.6 39.4 51.9 44.8 9.5 11.4 10.4 41.2 53.7 46.6 50.2 54.4 48.0opengoldBART 94.3 93.7 94.0 67.1 66.7 66.9 70.5 40.1 51.1 85.3 64.4 73.4 65.5 61.0 62.8openregularBART 82.5 82.3 82.4 61.4 61.2 61.3 61.4 36.1 45.5 75.3 58.3 65.7 55.9 60.3 57.3ItalianclosedgoldSUCRE 98.4 98.4 98.4 66.0 66.0 66.0 48.1 42.3 45.0 76.7 76.9 76.8 54.8 63.5 56.9closedregularSUCRE 84.6 98.1 90.8 57.1 66.2 61.3 50.1 50.7 50.4 63.6 79.2 70.6 55.2 68.3 57.7UBIU 46.8 35.9 40.6 37.9 29.0 32.9 2.9 4.6 3.6 38.4 31.9 34.8 50.0 46.6 37.2opengoldopenregularBART 42.8 80.7 55.9 35.0 66.1 45.8 35.3 54.0 42.7 34.6 70.6 46.4 57.1 68.1 59.6TANL-1 90.5 73.8 81.3 62.2 50.7 55.9 37.2 28.3 32.1 66.8 56.5 61.2 50.7 69.3 48.5SpanishclosedgoldRelaxCor 100 100 100 66.6 66.6 66.6 14.8 73.8 24.7 65.3 97.5 78.2 53.4 81.8 55.6SUCRE 100 100 100 69.8 69.8 69.8 52.7 58.3 55.3 75.8 79.0 77.4 67.3 62.5 64.5TANL-1 100 96.8 98.4 66.9 64.7 65.8 16.6 56.5 25.7 65.2 93.4 76.8 52.5 79.0 54.1UBIU 73.8 96.4 83.6 45.7 59.6 51.7 9.6 18.8 12.7 46.8 77.1 58.3 52.9 63.9 54.3closedregularSUCRE 74.9 66.3 70.3 56.3 49.9 52.9 35.8 36.8 36.3 56.6 54.6 55.6 52.1 61.2 51.4TANL-1 82.2 84.1 83.1 58.6 60.0 59.3 14.0 48.4 21.7 56.6 79.0 66.0 51.4 74.7 51.4UBIU 51.1 72.7 60.0 33.6 47.6 39.4 7.6 14.4 10.0 32.8 57.1 41.6 50.4 54.6 48.4opengoldopenregular

Table 5: Official results of the participating systems for all languages, settings, and metrics.

7

References

Giuseppe Attardi, Stefano Dei Rossi, and Maria Simi.2010. TANL-1: coreference resolution by parseanalysis and similarity clustering. In Proceedingsof SemEval-2.

Amit Bagga and Breck Baldwin. 1998. Algorithms forscoring coreference chains. In Proceedings of theLREC Workshop on Linguistic Coreference, pages563566.

Samuel Broscheit, Massimo Poesio, Simone PaoloPonzetto, Kepa Joseba Rodrguez, Lorenza Ro-mano, Olga Uryupina, Yannick Versley, and RobertoZanoli. 2010. BART: A multilingual anaphora res-olution system. In Proceedings of SemEval-2.

Walter Daelemans, Sabine Buchholz, and Jorn Veen-stra. 1999. Memory-based shallow parsing. In Pro-ceedings of CoNLL 1999.

George Doddington, Alexis Mitchell, Mark Przybocki,Lance Ramshaw, Stephanie Strassel, and RalphWeischedel. 2004. The Automatic Content Extrac-tion (ACE) program Tasks, data, and evaluation.In Proceedings of LREC 2004, pages 837840.

Katrin Erk and Carlo Strapparava, editors. 2010. Pro-ceedings of SemEval-2.

Johan Hall and Joakim Nivre. 2008. A dependency-driven parser for German dependency and con-stituency representations. In Proceedings of the ACLWorkshop on Parsing German (PaGe 2008), pages4754.

Erhard W. Hinrichs, Sandra Kubler, and Karin Nau-mann. 2005. A unified representation for morpho-logical, syntactic, semantic, and referential annota-tions. In Proceedings of the ACLWorkshop on Fron-tiers in Corpus Annotation II: Pie in the Sky, pages1320.

Lynette Hirschman and Nancy Chinchor. 1997.MUC-7 Coreference Task Definition Version 3.0.In Proceedings of MUC-7.

Veronique Hoste and Guy De Pauw. 2006. KNACK-2002: A richly annotated corpus of Dutch writtentext. In Proceedings of LREC 2006, pages 14321437.

Hamidreza Kobdani and Hinrich Schutze. 2010. SU-CRE: A modular system for coreference resolution.In Proceedings of SemEval-2.

Xiaoqiang Luo. 2005. On coreference resolutionperformance metrics. In Proceedings of HLT-EMNLP 2005, pages 2532.

Joseph F. McCarthy and Wendy G. Lehnert. 1995. Us-ing decision trees for coreference resolution. In Pro-ceedings of IJCAI 1995, pages 10501055.

Thomas S. Morton. 1999. Using coreference in ques-tion answering. In Proceedings of TREC-8, pages8589.

Constantin Orasan, Dan Cristea, Ruslan Mitkov, andAntonio Branco. 2008. Anaphora Resolution Exer-cise: An overview. In Proceedings of LREC 2008.

Sameer S. Pradhan, Eduard Hovy, Mitch Mar-cus, Martha Palmer, Lance Ramshaw, and RalphWeischedel. 2007. Ontonotes: A unified rela-tional semantic representation. In Proceedings ofthe International Conference on Semantic Comput-ing (ICSC 2007), pages 517526.

Marta Recasens and Eduard Hovy. in prep. BLANC:Implementing the Rand Index for Coreference Eval-uation.

Marta Recasens and M. Anto`nia Mart. 2009. AnCora-CO: Coreferentially annotated corpora for Spanishand Catalan. Language Resources and Evaluation,DOI:10.1007/s10579-009-9108-x.

Kepa Joseba Rodrguez, Francesca Delogu, YannickVersley, Egon Stemle, and Massimo Poesio. 2010.Anaphoric annotation of Wikipedia and blogs inthe Live Memories Corpus. In Proceedings ofLREC 2010, pages 157163.

Emili Sapena, Llus Padro, and Jordi Turmo. 2010.RelaxCor: A global relaxation labeling approach tocoreference resolution for the SemEval-2 Corefer-ence Task. In Proceedings of SemEval-2.

Helmut Schmid and Florian Laws. 2008. Estimationof conditional probabilities with decision trees andan application to fine-grained POS tagging. In Pro-ceedings of COLING 2008, pages 777784.

Helmut Schmid. 1995. Improvements in part-of-speech tagging with an application to German. InProceedings of the ACL SIGDAT Workshop, pages4750.

Josef Steinberger, Massimo Poesio, Mijail A. Kabad-jov, and Karel Jeek. 2007. Two uses of anaphoraresolution in summarization. Information Process-ing and Management: an International Journal,43(6):16631680.

Olga Uryupina. 2010. Corry: A system for corefer-ence resolution. In Proceedings of SemEval-2.

Gertjan van Noord, Ineke Schuurman, and VincentVandeghinste. 2006. Syntactic annotation of largecorpora in STEVIN. In Proceedings of LREC 2006.

Marc Vilain, John Burger, John Aberdeen, Dennis Con-nolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In Proceed-ings of MUC-6, pages 4552.

Desislava Zhekova and Sandra Kubler. 2010. UBIU:A language-independent system for coreference res-olution. In Proceedings of SemEval-2.

8

Proceedings of the 5th International Workshop on Semantic Evaluation, ACL 2010, pages 914,Uppsala, Sweden, 15-16 July 2010. c2010 Association for Computational Linguistics

SemEval-2010 Task 2: Cross-Lingual Lexical Substitution

Rada MihalceaUniversity of North Texas

[email protected]

Ravi SinhaUniversity of North Texas

[email protected]

Diana McCarthyLexical Computing Ltd.

[email protected]

Abstract

In this paper we describe the SemEval-2010 Cross-Lingual Lexical Substitutiontask, where given an English target wordin context, participating systems had tofind an alternative substitute word orphrase in Spanish. The task is based onthe English Lexical Substitution task runat SemEval-2007. In this paper we pro-vide background and motivation for thetask, we describe the data annotation pro-cess and the scoring system, and presentthe results of the participating systems.

1 Introduction

In the Cross-Lingual Lexical Substitution task, an-notators and systems had to find an alternativesubstitute word or phrase in Spanish for an En-glish target word in context. The task is basedon the English Lexical Substitution task run atSemEval-2007, where both target words and sub-stitutes were in English.

An automatic system for cross-lingual lexicalsubstitution would be useful for a number of ap-plications. For instance, such a system could beused to assist human translators in their work, byproviding a number of correct translations that thehuman translator can choose from. Similarly, thesystem could be used to assist language learners,by providing them with the interpretation of theunknown words in a text written in the languagethey are learning. Last but not least, the outputof a cross-lingual lexical substitution system couldbe used as input to existing systems for cross-language information retrieval or automatic ma-chine translation.

2 Motivation and Related Work

While there has been a lot of discussion on the rel-evant sense distinctions for monolingual WSD sys-tems, for machine translation applications there isa consensus that the relevant sense distinctions arethose that reflect different translations. One earlyand notable work was the SENSEVAL-2 JapaneseTranslation task (Kurohashi, 2001) that obtainedalternative translation records of typical usages ofa test word, also referred to as a translation mem-ory. Systems could either select the most appro-priate translation memory record for each instanceand were scored against a gold-standard set of an-notations, or they could provide a translation thatwas scored by translation experts after the resultswere submitted. In contrast to this work, in ourtask we provided actual translations for target in-stances in advance, rather than predetermine trans-lations using lexicographers or rely on post-hocevaluation, which does not permit evaluation ofnew systems after the competition.

Previous standalone WSD tasks based on par-allel data have obtained distinct translations forsenses as listed in a dictionary (Ng and Chan,2007). In this way fine-grained senses with thesame translations can be lumped together, how-ever this does not fully allow for the fact that somesenses for the same words may have some transla-tions in common but also others that are not (Sinhaet al., 2009).

In our task, we collected a dataset which al-lows instances of the same word to have sometranslations in common, while not necessitatinga clustering of translations from a specific re-source into senses (in comparison to Lefever andHoste (2010)). 1 Resnik and Yarowsky (2000) also

1Though in that task note that it is possible for a transla-tion to occur in more than one cluster. It will be interesting to

9

conducted experiments using words in context,rather than a predefined sense-inventory howeverin these experiments the annotators were asked fora single preferred translation. In our case, we al-lowed annotators to supply as many translationsas they felt were equally valid. This allows usto examine more subtle relationships between us-ages and to allow partial credit to systems thatget a close approximation to the annotators trans-lations. Unlike a full blown machine translationtask (Carpuat and Wu, 2007), annotators and sys-tems are not required to translate the whole contextbut just the target word.

3 Background: The English LexicalSubstitution Task

The English Lexical substitution task (hereafterreferred to as LEXSUB) was run at SemEval-2007 (McCarthy and Navigli, 2007; McCarthy andNavigli, 2009). LEXSUB was proposed as a taskwhich, while requiring contextual disambiguation,did not presuppose a specific sense inventory. Infact, it is quite possible to use alternative rep-resentations of meaning, such as those proposedby Schutze (1998) and Pantel and Lin (2002).

The motivation for a substitution task was thatit would reflect capabilities that might be usefulfor natural language processing tasks such as para-phrasing and textual entailment, while not requir-ing a complete system that might mask system ca-pabilities at a lexical level and make participationin the task difficult for small research teams.

The task required systems to produce a substi-tute word for a word in context. The data wascollected for 201 words from open class parts-of-speech (PoS) (i.e. nouns, verbs, adjectives and ad-verbs). Words were selected that have more thanone meaning with at least one near synonym. Tensentences for each word were extracted from theEnglish Internet Corpus (Sharoff, 2006). Therewere five annotators who annotated each targetword as it occurred in the context of a sentence.The annotators were each allowed to provide up tothree substitutes, though they could also providea NIL response if they could not come up with asubstitute. They had to indicate if the target wordwas an integral part of a multiword.

see the extent that this actually occurred in their data and theextent that the translations that our annotators provided mightbe clustered.

4 The Cross-Lingual LexicalSubstitution Task

The Cross-Lingual Lexical Substitution task fol-lows LEXSUB except that the annotations aretranslations rather than paraphrases. Given a tar-get word in context, the task is to provide severalcorrect translations for that word in a given lan-guage. We used English as the source languageand Spanish as the target language.

We provided both development and test sets, butno training data. As for LEXSUB, any systems re-quiring training data had to obtain it from othersources. We included nouns, verbs, adjectives andadverbs in both development and test data. Weused the same set of 30 development words as inLEXSUB, and a subset of 100 words from the LEX-SUB test set, selected so that they exhibit a widevariety of substitutes. For each word, the same ex-ample sentences were used as in LEXSUB.

4.1 AnnotationWe used four annotators for the task, all nativeSpanish speakers from Mexico, with a high levelof proficiency in English. As in LEXSUB, the an-notators were allowed to use any resources theywanted to, and were required to provide as manysubstitutes as they could think of.

The inter-tagger agreement (ITA) was calcu-lated as pairwise agreement between sets of sub-stitutes from annotators, as done in LEXSUB. TheITA without mode was determined as 0.2777,which is comparable with the ITA of 0.2775 de-termined for LEXSUB.

4.2 An ExampleOne significant outcome of this task is that thereare not necessarily clear divisions between usagesand senses because we do not use a predefinedsense inventory, or restrict the annotations to dis-tinctive translations. This means that there can beusages that overlap to different extents with eachother but do not have identical translations. Anexample is the target adverb severely. Four sen-tences are shown in Figure 1 with the translationsprovided by one annotator marked in italics and{} braces. Here, all the token occurrences seemrelated to each other in that they share some trans-lations, but not all. There are sentences like 1and 2 that appear not to have anything in com-mon. However 1, 3, and 4 seem to be partly re-lated (they share severamente), and 2, 3, and 4 arealso partly related (they share seriamente). When

10

we look again, sentences 1 and 2, though not di-rectly related, both have translations in commonwith sentences 3 and 4.

4.3 ScoringWe adopted the best and out-of-ten precision andrecall scores from LEXSUB (oot in the equationsbelow). The systems were allowed to supply asmany translations as they feel fit the context. Thesystem translations are then given credit depend-ing on the number of annotators that picked eachtranslation. The credit is divided by the numberof annotator responses for the item and since forthe best score the credit for the system answersfor an item is also divided by the number of an-swers the system provides, this allows more creditto be given to instances where there is less varia-tion. For that reason, a system is better guessingthe translation that is most frequent unless it re-ally wants to hedge its bets. Thus if i is an itemin the set of instances I , and Ti is the multiset ofgold standard translations from the human annota-tors for i, and a system provides a set of answersSi for i, then the best score for item i is2:

best score(i) =

sSi frequency(s Ti)|Si| |Ti| (1)

Precision is calculated by summing the scoresfor each item and dividing by the number of itemsthat the system attempted whereas recall dividesthe sum of scores for each item by |I|. Thus:

best precision =

i best score(i)|i I : defined(Si)| (2)

best recall =

i best score(i)|I| (3)

The out-of-ten scorer allows up to ten systemresponses and does not divide the credit attributedto each answer by the number of system responses.This allows a system to be less cautious and forthe fact that there is considerable variation on thetask and there may be cases where systems selecta perfectly good translation that the annotators hadnot thought of. By allowing up to ten translationsin the out-of-ten task the systems can hedge theirbets to find the translations that the annotators sup-plied.

2NB scores are multiplied by 100, though for out-of-tenthis is not strictly a percentage.

oot score(i) =

sSi frequency(s Ti)|Ti| (4)

oot precision =

i oot score(i)|i I : defined(Si)| (5)

oot recall =

i oot score(i)|I| (6)

We note that there was an issue that the origi-nal LEXSUB out-of-ten scorer allowed duplicates(McCarthy and Navigli, 2009). The effect of du-plicates is that systems can get inflated scores be-cause the credit for each item is not divided by thenumber of substitutes and because the frequencyof each annotator response is used. McCarthy andNavigli (2009) describe this oversight, identify thesystems that had included duplicates and explainthe implications. For our task, we decided to con-tinue to allow for duplicates, so that systems canboost their scores with duplicates on translationswith higher probability.

For both the best and out-of-ten measures, wealso report a mode score, which is calculatedagainst the mode from the annotators responses aswas done in LEXSUB. Unlike the LEXSUB task,we did not run a separate multi-word subtask andevaluation.

5 Baselines and Upper bound

To place results in perspective, several baselines aswell as the upper bound were calculated.

5.1 BaselinesWe calculated two baselines, one dictionary-basedand one dictionary and corpus-based. The base-lines were produced with the help of an on-line Spanish-English dictionary3 and the SpanishWikipedia. For the first baseline, denoted by DICT,for all target words, we collected all the Spanishtranslations provided by the dictionary, in the or-der returned on the online query page. The bestbaseline was produced by taking the first transla-tion provided by the online dictionary, while theout-of-ten baseline was produced by taking thefirst 10 translations provided.

The second baseline, DICTCORP, also ac-counted for the frequency of the translationswithin a Spanish dictionary. All the translations

3www.spanishdict.com

11

1. Perhaps the effect of West Nile Virus is sufficient to extinguish endemic birds already severelystressed by habitat losses. {fuertemente, severamente, duramente, exageradamente}

2. She looked as severely as she could muster at Draco. {rigurosamente, seriamente}3. A day before he was due to return to the United States Patton was severely injured in a road accident.

{seriamente, duramente, severamente}4. Use market tools to address environmental issues , such as eliminating subsidies for industries that

severely harm the environment, like coal. {peligrosamente, seriamente, severamente}5. This picture was severely damaged in the flood of 1913 and has rarely been seen until now.

{altamente, seriamente, exageradamente}Figure 1: Translations from one annotator for the adverb severely

provided by the online dictionary for a given targetword were ranked according to their frequencies inthe Spanish Wikipedia, producing the DICTCORPbaseline.

5.2 Upper boundThe results for the best task reflect the inherentvariability as less credit is given where annotatorsexpress differences. The theoretical upper boundfor the best recall (and precision if all items areattempted) score is calculated as:

bestub =

iI

freqmost freq substitutei|Ti|

|I| 100= 40.57 (7)

Note of course that this upper bound is theoreticaland assumes a human could find the most frequentsubstitute selected by all annotators. Performanceof annotators will undoubtedly be lower than thetheoretical upper bound because of human vari-ability on this task. Since we allow for duplicates,the out-of-ten upper bound assumes the most fre-quent word type in Ti is selected for all ten an-swers. Thus we would obtain ten times the bestupper bound (equation 7).

ootub =

iI

freqmost freq substitutei10|Ti|

|I| 100= 405.78 (8)

If we had not allowed duplicates then the out-of-ten upper bound would have been just less than100% (99.97). This is calculated by assuming thetop 10 most frequent responses from the annota-tors are picked in every case. There are only a cou-

ple of cases where there are more than 10 transla-tions from the annotators.

6 Systems

Nine teams participated in the task, and severalof them entered two systems. The systems usedvarious resources, including bilingual dictionar-ies, parallel corpora such as Europarl or corporabuilt from Wikipedia, monolingual corpora suchas Web1T or newswire collections, and transla-tion software such as Moses, GIZA or Google.Some systems attempted to select the substituteson the English side, using a lexical substitu-tion framework or word sense disambiguation,whereas some systems made the selection on theSpanish side using lexical substitution in Spanish.

In the following, we briefly describe each par-ticipating system.

CU-SMT relies on a phrase-based statistical ma-chine translation system, trained on the EuroparlEnglish-Spanish parallel corpora.

The UvT-v and UvT-g systems make use of k-nearest neighbour classifiers to build one word ex-pert for each target word, and select translationson the basis of a GIZA alignment of the Europarlparallel corpus.

The UBA-T and UBA-W systems both use can-didates from Google dictionary, SpanishDict.comand Babylon, which are then confirmed using par-allel texts. UBA-T relies on the automatic trans-lation of the source sentence using the GoogleTranslation API, combined with several heuristics.The UBA-W system uses a parallel corpus auto-matically constructed from DBpedia.

SWAT-E and SWAT-S use a lexical substitutionframework applied to either English or Spanish.The SWAT-E system first performs lexical sub-

12

stitution in English, and then each substitute istranslated into Spanish. SWAT-S translates thesource sentences into Spanish, identifies the Span-ish word corresponding to the target word, andthen it performs lexical substitution in Spanish.

TYO uses an English monolingual substitutionmodule, and then it translates the substitution can-didates into Spanish using the Freedict and theGoogle English-Spanish dictionary.

FCC-LS uses the probability of a word to betranslated into a candidate based on estimates ob-tained from the GIZA alignment of the Europarlcorpus. These translations are subsequently fil-tered to include only those that appear in a trans-lation of the target word using Google translate.

WLVUSP determines candidates using the bestN translations of the test sentences obtained withthe Moses system, which are further filtered us-ing an English-Spanish dictionary. USPWLV usescandidates from an alignment of Europarl, whichare then selected using various features and a clas-sifier tuned on the development data.

IRST-1 generates the best substitute using a PoSconstrained alignment of Moses translations of thesource sentences, with a back-off to a bilingualdictionary. For out-of-ten, dictionary translationsare filtered using the LSA similarity between can-didates and the sentence translation into Spanish.IRSTbs is intended as a baseline, and it uses onlythe PoS constrained Moses translation for best,and the dictionary translations for out-of-ten.

ColEur and ColSlm use a supervised word sensedisambiguation algorithm to distinguish betweensenses in the English source sentences. Trans-lations are then assigned by using GIZA align-ments from a parallel corpus, collected for theword senses of interest.

7 Results

Tables 1 and 2 show the precision P and recallR for the best and out-of-ten tasks respectively,for normal and mode. The rows are ordered byR. The out-of-ten systems were allowed to pro-vide up to 10 substitutes and did not have any ad-vantage by providing less. Since duplicates wereallowed so that a system can put more emphasison items it is more confident of, this means thatout-of-ten R and P scores might exceed 100%because the credit for each of the human answersis used for e