65
MultiLing 2017 MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres Proceedings of the Workshop EACL 2017 Workshop April 3, 2017 Valencia, Spain

Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

MultiLing 2017

MultiLing 2017 Workshop on Summarization and SummaryEvaluation Across Source Types and Genres

Proceedings of the Workshop

EACL 2017 WorkshopApril 3, 2017

Valencia, Spain

Page 2: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Technology sponsor: SciFY PNPC (www.scify.org)

c©2017 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-945626-41-8

ii

Page 3: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Introduction

Welcome to the MultiLing 2017 Workshop on Summarization and Summary Evaluation AcrossSource Types and Genres. The Workshop is a continuation of the MultiLing community effort(cf. http://multiling.iit.demokritos.gr/) on advancing the state-of-the-art related to summarization andsummary evaluation over different languages, genres and settings. We elaborate on the MultiLing focusin the next paragraphs.

Multilingual summarization across genres and sources: Summarization has been receiving increasingattention during the last years. This is mostly due to the increasing volume and redundancy of availableonline information but also due to the user created content. Recently, more and more interest arises formethods that will be able to function on a variety of languages and across different types of content andgenres (news, social media, transcripts). This topic of research is mapped to different community tasks,covering different genres and source types: Multilingual single-document summarization ; news headlinegeneration (new task in MultiLing 2017); user-supplied comments summarization (OnForumS task);conversation transcripts summarization (CCCS Task). The spectrum of the tasks covers a variety of realsettings, identifying individual requirements and intricacies, similarly to previous MultiLing endeavors.

Multilingual summary evaluation: Summary evaluation has been an open question for several years,even though there exist methods that correlate well to human judgment, when called upon to comparesystems. In the multilingual setting, it is not obvious that these methods will perform equally well to theEnglish language setting. In fact, some preliminary results have shown that several problems may arisein the multilingual setting. The same challenges arise across different source types and genres. Thissection of the workshop aims to cover and discuss these research problems and corresponding solutions.

The workshop will build upon the results of a set of research community tasks, as outlined below:

Single document summarization Following the pilot task of 2015, the multi-lingual single-documentsummarization task will be to generate a single document summary for all the given Wikipediafeature articles from one of about 40 languages provided. The provided training data will bethe Single-Document Summarization Task data from MultiLing 2015. A new set of data willbe generated based on additional Wikipedia feature articles. The summaries will be evaluatedvia automatic methods and participants will be required to perform some limited summarizationevaluations.

Headline Generation The objective of the Headline Generation (HG) task is to explore some of thechallenges highlighted by current state of the art approaches on creating informative headlines tonews articles: non-descriptive headlines, out-of-domain training data, and generating headlinesfrom long documents which are not well represented by the head heuristic.

Summary Evaluation This task aims to examine how well automated systems can evaluate summariesfrom different languages. This task takes as input the summaries generated from automatic systemsand humans in the Summarization Tasks of MultiLing 2015, but also in the Single documentsummarization tasks of 2015 and 2017 (when the latter is completed). The output should be agrading of the summaries. Ideally, we would want the automatic evaluation to maximally correlateto human judgement, thus the evaluation will be based on correlation measurement betweenestimated grades and human grades.

Online Forum Summarization Further to the successful pilot of OnForumS at MultiLing 2015, we areorganizing the task again in 2017 with a brand new dataset. The OnForumS task investigates howthe mass of comments found on news providers web sites (e.g., The Guardian) can be summarized.We posit that a crucial initial step towards that goal is to determine what comments link to, be that

iii

Page 4: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

either specific news snippets or comments by other users. Furthermore, a set of labels for a givenlink may be articulated to capture phenomena such as agreement and sentiment with respect to thecomment target.

Call Centre Conversation Summarization The Call Centre Conversation Summarization (CCCS) task— run for the first time as a pilot task in 2015 — consists in automatically generating summariesof spoken conversations in the form of textual synopses that shall inform on the content of aconversation and might be used for browsing a large database of recordings. As in CCCS 2015,participants to the task shall generate abstractive summaries from conversation transcripts thatinform a reader about the main events of the conversations, such as the objective of the participantsand how they are met.

Please keep in mind that the above tasks will run also during and after the Workshop itself. An addendumto the proceedings, containing system reports and evaluations, will be provided in the MultiLingwebsite ( cf. http://multiling.iit.demokritos.gr/pages/view/1638/multiling-2017-proceedings-addendum) as reports become available and tasks progress.

We thank you for your contribution and participation in MultiLing 2017 and we hope that we will allenjoy this opportunity to further the summarization state-of-the-art through a productive and knowledge-rich meeting.

George Giannakopoulos

on Behalf of the MultiLing Organizers

iv

Page 5: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Organizers:

George Giannakopoulos - NCSR Demokritos (General Chair, Summary Evaluation Task)Benoit Favre - LIF (CCCS, Headline Generation Task)Elena Lloret - University of Alicante (Headline Generation Task)John M. Conroy - IDA Center for Computing Sciences (Single Document Summarization Task)Josef Steinberger - University of West Bohemia (OnForumS task)Marina Litvak - Sami Shamoon College of Engineering (Headline Generation Task)Peter Rankel - Elder Research Inc. (Single Document Summarization Task)

Program Committee:

Udo Kruschwitz - University of EssexHoracio Saggion - Universitat Pompeu FabraKatja Filippova - GoogleJohn M. Conroy - IDA Center for Computing SciencesVangelis Karkaletsis - NCSR DemokritosLaura Plaza - UNEDFrancesco Ronzano - Universitat Pompeu FabraMark Last - Ben-Gurion University of the NegevGeorge Petasis - NCSR DemokritosElena Lloret - University of AlicanteAhmet Aker - Universität Duisburg-EssenJosef Steinberger - University of West BohemiaBenoit Favre - LIFMarina Litvak - Sami Shamoon College of EngineeringMijail Alexandrov Kabadjov - University of EssexNatalia Vanetik - Sami Shamoon College of EngineeringFlorian Boudin - University of NantesMahmoud El-Haj - Lancaster UniversityGeorge Giannakopoulos - NCSR Demokritos

Invited Speaker:

Enrique Alfonseca, Google

v

Page 6: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,
Page 7: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Table of Contents

MultiLing 2017 OverviewGeorge Giannakopoulos, John Conroy, Jeff Kubina, Peter A. Rankel, Elena Lloret, Josef Stein-

berger, Marina Litvak and Benoit Favre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Decoupling Encoder and Decoder Networks for Abstractive Document SummarizationYing Xu, Jey Han Lau, Timothy Baldwin and Trevor Cohn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Centroid-based Text Summarization through Compositionality of Word EmbeddingsGaetano Rossiello, Pierpaolo Basile and Giovanni Semeraro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Query-based summarization using MDL principleMarina Litvak and Natalia Vanetik. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22

Word Embedding and Topic Modeling Enhanced Multiple Features for Content Linking and Argument /Sentiment Labeling in Online Forums

Lei Li, Liyuan Mao and Moye Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Ultra-Concise Multi-genre Summarisation of Web2.0: towards Intelligent Content GenerationElena Lloret, Ester Boldrini, Patricio Martinez-Barco and Manuel Palomar . . . . . . . . . . . . . . . . . . . 37

Machine Learning Approach to Evaluate MultiLingual SummariesSamira Ellouze, Maher Jaoua and Lamia Hadrich Belguith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

vii

Page 8: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,
Page 9: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Workshop Program

April 3rd, 2017

09:30–11:00 Opening and Invited talk

09:30–09:45 MultiLing 2017 WelcomeGeorge Giannakopoulos

09:45–11:00 Invited talk: Sentence Compression for Conversational SearchEnrique Alfonseca, Google

11:00–11:30 Coffee break

11:30–13:00 MultiLing Tasks and Systems I

11:30–11:50 MultiLing 2017 OverviewGeorge Giannakopoulos, John Conroy, Jeff Kubina, Peter A. Rankel, Elena Lloret,Josef Steinberger, Marina Litvak and Benoit Favre

11:50–12:10 Decoupling Encoder and Decoder Networks for Abstractive Document Summariza-tionYing Xu, Jey Han Lau, Timothy Baldwin and Trevor Cohn

12:10–12:30 Centroid-based Text Summarization through Compositionality of Word EmbeddingsGaetano Rossiello, Pierpaolo Basile and Giovanni Semeraro

12:30–12:50 Query-based summarization using MDL principleMarina Litvak and Natalia Vanetik

ix

Page 10: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

April 3rd, 2017 (continued)

14:30–16:00 MultiLing Tasks and Systems II

14:30–14:50 Multilingual Single Document Summarization OverviewJohn Conroy

14:50–15:10 Word Embedding and Topic Modeling Enhanced Multiple Features for ContentLinking and Argument / Sentiment Labeling in Online ForumsLei Li, Liyuan Mao and Moye Chen

15:10–15:30 Ultra-Concise Multi-genre Summarisation of Web2.0: towards Intelligent ContentGenerationElena Lloret, Ester Boldrini, Patricio Martinez-Barco and Manuel Palomar

15:30–15:50 Machine Learning Approach to Evaluate MultiLingual SummariesSamira Ellouze, Maher Jaoua and Lamia Hadrich Belguith

16:00–16:30 Coffee break and poster setup

16:30–18:00 Poster session and Closing

16:30–17:30 Poster session

17:30–18:00 Closing remarks and planning

x

Page 11: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,Valencia, Spain, April 3, 2017. c©2017 Association for Computational Linguistics

MultiLing 2017 Overview

George GiannakopoulosNCSR “Demokritos”, Greece

John M. ConroyIDA / Center for Comp. Sciences, U.S.A

Jeff KubinaU.S. Dep. of Defense, U.S.A

Peter A. RankelElder Research, U.S.A

Elena LloretUniv. of Alicante, Spain

Josef SteinbergerUniv. of West Bohemia, Czech Republic

Marina LitvakShamoon College of Engineering, Israel

Benoit FavreLIF, France

Abstract

In this brief report we present an overviewof the MultiLing 2017 effort and work-shop, as implemented within EACL 2017.MultiLing is a community-driven initia-tive that pushes the state-of-the-art inAutomatic Summarization by providingdata sets and fostering further researchand development of summarization sys-tems. This year the scope of the work-shop was widened, bringing together re-searchers that work on summarizationacross sources, languages and genres. Wesummarize the main tasks planned andimplemented this year, also providing in-sights on next steps.

1 Overview

MultiLing covers a variety of topics on NaturalLanguage Processing, focused on the multilingualaspect of summarization:

• Multilingual summarization across genresand sources: Summarization has been re-ceiving increasing attention during the lastyears. This is mostly due to the increasingvolume and redundancy of available onlineinformation but also due to the user createdcontent. Recently, more and more interestarises for methods that will be able to func-tion on a variety of languages and across dif-ferent types of content and genres (news, so-cial media, transcripts).

This topic of research is mapped to differentcommunity tasks, covering different genresand source types:

– Multilingual single-document summa-rization (Giannakopoulos et al., 2015);

– user-supplied comments summarization(OnForumS task (Kabadjov et al.,2015));

– conversation transcripts summarization(see also (Favre et al., 2015)).

The spectrum of the tasks covers a vari-ety of real settings, identifying individual re-quirements and intricacies, similarly to pre-vious MultiLing endeavours (Giannakopou-los et al., 2011a; Giannakopoulos, 2013; El-hadad et al., 2013; Giannakopoulos et al.,2015).

• Multilingual summary evaluation: Sum-mary evaluation has been an open questionfor several years, even though there existmethods that correlate well to human judge-ment, when called upon to compare systems.In the multilingual setting, it is not obviousthat these methods will perform equally wellto the English language setting. In fact, somepreliminary results have shown that severalproblems may arise in the multilingual set-ting (Giannakopoulos et al., 2011a). Thesame challenges arise across different sourcetypes and genres. This aspect of the work-shop aims to cover and discuss these researchproblems and corresponding solutions.

The workshop builds upon the results of a setof research community tasks, which are elabo-rated on in the following paragraphs. However,this year MultiLing also hosts works beyond thetasks themselves, but still within the scope of au-tomatic summarization and evaluation in differentgenres and settings.

2 Community Tasks

In this year’s MultiLing community effort we areimplementing the following tasks:

1

Page 12: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

• Multilingual Single-Document Summariza-tion (MSS)

• Multilingual Summary Evaluation (MSE)

• Online Forum Summarization (OnForumS)

• Call Centre Conversation Summarization(CCCS)

• Headline Generation Task (HG)

Due to time limitations, all but the MSS andOnForumS tasks will run beyond the workshoptimespan, thus the proceedings will be comple-mented by the proceedings addendum 1, contain-ing system reports and evaluation results.

3 Multilingual Single-DocumentSummarization Task Overview

The Multilingual Single-document Summariza-tion 2017 posed a task to measure the performanceof multilingual, single-document, summarizationsystems using a dataset derived from the featuredarticles of 41 Wikipedias. The objective was to as-sess the performance of automatic summarizationtechniques on text documents covering a diverserange of languages and topics outside the news do-main. This section describes the task, the datasetand the methods to be used to evaluate the sub-mitted summaries. To give amble time for eval-uation the results and analysis will be presentedat the workshop and published later. The objec-tive of this task, like the 2015 Multilingual Single-document Summarization Task, was to stimulateresearch and assess the performance of automaticsingle-document summarization systems on docu-ments covering a large range of sizes, languages,and topics.

3.1 Task and Dataset Description

Each participating system of the task was to com-pute a summary for each document in at leastone of the datasets 41 languages. To removeany potential bias in the evaluation of generatedsummaries that are too small, the human sum-mary length in characters was provided for eachtest document and generated summaries were ex-pected to be close to it.

1Cf. http://multiling.iit.demokritos.gr/pages/view/1638/multiling-2017-proceedings-addendum .

The testing dataset was created using the samesteps as reported in Section 2 of (Giannakopou-los et al., 2015) and excluded the articles in thetraining dataset (which was the testing dataset forthe task in 2015). For each language Table 1 con-tains the mean character size of the summary andbody of the articles selected for the test dataset.Within the dataset there is no correlation betweenthe summary and body size of the articles, in fact,the variance in the summary size is small. This islikely because Wikipedia style requirements dic-tate that a summary be at most four paragraphs,2

regardless of article size, and paragraphs be rea-sonably sized.3

3.2 Preprocessing and Evaluation

For the evaluation the baseline summary for eacharticle in the dataset was the prefix substring ofthe article’s body text having the same length asthe human summary of the article. An oracle sum-mary was also computed for each article using thecombinatorial covering algorithm in (Davis et al.,2012) by selecting sentences from its body text tocover the tokens in the human summary using asfew sentences as possible until its size exceededthe human summary, upon which it was truncated.

Preprocessing of all the submitted and humansummaries was performed using the Natural Lan-guage Toolkit (Bird et al., 2009). Sentence split-ting was done using punkt(). Models based on theWikipedia data were built for each language. Foreach summary the pre-processing steps were:

1. all multiple white-spaces and control charac-ters are convert to a single space

2. any leading space is removed

3. the resulting text string is truncated to the hu-man summary length

4. the text is tokenized and, if possible, lemma-tized

5. all tokens without a letter or number are dis-carded

6. all remaining tokens are lowercased.

2https://en.wikipedia.org/wiki/Wikipedia:LEAD

3https://en.wikipedia.org/wiki/Wikipedia:WBA

2

Page 13: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Table 1: Dataset Languages and SizesISO LANGUAGE SUMMARY BODY ISO LANGUAGE SUMMARY BODY

af Afrikaans 1743 (784) 32407 (20378) ka Georgian 1114 (682) 23626 (23018)ar Arabic 2129 (1045) 38682 (16354) ko Korean 905 (491) 15723 (7098)az Azerbaijani 1375 (937) 48687 (45855) li Limburgish 569 (237) 14177 (16326)bg Bulgarian 1451 (782) 29421 (10774) lv Latvian 1334 (514) 25292 (13464)bs Bosnian 1275 (801) 26497 (15319) mr Marathi 970 (653) 14727 (8438)ca Catalan 1733 (906) 28536 (14460) ms Malay 1420 (952) 22820 (16851)cs Czech 1947 (745) 33751 (24010) nl Dutch 1316 (562) 36638 (18062)de German 1122 (470) 42838 (30382) nn Norwegian 965 (493) 17772 (9073)el Greek 1582 (905) 36081 (16652) no Nor.-Bok. 1808 (913) 37128 (22024)en English 1878 (735) 20683 (9644) pl Polish 1470 (687) 31460 (16319)eo Esperanto 1286 (875) 22905 (10279) pt Portuguese 2247 (759) 37189 (16777)es Spanish 2083 (892) 47670 (39981) ro Romanian 2204 (710) 38973 (20349)eu Basque 1105 (742) 23558 (16672) ru Russian 1855 (915) 59337 (27360)fa Persian 1850 (581) 29525 (13172) simple Simp. Eng. 973 (351) 9793 (7027)fi Finnish 1135 (406) 23971 (10538) sk Slovak 1104 (631) 26102 (11024)fr French 1924 (884) 65960 (41289) th Thai 1851 (951) 30549 (15203)hr Croatian 1398 (1119) 22430 (13583) tr Turkish 2059 (807) 32240 (23667)id Indonesian 1813 (964) 26634 (18564) tt Tagalog 1149 (779) 23648 (14139)it Italian 1743 (701) 51461 (20832) uk Ukrainian 1023 (758) 35552 (32014)ja Japanese 383 (275) 21349 (14694) zh Chinese 662 (245) 10614 (6338)jv Javanese 1118 (855) 14033 (10810)

Table 1: The table lists the languages in the dataset with the first column containing the ISO code foreach the language, the second column the name of the language, and the remaining columns containingthe mean size, in characters, and standard deviation, in parentheses, of the summary and body of thearticle. For example, for English the mean size of the human summaries is 1,857 characters.

3

Page 14: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

As of the time of publication of the proceedings,three teams have participated and automatic meth-ods of scoring the subumissions, using ROUGE(Lin, 2004) and MeMoG (Gia, ), are underway andwill be presented at the EACL 2017 workshop. Ahuman evaluation will proceed afterwards.

4 OnForumS Task

Further to the pilot of OnForumS in 2015, we or-ganized the task again in 2017 with a brand newdataset. The OnForumS task investigates how themass of comments found on news providers websites can be summarized. We posit that a cru-cial initial step towards that goal is to determinewhat comments link to, be that either specific newssnippets or comments by other users. Further-more, a set of labels for a given link may be ar-ticulated to capture phenomena such as agreementand sentiment with respect to the comment target.Solving this labelled linking problem can enablerecognition of salience (e.g., snippets/commentswith most links) and relations between comments(e.g., agreement).

The OnForumS task is a particular specificationof the linking task, in which systems take as in-put a news article with comments and were askedto link and label (sentiment, argument) each com-ment to sentences in the article, to the article topicas a whole or to other comments. The set of possi-ble labels is for sentiment is [POS, NEUT, NEG]and the set of possible argument labels is [IN FA-VOR, AGAINST, IMPARTIAL].

This year we focus on English (The Guardian)and Italian (La Repubblica) as in the previous edi-tion and we released the 2015 test data as trainingdata.

The 2017 text collection contains 19 Englishand 19 Italian articles. This year we had 4 par-ticipanting teams and together with two baselineswe received 9 runs. The evaluation focuses on howmany of the links and labels were correctly iden-tified, as in the previous OnForumS run. The nextstep is to manually validate the links and labels us-ing CrowdFlower.

5 MultiLingual Summary Evaluation

The summary evaluation task revisits the multi-lingually applicable evaluation challenge. The aimis to introduce novel, automatic evaluation meth-ods of summary evaluation. Even though, cur-rently, systems are evaluated using the ROUGE

(Lin, 2004) and MeMoG (Gia, ) metrics, thereexists a big gap between automatic methods andmanual annotations, especially in non-English set-tings (Giannakopoulos et al., 2011b).

This year’s task reuses the MultiLing 2013 and2015 single-document and multi-document sum-marization corpora and evalautions. Furthermore,we generate summary variations (often through in-ducing “noise”), which the evaluation systems willbe asked to grade. These variations include:

• Sentence re-ordering;

• Random sentence replacement;

• Merging between different summaries.

All the above changes will be studied, to under-stand the strengths and weaknesses of differentevaluation methods with respect to these syntheticdeviations. Then, a human evaluation will be con-ducted to see whether humans respond similarly tothe automatic methods with respect to the differentnoise types.

The aim of this task and study is to understandhow variations of text change its perceived qual-ity of a summary. It also aims to highlight the(in)sufficiency of existing methods in the multi-lingual setting and promote new, more robust ap-proaches for summary evaluation.

6 Tasks in preparation

6.1 Headline GenerationThe Headline Generation (HG) task aims to ex-plore some of the challenges highlighted by cur-rent state of the art approaches on creating infor-mative headlines to news articles: non-descriptiveheadlines, out-of-domain training data, and gen-erating headlines from long documents which arenot well represented by the head heuristic. Thistask has been previously addressed in past summa-rization challenges, such as the well-known Doc-ument Understanding Conferences (DUC) for the2002, 2003 or 2004 editions.

With the high-rate of information increase,novel summarization methods that could condenseand extract relevant information in just one sen-tence (i.e., headlines) would perfectly fit in today’ssociety for creating better information access andprocessing tools. We will rerun the headline gen-eration task in DUC4 2002, 2003, 2004 conditions

4Cf. http://duc.nist.gov//

4

Page 15: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

in order to create comparable results, and deter-mine to what extent the techniques and methodshave improved with respect to former participants.

Moreover, we will encourage multilingual orcross-lingual approaches able to generate head-lines for at least two languages. We expect to makeavailable a large set of training data for headlinegeneration, and create evaluation conditions to ob-jectively assess and compare different approaches.

6.2 Call Centre Conversation Summarization

The Call Centre Conversation Summarization(CCCS) task — run for the first time as a pilottask in 2015 — consists in automatically gener-ating summaries of spoken conversations in theform of textual synopses that shall inform on thecontent of a conversation and might be used forbrowsing a large database of recordings. As inCCCS 2015, participants to the task shall gener-ate abstractive summaries from conversation tran-scripts that inform a reader about the main eventsof the conversations, such as the objective of theparticipants and how they are met. Evaluation willbe performed by ROUGE-like measures based onhuman-written summaries as in CCCS 2015, and— if possible — will be coupled by manual evalu-ation, depending on the funding we can secure forthe task.

7 Conclusion

This year MultiLing covers a number of challeng-ing problems related to summarization. In the pro-ceedings (and the addendum) one can find vari-ous methods using deep learning and word embed-dings, topic modeling, optimization and other ap-proaches to achieve summarization and summaryevaluation across settings.

The rest of the proceedings will allow the readerto examine interesting challenges related to ab-stractive summarization, argument labeling, multi-genre, multi-document and query-based summa-rization. They will also identify and attemp totackle important challenges related to summaryevaluation beyond English.

We hope that the conclusion of the tasks afterthe workshop will provide the grounds for furtherresearch and open systems development, revisingand improving the way summarization is modeled,faced, evaluated and implemented in the years tocome.

8 Acknowledgements

This work was supported by project MediaGist,EUs FP7 People Programme (Marie Curie Ac-tions), no. 630786, MediaGist.

ReferencesSteven Bird, Ewan Klein, and Edward Loper. 2009.

Natural language processing with Python: analyz-ing text with the natural language toolkit. ” O’ReillyMedia, Inc.”.

Sashka T. Davis, John M. Conroy, and Judith D.Schlesinger. 2012. OCCAMS - an optimal com-binatorial covering algorithm for multi-documentsummarization. In 12th IEEE International Confer-ence on Data Mining Workshops, ICDM Workshops,Brussels, Belgium, December 10, 2012, pages 454–463.

Michael Elhadad, Sabino Miranda-Jimenez, JosefSteinberger, and George Giannakopoulos. 2013.Multi-document multilingual summarization corpuspreparation, part 2: Czech, hebrew and spanish.MultiLing 2013, page 13.

Benoit Favre, Evgeny Stepanov, Jeremy Trione,Frederic Bechet, and Giuseppe Riccardi. 2015. Callcentre conversation summarization: A pilot task atmultiling 2015. In 16th Annual Meeting of theSpecial Interest Group on Discourse and Dialogue,page 232.

Summarization system evaluation variations based onn-gram graphs.

George Giannakopoulos, Mahmoud El-Haj, BenoitFavre, Marina Litvak, Josef Steinberger, and Va-sudeva Varma. 2011a. Tac2011 multiling pilotoverview. In TAC 2011 Workshop.

George Giannakopoulos, Mahmoud El-Haj, BenoitFavre, Marina Litvak, Josef Steinberger, and Va-sudeva Varma. 2011b. Tac2011 multiling pilotoverview. In TAC 2011 Workshop.

George Giannakopoulos, Jeff Kubina, John M. Conroy,Josef Steinberger, Benoit Favre, Mijail Kabadjov,Udo Kruschwitz, and Massimo Poesio. 2015. Mul-tiling 2015: Multilingual summarization of singleand multi-documents, on-line fora, and call-centerconversations. In SIGDIAL 2015.

George Giannakopoulos. 2013. Multi-document mul-tilingual summarization and evaluation tracks in acl2013 multiling workshop. In Proceedings of theMultiLing 2013 Workshop on Multilingual Multi-document Summarization, pages 20–28.

Mijail Kabadjov, Josef Steinberger, Emma Barker, UdoKruschwitz, and Massimo Poesio. 2015. Onfo-rums: The shared task on online forum summari-sation at multiling’15. In Proceedings of the 7th

5

Page 16: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Forum for Information Retrieval Evaluation, pages21–26. ACM.

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. In Text summariza-tion branches out: Proceedings of the ACL-04 work-shop, volume 8. Barcelona, Spain.

6

Page 17: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 7–11,Valencia, Spain, April 3, 2017. c©2017 Association for Computational Linguistics

Decoupling Encoder and Decoder Networks forAbstractive Document Summarization

Ying Xu1, Jey Han Lau2,3, Timothy Baldwin3 andTrevor Cohn3

1Monash University2IBM Research

3The University of [email protected], [email protected]@ldwin.net, [email protected]

Abstract

Abstractive document summarizationseeks to automatically generate a sum-mary for a document, based on someabstract “understanding” of the originaldocument. State-of-the-art techniques tra-ditionally use attentive encoder–decoderarchitectures. However, due to the largenumber of parameters in these models,they require large training datasets andlong training times. In this paper, wepropose decoupling the encoder anddecoder networks, and training themseparately. We encode documents usingan unsupervised document encoder,and then feed the document vector toa recurrent neural network decoder.With this decoupled architecture, wedecrease the number of parameters inthe decoder substantially, and shorten itstraining time. Experiments show that thedecoupled model achieves comparableperformance with state-of-the-art modelsfor in-domain documents, but less well forout-of-domain documents.

1 Introduction

Abstractive document summarization is a chal-lenging natural language understanding task. Ab-stractive methods first encode the original docu-ment into a high-level representation, and then de-code it into the target summary.

Rush et al. (2015) proposed the task of head-line generation as the first step towards abstrac-tive summarization. Instead of using the full docu-ment, the authors experimented with using the firstsentence as input, with the aim of generating a co-herent headline given the sentence.

The current state-of-art system for the task isbased on an attentive encoder and a recurrent de-coder (Chopra et al., 2016), which is an extensionof the methodology of Rush et al. (2015). Theencoder and decoder are trained jointly, and thedecoder attends to different parts of the documentduring generation. It has a large number of param-eters, and thus requires large-scale training dataand long training times.

In this paper, we propose decoupling the en-coder and decoder. We encode documents usingdoc2vec (Le and Mikolov, 2014), as it has beendemonstrated to be a competitive unsuperviseddocument encoder (Lau and Baldwin, 2016). Weincorporate doc2vec vectors of input documentsto the decoder as an additional signal, to gener-ate sentences that are not only coherent but arealso related to the original documents. Comparedto the standard joint encoder–decoder design, thedecoupled architecture has less parameters for thedecoder, and thus requires less training data andtrains faster.1 The downside of the decoupled ar-chitecture is that the doc2vec signal is not updatedin the decoder, and its document representationcould be sub-optimal for the decoder to generategood summaries. Our experiments reveal that thedecoupled architecture works well in-domain, butless well out-of-domain, as a consequence of thefixed capacity of the document encoding as wellas having no explicit copy mechanism.

2 Attentive Recurrent Neural Network:A Joint Encoder–decoder Architecture

The attentive recurrent neural network is com-posed of an attentive encoder and a recurrent de-coder (Chopra et al., 2016), where the encoder is

1The training time is decreased from 4 days (with full GI-GAWORD) for the coupled model (Rush et al., 2015) to 2 days(with 75% GIGAWORD) in our model with comparable in-domain performance.

7

Page 18: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

!(#$ |#&,… , #$)&, *; ,)

.$

ℎ$

01.2345

#$)&Decoder

Encoder

*

!(#$)&|#&,… , #$)6, *; ,)

.$)&

ℎ$)&

01.2345

#$)6

*

(a) Joint encoder–decoder architecture

!(#$ |#&,… , #$)&, *; ,)

ℎ$

#$)&Decoder

Encoder

!(#$)&|#&,… , #$)6, *; ,)

ℎ$)&

01.2345

#$)6

*To hidden unitTo softmax layer

doc2vec

(b) Decoupled encoder–decoder architecture

Figure 1: Encoder–decoder architectures

a fixed-window feed-forward network and the de-coder is a recurrent neural network (RNN: Elman(1998), Mikolov et al. (2010)) language modeland parameters from both networks are trained to-gether. Let x denote the document, y the sum-mary, θ the set of parameters to be learnt, and y =y1, y2, ..., yN a word sequence of length N . Whendecoding, y is computed as argmaxP (y|x; θ),where the conditional probability P (y|x; θ) canbe calculated from each word yt in the sequence,i.e. P (y|x; θ) =

∏Nt=1 P (yt|{y1, ..., yt−1},x; θ).

For the recurrent decoder, the condi-tional probability of each word is given asP (yt|{y1, ..., yi−1},x; θ) = gθ(ht, ct), whereht represents the hidden state of RNN, i.e.ht = gθ(yt−1,ht−1, ct), and ct the output of theencoder at time t.

For the attentive encoder, x is computed by at-tending to some of the source words using the pre-vious hidden state ht−1.2 Figure 1a demonstratesthe dependencies between the next generated wordyt and document x, given the parameters θ.

The decoupled architecture ensures that the in-formation from document x is adapted to the cur-rent context of the generated summary.

3 Decoupled Encoder–decoderArchitecture for DocumentSummarization

A decoupled encoder–decoder architecture has aclear boundary between the encoder and decoder:it can be seen as a pipeline model where the outputof the encoder is fed as an input to the decoder, so

2The unnormalised attention weights are computed bycombining ht−1 with the convolutional embedding of eachsource word via dot product.

the encoder and decoder modules can be trainedseparately. Figure 1b illustrates how the encoderand decoder are decoupled from each other. Here,we use doc2vec the document encoder and a long-short term memory network (LSTM: Hochreiterand Schmidhuber (1997)) as the decoder.doc2vec is an extension of word2vec (Mikolov

et al., 2013), a popular deep learning method forlearning word embeddings. In word2vec (basedon the skipgram variant) the embedding of a tar-get word is learnt by optimising it to predict itsposition-indexed neighbouring words. doc2vec

(based on the dbow variant) is based on the sameidea, except that the target word is now the docu-ment itself, and the document vector is optimisedto predict the document words. Note that the dbowimplementation does not take into account the or-der of the words (hence its name “distributed bagof words”). Once the model is trained, embed-dings of new/unseen documents can be inferredfrom the pre-trained model efficiently. As an en-coder, doc2vec is completely unsupervised, anduses no labelled information or signal from the de-coder.

The decoder is an RNN language model(Mikolov et al., 2010), implemented as an LSTM(Hochreiter and Schmidhuber, 1997). Formally:

it = Wixt + Uiht−1 + bift = Wfxt + Ufht−1 + bfot = Woxt + Uoht−1 + bojt = Wjxt + Ujht−1 + bjct = ct−1 ∗ σ(ft) + tanh(jt) ∗ σ(it)ht = tanh(ct) ∗ σ(ot)

(1)

where it, ft and ot are input, forget and outputgates, respectively; jt, ct and ht represent the new

8

Page 19: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Combination Equationadd-input i′t = it + dadd-hidden h′t = ht + d

stack-input i′t = [it;d]stack-hidden h′t = [ht;d]

mlp-input i′t = tanh(Wiit +Wdd + b)mlp-hidden h′t = tanh(Wiht +Wdd + b)

Table 1: Incorporation of doc2vec signal in thedecoder. d denotes the doc2vec vector; it (ht)is the input (hidden) vector at time t; and “[·; ·]”denotes vector concatenation.

input, new context and new hidden state, respec-tively; ∗ is the elementwise vector product; and σis the sigmoid activation function.

Given an input word and previous hidden state,the decoder predicts the next word and generatesthe summary one word at a time.

To generate summaries that are related to thedocument, we incorporate the doc2vec input doc-ument signal to the decoder using several meth-ods proposed by Hoang et al. (2016). There aretwo layers where we can incorporate doc2vec: inthe input layer (input), or hidden layer (hidden).There are three methods of incorporation: addi-tion (add), stacking (stack), or via a multilayerperceptron (mlp). Table 1 illustrates the 6 possibleapproaches to incorporation.

Note that add requires doc2vec to have thesame vector dimensionality as the layer it is com-bined with, and stack-hidden doubles the hiddensize (assuming they have the same dimensions),resulting in a large output projection matrix andlonger training time.

4 Experiments and Results

4.1 DatasetsWe test our decoupled architecture for the headlinegeneration task. Following Chopra et al. (2016),we run experiments using GIGAWORD, DUC03and DUC04.3

GIGAWORD is preprocessed according to Rushet al. (2015), yielding 4.3 million examples.For in-domain experiments, we randomly sam-ple 2,000 examples for each validation and testset, and use the remaining examples for training.We tune hyper-parameters based on validation per-plexity and evaluate performance on the test set

3GIGAWORD: https://catalog.ldc.upenn.edu/LDC2003T05; DUC: http://duc.nist.gov/

using the ROUGE metric (Lin, 2004), followingthe same evaluation style of benchmark systems(Rush et al., 2015; Chopra et al., 2016). For out-of-domain experiments, we use the same modelstrained from GIGAWORD, but tune using DUC03and test on DUC04; DUC03 and DUC04 each have500 examples.

For the doc2vec encoder, we train using GIGA-WORD and infer document vectors for validationand test examples using the trained model. Validand test examples are excluded from the doc2vec

training data.

4.2 Hyper-parameter tuningFor the encoder, we explore using a range of docu-ment lengths (first 20/30/40/50 words) to generatethe input representation. Validation results showthat using the first 20 words produces the best per-formance, suggesting that this length contains suf-ficient information to generate headlines.

We next test the 6 different ways to incorporatedoc2vec into the decoder. We find that stackingthe doc2vec vector with the input (stack-input)has the most consistent performance, while mlp iscompetitive, and add performs the worst. Interest-ingly, for mlp and stack, we find the differencebetween input and hidden to be small.

For the recurrent decoder, hyper-parametersthat are tuned include the mini-batch size, hiddenlayer size, number of LSTM layers, number oftraining epochs, learning rate, and drop out rate.The best results is achieved with a mini-batch sizeof 128, hidden size of 900, and one LSTM layer.The best perplexity is obtained after 3 to 4 epochs,with a learning rate of 0.001. More training epochsare needed when we reduce the learning rate to0.0001. For in-domain experiments, the best re-sults are achieved with a dropout rate of 0.1, whilefor out-of-domain experiments, the best perfor-mance prefers a higher dropout rate at 0.4. Thissuggests that dropout plays an important role incombating over-fitting, and it is especially usefulfor out-of-domain data.

4.3 ResultsWe compare our model (RDS: Recurrent De-coupled Summarizer) with 4 state-of-art models:ABS, ABS+ (Rush et al., 2015); RAS-LSTM andRAS-Elman (Chopra et al., 2016), which are alljoint encoder–decoder models. For in-domainresults, we present ROUGE-1/2/L full-length F-scores in Table 2. For out-of-domain results we

9

Page 20: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

System ROUGE-1 ROUGE-2 ROUGE-LABS 29.6 11.3 26.4ABS+ 29.8 11.9 30.0RAS-LSTM 32.6 14.7 30.0RAS-Elman 33.8 16.0 31.2

RDS 30.7 11.3 27.6RDS (75%) 29.1 10.0 26.3RDS (50%) 27.4 8.9 24.9

Table 2: Comparison of ROUGE scores (fulllength F-score) for in-domain experiments.

System ROUGE-1 ROUGE-2 ROUGE-LPrefix 22.4 6.5 19.7ABS 26.6 7.1 22.1ABS+ 28.2 8.5 23.8RAS-LSTM 27.4 7.7 23.1RAS-Elman 29.0 8.3 24.1

RDS 16.7 3.7 14.4

Table 3: Comparison of ROUGE scores (recall at75 bytes) for out-of-domain experiments.

report ROUGE-1/2/L recall at 75 bytes in Table 3,where only the first 75 bytes of model-generatedsummary is used for evaluation against the refer-ences.ABS employs an attentive encoder and a feed-

forward neural network decoder. ABS+ works inthe same way as ABS, but further tunes the de-coder using Z-MERT (Zaidan, 2009). RAS-LSTM

and RAS-Elman are detailed in Section 2; the onlydifference between them is that RAS-LSTM uses anLSTM decoder, while RAS-Elman uses a simpleRNN decoder. Prefix is a baseline model wherethe first 75 byte of the document is used as the title.

Looking at Table 2, models with a recurrent de-coder (RDS and RAS) perform better than those witha feed-forward decoder (ABS). The decoupled ar-chitecture RDS achieves competitive performance,although the fully joint RAS models achieve thebest results. Chopra et al. (2016) found that RASmodels perform best with a beam size of 10,while we found that RDS performs best with greedyargmax decoding.

We further experiment with training RDS us-ing less data (50% and 75%), and find that itsperformance degrades slightly. Encouragingly,its ROUGE-1 is comparable to ABS when RDS istrained using only 75% of the training data, andit takes 2 days of training time (RDS) instead of 4days (ABS).

I A North Korean man arrived in Seoul Wednes-day and sought asylum after escaping his hunger-stricken homeland, government officials said.

A North Korean man arrives in Seoul to seek asylumfor homeland security officials say.

D North Korean man defects to South Korea.I King Norodom Sihanouk has declined requests to

chair a summit of Cambodia ’s top political leaders,saying the meeting would not bring any progress indeadlocked negotiations to form a government .

A King Sihanouk declines to meet Cambodian leaderson eve of talks with Cambodia.

D Cambodian king refuses to meet with leaders ofpolitical leaders.

I Former U.S. president Jimmy Carter , who seems aperennial Nobel peace prize also-ran , could havewon the coveted honor in 1978 had it not been forstrict deadline rules for nominations.

A Former U.S. president Jimmy Carter wins top honorfor Nobel peace prize nominations.

D Carter wins Nobel prize in literature.

Table 4: System generated article summaries. I:reference summary; A: RAS-Elman; and D: RDS.

For out-of-domain experiments, we computeROUGE recall at 75 bytes and find that RDS per-forms poorly, even worse than the baseline methodPrefix. This suggests that the decoupled ar-chitecture is sensitive to domain differences, andhighlights a potential downside of the architec-ture. Investigating methods that can improvecross-domain performance is a direction for futurework.

5 Discussion and Conclusion

We present some summaries for DUC documentsgenerated by RAS-Elman and RDS in Table 4 tobetter understand possible reasons for the poorin-domain performance of RDS. The first obser-vation is the shorter headlines generated by RDS

compared to the DUC reference summaries andRAS-Elman headlines. RDS headlines are short— at an average length of 8.13 — and this isdue to the short length of GIGAWORD titles itis trained on, at 8.50 words on average. Onthe other hand, the average length of DUC refer-ence summaries and RAS-Elman headlines is 11.63and 13.08 words respectively; this explains whyRAS-Elman achieves better performance, since alonger sentence would have a better chance toscore higher in ROUGE.

We also find that RDS often generates similarwords and is penalised by ROUGE as it uses exact

10

Page 21: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

word matching, as is evidenced by Sihanouk andCambodian king in the second example.4 Lastly,the bag-of-words view of the doc2vec encoder re-sults in some meaning loss: in the third example,Jimmy Carter did not actually win the Nobel peaceprize, for example.

We also computed the source copy rate of thesystems.5 We find that, on average, RDS copiesonly 50.7% of its predicted words from input doc-ument, while ABS and ABS+ copy at a rate of 85.4%and 91.5%. This is interesting, as it suggeststhat it paraphrases more than other systems, whileachieving similar ROUGE performance.

To summarise, we proposed decoupling theencoder–decoder architecture as is traditionallyused in sequence-to-sequence problems. Wetested the decoupled system on news title genera-tion, and found that it performed competitively in-domain. Out-of-domain experiments, however, re-veal sub-par performance, suggesting that the de-coupled architecture is susceptible to domain dif-ferences.

ReferencesSumit Chopra, Michael Auli, and Alexander M. Rush.

2016. Abstractive sentence summarization with at-tentive recurrent neural networks. In Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, pages 93–98, SanDiego, California, June. Association for Computa-tional Linguistics.

Jeffrey Elman. 1998. Generalization, simple recurrentnetworks, and the emergence of structure. In Pro-ceedings of the Twentieth Annual Conference of theCognitive Science Society.

Cong Duy Vu Hoang, Trevor Cohn, and GholamrezaHaffari. 2016. Incorporating side information intorecurrent neural network language models. In Pro-ceedings of the 2016 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages1250–1255, San Diego, California, June. Associa-tion for Computational Linguistics.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural Computation, 9:1735–1780.

4Informal manual evaluation for RDS, ABS, ABS+ and RAS

on GIGAWORD reveals that ABS, ABS+ and RAS are also gen-erating words that are of similar meaning, and that overall,RDS is less prefered by annotators.

5Source copy rate is defined as the fraction of generatedwords that are in the original source documents.

Jey Han Lau and Timothy Baldwin. 2016. An em-pirical evaluation of doc2vec with practical insightsinto document embedding generation. In Proceed-ings of the 1st Workshop on Representation Learn-ing for NLP, pages 78–86, Berlin, Germany, August.Association for Computational Linguistics.

Quoc V Le and Tomas Mikolov. 2014. Distributedrepresentations of sentences and documents. In In-ternational Conference on Machine Learning, pages1188–1196.

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. In Stan SzpakowiczMarie-Francine Moens, editor, Text SummarizationBranches Out: Proceedings of the ACL-04 Work-shop, pages 74–81, Barcelona, Spain, July. Associa-tion for Computational Linguistics.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. In Inter-speech, volume 2, page 3.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in Neural Information ProcessingSystems, pages 3111–3119.

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, pages 379–389, Lisbon, Portugal,September. Association for Computational Linguis-tics.

Omar F. Zaidan. 2009. Z-MERT: A fully configurableopen source tool for minimum error rate training ofmachine translation systems. The Prague Bulletin ofMathematical Linguistics, 91:79–88.

11

Page 22: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 12–21,Valencia, Spain, April 3, 2017. c©2017 Association for Computational Linguistics

Centroid-based Text Summarization throughCompositionality of Word Embeddings

Gaetano Rossiello Pierpaolo Basile Giovanni SemeraroDepartment of Computer Science

University of Bari, 70125 Bari, Italy{firstname.secondname}@uniba.it

Abstract

The textual similarity is a crucial aspectfor many extractive text summarizationmethods. A bag-of-words representationdoes not allow to grasp the semantic re-lationships between concepts when com-paring strongly related sentences with nowords in common. To overcome this is-sue, in this paper we propose a centroid-based method for text summarization thatexploits the compositional capabilities ofword embeddings. The evaluations onmulti-document and multilingual datasetsprove the effectiveness of the continu-ous vector representation of words com-pared to the bag-of-words model. De-spite its simplicity, our method achievesgood performance even in comparison tomore complex deep learning models. Ourmethod is unsupervised and it can beadopted in other summarization tasks.

1 Introduction

The goal of text summarization is to produce ashorter version of a source text by preserving themeaning and the key contents of the original text.This is a very complex problem since it requiresto emulate the cognitive capacity of human beingsto generate summaries. Thus, text summarizationposes open challenges in both natural language un-derstanding and generation. Due to the difficultyof this task, research work in the literature focusedon the extractive aspect of summarization, wherethe generated summary is a selection of relevantsentences from a document (or a set of documents)in a copy-paste fashion. A good extractive sum-marization method must satisfy and optimize bothcoverage and diversity properties, where the se-lected sentences should cover a sufficient amount

of topics from the original source text, avoidingthe redundancy of information in the summary.The diversity property is fundamental especiallyfor a multi-document summarization. For instancein a news aggregator, a selection of too similar sen-tences may compromise the quality of the gener-ated summary.

An extractive method should define a sentencerepresentation model, a technique for assigning ascore to each sentence in the original source anda ranking module to properly select the most rel-evant sentences by relying on a similarity func-tion. Following this vision, several summariza-tion methods proposed in the literature use thebag of words (BOW) as representation model forthe sentence scoring and selection modules (Radevet al., 2004; Erkan and Radev, 2004; Lin andBilmes, 2011). Despite their proven effectiveness,these methods rely heavily on the notion of sim-ilarity between sentences, and a BOW represen-tation is often not suitable to grasp the semanticrelationships between concepts when comparingsentences. For example, taking into account thefollowing two sentences “Syd leaves Pink Floyd”and “Barrett abandons the band”, in the BOWmodel their vector (sparse) representations resultorthogonal since they have no words in common,nonetheless the two sentences are strongly related.

In attempt to solve this issue, in this work wepropose a novel and simple extractive summariza-tion method based on the geometric meaning ofthe centroid vector of a (multi) document by tak-ing advantage of compositional properties of theword embeddings (Mikolov et al., 2013b). Empir-ically, we prove the effectiveness of word embed-dings with a fair comparison to the BOW represen-tation by limiting, as much as possible, the param-eters and the complexity of the method. Surpris-ingly, the results achieved from our method on thegold standard DUC-2004 dataset are comparable,

12

Page 23: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

and in some cases better, to those obtained usinga more complex sentence representations comingfrom the deep learning models.

In the following section we provide a brief de-scription of word embeddings and text summariza-tion methods. The centroid-based summarizationmethod that uses word embeddings is described inSection 3, followed by experimental results in Sec-tion 4. Final remarks and a discussion about ourfuture plans are reported in Section 5.

2 Related Work

2.1 Word EmbeddingsWord embedding stands for a continuous vectorrepresentation able to capture syntactic and se-mantic information of a word. Several methodshave been proposed in order to create word em-beddings that follow the Distributional Hypothe-sis (Harris, 1954). In our work we use two mod-els1, continuous bag-of-words and skip-gram, in-troduced by (Mikolov et al., 2013a). These mod-els learn a vector representation for each word us-ing a neural network language model and can betrained efficiently on billions of words. Word2vecallows to learn complex semantic relationships us-ing simple vectorial operators, such as vec(king)− vec(man) + vec(woman) ≈ vec(queen) andvec(Barrett) − vec(singer) + vec(guitarist) ≈vec(Gilmour). However, our method is generaland other approaches for building word embed-dings can be used (Goldberg, 2015).

2.2 Text SummarizationSince the first method proposed by (Luhn, 1958),automatic text summarization has been widely ad-dressed by the research community with the pro-posal of different methodologies as well as toolk-its (Saggion and Gaizauskas, 2004). Good sur-veys are proposed by (Jones, 2007; Saggion andPoibeau, 2013). Since our method exploits wordembeddings as alternative representation to BOW,here we focus on the methods sharing this fea-ture. Methods based on matrix factorization, suchas Latent Semantic Analysis (LSA) (Ozsoy etal., 2011) and Non-Negative Matrix Factorization(NMF) (Lee et al., 2009), have the aim to arisethe latent factors by producing dense and compactrepresentations of sentences. Recently, riding thewave of prominent results of modern Deep Learn-ing (DL) models in many natural language pro-

1Commonly called word2vec.

Figure 1: Sentence and centroid embeddings 2Dvisualization of the Donkey Kong (video game)Wikipedia article. Dimensionality reduction isperformed using t-SNE algorithm. For each sen-tence the position in the document is shown.The closest sentences to centroid embedding aremarked in green.

cessing tasks (LeCun et al., 2015; Goodfellow etal., 2016), several groups have started to exploitdeep neural networks for both abstractive (Rushet al., 2015; Nallapati et al., 2016) and extractive(Kageback et al., 2014; Cao et al., 2015; Chengand Lapata, 2016) text summarization.

3 Centroid-based Method

The centroid-based method for extractive summa-rization was introduced by (Radev et al., 2004).The centroid represents a pseudo-document whichcondenses the meaningful information of a docu-ment2. The main idea is to project in the vectorspace the vector representations of both the cen-troid and each sentence of a document. Then, thesentences closer to the centroid are selected. Theoriginal method adopts the BOW model for thevector representations using the tf ∗ idf weightscheme (Salton and McGill, 1986), where the sizeof vectors is equal to that of the document vocab-ulary. We adapt the centroid-based method intro-ducing a distributed representation of words whereeach word in a document is represented by a vectorof real numbers of an established size. Formally,given a corpus of documents [D1, D2, . . . ] and itsvocabulary V with size N = |V |, we define a ma-trix E ∈ RN,k, so-called lookup table, where thei-th row is a word embedding of size k, k << N ,

2In this section we refer to a single document, but themethod can be extended to a cluster of documents.

13

Page 24: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

arcade donkey kong game nintendo coleco centroid embeddingarcades goat hong gameplay mario intellivision nespac-man pig macao multiplayer wii atari gamecubeconsole monkey fung videogame console nes konamifamicom horse taiwan rpg nes msx wii

sega cow wong gamespot gamecube 3do famicom

Table 1: Centroid words of the Donkey Kong (video game) article having the tf-idf values greater than atopic threshold equal to 0.3. For each centroid word, the five closest words are shown using the skip-grammodel trained on the Wikipedia (en) content. In the last column the words most similar to the centroidembedding computed using element-wise addition are shown.

of the i-th word in V . The values of the wordembeddings matrix E are learned using the neu-ral network model introduced by (Mikolov et al.,2013b). The model can be trained on the collec-tion of documents to be summarized or on a largercorpus. This is a peculiar advantage of Represen-tation Learning (RL) (Bengio et al., 2013) that al-lows to reuse an external knowledge and this isespecially useful for the summarization of docu-ments in specific domains, where large amount ofdata are not available. After learning the lookuptable, our summarization method consists of foursteps: 1) preprocess the input document; 2) buildthe centroid embedding; 3) compute the sentencescores; 4) select the relevant sentences.

3.1 Preprocessing

The first step follows the common pipeline for thesummarization task: split the document into sen-tences, convert all words in lower case and removestopwords. Stemming is not performed becausewe let the word embeddings to discover the lin-guistic regularities of words with the same root(Mikolov et al., 2013c). For instance, the mostsimilar embeddings to the words that compose thecentroid vector using the skip-gram model trainedon the Wikipedia content are reported in Table 1.The closest word of arcade is its plural arcades,while they are orthogonal in the vector space ac-cording to the BOW representation.

3.2 Centroid Embedding

In order to build a centroid vector using word em-beddings, we first select the meaningful words intothe document. For simplicity and a fair compar-ison with the original method, we select thosewords having the tf ∗ idf weight greater thana topic threshold. Thus, we compute the cen-

troid embedding as the sum3 of the embeddingsof the top ranked words in the document using thelookup table E.

C =∑

w∈D,tfidf(w)>t

E[idx(w)] (1)

In the eq. (1) we denote with C the cen-troid embedding related to the document D andwith idx(w) a function that returns the index ofthe word w in the vocabulary. In the headers ofthe Table 1 the centroid words extracted from aWikipedia article are reported. The last columnshows the words most similar to centroid embed-ding computed using element-wise addition. It isimportant to underline that all five closest words tothe centroid vector are semantically related to themain topic of the document despite the size of theWikipedia vocabulary (about 1 million words).

3.3 Sentence ScoringFor each sentence in the document, we create anembedding representation by summing the vectorsfor each word in the sentence stored in the lookuptable E.

Sj =∑w∈Sj

E[idx(w)] (2)

In the eq. (2) we denote with Sj the j-th sen-tence in the document D. Then, the sentence scoreis computed as the cosine similarity between theembedding of the sentence Sj and that of the cen-troid C of the document D.

sim(C, Sj) =CT • Sj

||C|| · ||Sj || (3)

Figure 1 shows a visualization of sentence andcentroid embeddings of a Wikipedia article. We

3Some works use the average rather than the addition tocompose word embeddings. However, the sum and the aver-age do not change the similarity value when using the cosinedistance, since the angle between vectors remains the same.

14

Page 25: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Sent ID Sentence Score136 The original arcade version of the game appears in the Nintendo 64 game

Donkey Kong 64.0.9533

131 The game was ported to Nintendo’s Family Computer (Famicom) console in1983 as one of the system’s three launch titles; the same version was a launchtitle for the Famicom’s North American version, the Nintendo EntertainmentSystem (NES).

0.9375

186 In 2004, Nintendo released Mario vs. Donkey Kong, a sequel to the GameBoy title.

0.9366

192 In 2007, Donkey Kong Barrel Blast was released for the Nintendo Wii. 0.9362135 The NES version was re-released as an unlockable game in Animal Crossing for

the GameCube and as an item for purchase on the Wii’s Virtual Console.0.9308

Table 2: The most relevant sentences of the Donkey Kong article selected with the centroid-based sum-marization method using word embeddings. For each sentence are reported the related position ID inthe document and the similarity score computed between sentence and centroid embeddings. The wordsthat compose the centroid vector are marked in bold. The most similar words to the centroid ones arereported in italic.

use t-SNE method (van der Maaten and Hinton,2008) to reduce the dimensionality of vectors from300 to 2. For each sentence the position ID inthe document is shown. The closest sentencesto the centroid embedding are marked in green.The words that compose the centroid are the sameshowed in Table 1. In Table 2 we report the sen-tences near to the centroid with the related cosinesimilarity values. As we expected, the most rele-vant sentence (136) contains many words close tothe centroid vector. However, the relevant aspectconcerns the last sentence (135). Despite this sen-tence does not contain any centroid word, it hasa high similarity value so it is close to the cen-troid embedding in the vector space. The reasonis due to the presence of the words, such as NES,GameCube and Wii, that are the closest words tothe centroid embedding (Table 1). This proves theeffectiveness of the compositionality of word em-beddings to encode the semantic relations betweenwords through vector dense representations.

3.4 Sentence Selection

The sentences are sorted in descending order oftheir similarity scores. The top ranked sentencesare iteratively selected and added to the summaryuntil the limit4 is reached. In order to satisfythe redundancy property, during the iteration wecompute the cosine similarity between the nextsentence and each one already in the summary.We discard the incoming sentence if the similar-ity value is greater than a threshold. This proce-dure is reported in Algorithm 1. However, sim-

4The limit can be the number of bytes/words in the sum-mary or a compression rate.

ilar sentence selection approaches are describedin (Carbonell and Goldstein, 1998; Saggion andGaizauskas, 2004).

Algorithm 1 Sentence selectionInput: S, Scores, st, limitOutput: Summary

S ← SORTDESC(S,Scores)k ← 1for i← 1 to m do

length← LENGTH(Summary)if length > limit then return Summary

SV ← SUMVECTORS(S[i])include← Truefor j ← 1 to k do

SV 2← SUMVECTORS(Summary[j])sim← SIMILARITY(SV ,SV 2)if sim > st then

include← Falseif include then

Summary[k]← S[i]k ← k + 1

4 Experiments

In this section we describe the benchmarks con-ducted on two text summarization tasks. The maingoal is to compare the centroid-based method us-ing two different representations (bag-of-wordsand word embeddings). In Section 4.1 and in Sec-tion 4.2 we report the experimental results carriedout on Multi-Document and Multilingual SingleDocument summarization tasks, respectively.

15

Page 26: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

4.1 Multi-Document Summarization

Dataset and Metrics During the document un-derstanding conference (DUC)5 from 2001 to2007, several gold standard datasets have beendeveloped to evaluate the summarization meth-ods. In particular, we evaluate our method onmulti-document summarization using the DUC-2004 Task 2 dataset composed by 50 clusters, eachof which consists of 10 documents coming fromAssociated Press and New York Times newswires.For each cluster, four summaries written by dif-ferent humans were supplied. For the evaluation,we adopt the ROUGE (Lin, 2004), a set of recall-based metrics that compare the automatic and hu-man summaries on the basis of the n-gram overlap.In our experiment, we adopt both ROUGE-1 andROUGE-26.

Baselines For the comparison, we propose sev-eral baselines. Firstly, we adapt the cen-troid method proposed by (Radev et al., 2004)(C BOW) for a fair comparison. In the originalwork, the sentence scores are the linear combina-tion of the centroid score, the positional value andthe first sentence overlap. The centroid score isthe sum of tf ∗ idf weights of the words occur-ring both in the sentence and in the centroid. Inour experiment, we apply both our sentence scoreand selection algorithms. LEAD simply choosesthe first 665 bytes from the most recent article ineach cluster. SumBasic is a simple probabilisticmethod proposed by (Nenkova and Vanderwende,2005) commonly used as baseline in the summa-rization evaluation. Peer65 is the winning systemin DUC-2004 Task 2. To compare our methodwith others which also use compact and denserepresentations, we use the method proposed by(Lee et al., 2009) that adopts the generic rele-vance of sentences method using NMF. Anothermethod often used in summarization evaluationsis LexRank proposed by (Erkan and Radev, 2004)which uses the TextRank algorithm (Mihalcea andTarau, 2004) to establish a ranking between sen-tences. Finally, we compare our method with theone proposed by (Cao et al., 2015) that uses Recur-sive Neural Network (RNN) for learning sentenceembeddings by encoding syntactic features.

5http://duc.nist.gov6ROUGE-1.5.5 with options -c 95 -b 665 -m -n 2 -x

System R1 R2 tt st sizeLEAD 32.42 6.42SumBasic 37.27 8.58Peer65 38.22 9.18NMF 31.60 6.31LexRank 37.58 8.78RNN 38.78 9.86C BOW 37.76 8.08 0.1 0.6C GNEWS 37.91 8.45 0.2 0.9 300C CBOW 38.68 8.93 0.3 0.93 200C SKIP 38.81 9.97 0.3 0.94 400

Table 3: ROUGE scores (%) on DUC-2004dataset. tt and st are the topic and similaritythresholds respectively. size is the dimension ofembeddings.

Implementation Our system7 is written inPython by relying on nltk, scikit-learn and gen-sim libraries for text preprocessing, building thesentence-term matrix and import the word2vecmodel. We train the word embeddings on DUC-2004 corpus using the original word2vec8 imple-mentation. We test both continuous bag-of-words(C CBOW) and skip-gram (C SKIP) neural ar-chitectures proposed in (Mikolov et al., 2013a) us-ing the same parameters9 but varying the embed-ding sizes. Moreover, we compare our method us-ing the model trained on a part of Google Newsdataset (C GNEWS) which consists of about 100billion words. In the preprocessing step each clus-ter of documents is divided in sentences and stop-words are removed. We do not perform stemmingas reported in Section 3. To find the best parame-ters configuration, we run a grid search using thissetting: embedding size in [100, 200, 300, 400,500], topic and similarity thresholds respectivelyin [0, 0.5] and [0.5, 1] with a step of 0.01.

Results and Discussion The results of the ex-periment are shown in Table 3. We report thebest scores of our method using the three differentword2vec models along with their parameters. Forall word embeddings models, our method outper-forms the original centroid one. In detail, with theskip-gram model we obtain an increment of 1.05%and 1.71% with respect to the BOW model usingROUGE-1 and ROUGE-2 respectively. Moreover,our simple method with skip-gram performs bet-

7https://github.com/gaetangate/text-summarizer8https://code.google.com/archive/p/word2vec/9-hs 1 -min-count 0 -window 10 -negative 10 -iter 10

16

Page 27: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

ter than the more complex models based on RNN.This proves the effectiveness of the compositionalcapability of word embeddings in order to en-code the information word sequences by applyinga simple sum of word vectors, as already proved in(Wieting et al., 2015). Although our method withthe model pre-trained on Google News does notachieve the best score, it is interesting to notice theflexibility of the word embeddings in reusing ex-ternal knowledge. Regarding the comparison be-tween BOW and embedding representations, theexperiment shows different behaviors of the simi-larity threshold. In particular, the use of word2vecrequires a higher threshold because the word em-beddings are dense vectors unlike the sparse rep-resentation of BOW. This proves that the embed-dings of sentences are closer in the vector space,thus the cosine similarity returns closer values.

System R1 R2 tt st sizeC BOW 37.56 8.26 0 0.6C GNEWS 36.91 7.35 0 0.9 300C CBOW 37.69 7.64 0 0.83 300C SKIP 37.61 8.10 0 0.91 300

Table 4: ROUGE scores without topic threshold.

Also the topic threshold shows different trends.The word embeddings require a higher thresholdvalue to make our method effective. In order toanalyze this aspect we run another experiment set-ting the topic threshold to 0. The results are re-ported in Table 4. Results show that the BOWrepresentation is more stable and obtains the bestROUGE-2 score, while the performance obtainedby word2vec decreases considerably. This meansthat word embeddings are more sensitive to noiseand they require an accurate choice of the mean-ingful words to compose the centroid vector.

Summaries Overlap Although the differentmethods achieve similar ROUGE scores, they notnecessarily generate similar summaries. An ex-ample is reported in Table 6. In this section weconduct a further analysis by comparing the sum-maries generated by the best four configurations ofthe centroid method reported in Table 3. We adoptthe same criterion presented in (Hong et al., 2014),where the different summaries are compared interms of sentences and words overlap using theJaccard coefficient. Due to space constraint, wereport in Table 5 only the sentence overlap. Theresults prove that different word representations

GNEWS CBOW SKIP BOWGNEWS 1 0.109 0.171 0.075

CBOW 1 0.460 0.072SKIP 1 0.105BOW 1

Table 5: Sentence overlap.

lead to different summaries. In particular, the sum-maries using BOW differ considerably from thosegenerated using word2vec, but this is true even fordifferent embedding models. On the other hand,only the models trained on the DUC-2004 corpus(CBOW and SKIP) tend to generate more similarsummaries. This analysis suggests that a combina-tion of various models trained on different corporacould result in good performance.

4.2 Multilingual Document Summarization

Task Description We carried out an experimenton Multilingual Single-document Summarization(MSS). Our main goal is to prove empiricallythe effectiveness of the use of word embeddingsin the document summarization task across dif-ferent languages. For this purpose, we evaluateour method on the MSS task proposed in Multi-Ling 2015 (Giannakopoulos et al., 2015), a spe-cial session at SIGDIAL 2015. Starting from 2011the aim of MultiLing community is to promotethe cutting-edge research in automatic summa-rization by providing datasets and by introducingseveral pilot tasks to encourage further develop-ments in single and multi-document summariza-tion and in summarizing human dialogs in on-lineforums and customer call centers. The goal ofthe MSS 2015 task is to generate a single doc-ument summary from a selection of some of thebest written Wikipedia articles with at least oneout of 38 languages defined by organizers of thetask. The dataset10 is divided into a training anda test sets, both consisting of 30 documents foreach of 38 languages. For both datasets, the bodyof the articles and the related abstracts with thecharacter length limits are provided. Since theWikipedia abstracts are summaries written by hu-mans, they are useful to perform automatic evalua-tions. We evaluate our method using five differentlanguages: English, Italian, German, Spanish andFrench.

10http://multiling.iit.demokritos.gr/pages/view/1532/task-mss-single-document-summarization-data-and-information

17

Page 28: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

BOW - Bag of Words baseline

The controversy centers on the payment of nearly dlrs 400,000 in scholarships to relatives of IOC members by the Salt Lake bid committee which wonthe right to stage the 2002 games. Pound said the panel would investigate allegations that “there may or may not have been payments for the benefit ofmembers of the IOC or their families connected with the Salt Lake City bid.” Samaranch said he was surprised at the allegations of corruption in theInternational Olympic Committee made by senior Swiss member Marc Hodler.

CBOW - Continuous Bag of Words trained on DUC-2004 dataset

Marc Hodler, a senior member of the International Olympic Committee executive board, alleged malpractices in the voting for the 1996 Atlanta Games,2000 Sydney Olympics and 2002 Salt Lake Games. The IOC, meanwhile, said it was prepared to investigate allegations made by Hodler of bribery in theselection of Olympic host cities. The issue of vote-buying came to the fore in Lausanne because of the recent disclosure of scholarship payments madeto six relatives of IOC members by Salt Lake City officials during their successful bid to play host to the 2002 Winter Games.

SKIP - Skip-gram trained on DUC-2004 dataset

Marc Hodler, a senior member of the International Olympic Committee executive board, alleged malpractices in the voting for the 1996 Atlanta Games,2000 Sydney Olympics and 2002 Salt Lake Games. The IOC, meanwhile, said it was prepared to investigate allegations made by Hodler of bribery in theselection of Olympic host cities. Saying “if we have to clean, we will clean,” Juan Antonio Samaranch responded on Sunday to allegations of corruptionin the Olympic bidding process by declaring that IOC members who were found to have accepted bribes from candidate cities could be expelled.

GNEWS - Skip-gram trained on Google News dataset

The International Olympic Committee has ordered a top-level investigation into the payment of nearly dlrs 400,000 in scholarships to relatives of IOCmembers by the Salt Lake group which won the bid for the 2002 Winter Games. The mayor of the Japanese city of Nagano, site of the 1998 WinterOlympics, denied allegations that city officials bribed members of the International Olympic Committee to win the right to host the games. Swiss IOCexecutive board member Marc Hodler said Sunday he might be thrown out of the International Olympic Committee for making allegations of corruptionwithin the Olympic movement.

Table 6: Summaries of the cluster d30038 in DUC-2004 dataset using the centroid-based summarizationmethod with different sentence representations.

Model Configuration In order to learn wordembeddings for the different languages, we ex-ploit five Wikipedia dumps11, one for each cho-sen language. We extract the plain text from theWiki markup language using Wikiextractor12, aWikimedia parser written in Python. Each articleis converted from UTF-8 to ASCII encoding us-ing the Unidecode Python package. Since in theprevious evaluation we observe a similar behav-ior between the continuous bag of words and skip-gram models, in this evaluation we adopt only theskip-gram one using the same training parame-ters13 for all five languages. The Table 7 reportsthe Wikipedia statistics for the five languages re-garding the number of words and the size of thevocabularies.

Language # Words VocabularyEnglish 1,890,356,976 973,839Italian 371,218,773 378,286

German 657,234,125 1,042,683Spanish 464,465,399 419,683French 551,057,299 458,748

Table 7: Wikipedia statistics.

Experiment Protocol In order to reproduce thesame challenging scenario of the MultiLing 2015

11https://dumps.wikimedia.org/ lang wiki/20161220/with lang in [en, it, de, es, fr]

12https://github.com/attardi/wikiextractor/wiki13-hs 1 -min-count 10 -window 8 -negative 5 -iter 5

MSS task, we performed the tuning of parametersusing only the training set. To find the best topicand similarity threshold parameters we run a gridsearch as explained in Section 4.1. The grid searchis performed for each language separately usingboth BOW and skip-gram representations. The pa-rameter configurations are in line with those of theprevious experiment on DUC-2004. In detail, thetopic thresholds are in the range [0.1, 0.2] usingthe BOW model and in the range [0.3, 0.5] usingword embeddings. While, the similarity thresh-olds are slightly higher w.r.t. the multi-documentexperiment: about 0.7 and 0.95 for BOW andskip-gram, respectively. This is due to the factthat too similar sentences are rare, especially withwell-written documents as Wikipedia articles. Thebest parameters configuration for each language isused to generate summaries for the documents inthe test set. Also for this task, each document ispreprocessed with the sentences segmentation andstopwords removal, without stemming. We adoptthe same automatic evaluation metrics used by theparticipating systems in MSS 2015 task: ROUGE-1, -2, -SU414. ROUGE-SU4 computes the scorebetween the generated and human summaries con-sidering the overlap of the skip-bigrams of 4 aswell as the unigrams. Finally, the generated sum-mary for each document must comply with a spe-cific length constraint (rather than using a uniquelength limit for the whole collection). This differs

14ROUGE-1.5.7 with options -n 2 -2 4 -u -x -m

18

Page 29: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

English Italian German Spanish FrenchR1 R2 R1 R2 R1 R2 R1 R2 R1 R2

LEAD 44.33 11.68 30.46 4.38 29.13 3.21 43.02 9.17 42.73 8.07WORST 37.17 9.93 39.68 10.01 33.02 4.88 45.20 13.04 46.68 12.96

BEST 50.38 15.10 43.87 12.50 40.58 8.80 53.23 17.86 51.39 15.38C BOW 49.06 13.43 33.44 4.82 35.28 4.93 48.38 12.88 46.13 10.45C W2V 50.43‡ 13.34† 35.12 6.81 35.38† 5.39† 49.25† 12.99 47.82† 12.15

ORACLE 61.91 22.42 53.31 17.51 54.34 13.32 62.55 22.36 58.68 17.18

Table 8: ROUGE-1, -2 scores (%) on MultiLing MSS 2015 dataset for five different languages.

from the previous evaluation on DUC-2004.

Results and Discussion The results for eachlanguages are shown in Table 8. We report theROUGE-1, -2 scores for each chosen language.LEAD and C BOW represent the same baselinesused in the multi-document experiment. The for-mer uses the initial text of each article truncatedto the length of the Wikipedia abstract. The lat-ter is the centroid-based method with the BOWrepresentation. Our method that uses word em-beddings learned with skip-gram model is labeledwith C W2V. For each metric and language wealso report the WORST and the BEST scores15

obtained by the 23 participating systems at MSS2015 task. Finally, ORACLE scores can be con-sidered as an upper bound approximation for theextractive summarization methods. It uses a cover-ing algorithm (Davis et al., 2012) that selects sen-tences from the original text covering the words inthe summary without disregarding the length limit.We highlight in bold the scores of our methodwhen it outperforms the baseline C BOW. On theother hand, the superscripts † and ‡ imply a bet-ter performance of our method with respect to theWORST and the BEST scores respectively.

Both centroid-based methods overcome thesimple baseline over all languages. Our methodalways achieves better scores against the BOWmodel except for ROUGE-2 metric for English.This confirms the effectiveness of using wordembeddings as alternative sentence representa-tions able to capture the semantic similarities be-tween the centroid words and the sentences, whensummarizing single documents too. Moreover,our method outperforms substantially the lowestscores performing systems participating in MSS2015 task for English and German languages. For

15http://multiling.iit.demokritos.gr/file/view/1629/mss-multilingual-single-document-summarization-submission-and-automatic-score

English our method obtains a ROUGE-1 scoreeven better than the one of the best system inMSS 2015. Instead, our method fails in summa-rizing Italian documents and it achieves the worstROUGE-2 for Spanish and French languages. Thereason may lie in the size of the Wikipedia dumpsused to learn the word embeddings for differentlanguages. As showing in Table 7, the sizes ofthe various corpora as well as the ratios betweenthe number of words and dimension of the vocab-ularies, differ consistently. The English versionof Wikipedia consists of nearly 2 billion wordsagainst about 300 million words of Italian one.Thus, according to the distributional hypothesisreported in (Harris, 1954), we expect better per-formance for our method in summarizing Englishor German articles with respect to the other lan-guages where the word embeddings are learnedusing a smaller corpus. Our results and in partic-ular the ROUGE-SU4 scores reported in Figure 2support this hypothesis.

Figure 2: ROUGE-SU4 scores (%) comparison onMultiLing MSS 2015 dataset.

19

Page 30: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

5 Conclusion

In this paper, we propose a centroid-based methodfor extractive summarization which exploits thecompositional capability of word embeddings.One of the advantages of our method lies on itssimplicity. Indeed, it can be used as a base-line in experimenting new articulate semantic rep-resentations in summarization tasks. Moreover,following the idea of representation learning, itis feasible to infuse knowledge by training theword embeddings from external sources. Finally,the proposed method is fully unsupervised, thusit can be adopted in other summarization tasks,such as query-based document summarization. Asfuture work, we plan to evaluate the centroid-based summarization method using a topic model,such as Latent Dirichlet Allocation (LDA) (Bleiet al., 2003) or Non-negative Matrix Factoriza-tion (NMF) (Berry et al., 2007), in order to ex-tract the meaningful words to compute the cen-troid embedding as well as to carry out a com-prehensive comparison of different sentence rep-resentations using more complex neural languagemodels (Le and Mikolov, 2014; Zhang and Le-Cun, 2015; Jozefowicz et al., 2016). Finally,the combination of distributional and relational se-mantics (Fried and Duh, 2014; Verga and McCal-lum, 2016; Rossiello, 2016) applied to extractivetext summarization is a promising further direc-tion that we want to investigate.

References

Y. Bengio, A. Courville, and P. Vincent. 2013. Repre-sentation learning: A review and new perspectives.IEEE Transactions on Pattern Analysis and MachineIntelligence, 35(8):1798–1828, Aug.

Michael W. Berry, Murray Browne, Amy N. Langville,V. Paul Pauca, and Robert J. Plemmons. 2007. Al-gorithms and applications for approximate nonneg-ative matrix factorization. Computational Statistics& Data Analysis, 52(1):155–173, September.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent dirichlet allocation. J. Mach. Learn.Res., 3:993–1022, March.

Ziqiang Cao, Furu Wei, Li Dong, Sujian Li, andMing Zhou. 2015. Ranking with recursive neu-ral networks and its application to multi-documentsummarization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence,AAAI’15, pages 2153–2159. AAAI Press.

Jaime Carbonell and Jade Goldstein. 1998. The use ofmmr, diversity-based reranking for reordering docu-ments and producing summaries. In Proceedings ofthe 21st Annual International ACM SIGIR Confer-ence on Research and Development in InformationRetrieval, SIGIR ’98, pages 335–336, New York,NY, USA. ACM.

Jianpeng Cheng and Mirella Lapata. 2016. Neuralsummarization by extracting sentences and words.pages 484–494, August.

S. T. Davis, J. M. Conroy, and J. D. Schlesinger. 2012.Occams – an optimal combinatorial covering algo-rithm for multi-document summarization. In 2012IEEE 12th International Conference on Data Min-ing Workshops, pages 454–463, Dec.

Gunes Erkan and Dragomir R. Radev. 2004. Lexrank:Graph-based lexical centrality as salience in textsummarization. J. Artif. Int. Res., 22(1):457–479,December.

Daniel Fried and Kevin Duh. 2014. Incorporating bothdistributional and relational semantics in word rep-resentations. CoRR, abs/1412.4369.

George Giannakopoulos, Jeff Kubina, John M. Conroy,Josef Steinberger, Benoıt Favre, Mijail A. Kabadjov,Udo Kruschwitz, and Massimo Poesio. 2015. Mul-tiling 2015: Multilingual summarization of singleand multi-documents, on-line fora, and call-centerconversations. In Proceedings of the SIGDIAL 2015Conference., pages 270–274.

Yoav Goldberg. 2015. A primer on neural networkmodels for natural language processing. CoRR,abs/1510.00726.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.

Zellig Harris. 1954. Distributional structure. Word,10(23):146–162.

Kai Hong, John Conroy, Benoit Favre, Alex Kulesza,Hui Lin, and Ani Nenkova. 2014. A repository ofstate of the art and competitive baseline summariesfor generic news summarization. In Nicoletta Cal-zolari (Conference Chair), Khalid Choukri, ThierryDeclerck, Hrafn Loftsson, Bente Maegaard, JosephMariani, Asuncion Moreno, Jan Odijk, and SteliosPiperidis, editors, Proceedings of the Ninth Interna-tional Conference on Language Resources and Eval-uation (LREC’14), Reykjavik, Iceland, may. Euro-pean Language Resources Association (ELRA).

Karen Sparck Jones. 2007. Automatic summarising:The state of the art. Information Processing & Man-agement, 43(6):1449–1481.

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, NoamShazeer, and Yonghui Wu. 2016. Exploring the lim-its of language modeling. CoRR, abs/1602.02410.

20

Page 31: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Mikael Kageback, Olof Mogren, Nina Tahmasebi, andDevdatt Dubhashi. 2014. Extractive summariza-tion using continuous vector space models. In Pro-ceedings of the 2nd Workshop on Continuous VectorSpace Models and their Compositionality (CVSC)@EACL, pages 31–39.

Quoc Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In TonyJebara and Eric P. Xing, editors, Proceedings of the31st International Conference on Machine Learning(ICML-14), pages 1188–1196. JMLR Workshop andConference Proceedings.

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.2015. Deep learning. Nature, 521:436–444.

Ju-Hong Lee, Sun Park, Chan-Min Ahn, and DaehoKim. 2009. Automatic generic document summa-rization based on non-negative matrix factorization.Inf. Process. Manage., 45(1):20–34, January.

Hui Lin and Jeff Bilmes. 2011. A class of submodu-lar functions for document summarization. In Pro-ceedings of the 49th Annual Meeting of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, pages 510–520, Portland, Ore-gon, USA, June. Association for Computational Lin-guistics.

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. In Stan SzpakowiczMarie-Francine Moens, editor, Text SummarizationBranches Out: Proceedings of the ACL-04 Work-shop, pages 74–81, Barcelona, Spain, July. Associa-tion for Computational Linguistics.

H. P. Luhn. 1958. The automatic creation of literatureabstracts. IBM J. Res. Dev., 2(2):159–165, April.

Rada Mihalcea and Paul Tarau. 2004. Textrank:Bringing order into texts. In Dekang Lin and DekaiWu, editors, Proceedings of EMNLP 2004, pages404–411, Barcelona, Spain, July. Association forComputational Linguistics.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013a. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed represen-tations of words and phrases and their composition-ality. In Advances in neural information processingsystems, pages 3111–3119.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013c. Linguistic regularities in continuous spaceword representations. In Proceedings of the 2013Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 746–751, Atlanta,Georgia, June. Association for Computational Lin-guistics.

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,Caglar Gulcehre, and Bing Xiang. 2016. Abstrac-tive text summarization using sequence-to-sequencernns and beyond. pages 280–290, August.

Ani Nenkova and Lucy Vanderwende. 2005. The im-pact of frequency on summarization. Microsoft Re-search, Redmond, Washington, Tech. Rep. MSR-TR-2005-101.

Makbule G. Ozsoy, Ferda N. Alpaslan, and Ilyas Ci-cekli. 2011. Text summarization using latent se-mantic analysis. Journal of Information Science,37(4):405–417.

Dragomir R. Radev, Hongyan Jing, Malgorzata Stys,and Daniel Tam. 2004. Centroid-based summariza-tion of multiple documents. Inf. Process. Manage.,40(6):919–938, November.

Gaetano Rossiello. 2016. Neural abstractive text sum-marization. In Proceedings of the Doctoral Consor-tium of AI*IA 2016 co-located with the 15th Inter-national Conference of the Italian Association forArtificial Intelligence (AI*IA 2016), Genova, Italy,November 29, 2016., pages 70–75.

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, pages 379–389, Lisbon, Portugal,September. Association for Computational Linguis-tics.

Horacio Saggion and Robert Gaizauskas. 2004. Multi-document summarization by cluster/profile rele-vance and redundancy removal. In Proceedings ofthe HLT/NAACL Document Understanding Work-shop (DUC 2004), Boston, May.

Horacio Saggion and Thierry Poibeau. 2013. Auto-matic text summarization: Past, present and future.In Multi-source, Multilingual Information Extrac-tion and Summarization, pages 3–21. Springer.

Gerard Salton and Michael J. McGill. 1986. Intro-duction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA.

Laurens van der Maaten and Geoffrey E. Hinton.2008. Visualizing high-dimensional data usingt-sne. Journal of Machine Learning Research,9:2579–2605.

Patrick Verga and Andrew McCallum. 2016. Row-lessuniversal schema. CoRR, abs/1604.06361.

John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2015. Towards universal paraphrastic sen-tence embeddings. CoRR, abs/1511.08198.

Xiang Zhang and Yann LeCun. 2015. Text understand-ing from scratch. CoRR, abs/1502.01710.

21

Page 32: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 22–31,Valencia, Spain, April 3, 2017. c©2017 Association for Computational Linguistics

Query-based summarization using MDL principle

Natalia VanetikShamoon College of Engineering

Beer ShevaIsrael

[email protected]

Marina LitvakShamoon College of Engineering

Beer ShevaIsrael

[email protected]

Abstract

Query-based text summarization is aimedat extracting essential information that an-swers the query from original text. Theanswer is presented in a minimal, oftenpredefined, number of words. In this pa-per we introduce a new unsupervised ap-proach for query-based extractive summa-rization, based on the minimum descrip-tion length (MDL) principle that employsKrimp compression algorithm (Vreekenet al., 2011). The key idea of our ap-proach is to select frequent word sets re-lated to a given query that compress doc-ument sentences better and therefore de-scribe the document better. A summary isextracted by selecting sentences that bestcover query-related frequent word sets.The approach is evaluated based on theDUC 2005 and DUC 2006 datasets whichare specifically designed for query-basedsummarization (DUC, 2005 2006). Itcompetes with the best results.

1 Introduction

Query-based summarization (QS) is directed to-ward generating a summary most relevant to agiven query. It can relate to a single-document orto a multi-document input. Our approach for QSis based on the MDL principle, defining the bestsummary as the one that leads to the best compres-sion of the text with query-related information byproviding its shortest and most concise descrip-tion. The MDL principle is widely useful in com-pression techniques of non-textual data, such assummarization of query results for online analyti-cal processing (OLAP) applications (Lakshmananet al., 2002; Bu et al., 2005). However, only a fewworks about text summarization using MDL can

be found in the literature. Nomoto and Matsumoto(2001) used K-means clustering extended with theMDL principle, to find diverse topics in the sum-marized text. Nomoto (2004) also extended theC4.5 classifier with MDL for learning rhetoricalrelations. In (Nguyen et al., 2015) the problem ofmicro-review summarization is formulated withinthe MDL framework, where the authors view thetips as being encoded by snippets, and seek to finda collection of snippets that produces the encodingwith the minimum number of bits.

This work proposes a MDL approach where thesentences that are best described by the query-related word sequences are selected to a sum-mary. It is principally different from the men-tioned works by (1) using frequent itemsets andnot single words in the description model, (2)compressing entire documents instead of sum-maries, (3) ranking method for sentences, and (4)the description model itself. We tested our ap-proach on DUC 2005 and DUC 2006 data for En-glish query-based summarization.

2 Related Work

Multiple works about QS have been published inrecent years. Daume III and Marcu (2006) pre-sented BayeSum, a model for sentence extrac-tion in QS. BayeSum is based on the conceptsof three models: language model, Bayesian sta-tistical model, and graphical model. Mohamedand Rajasekaran (2006) proposed an approach forQS based on document graphs, which are directedgraphs of concepts or entity nodes and relationsbetween them. The work in (Bosma, 2005) in-troduced a graph search algorithm that looks forrelevant sentences in the discourse structure rep-resented as a graph. The author used RhetoricalStructure Theory for creating a graph representa-tion of a text document - a weighted graph with

22

Page 33: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

nodes standing for sentences and weighted edgesrepresenting a distance between sentences. Con-roy et al. (2005) presented the CLASSY summa-rizer that used a hidden Markov model based onsignature terms and query terms for sentence se-lection within a document, and a pivoted questionanswering algorithm for redundancy removal. Liuet al. (2012) proposed QS with multi-documentinput using unsupervised deep learning. Schilderand Kondadadi (2008) presented FastSum - a fastquery-based multi-document summarizer basedsolely on word-frequency features of clusters, doc-uments, and topics, where summary sentencesare ranked by a regression support vector ma-chine. Tang et al. (2009) proposed two strategiesto incorporate the query information into a prob-abilistic model. Park et al. (2006) introduced amethod that uses non-negative matrix factoriza-tion to extract query-relevant sentences. Someworks deal with domain-specific data (Chen andVerma, 2006) and use domain-specific terms whenmeasuring the distance between sentences and aquery. Zhou et al. (2006) describes a query-basedmulti-document summarizer based on basic ele-ments, a head-modifier-relation triple representa-tion of document content. Recently, many worksintegrate topic modeling into their summarizationmodels. For example, Li and Li (2014) extend thestandard graph ranking algorithm by proposing atwo-layer (sentence layer and topic layer) graph-based semi-supervised learning approach based ontopic modeling techniques. Wang et al. (2014)present a submodular function-based frameworkfor query-focused opinion summarization. Withintheir framework, relevance ordering produced bya statistical ranker, and information coverage withrespect to topic distribution and diverse viewpointsare both encoded as submodular functions. Someworks (Li et al., 2015) use external resources withthe goal to better represent the importance of atext unit and its semantic similarity with the givenquery. Otterbacher et al. (2009) present BiasedLexRank method, which represents a text as agraph of passages linked based on their pairwiselexical similarity, identifies passages that are likelyto be relevant to a users natural language ques-tion and then perform a random walk on the lexi-cal similarity graph in order to recursively retrieveadditional passages that are similar relevant pas-sages. Williams et al. (2014) provides a task-based evaluation of multiple query biased sum-

marization methods for cross-language informa-tion retrieval using relevance prediction. In (Lit-vak et al., 2015) we applied the MDL principle togeneric summarization, where we considered fre-quent word sets as the means for encoding text.The results demonstrated superiority of the pro-posed method over other methods on DUC data.This paper continues the above work by construct-ing a model where frequent word sets depend onthe query.

3 Methodology

3.1 Overview

Our approach consists of the following steps:(1) text preprocessing, (2) query-related frequentitemset mining, (3) finding the best MDL model,and (4) sentence ranking for the summary con-struction. The general scheme of our approachis depicted in Figure 1. Details of every step aregiven in sections below. Section 3.8 contains anexample of intermediate and the final (the sum-mary) outputs for one of the document clustersfrom DUC 2005 dataset.

3.2 Query-based MDL principle

The Minimum Description Length (MDL) Princi-ple is based on the idea that a regularity in the datacan be used to compress the data, and this com-pression should use fewer symbols than the dataitself. Intuitively, a model is a partial functionfrom data subsets to codes, where the codes sizehas logarithmic growth.

In general, given a set of models M, a modelM ∈ M is considered the best if it minimizesL(M) + L(D|M), where L(M) is the bit lengthof description of M and L(D|M) is the bit lengthof the dataset D encoded with M . As such, thefrequency and the length of the codes that re-place data subsets are the most important fea-tures of the MDL model. Because we aim atquery-based summarization, in our approach weseek a query-dependent model MQ that minimizesL(MQ)+L(D|MQ); it may not be the best modeloverall but it has to be the best among models forthe query Q. Note that the MDL approach doesnot use the actual codes but only takes into accounttheir bit size.

3.3 Query-based data setup

In our case both text and query undergo prepro-cessing that includes sentence splitting, tokeniza-

23

Page 34: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Figure 1: Query-based MDL summarization

tion, stemming, and stop-words removal. Ad-ditionally, sentences that are too long (over 40words), too short (less than 5 words), or consistprimarily of direct speech (main portion of a sen-tence is contained within quotes), are omitted. Thenumber of these sentences is small and we notedthat their inclusion in summaries, although rare,decreases summary quality. No deep linguisticanalysis is performed, and therefore this methodis suitable for any language with basic tools.

A query is considered to be a set of stemmed to-kens, e.g., terms, even if it contains more than onesentence. A document or a document set (in caseof multi-document summarization) D is treated asa dataset where each sentence is a transaction thatis a set of stemmed tokens. The order of wordsin a sentence is ignored in our model, because weconsider the relation of a sentence to a query tobe more important than the order in which a sen-tence utilizes query tokens. Formally, we havesentences S1, . . . , Sn of a document set where ev-ery sentence is a subset of unique terms (stemmedtokens), denoted by T1, . . . , Tm. A query Q is asubset of unique terms as well.

3.4 Frequent itemsets

In our approach, we refer to text as a transactionaldataset, where each sentence is considered to be asingle transaction consisting of items. An item inour case is a term, i.e. stemmed word. Therefore,a sentence is viewed as a set of terms contained init. A set of items, called an itemset, is frequent if itis contained (as a set) in S sentences, where S ≥ 1is user-defined parameter.

The paper (Agrawal and Srikant, 1994) has pro-

posed two algorithms–Apriori and Apriori-TID–for processing large databases and mining fre-quent itemsets in efficient time. The Apriori al-gorithm makes multiple passes over the databasewhile Apriori-TID algorithm uses the databaseonly once, in the first pass. In this work, we use theApriori-TID algorithm for frequent itemset min-ing. While multitude of algorithms performingthe same task exist, Apriori-TID is sufficient forour purposes because texts, treated as transactionaldatasets, are rarely dense, and therefore the num-ber of frequent itemsets found in texts is usuallynot very large.

3.5 Data encoding

In general MDL approach, a Coding Table is acollection CT of sets from D that are used asa model of our dataset. According to the MDLprinciple, CT is considered to be the best whenit minimizes encoded dataset size size(D,CT ) =L(CT ) + L(D|CT ). In general MDL approach,a Coding Table is a collection CT of sets from Dthat are used as a model of our dataset. Accord-ing to the MDL principle, CT is considered to bethe best when it minimizes encoded dataset sizesize(D,CT ) = L(CT ) + L(D|CT ).

In our approach, sets included in the CodingTable come from the set F of all frequent wordsets in our text. Moreover, we only keep a fre-quent set in F if it is query-related, and there-fore all the sets in CT are query-related as well.Every member I ∈ CT is associated with itscode(I) of logarithmic growth (for instance, pre-fix codes may be used). In our case, the choice ofa specific code is not important as we only use its

24

Page 35: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

size |code(I)| when computing size(D, CT ). Weuse Huffman Coding in the current version, where|code(I)| = log |I|. General approach of usingfrequent sets for dataset MDL representation firstappeared in (Vreeken et al., 2011); here, we applyit to text and only care about word sets related to aquery.

There are two main orders to consider: StandardCandidate Order in which F is kept, whose pur-pose is to build the coding table faster. Itemsetsin F are first sorted by increasing support, thenby decreasing sequence length, then lexicographi-cally. The Standard Cover Order of CT keeps itsmembers sorted by first by decreasing sequencelength, then by decreasing support, and finally,in lexicographical order. Using this order en-sures that encoding of the dataset with CT indeedproduces minimal encoding length L(D|CT ) forfixed CT . Validity of this approach is proven in(Vreeken et al., 2011).

3.6 Sentence ranking and summaryconstruction

We are interested in the dataset D|CT after it iscompressed in the best possible way with the bestcompressing query-related set CT . The datasetD|CT is obtained by replacing in D every itemsetin CT by its code, and shorter codes are selectedfirst. We use an upper bound on the size of CT , inorder to limit document compression, and select itto be equal to the target summary size, which isdenoted by SummarySize. The ideal compressionin this case will compress only words most rel-evant to the summary and will ignore everythingelse; additionally, this limitation speeds up com-putation.

The summary is constructed by iteratively se-lecting the sentences according to the coverage ofCT . At each step, a sentence that covers the mostimportant uncovered itemset in CT is added to asummary. Importance of itemsets in CT is de-termined by their order – higher itemsets in CT(those with shorter codes) have higher importance.

3.7 Query-based frequent itemsets and dataencoding

In order to direct the summarization process to-wards the given query, we compute and use forencoding only frequent itemsets that are related tothe given query. We tested two different types ofconstraints on frequent itemsets:

• (C1) All terms in a frequent itemset I mustbe contained in the query: I ⊆ Q.With this approach, a set of words in a sen-tence is encoded only if these words appearin the query.

• (C2) Every frequent itemset and the querymust have a common term: I ∩Q = ∅.Here, a set of words in a sentence is encodedif at least one word in the set appears in thequery.

Both methods ensure that only terms related to thequery are taken into account. Therefore, insteadof all frequent itemsets F , we only use its subsetFCi, i = 1, 2. The general Standard Candidate Or-der is used, and members of FCi are sorted by byincreasing support, then by decreasing sequencelength, then lexicographically.

Because the coding table CT can now containmembers of FCi only, we modify the StandardCover Order of CT accordingly in order to com-press query-related terms first. The order is modi-fied as follows:

• (C1) Because every member of CT containsonly terms used in the query, we sort it firstby decreasing sequence length, then by in-creasing support, and then lexicographically.Here, sequence length is precisely the num-ber of query terms contained in a frequentitemset, and itemsets containing more queryterms get higher priority.

• (C2) Every member of CT has some com-mon terms with the query, we sort CT firstby the number of terms common to an item-set and the query, then by decreasing se-quence length, then by increasing support,and then lexicographically. Here, a prece-dence is given to itemsets that have more incommon to the query. Note that we testedother measures for distance between itemsetsand the query (Jaccard similarity, cosine sim-ilarity), but this method provided better re-sults.

Detailed description of our query-based summa-rization method Qump (Query-based Krimp) isgiven in Algorithm 1.

3.8 ExampleHere we demonstrate intermediate and the fi-nal (summary) outputs for the document cluster

25

Page 36: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Algorithm 1: Qump: Query-Based KrimpInput:

(1) a document or a document set D,preprocessed as in Section 3.3,

(2) a query Q preprocessedas in Section 3.3,

(3) target summary word limit SummarySize,(4) support bound S,(5) constraint Cx on frequent itemsets

as described in Section 3.7.Output: Extractive summary Summary

/* STEP 1: Query-related frequent set mining */(1a) F ← frequent sets of terms from {T1, . . . , Tm}

appearing in at least Supp fraction of sentences thatsatisfy constraint Cx;

(1b) Sort F according to Standard Candidate Order;/* STEP 2: Initialize the coding table */(2a) Add all terms T1, . . . , Tm and their support toCT ;

(2b) Keep CT always sorted according to StandardCover Order;

(2c) Initialize prefix codes according to the order ofsets in CT ;

/* STEP 3: Find the best encoding */(3a) EncodedData ←

PrefixEncoding({S1, . . . , Sn}, CT );(3b) CodeCount ← 0;while CodeCount < SummarySize and F = ∅ do

BestCode ← arg minc∈F L(CT ∪ {c}) +L(PrefixEncoding({S1, . . . , Sn}, CT ∪ {c}));

CT ← CT ∪ {BestCode};F ← F \ {BestCode};/* If a code is used, its supercodes cannotappear in the data */F ← F \ {d ∈ F |BestCode ⊂ d};CodeCount++;

end/* STEP 4: Build the summary */Summary ← ∅;for codes c ∈ CT do

importance(c) := serial number of c in CTendwhile |Summary | < SummarySize do

for all unselected sentences S donCov(S)←

∑c∈CT

importance(c)/|S|endS ← arg maxS nCov(S);Summary ← Summary ∪ {S};CT := CT \ {c}for d ⊂ c do

CT := CT \ {d}end

endreturn Summary

D301I and the query “International OrganizedCrime Identify and describe types of organizedcrime that crosses borders or involves more thanone country. Name the countries involved. Alsoidentify the perpetrators involved with each typeof crime, including both individuals and organiza-tions if possible.” from the DUC 2005 dataset.

• The coding table CT (only its top 8 itemsets

Code # Itemset0 cross border1 countri includ2 crime countri3 border cross4 countri identifi5 drug6 cocain7 offici... ...

Table 1: CT example, top records.

with the shortest codes, sorted from the mostimportant to the least) is given in Table 1.

• The sentence “The drugs organisation usedintricate methods - including bank accounts,couriers and ships as well as dummy and realcompanies in many countries - to smuggle co-caine from South America to Europe.” afterencoding (replacement of phrases by codes)looks like this: “code#5 code#24 intricmethod includ code#19 account courier shipdummi real compani mani code#16 code#23code#6 south america code#40”, and it cov-ers 7 out of 50 codes from the CT . Normal-ized by the sentence length, its coverage isthe largest among all sentences, and thereforeit is selected to the summary.

• The summary contains 8 following sentenceswith the highest CT coverage, ordered bytheir appearance in the summarized docu-ments:“The drugs organization used intricate meth-ods - including bank accounts, couriers andships as well as dummy and real companiesin many countries - to smuggle cocaine fromSouth America to Europe. But this week theNew York Times gave extensive coverage toa report from a US intelligence officer thatwarned Mexican drug-traffickers were plan-ning to take advantage of lax border controls.The measures announced yesterday include aCDollars 5 tax cut per carton of cigarettes,bringing federal taxes down to CDollars 11a carton. Mr Louis Freeh, the FBI directorwho arrived in Moscow yesterday as part ofa central and East European tour, said themounting crime wave in Russia posed ’com-mon threats’ to all. Crime Without Frontiers

26

Page 37: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

is the story of how western and eastern crimi-nal syndicates secured the former Soviet safe-house, the last piece in constructing a globalpax mafiosa. In the pax mafiosa, business isbusiness - the Chinese Triads are partners incrime with the American Mafia; the Italiansuse the Russians to launder for the Colom-bian cartels, the Japanese Yakuza work handin hand with the Italians. Last January, afterthe ouster of Panamanian strongman ManuelA. Noriega, the new government of Panamaagreed to U.S. requests for records of bankaccounts identified as having been used bycartel money launderers. Cuba’s interiorminister, the Cabinet officer in charge of do-mestic law enforcement, was fired Thursdayas that nation’s drug purge continued, but thecrackdown has failed to touch other leaderswho U.S. officials say are involved in traffick-ing.”

4 Experiments

4.1 Dataset

We selected two English corpora of the Doc-ument Understanding Conference (DUC): DUC2006 and DUC 2005 (DUC, 2005 2006) for ourexperiments, which are standard datasets used forquery-based summarization methods evaluation.

The DUC 2005 dataset contains 50 documentssets of 25-50 related documents each. Averagenumber of words in a document set is 20185. Forevery document set a short (1-3 sentences) queryis supplied. From 4 to 9 gold standard summariesare supplied for every document set, and the targetsummary size is 250 words.

The DUC 2006 dataset contains 50 documentssets of 25 related document each. Average numberof words in a document set is 15293. For everydocument set a short (1-3 sentences) query is sup-plied. Four gold standard summaries are suppliedfor every document set, and the target summarysize is 250 words.

4.2 Experiment setup

We generated summaries for each set of relateddocuments (by considering each set of documentsas one meta-document) in the DUC 2005 andDUC 2006 corpora. The summarization qualitywas measured by the ROUGE-1 (Lin, 2004) recall

scores1 , with the word limit set to 250, withoutstemming and stopword removal. We limited thesize of the coding table by 250, as described inSection 3.4, and set support count S = 2 in or-der to take into account all terms repeated twice ormore in the text.

4.3 Results

Table 2 shows the ROUGE-1 scores of our al-gorithm comparative to the scores of 32 sys-tems that participated in the DUC 2005 compe-tition. Two options of our algorithm that corre-spond to constraints described in Section 3.7 ap-pear in the ”Systems” column of Table 2, denotedby Qump(C1-C2). Qump(C2) places third on theROUGE-1 recall and f-measure, and the differencebetween the top systems (ID=15 and ID=4) andour algorithm is statistically insignificant.

System 15 stands for the NUS summarizer fromthe National University of Singapore. This sum-marizer is based on the concept link approach (Yeet al., 2005). NUS method uses two features: sen-tence semantic similarity and redundancy mini-mization based on Maximal Marginal Relevance(MMR). The first one is computed as an over-all similarity score between each sentence and theremainder of the document cluster. This overallsimilarity score reflects the strength of represen-tative power of the sentence in regard to the restof the document cluster and is used as the pri-mary sentence ranking metric while forming thesummary. Then, a module similar to MMR is em-ployed to build the summary incrementally, min-imizing redundancy and maintaining the summa-rys relevance to the query’s topic. In order toreduce the run-time computational cost (requiredfor a scan through all possible pairs of senses forall pairs of concepts), authors pre-computed thesemantic similarity between all possible pairs ofWordNet entries offline.

System 4 represents the Columbia summarizerfrom the Columbia University (Blair-Goldensohn,2005). This is an adaptation of the Def-Scriber question answering (QA) system (Blair-Goldensohn et al., 2004). DefScriber (1) identifiesrelevant sentences which contain information per-tinent to the target individual or term (i.e. the X inthe “Who/What is X?” question); (2) incremen-tally clusters extracted sentences using a cosine

1we used ROUGE-1.5.5 version and the command line “-a -l 250 -n 2 -2 4 -u”

27

Page 38: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

distance metric; then (3) selects sentences for out-put summary using a fitness function which max-imizes inclusion of core definitional predicates,coverage of the highest ranking clusters, and an-swer cohesion; and finally (4) applies referencerewriting techniques to extracted sentences to im-prove readability of summary, using an auxiliarysystem (Nenkova and McKeown, 2003). The keyadaptations made for the DUC 2005 task were inrelevant-passage selection (step 1) by combiningthe following techniques:

1. Term frequency-based weighting for filteringthe less relevant terms from the topic state-ment (“topic terms”), based on IDF calcu-lated over a large news corpus;

2. Topic structure for adjustment the topic termweights with the simple heuristic of givingterms in the title double the weight of termsin the extended question/topic body;

3. Stemming for maximize coverage of relevantterms when measuring overlap of topic termsand document sentences;

4. Including content of the nearby-sentences inthe determination of a given sentences rele-vance.

Using these techniques, the algorithm made twopasses over each document. In the first pass as-signing relevance scores to each sentence based onoverlap with topic terms. In the second pass, thesescores were adjusted using the first-pass scores ofnearby sentences. Finally, the sentences scoredabove a certain cutoff were selected.

In comparison to the two top systems, our ap-proach does not require any pre-computed datafrom the external resources (like NUS does), andhas a very simple pipeline with a few stages (un-like the Columbia summarizer) which do not in-volve external tools and have a low computationalcost.

The difference of Qump(C2) from system withID=17 was statistically insignificant, and the dif-ference from system with ID=11 was statisticallysignificant.

Table 3 shows how the ROUGE-1 scores of ouralgorithm compare to the scores of 35 systems thatparticipated in the DUC 2006 competition. Twooptions of our algorithm that correspond to con-straints described in Section 3.7 appear in the ta-ble, denoted by Qump(C1-C2). Qump(C2) places

System ID Recall Precision F-measure15 0.3446 0.3436 0.34404 0.3424 0.3355 0.3388Qump(C2) 0.3416 0.3334 0.337417 0.3400 0.3329 0.336311 0.3336 0.3134 0.32316 0.3310 0.3256 0.328219 0.3305 0.3249 0.327610 0.3304 0.3225 0.32637 0.3300 0.3211 0.32548 0.3292 0.3314 0.33015 0.3281 0.3406 0.3339Qump(C1) 0.3276 0.3194 0.323325 0.3264 0.3197 0.322924 0.3223 0.3253 0.32379 0.3222 0.3138 0.317916 0.3209 0.3203 0.32053 0.3177 0.3179 0.317714 0.3172 0.3325 0.323512 0.3115 0.3043 0.307821 0.3107 0.3095 0.310029 0.3107 0.3159 0.313127 0.3069 0.2976 0.302128 0.3047 0.3074 0.305913 0.3039 0.3186 0.310918 0.3003 0.3350 0.316132 0.2977 0.3056 0.301430 0.2931 0.2900 0.291426 0.2824 0.3088 0.29492 0.2801 0.3052 0.291422 0.2795 0.3160 0.287831 0.2719 0.3062 0.279720 0.2552 0.3554 0.29301 0.2532 0.3104 0.264423 0.1647 0.3708 0.2196

Table 2: DUC 2005. ROUGE-1 scores.

second on the ROUGE-1 recall and f-measure, andthe difference between the top system with ID=24and our algorithm is statistically insignificant.

ID 24 represents the IIITH-Sum system fromthe International Institute of Information Tech-nology (Jagarlamudi et al., 2006). IIITH-Sumused two features to score the sentences, and thenpicked the top-scored ones to a summary in thegreedy manner. The first feature is a query de-pendent adaptation of the HAL (Jagadeesh et al.,2005) feature, where an additional importance isgiven to a word/phrase of a query. The second fea-ture calculates query-independent sentence impor-tance, using external resources in the web. First,the Yahoo search engine was used to get a rankedlist of retrieved documents, and a unigram lan-guage model was learned on a text content ex-tracted from them. Then, Information Measure(IM), using entropy to compute the informationcontent of a sentence based on the learned unigrammodel, was used for scoring a sentence. The finalsentence ranks were computed as a weighted lin-ear combination of modified HAL feature and IM.

In contrast to the IIIHT-Sum, our approach doesnot require any external resources and is strictlybased on the internal content of the analyzed cor-pus. As results, it is also consumes less run-time.

The difference of Qump(C2) from system with

28

Page 39: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

System ID Recall Precision F-measure24 0.3797 0.3781 0.3789Qump(C2) 0.3745 0.3732 0.374512 0.3736 0.3734 0.373431 0.3675 0.3730 0.370210 0.3720 0.3680 0.369933 0.3700 0.3698 0.369915 0.3717 0.3675 0.369623 0.3726 0.3661 0.369228 0.3677 0.3707 0.36918 0.3702 0.3665 0.368327 0.3574 0.3707 0.36375 0.3665 0.3607 0.3635Qump(C1) 0.3647 0.3623 0.363513 0.3553 0.3713 0.36293 0.3539 0.3650 0.35932 0.3587 0.3580 0.35836 0.3543 0.3567 0.355519 0.3552 0.3534 0.35424 0.3518 0.3567 0.354222 0.3518 0.3554 0.353629 0.3432 0.3598 0.35129 0.3373 0.3641 0.349232 0.3519 0.3466 0.349214 0.3501 0.3481 0.349030 0.3317 0.3575 0.343925 0.3374 0.3478 0.342520 0.3413 0.3412 0.34127 0.3417 0.3368 0.339216 0.3384 0.3377 0.338018 0.3335 0.3423 0.337717 0.3105 0.3647 0.335121 0.3244 0.3541 0.334434 0.3320 0.3300 0.331035 0.3058 0.3488 0.324226 0.3023 0.3399 0.31991 0.2789 0.3231 0.296211 0.1965 0.3014 0.2366

Table 3: DUC 2006. ROUGE-1 scores.

ID=12 was statistically insignificant, and the dif-ference from system with ID=31 was statisticallysignificant. It is not surprising that method C2 per-formed better that C1 as limiting frequent wordsets to words appearing in a query only decreasesthe overall number of frequent word sets. In thiscase, many repetitive word sets that are related tothe query are missed.

The actual running time of our Qump (both ver-sions) was around 1-3 seconds per document sets.We also learned that long sentences do not affectcomputation cost of our approach.

5 Conclusions

This work introduces Qump, a system followinga new MDL-based approach to a query-orientedsummarization. Qump extracts sentences that bestdescribe the query-related frequent set of words.The evaluation results show that Qump has an ex-cellent performance. In absolute ranking, it out-performs all but two of participated systems inDUC 2005 competition and all but one of com-peting systems in DUC 2006 contest. Accord-ing to significance test, Qump has the same per-formance as leading systems in both competitions(ID=15,4,7 in DUC-2005 and ID=24,12 in DUC-

2006). In addition, Qump is an efficient algorithmhaving polynomial complexity. Qump’s runtimeis limited by Apriori that is known as a PSPACE-complete problem. However, because it is a rareoccasion to have a set of words repeated in morethan 4-5 different sentences in the entire docu-ment set, we have O(n5) frequent itemsets at mostwhere n is the number of terms. The encoding pro-cess is bound by a number of frequent sets times anumber of sentences (m) times a number of words(k). Therefore, we can say that Qump’s runtime ispolynomial in the number of terms and is boundby O(m × k × n5). In conclusion, the presentedtechnique has the following advantages over othertechniques: (1) It is unsupervised and does not re-quire any external resources (many of the top-ratedsystems from the DUC competitions are super-vised or use external data); (2) It has efficient timecomplexity (polynomial in the number of terms);(3) It is language-independent and can be appliedon any language, as far as we have a tokenizer forthis language (for example, we got excellent re-sults with generic summarization in Chinese withthis approach2); (4) Despite its robustness (inde-pendence on annotated data and language, and itsefficiency), its performance is comparable to theone of the top systems.

In future, we intend to enrich this approach withword vectors for better match between a query anda sentence. Also, we plan to integrate it with anovel summarization technique using OLAP rep-resentation, where frequent itemsets of words willrepresent an additional dimension.

ReferencesRakesh Agrawal and Ramakrishnan Srikant. 1994.

Fast algorithms for mining association rules. In 20thInternational Conference on Very Large Databases,pages 487–499.

Sasha Blair-Goldensohn, Kathleen McKeown, and An-drew H. Schlaikjer, 2004. Answering definitionalquestions: A hybrid approach, chapter 4. AAAIPress.

Sasha Blair-Goldensohn. 2005. From definitions tocomplex topics: Columbia university at DUC 2005.In Document Understanding Conference.

Wauter Bosma, 2005. Query-Based Summarization us-ing Rhetorical Structure Theory, pages 29–44. LOT.

2Litvak, M. and Vanetik, N. and Li, L., 2017, Summariz-ing Weibo with Topics Compression, unpublished manuscript.

29

Page 40: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Shaofeng Bu, Laks V. S. Lakshmanan, and Raymond T.Ng. 2005. Mdl summarization with holes. InProceedings of the 31st International Conference onVery Large Data Bases, VLDB ’05, pages 433–444.

Ping Chen and Rakesh Verma. 2006. A query-basedmedical information summarization system usingontology knowledge. In Proceedings of the 19thIEEE Symposium on Computer-Based Medical Sys-tems (CBMS06), pages 37–42.

John M. Conroy, Judith D. Schlesinger, and Jade Gold-stein Stewart. 2005. CLASSY: Query-based multi-document summarization. In DUC 05 ConferenceProceedings.

Hal Daume III and Daniel Marcu. 2006. Bayesianquery-focused summarization. In Proceedings ofthe 21st International Conference on ComputationalLinguistics and 44th Annual Meeting of the Associa-tion for Computational Linguistics, pages 305–312,Sydney, Australia, July. Association for Computa-tional Linguistics.

DUC. 2005-2006. Document Understanding Confer-ence. http://duc.nist.gov.

Jagarlamudi Jagadeesh, Prasad Pingali, and VasudevaVarma. 2005. A relevance-based language mod-eling approach to DUC 2005. In Document Un-derstanding Conferences (along with HLT-EMNLP2005).

Jagadeesh Jagarlamudi, Prasad Pingali, and VasudevaVarma. 2006. Query independent sentence scoringapproach to DUC 2006.

Laks V. S. Lakshmanan, Raymond T. Ng, Chris-tine Xing Wang, Xiaodong Zhou, and Theodore J.Johnson. 2002. The generalized mdl approach forsummarization. In Proceedings of the 28th Interna-tional Conference on Very Large Data Bases, VLDB’02, pages 766–777.

Yanran Li and Sujian Li. 2014. Query-focused multi-document summarization: Combining a topic modelwith graph-based semi-supervised learning. In Pro-ceedings of COLING 2014, the 25th InternationalConference on Computational Linguistics: Techni-cal Papers, pages 1197–1207, Dublin, Ireland, Au-gust. Dublin City University and Association forComputational Linguistics.

Chen Li, Yang Liu, and Lin Zhao. 2015. Using exter-nal resources and joint learning for bigram weight-ing in ilp-based multi-document summarization. InProceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 778–787, Denver, Colorado, May–June. As-sociation for Computational Linguistics.

Chin-Yew Lin. 2004. ROUGE: A Package for Auto-matic Evaluation of summaries. In Proceedings ofthe Workshop on Text Summarization Branches Out(WAS 2004), pages 25–26.

Marina Litvak, Mark Last, and Natalia Vanetik. 2015.Krimping texts for better summarization. In Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing, pages 1931–1935, Lisbon, Portugal, September. Association forComputational Linguistics.

Yan Liu, Sheng hua Zhong, and Wenjie Li. 2012.Query-oriented multi-document summarization viaunsupervised deep learning. In Proceedings of theTwenty-Sixth AAAI conference on Artificial Intelli-gence, pages 1699–1705.

Ahmed A. Mohamed and Sanguthevar Rajasekaran.2006. Query-based summarization based on docu-ment graphs. In Proceedings of IEEE InternationalSymposium on Signal Processing and InformationTechnology, pages 408–410.

Ani Nenkova and Kathleen McKeown. 2003. Refer-ences to named entities: a corpus study. In NAACL-HLT 2003.

Thanh-Son Nguyen, Hady W. Lauw, and PanayiotisTsaparas. 2015. Review synthesis for micro-reviewsummarization. In Proceedings of the Eighth ACMInternational Conference on Web Search and DataMining, WSDM’15, pages 169–178.

Tadashi Nomoto and Yuji Matsumoto. 2001. A newapproach to unsupervised text summarization. InProceedings of the 24th Annual International ACMSIGIR Conference on Research and Development inInformation Retrieval, SIGIR ’01, pages 26–34.

Tadashi Nomoto. 2004. Machine Learning Ap-proaches to Rhetorical Parsing and Open-DomainText Summarization. Ph.D. thesis, Nara Institute ofScience and Technology.

Jahna Otterbacher, Gunes Erkan, and Dragomir R.Radev. 2009. Biased lexrank: Passage retrieval us-ing random walks with question-based priors. Infor-mation Processing & Management, 45(1):42–54.

Sun Park, Ju-Hong Lee, Chan-Min Ahn, Jun Sik Hong,and Seok-Ju Chun, 2006. Query Based Summariza-tion Using Non-negative Matrix Factorization, vol-ume 4253 of Lecture Notes in Computer Science,pages 84–89.

Frank Schilder and Ravikumar Kondadadi. 2008. Fast-sum: Fast and accurate query-based multi-documentsummarization. In Proceedings of ACL-08: HLT,Short Papers, pages 205–208, Columbus, Ohio,June. Association for Computational Linguistics.

Jie Tang, Limin Yao, and Dewei Chen. 2009. Multi-topic based query-oriented summarization. In SIAMInternational Conference Data Mining.

Jilles Vreeken, Matthijs Leeuwen, and Arno Siebes.2011. Krimp: Mining itemsets that compress. DataMin. Knowl. Discov., 23(1):169–214, July.

30

Page 41: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Lu Wang, Hema Raghavan, Claire Cardie, and VittorioCastelli. 2014. Query-focused opinion summariza-tion for user-generated content. In Proceedings ofCOLING 2014, the 25th International Conferenceon Computational Linguistics: Technical Papers,pages 1660–1669, Dublin, Ireland, August. DublinCity University and Association for ComputationalLinguistics.

Jennifer Williams, Sharon Tam, and Wade Shen. 2014.Finding good enough: A task-based evaluation ofquery biased summarization for cross-language in-formation retrieval. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 657–669, Doha, Qatar,October. Association for Computational Linguistics.

Shiren Ye, Long Qiu, Tat-Seng Chua, and Min-YenKan. 2005. NUS at DUC 2005: understanding doc-uments via concept links. In Document Understand-ing Conference.

Liang Zhou, Chin-Yew, and Eduard Hovy. 2006. Sum-marizing answers for complicated questions. In Pro-ceedings of the Fifth International Conference onLanguage Resources and Evaluation (LREC-2006).

31

Page 42: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 32–36,Valencia, Spain, April 3, 2017. c©2017 Association for Computational Linguistics

Word Embedding and Topic Modeling Enhanced Multiple Features forContent Linking and Argument/Sentiment Labeling in Online Forums

Lei Li and Liyuan Mao and Moye ChenCenter for Intelligence Science and Technology (CIST)

School of Computer ScienceBeijing University of Posts and Telecommunications (BUPT) , China

[email protected] [email protected] [email protected]

Abstract

Multiple grammatical and semantic fea-tures are adopted in content linking andargument/sentiment labeling for online fo-rums in this paper. There are mainlytwo different methods for content link-ing. First, we utilize the deep feature ob-tained from Word Embedding Model indeep learning and compute sentence simi-larity. Second, we use multiple traditionalfeatures to locate candidate linking sen-tences, and then adopt a voting method toobtain the final result. LDA topic model-ing is used to mine latent semantic featureand K-means clustering is implementedfor argument labeling, while features fromsentiment dictionaries and rule-based sen-timent analysis are integrated for senti-ment labeling. Experimental results haveshown that our methods are valid.

1 Introduction

Comments to news and their providers in on-line forums have been increasing rapidly in recentyears with a large number of user participants andhuge amount of interactive contents. How can weunderstand the mass of comments effectively? Acrucial initial step towards this goal should be con-tent linking, which is to determine what commentslink to, be that either specific news snippets orcomments by other users. Furthermore, a set oflabels for a given link may be articulated to cap-ture phenomena such as agreement and sentimentwith respect to the comment target.

Content linking is a relatively new researchtopic and it has attracted the focus of TAC2014 (https://tac.nist.gov//2014/KBP/), BIRNDL2016 (Jaidka et al., 2016) and MultiLing2015 (Kabadjov et al., 2015) and MultiLing 2017.

The main method is based on the calculation ofsentence similarity (Aggarwal and Sharma, 2016;Cao et al., 2016; Jaidka et al., 2016; Saggion et al.,2016; Nomoto, 2016; Moraes et al., 2016; Malen-fant and Lapalme, 2016; Lu et al., 2016; Li et al.,2016; Klampfl et al., 2016), with the key point ofmining semantic information better.

Researchers have tried various features andmethods for sentiment and argument labeling. Themain features are different kinds of sentiment dic-tionaries, while the basic method is the rule-basedone. The major method for sentiment and argu-ment labeling is based on statistical machine learn-ing algorithms (Aker et al., 2015; Hristo Tanev,2015; Maynard and Funk, 2011).

2 Task Description

We work on three tasks for English and Italian inthis paper. The first one is content linking, whichis to find all the linking pairs for comment sen-tences. In every pair, one sentence belongs to theoriginal article or a former comment by an author,the other belongs to a comment by a later com-mentator. The second and third tasks are to tagtwo kinds of labels to the linking pairs that werefound in the first task. Labels involve argumentlabel and sentiment label. For argument label, itfocuses on whether or not a commentator agreeswith the commentated author. For sentiment label,it cares about the sentiment of comment sentences.Experiments are implemented on the training datareleased by MultiLing 2017, including 20 Englishnews (from The Guardian) and 5 Italian (from LeMonde) news with some comments.

3 Methods

For content linking, we adopt the Word Embed-ding Model to dig up word vectors as linking infor-mation of sentence pair with deeper semantic fea-

32

Page 43: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Figure 1: Content linking process

tures. Besides, we also use some traditional fea-tures of sentence similarity which performed wellthrough experiments and explore how to fuse themtogether with Word Embedding features. For thispurpose, first, we try to use every single featureto get one linking sentence, next, we choose themost repetitive sentence as final result via a votingmethod. Then we mainly use rule-based sentimentanalysis to obtain the sentiment label. LDA (La-tent Dirichlet Allocation) (Blei et al., 2003) topicmodel and K-means (Hartigan and Wong, 1979)are integrated to obtain the argument label.

3.1 Content Linking

Figure 1 shows the process for content linking.

3.1.1 Pre-ProcessingWe crawl 1.5G data from the English Guardianwebsite to train word vectors for English, andabout 1G data from Wikipedia for Italian. Thenwe use the tool named word2vec (Mikolov et al.,2013) for training.

3.1.2 Method 1-Word Vector AlgorithmAfter the training of word embedding models, asentence in the corpus can be expressed as:

Wt = (wt, wt+1, · · · , wt+k) (1)

Where wt is the word vector of 300 dimensionsof word t. Then two sentences Wi and Wj can

form a calculating matrix Mi,j :

Mi,j = WiWTj =

wtwv · · · wtwv+l...

. . ....

wt+kwv · · · wt+kwv+l

(2)

Before the computation of (wt, wv), we needsome processing steps: stemming as well as stopwords and punctuation removing. Besides, it isessential to check relations between word t andword v based on WordNet. If they exist in the hy-ponyms/hypernyms part of each other, they can beseen as the same.

The cosine distance can represent (wt, wv), andthe similarity of sentences i and j is:

Simi,j =∑

m=i,n=j max(Mm,n)√lengthilengthj

(3)

Where maxMm,n is obtained through the fol-lowing concrete steps. First, find out the maxi-mum of Mi,j , then delete the row and column ofthe maximum. Next, find the maximum of the re-maining matrix and remove row and column likethe former step. Do the same procedure until thematrix is empty. Finally add up all the maximumvalues. lengthi is the number of word vectors inthe sentence, and

√lengthilengthj is used to re-

duce the influence of sentence length.We think that the maximum value in the ma-

trix can represent the most matching word pairs inthe two sentences. We just choose the maximumvalue in each step and delete the word pairs in thematrix for next iteration until the matrix is empty.As a result, we can find out all the best matchingword pairs in the two sentences. Hence accumula-tion of word similarities of all these best matchingword pairs can represent the similarities of the twosentences.

Based on the above sentence similarities, wecan extract those sentences with the highest sim-ilarity to a comment sentence as its linking result.

3.1.3 Method 2-Feature Fusion AlgorithmThis algorithm is only for English. We have twokinds of features, one is from lexicons, and theother is from sentence similarities.

We have three lexicons: Linked Text high-frequency word lexicon (Lexicon 1), LDA lexicon(Lexicon 2), Comment Text and Linked Text co-occurrence lexicon (Lexicon 3). For Lexicon 1, wepick up the words with high frequency from stan-dard answers artificially, and then expand them

33

Page 44: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Sentence

Pair

Generate

input File

for LDA

Construct

Model

Utilize K-means to

get cluster of every

sentence pair

Get the result

label of every

sentence pair

Figure 2: Argument label process

through WordNet and word vectors, resulting ina lexicon. For Lexicon 2, we use LDA model totrain the news and comments to get a lexicon of25 latent topics in every file independently. ForLexicon 3, we obtain the co-occurrence degree be-tween words by the word frequency statistics ofcomment and its linked sentence from the trainingcorpus.

As for sentence similarities, we have word vec-tor similarity, jaccard similarity, idf similarity, ressimilarity, jcn similarity and path similarity. Wordvector similarity is calculated through Method 1.We add up the idf values of the same words be-tween two sentences to represent their idf similar-ity. The last three similarities are from WordNet.

For every feature, we can use it to get a sentencewith the highest score. Then among these ninesentences chosen by nine features, we use votingmethod to choose the most repetitive sentence asthe final linking result. When some sentences getthe same votes, we choose the first one accordingto sentence order in the input news and comments.

3.2 Argument Label

Figure 2 shows the process for argument label.Given a collection of sentences in the input file,

we wish to discover topic distribution of every sen-tence through LDA model. We generate the inputfile for LDA first. For every sentence, we change itinto its bag-of-words model representation, whichassumes that the order of words can be neglected.During LDA modeling, we set the topic numberto 15 according to the experiments. That is tosay, later in K-means clustering, our feature is the15-dimension vector. We run K-means to clusterall sentences into two categories. For every sen-tence pair, if the two sentences belong to the samecategory, then we set the label to in favour, else,against.

Sentiment

Pair

Construct seed

sentiment dictionary

Expand dictionary

Utilize rules to

get the label of

every sentence

pair

Use machine

translation

Use DLDA

Figure 3: Sentiment label process

3.3 Sentiment Label

Figure 3 shows the process for sentiment label.There are three kinds of seed sentiment dic-

tionaries discovered from OpinionFinder system(MPQA, http://mpqa.cs.pitt.edu/). One is subjec-tivity lexicon, the other two are called Intensi-fier and Valenceshifters lexicon. Intensifier lexi-con involves words which can improve the senti-ment level. Valenceshifters lexicon involves wordswhich can alter the sentiment label.

The original dictionary is in English. We usemachine translation to add Italian vocabularies.With DLDA (Chen et al., 2014), we can get allsentiment weights of words in corpus. At last, theword which is not included in seed has the samepolarity with a seed word if their sentiment weightdistance can be ignored.

Through DLDA, every word gets a sentimentstate. We map the sentiment state to a number ofword score as in Table 1. We accumulate wordscore in a sentence to obtain the sentence score,which is then mapped to the sentiment label as inTable 2.

Sentiment state Word scoreWeak neg(only) -1Strong neg(only) -2Strong pos(only) 2Weak pos(only) 1Neutral 0Intensifier+weak neg -2Intensifier+weak pos 2

Table 1: Scoring strategy

Note that when current sentence score is big-ger than 0 and current word is in Valenceshifters

34

Page 45: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

and the score of current word is less than 0, sen-tence score = sentence score * (-1), or current sen-tence score is less than 0 and current word is inValenceshifters and the score of current word ismore than 0, sentence score strategy is the same.For any other conditions, we simply accumulatethe word score.

Sentence final score Label>0 Positive=0 Neutral<0 Negative

Table 2: Mapping sentence score to sentiment la-bel

4 Experiments

4.1 Content LinkingThreshold is used for extracting sentence. Wechoose the sentence as linking result only whenthe score (for Method 1) or the vote (for Method2) is bigger than the threshold.

Thres Linking Thres Linkinghold Precision hold Precision

0 77.9 0.5 81.90.1 78.2 0.6 80.60.2 80.8 0.7 84.20.3 80.6 0.8 87.00.4 80.8 0.9 87.8

Table 3: The performance of Method 1

Table 3 and Table 4 show the performance ofMethod 1 and Method 2 in our experiments re-spectively. The first and third rows in Table 4 arethe threshold. F1 to F9 refer to 9 features respec-tively (word vector, jaccard, idf, res, jcn, path, lex-icon 2, lexicon 1 and lexicon 3) and the numbermeans the vote for corresponding feature.

From Table 3, we can find out that, for Method1, the bigger threshold usually can bring the higherprecision. But the sentences we obtain may befewer, too. This will cause low recall rate. Ac-cording to the precision evaluation method usedby MultiLing 2015, precision of 86 is high. Thuswe can have good precision here. For Method 2in Table 4, although its precision is a little lowerthan that of Method 1, it can also achieve goodresult. Lexicon 3 shows its good performance,other features like jaccard and idf perform well,

Threshold 2 2 2F1 1 0 1F2 1 1 1F3 1 0 0F4 1 0 0F5 1 1 0F6 1 0 0F7 1 0 0F8 1 0 0F9 1 2 2

Linking Precision 78.4 85.2 85.2Threshold 3 3 3

F1 1 1 1F2 1 0 1F3 1 0 1F4 1 1 0F5 1 0 0F6 1 0 0F7 1 0 1F8 1 0 0F9 1 2 2

Linking Precision 78.2 82.7 82.6

Table 4: The performance of Method 2

too. Hence, how to combine them is important forus in the future. Besides, the linking precision ofItalian is 10.1 with the threshold of 0.6 as shownin Table 5.

Thres Linking Thres Linkinghold Precision hold Precision0.3 8.14 0.5 8.80.4 8.25 0.6 10.1

Table 5: The performance for Italian

4.2 Argument and Sentiment Label

From Table 6, we can find out that when we setthe threshold at 0.2 and 0.3, we can get the high-est precision in both argument label and sentimentlabel. However, unlike the linking precision men-tioned above, the bigger thresholds result in lowerprecision. The reason may be that when we set abigger threshold, the linking sentences we obtainare much fewer. Sometimes we can only get oneor two sentence pairs. If there are any wrong an-swers in the results, it will obviously decrease theprecision.

35

Page 46: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Thres Argu Senti Thres Argu Sentihold ment ment hold ment ment

0 85.9 77.6 0.5 76.7 78.30.1 86.1 79.6 0.6 67.6 60.70.2 86.3 79.1 0.7 52.9 52.00.3 84.8 81.1 0.8 47.6 59.10.4 80.5 75.4 0.9 47.3 58.9

Table 6: The performance of Labeling

5 Conclusion

For content linking, our system has tried to mineboth syntactic and semantic information, and theperformances are good. For argument and senti-ment labeling, we focus on machine learning algo-rithm and sentiment dictionaries. And there is stillspace for us to improve. Our future work is to findsome better ways to mine and use more semanticfeatures for both content linking and labeling.

Acknowledgments

This work was supported by the National Nat-ural Science Foundation of China under Grant91546121, 71231002 and 61202247; National So-cial Science Foundation of China under Grant16ZDA055; EU FP7 IRSES MobileCloud Project612212; the 111 Project of China under GrantB08004; Engineering Research Center of Infor-mation Networks, Ministry of Education; theproject of Beijing Institute of Science and Tech-nology Information; the project of CapInfo Com-pany Limited.

ReferencesPeeyush Aggarwal and Richa Sharma. 2016. Lexical

and syntactic cues to identify reference scope of ci-tance. In BIRNDL@ JCDL, pages 103–112.

Ahmet Aker, Fabio Celli, Adam Funk, Emina Kur-tic, Mark Hepple, and Rob Gaizauskas. 2015.Sheffield-trento system for sentiment and argumentstructure enhanced comment-to-article linking in theonline news domain.

David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation. Journal of Ma-chine Learning Research, 3:993–1022.

Ziqiang Cao, Wenjie Li, and Dapeng Wu. 2016. Polyuat cl-scisumm 2016. In BIRNDL@ JCDL, pages132–138.

Xue Chen, Wenqing Tang, Hao Xu, and Xiaofeng Hu.2014. Double lda: A sentiment analysis model

based on topic model. In International Conferenceon Semantics, Knowledge and Grids, pages 49–56.

J. A. Hartigan and M. A. Wong. 1979. Algorithmas 136: A k-means clustering algorithm. AppliedStatistics, 28(1):100–108.

Alexandra Balahur Hristo Tanev. 2015. Tackling theonforums challenge.

Kokil Jaidka, Muthu Kumar Chandrasekaran, SajalRustagi, and Min-Yen Kan. 2016. Overview of thecl-scisumm 2016 shared task. In BIRNDL@ JCDL,pages 93–102.

Mijail Kabadjov, Josef Steinberger, Emma Barker, UdoKruschwitz, and Massimo Poesio. 2015. Onforums:The shared task on online forum summarisation atmultiling’15. In Forum for Information RetrievalEvaluation, pages 21–26.

Stefan Klampfl, Andi Rexha, and Roman Kern. 2016.Identifying referenced text in scientific publicationsby summarisation and classification techniques. InBIRNDL@ JCDL, pages 122–131.

Lei Li, Liyuan Mao, Yazhao Zhang, Junqi Chi, Tai-wen Huang, Xiaoyue Cong, and Heng Peng. 2016.Cist system for cl-scisumm 2016 shared task. InBIRNDL@ JCDL, pages 156–167.

Kun Lu, Jin Mao, Gang Li, and Jian Xu. 2016. Rec-ognizing reference spans and classifying their dis-course facets. In BIRNDL@ JCDL, pages 139–145.

Bruno Malenfant and Guy Lapalme. 2016. Rali sys-tem description for cl-scisumm 2016 shared task. InBIRNDL@ JCDL, pages 146–155.

Diana Maynard and Adam Funk. 2011. Automatic de-tection of political opinions in tweets. In ExtendedSemantic Web Conference, pages 88–99. Springer.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. Computer Science.

Luis Moraes, Shahryar Baki, Rakesh Verma, andDaniel Lee. 2016. University of houston at cl-scisumm 2016: Svms with tree kernels and sentencesimilarity. In BIRNDL@ JCDL, pages 113–121.

Tadashi Nomoto. 2016. Neal: A neurally enhancedapproach to linking citation and reference. InBIRNDL@ JCDL, pages 168–174.

Horacio Saggion, Ahmed AbuRa’ed, and FrancescoRonzano. 2016. Trainable citation-enhanced sum-marization of scientific articles. In BIRNDL@JCDL, pages 175–186.

36

Page 47: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 37–46,Valencia, Spain, April 3, 2017. c©2017 Association for Computational Linguistics

Ultra-Concise Multi-genre Summarisation of Web2.0: towards IntelligentContent Generation

Elena Lloret, Ester Boldrini, Patricio Martınez-Barco, Manuel PalomarDepartment of Software and Computing Systems

University of AlicanteApdo. de correos, 99

E-03080 Alicante, Spain{elloret,eboldrini,patricio,mpalomar}@dlsi.ua.es

Abstract

The electronic Word of Mouth has becomethe most powerful communication channelthanks to the wide usage of the Social Me-dia. Our research proposes an approachtowards the production of automatic ultra-concise summaries from multiple Web 2.0sources. We exploit user-generated con-tent from reviews and microblogs in dif-ferent domains, and compile and analysefour types of ultra-concise summaries: a)positive information, b) negative informa-tion; c) both or d) objective information.The appropriateness and usefulness of ourmodel is demonstrated by its successful re-sults and great potential in real-life appli-cations, thus meaning a relevant advance-ment of the state-of-the-art approaches.

1 Introduction and Motivation

The Web 2.0 has created a framework where usersfrom all over the world express their opinion ona wide range of topics via different communica-tion Social Media channels, such as blogs, fora,micro-blogs, reviews, etc. Undoubtedly, all thisinformation is of great value in today’s competi-tive business environment, increasing the need forbusinesses to collect, monitor, and analyse user-generated data on their own and on their com-petitors’ Social Media, such as Twitter (He et al.,2015). Moreover, this context is also fostering theelectronic Word of Mouth (eWOM) (Jansen et al.,2009), an unpaid form of promotion (Duan et al.,2008) in which customers share with other userstheir experience with the product they bought,for example. WOM is an ancient phenomenonoriginated in the streets orally, but now, with theflourishing of the Web 2.0 it has been evolved ineWOM, whose essence is the same; the only dif-ference is that it is not implemented orally, but

using the Social Media instead (Boldrini et al.,2010): fora, blogs, online reviews and microblogs.However, the huge amount and the heterogeneityof online data poses great challenges to the devel-opment of applications able to effectively retrieve,extract and synthesise the main content spreadwithin Social Media. Due to the richness of So-cial Media data, its exploitation is being crucialfor business-oriented applications, such as marketanalysis, competence monitoring or simply under-standing the reasons behind customers’ opinion ona product. Having at their disposal effective appli-cations for information analysis and exploitationwould mean for them having competitive advan-tage.

Recently, three Natural Language Processing(NLP) applications are gaining predominance, es-pecially in the field of Social Media content analy-sis: i) information retrieval (Croft et al., 2009), ii)opinion mining (Pang and Lee, 2008) and iii) au-tomatic text summarisation (Nenkova and McKe-own, 2011). Information retrieval aims to searchand determine relevant documents on the Web ac-cording to a specific user need or topic. The goalof opinion mining is to identify subjective lan-guage and classify it according to its sentiment orpolarity (i.e., positive, negative or neutral informa-tion). Finally, text summarisation detects the mostrelevant pieces of information from one or multi-ple texts and presents the main ideas in a coherentfragment of text.

The main objective of this article is to applythe aforementioned NLP techniques to exploit theSocial Media data generated through online re-views and microblogs. In particular, our aim is toproduce innovative automatic ultra-concise sum-maries in the form of tweets (140 characters) re-liable in terms of content (they reflect the opin-ions expressed on a topic positive/negative -) andform. Even if there are some previous studies onthis topic (Ganesan et al., 2012), the novelty of our

37

Page 48: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

approach comes from the usage of multiple textualgenres simultaneously and the production of a ul-tra concise summary (multi-genre summarisation).This means that our final summary is presented tothe user in the form of a Tweet. The summaryis representative of what has been said on a pre-defined topic. It is reliable since we perform a ro-bust treatment of our source data. We treat each ofthe sources separately (because each textual genrehas specific needs) and then merge the distinctiveand relevant information for automatically build-ing up the tweet as final outcome.

Microblogs, and more especially tweets, havea direct impact on eWOM communication. Theyempower people to share these brand-affectingpoints of view anywhere to almost anyone. More-over, Princeton Survey Research Associates Inter-national1 found that the microblogging site Twit-ter experienced massive amounts of growth overthe past years with millions of new users join-ing and engaging with the site on a daily basis(Smith and Brennen, 2012). While the concise-ness of microblogs keeps people from writing ex-tensive reflections, it is exactly the micro part thatmakes microblogs peculiar if compared with othereWOM channels, such as blogs, webs, etc. More-over, the advantage of having a tweet as a finaloutcome is that: a) we provide immediate infor-mation, b) users can take it and exploit it in theway they prefer (i.e. post it) and c) have a com-plete overview, comprehensive of different genrescontent in a friendly format, and d) save a lot oftheir time since the system carries out the job forthem: retrieving, analysing, selecting, and provid-ing them with the information they are looking for.Due to the limited length of a tweet, it is neces-sary to analyse to what extent and how current ap-proaches could be adapted.

The motivation of our article lies on the factthat microblog is one the most used Social Mediachannel and thus considered as a point of refer-ence for many users. This implies that the genera-tion we do of brief summaries would be useful forusers that can use it directly in their Social Net-works. Microblogs have the potential of reachinga huge number of users. But not only normal userscan take advantage of it; for instance a companycan exploit such ultra-concise summaries dissem-inating them through different channels for adver-tising purposes, to attract more customers or to

1http://www.psrai.com/ (last access 30 January 2017)

make its potential customers aware of the high rep-utation of their products; thus, making this tech-nology a real-life application that will allow usersto save a lot of time and effort since the systemwill do the job automatically analysing the textsselected and summarising their content in a reli-able way.

The paper is organised as follows: Section 2presents the most relevant related work, while Sec-tion 3 the corpus creation and Section 4 its annota-tion. Section 5 describes the methodology for gen-erating ultra-concise summaries and Section 6 theevaluation of the results obtained. Section 7 anal-yses our approach in-depth, and finally, Section 8outlines the main conclusions and future work.

2 Related Work

In the last years, there has been much interestin summarisation from Social Media, within thewider context of opinion summarisation. Twitteris now the most popular microblogging service. Itis a huge repository of data and it is gaining pop-ularity in different NLP tasks, especially in auto-matic text summarisation focusing on generatingbrief summaries starting from a collection of textslike microblog entries, like Tweets (O’Connor etal., 2010; Sharifi et al., 2010; Kim et al., 2014)or enriched with other sources of information, likeWebpage links or newswire (Liu et al., 2011). In(Sharifi et al., 2010) a trending topic is consid-ered as a starting point from which all relatedposts are collected and summarised. They gen-erally use machine learning algorithms to detectthe sentences mostly related to the topic phrase.In (Inouye and Kalita, 2011) a comparative anal-ysis of different summarisation techniques is car-ried out to determine which is the most adequatefor this type of summaries, and it is concluded thatsimple word frequency and redundancy reductionare the best techniques for the Twitter topics sum-marisation. In (Weng et al., 2011), the approachto summarise Twitter posts consists of two stages:i) classification of the posts and responses in dif-ferent groups, according to their intention (interro-gation, sharing, discussion and chat) and ii) analy-sis of different strategies for building the summarythrough sentiment analysis techniques or simplyanalysing the responses for each post. Their finalsummaries are generated depending on the cate-gory they belong (e.g., if the summarised posts arewithin the sharing groups, the summary is a pie

38

Page 49: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Domain Technology MotorTopic Mobile phones CarsNumber of topics 10 10Number of tweets per topic 13 10Number of online reviews per topic 9 10Microblogs (Avg. Number of words per tweet) 28.05 14.76Microblogs (Number of words in total) 3647 1476Reviews (Avg. Number of words per review) 146.30 156.13Reviews (Number of words in total) 13167 15613

Table 1: Corpus statistics.

chart showing the percentage of positive and neg-ative opinions). Other relevant studies aim at gen-erating other types of summaries, like event sum-maries (Chakrabarti and Punera, 2011) or ultra-concise opinion summaries (Ganesan et al., 2012).In the former, the goal is to produce a real-timesummary of events, focusing on the AmericanFootball games, and analysing the performance ofdifferent summarisers. In the latter, the objectiveis to generate a tweet from a set of reviews, whereeach tweet is a summary of a key opinion in text,and it relies on techniques based on Web N-grams.

Our research idea is based on (Ganesan et al.,2012), but the main novelty and added value of ourresearch with respect to it is precisely the multi-genre perspective addressed: starting simultane-ously from multiple sources of information, tweetsand online reviews we treat them and produce anultra concise summary that is reliable and repre-sentative in terms of content. This output can bedirectly used by the user in his own Social Net-works and not to waste their time in analysingwhat they found and prepare a summary of thehuge amount of information available.

3 Corpus Creation

We automatically gathered a corpus of online re-views and tweets in English for 10 different mo-bile phones and 10 cars using the crawler de-veloped in (Fernandez et al., 2010). A crawleris an automatic process in charge of retrievingand extracting the HTML content of a Website.In this process, relevant information sources areidentified (e.g. Websites of reviews), and thenthe content of useful web pages within the se-lected sources of information are retrieved, down-loaded, and extracted in an automatic, quick andeasy manner using the well-known vector space

model. This manner, a corpus with a large numberof documents can be created automatically. Usingthis crawler, 10 mobile phone brands and 10 carbrand models were selected to conduct the exper-iments. For each brand, we obtained on average10 microblogs extracted from Twitter and 10 on-line reviews from Amazon2 and WhatCar3, hav-ing in total 400 texts. We selected cars and mo-bile phones, since they are frequently discussed bypeople with different profiles and at the same timethey are from two different domains (importantfor demonstrating the efficiency of our approachin multi-domain contexts). Furthermore, mobilephones and car topics can be seen as in betweenthe high-level terminology complexity of a spe-cific and technical domain such as medicine forexample and an easier one like food or music. Inour case, for this step of our research we focusedon a medium level complexity topic to check ifour techniques were pertinent for a real-life appli-cation.

Table 1 summarises the main features of ourcorpus. Analysing the specific properties and fea-tures for each of the Social Media textual genres,the most outstanding difference is the length of thetexts. While the tweets are very short, having nomore than 28 words on average, the reviews arequite longer, with more than 145 words on aver-age. Both textual genres will be therefore comple-mentary, tweets providing conciseness, whereasreviews providing opinions in a specific context.

4 Annotation Process

Once the documents were retrieved, we pre-processed them by extracting only the main text;a very important step since we eliminate all pos-

2http://www.amazon.com (last access 30 January 2017)3http://www.whatcar.com (last access 30 January 2017)

39

Page 50: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

sible characters that are not part of the infor-mation, links, and emoticons. Then, we startedthe annotation process, where one expert anno-tator manually labelled the documents using thecoarse-grained version of the EmotiBlog annota-tion schema (Boldrini et al., 2010). To have thedocuments annotated by just one expert was con-sidered enough for this preliminary study, sincethe EmotiBlog model had been previously evalu-ated (Boldrini, 2012) and proved to be easy to use.The annotation schema developed in EmotiBlogwas created for automatic systems to detect thesubjective expressions in the new textual genres ofthe Web 2.0 and has been employed to improve theperformance of different NLP applications dealingwith complex tasks, including opinion summari-sation, where it obtained satisfactory results (Bal-ahur et al., 2009).

Although EmotiBlog was originally a very fine-grained annotation scheme, we decided to use theless grained part of the resource, since our researchpurposes at this level only required to detect sub-jectivity and classify the polarity of a statement.Therefore, the expert annotator labelled the corpusat sentence level during 4 weeks in part time ded-ication using the following EmotiBlog elements:POLARITY (positive/negative/neutral) + INTEN-SITY (high/medium/low). The fragment belowshows an example of labelled sentence for thetopic “Nokia 2700”, a subjective sentence whosepolarity is positive, with a high intensity.

<phenomenon degree1="high"category="phrase"polarity="positive"source="w" target="Nokia2700" confidence="high">It’sinexpensive, it’s primarily aphone but it has some usefulfeatures, it’s neat and it workswell.</phenomenon>

We selected the abovementioned elements ofthe EmotiBlog model since our purpose is to dis-criminate sentences between objective/subjectiveand from the subjective ones discriminate theminto positive/negative and finally the summarisa-tion system will treat those two groups to pro-duce a reliable ultra-concise summary that reflectsthe reality of users feelings. This process impliesthe added challenge for the summarisation sys-tem to be able to treat the two typologies (objec-tive/subjective), quite challenging task if we take

into consideration the high language variabilitypresent in the Web 2.0 (Pacea, 2011). The rea-son why we performed manual annotation was be-cause we wanted to ensure a precise labelling andminimise cascade errors derived from the use ofNLP tools. This manner we can focus more onthe evaluation of the automatic summarisation ap-proach.

5 Ultra-concise Summary Generation

To the best of our knowledge, this is one of thefirst attempts aiming at generating this new type ofsummaries starting from text sources of differentnature (i.e., multi-genre), and about different opin-ion polarities (positive/negative/objective). Thisrequires that the summarisation system has highcoverage to produce such a small text with fullmeaning, relevant to the topic, and with grammat-ical adequacy. These types of summaries haveenormous benefits as presented in Section 1. Asit was previously said, the ultra-concise summari-sation generation approach employed in this re-search is novel considering the following mainaspects. First of all, it is able to deal with dif-ferent types of Web 2.0 textual genres (tweetsand reviews) thus employing NLP techniques ableto treat the linguistic phenomena encountered(thanks to the annotation performed beforehand).The proposed approach takes as a starting point thecorpus previously created, and then it performs aseries of steps to determine the relevant informa-tion, and analysing the most appropriate mannerto present it in the form of a tweet. In additionits modular architecture allows the inclusion oftools for deeper language/linguistic/content analy-sis. Figure 1 depicts an overview of our proposedapproach. Next, the stages involved in the processare explained in more detail:

1. Polarity detection and classification: ourmain goal in this stage is to group the sen-tences into three sets: one for the positive,one for the negatives, and one for the ob-jectives. In this manner, taking as a start-ing point a set of topic-related reviews andtweets, the first stage is to detect and clas-sify the opinions contained. In our researchwork, we take advantage of the manual anno-tation process previously explained, becausethis manner, precise information concerningthe positive, negative and objective sentencesis obtained.

40

Page 51: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Figure 1: Overview of our proposed approach.

2. Sentence ranking: the aim of this stage is toassign a relevance score to each positive, neg-ative or objective sentence. This relevancescore is determined automatically, relying onautomatic summarisation techniques. Specif-ically, two heuristics were used to computethe relevance for a sentence: term frequencyand noun-phrase length. On the one hand,term frequency is a statistical technique thatassumes that the more important sentencesare determined by the most frequent words,without taking into account the words thatdo not carry any semantic information (i.e.,stopwords such as “the”, “a”, etc.). On theother hand, the use of noun-phrase lengthhas a linguistic motivation (Givon, 1990),where it is stated that longer noun-phrasescarry more important information. Then, forcalculating the score of each sentence, thesetwo heuristics were combined, thus consider-ing more important those sentences contain-ing longer noun-phrases composed of highfrequent words. The combination of bothtechniques has been proven to work fine forproducing automatic summaries by meansof COMPENDIUM summariser (Lloret andPalomar, 2013).

3. Determining candidate text fragments fortweets: although a relevance score was as-signed to each sentence, one of the key is-sues of the task we are facing is how to pro-duce tweet informative enough, ensuring atthe same time, the 140 characters length re-striction. At this stage, regardless of the rel-evance for each sentence, we identify fromeach group of sentences those ones not ex-ceeding the 140 characters length to be com-bined with the relevance score in the nextstage. Although this may seem a trivial ap-proach, the current semantic analysis and nat-ural language generation tools are not capa-ble of dealing with the textual genres of theWeb 2.0, making a lot of errors that could bedetrimental for the final summaries.

4. Selecting the most appropriate tweet: hav-ing on the one hand, the score for each sen-tences, and on the other hand, the group ofsentences that would fit the tweet length, anadded value of our proposed approach is totake into account in a joint manner the rel-evance sentence score, distinguishing themwith respect to their polarity, and the poten-tial tweet candidates that fit with the rightlength. The strategy followed in this stageis to select as tweets the ones that are most

41

Page 52: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

similar to the top-most relevant sentence ineach group, but at the same time, not sur-passing 140 characters. For achieving this weused the cosine similarity measure, where thetweet candidate whose cosine score is mostsimilar to the sentences that have been iden-tified are more relevant is extracted as finaltweet. This works fine for the generation ofpositive, negative or objective tweets. How-ever, when a mixed tweet needs to be pro-duced, the result of combining subjective andobjective information may be ending witha sentence longer than 140 characters. Ifthis happens, we use the same method asin (Lloret et al., 2013) for compressing sen-tences. In the end, we obtain four tweets(positive, negative, objective, and one con-taining a mix of subjective/objective informa-tion). We therefore propose these four tweetsas reliable ultra-concise summaries.

All these stages together with the corpus cre-ation are then included in a semi-automatic se-quential process, where basic and intermediateNLP components are used in order to analyse andpre-process the input documents to the summari-sation engine.

6 Evaluation and Results

The evaluation of our ultra-concise summarisa-tion approach was carried out by two expert userswho evaluated in an independent manner the au-tomatic ultra-concise summaries (i.e. the gener-ated tweets). For performing this evaluation, werelied on the qualitative criteria employed in theTAC conferences4. These criteria were: a) thecontent of the summary, b) its readability and c)its overall responsiveness. The first criterion de-termines whether the tweet reflects important in-formation of the source inputs; the second assesseswhether the tweet is well-written and easy to un-derstand; and the third evaluates if it is reliableand suitable for a real-life application. Evaluatingour results with these criteria allows us determin-ing if the product we reach is useful because it isof a high quality, so that no additional treatmentis needed. They were evaluated from a concep-tual point of view and the general idea expressed,rather than from the individual words building upthe summaries.

4http://www.nist.gov/tac/ (last access 30 January 2017)

For each topic (mobile phone and cars) we pro-duce different alternatives as tweets: i) positive in-formation; ii) negative information; iii) objectiveinformation; and iv) mixing subjective and objec-tive information. Each of these tweets (40 in to-tal) was rated according to a 3-level Likert scale,with values ranging from 1 to 3 (1=poor or verypoor; 2=barely acceptable; and 3=good or verygood). The reason for choosing such scale andnot a 5-level one was to avoid assessment disper-sion. In our case, the agreement between the twoassessors was quite high: 60% for content, 75%for readability, and 65% for overall responsive-ness criteria, meaning that they both agree in thescore assigned to the summary. It is important tostress that the assessors had access to the origi-nal reviews and tweets, from which the automatictweets were generated, and they read them in ad-vance for being able to determine whether the au-tomatic tweets were reliable and a good represen-tative of the source documents. Table 2 shows theaverage results obtained for each type of automat-ically generated tweet. Last two columns providethe global average obtained taking into account allthe results.

The results are very encouraging, meaningthat our proposed approach can be consideredas appropriate for synthesising in one tweet orultra-concise summary relevant information.Next, we show an example of a potential mixedTweet with 126 characters that contains subjectiveand objective information that belong to the groupthat has obtained the best results:

“Most annoyingly, the alarm is MUCH tooquiet. Aside from the alarm though, this is a fablittle phone which I highly recommend.”

Generally, the results for each criterion are high,since they are over 2.50, and in most of the cases,very close to 3, the maximum value. Thus, tweetsare readable, easy to understand, and reflect rele-vant information or the key aspects of the phones.If we have a look at the results obtained, we canalso notice a better performance in the case ofnegative sentences. This could be due to the factthat generally negative concepts are expressed ina much more strong and direct way than positiveones, thus are easier to detect and treat from alinguistic point of view. Moreover, we can alsoappreciate that the objective sentences have lower

42

Page 53: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Content Readability Over. Resp. GlobalPhones Cars Phones Cars Phones Cars Phones Cars

Positive 2.55 2.70 2.65 2.65 2.55 2.70 2.58 2.68Negative 2.72 2.50 2.83 2.60 2.76 2.65 2.76 2.58Objective 2.01 2.35 2.61 2.40 2.22 2.35 2.22 2.37Mixed 2.75 2.70 2.85 2.80 2.77 2.75 2.77 2.75

Table 2: Average results for the ultra-concise summaries generated with our approach

Content Readab.ility Over. Resp. GlobalPhones Cars Phones Cars Phones Cars Phones Cars

Our approach 2.75 2.70 2.85 2.80 2.77 2.75 2.77 2.75Swotti 2.22 2.35 2.45 2.50 2.30 2.40 2.32 2.42Ganesan 1.20 1.30 1.45 1.85 1.25 1.45 1.30 1.48COMPENDIUMsem 1.75 2.65 1.65 2.75 1.70 2.60 1.70 2.67

Table 3: Comparison results for mixed tweet generation across systems

performance and after having analysed our test set,we reach the conclusion that this is because ofthe huge amount of advertisement that companieslaunch in such communication channels, addingnoise and influencing in a negative way the re-trieval of the sentences and thus the performanceof the system. Analysing more in detail the gener-ated tweets, we find some beneficial aspects of ourapproach and some other aspects with room forimprovement. Concerning the positive ones, themodularity of our approach makes it easy to adaptit to other textual genres and languages. More-over, a strong advantage of our approach is thatit is able to produce different types of tweets to beused for different purposes. Each type of tweet canbe more or less suitable depending on the users’needs and interests, thus allowing taking into ac-count the user/company’s profile. For instance, ifa company wants to emphasise a good feature of aproduct, a positive tweet entry would be the best.In contrast, if they want to improve the weaknessesof their products, a negative tweet could be moreappropriate (always seeing this automatic gener-ated tweet as an additional tool to be properlymanaged by market experts). On the other hand,we also observed some cases in which the auto-matic tweets generated did not meet our expecta-tions. We found that some original tweets werein other languages than English, such as Spanishor French, and therefore they were counted as in-correct, since we are focusing only in the Englishlanguage. This was probably due to the fact thatthe crawler used did not filter the language of the

tweets, because this did not occur for the reviews.However, this stresses the necessity and impor-tance of multilingual approaches that could dealwith these challenges and exploit a larger amountof data. Another issue that is worth discussing isthe informality, frequently employed in Web 2.0content. In our case, this was higher in tweetsthan in on-line reviews, and although, the auto-matic tweets were not much affected for this, thiscould be a problematic issue when dealing withhighly informal texts.

For both domains tested, the mixed summarieswere the ones with best results. To compare oursummaries in the form of a tweet with respect toother existing approaches, we took into accountthe following systems, and we evaluated the re-sults following the same criteria as for the previousevaluation of our approach:

• COMPENDIUMsem (Vodolazova et al.,2012) for determining relevant sentences.This summariser takes into account seman-tic features, such as concept identificationand disambiguation, textual entailment andanaphora resolution.

• The approach proposed in (Ganesan et al.,2012), which is able to produce ultra-conciseopinion summaries as well.

• Swotti5, which is a commercial system thatprovides summarised opinion information for

5http://swotti.starmedia.com/ (last access 30 January2017)

43

Page 54: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

a wide range of products.

Table 3 shows the comparison results of our ap-proach (mixed tweet configuration) with respect tothe other systems, where the last two columns pro-vide the global average per row. As it can be seen,our approach obtains the best performance, show-ing again its appropriateness to be used in real-life applications. From the comparison results, wewould like to note that even if we also tested sum-marisation approaches relying on semantic knowl-edge the results of these summaries were lowerthan when using simple word frequency tech-niques, as in our approach. This confirms thefindings in (Inouye and Kalita, 2011), and veri-fies the power of lexical-based techniques for sum-marisation. This could be also seen as a com-petitive method for the generation of ultra-concisesummaries. In addition, the results achieved withSwotti, a real on-line platform that provides opin-ion summarisation, could not surpass the ones ob-tained with our approach either.

7 Potentials and Limitations of OurApproach

After having described our approach and the re-sults obtained it is worth underlying that, sincethis the first step of an experimental approach, wedecided to take into account only the text. Asexplained above, all non-conventional characterswere deleted. In addition, emoticons have alsobeen removed, however our idea is to take theminto account for next steps, since they are usuallycharged with polarity that could also be an addedvalue for the correct text interpretation.

The retrieval and pre processing stages con-firmed the fact that the text available in the SocialMedia is extremely informal and it does not al-ways follow the conventional rules. Many are thecases of informal languages that contain sayings orcollocations hard to interpret automatically. Theywill be taken into account in future work, sincethe EmotiBlog annotation model is able to captureand interpret them in terms of polarity. In addi-tion to semantic challenges, the issue of informal-ity is a main concern. Last but not least, the factof working with different textual genres poses theproblem of having to deal with different linguisticphenomena, typical in one or other genre. The re-sult of this is that, despite the satisfactory perfor-mance we also obtained cases with room for im-provement. Example of this are:

• Help please?!

• 2.5mm Standard Headset Adapter for LGGW520 [Wireless Phone Accessory]: Atop quality standard headset adapter (...http://amzn.to/wnzLVP

• Easy AdSense Pro by Unreal Hereis download link LG KS360/ KS365PC suite/Sync/USB Driver forhttp://goo.gl/fb/uxV5o

From the examples above, we can deduce thatin the cases in which the system performance wasnot satisfactory, we obtain sentences with no rele-vant content in most of the cases. In all the casesin which the system performance was satisfactory,examples of sentences are the following ones:

• The only annoying thing I do find is all theBlackberry apps are geared up for the USA,not UK, which limits their desirability.

• The battery life is good and will last me 2-3days if I am careful.

• All the Blackberry apps are geared up forthe USA. The internet drains is pretty quickly.The battery is good and will last me 2-3 days.

As it can be seen, the content is relevant to thetopic search. In addition to this, in the third casewe obtain a mixed tweet, which includes differentideas taken from the positive, negative and objec-tive sentences. This is very important to maintainthe relevance of the content and thus the qualityand reliability of the system output.

8 Conclusion and Future Work

In this article, we presented an innovative methodfor creating an effective system that producesultra-concise summaries with no more that 140characters (i.e., in the form of a tweet). The ap-proach represents an advancement of the state-of-the-art approach since we work with multiple tex-tual genres simultaneously and produce a reliableultra concise summary. We start from the informa-tion contained in a selected corpus composed ofon-line reviews and microblogs written in Englishabout “mobile phones and cars” and more con-cretely we selected a set of 10 cell phones and 10cars of different brands. We presented our corpusand the annotation process carried out. For the cor-pus annotation, we applied the coarse grained ver-sion of the EmotiBlog annotation schema that is

44

Page 55: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

a fine-grained annotation schema for the detectionof the subjectivity in the new textual genres of theWeb 2.0. We took advantage of such annotationto propose a novel summarisation approach thatemployed statistical and linguistic techniques fordetecting relevant sentences. We generated fourtypes of ultra-concise summaries (or tweets) thatwere then evaluated following a standard qual-itative framework, and compared to similar ap-proaches. From the results obtained we concludethat our approach is reliable and appropriate forgenerating this types of summaries. The success-ful performance of the tweets (in terms of contentreliability and syntactic adequacy) clearly indi-cates that they can be used within a real-life appli-cation. The selection of which tweet to show is outof the scope of this paper, and this would dependon the target users and purpose of the tweet. How-ever, one strategy that could be adopted wouldbe to rely on existing automatic rating models,such as EmotiReview (Boldrini et al., 2011). Thismanner depending on the rating given to a spe-cific product, we could decide which type of tweetshould be generated.

Despite the good results achieved, there are sev-eral issues that have to be tackled for improvingthe generation of ultra-concise summaries and thatwe plan to tackle as future work. On the one hand,in the short-term, we will mainly focus on twoaspects: the multilinguality and the inclusion ofmore Web 2.0 textual genres and domains. Thismanner, we will be able to extend our approach tolanguages, such as Italian or Spanish, as well asto deal with other type of texts, such as fora, orblogs and increase the domains such as tourism,politics, etc. Moreover, we want to analyse inmore detail the impact of each proposed stage inthe summarisation process, as well as the influ-ence of each textual genre. In this manner, wewill substitute the manual annotation for sentimentanalysis for an automatic one as well as we willanalyse more features for sentiment analysis (e.g.target, intensity, etc.), and we will test other sum-marisation techniques. On the other hand, for themedium and long-term research, we will increasethe size of the annotated corpus, and we will use itto train a machine learning system that automati-cally detects and classifies the objective/subjectiveinformation. In addition, a topic detection stageable to identify concepts and their relationshipswould be necessary in order to personalise the re-

sulting summaries. Other important issues to takeinto account will be the informality of the text andsome sentiment-analysis-related phenomena, suchas irony, that were out of the scope of the paper,due to its great difficulty. The informality of a textcould be detected and used to normalize the textsusing for instance the TENOR tool (Mosquera andMoreda, 2012), or vice versa, to produce an infor-mal summary in the form of a tweet. For detectingironic expressions, we could also rely on alreadyexisting approaches, such as the one described in(Reyes et al., 2012).

Acknowledgments

This research work has been partially fundedby the Generalitat Valenciana and the Span-ish Government through the projects DIIM2.0(PROMETEOII/2014/001), TIN2015-65100-Rand TIN2015-65136-C2-2-R.

ReferencesAlexandra Balahur, Elena Lloret, Ester Boldrini,

Andres Montoyo, Manuel Palomar, and PatricioMartınez-Barco. 2009. Summarizing threads inblogs using opinion polarity. In Proceedings of theWorkshop on Events in Emerging Text Types, pages23–31, Borovets, Bulgaria, September. Associationfor Computational Linguistics.

Ester Boldrini, Alexandra Balahur, Patricio Martınez-Barco, and Andres Montoyo. 2010. Emotiblog: Afiner-grained and more precise learning of subjectiv-ity expression models. In Proceedings of the FourthLinguistic Annotation Workshop, pages 1–10, Upp-sala, Sweden, July. Association for ComputationalLinguistics.

Ester Boldrini, Javier Fernandez, Jose Manuel Gomez,and Patricio Martınez-Barco. 2011. Emotiblog: To-wards a finer-grained sentiment analysis and its ap-plication to opinion mining. In Proceedings of IVJornadas TIMM, pages 49–54.

Ester Boldrini. 2012. EmotiBlog: A Model to LearnSubjective Information Detection in the New TextualGenres of the Web 2.0 -a Multilingual and Multi-Genre Approach. Ph.D. thesis, University of Ali-cante.

Deepayan Chakrabarti and Kunal Punera. 2011. Eventsummarization using tweets. In Proceedings of theFifth International Conference on Weblogs and So-cial Media, Barcelona, Catalonia, Spain, July 17-21, 2011.

Bruce Croft, Donald Metzler, and Trevor Strohman.2009. Search Engines: Information Retrieval inPractice. Addison-Wesley Publishing Company,USA, 1st edition.

45

Page 56: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Wenjing Duan, Bin Gu, and Andrew B. Whinston.2008. Do Online Reviews Matter? - An Empiri-cal Investigation of Panel Data. Decision SupportSystems, 45(4):1007–1016, November.

Javi Fernandez, Jose Manuel Gomez, and Patri-cio Martınez Barco. 2010. Evaluation of web in-formation retrieval systems on restricted domains.Procesamiento del Lenguaje Natural, 45:273–276.

Kavita Ganesan, ChengXiang Zhai, and Evelyne Vie-gas. 2012. Micropinion generation: An unsu-pervised approach to generating ultra-concise sum-maries of opinions. In Proceedings of the 21st In-ternational Conference on World Wide Web, pages869–878, New York, NY, USA. ACM.

Talmy Givon. 1990. Syntax: A functional-typologicalintroduction. Volume II. Amsterdam: John Ben-jamins. XXV, 553 pages. (Volume 1: 1984 (XX, 464pages).

Wu He, Harris Wu, Gongjun Yan, Vasudeva Akula, andJiancheng Shen. 2015. A novel social media com-petitive analytics framework with sentiment bench-marks. Information & Management, 52(7):801–812.

David Inouye and Jugal K. Kalita. 2011. Comparingtwitter summarization algorithms for multiple postsummaries. In PASSAT/SocialCom 2011, Privacy,Security, Risk and Trust (PASSAT), 2011 IEEE ThirdInternational Conference on and 2011 IEEE ThirdInternational Conference on Social Computing (So-cialCom), Boston, MA, USA, 9-11 Oct., 2011, pages298–306.

B.J. Jansen, M. Zhang, K. Sobel, and A. Chowdury.2009. Twitter power: Tweets as electronic word ofmouth. Journal of American Society of InformationScience, 60(11):2169–2188, November.

Tae-Yeon Kim, Jaekwang Kim, Jaedong Lee, and Jee-Hyong Lee. 2014. A tweet summarization methodbased on a keyword graph. In Proceedings of the8th International Conference on Ubiquitous Infor-mation Management and Communication, ICUIMC’14, pages 96:1–96:8, New York, NY, USA. ACM.

Fei Liu, Yang Liu, and Fuliang Weng. 2011. Why is”sxsw” trending?: Exploring multiple text sourcesfor twitter topic summarization. In Proceedings ofthe Workshop on Languages in Social Media, LSM’11, pages 66–75, Stroudsburg, PA, USA. Associa-tion for Computational Linguistics.

Elena Lloret and Manuel Palomar. 2013. COM-PENDIUM: a text summarisation tool for generatingsummaries of multiple purposes, domains, and gen-res. Natural Language Engineering, 19:147–186, 4.

Elena Lloret, Marıa Teresa Roma-Ferri, and ManuelPalomar. 2013. COMPENDIUM: A text summa-rization system for generating abstracts of researchpapers. Data & Knowledge Engineering, 88:164–175.

Alejandro Mosquera and Paloma Moreda. 2012.Tenor: A lexical normalisation tool for spanish web2.0 texts. In Petr Sojka, Ales Hork, Ivan Kopecek,and Karel Pala, editors, Text, Speech and Dialogue- 15th International Conference, TSD 2012, Brno,Czech Republic, September 3-7, 2012. Proceedings,volume 7499 of Lecture Notes in Computer Science,pages 535–542. Springer.

Ani Nenkova and Kathleen McKeown. 2011. Auto-matic summarization. Foundations and Trends inInformation Retrieval, 5(2-3):103–233.

Brendan O’Connor, Michel Krieger, and David Ahn.2010. Tweetmotif: Exploratory search and topicsummarization for twitter. In Proceedings of theFourth International Conference on Weblogs andSocial Media, ICWSM 2010, Washington, DC, USA,May 23-26, 2010.

O. Pacea. 2011. CorpTweet: Brands, Languageand Identity in Web 2.0. SERIA FILOLOGIE,22(2):157–168.

Bo Pang and Lillian Lee. 2008. Opinion mining andsentiment analysis. Foundations and Trends in In-formation Retrieval:, 2(1-2):1–135, January.

Antonio Reyes, Paolo Rosso, and Davide Buscaldi.2012. From Humor Recognition to Irony Detection:The Figurative Language of Social Media. Data &Knowledge Engineering, 74:1–12.

Beaux Sharifi, Mark-Anthony Hutton, and Jugal Kalita.2010. Summarizing microblogs automatically. InHuman Language Technologies: The 2010 AnnualConference of the North American Chapter of theAssociation for Computational Linguistics, pages685–688, Los Angeles, California, June. Associa-tion for Computational Linguistics.

A. Smith and J. Brennen. 2012. Twitter use 2012. tech-nical report. Technical report.

Tatiana Vodolazova, Elena Lloret, Rafael Munoz, andManuel Palomar. 2012. A comparative study ofthe impact of statistical and semantic features in theframework of extractive text summarization. In Pro-ceedings of the 15th International Conference onText, Speech, and Dialogue Conference, TSD ’12,pages 306–313.

Jui-Yu Weng, Cheng-Lun Yang, Bo-Nian Chen, Yen-Kai Wang, and Shou-De Lin. 2011. IMASS: An in-telligent microblog analysis and summarization sys-tem. In Proceedings of the ACL-HLT 2011 SystemDemonstrations, pages 133–138, Portland, Oregon,June. Association for Computational Linguistics.

46

Page 57: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 47–54,Valencia, Spain, April 3, 2017. c©2017 Association for Computational Linguistics

Machine Learning Approach to Evaluate MultiLingual Summaries

Samira Ellouze and Maher Jaoua and Lamia Hadrich BelguithANLP-RG, MIRACL Laboratory, FSEG Sfax, University of Sfax, Sfax, Tunisia

[email protected], [email protected], [email protected]

Abstract

The present paper introduces a new Multil-ing text summary evaluation method. Thismethod relies on machine learning ap-proach which operates by combining mul-tiple features to build models that predictthe human score (overall responsiveness)of a new summary. We have tried severalsingle and “ensemble learning” classiersto build the best model. We have experi-mented our method in summary level eval-uation where we evaluate the quality ofeach text summary separately. The cor-relation between built models and humanscore is better than the correlation betweenthe baselines and the manual score.

1 Introduction

Nowadays, the evaluation of summarization sys-tems is an important step in the development cy-cle of those systems. In fact, it accelerates thecycle of development by giving an analysis of er-rors, making an optimization of systems and com-paring each system with others. The evaluationof text summary covers its content, its linguisticquality or both. Whatever the type of evaluation(content and/or linguistic quality), the evaluationof system summary output is a difficult task giventhat in most times there is not a single good sum-mary. In the extreme case, two summaries of thesame documents set may have completely differ-ent words and/or sentences with different struc-tures. Several metrics have been evaluated thecontent, the linguistic quality and the overall re-sponsiveness of MonoLing text summaries. Wecan cite ROUGE (Lin and Hovy, 2003), BE (Hovyet al., 2006), AutoSummENG (Giannakopoulos etal., 2008), BEwTE (Tratz and Hovy, 2008) , etc.Some of those metircs can assess MultiLing text

summaries such as ROUGE and AutoSummENG.But, those features can only evaluate the contentof MultiLing text summaries.

To encourage research to develop automaticmultilingual multi-documents summarization sys-tems a new task, dubbed MultiLing Pilot (Gian-nakopoulos et al., 2011), has been introduced forthe first time in TAC2011 conference. Later, thetwo workshops 2013 ACL MultiLing Pilot (Gian-nakopoulos, 2013) and MultiLing 2015 at SIGdial2015 (Giannakopoulos et al., 2015) have been or-ganised with the same purpose as MultiLing Pi-lot 2011. The participated summarization sys-tems in the MultiLing task have been assessed us-ing automatic content metrics such as ROUGE-1, ROUGE-2 and MeMoG and a manual metricnamed Overall Respensiveness which covers thecontent and the linguistic quality of a text sum-mary. However, the manual evaluation of both thecontent and the linguistic quality of multilingualmulti-documents summarization systems is an ar-duous and costly process. In addition, the auto-matic evaluation of only the content of summary isnot enough because a summary should also have agood linguistic quality. For this reason, automaticmetrics that evaluate the content and the linguis-tic quality of summaries from several languagesshould be developed. In this context, we proposea new method based on a machine learning ap-proach for evaluating the overall quality of auto-matic text summaries. This method could pre-dict the human score (Overall Reponsiveness) ofEnglish and Arabic text summaries by combiningmultiple content and linguistic quality features.

The rest of the paper is organized in the fol-lowing way: First in Section 2 we introduce themain metrics that have been proposed to evaluatetext summaries; then in Section 3 we explain themethodology adopted in our work. In Section 4we present the different experiments and results

47

Page 58: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

for summary level evaluation. Finally, Section 5describes the main conclusions and possible futureworks.

2 Related Works

The summary evaluation task started as Mono-ling evaluation task. Several manual and auto-matic metrics have been developed to evaluatethe content and the linguistic quality of text sum-mary. Manual evaluation is expensive and time-consuming. Then, there is a need to assess textsummaries automatically. One of the standards inautomatic evaluation is ROUGE (Lin and Hovy,2003). It measures overlapping content betweena candidate summary and reference summaries.ROUGE metric scores are obtained through thecomparison of common words: N-grams. Later,Giannakopoulos et al. (2008) introduced Auto-SummENG metric, which is based on statisticalextracting of textual information from the sum-mary. The information extracted from the sum-mary, represents a set of relations between n-grams in this summary. The n-grams and the re-lations are represented as a graph where the nodesare the N-grams and the edges represent the rela-tions between them. The calculation of the simi-larity is performed by comparing the graph of thecandidate summary with the graph of each ref-erence summary. In a subsequent work, (Gian-nakopoulos and Karkaletsis, 2010) have presentedMerge Model Graph (MeMoG) which is anothervariation of AutoSummENG based on n-gramgraphs. This variation calculates the merged graphof all reference summaries. Then, it compares thecandidate summary graph to the merged graph ofreference summaries. Afterwards, the SIMetrix(Summary Input similarity Metrics) measurementwas developed by (Louis and Nenkova, 2013);it assesses a candidate summary by comparing itwith the source documents. The SIMetrix com-putes ten measures of similarity based on the com-parison between the source documents and thecandidate summary. Among the used similaritymeasures we cite the cosine similarity, the di-vergence of Jensen-Shannon, the divergence ofKullback-Leibler, etc.

Recently, (Giannakopoulos and Karkaletsis,2013) proposed NPowER (N-gram graph PoweredEvaluation via Regression) metric, which presentsa combination of AutoSummENG and MeMoG.They build a linear regression model that pre-

dicts a manual (human) score. All the abovemetrics (ROUGE, AutoSummENG, NPowER andSIMetrix) are used in monolingual and multilin-gual summary evaluation. Same of those metricsare adapted to multilingual evaluation while oth-ers (i.e. AutoSummENG) can from the beginning,support multilingual evaluation.

3 Proposed Method

From Table 1, we notably remark that in the Ara-bic language, the correlation between ROUGE-2 and Overall Responsiveness is very low. Inaddition, almost no correlation exists betweenMeMoG, AutoSummENG, NPowER and OverallResponsiveness for the Arabic language. Perhaps,this is due to the complexity of the Arabic lan-guage structure. For the English language, wenote that the correlation between automatic met-rics and Overall Responsiveness is better than forthe Arabic language but it still low. This motivatedus to combine those automatic metrics in order topredict Overall Responsiveness. So, the combina-tion of those metrics will give better correlation.In addition, the Overall Responsiveness score is areal number between 1 and 5 which assesses thecontent and the linguistic quality of a text sum-mary. This means that we should combine multi-ple features related to the content and the linguis-tic quality of a summary. For this reason we haveadded multiple syntactic features. Then, a predic-tive model for each language is built by combiningmultiple features.

The basic idea of the proposed evaluationmethodology is based on the prediction of the hu-man grade score (Overall Responsiveness) (Dangand Karolina, 2008) for a candidate summary inArabic or English languages. This prediction isobtained by the extraction of features from thecandidate summary itself, from comparing thecandidate summary with the source documents orwith reference summaries. To obtain the predic-tive model for each language, extracted featuresare combined using a linear regression algorithm.In the following subsections, we will first give thelist of used features, then we move to the descrip-tion of the combination scheme.

3.1 Used features

In the proposed method we use several classes offeatures that are related to the content and the lin-guistic quality of a text summary. The list of used

48

Page 59: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Table 1: Kendall’s Tau Correlation Between Gradings (R2, MeMoG, AutoSummENG, NPowER andOR) with p-value < 0.1 from MultiLing 2013 corpus

Language R2 to OR MeMoG∗ to OR AutoSummENG∗ to OR NPowER∗ to ORArabic 0.125 0.018 0.029 0.031English 0.216 0.202 0.239 0.234

∗ we give the kendall correlation for MeMoG, AutoSummENG and NPowER with parameters: minimum length of N-grams =maximum length of N-grams = window size=3

features are:

• ROUGE Scores: ROUGE scores are de-signed to evaluate the content of a text sum-mary. They are based on the overlap of wordsN-grams between a candidate summary andone or more reference summaries. Accord-ing to (Conroy and Dang, 2008), ROUGEvariants which take into account large con-texts may capture the linguistic qualitiesof the summary such as some grammaticalphenomena. We mean that ROUGE vari-ants that use bigrams, trigrams or more cancapture some grammatical phenomena fromthe well formation of reference sentences.For this reason, we include ROUGE scoreswhich take into account large contexts inthe ROUGE feature class:ROUGE-1 (R1),ROUGE-2 (R2), ROUGE-3 (R3), ROUGE-4 (R4) and ROUGE-5 (R5) which calculaterespectively words overlaps of bigrams, tri-grams, 4-grams and 5-grams.

• AutoSummENG, MeMoG and NPowERscores: Those three scores are based on N-grams graph (Giannakopoulos and Karkalet-sis, 2010) are used to assess the content andthe readability of a summary. To calculatethese scores, we should adjust three parame-ters: minimum length of N-grams, maximumlength of N-grams and window size betweentwo N-grams. In our experiments, we haveused three configurations for each score. Thefirst configuration gives 1 to minimum lengthof N-gram, 2 to maximum length of N-gramand 3 to window size. The second config-uration assigns 3 to minimum length of N-gram, 3 to maximum length of N-gram and3 to window size. Finally, the third one at-tributs 4 to minimum length of N-gram, 4 tomaximum length of N-gram and 3 to win-dow size. In fact, because Overall respon-sivness scores evaluate the content and thelinguistic quality of summary, we have cho-sen the first configuration to assess the con-

tent and the two other configurations to cap-ture some grammatical phenomena from thewell formation of reference sentences. Wehave assumed that also for those scores con-figurations which take into account large con-texts may capture the linguistic qualities ofthe summary.

• SIMetrix scores: we have used the follow-ing six scores calculated by SIMetrix (Louisand Nenkova, 2013) : the Kullback-Leibler(KL) divergence between the source docu-ments (SDs) and the candidate summary (CS)(KLInputSummary), the KL divergence be-tween the CS and the SDs (KLSummaryIn-put), the unsmoothed version of Jensen Shan-non divergence between the SDs and theCS (unsmoothedJSD) and the smoothed one(smoothedJSD), the probability of uni-gramsof the CS given SDs (unigramProb), multino-mial probability of the CS given SDs (multi-nomialProb).

• Syntactic features: the syntactic structureof sentences is an important factor that candetermine the linguistic quality of texts.(Schwarm and Ostendorf, 2005) and (Fenget al., 2010) used syntactic features to gaugethe readability of text as assessment of read-ing level. While (Kate et al., 2010) usedsyntactic features to predict linguistic qualityof natural-language documents. We imple-ment some of these features using the Stan-ford parser (Klein and Manning, 2003). Wecalculate the number and the average numberof noun phrases (NP), verbal phrases (VP)and prepositional phrases (PP). The averagenumber of each of the previous phrases is cal-culated as the ratio between the number ofone of the previous phrase type and the totalnumber of sentences.

3.2 Combination schemeBefore building a predictive model, we shouldfirst calculate the values of all the features.

49

Page 60: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Then, We select the relevant ones using ”wrappermethod”(Kohavi and John, 1997). This methodevaluates subsets of features which allows to de-tect the possible interactions between features. Itevaluates the performance of each subset of fea-tures, then it gives as a result the best one. Thisdoes not mean that the other features are not good,but it means that the combination of features fromthe best subset gives the best performance.

Now, to build the predictive model (combi-nation scheme) for a language, we have usedseveral basic (single algorithms) and ”ensem-ble learning” algorithms, implemented by theWeka environment (Witten et al., 2011), usinga regression method. For basic algorithms weuse ”GaussianProcesses”, LinearRegression andSMOReg. For ”ensemble learning” algorithms,we use ”Bagging” (Breiman, 1996), ”AdditiveRe-gression”(Friedman, 1999), ”Stacking” (Wolpert,1992) and ”Vote” (Kuncheva, 2004).

After testing the algorithms, we adopt the onethat produces the best predictive model. The vali-dation of each model is performed by two meth-ods: cross-validation method with 10 folds andsupplied test set method.

4 Evaluation

4.1 Corpus

In this article, we use the TAC 2011 MultiLingPilot 2011 corpus (Giannakopoulos et al., 2011)and the MultiLing 2013 corpus (Giannakopoulos,2013). The two corpus involve the source docu-ments, peer summaries, model summaries and au-tomatic and manual evaluation results. The firstcorpus is available in 7 languages. We use onlythe Arabic and English documents. For Arabiclanguages, there are seven participating systemsand two baseline systems. While for English lan-guage, there are eight participating systems andtwo baseline systems. For each language, sourcedocuments are divided to ten collections of news-paper articles. Each collection includes ten articlesrelated to the same topic. Each collection has threemodel (human) summaries. Each summarizationsystem is invited to generate a summary for eachcollection of documents. For MultiLing 2013 cor-pus, This corpus is available in 10 languages. Weuse only the Arabic and English documents. Foreach collection, there are eight participating sys-tems, two baseline systems and 15 collections ofnewspaper articles. Each collection includes ten

articles related to the same topic. Each summa-rization system is invited to generate a summaryfor each collection.

4.2 Experiments and results

We have experimented our method in summarylevel evaluation (Micro-evaluation). At this level,we take, for each Summarizer system, each pro-duced summary in a separate entry. It is worthmentioning that this evaluation level is more dif-ficult than system level evaluation (i.e. where theaverage quality of a summarizing system is mea-sured) even for MonoLingual summary evalua-tion (Ellouze et al., 2013), (Ellouze et al., 2016).For each language, we have tested several sin-gle and “ensemble learning” classifiers integratedon Weka environment and based on regressionmethod like GaussianProcesses, linearRegression,vote, Bagging, etc.

We validate our models using cross-validationwith 10 folds and using supplied test set. Forcross-validation method, we have calculated thefeatures from ”MultiLing 2013” corpus. While,for supplied test set method we have used ”Mul-tiLing 2013” corpus as training set and ”MultiL-ing Pilot TAC’2011” corpus as testing set. Wehave chosen to train our models on ”MultiLing2013” corpus because we have more summariesin this corpus (150 summaries for Arabic and 149for English). To evaluate the proposed method,we study the correlation of Pearson (Pearson,1895), Spearman (Spearman, 1910) and Kendall(Kendall, 1938) between the manual scores (Over-all Responsiveness) and the scores produced bythe proposed method. Furthermore, we reportthe ”Root Mean Squared Error” (RMSE) mea-sure generated by each model. This measure isbased on the difference between the manual scores(Overall responsiveness) and the predicted scores.

Arabic Summary EvaluationWe begin with the experiments performed with

Arabic language. The selected features for Ara-bic models are: autosummeng443, unsmoothed-JSD, unigramProb, multinomialProb, ROUGE-3and number of NP phrases in the summary. ThePearson, the Spearman and the Kendall Correla-tions and the root mean square error (RMSE) gen-erated by each classifier for Arabic language arepresented in Table 2.

Table 2 shows the performance of the selectedfeatures in building the predictive models using

50

Page 61: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Table 2: Pearson, Spearman and Kendall Correlations with Overall Responsiveness and RMSE (betweenbrackets) for Various Single and Ensemble learning Classifiers for Arabic language

Classifiers Cross-validation Supplied test setSingle classifiers

Peason Spearman Kendall RMSE Peason Spearman Kendall RMSEGaussianProcesses 0.329 0.328 0.236 0.696 0.224 0.229 0.163 0.591LinearRegression 0.306 0.292 0.207 0.708 0.196 0.197 0.148 0.647SMOReg 0.299 0.304 0.216 0.711 0.128 0.181 0.142 0.632

“Ensemble learning” classifiersAdditiveRegression 0.337 0.327 0.232 0.697 0.185 0.194 0.150 0.643Vote 0.320 0.330 0.236 0.705 0.212 0.226 0.169 0.650Bagging 0.330 0.335 0.239 0.700 0.185 0.218 0.160 0.637Stacking 0.308 0.322 0.228 0.701 0.217 0.232 0.172 0.625

several single and ensemble learning classifiers.In the case of cross validation method, the re-sults show that the model built from the ”en-semble learning” classifier ”Bagging” producedthe best Kendall (0.239) and Spearman (0.335)correlations, ”AdditiveRegression” produced thebest Pearson (0.337) correlation while the ”Gaus-sianProcesses” have produced the lowest RMSE(0.696). In the case of supplied test set method,Table 2 indicates that the best “ensemble learn-ing” classifier is the “Stacking” which providesa model having a Kendall correlation of 0.171and a Spearman correlation of (0.232) while the”GaussianProcesses” have produced the best Pear-son (0.224) correlation and the lowest RMSE. An-other notable observation is that the correlation us-ing cross-validation is more important than usingsupplied test set. Whereas, the RMSE using sup-plied test set is lower than using cross-validation.This means that the error between the predictivevalues and the actual values is less important us-ing supplied test set. The decrease of correlationbetween the cross-validation method and the sup-plied test set method needs to be studied further infuture works.

We pass now to the comparison between theperformance of the best obtained model and thebaseline metrics that were adopted by the Mul-tiLing workshop such as R-2, MeMoG and alsowe add the best variant of each of the three otherfamous metrics AutoSummENG, NPoWER andSIMetrix. Table 3 details the different correlationsand RMSEs of baseline metrics and our differentexperimentations.

From Table 3, the model built from the com-bination of selected features has the best corre-lation and RMSE comparing to baselines. Whenobserving the Table 3, we see the gap between

baseline metrics and the model build from selectedfeatures. In addition, we notice the decrease ofcorrelation on both methods of validation (cross-validation, supplied test set), when we tried to re-move one of the classes of features. Moreover, weremark that removing SIMetrix metric from the se-lected features have a big effect on its correlationwith Overall Responsiveness when using suppliedtest set as validation method.

Besides, we note that the correlation of the bestmodel with Overall Responsiveness is low, whileit is more important than the correlation of base-lines. This may be due to the small set of the ob-servations per Arabic language. We need a largerset of observations to determine the best combi-nation of features and to have better correlation.Furthermore, perhaps, this is due to the complex-ity of the Arabic language structure which is an ag-glutinative language where agglutination (Grefen-stette et al., 2005) occurs when articles, preposi-tions and conjunctions are attached to the begin-ning of words and pronouns are attached to theend of words. This phenomenon can greatly in-fluence the operation of comparing the candidatesummary with reference summaries. Especiallywhen a word appears in the candidate summarywithout agglutination while it appears in a refer-ence summary in an agglutinative form and viceversa.

English Summary EvaluationWe pass now to the different experiments per-

formed with English language. The selectedfeatures for English models are NPowER123,autosummeng443, the number of NP phrases in thetext summary, the average number of PP per sen-tence in a text summary. The Pearson, the Spear-man and the Kendall Correlations and the root-mean-square error (RMSE) generated by each

51

Page 62: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Table 3: Pearson, Spearman and Kendall Correlations with Overall Responsiveness Score and RMSE(between brackets) for Arabic language

BaselinesScore Peason Spearman KendallROUGE-2 0.164 0.175 0.125AutoSummENG443 0.055 0.063 0.045MeMoG443 0.066 0.039 0.03NPowER443 0.063 0.064 0.049SIMetrix unigramProb 0.258 0.257 0.182

Our experimentationsCross-validation Supplied test set

Score Peason Spearman Kendall RMSE Peason Spearman Kendall RMSECombining selected features (CSF) 0.330 0.335 0.239 0.700 0.217 0.232 0.172 0.625CSF without ROUGE 0.276 0.298 0.213 0.713 0.194 0.149 0.107 0.638CSF without AutoSummENG 0.315 0.319 0.227 0.704 0.190 0.225 0.160 0.647CSF without SIMetrix 0.310 0.340 0.243 0.717 0.057 0.048 0.039 0.646CSF without Synt Feat 0.285 0.244 0.172 0.708 0.199 0.154 0.111 0.601

classifier for English language are presented in Ta-ble 4.

Table 4 shows the performance of the selectedfeatures in building the predictive models usingseveral single and ensemble learning classifiersfor the English language. For cross validationmethod, the results show that the model built fromthe ”ensemble learning” classifier ”Bagging” pro-duced the best Kendall (0.393), Spearman (0.537)and Pearson (0.529) correlations and the lowestRMSE (0.652).

For supplied test set validation method, Table 2indicates that the best “ensemble learning” classi-fier in terms of correlation and RMSE is also the“Bagging”. In fact, this ”ensemble learning” hasthe best correlations (i.e. Kendall: 0.322) and thelowest RMSE (0.754). Again, we note that the cor-relation using cross-validation is more importantthan using supplied test set. The decrease of cor-relation between the cross-validation method andthe supplied test set method can be caused by thevariation of the human evaluator and/or the changeof evaluation guidelines from MultiLing 2011 toMultiLing 2013.

We now move to the comparison between theperformance of the best obtained model and thebaseline metrics that were adopted by the MultiL-ing workshop such as ROUGE-2 and MeMoG andalso we add the best variant of each of the threeother famous metrics AutoSummENG, NPoWERand SIMetrix. Table 5 details the different corre-lations and RMSEs of baseline metrics, other fa-mous metrics and our best model.

From Table 5, we see the gap between base-

line metrics and our experiments, with both vali-dation methods. We have retained the model builtfrom the ”Bagging” classifier with both valida-tion methods. We observe also that the elimi-nation of one of the used classes of features de-creases the correlation of the best model (builtfrom selected features) with Overall Responsive-ness and increases the RMSE. Furthermore, wenote that the elimination of syntactic features classdecreases enormously the correlation with the useof both methods of validation. The surprising noti-fication is that the elimination of AutoSummENGscore increases the correlation instead of decreas-ing it. Generally, we have noted the effect of syn-tactic features in the best model for both languages(Arabic, English).

5 Conclusion

We have presented a method for evaluating theOverall Responsiveness of text summary in bothArabic and English language. This method isbased on a combination of ROUGE scores, Au-toSummENG scores, MeMoG scores, NPowERscores, SIMetrix scores and a variety of syntacticfeatures. We have combined these features using aregression method. Before building the linear re-gression model, we select the relevant features us-ing the ”Wrapper subset evaluator” method. Theselected method includes automatic metrics andsyntactic features. And generally automatic fea-tures that take into account large context are se-lected (autosummeng443, ROUGE-3, etc). Thisconfirms the hypothesis of (Conroy and Dang,2008) which indicates that the integration of con-

52

Page 63: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Table 4: Pearson, Spearman and Kendall Correlations with Overall Responsiveness and RMSE (betweenbrackets) for Various Single and Ensemble learning Classifiers for English language

Classifiers Cross-validation Supplied test setSingle classifiers

Peason Spearman Kendall RMSE Peason Spearman Kendall RMSEGaussianProcesses 0.519 0.508 0.367 0.656 0.395 0.365 0.258 0.780LinearRegression 0.514 0.490 0.353 0.658 0.236 0.384 0.277 1.542SMOReg 0.510 0.5184 0.375 0.668 0.372 0.310 0.227 0.803

“Ensemble learning” classifiersAdditiveRegression 0.522 0.499 0.360 1.092 0.276 0.427 0.313 3.028Vote 0.523 0.522 0.380 0.661 0.232 0.395 0.285 1.475Bagging 0.529 0.537 0.393 0.652 0.465 0.444 0.322 0.754Stacking 0.503 0.519 0.379 0.663 0.372 0.427 0.304 0.837

Table 5: Pearson, Spearman and Kendall Correlations with Overall Responsiveness Score and RMSE(between brackets) for Arabic language

BaselinesScore Peason Spearman KendallROUGE-2 0.314 0.316 0.216AutoSummENG123 0.358 0.385 0.263MeMoG123 0.370 0.362 0.254NPowER123 0.385 0.386 0.266SIMetrix unsmoothedJSD 0.235 0.248 0.173

Our experimentationsCross-validation Supplied test set

Score Peason Spearman Kendall RMSE Peason Spearman Kendall RMSECombining selected features (CSF) 0.529 0.537 0.393 0.652 0.465 0.444 0.322 0.754CSF without AutoSummENG 0.466 0.459 0.333 0.680 0.502 0.452 0.335 0.802CSF without NPowER 0.505 0.498 0.363 0.663 0.310 0.285 0.203 0.794CSF without Synt Feat 0.396 0.388 0.267 0.705 0.377 0.312 0.236 0.834

tent scores which take into account large contextmay captivate some grammatical phenomena.

To evaluate our method, we have compared thecorrelation of the best model (built with selectedfeatures) and of baselines with manual Overall Re-sponsiveness.We have tested two methods of vali-dation of predictive models : cross validation with10 folds and supplied test set. The results showthat, in both languages, the correlation of the bestmodel with Overall Responsiveness is low, whileit is more importante then the correlation of base-lines. This may be due to the small set of theobservations per language. We need a larger setof observations to determine the best combinationof features and to have better correlation. More-over, we note that the correlation using cross-validation is more important than using suppliedtest set. The decrease of correlation between thecross-validation method and the supplied test setmethod needs to be studied further in future works.

The main steps we plan to take in our futureworks, are the construction of predictive models

for more languages and the addition of other typesof features such as entities based features, part-of-speech features, Co-reference Features, shallowfeatures, etc.

ReferencesLeo Breiman. 1996. Bagging predictors. Machine

learning, 24:123–140.

John M. Conroy and Hoa T. Dang. 2008. Mindthe gap: Dangers of divorcing evaluations of sum-mary content from linguistic quality. In Proceedingsof the 22nd International Conference on Computa-tional Linguistics (Coling 2008), pages 145–152.

H. Trang Dang and Owczarzak. Karolina. 2008.Overview of the tac 2008 update summarizationtask. In In TAC 2008 Workshop - Notebook papersand results, pages 10–23.

Samira Ellouze, Maher Jaoua, and Lamia Hadrich Bel-guith. 2013. An evaluation summary method basedon a combination of content and linguistic met-rics. In Proceedings of the International Confer-ence Recent Advances in Natural Language Pro-cessing(RANLP), pages 245–251, Hissar, Bulgaria.

53

Page 64: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Samira Ellouze, Maher Jaoua, and Lamia Hadrich Bel-guith. 2016. Automatic evaluation of a summary’slinguistic quality. In Proceedings of Natural Lan-guage Processing and Information Systems (NLDB).Lecture Notes in Computer Science, pages 392–400.

Lijun Feng, Martin Jansche, Matt Huenerfauth, andNomie Elhadad. 2010. A Comparison of Featuresfor Automatic Readability Assessment. In Proceed-ings of the 23rd International Conference on Com-putational Linguistics (COLING 2010), pages 276–284.

Jerome H. Friedman. 1999. Stochastic gradient boost-ing. Technical report, Stanford University.

George Giannakopoulos and Vangelis Karkaletsis.2010. Summarization system evaluation variationsbased on n-gram graphs. In Proceedings of the ThirdText Analysis Conference, TAC 2010.

George Giannakopoulos and Vangelis Karkaletsis.2013. Summary evaluation: Together we standnpower-ed. In Proceedings of 14th internationalconference on Computational Linguistics and Intel-ligent Text Processing - Volume 2, pages 436–450.

George Giannakopoulos, Vangelis Karkaletsis, GeorgeVouros, and P. Stamatopoulos. 2008. Summariza-tion system evaluation revisited: N-gram graphs.ACM Transactions on Speech and Language Pro-cessing, 5(3):1–39.

George Giannakopoulos, Mahmoud El-Haj, BenotFavre, Marianna Litvak, Josef Steinberger, and Va-sudeva Varma. 2011. Tac 2011 multiling pilotoverview. In Proceedings of the Fourth Text Anal-ysis Conference.

George Giannakopoulos, Jeff Kubina, John M. Conroy,Josef Steinberger, Benot Favre, Mijail A. Kabadjov,Udo Kruschwitz, and Massimo Poesio. 2015. Mul-tiling 2015: Multilingual summarization of singleand multi-documents, on-linefora, and call-centerconversations. In Proceedings of the 16th AnnualSIGdial Meeting on Discourse and Dialogue.

George Giannakopoulos. 2013. Multi-document mul-tilingual summarization and evaluation tracks in acl2013 multiling workshop. In Proceedings of theMultiLing 2013 Workshop on Multilingual Multi-document Summarization.

Gregory Grefenstette, Nasredine Semmar, and FaızaElkateb-Gara. 2005. Modifying a natural languageprocessing system for european languages to treatarabic in information processing and informationretrieval applications. In Proceedings of the ACLWorkshop on Computational Approaches to SemiticLanguages, pages 31–37.

Eduard Hovy, Chin-Yew Lin, Liang Zhou, and JunichiFukumoto. 2006. Automated summarization eval-uation with basic elements. In Proceedings of theFifth Conference on Language Resources and Eval-uation (LREC).

Rohit J. Kate, Xiaoqiang Luo, Siddharth Patwardhan,Martin Franz, Radu Florian, Raymond J. Mooney,Salim Roukos, and Chris Welty. 2010. Learning topredict readability using diverse linguistic features.In Proceedings of the 23rd International Conferenceon Computational Linguistics, pages 546–554.

Maurice Kendall. 1938. A new measure of rank corre-lation. Biometrika, 30:81–89.

Dan Klein and Christopher D. Manning. 2003. Ac-curate unlexicalized parsing. In Proceedings of the41st Annual Meeting on Association for Computa-tional Linguistics - Volume 1, pages 423–430.

Ron Kohavi and George H. John. 1997. Wrappersfor feature subset selection. Artificial Intelligence,97(1-2):273–324.

Ludmila I. Kuncheva. 2004. Combining PatternClassifiers: Methods and Algorithms. Wiley-Interscience.

Chin-Yew Lin and Eduard Hovy. 2003. Auto-matic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003Conference of the North American Chapter of theAssociation for Computational Linguistics on Hu-man Language Technology - Volume 1, pages 71–78.

Annie Louis and Ani Nenkova. 2013. Automaticallyassessing machine summary content without a goldstandard. Computational Linguistics, 39(2):267–300, jun.

Karl Pearson. 1895. Mathematical contributions to thetheory of evolution, ii: Skew variation in homoge-neous material. Philosophical Transactions of RoyalSociety London (A), 186:343–414.

Sarah E. Schwarm and Mari Ostendorf. 2005. Read-ing level assessment using support vector machinesand statistical language models. In Proceedings ofthe Annual Meeting of the Association for Computa-tional Linguistics, pages 523–530.

Charles Spearman. 1910. Correlation calculated fromfaulty data. British Journal of Psychology, 3:271–295.

Stephen Tratz and Eduard Hovy. 2008. Bewte: ba-sic elements with transformations for evaluation.In Proceedings of Text Analysis Conference (TAC)Workshop.

Ian H. Witten, Eibe Frank, and Mark A. Hall. 2011.Data Mining: Practical Machine Learning Toolsand Techniques. Morgan Kaufmann Publishers Inc.,San Francisco, CA, USA, 3rd edition.

David H. Wolpert. 1992. Stacked generalization. Neu-ral Networks, 5:241–259.

54

Page 65: Proceedings of the MultiLing 2017 Workshop on ... · Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres, pages 1–6,

Author Index

Baldwin, Timothy, 7Basile, Pierpaolo, 12Boldrini, Ester, 37

Chen, Moye, 32Cohn, Trevor, 7Conroy, John, 1

Ellouze, Samira, 47

Favre, Benoit, 1

Giannakopoulos, George, 1

Hadrich Belguith, Lamia, 47

Jaoua, Maher, 47

Kubina, Jeff, 1

Lau, Jey Han, 7Li, Lei, 32Litvak, Marina, 1, 22Lloret, Elena, 1, 37

Mao, Liyuan, 32Martinez-Barco, Patricio, 37

Palomar, Manuel, 37

Rankel, Peter A., 1Rossiello, Gaetano, 12

Semeraro, Giovanni, 12Steinberger, Josef, 1

Vanetik, Natalia, 22

Xu, Ying, 7

55