Guiding Extractive Summarization with Question …Guiding Extractive Summarization with Question-Answering Rewards Kristjan Arumae and Fei Liu Computer Science Department University

Guiding Extractive Summarization with Question-Answering Rewards

Kristjan Arumae and Fei LiuComputer Science Department

University of Central Florida, Orlando, FL 32816, [email protected] [email protected]

AbstractHighlighting while reading is a natural behav-ior for people to track salient content of a doc-ument. It would be desirable to teach an ex-tractive summarizer to do the same. However,a major obstacle to the development of a super-vised summarizer is the lack of ground-truth.Manual annotation of extraction units is cost-prohibitive, whereas acquiring labels by auto-matically aligning human abstracts and sourcedocuments can yield inferior results. In thispaper we describe a novel framework to guidea supervised, extractive summarization systemwith question-answering rewards. We arguethat quality summaries should serve as a docu-ment surrogate to answer important questions,and such question-answer pairs can be conve-niently obtained from human abstracts. Thesystem learns to promote summaries that areinformative, fluent, and perform competitivelyon question-answering. Our results comparefavorably with those reported by strong sum-marization baselines as evaluated by automaticmetrics and human assessors.

1 Introduction

Our increasingly digitized lifestyle calls for sum-marization techniques to produce short and accu-rate summaries that can be accessed at any time.These summaries should factually adhere to thecontent of the source text and present the readerwith the key points therein. Although neural ab-stractive summarization has shown promising re-sults (Rush et al., 2015; Nallapati et al., 2016;See et al., 2017), these methods can have poten-tial drawbacks. It was revealed that abstracts gen-erated by neural systems sometimes alter or fal-sify objective details, and introduce new meaningsnot present in the original text (Cao et al., 2018).Reading these abstracts can lead to misinterpreta-tion of the source materials, which is clearly unde-sirable. In this work, we focus on extractive sum-marization, where the summaries are guaranteed

(CNN) A judge this week sentenced a former TSA agent to sixmonths in jail for secretly videotaping a female co-workerwhile she was in the bathroom, prosecutors said.

During the investigation, detectives with the Metro NashvillePolice Department in Tennessee also found that the agent,33-year-old Daniel Boykin, entered the woman’s home multipletimes, where he took videos, photos and other data.

Police found more than 90 videos and 1,500 photos of the victimon Boykin’s phone and computer .

The victim filed a complaint after seeing images of herself on hisphone last year. [...]

Comprehension Questions (Human Abstract):

Former Daniel Boykin, 33, videotaped his female co-workerin the restroom, authorities say.

Authorities say they found 90 videos and 1,500 photos of the victimon and computer.

Table 1: An example extractive summary bolded in the arti-cle (top). Highlighted sections indicate salient segments use-ful for answering fill-in-the-blank questions generated fromhuman abstracts (bottom).

to remain faithful to the original content. Our sys-tem seeks to identify salient and consecutive se-quences of words from the source document, andhighlight them in the text to assist users in brows-ing and comprehending lengthy documents. Anexample is illustrated in Table 1.

A primary challenge faced by extractive sum-marizers is the lack of annotated data. The costof hiring humans to label a necessary amount ofsource articles with summary words, good fortraining a modern classifier, can be prohibitive.Previous work has exploited using human ab-stracts to derive labels for extraction units (Wood-send and Lapata, 2010). E.g., a source word istagged 1 if it appears in the abstract, 0 otherwise.Although pairs of source articles and human ab-stracts are abundant, labels derived in this wayare not necessarily best since summary saliencycan not be easily captured with a rule based cat-egorization. Considering that human abstracts in-volve generalization, paraphrasing, and can con-

arX

iv:1

904.

0232

1v1

[cs

.CL

] 4

Apr

201

9

tain words not present in the source text, leverag-ing them to derive labels for extraction units canbe suboptimal. In this work, we investigate a newstrategy that seeks to better utilize human abstractsto guide the extraction of summary text units.

We hypothesize that quality extractive sum-maries should contain informative content so thatthey can be used as document surrogates to answerimportant questions, thereby satisfying users’ in-formation needs. The question-answer pairs canbe conveniently developed from human abstracts.Our proposed approach identifies answer tokensfrom each sentence of the human abstract, then re-places each answer token with a blank to createa Cloze-style question-answer pair. To answer allquestions (≈human abstract), the system summarymust contain content that is semantically close toand collectively resembles the human abstract.

In this paper, we construct an extractive sum-mary by selecting consecutive word sequencesfrom the source document. To accomplish thiswe utilize a novel reinforcement learning frame-work to explore the space of possible extractivesummaries and assess each summary using a novelreward function judging the summary’s adequacy,fluency, length, and its competency to answer im-portant questions. The system learns to sam-ple extractive summaries yielding the highest ex-pected rewards, with no pre-derived extraction la-bels needed. This work extends the methodologyof Arumae and Liu (2018) with new representa-tions of extraction units and thorough experimen-tal evaluation. The contributions of this researchcan be summarized as follows:

• we describe a novel framework generating ex-tractive summaries by selecting consecutive se-quences of words from source documents. Thisnew system explores various encoding mecha-nisms, as well as new sampling techniques tocapture phrase level data. Such a framework hasnot been thoroughly investigated in the past;

• We conduct a methodical empirical evaluationfrom the point of view of information saliency.Rather than solely relying on automatic summa-rization evaluation methods, we also show theadvantages of our system by assessing the sum-mary quality with reading comprehension tasks.Our summaries compare favorably with the au-tomatic metrics against state of the art, and showpromising results against baselines when evalu-ated by humans for question answering.

2 Related Work

Extractive summarization has seen growing popu-larity in the past decades (Nenkova and McKeown,2011). The methods focus on selecting represen-tative sentences from the document(s) and option-ally deleting unimportant sentence constituents toform a summary (Knight and Marcu, 2002; Radevet al., 2004; Zajic et al., 2007; Martins and Smith,2009; Gillick and Favre, 2009; Lin and Bilmes,2010; Wang et al., 2013; Li et al., 2013, 2014;Hong et al., 2014; Yogatama et al., 2015). A ma-jority of the methods are unsupervised. They esti-mate sentence importance based on the sentence’slength and position in the document, whether thesentence contains topical content and its relation-ship with other sentences. The summarization ob-jective is to select a handful of sentences to maxi-mize the coverage of important content while min-imizing summary redundancy. Although unsuper-vised methods are promising, they cannot benefitfrom the large-scale training data harvested fromthe Web (Sandhaus, 2008; Hermann et al., 2015;Grusky et al., 2018).

Neural extractive summarization has focusedprimarily on extracting sentences (Nallapati et al.,2017; Cao et al., 2017; Isonuma et al., 2017; Tarn-pradab et al., 2017; Zhou et al., 2018; Kedzie et al.,2018). These studies exploit parallel training dataconsisting of source articles and story highlights(i.e., human abstracts) to create ground-truth la-bels for sentences. A neural extractive summarizerlearns to predict a binary label for each sourcesentence indicating if it is to be included in thesummary. These studies build distributed sentencerepresentations using neural networks (Cheng andLapata, 2016; Yasunaga et al., 2017) and use rein-forcement learning to optimize the evaluation met-ric (Narayan et al., 2018b) and improve summarycoherence (Wu and Hu, 2018). However, sentenceextraction can be coarse and in many cases, onlya part of the sentence is worthy to be added to thesummary. In this study, we perform finer-grainedextractive summarization by allowing the systemto select consecutive sequences of words ratherthan sentences to form a summary.

Interestingly, studies reveal that summaries gen-erated by recent neural abstractive systems are, infact, quite “extractive.” Abstractive systems oftenadopt the encoder-decoder architecture with an at-tention mechanism (Rush et al., 2015; Nallapatiet al., 2016; Paulus et al., 2017; Guo et al., 2018;Gehrmann et al., 2018; Lebanoff et al., 2018; Ce-

likyilmaz et al., 2018). The encoder condenses asource sequence to a fixed-length vector and thedecoder takes the vector as input and generates asummary by predicting one word at a time. See,Liu, and Manning (2017) suggest that about 35%of the summary sentences occur in the source doc-uments, and 90% of summary n-grams appear inthe source. Moreover, the summaries may containinaccurate factual details and introduce new mean-ings not present in the original text (Cao et al.,2018; Song et al., 2018). It thus raises concernsas to whether such systems can be used in real-world scenarios to summarize materials such as le-gal documents. In this work, we choose to focuson extractive summarization where selected wordsequences can be highlighted on the source text toavoid change of meaning.

Our proposed method is inspired by the workof Lei et al. (2016) who seek to identify rationalesfrom textual input to support sentiment classifica-tion and question retrieval. Distinct from this pre-vious work, we focus on generating generic doc-ument summaries. We present a novel supervisedframework encouraging the selection of consecu-tive sequences of words to form an extractive sum-mary. Further, we leverage reinforcement learningto explore the space of possible extractive sum-maries and promote those that are fluent, adequate,and competent in question answering. We seek totest the hypothesis that successful summaries canserve as document surrogates to answer importantquestions, and moreover, ground-truth question-answer pairs can be derived from human abstracts.In the following section we describe our proposedapproach in details.

3 Our Approach

Let S be an extractive summary consisting of textsegments selected from a source document x. Thesummary can be mapped to a sequence of binarylabels y assigned to document words. In this sec-tion we first present a supervised framework foridentifying consecutive sequences of words thatare summary-worthy, then proceed by describingour question-answering rewards and a deep rein-forcement learning framework to guide the selec-tion of summaries so that they can be used as doc-ument surrogates to answer important questions.1

1We have made our code and models available at https://github.com/ucfnlp/summ_qa_rewards

3.1 Representing an Extraction UnitHow best to decompose a source document into aset of text units useful for extractive summariza-tion remains an open problem. A natural choiceis to use words as extraction units. However, thischoice ignores the cohesiveness of text. A textchunk (e.g., a prepositional phrase) can be eitherselected to the summary in its entirety or not at all.In this paper we experiment with both schemes,using either words or chunks as extraction units.When a text chunk is selected in the summary, allits consisting words are selected. We obtain textchunks by breaking down the sentence constituentparse tree in a top-down manner until each treefragment governs at most 5 words. A chunk thuscan contain from 1 to 5 words. Additionally, wordlevel modeling can be considered a special case ofchunks where the length of each phrase is 1. It isimportant to note that using sentences as extrac-tion units is out of the scope of this paper, becauseour work focuses on finer-grained extraction unitssuch as words and phrases and this is notably amore challenging task.

The most successful neural models for encod-ing a piece of text to a fixed-length vector includethe recurrent (Hochreiter and Schmidhuber, 1997)and convolutional neural networks (CNN; Kim etal., 2014), among others. A recent study by Khan-delwal et al. (2018) reported that the recurrent net-works are capable of memorizing a recent contextof about 20 tokens and the model is highly sensi-tive to word order, whereas this is less the case forCNN whose max-pooling operation makes it ag-nostic to word order. We implement both networksand are curious to compare their effectiveness atencoding extraction units for summarization.

{het} = fBi-LSTM1 (x) (1)

or {het} = fCNN2 (x) (2)

Our model first encodes the source documentusing a bidirectional LSTM with the forward andbackward passes (Eq. (1)). The representation ofthe t-th source word het = [

←−h et ||−→h et ] is the con-

catenation of the hidden states in both directions.A chunk is similarly denoted by het = [

←−h et ||−→h et+n]

where t and t + n are the indices of its beginningand ending words. In both cases, a fixed-lengthvector (het ∈ Rm) is created for the word/chunk.Further, our CNN encoder (Eq. (2)) uses a slidingwindow of {1,3,5,7} words, corresponding to thekernel sizes, to scan through the source document.We apply a number of filters to each window size

https://github.com/ucfnlp/summ_qa_rewards

https://github.com/ucfnlp/summ_qa_rewards

to extract local features. The t-th source word isrepresented by the concatenation of feature maps(an m-dimensional vector). To obtain the chunkvector we perform max-pooling over the represen-tations of its consisting words (from t to t+n). Inthe following we use het to denote the vector rep-resentation of the t-th extraction unit, may it be aword or a chunk, generated using either encoder.

3.2 Constructing an Extractive SummaryIt is desirable to first develop a supervised frame-work for identifying summary-worthy text seg-ments from a source article. These segments col-lectively form an extractive summary to be high-lighted on the source text. The task can be for-mulated as a sequence labeling problem: a sourcetext unit (a word or chunk) is labelled 1 if it is tobe included in the summary and 0 otherwise. Itis not unusual to develop an auto-regressive modelto perform sequence labeling, where the label ofthe t-th extraction unit (yt) depends on all previ-ous labels (y<t). Given this hypothesis, we builda framework to extract summary units where theimportance of the t-th source unit is characterizedby its informativeness (encoded in het ), its positionin the document, and relationship with the partialsummary. The details are presented below.

We use a positional embedding (gt) to signifythe position of the t-th text unit in the source docu-ment. The position corresponds to the index of thesource sentence containing the t-th unit, and fur-ther, all text units belonging to the same sentenceshare the same positional embedding. We apply si-nusoidal initialization to the embeddings, follow-ing Vaswani et al. (2017). Importantly, positionalembeddings allow us to inject macro-positionalknowledge about words/chunks into a neural sum-marization framework to offset the natural biasthat humans tend to have on putting important con-tent at the beginning of an article.

Next, we build a representation for the partialsummary to aid the system in selecting future textunits. The representation st is expected to encodethe extraction decisions up to time t-1 and it canbe realized using a unidirectional LSTM network(Eq. (3)). The t-th input to the network is repre-sented as yt−1⊗het−1 where yt−1 is a binary labelserving as a gating mechanism to control if the se-mantic content of the previous text unit (het−1) isto be included in the summary (“⊗” correspondsto elementwise product). During training, we ap-ply teacher forcing and yt−1 is the ground-truthextraction label for the (t− 1)-th unit; at test time,

gt�1 gt gt+1 gt+2

st+2st+1stst�1

het�1 he

t het+1 he

t+2

Figure 1: A unidirectional LSTM (blue, Eq. (3)) encodesthe partial summary, while the multilayer perceptron network(orange, Eq. (4-5)) utilizes the text unit representation (he

t ),its positional embedding (gt), and the partial summary rep-resentation (st) to determine if the t-th text unit is to be in-cluded in the summary. Best viewed in color.

yt−1 is generated on-the-fly by obtaining the la-bel yielding the highest probability according toEq. (5). In the previous work of Cheng and La-pata (2016) and Nallapati et al. (2017), similarauto-regressive models are developed to identifysummary sentences. Different from the previouswork, this study focuses on extracting consecutivesequences of words and chunks from the sourcedocument, and the partial summary representationis particularly useful for predicting if the next unitis to be included in the summary to improve sum-mary fluency.

st = fUni-LSTM3 (st−1, yt−1 ⊗ het−1) (3)

Given the partial summary representation (st),and representation of the text unit (het ) and itspositional encoding (gt), we employ a multilayerperceptron to predict how likely the unit is to be in-cluded in the summary. This process is describedby Eqs. (4-5) and further illustrated in Figure 1.

at = fReLU(Wa[het ;gt; st] + ba) (4)p(yt|y<t,x) = σ(wyat + by) (5)

Our model parameters include {Wa, ba, wy,by} along with those required by fBi-LSTM

1 , fCNN2

and fUni-LSTM3 . It is possible to train this model in

a fully supervised fashion by minimizing the neg-ative log-likelihood of the training data. We gen-erate ground-truth labels for source text units asfollows. A source word receives a label of 1 ifboth itself and its adjacent word appear in the hu-man abstract (excluding cases where both wordsare stopwords). This heuristic aims to label con-secutive source words (2 or more) as summary-worthy, as opposed to picking single words whichcan be less informative. A source text chunk re-ceives a label of 1 if one of its component wordsis labelled 1 in the above process.

Because human abstracts are often short andcontain novel words not present in source doc-uments, they can be suboptimal for generatingground-truth labels for extraction units. Only asmall portion of the source words (about 8% inour dataset) are labelled as positive, whereas thevast majority are negative. Such labels can be inef-fective in providing supervision. In the followingsection, we investigate a new learning paradigm,which encourages extractive summaries to containinformative content useful for answering impor-tant questions, while question-answer pairs can beautomatically derived from human abstracts.

3.3 Using Summaries to Answer QuestionsOur hypothesis is that high-quality summariesshould contain informative content making themappropriate to serve as document surrogates to sat-isfy users’ information needs. We train the extrac-tive summarizer to identify source text units nec-essary for answering questions, and the question-answer (QA) pairs can be conveniently developedfrom human abstracts.

To obtain QA pairs, we set an answer token tobe either a salient word or a named entity to limitthe space of potential answers. For any sentencein the human abstract, we identify an answer to-ken from it, then replace the answer token with ablank to create a Cloze-style question-answer pair(see Table 1). When a sentence contains multipleanswer tokens, a set of QA pairs can be obtainedfrom it. It is important to note that at least one QApair should be extracted from each sentence of theabstract. Because a system summary is trained tocontain content useful for answering all questions(≈human abstract), any missing QA pair is likelyto cause the summary to be insufficient.

We collect answer tokens using the followingmethods: (a) we extract a set of entities with tag{PER, LOC, ORG, MISC} from each sentence us-ing the Stanford CoreNLP toolkit (Manning et al.,2014); (b) we also identify the ROOT word ofeach sentence’s dependency parse tree along withthe sentence’s subject/object word, whose type is{NSUBJ, CSUBJ, OBJ, IOBJ} (if exists), then addthem to the collection of answer tokens. Further,we prune the answer space by excluding thosewhich appear fewer than 5 times overall. Havingseveral methods for question construction allowsus to explore the answer space properly. In the re-sults section we perform experiments on root, sub-ject/object, and named entities to see which modelprovides the best extraction guide.

Given an extractive summary S containing a setof source text units, and a collection of question-answer pairs P = {(Qk, e∗k)}Kk=1 related to thesource document, we want to develop a mecha-nism leveraging the extractive summary to answerthese questions. We first encode each question Qkto a vector representation (qk). This is achievedby concatenating the last hidden states of the for-ward/backward passes of a bidirectional LSTM(Eq. (6)). Next, we exploit the attention mecha-nism to locate summary parts that are relevant toanswering the k-th question. Given the attentionmechanism, an extractive summary S can be usedto answer multiple questions related to the docu-ment. We define αt,k to be the semantic related-ness between the t-th source text unit and the k-thquestion. Following Chen et al. (2016a), we in-troduce a bilinear term to characterize their rela-tionship (αt,k ∝ hetW

αqk; see Eq. (7)). In thisprocess, we consider only those source text unitsselected in summary S. Using αt,k as weights, wethen compute a context vector ck condensing sum-mary content related to the k-th question (Eq. (8)).

qk = fBi-LSTM4 (Qk) (6)

αt,k =exp(hetW

αqk)∑t exp(h

etW

αqk)(7)

ck =∑

t αt,khet (8)

uk = [ck;qk; |ck − qk|; ck ⊗ qk] (9)

To predict the most probable answer, we constructa fully-connected network as the output layer. Theinput to the network includes a concatenation ofthe context vector (ck), question vector (qk), abso-lute difference (|ck−qk|) and element-wise prod-uct (ck ⊗ qk) of the two vectors (Eq. (9)). A soft-max function is used to estimate a probability dis-tribution over the space of candidate answers:

P (ek|S, Qk) = softmax(WefReLU(Wuuk + bu)).

Such a fully-connected output layer has achievedsuccess on natural language inference (Mou et al.,2016; Chen et al., 2018); here we test its efficacyon answer selection. The model parameters in-clude {Wα,We,Wu,bu} and those of fBi-LSTM

4 .

3.4 A Reinforcement Learning Framework

In this section we introduce a reinforcement learn-ing framework to explore the space of possibleextractive summaries and present a novel rewardfunction to promote summaries that are adequate,

fluent, restricted in length, and competent in ques-tion answering. Our reward function consists offour components, whose interpolation weights γ,α, and β are tuned on the dev set.

R(y) = Rc(y) + γRa(y) + αRf (y) + βRl(y)

We define QA competency (Eq. (10)) as the av-erage log-likelihood of correctly answering ques-tions using the system summary (y). A high-quality system summary is expected to resem-ble reference summary by using similar wording.The adequacy metric (Eq. (11)) measures the per-centage of overlapping unigrams between the sys-tem (y) and reference summary (y∗). The flu-ency criterion (Eq. (12)) encourages consecutivesequences of source words to be selected by pre-venting many 0/1 switches in the label sequence(i.e., |yt − yt−1|). Finally, we limit the summarysize by setting the ratio of selected words to beclose to a threshold δ (Eq. (13)).

QA Rc(y)=1

K

K∑k=1

logP (e∗k |y,Qk) (10)

Adequ. Ra(y)=1

|y∗ |U(y,y∗) (11)

Fluency Rf (y)=−|y|∑t=2

|yt−yt−1 | (12)

Length Rl(y)=−∣∣∣ 1|y|∑

t

yt−δ∣∣∣ (13)

The reward function R(y) successfully com-bines intrinsic measures of summary fluency andadequacy (Goldstein et al., 2005) with extrin-sic measure of summary responsiveness to givenquestions (Dang, 2006; Murray et al., 2008). A re-inforcement learning agent finds a policy P (y|x)to maximize the expected reward EP (y|x)[R(y)].Training the system with policy gradient (Eq. (14))involves repeatedly sampling an extractive sum-mary y from the source document x. At time t,the agent takes an action by sampling a decisionbased on p(yt|y<t,x) (Eq. (5)) indicating whetherthe t-th source text unit is to be included in thesummary. Once the full summary sequence y isgenerated, it is compared to the ground-truth se-quence to compute the reward R(y). In this way,reinforcement learning explores the space of ex-tractive summaries and promotes those yieldinghigh rewards. At inference time, rather than sam-pling actions from p(yt|y<t,x), we choose yt that

yields the highest probability to generate the sys-tem summary y. This process is deterministic andno QA is required.

∇θEP (y|x)[R(y)]= EP (y|x)[R(y)∇θ logP (y|x)]

≈ 1N

∑Nn=1R(y(n))∇θ logP (y(n)|x) (14)

4 Experiments

We proceed by discussing the dataset and settings,comparison systems, and experimental results ob-tained through both automatic metrics and humanevaluation in a reading comprehension setting.

4.1 Dataset and SettingsOur goal is to build an extractive summarizer iden-tifying important textual segments from source ar-ticles. To investigate the effectiveness of the pro-posed approach, we conduct experiments on theCNN/Daily Mail dataset using a version providedby See et al. (2017). The reference summaries ofthis dataset were created by human editors exhibit-ing a moderate degree of extractiveness. E.g., 83%of summary unigrams and 45% of bigrams appearin source articles (Narayan et al., 2018a). On av-erage, a CNN article contains 761 words / 34 sen-tences and a DM article contains 653 words / 29sentences. We report results respectively for theCNN and DM portion of the dataset.

Our hyperparameter settings are as follows. Weset the hidden state dimension of the LSTM tobe 256 in either direction. A bidirectional LSTMfBi-LSTM1 (·) produces a 512-dimensional vector for

each content word. Similarly, fBi-LSTM4 (·) gener-

ates a question vector qk of the same size. OurCNN encoder fCNN

2 (·) uses multiple window sizesof {1, 3, 5, 7} and 128 filters per window size. hetis thus a 512-dimensional vector using either CNNor LSTM encoder. We set the hidden state dimen-sion of st to be 128. We also use 100-dimensionalword embeddings (Pennington et al., 2014) andsinusoidal positional encodings (Vaswani et al.,2017) of 30 dimensions.

The maximum article length is set to 400 words.Compared to the study of Arumae and Liu (2018),we expand the search space dramatically from 100to 400 words, which poses a challenge to the RL-based summarizers. We associate each article withat most 10 QA pairs (K=10) and use them to guidethe extraction of summary segments. We applymini-batch training with Adam optimizer (Kingmaand Ba, 2014), where a mini-batch contains 128

CNNSystem #Ans. R-1 R-2 R-L

Lead-3 – 28.8 11.5 19.3PointerGen+Cov. – 29.9 10.9 21.1Graph Attn. – 30.3 9.8 20.0LexRank – 26.1 9.6 17.7SumBasic – 22.9 5.5 14.8KLSum – 20.7 5.9 13.7Distraction-M3 – 27.1 8.2 18.7

QASumm+NoQ 0 16.38 7.25 11.30QASumm+SUBJ/OBJ 9,893 26.16 8.97 18.24QASumm+ROOT 3,678 26.67 9.19 18.76QASumm+NER 6,167 27.38 9.38 19.02

Table 2: Summarization results on CNN test set. Summariesare evaluated at their full-length by ROUGE F1-scores.

Daily MailSystem #Ans. R-1 R-2 R-L

Lead-3 – 22.5 9.3 20.0PointerGen+Cov. – 31.2 17.0 28.9Graph Attn. – 27.4 11.3 15.1NN-WE – 15.7 6.4 9.8NN-SE – 22.7 8.5 12.5SummaRuNNer – 26.2 10.8 14.4

QASumm+NoQ 0 22.26 9.16 19.78QASumm+SUBJ/OBJ 19,151 23.38 9.54 20.14QASumm+ROOT 5,498 26.87 11.97 23.07QASumm+NER 15,342 25.74 11.89 22.38

Table 3: Summarization results on DM test set. To ensure afair comparison, we follow the convention to report ROUGErecall scores evaluated at 75 bytes.

articles and their QA pairs. The summary ratio δ isset to 0.15, yielding extractive summaries of about60 words. Following Arumae and Liu (2018), weset hyperparameters β = 2α; α and γ are tuned onthe dev set using grid search.

4.2 Experimental Results

Comparison systems We compare our methodwith a number of extractive and abstractive sys-tems that have reported results on the CNN/DMdatasets. We consider non-neural approaches thatextract sentences from the source article to forma summary. These include LexRank (Radev et al.,2004), SumBasic (Vanderwende et al., 2007), andKLSum (Haghighi and Vanderwende, 2009). Suchmethods treat sentences as bags of words, andthen select sentences containing topically impor-tant words. We further include the Lead-3 baselinethat extracts the first 3 sentences from any givenarticle. The method has been shown to be a strongbaseline for summarizing news articles.

Neural extractive approaches focus on learningvector representations for sentences and words,

then performing extraction based on the learnedrepresentations. Cheng et al. (2016) describe aneural network method composed of a hierarchi-cal document encoder and an attention-based ex-tractor. The system has two variants: NN-WE ex-tracts words from the source article and NN-SE ex-tracts sentences. SummaRuNNer (Nallapati et al.,2017) presents an autoregressive sequence label-ing method based on recurrent neural networks. Itselects summary sentences based on their content,salience, position, and novelty representations.

Abstractive summarization methods are not di-rectly comparable to our approach, but we chooseto include three systems that report results respec-tively for CNN and DM datasets. Distraction-M3 (Chen et al., 2016b) trains the summarizationsystem to distract its attention to traverse differ-ent regions of the source article. Graph atten-tion (Tan et al., 2017) introduces a graph-basedattention mechanism to enhance the encoder-decoder framework. PointerGen+Cov. (See et al.,2017) allows the system to not only copy wordsfrom the source text but also generate summarywords by selecting them from a vocabulary. Ab-stractive methods can thus introduce new words tothe summary that are not present in the source arti-cle. However, system summaries may change themeaning of the original texts due to this flexibility.

Results We present summarization results ofvarious systems in Tables 2 and 3, evaluated onthe standard CNN/DM test sets by R-1, R-2, andR-L metrics (Lin, 2004), which respectively mea-sure the overlap of unigrams, bigrams, and longestcommon subsequences between system and refer-ence summaries. We investigate four variants ofour method: QASumm+NoQ does not utilize anyquestion-answer pairs during training. It extractssummary text chunks by learning from ground-truth labels (§3.2) and the chunks are encoded byfBi-LSTM1 . Other variants initialize their models us-

ing pretrained parameters from QASumm+NoQ,then integrate the reinforcement learning objective(§3.4) to exploit the space of possible extractivesummaries and reward those that are useful for an-swering questions. We consider three types of QApairs: the answer token is the root of a sentence de-pendency parse tree (+ROOT), a subject or object(+SUBJ/OBJ), or an entity found in the sentence(+NER). In all cases, the question is generated byreplacing the answer token with a blank symbol.

As illustrated in Tables 2 and 3, our QASummmethods with reinforcement learning (+ROOT,

NoText QASumm+NoQ GoldSumm FullTextTrain Dev Gap Train Dev Gap Train Dev Gap Train Dev Gap

SUBJ/OBJ 49.7 24.4 25.3 55.9 31.2 24.7 69.3 48.6 20.7 67.6 43.3 24.3ROOT 68.1 34.9 33.2 71.6 36.3 35.3 76.9 44.9 32.0 76.0 35.7 40.3NER 61.0 15.8 45.2 66.0 32.7 33.3 85.2 54.0 31.2 82.4 46.3 36.1

Table 4: Question-answering accuracies using different types of QA pairs (ROOT, SUBJ/OBJ, NER) and different source input(NoText, QASumm+NoQ, GoldSumm, and FullText) as the basis for predicting answers.

+SUBJ/OBJ, +NER) perform competitively withstrong baselines. They outperform the counterpartQASumm+NoQ that makes no use of the QA pairsby a substantial margin. They outperform or per-form at a comparable level to state-of-the-art pub-lished systems on the CNN/DM datasets but aregenerally inferior to PointerGen. We observe thatexacting summary chunks is highly desirable inreal-world applications as it provides a mechanismto generate concise summaries. Nonetheless, ac-curately identifying summary chunks is challeng-ing because the search space is vast and spurious-ness arises in chunking sentences. Cheng and La-pata (2016) report a substantial performance dropwhen adapting their system to extract words. OurQASumm methods focusing on chunk extractionperform on par with competitive systems that ex-tract whole sentences. We additionally present hu-man evaluation results of summary usefulness fora reading comprehension task in §4.3.

In Tables 2 and 3, we further show the num-ber of unique answers per QA type. We find thatthe ROOT-type QA pairs have the least number ofunique answers. They are often main verbs of sen-tences. In contrast, the SUBJ/OBJ-type has themost number of answers. They are subjects andobjects of sentences and correspond to an openclass of content words. The NER-type has a mod-erate number of answers compared to others. Notethat all answer tokens have been filtered by fre-quency; those appearing less than 5 times in thedataset are removed to avoid overfitting.

Among variants of the QASumm method, wefind that QASumm+ROOT achieves the highestscores on DM dataset. QASumm+NER performsconsistently well on both CNN and DM datasets,suggesting QA pairs of this type are effective inguiding the system to extract summary chunks.We conjecture that maintaining a moderate num-ber of answers is important to maximize perfor-mance. To answer questions with missing enti-ties, the summary is encouraged to contain similarcontent as the question body. Because questionsare derived from the human abstract, this in turn

requires the system summary to carry similar se-mantic content as the human abstract.

Question-answering accuracy We next diveinto the QA component of our system to investi-gate question-answering performance when differ-ent types of summaries and QA pairs are suppliedto the system (§3.3). Given a question, the systempredicts an answer using an extractive summary asthe source input. Intuitively, an informative sum-mary can lead to high QA accuracy, as the sum-mary content serves well as the basis for predictinganswers. With the same summary as input, certaintypes of questions can be more difficult to answerthan others, and the system must rely heavily onthe summary to gauge correct answers.

We compare various types of summaries. Theseinclude (a) QASumm+NoQ which extracts sum-mary chunks without requiring QA pairs; and(b) GoldSumm, which are gold-standard extractivesummaries generated by collecting source wordsappearing in human summaries. We further con-sider NoText and FullText, corresponding to usingno source text or the full source article as input.They represent the two extremes. In all cases theQA component (§3.3) is trained on the training setand we report QA accuracies on the dev set.

In Table 4, we observe that question-answeringwith GoldSumm performs the best for all QAtypes. It outperforms the scenarios using Full-Text as the source input. This indicates that dis-tilled information contained in a high-quality sum-mary can be useful for answering questions, assearching for answers in a succinct summary canbe more efficient than that in a full article. More-over, we observe that the performance of QA-Summ+NoQ is in between NoText and GoldSummfor all answer types. The results suggest thatextractive summaries with even modest ROUGEscores can prove useful for question-answering.Regarding different types of QA pairs, we findthat the ROOT-type can achieve high QA accu-racy when using NoText input. It suggests thatROOT answers can to some extent be predictedbased on the question context. The NER-type QA

Figure 2: Summarization results using fLSTM1 or fCNN

2 en-coder with word/chunk as the extraction unit.

pairs work the best for both GoldSumm and Full-Text, likely because the source texts contain nec-essary entities required to correctly answer thosequestions. We also find the SUBJ/OBJ-type QApairs have the smallest gap between train/dev ac-curacies, despite that they have a large answerspace. Based on the analysis we would suggest fu-ture work to consider using NER-based QA pairsas they encourage the summaries to contain salientsource content and be informative.

Extraction units We finally compare the per-formance of using either words or chunks as ex-traction units (§3.1). The chunks are obtained bybreaking down sentence constituent parse trees ina top-down manner until all tree fragments con-tain 5 words or less. We observe that 70% of thechunks are 1-grams, and 2/3/4/5-grams are 9%,7%, 6%, and 8% respectively. We compare thebidirectional LSTM (f LSTM

1 ) and CNN (fCNN2 ) en-

coders for their effectiveness on generating repre-sentations for extraction units. Figure 2 presentsthe results of the QASumm+NoQ system undervarious settings. We find that extracting chunksperforms superior, and combining chunks withLSTM representations yield the highest scores.

4.3 Human EvaluationTesting the usefulness of an extractive systemdriven by reading comprehension is not inherentlymeasured by automatic metrics (i.e. ROUGE). Weconducted a human evaluation to assess whetherthe highlighted summaries contribute to documentunderstanding. Similar to our training paradigmwe presented each participant with the documentand three fill-in-the-blank questions created fromthe human abstracts. It was guaranteed that eachquestion was from a unique human abstract toavoid seeing the answer adjacent to the same tem-plate. The missing section was randomly gener-ated to be either the root word, the subject or ob-

Summary Time Accuracy Inform.

Human 69.5s 87.3 4.23QASumm+NoQ 87.9s 53.7 1.76Pointer+Cov. 100.9s 52.3 2.14QASumm+NER 95.0s 62.3 2.14

Table 5: Amazon mechanical turk experiments. Human ab-stracts were the goldstandard summaries, Pointer+Cov. weresummaries generated by See et al. (2017). Our systems testedwere the supervised extractor, and our full model (NER).

ject of the sentence, or a named entity. We com-pare our reinforced extracted summary (presentedas a bold overlay to the document), against oursupervised method (section 3.2), abstractive sum-maries generated by See et al. (2017), and the hu-man abstracts in full. Additionally we asked theparticipants to rate the quality of the summary pre-sented (1-5, with 5 being most informative). Weutilized Amazon Mechanical Turk, and conductedan experiment where we sampled 80 documentsfrom the CNN test set. The articles were evenlysplit across the four competing systems, and eachHIT was completed by 5 turkers. Upon comple-tion the data was analyzed manually for accuracysince turkers entered each answer as free text, andto remove any meaningless datapoints.

Table 5 shows the average time (in seconds) tocomplete a single question, the overall accuracy ofthe participants, and the informativeness of a givensummary type. Excluding the use of human ab-stracts, all systems resulted in similar performancetimes. However we observe a large margin in QAaccuracy in our full system compared to the ab-stractive and our supervised approach. Althoughparticipants rated the informativeness of the sum-maries to be the same our systems yielded a higherperformance. This strongly indicates that having asystem which makes using of document compre-hension has a tangible effect when applied towardsa real-world task.

5 Conclusion

We exploited an extractive summarization frame-work using deep reinforcement learning to iden-tify consecutive word sequences from a documentto form an extractive summary. Our reward func-tion promotes adequate and fluent summaries thatcan serve as document surrogates to answer im-portant questions, directly addressing users’ infor-mation needs. Experimental results on benchmarkdatasets demonstrated the efficacy of our proposedmethod over state-of-the-art baselines, assessed byboth automatic metrics and human evaluators.

ReferencesKristjan Arumae and Fei Liu. 2018. Reinforced extrac-

tive summarization with question-focused rewards.In Proceedings of the ACL Student Research Work-shop.

Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei. 2017.Improving multi-document summarization via textclassification. In Proceedings of the Association forthe Advancement of Artificial Intelligence (AAAI).

Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018.Faithful to the original: Fact aware neural abstrac-tive summarization. In Proceedings of the AAAIConference on Artificial Intelligence (AAAI).

Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, andYejin Choi. 2018. Deep communicating agentsfor abstractive summarization. In Proceedings ofthe North American Chapter of the Association forComputational Linguistics (NAACL).

Danqi Chen, Jason Bolton, and Christopher D. Man-ning. 2016a. A thorough examination of thecnn/daily mail reading comprehension task. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (ACL).

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, DianaInkpen, and Si Wei. 2018. Neural natural languageinference models enhanced with external knowl-edge. In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics (ACL).

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei,and Hui Jiang. 2016b. Distraction-based neural net-works for document summarization. In Proceedingsof the Twenty-Fifth International Joint Conferenceon Artificial Intelligence (IJCAI).

Jianpeng Cheng and Mirella Lapata. 2016. Neuralsummarization by extracting sentences and words.In Proceedings of ACL.

Hoa Trang Dang. 2006. DUC 2005: Evaluation ofquestion-focused summarization systems. In Pro-ceedings of the Workshop on Task-Focused Summa-rization and Question Answering.

Sebastian Gehrmann, Yuntian Deng, and Alexander M.Rush. 2018. Bottom-up abstractive summariza-tion. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP).

Dan Gillick and Benoit Favre. 2009. A scalable globalmodel for summarization. In Proceedings of theNAACL Workshop on Integer Linear Programmingfor Natural Langauge Processing.

Jade Goldstein, Alon Lavie, Chin-Yew Lin, and ClareVoss. 2005. Intrinsic and extrinsic evaluation mea-sures for machine translation and/or summarization.In Proceedings of the ACL-05 Workshop.

Max Grusky, Mor Naaman, and Yoav Artzi. 2018.NEWSROOM: A dataset of 1.3 million summarieswith diverse extractive strategies. In Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies (NAACL-HLT).

Han Guo, Ramakanth Pasunuru, and Mohit Bansal.2018. Soft, layer-specific multi-task summarizationwith entailment and question generation. In Pro-ceedings of the Annual Meeting of the Associationfor Computational Linguistics (ACL).

Aria Haghighi and Lucy Vanderwende. 2009. Explor-ing content models for multi-document summariza-tion. In Proceedings of the North American Chap-ter of the Association for Computational Linguistics(NAACL).

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Proceedings ofNeural Information Processing Systems (NIPS).

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Kai Hong, John M Conroy, Benoit Favre, AlexKulesza, Hui Lin, and Ani Nenkova. 2014. A repos-itory of state of the art and competitive baseline sum-maries for generic news summarization. In Proceed-ings of the Ninth International Conference on Lan-guage Resources and Evaluation (LREC).

Masaru Isonuma, Toru Fujino, Junichiro Mori, YutakaMatsuo, and Ichiro Sakata. 2017. Extractive sum-marization using multi-task learning with documentclassification. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP).

Chris Kedzie, Kathleen McKeown, and Hal Daume III.2018. Content selection in deep learning models ofsummarization. In Proceedings of the Conferenceon Empirical Methods in Natural Language Pro-cessing (EMNLP).

Urvashi Khandelwal, He He, Peng Qi, and Dan Ju-rafsky. 2018. Sharp nearby, fuzzy far away: Howneural language models use context. Proceedings ofthe Annual Meeting of the Association for Computa-tional Linguistics (ACL).

Gunhee Kim, Leonid Sigal, and Eric P. Xing. 2014.Joint summarization of large-scale collections ofweb images and videos for storyline reconstruction.In Proceedings of CVPR.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Kevin Knight and Daniel Marcu. 2002. Summariza-tion beyond sentence extraction: A probabilistic ap-proach to sentence compression. Artificial Intelli-gence.

Logan Lebanoff, Kaiqiang Song, and Fei Liu. 2018.Adapting the neural encoder-decoder frameworkfrom single to multi-document summarization. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing (EMNLP).

Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016.Rationalizing neural predictions. In Proceedings ofthe Conference on Empirical Methods in NaturalLanguage Processing (EMNLP).

Chen Li, Fei Liu, Fuliang Weng, and Yang Liu. 2013.Document summarization via guided sentence com-pression. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing(EMNLP).

Chen Li, Yang Liu, Fei Liu, Lin Zhao, and FuliangWeng. 2014. Improving multi-document summa-rization by sentence compression based on expandedconstituent parse tree. In Proceedings of the Con-ference on Empirical Methods on Natural LanguageProcessing (EMNLP).

Chin-Yew Lin. 2004. ROUGE: a package for au-tomatic evaluation of summaries. In Proceedingsof ACL Workshop on Text Summarization BranchesOut.

Hui Lin and Jeff Bilmes. 2010. Multi-document sum-marization via budgeted maximization of submodu-lar functions. In Proceedings of NAACL.

Christopher D Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J Bethard, and David Mc-Closky. 2014. The stanford CoreNLP natural lan-guage processing toolkit. In Proceedings of the As-sociation for Computational Linguistics (ACL) Sys-tem Demonstrations.

Andre F. T. Martins and Noah A. Smith. 2009. Sum-marization with a joint model for sentence extrac-tion and compression. In Proceedings of the ACLWorkshop on Integer Linear Programming for Natu-ral Language Processing.

Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan,and Zhi Jin. 2016. Natural language inference bytree-based convolution and heuristic matching. InProceedings of the Annual Meeting of the Associa-tion for Computational Linguistics (ACL).

Gabriel Murray, Thomas Kleinbauer, Peter Poller,Steve Renals, Jonathan Kilgour, and Tilman Becker.2008. Extrinsic summarization evaluation: A deci-sion audit task. Machine Learning for MultimodalInteraction.

Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.SummaRuNNer: A recurrent neural network basedsequence model for extractive summarization of

documents. In Proceedings of the Thirty-First AAAIConference on Artificial Intelligence (AAAI).

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,Caglar Gulcehre, and Bing Xiang. 2016. Abstrac-tive text summarization using sequence-to-sequencernns and beyond. In Proceedings of SIGNLL.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata.2018a. Don’t give me the details, just the summary!Topic-aware convolutional neural networks for ex-treme summarization. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP).

Shashi Narayan, Shay B. Cohen, and Mirella Lapata.2018b. Ranking sentences for extractive summa-rization with reinforcement learning. In Proceed-ings of the 16th Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies(NAACL-HLT).

Ani Nenkova and Kathleen McKeown. 2011. Auto-matic summarization. Foundations and Trends inInformation Retrieval.

Romain Paulus, Caiming Xiong, and Richard Socher.2017. A deep reinforced model for abstractive sum-marization. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP).

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

Dragomir R. Radev, Hongyan Jing, Małgorzata Stys,and Daniel Tam. 2004. Centroid-based summariza-tion of multiple documents. Information Processingand Management, 40(6):919–938.

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for sentence sum-marization. In Proceedings of EMNLP.

Evan Sandhaus. 2008. The new york times annotatedcorpus. Linguistic Data Consortium.

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the AnnualMeeting of the Association for Computational Lin-guistics (ACL).

Kaiqiang Song, Lin Zhao, and Fei Liu. 2018.Structure-infused copy mechanisms for abstractivesummarization. In Proceedings of the InternationalConference on Computational Linguistics (COL-ING).

Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017.Abstractive document summarization with a graph-based attentional neural model. In Proceedings of

http://www.aclweb.org/anthology/D14-1162

http://www.aclweb.org/anthology/D14-1162

the Annual Meeting of the Association for Computa-tional Linguistics (ACL).

Sansiri Tarnpradab, Fei Liu, and Kien A. Hua. 2017.Toward extractive summarization of online forumdiscussions via hierarchical attention networks. InProceedings of the 30th Florida Artificial Intelli-gence Research Society Conference (FLAIRS).

Lucy Vanderwende, Hisami Suzuki, Chris Brockett,and Ani Nenkova. 2007. Beyond SumBasic: Task-focused summarization with sentence simplificationand lexical expansion. Information Processing andManagement, 43(6):1606–1618.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of the 31st Conference onNeural Information Processing Systems (NIPS).

Lu Wang, Hema Raghavan, Vittorio Castelli, Radu Flo-rian, and Claire Cardie. 2013. A sentence com-pression based framework to query-focused multi-document summarization. In Proceedings of ACL.

Kristian Woodsend and Mirella Lapata. 2010. Auto-matic generation of story highlights. In Proceedingsof the Annual Meeting of the Association for Com-putational Linguistics (ACL).

Yuxiang Wu and Baotian Hu. 2018. Learning to extractcoherent summary via deep reinforcement learning.In Proceedings of the AAAI Conference on ArtificialIntelligence (AAAI).

Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,Ayush Pareek, Krishnan Srinivasan, and DragomirRadev. 2017. Graph-based neural multi-documentsummarization. In Proceedings of the Confer-ence on Computational Natural Language Learning(CoNLL).

Dani Yogatama, Fei Liu, and Noah A. Smith. 2015.Extractive summarization by maximizing semanticvolume. In Proceedings of EMNLP.

David Zajic, Bonnie J. Dorr, Jimmy Lin, and RichardSchwartz. 2007. Multi-candidate reduction: Sen-tence compression as a tool for document summa-rization tasks. Information Processing and Manage-ment.

Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang,Ming Zhou, and Tiejun Zhao. 2018. Neural doc-ument summarization by jointly learning to scoreand select sentences. In Proceedings of the AnnualMeeting of the Association for Computational Lin-guistics (ACL).

Documents

Guiding Extractive Summarization with Question …Guiding Extractive Summarization with Question-Answering Rewards Kristjan Arumae and Fei Liu Computer Science Department University