Hierarchical Transformers for Multi-Document Summarization · Yang Liu and Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5070–5081Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

5070

Hierarchical Transformers for Multi-Document Summarization

Yang Liu and Mirella LapataInstitute for Language, Cognition and Computation

School of Informatics, University of [email protected], [email protected]

Abstract

In this paper, we develop a neural summa-rization model which can effectively processmultiple input documents and distill abstrac-tive summaries. Our model augments a previ-ously proposed Transformer architecture (Liuet al., 2018) with the ability to encode docu-ments in a hierarchical manner. We representcross-document relationships via an attentionmechanism which allows to share informationas opposed to simply concatenating text spansand processing them as a flat sequence. Ourmodel learns latent dependencies among tex-tual units, but can also take advantage of ex-plicit graph representations focusing on simi-larity or discourse relations. Empirical resultson the WikiSum dataset demonstrate that theproposed architecture brings substantial im-provements over several strong baselines.1

1 Introduction

Automatic summarization has enjoyed renewedinterest in recent years, thanks to the popular-ity of neural network models and their ability tolearn continuous representations without recourseto preprocessing tools or linguistic annotations.The availability of large-scale datasets (Sandhaus,2008; Hermann et al., 2015; Grusky et al., 2018)containing hundreds of thousands of document-summary pairs has driven the development ofneural architectures for summarizing single doc-uments. Several approaches have shown promis-ing results with sequence-to-sequence models thatencode a source document and then decode it intoan abstractive summary (See et al., 2017; Celiky-ilmaz et al., 2018; Paulus et al., 2018; Gehrmannet al., 2018).

Multi-document summarization — the task ofproducing summaries from clusters of themati-

1Our code and data is available at https://github.com/nlpyang/hiersumm.

cally related documents — has received signif-icantly less attention, partly due to the paucityof suitable data for the application of learningmethods. High-quality multi-document summa-rization datasets (i.e., document clusters pairedwith multiple reference summaries written by hu-mans) have been produced for the Document Un-derstanding and Text Analysis Conferences (DUCand TAC), but are relatively small (in the rangeof a few hundred examples) for training neu-ral models. In an attempt to drive research fur-ther, Liu et al. (2018) tap into the potential ofWikipedia and propose a methodology for cre-ating a large-scale dataset (WikiSum) for multi-document summarization with hundreds of thou-sands of instances. Wikipedia articles, specificallylead sections, are viewed as summaries of varioustopics indicated by their title, e.g.,“Florence” or“Natural Language Processing”. Documents citedin the Wikipedia articles or web pages returnedby Google (using the section titles as queries) areseen as the source cluster which the lead sectionpurports to summarize.

Aside from the difficulties in obtaining train-ing data, a major obstacle to the application ofend-to-end models to multi-document summariza-tion is the sheer size and number of source doc-uments which can be very large. As a result, itis practically infeasible (given memory limitationsof current hardware) to train a model which en-codes all of them into vectors and subsequentlygenerates a summary from them. Liu et al. (2018)propose a two-stage architecture, where an extrac-tive model first selects a subset of salient passages,and subsequently an abstractive model generatesthe summary while conditioning on the extractedsubset. The selected passages are concatenatedinto a flat sequence and the Transformer (Vaswaniet al., 2017), an architecture well-suited to lan-guage modeling over long sequences, is used to

https://github.com/nlpyang/hiersummhttps://github.com/nlpyang/hiersumm

5071

decode the summary.

Although the model of Liu et al. (2018) takesan important first step towards abstractive multi-document summarization, it still considers themultiple input documents as a concatenated flatsequence, being agnostic of the hierarchical struc-tures and the relations that might exist among doc-uments. For example, different web pages mightrepeat the same content, include additional con-tent, present contradictory information, or discussthe same fact in a different light (Radev, 2000).The realization that cross-document links are im-portant in isolating salient information, elimi-nating redundancy, and creating overall coherentsummaries, has led to the widespread adoptionof graph-based models for multi-document sum-marization (Erkan and Radev, 2004; Christensenet al., 2013; Wan, 2008; Parveen and Strube,2014). Graphs conveniently capture the relation-ships between textual units within a document col-lection and can be easily constructed under the as-sumption that text spans represent graph nodes andedges are semantic links between them.

In this paper, we develop a neural summariza-tion model which can effectively process multi-ple input documents and distill abstractive sum-maries. Our model augments the previously pro-posed Transformer architecture with the ability toencode multiple documents in a hierarchical man-ner. We represent cross-document relationshipsvia an attention mechanism which allows to shareinformation across multiple documents as opposedto simply concatenating text spans and feedingthem as a flat sequence to the model. In thisway, the model automatically learns richer struc-tural dependencies among textual units, thus in-corporating well-established insights from earlierwork. Advantageously, the proposed architecturecan easily benefit from information external to themodel, i.e., by replacing inter-document attentionwith a graph-matrix computed based on the basisof lexical similarity (Erkan and Radev, 2004) ordiscourse relations (Christensen et al., 2013).

We evaluate our model on the WikiSum datasetand show experimentally that the proposed archi-tecture brings substantial improvements over sev-eral strong baselines. We also find that the ad-dition of a simple ranking module which scoresdocuments based on their usefulness for the targetsummary can greatly boost the performance of amulti-document summarization system.

2 Related Work

Most previous multi-document summarizationmethods are extractive operating over graph-basedrepresentations of sentences or passages. Ap-proaches vary depending on how edge weightsare computed e.g., based on cosine similarity withtf-idf weights for words (Erkan and Radev, 2004)or on discourse relations (Christensen et al., 2013),and the specific algorithm adopted for ranking textunits for inclusion in the final summary. Sev-eral variants of the PageRank algorithm have beenadopted in the literature (Erkan and Radev, 2004)in order to compute the importance or salience ofa passage recursively based on the entire graph.More recently, Yasunaga et al. (2017) propose aneural version of this framework, where salienceis estimated using features extracted from sen-tence embeddings and graph convolutional net-works (Kipf and Welling, 2017) applied over therelation graph representing cross-document links.

Abstractive approaches have met with limitedsuccess. A few systems generate summariesbased on sentence fusion, a technique which iden-tifies fragments conveying common informationacross documents and combines these into sen-tences (Barzilay and McKeown, 2005; Filippovaand Strube, 2008; Bing et al., 2015). Althoughneural abstractive models have achieved promis-ing results on single-document summarization(See et al., 2017; Paulus et al., 2018; Gehrmannet al., 2018; Celikyilmaz et al., 2018), the ex-tension of sequence-to-sequence architectures tomulti-document summarization is less straightfor-ward. Apart from the lack of sufficient trainingdata, neural models also face the computationalchallenge of processing multiple source docu-ments. Previous solutions include model trans-fer (Zhang et al., 2018; Lebanoff and Liu, 2018),where a sequence-to-sequence model is pretrainedon single-document summarization data and fine-tuned on DUC (multi-document) benchmarks, orunsupervised models relying on reconstruction ob-jectives (Ma et al., 2016; Chu and Liu, 2018).

Liu et al. (2018) propose a methodology forconstructing large-scale summarization datasetsand a two-stage model which first extracts salientinformation from source documents and then usesa decoder-only architecture (that can attend to verylong sequences) to generate the summary. We fol-low their setup in viewing multi-document sum-marization as a supervised machine learning prob-

5072

............

ranked paragraphs

source paragraphs

paragraph ranker

encoderpara 1

para L

para L

decoder

abstractive summarizer

target summary

Figure 1: Pipeline of our multi-document summariza-tion system. L source paragraphs are first ranked andthe L′-best ones serve as input to an encoder-decodermodel which generates the target summary.

lem and for this purpose assume access to large,labeled datasets (i.e., source documents-summarypairs). In contrast to their approach, we use alearning-based ranker and our abstractive modelcan hierarchically encode the input documents,with the ability to learn latent relations across doc-uments and additionally incorporate informationencoded in well-known graph representations.

3 Model Description

We follow Liu et al. (2018) in treating the gen-eration of lead Wikipedia sections as a multi-document summarization task. The input to a hy-pothetical system is the title of a Wikipedia arti-cle and a collection of source documents, whilethe output is the Wikipedia article’s first section.Source documents are webpages cited in the Ref-erences section of the Wikipedia article and thetop 10 search results returned by Google (withthe title of the article as the query). Since sourcedocuments could be relatively long, they are splitinto multiple paragraphs by line-breaks. Moreformally, given title T , and L input paragraphs{P1, · · · , PL} (retrieved from Wikipedia citationsand a search engine), the task is to generate thelead section D of the Wikipedia article.

Our summarization system is illustrated in Fig-ure 1. Since the input paragraphs are numerousand possibly lengthy, instead of directly applyingan abstractive system, we first rank them and sum-marize the L′-best ones. Our summarizer followsthe very successful encoder-decoder architecture(Bahdanau et al., 2015), where the encoder en-codes the input text into hidden representationsand the decoder generates target summaries basedon these representations. In this paper, we focusexclusively on the encoder part of the model, ourdecoder follows the Transformer architecture in-

troduced in Vaswani et al. (2017); it generates asummary token by token while attending to thesource input. We also use beam search and alength penalty (Wu et al., 2016) in the decodingprocess to generate more fluent and longer sum-maries.

3.1 Paragraph RankingUnlike Liu et al. (2018) who rank paragraphsbased on their similarity with the title (using tf-idf-based cosine similarity), we adopt a learning-based approach. A logistic regression model isapplied to each paragraph to calculate a score in-dicating whether it should be selected for summa-rization. We use two recurrent neural networkswith Long-Short Term Memory units (LSTM;Hochreiter and Schmidhuber 1997) to represent ti-tle T and source paragraph P :

{ut1, · · · , utm} = lstmt({wt1, · · · , wtm}) (1){up1, · · · , upn} = lstmp({wp1, · · · , wpn}) (2)

where wti, wpj are word embeddings for tokens inT and P , and uti, upj are the updated vectors foreach token after applying the LSTMs.

A max-pooling operation is then used over titlevectors to obtain a fixed-length representation ût:

ût = maxpool({ut1, · · · , utm}) (3)

We concatenate ût with the vector upi of each to-ken in the paragraph and apply a non-linear trans-formation to extract features for matching the titleand the paragraph. A second max-pooling opera-tion yields the final paragraph vector p̂:

pi = tanh(W1([upi; ût])) (4)

p̂ = maxpool({p1, · · · , pn}) (5)

Finally, to estimate whether a paragraph should beselected, we use a linear transformation and a sig-moid function:

s = sigmoid(W2 ˆ(p)) (6)

where s is the score indicating whether para-graph P should be used for summarization.

All input paragraphs {P1, · · · , PL} receivescores {s1, · · · , sL}. The model is trained byminimizing the cross entropy loss between si andground-truth scores yi denoting the relatedness ofa paragraph to the gold standard summary. Weadopt ROUGE-2 recall (of paragraph Pi against

5073

gold target text D) as yi. In testing, input para-graphs are ranked based on the model predictedscores and an ordering {R1, · · · , RL} is gener-ated. The first L′ paragraphs {R1, · · · , RL′} areselected as input to the second abstractive stage.

3.2 Paragraph Encoding

Instead of treating the selected paragraphs asa very long sequence, we develop a hierarchi-cal model based on the Transformer architecture(Vaswani et al., 2017) to capture inter-paragraphrelations. The model is composed of several lo-cal and global transformer layers which can bestacked freely. Let tij denote the j-th token in thei-th ranked paragraph Ri; the model takes vectorsx0ij (for all tokens) as input. For the l-th trans-former layer, the input will be xl−1ij , and the outputis written as xlij .

3.2.1 Embeddings

Input tokens are first represented by word embed-dings. Let wij ∈ Rd denote the embedding as-signed to tij . Since the Transformer is a non-recurrent model, we also assign a special posi-tional embedding peij to tij , to indicate the po-sition of the token within the input.

To calculate positional embeddings, we followVaswani et al. (2017) and use sine and cosine func-tions of different frequencies. The embedding epfor the p-th element in a sequence is:

ep[i] = sin(p/100002i/d) (7)

ep[2i+ 1] = cos(p/100002i/d) (8)

where ep[i] indicates the i-th dimension of the em-bedding vector. Because each dimension of thepositional encoding corresponds to a sinusoid, forany fixed offset o, ep+o can be represented as alinear function of ep, which enables the model todistinguish relative positions of input elements.

In multi-document summarization, token tij hastwo positions that need to be considered, namely i(the rank of the paragraph) and j (the positionof the token within the paragraph). Positionalembedding peij ∈ Rd represents both positions(via concatenation) and is added to word embed-ding wij to obtain the final input vector x0ij :

peij = [ei; ej ] (9)

x0ij = wij + peij (10)

3.2.2 Local Transformer Layer

A local transformer layer is used to encode con-textual information for tokens within each para-graph. The local transformer layer is the sameas the vanilla transformer layer (Vaswani et al.,2017), and composed of two sub-layers:

h = LayerNorm(xl−1 +MHAtt(xl−1)) (11)

xl = LayerNorm(h+ FFN(h)) (12)

where LayerNorm is layer normalization pro-posed in Ba et al. (2016); MHAtt is the multi-head attention mechanism introduced in Vaswaniet al. (2017) which allows each token to attendto other tokens with different attention distribu-tions; and FFN is a two-layer feed-forward net-work with ReLU as hidden activation function.

3.2.3 Global Transformer Layer

A global transformer layer is used to exchange in-formation across multiple paragraphs. As shownin Figure 2, we first apply a multi-head pooling op-eration to each paragraph. Different heads will en-code paragraphs with different attention weights.Then, for each head, an inter-paragraph attentionmechanism is applied, where each paragraph cancollect information from other paragraphs by self-attention, generating a context vector to capturecontextual information from the whole input. Fi-nally, context vectors are concatenated, linearlytransformed, added to the vector of each token,and fed to a feed-forward layer, updating the rep-resentation of each token with global information.

Multi-head Pooling To obtain fixed-lengthparagraph representations, we apply a weighted-pooling operation; instead of using only one rep-resentation for each paragraph, we introduce amulti-head pooling mechanism, where for eachparagraph, weight distributions over tokens arecalculated, allowing the model to flexibly encodeparagraphs in different representation subspacesby attending to different words.

Let xl−1ij ∈ Rd denote the output vector of thelast transformer layer for token tij , which is usedas input for the current layer. For each paragraphRi, for head z ∈ {1, · · · , nhead}, we first trans-form the input vectors into attention scores azijand value vectors bzij . Then, for each head, wecalculate a probability distribution âzij over tokens

5074

within the paragraph based on attention scores:

azij =Wzax

l−1ij (13)

bzij =Wzb x

l−1ij (14)

âzij = exp(azij)/

n∑j=1

exp(azij) (15)

where W za ∈ R1∗d and W zb ∈ Rdhead∗d areweights. dhead = d/nhead is the dimension ofeach head. n is the number of tokens in Ri.

We next apply a weighted summation with an-other linear transformation and layer normaliza-tion to obtain vector headzi for the paragraph:

headzi = LayerNorm(Wzc

n∑j=1

azijbzij) (16)

where W zc ∈ Rdhead∗dhead is the weight.The model can flexibly incorporate multiple

heads, with each paragraph having multiple at-tention distributions, thereby focusing on differentviews of the input.

Inter-paragraph Attention We model the de-pendencies across multiple paragraphs with aninter-paragraph attention mechanism. Similar toself-attention, inter-paragraph attention allows foreach paragraph to attend to other paragraphs bycalculating an attention distribution:

qzi =Wzq head

zi (17)

kzi =Wzkhead

zi (18)

vzi =Wzv head

zi (19)

contextzi =

m∑i=1

exp(qziTkzi′)∑m

o=1 exp(qziTkzo)

vzi′ (20)

where qzi , kzi , v

zi ∈ Rdhead∗dhead are query,

key, and value vectors that are linearly trans-formed from headzi as in Vaswani et al. (2017);contextzi ∈ Rdhead represents the context vec-tor generated by a self-attention operation overall paragraphs. m is the number of input para-graphs. Figure 2 provides a schematic view ofinter-paragraph attention.

Feed-forward Networks We next update tokenrepresentations with contextual information. Wefirst fuse information from all heads by concate-nating all context vectors and applying a lineartransformation with weight Wc ∈ Rd∗d:

ci =Wc[context1i ; · · · ; context

nheadi ] (21)

Multi-head Pooling Multi-head Pooling

head 1

head 2

head 3

head 1

head 2

head 3

context 1

context 2

context 3

context 1

context 2

context 3

Inter-paragraph Attention



context

this is para one

Feed Forward

Feed Forward

Feed Forward

Feed Forward

context

this is para two

Feed Forward

Feed Forward

Feed Forward

Feed Forward

this is para one this is para two

Figure 2: A global transformer layer. Different col-ors indicate different heads in multi-head pooling andinter-paragraph attention.

We then add ci to each input token vector xl−1ij ,and feed it to a two-layer feed-forward networkwith ReLU as the activation function and a high-way layer normalization on top:

gij =Wo2ReLU(Wo1(xl−1ij + ci)) (22)

xlij = LayerNorm(gij + xl−1ij ) (23)

where Wo1 ∈ Rdff∗d and Wo2 ∈ Rd∗dff are theweights, dff is the hidden size of the feed-forwardlater. This way, each token within paragraph Rican collect information from other paragraphs in ahierarchical and efficient manner.

3.2.4 Graph-informed AttentionThe inter-paragraph attention mechanism can beviewed as learning a latent graph representation(self-attention weights) of the input paragraphs.Although previous work has shown that simi-lar latent representations are beneficial for down-stream NLP tasks (Liu and Lapata, 2018; Kimet al., 2017; Williams et al., 2018; Niculae et al.,2018; Fernandes et al., 2019), much work inmulti-document summarization has taken advan-tage of explicit graph representations, each focus-ing on different facets of the summarization task

5075

(e.g., capturing redundant information or repre-senting passages referring to the same event orentity). One advantage of the hierarchical trans-former is that we can easily incorporate graphs ex-ternal to the model, to generate better summaries.

We experimented with two well-establishedgraph representations which we discuss briefly be-low. However, there is nothing inherent in ourmodel that restricts us to these, any graph mod-eling relationships across paragraphs could havebeen used instead. Our first graph aims to capturelexical relations; graph nodes correspond to para-graphs and edge weights are cosine similaritiesbased on tf-idf representations of the paragraphs.Our second graph aims to capture discourse re-lations (Christensen et al., 2013); it builds anApproximate Discourse Graph (ADG) (Yasunagaet al., 2017) over paragraphs; edges between para-graphs are drawn by counting (a) co-occurring en-tities and (b) discourse markers (e.g., however,nevertheless) connecting two adjacent paragraphs(see the Appendix for details on how ADGs areconstructed).

We represent such graphs with a matrix G,where Gii′ is the weight of the edge connectingparagraphs i and i′. We can then inject this graphinto our hierarchical transformer by simply substi-tuting one of its (learned) heads z′ with G. Equa-tion (20) for calculating the context vector for thishead is modified as:

contextz′

i =m∑

i′=1

Gii′∑mo=1Gio

vz′

i′ (24)

4 Experimental Setup

WikiSum Dataset We used the scripts and urlsprovided in Liu et al. (2018) to crawl Wikipediaarticles and source reference documents. We suc-cessfully crawled 78.9% of the original documents(some urls have become invalid and correspond-ing documents could not be retrieved). We fur-ther removed clone paragraphs (which are exactcopies of some parts of the Wikipedia articles);these were paragraphs in the source documentswhose bigram recall against the target summarywas higher than 0.8. On average, each inputhas 525 paragraphs, and each paragraph has 70.1tokens. The average length of the target sum-mary is 139.4 tokens. We split the dataset with1, 579, 360 instances for training, 38, 144 for vali-dation and 38, 205 for test.

MethodsROUGE-L Recall

L′ = 5 L′ = 10 L′ = 20 L′ = 40

Similarity 24.86 32.43 40.87 49.49Ranking 39.38 46.74 53.84 60.42

Table 1: ROUGE-L recall against target summary forL′-best paragraphs obtained with tf-idf cosine similar-ity and our ranking model.

For both ranking and summarization stages,we encode source paragraphs and target sum-maries using subword tokenization with Sentence-Piece (Kudo and Richardson, 2018). Our vocabu-lary consists of 32, 000 subwords and is shared forboth source and target.

Paragraph Ranking To train the regressionmodel, we calculated the ROUGE-2 recall (Lin,2004) of each paragraph against the target sum-mary and used this as the ground-truth score. Thehidden size of the two LSTMs was set to 256,and dropout (with dropout probability of 0.2) wasused before all linear layers. Adagrad (Duchiet al., 2011) with learning rate 0.15 is used foroptimization. We compare our ranking modelagainst the method proposed in Liu et al. (2018)who use the tf-idf cosine similarity between eachparagraph and the article title to rank the inputparagraphs. We take the first L′ paragraphs fromthe ordered paragraph set produced by our rankerand the similarity-based method, respectively. Weconcatenate these paragraphs and calculate theirROUGE-L recall against the gold target text. Theresults are shown in Table 1. We can see that ourranker effectively extracts related paragraphs andproduces more informative input for the down-stream summarization task.

Training Configuration In all abstractive mod-els, we apply dropout (with probability of 0.1) be-fore all linear layers; label smoothing (Szegedyet al., 2016) with smoothing factor 0.1 is also used.Training is in traditional sequence-to-sequencemanner with maximum likelihood estimation. Theoptimizer was Adam (Kingma and Ba, 2014) withlearning rate of 2, β1 = 0.9, and β2 = 0.998;we also applied learning rate warmup over thefirst 8, 000 steps, and decay as in (Vaswani et al.,2017). All transformer-based models had 256 hid-den units; the feed-forward hidden size was 1, 024for all layers. All models were trained on 4 GPUs(NVIDIA TITAN Xp) for 500, 000 steps. We used

5076

Model ROUGE-1 ROUGE-2 ROUGE-LLead 38.22 16.85 26.89LexRank 36.12 11.67 22.52FT (600 tokens, no ranking) 35.46 20.26 30.65FT (600 tokens) 40.46 25.26 34.65FT (800 tokens) 40.56 25.35 34.73FT (1,200 tokens) 39.55 24.63 33.99T-DMCA (3000 tokens) 40.77 25.60 34.90HT (1,600 tokens) 40.82 25.99 35.08HT (1,600 tokens) + Similarity Graph 40.80 25.95 35.08HT (1,600 tokens) + Discourse Graph 40.81 25.95 35.24HT (train on 1,600 tokens/test on 3000 tokens) 41.53 26.52 35.76

Table 2: Test set results on the WikiSum dataset using ROUGE F1.

gradient accumulation to keep training time for allmodels approximately consistent. We selected the5 best checkpoints based on performance on thevalidation set and report averaged results on thetest set.

During decoding we use beam search with beamsize 5 and length penalty with α = 0.4 (Wu et al.,2016); we decode until an end-of-sequence tokenis reached.

Comparison Systems We compared the pro-posed hierarchical transformer against severalstrong baselines:

Lead is a simple baseline that concatenates the ti-tle and ranked paragraphs, and extracts thefirst k tokens; we set k to the length of theground-truth target.

LexRank (Erkan and Radev, 2004) is a widely-used graph-based extractive summarizer; webuild a graph with paragraphs as nodes andedges weighted by tf-idf cosine similarity; werun a PageRank-like algorithm on this graphto rank and select paragraphs until the lengthof the ground-truth summary is reached.

Flat Transformer (FT) is a baseline that appliesa Transformer-based encoder-decoder modelto a flat token sequence. We used a 6-layertransformer. The title and ranked paragraphswere concatenated and truncated to 600, 800,and 1, 200 tokens.

T-DMCA is the best performing model of Liuet al. (2018) and a shorthand for TransformerDecoder with Memory Compressed Atten-tion; they only used a Transformer decoder

and compressed the key and value in self-attention with a convolutional layer. Themodel has 5 layers as in Liu et al. (2018).Its hidden size is 512 and its feed-forwardhidden size is 2, 048. The title and rankedparagraphs were concatenated and truncatedto 3,000 tokens.

Hierarchical Transformer (HT) is the modelproposed in this paper. The model archi-tecture is a 7-layer network (with 5 local-attention layers at the bottom and 2 global at-tention layers at the top). The model takesthe title and L′ = 24 paragraphs as input toproduce a target summary, which leads to ap-proximately 1, 600 input tokens per instance.

5 Results

Automatic Evaluation We evaluated summa-rization quality using ROUGE F1 (Lin, 2004). Wereport unigram and bigram overlap (ROUGE-1and ROUGE-2) as a means of assessing infor-mativeness and the longest common subsequence(ROUGE-L) as a means of assessing fluency.

Table 2 summarizes our results. The firstblock in the table includes extractive systems(Lead, LexRank), the second block includes sev-eral variants of Flat Transformer-based models(FT, T-DMCA), while the rest of the table presentsthe results of our Hierarchical Transformer (HT).As can be seen, abstractive models generally out-perform extractive ones. The Flat Transformer,achieves best results when the input length is setto 800 tokens, while longer input (i.e., 1, 200 to-kens) actually hurts performance. The Hierarchi-cal Transformer with 1, 600 input tokens, outper-

5077

Model R1 R2 RLHT 40.82 25.99 35.08HT w/o PP 40.21 24.54 34.71HT w/o MP 39.90 24.34 34.61HT w/o GT 39.01 22.97 33.76

Table 3: Hierarchical Transformer and versions thereofwithout (w/o) paragraph position (PP), multi-headpooling (MP), and global transformer layer (GT).

forms FT, and even T-DMCA when the latter ispresented with 3, 000 tokens. Adding an externalgraph also seems to help the summarization pro-cess. The similarity graph does not have an ob-vious influence on the results, while the discoursegraph boosts ROUGE-L by 0.16.

We also found that the performance of the Hi-erarchical Transformer further improves when themodel is presented with longer input at test time.2

As shown in the last row of Table 2, when test-ing on 3, 000 input tokens, summarization qualityimproves across the board. This suggests that themodel can potentially generate better summarieswithout increasing training time.

Table 3 summarizes ablation studies aiming toassess the contribution of individual components.Our experiments confirmed that encoding para-graph position in addition to token position withineach paragraph is beneficial (see row w/o PP), aswell as multi-head pooling (w/o MP is a modelwhere the number of heads is set to 1), and theglobal transformer layer (w/o GT is a model withonly 5 local transformer layers in the encoder).

Human Evaluation In addition to automaticevaluation, we also assessed system performanceby eliciting human judgments on 20 randomly se-lected test instances. Our first evaluation studyquantified the degree to which summarizationmodels retain key information from the documentsfollowing a question-answering (QA) paradigm(Clarke and Lapata, 2010; Narayan et al., 2018).We created a set of questions based on the goldsummary under the assumption that it contains themost important information from the input para-graphs. We then examined whether participantswere able to answer these questions by readingsystem summaries alone without access to the goldsummary. The more questions a system can an-swer, the better it is at summarization. We cre-ated 57 questions in total varying from two to

2This was not the case with the other Transformer models.

Model QA RatingLead 31.59 -0.383FT 35.69 0.000T-DMCA 43.14 0.147HT 54.11 0.237

Table 4: System scores based on questions answeredby AMT participants and summary quality rating.

four questions per gold summary. Examples ofquestions and their answers are given in Table 5.We adopted the same scoring mechanism usedin Clarke and Lapata (2010), i.e., correct answersare marked with 1, partially correct ones with 0.5,and 0 otherwise. A system’s score is the averageof all question scores.

Our second evaluation study assessed the over-all quality of the summaries by asking partici-pants to rank them taking into account the fol-lowing criteria: Informativeness (does the sum-mary convey important facts about the topic inquestion?), Fluency (is the summary fluent andgrammatical?), and Succinctness (does the sum-mary avoid repetition?). We used Best-Worst Scal-ing (Louviere et al., 2015), a less labor-intensivealternative to paired comparisons that has beenshown to produce more reliable results than ratingscales (Kiritchenko and Mohammad, 2017). Par-ticipants were presented with the gold summaryand summaries generated from 3 out of 4 systemsand were asked to decide which summary was thebest and which one was the worst in relation tothe gold standard, taking into account the criteriamentioned above. The rating of each system wascomputed as the percentage of times it was chosenas best minus the times it was selected as worst.Ratings range from −1 (worst) to 1 (best).

Both evaluations were conducted on the Ama-zon Mechanical Turk platform with 5 responsesper hit. Participants evaluated summaries pro-duced by the Lead baseline, the Flat Transformer,T-DMCA, and our Hierarchical Transformer. Allevaluated systems were variants that achieved thebest performance in automatic evaluations. Asshown in Table 4, on both evaluations, participantsoverwhelmingly prefer our model (HT). All pair-wise comparisons among systems are statisticallysignificant (using a one-way ANOVA with post-hoc Tukey HSD tests; p < 0.01). Examples ofsystem output are provided in Table 5.

5078

Pentagoet Archeological District

GO

LD

The Pentagoet Archeological District is a National Historic Landmark District located at the southern edge ofthe Bagaduce Peninsula in Castine, Maine. It is the site of Fort Pentagoet, a 17th-century fortified trading postestablished by fur traders of French Acadia. From 1635 to 1654 this site was a center of trade with the localAbenaki, and marked the effective western border of Acadia with New England. From 1654 to 1670 the sitewas under English control, after which it was returned to France by the Treaty of Breda. The fort was destroyedin 1674 by Dutch raiders. The site was designated a National Historic Landmark in 1993. It is now a publicpark.

QA

What is the Pentagoet Archeological District? [a National Historic Landmark District]Where is it located? [Castine , Maine]What did the Abenaki Indians use the site for? [trading center]

LE

AD

The Pentagoet Archeological District is a National Historic Landmark District located in Castine, Maine. Thisdistrict forms part of the traditional homeland of the Abenaki Indians, in particular the Penobscot tribe. Inthe colonial period, Abenakis frequented the fortified trading post at this site, bartering moosehides, sealskins,beaver and other furs in exchange for European commodities. ”Pentagoet Archeological district” is a NationalHistoric Landmark District located at the southern edge of the Bagaduce Peninsula in Treaty Of Breda.

FT

the Pentagoet Archeological district is a National Historic Landmark District located atthe southern edge of the Bagaduce Peninsula in Treaty Of Breda. It was listed on thenational register of historic places in 1983.

T-D

MC

A The Pentagoet Archeological District is a national historic landmark district located in castine , maine . thisdistrict forms part of the traditional homeland of the abenaki indians , in particular the Penobscot tribe. Thedistrict was listed on the national register of historic places in 1982.

HT

The Pentagoet Archeological district is a National Historic Landmark District located in Castine, Maine. Thisdistrict forms part of the traditional homeland of the Abenaki Indians, in particular the Penobscot tribe. Inthe colonial period, Abenaki frequented the fortified trading post at this site, bartering moosehides, sealskins,beaver and other furs in exchange for European commodities.

Melanesian Whistler

GO

LD The Melanesian whistler or Vanuatu whistler (Pachycephala chlorura) is a species of passerine bird in the

whistler family Pachycephalidae. It is found on the Loyalty Islands, Vanuatu, and Vanikoro in the far south-eastern Solomons.

QA What is the Melanesian Whistler? [a species of passerine bird in the whistler family Pachycephalidae]

Where is it found? [Loyalty Islands , Vanuatu , and Vanikoro in the far south-eastern Solomons]

LE

AD The Australian golden whistler (Pachycephala pectoralis) is a species of bird found in forest, woodland, mallee,

mangrove and scrub in Australia (except the interior and most of the north) Most populations are resident, butsome in south-eastern Australia migrate north during the winter.

FT The Melanesian whistler (P. Caledonica) is a species of bird in the family Muscicapidae. It is endemic to

Melanesia.

T-D

MC

A

The Australian golden whistler (Pachycephala chlorura) is a species of bird in the family Pachycephalidae,which is endemic to Fiji.

HT The Melanesian whistler (Pachycephala chlorura) is a species of bird in the family Pachycephalidae, which is

endemic to Fiji.

Table 5: GOLD human authored summaries, questions based on them (answers shown in square brackets) andautomatic summaries produced by the LEAD-3 baseline, the Flat Transformer (FT), T-DMCA (Liu et al., 2018)and our Hierachical Transformer (HT).

6 Conclusions

In this paper we conceptualized abstractive multi-document summarization as a machine learningproblem. We proposed a new model which isable to encode multiple input documents hierar-chically, learn latent relations across them, and ad-ditionally incorporate structural information fromwell-known graph representations. We have alsodemonstrated the importance of a learning-basedapproach for selecting which documents to sum-marize. Experimental results show that our modelproduces summaries which are both fluent and in-

formative outperforming competitive systems by awide margin. In the future we would like to applyour hierarchical transformer to question answeringand related textual inference tasks.

Acknowledgments

We would like to thank Laura Perez-Beltrachinifor her help with preprocessing the dataset. Thisresearch is supported by a Google PhD Fellow-ship to the first author. The authors gratefully ac-knowledge the financial support of the EuropeanResearch Council (award number 681760).

5079

ReferencesJimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-

ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In In Proceedings ofthe 3rd International Conference on Learning Rep-resentations, San Diego, California.

Regina Barzilay and Kathleen R. McKeown. 2005.Sentence fusion for multidocument news summa-rization. Computational Linguistics, 31(3):297–327.

Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Guo,and Rebecca Passonneau. 2015. Abstractive multi-document summarization via phrase selection andmerging. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers),pages 1587–1597, Beijing, China.

Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, andYejin Choi. 2018. Deep communicating agents forabstractive summarization. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Pa-pers), pages 1662–1675, New Orleans, Louisiana.

Janara Christensen, Mausam, Stephen Soderland, andOren Etzioni. 2013. Towards coherent multi-document summarization. In Proceedings of the2013 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 1163–1173, At-lanta, Georgia. Association for Computational Lin-guistics.

Eric Chu and Peter J Liu. 2018. Unsupervised neuralmulti-document abstractive summarization. arXivpreprint arXiv:1810.05739.

James Clarke and Mirella Lapata. 2010. Discourseconstraints for document compression. Computa-tional Linguistics, 36(3):411–441.

John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization. Journal of MachineLearning Research, 12(Jul):2121–2159.

Günes Erkan and Dragomir R Radev. 2004. Lexrank:Graph-based lexical centrality as salience in textsummarization. Journal of artificial intelligence re-search, 22:457–479.

Patrick Fernandes, Miltiadis Allamanis, and MarcBrockschmidt. 2019. Structured neural summariza-tion. In Proceedings of the 7th International Con-ference on Learning Representations, New Orleans,Louisiana.

Katja Filippova and Michael Strube. 2008. Sentencefusion via dependency graph compression. In Pro-ceedings of the 2008 Conference on Empirical Meth-ods in Natural Language Processing, pages 177–185, Honolulu, Hawaii.

Sebastian Gehrmann, Yuntian Deng, and AlexanderRush. 2018. Bottom-up abstractive summarization.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages4098–4109, Brussels, Belgium.

Max Grusky, Mor Naaman, and Yoav Artzi. 2018.Newsroom: A dataset of 1.3 million summaries withdiverse extractive strategies. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Pa-pers), pages 708–719, New Orleans, Louisiana.

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Advances in Neu-ral Information Processing Systems 28, pages 1693–1701. Curran Associates, Inc.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Yoon Kim, Carl Denton, Luong Hoang, and Alexan-der M Rush. 2017. Structured attention networks.In Proceedings of the 5th International Conferenceon Learning Representations, Toulon, France.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutionalnetworks. In Proceedings of the 4th InternationalConference on Learning Representations, San Juan,Puerto Rico.

Svetlana Kiritchenko and Saif Mohammad. 2017.Best-worst scaling more reliable than rating scales:A case study on sentiment intensity annotation. InProceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics, pages 465–470, Vancouver, Canada.

Taku Kudo and John Richardson. 2018. Sentencepiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing.arXiv preprint arXiv:1808.06226.

Logan Lebanoff and Fei Liu. 2018. Automatic detec-tion of vague words and sentences in privacy poli-cies. In Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing,pages 3508–3517, Brussels, Belgium.

5080

Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In Text SummarizationBranches Out: Proceedings of the ACL-04 Work-shop, pages 74–81, Barcelona, Spain. Associationfor Computational Linguistics.

Peter J Liu, Mohammad Saleh, Etienne Pot, BenGoodrich, Ryan Sepassi, Lukasz Kaiser, and NoamShazeer. 2018. Generating Wikipedia by summariz-ing long sequences. In Proceedings of the 6th Inter-national Conference on Learning Representations,Vancouver, Canada.

Yang Liu and Mirella Lapata. 2018. Learning struc-tured text representations. Transactions of the Asso-ciation for Computational Linguistics, 6:63–75.

Jordan J Louviere, Terry N Flynn, and Anthony Al-fred John Marley. 2015. Best-worst scaling: The-ory, methods and applications. Cambridge Univer-sity Press.

Shulei Ma, Zhi-Hong Deng, and Yunlun Yang. 2016.An unsupervised multi-document summarizationframework based on neural document model. InProceedings of COLING 2016, the 26th Interna-tional Conference on Computational Linguistics:Technical Papers, pages 1514–1523, Osaka, Japan.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata.2018. Ranking sentences for extractive summariza-tion with reinforcement learning. In Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Pa-pers), pages 1747–1759, New Orleans, Louisiana.

Vlad Niculae, André F. T. Martins, and Claire Cardie.2018. Towards dynamic computation graphs viasparse latent structure. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 905–911, Brussels, Bel-gium.

Daraksha Parveen and Michael Strube. 2014. Multi-document summarization using bipartite graphs. InProceedings of TextGraphs-9: the workshop onGraph-based Methods for Natural Language Pro-cessing, pages 15–24, Doha, Qatar.

Romain Paulus, Caiming Xiong, and Richard Socher.2018. A deep reinforced model for abstractive sum-marization. In Proceedings of the 6th InternationalConference on Learning Representations, Vancou-ver, Canada.

Dragomir Radev. 2000. A common theory of infor-mation fusion from multiple text sources step one:Cross-document structure. In 1st SIGdial Workshopon Discourse and Dialogue, pages 74–83, HongKong, China.

Evan Sandhaus. 2008. The New York Times AnnotatedCorpus. Linguistic Data Consortium, Philadelphia,6(12).

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jon Shlens, and Zbigniew Wojna. 2016. Rethinkingthe inception architecture for computer vision. InThe IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR).

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008. Curran Asso-ciates, Inc.

Xiaojun Wan. 2008. An exploration of documentimpact on graph-based multi-document summariza-tion. In Proceedings of the 2008 Conference on Em-pirical Methods in Natural Language Processing,pages 755–762, Honolulu, Hawaii.

Adina Williams, Andrew Drozdov, and Samuel R.Bowman. 2018. Do latent tree learning models iden-tify meaningful structure in sentences? Transac-tions of the Association for Computational Linguis-tics, 6:253–267.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural machinetranslation system: Bridging the gap between hu-man and machine translation. In arXiv preprintarXiv:1609.08144.

Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,Ayush Pareek, Krishnan Srinivasan, and DragomirRadev. 2017. Graph-based neural multi-documentsummarization. In Proceedings of the 21st Confer-ence on Computational Natural Language Learning(CoNLL 2017), pages 452–462, Vancouver, Canada.

Jianmin Zhang, Jiwei Tan, and Xiaojun Wan. 2018.Adapting neural single-document summarizationmodel for abstractive multi-document summariza-tion: A pilot study. In Proceedings of the Interna-tional Conference on Natural Language Generation.

A Appendix

We describe here how the similarity and discoursegraphs discussed in Section 3.2.4 were created.These graphs were added to the hierarchical trans-former model as a means to enhance summaryquality (see Section 5 for details).

5081

A.1 Similarity Graph

The similarity graph S is based on tf-idf cosinesimilarity. The nodes of the graph are paragraphs.We first represent each paragraph pi as a bag ofwords. Then, we calculate the tf-idf value vik foreach token tik in a paragraph:

vik = Nw(tik)log(Nd

Ndw(tik)) (25)

where Nw(t) is the count of word t in the para-graph, Nd is the total number of paragraphs,and Ndw(t) is the total number of paragraphs con-taining the word. We thus obtain a tf-idf vectorfor each paragraph. Then, for all paragraph pairs< pi, pi′ >, we calculate the cosine similarity oftheir tf-idf vectors and use this as the weight Sii′for the edge connecting the pair in the graph. Weremove edges with weights lower than 0.2.

A.2 Discourse Graphs

To build the Approximate Discourse Graph(ADG)D, we follow Christensen et al. (2013) andYasunaga et al. (2017). The original ADG makesuse of several complex features. Here, we createa simplified version with only two features (nodesin this graph are again paragraphs).

Co-occurring Entities For each paragraph pi,we extract a set of entities Ei in the paragraphusing the Spacy3 NER recognizer. We only useentities with type {PERSON, NORP, FAC,ORG, GPE, LOC, EVENT, WORK OF ART,LAW}. For each paragraph pair < pi, pj >, wecount eij , the number of entities with exact match.

Discourse Markers We use the following 36 ex-plicit discourse markers to identify edges betweentwo adjacent paragraphs in a source webpage:

again, also, another, comparatively, fur-thermore, at the same time,however, im-mediately, indeed, instead, to be sure,likewise, meanwhile, moreover, never-theless, nonetheless, notably, otherwise,regardless, similarly, unlike, in addition,even, in turn, in exchange, in this case,in any event, finally, later, as well, espe-cially, as a result, example, in fact, then,the day before

3https://spacy.io/api/entityrecognizer

If two paragraphs < pi, pi′ > are adjacent in onesource webpage and they are connected with oneof the above 36 discourse markers, mii′ will be 1,otherwise it will be 0.

The final edge weight Dii′ is the weighted sumof eii′ and mii′

Dii′ = 0.2 ∗ eii′ +mii′ (26)