arXiv:1910.00998v1 [cs.CL] 2 Oct 2019mance on the SCT task is close to 100%. Our proposed UAS task involves summariz-ing the ﬁve-sentence training ROCStories with a single sentence

SummAE: Zero-Shot Abstractive Text Summarization usingLength-Agnostic Auto-Encoders

Peter J. Liu ∗Google Brain

[email protected]

Yu-An Chung ∗ †MIT CSAIL

[email protected]

Jie RenGoogle Brain

[email protected]

Abstract

We propose an end-to-end neural model forzero-shot abstractive text summarization ofparagraphs, and introduce a benchmark task,ROCSumm, based on ROCStories, a subsetfor which we collected human summaries. Inthis task, five-sentence stories (paragraphs) aresummarized with one sentence, using humansummaries only for evaluation. We show re-sults for extractive and human baselines todemonstrate a large abstractive gap in per-formance. Our model, SummAE, consists ofa denoising auto-encoder that embeds sen-tences and paragraphs in a common space,from which either can be decoded. Summariesfor paragraphs are generated by decoding asentence from the paragraph representations.We find that traditional sequence-to-sequenceauto-encoders fail to produce good summariesand describe how specific architectural choicesand pre-training techniques can significantlyimprove performance, outperforming extrac-tive baselines. The data, training, evalua-tion code, and best model weights are open-sourced.

1 Introduction

Extractive summarization has been studied exten-sively over the past several decades (Gupta andLehal, 2010; Ferreira et al., 2013). However,humans typically summarize abstractively, para-phrasing and performing non-trivial compressionof details that are difficult to encode in classi-cal summarization algorithms. Recent progressin neural language models (Sutskever et al., 2014;Jozefowicz et al., 2016; Radford et al., 2019) hasenabled models to generate near fluent languagethat are not mere regurgitations of training data.With large datasets of (document, summary) pairs

∗ Equal contribution†Work done while interning at Google Brain.

(a) Story (paragraph) to summarize- Jason listened to the weather and heard it was goingto be sunny. He thought the kids might like to goswimming. He gathered up the swimsuits, towelsand sunscreen. Jason and the kids got into the truckand drove to the beach. They spent the next 2 hoursplaying and splashing in the surf.

(b) Three human summaries- Jason saw a nice weather forecast and went to thebeach with his kids for 2 hours.- Jason took the kids swimming at the beach on asunny day.- Jason decided to take the kids to the beach since itwas a sunny day.

(c) Best extractive sentence- Jason listened to the weather and heard it was goingto be sunny.

(d) Unsupervised abstractive summary (ours)- Jason listened to the weather to be the coolest andfun day and he went to the beach and got ready.

Figure 1: (a) An example story to summarize fromthe validation set for which we collected human sum-maries. (b) Human summaries from independent raters.(c) The extractive sentence with the highest ROUGE-1.(d) Summary generated from our model without seeingany summaries.

primarily from the news domain, abstractive sum-marization has been approached as a supervisedneural sequence transduction problem (Rush et al.,2015; Nallapati et al., 2016; Narayan et al., 2018;Fabbri et al., 2019). Outside of news, however,such large parallel datasets are rare, due to thecost-prohibitive “labeling” process (i.e., readinglong documents and writing summaries). Muchmore commonly available are large corpora ofdocuments without summaries. It is therefore de-sirable to have models capable of automaticallysummarizing documents abstractively with little tono supervision.

In contrast to abstractive methods, many ex-

arX

iv:1

910.

0099

8v1

[cs

.CL

] 2

Oct

201

9

tractive approaches do not rely on example sum-maries. Inspired by that we study the extreme caseof no exposure to summaries during training, orunsupervised (zero-shot) abstractive summariza-tion (UAS).

Recently there has been some, thus far limited,work on UAS for both the multi-document (Chuand Liu, 2019) and single-document (Isonumaet al., 2019) cases. In this work we focus onUAS of paragraphs with a sentence, which is per-haps the most basic form of multi-sentence single-document summarization. In particular, we sum-marize the five-sentence stories from ROCSto-ries (Mostafazadeh et al., 2016) and show thatthere is a non-trivial abstractive gap between hu-man and extractive performance, making it a suit-able benchmark for measuring progress in UAS.

Our approach is based on training a denois-ing auto-encoder (Vincent et al., 2008) that en-codes sentences and paragraphs in a shared space.The decoder input is pre-pended with a specialtoken to signal whether to decode a sentence ora paragraph, and a summarizing sentence is gen-erated by decoding a sentence from an encodedparagraph. However, we found that traditionalapproaches to training such an auto-encoder re-sulted in non-overlapping sentence and paragraphlatent sub-spaces — which we call segregation— resulting in long, multi-sentence summaries.We describe architectural modifications and self-supervised pre-training objectives to prevent seg-regation and improve performance significantlybeyond sentence-extractive baselines. While thegoal of human performance is still far, we believethe techniques presented here are a major step inthat direction.

In summary, our contributions are as follows.

1. We introduce a new benchmark task, ROC-Summ, for measuring progress toward humanperformance on UAS.

2. We propose a novel end-to-end, fully-differentiable neural model for UAS of para-graphs.

3. We describe novel self-supervised (pre-training) and denoising objectives thatsignificantly improve performance beyondsentence-extractive baselines.

4. We conduct ablation experiments showingthe importance of architectural choices andmodel objectives.

2 A new task for unsupervisedabstractive summarization

Our new task re-purposes and augments an ex-isting dataset, ROCStories (Mostafazadeh et al.,2016, 2017), originally designed for the “StoryCloze Test” (SCT), where a model must choosethe correct fifth sentence of two candidates giventhe first four. The stories are self-contained, di-verse, realistic, non-technical, high-quality, andhave a coherent story arch. The human perfor-mance on the SCT task is close to 100%.

Our proposed UAS task involves summariz-ing the five-sentence training ROCStories witha single sentence without summaries at train-ing, i.e., perform zero-shot summarization. Wefound summaries by independent human ratershave high similarity in this task, suggesting it iswell-defined, and relatively unambiguous, in con-trast to other summarization tasks where the de-sired length or the topic of the summary is unclear.The simplicity of the task is conducive to iteratingquickly and making rapid progress in UAS. Hav-ing only five-sentences and a low-bound on totalnumber of words avoids engineering issues thatoften arise with very long sequences. Due to theconstraints, it is simple to calculate the maximum(sentence) extractive summarization performance,which is far from the human performance (see Ta-ble 1) suggesting a need for abstractive models.In contrast, it is unclear for example, what humanperformance is on the popular CNN/DailyMail(supervised) summarization task (See et al., 2017)and whether abstractive models provide much of abenefit over extractive ones on it (Kryscinski et al.,2019).

2.1 Collection of reference summaries forevaluation

To evaluate summarization models, we collectedmultiple summaries from independent, experi-enced, and highly-reputable Amazon MechanicalTurk (AMT) workers. The full worker selectioncriteria and AMT template can be found in Ap-pendix A.1. We split the original 98,159 trainingstories into 95% train, 5% valid, and 5% test ex-amples, and collect 3 human summaries each for500 validation examples and 500 test examples.Collecting multiple summaries allowed us to esti-mate human performance as well as treat multiple“right” answers more fairly by averaging metricsacross the summaries for a given example. An ex-

ample story with 3 human summaries, the best ex-tractive sentence, and one of our model summariescan be found in Figure 1.

3 Related Work

Chu and Liu (2019) proposed a model for zero-shot multi-document abstractive summarization,where the mean of the representations from anauto-encoder for input documents is used to de-code a summary. Isonuma et al. (2019) proposedto summarize a product review by describing it asa discourse tree, where the summary is the rootand the child sentences explain their parent.

Baziotis et al. (2019) performed sentence com-pression by chaining two sequence-to-sequence(seq2seq) models as an auto-encoder. A Straight-Through Gumbel-Softmax estimator (Jang et al.,2017) was used to sample an output sequence fromthe first seq2seq, which was encouraged to be lan-guage via a pre-trained language model loss. Itwas also encouraged to be related to the originalsentence by using it as input to the second seq2seqmodel, which was trained to reconstruct the orig-inal sentence. Fevry and Phang (2018) used a de-noising auto-encoder for sentence compression aswell. An input sentence was artificially extendedand word-shuffled, encouraging the model to learnto exclude and compress, producing shorter sen-tences.

Wang and Lee (2018) trained a Cycle-GANmodel (Zhu et al., 2017) to learn a document-to-summary mapping given large datasets of un-paired documents and summaries. However, dueto the model being exposed to summaries duringtraining, it is not zero-shot summarization. Fur-ther, unlike the original Cycle-GAN model on im-ages, it is non-differentiable since the discrimi-nator must distinguish real from generated in thediscrete language domain, and relies on REIN-FORCE (Williams, 1992).

Radford et al. (2019) trained a large languagemodel on a large Web text dataset and found thatthe model could produce zero-shot summaries ifprompted with a document followed by TL;DR,though they considered them rudimentary and un-usable.

Historically there have been strong parallels inthe development of neural sequence transductionmodels for translation and summarization, relyingon some flavor of sequence-to-sequence learning.We depart significantly from recent unsupervised

Figure 2: Model architecture of SummAE, whosebackbone is a denoising auto-encoder. The encoderφenc maps paragraphs and sentences into a commonspace, from which either can be decoded via the de-coder φdec by conditioning on two different beginning-of-sequence tokens β ∈ {0, 1}. During training, weadd noise to text sequences before mapping to encour-age φenc to learn more robust text representations.

translation work (Lample et al., 2018a,b; Artetxeet al., 2018) where models are exposed to bothsource and target sequences, though unpaired. Inour work, models must learn to produce target se-quences (i.e., summarize) having only been ex-posed to source sequences (documents) duringtraining.

4 Model and Methods

4.1 ArchitectureOur summarization model, SummAE, is depictedin Figure 2 and consists of a denoising auto-encoder, (φenc, φdec), capable of auto-encodingtext sequences T that can be sentences or para-graphs. In particular, the encoder, φenc : T 7→Rzdim , is a deterministic function, parameterizedby θE , mapping text to a latent vector representa-tion, φenc(t) = z ∈ Rzdim . For an input sequencet ∈ T, we add random noise to t (described in Sec-tion 4.2), t = η(t) ∈ T. We consider two encoderimplementations:

1. a bidirectional RNN (Schuster and Paliwal,1997) where φenc(t) = z is derived from theRNN’s final hidden state, h = [h→, h←] ∈R2hdim , followed by an affine transformation:

h = RNNenc(t),

z =Wh+ b;

2. a Transformer stacked with N identi-cal Transformer encoder blocks (denoted

as TRFenc) from Vaswani et al. (2017).φenc(t) = z is derived from the output repre-sentation of the first token as in Devlin et al.(2019), followed by an affine transformation:

h0 = t,

hl = TRFenc(hl−1), ∀l ∈ [1, N ],

h = hN [0, :],

z =Wh+ b.

The decoder, parameterized by θG, is an auto-regressive generative model defining a probabil-ity distribution over T conditioned on z. We alsocondition the decoder on whether to decode a sen-tence or paragraph, using two different beginning-of-sequence tokens indicated by β ∈ {0, 1}. Thereconstructed input, t, is obtained by sampling onetoken at a time until a special end-of-sequence to-ken is obtained:

t = φdec(z, β)

∼ pθG(t|z, β)

=∏i

pθG(ti|t1, ..., ti−1, z, β). (1)

We consider two decoder implementations:

1. a unidirectional RNN that conditions on z byconcatenating the decoder input embeddingswith z at each time-step;

2. a Transformer with causal, masked attentionthat conditions by adding z to each inputembedding. This is similar to the Trans-former decoder in Vaswani et al. (2017) with-out decoder-encoder attention.

In both cases, we avoid decoder-encoder attentionto encourage all semantic information to be encap-sulated in z.

In our dataset, a single example is a paragraphconsisting of n(p) sentences, p = (s1, ..., sn(p)),and the auto-encoder contributes two reconstruc-tion loss terms, one for the sentences and one forthe paragraph, weighted by λs and λp.

`rec(p, θE , θG)

= λs1

n(p)

n(p)∑i=1

`ce(si, si; θE , θG)

+ λp`ce(p, p; θE , θG), (2)

where `ce(x, x) is the standard cross-entropy lossbetween the input sequence and its reconstruction.We optimize it using gradient descent and teacher-forcing (Williams, 1992).

Our approach to summarize paragraphs with asentence is to prompt the decoder to generate asentence (using β = 0) conditioned on the latentvector z of a paragraph. However, simply trainingan RNN or Transformer auto-encoder as describedgenerally fails (as we see in Section 5.4); we hy-pothesize that the encoder can learn to map sen-tences and paragraphs to separate regions in the la-tent space, and the decoder can recognize whetherto decode a sentence or paragraph in reconstruc-tion based solely on the location of z and ignoreβ. We find this can result in the decoder generat-ing a paragraph even if prompted for a sentence.We call this phenomenon segregation.

Ideally, the auto-encoder learns a higher-levellatent concept conveyed in paragraphs and sen-tences, disentangled from their original expres-sion as paragraphs and sentences. To explicitlyencourage this we investigated adding an adver-sarial discriminator/critic D, which is trained toclassify whether a latent vector z is an encodedsentence or a paragraph. In other words, it learnspD(y(t)|φenc(t)), where y(t) is 0 if t is a sentenceand 1 if a paragraph, while the auto-encoding lossis augmented to encourage fooling the discrim-inator into classifying paragraphs as sentences.Similar approaches have been used in style trans-fer (Hu et al., 2017; Shen et al., 2017; Romanovet al., 2019) and unsupervised machine transla-tion (Lample et al., 2018a), although not for ab-stractive summarization. Details of our implemen-tation can be found in Appendix A.2.

In our experiments we found adding the criticwas very effective for generating one-sentenceshort summaries. However, for some encoder-decoder configurations, the critic was found to beunnecessary and even harmed performance, whichwe discuss in Section 5.4.

4.2 Adding noise to text

We use a denoising auto-encoder rather than astandard auto-encoder by reconstructing a text se-quence from a noisy version of it. Denoising canbe seen as a useful self-supervised objective forimproving representations as seen in Devlin et al.(2019); it also serves as a form of data augmenta-tion, effectively increasing the number of training

examples; finally, it discourages merely learningthe identity function without having to reduce theinformation bottleneck zdim to a very small value.We employ two techniques for adding noise:

Randomly masking tokens Similar to Devlinet al. (2019) and Song et al. (2019), we randomlymask the input sequence tokens at training beforefeeding it to the encoder. However, instead of onlypredicting masked tokens, we generate the full de-noised sequence. We apply the masking with thefollowing procedure:

1. Select sequences to mask with probabilityps < 1.0, so that some of the sequences areunpermuted as they are during test time.

2. For selected sequences, replace each tokenwith a <MASK> token with probability pm.

Permuting order of sentences within para-graphs Even with token-masking we observeda failure mode where the latent representation ofa paragraph overly focuses on the first sentence ofthe paragraph and memorizes it. Indeed, it is thebest sentence to memorize for the purpose of re-constructing a paragraph. However, to encouragelearning the structure and coherence of paragraphsbeyond the first sentence, with probability pperm,we permute the order of sentences in a paragraphand train the auto-encoder to recover the originalparagraph.

4.3 Pre-training of encoder and decoder

Motivated by the recent success in self-supervisedlanguage representation learning (Peters et al.,2018; Howard and Ruder, 2018; Radford et al.,2018; Devlin et al., 2019; Song et al., 2019; Zhanget al., 2019; Yang et al., 2019), we propose sev-eral strategies that pre-train the encoder and de-coder before optimizing them jointly with theauto-encoding objective (Equation 2). The strate-gies are applied jointly in the pre-training phase byadding the corresponding losses.

Although we adopt the paradigm of pre-trainingfollowed by fine-tuning, there are two significantdifferences with other work. In past work la-beled data are available for the downstream tasksand fine-tuning is supervised; in our work, how-ever, both pre-training and fine-tuning are un-supervised. Additionally, most previous workpre-trained their models on extremely large cor-pora different from their downstream datasets,

whereas our model learns everything from thesame dataset1.

Encoder pre-training #1: Corrupted Para-graph Prediction Summarizing a paragraph re-quires understanding how its sentences followeach other to form a coherent narrative. To encour-age good paragraph representations we propose anovel pre-training task classifying whether a para-graph has been corrupted by swapping the order ofthe i-th sentence with its next sentence. We imple-mented this by adding a logistic regression layer tothe encoder paragraph output z, and minimizingcross-entropy error with 50% of paragraphs cor-rupted, and 50% unmodified. We refer to this taskas Corrupted Paragraph Prediction (CPP).

Encoder pre-training #2: Next Sentence Pre-diction We propose another pre-training objec-tive encouraging the encoder to understand howsentences follow within a paragraph. The ob-jective, referred to as Next Sentence or SameParagraph (NSSP), shares an idea with BERT’sNext Sentence Prediction (NSP) objective (Devlinet al., 2019)—to classify whether two sentencesare adjacent—but is modified to be more difficult.As in BERT, we sample the sentence pairs A and Bsuch that 50% of the time B follows A, otherwise itdoes not (negatives). However, negative pairs aresampled from within the same paragraph insteadof from the whole corpus. This is more challeng-ing as sentences from the same paragraph usuallyare more similar and about the same topic. We ob-served this harder negative sampling leads to bet-ter downstream summarization performance.

Further our implementation differs in the fol-lowing way from BERT’s NSP:

• In BERT, each input sequence is constructedby concatenating A and B separated by a<SEP> token. To better differentiate A andB, BERT further employs segment embed-dings. In contrast, we encode A and B inde-pendently, avoiding the need for the separatortoken and segment embeddings.

• In BERT, a binary multi-layer perceptron isadded during pre-training. Instead of in-troducing these extra parameters we directlytake the dot product of the two sentence rep-resentations followed by a sigmoid function:

σ(φenc(A) · φenc(B)),A,B ∈ T.1We tried pre-training on large corpora, but it didn’t help.

This also restricts the capacity of the classi-fier, forcing all relevant features into z. Thisimplementation is simpler, and more genericsince it is not tied to a specific encoder (e.g.,BERT/Transformer) architecture.

In contrast to CPP, which learns to encode the sen-tence relationships in paragraph representations,NSSP is taught to encode sentence relationshipsin individual sentence representations.

Decoder pre-training: Auto-regressive Lan-guage Modeling We pre-train the decoder witha standard auto-regressive language modeling(LM) objective similar to Ramachandran et al.(2017). We implement this by setting z = 0 inEquation 1 regardless of the input sequences inpre-training, so that the decoder receives no condi-tioning signal during teacher-forcing, making theprocess equivalent to a standard LM task.

5 Experiments

5.1 MetricsWe found human summaries, which are quite ab-stractive, to be unfairly punished on ROUGE-2and ROUGE-X precision variants; and thus onlyreport ROUGE-1 and -L recall scores (Lin, 2004).We truncated generated summaries to their firstsentence and to a word limit of 20 to prevent pre-ferring models with overly long summaries, sim-ilar to Rush et al. (2015), although they used abyte limit. The limitations of ROUGE are well-known, and in particular favor extractive meth-ods, since paraphrasing and synonyms are not re-warded fairly. However, at the time of writing,ROUGE remains a dominant metric in summariza-tion research (Kryscinski et al., 2019). We treatedthe 3 summaries per evaluation example equallyand report average ROUGE scores over the 500test examples, using the 500 validation examplesfor hyper-parameter tuning. We observed little dif-ference in averaged validation and test scores.

5.2 BaselinesExtractive Because the documents consist ofonly five sentences, for each example there areonly five possible sentence-extractive summaries.Thus we computed the performance of selectingsentence i as the summary, for i ∈ {1, ..., 5}, anddenote these as Extract i. We also computed max-imum achievable sentence-extractive score, by se-lecting the best performing sentence per example

evaluated based on ROUGE-1 (recall) against the3 human summaries, which we refer to as ExtractOracle. Although this method cheats by lookingat the test data and may not actually be achievablein practice, it is useful to have an estimate of theceiling of extractive methods.

Human We estimated human performance bycomputing (a) the maximum and (b) the averageROUGE between all pairs of human summariesfor each evaluation example, and took the meanacross examples.

MeanSum-single Although designed for multi-document summarization, we adapted Mean-Sum (Chu and Liu, 2019) for the single-documentcase by treating each sentence as a documentsimilar to Isonuma et al. (2019), where it iscalled MeanSum-single. Since MeanSum gen-erates summaries of the same shape as the in-put documents, this adaptation generates sentence-summaries.

5.3 Training details and hyper-parametersText Tokenization We used a data-driven sub-word tokenizer (Sennrich et al., 2016) with a vo-cabulary size of 215 to convert text into integer IDtokens, followed by a 128-dim embedding layer.We shared the input and output embedding layersas in Press and Wolf (2017).

Architecture We experimented with three baseencoder-decoder configurations for SummAE:RNN-RNN, TRF-TRF, and TRF-RNN, whereTRF stands for Transformer. Both RNNencand RNNdec were single-layer GRUs (Chunget al., 2014) with hdim = 512, while RNNencwas bidirectional and RNNdec was unidirectional.TRFenc was a stack of 2 Transformer encoder lay-ers with each consisting of a self-attention sub-layer with 8 attention heads and a point-wise feed-forward network with 512 hidden units. TRFdechad the same hyper-parameter setting as TRFencbut was composed of Transformer decoder layers.zdim was set to 256. All model weights, includ-ing embeddings were initialized randomly. De-coding at each time-step was performed greedilyusing arg-max.

Optimization We performed gradient descentusing the Adam optimizer (Kingma and Ba, 2015)with a learning rate of 0.001 and batch size of 64.The models were trained with early-stopping, us-ing maximum validation ROUGE-1 (recall). The

100 80 60 40 20 0 20 40 60 8060

40

20

0

20

40

60

(a) RNN-RNN without critic.

80 60 40 20 0 20 40 60

40

20

0

20

40

(b) RNN-RNN with critic.

80 60 40 20 0 20 40 60 8080

60

40

20

0

20

40

60

80

(c) TRF-RNN with LM pre-trainingand without critic.

Figure 3: 2-D t-SNE visualizations for SummAE latent space. Each blue circle corresponds to a sentence and eachred triangle corresponds to a paragraph (best viewed in color).

number of pre-training steps was set to 100,000,when pre-training was employed.

When critic/discriminator was used, it was im-plemented as a multi-layer perceptron with 128hidden units. When token mask was added, weset ps = 0.15 and pm = 0.8. For permuting sen-tences in paragraphs, we set pperm = 0.5. Thecritic and noise were only added during the fine-tuning phase.

We experimented with different hyper-parameters for constructing SummAE but foundthe reported setting worked the best empirically.

5.4 Results and discussionTable 1 shows ROUGE scores and summarylengths for human and extractive baselines, and forSummAE with enhancements described in Sec-tion 4. For the base encoder-decoder configura-tions (RNN-RNN, TRF-TRF, and TRF-RNN), to-ken masking and paragraph shuffling were addedas noise, but no pre-training was done and no criticwas added.

Human and extractive baseline performanceThe best extractive sentences were unsurprisinglythe first and last, as the sentences introducing thesubject, and revealing the ending. Extractive Or-acle is by definition the best and its performancewas considerably higher than any fixed-index sen-tence to extract. Estimated human performancewas much higher than even the Extractive Oracle,suggesting an abstractive gap.

Critic effectively restricts summary-lengthAll three base encoder-decoder configurationstended to generate overly long/invalid summariesas shown by the number of words (close to 40)and sentences (exceeding 1), indicating the β inEquation 1 was likely ignored. Once the critic

was added, the model was able to generate single-sentence short summaries.

To validate our hypothesis that segregation inthe latent space was the underlying problem caus-ing long summaries, we visualized the latent spaceof SummAE RNN-RNN (similar plots for TRF-TRF and TRF-RNN) in 2-D using t-SNE (Maatenand Hinton, 2008), with and without the critic. InFigure 3a where there was no critic, sentence andparagraph representations indeed were mapped tocompletely separate regions; while in Figure 3bthe adversarial feedback from the critic effectivelymerged the two clusters, supporting our hypothe-sis.

Effect of LM pre-training As can be seen fromthe table, LM pre-training always improved themodel performance while keeping the summariesshort: for all of three encoder-decoder configura-tions, it boosted their ROUGE-1 scores from 19.8to 26.4, 22.9 to 25.8, and 23.2 to 26.4, respec-tively. Qualitatively, we found models with LMpre-training generated more fluent summaries aswell.

Best encoder-decoder configuration Interest-ingly, we observed that TRF-RNN outperformedthe other two configurations that use the same se-quential architecture for both encoder and decoder.Similar results were reported in Chen et al. (2018),where they found their hybrid model composed ofa Transformer encoder and RNN decoder workedthe best in machine translation. Another possiblereason is that the decoder here does not need toattend to long-term dependencies, which is whereTransformers have major advantages over RNNs.

Surprisingly we found that the TRF-RNN vari-ant enhanced with LM pre-training did not re-quire a critic to generate one-sentence summaries

Table 1: ROUGE recall scores and generated summary lengths (number of words and sentences) for estimatedhuman performance, extractive baselines, MeanSum-single, and variants of SummAE. By default, all SummAEmodels incorporate token masking and paragraph shuffling without critic and any pre-training. TRF stands forTransformer. Numbers in bold denote the best performed models in each category based on ROUGE; numbers initalics denote the models that are not qualified for generating one sentence summaries. Superscript letters, a,b,c,d,denote model pairs that are compared in human evaluations in Table 2.

Model ROUGE-1 ROUGE-L Num words Num sentences

Human average 45.0 37.7 13.9 1.0Human maximumd 52.7 44.1 17.6 1.0

Ext

ract

ive

Extract 1 27.3 24.8 7.9 1.0Extract 2 19.5 16.7 8.9 1.0Extract 3 19.4 16.4 9.0 1.0Extract 4 20.5 17.7 8.7 1.0Extract 5 24.6 20.9 9.3 1.0Extract Oracle 36.7 31.9 9.7 1.0

Abs

trac

tive

SummAE RNN-RNN 33.7 27.2 41.2 4.2+ critica,b 19.8 17.4 9.5 1.0

+ LM pre-trainingb,c 26.4 21.6 11.5 1.0SummAE TRF-TRF 22.4 20.3 44.1 5.1+ critic 22.9 19.9 9.4 1.0

+ LM pre-training 25.8 21.2 10.2 1.0SummAE TRF-RNN 26.5 22.8 39.7 3.0+ critic 23.2 19.9 10.4 1.0

+ LM pre-training 26.4 22.7 10.7 1.0+ LM pre-training 33.4 27.5 16.3 1.0

- token masking 30.9 25.1 16.5 1.0- paragraph shuffling 31.8 26.5 14.7 1.0+ CPP pre-training 34.4 28.0 19.6 1.0+ NSSP pre-trainingc,d 36.5 29.3 19.4 1.0

MeanSum-singlea 15.6 13.4 9.5 1.0- token masking 12.0 10.5 9.4 1.0

and prevented segregation as the other encoder-decoder configurations did. Figure 3c, visualizesits corresponding latent space where the paragraphand sentence representations were mapped to thesame region, suggesting that LM pre-training hasthe same effect of critic in this specific configu-ration. It also outperformed the model with bothcritic and LM pre-training (33.4 vs. 26.4).

Effect of token masking and paragraph shuf-fling By default both token masking and para-graph shuffling introduced in Section 4.2 wereadded as noise to base SummAE models duringauto-encoding. Removing either one degradedthe performance: ROUGE-1 score decreased from33.4 to 30.9 when token masking was removedand to 31.8 when paragraph shuffling was re-moved. During experimentation, we also foundadding this noise produced more stable resultsacross different runs.

Effect of NSSP & CPP pre-training SummAETRF-RNN with only decoder pre-training (33.4)already surpassed the best extractive sentence (Ex-

tract 1: 27.3). With encoder pre-training with ei-ther NSSP or CPP, we observed a further improve-ment from 33.4 to 36.5 and to 34.4, respectively.

Our best SummAE model—constructed by aTransformer encoder and an RNN decoder—waspre-trained with NSSP and LM objectives, fol-lowed by denoising auto-encoding of masked andshuffled input sequences. It achieved (ROUGE-1, ROUGE-L) = (36.5, 29.3) that significantlyoutperformed the fixed-index extractive sentencesand was comparable with Extract Oracle, whichlooks at the test data.

Human evaluation of model summaries Tovalidate that making progress on the ROUGE met-rics correlates to making real progress as judgedby humans, we conducted several side-by-sidemodel comparisons covering the range of ROUGEscores from low to high-end using Amazon Me-chanical Turk workers. Workers were presentedwith the 5-sentence story/paragraph along withtwo model summaries and asked to rate each ontwo dimensions: fluency and information rele-vance. To minimize inter-rater noise, scores by 3

Table 2: Human evaluation results comparing model summary pairs side-by-side on fluency and informationrelevance. Superscript letters, a,b,c,d, correspond to models in Table 1. The * denotes statistical significance atp < 0.001 using a Binomial Two-Tailed test (null hypothesis: the models are equally good).

Model preference Fluency Informationa RNN-RNN + critic ≥ MeanSum-single + noise 77%* 80%*

b RNN-RNN + critic + LM ≥ RNN-RNN + critic 80%* 80%*

c TRF-RNN + LM + NSSP ≥ RNN-RNN + critic + LM 45% 86%*

d Human ≥ TRF-RNN + LM + NSSP 99%* 99%*

distinct workers were collected for each exampleand averaged. We aggregated results across 100random examples from the test set. Results show-ing the average preference of the workers on thetwo dimensions are presented in Table 2.

We observed that the fluency improved signif-icantly from the MeanSum-single model throughthe RNN-RNN+critic+LM models, while the in-formation aspect continued to improve throughour best model, the TRF-RNN with NSSP andlanguage model pre-training. The human per-formance was still far better on both dimensionswhen compared side-by-side. Additional modelsamples can be viewed in Appendix Figures 6 and7.

In addition to decoding a sentence from a para-graph representation, we found it informative tolook at reconstructed paragraphs from the auto-encoder which are also included in Figure 7. Theparagraph reconstructions show some coherence,although with some disfluencies and factual inac-curacies that are common with neural generativemodels. Since the summaries are decoded fromthe same latent vector as the reconstructions, im-proving them could lead to more accurate sum-maries.

6 Conclusions

We introduce, ROCSumm, a new benchmark taskfor zero-shot, unsupervised abstractive summa-rization (UAS) that is useful for iterating andmeasuring progress on this challenging prob-lem. As one of the very first works approach-ing single-document UAS, we propose a novelneural model—SummAE—based on a denoisingauto-encoder along with several self-supervisedpre-training techniques for enhancing the model.While performance is still far behind humans,SummAE outperforms extractive baselines, and isa major step toward UAS.

7 Code and data release

The code to reproduce our experimental setupis provided at https://github.com/google-research/google-research/tree/master/summae. We include thecode to process the ROCStories dataset intoROCSumm, our train/test splits, the 3000 humansummaries used in validation and test evaluation,Amazon Mechanical Turk templates used fordata collection and evaluation. We also includemodel training and evaluation code, and the modelweights for our best model.

ReferencesMikel Artetxe, Gorka Labaka, Eneko Agirre, and

Kyunghyun Cho. 2018. Unsupervised neural ma-chine translation. In ICLR.

Christos Baziotis, Ion Androutsopoulos, Ioannis Kon-stas, and Alexandros Potamianos. 2019. SEQˆ3: Differentiable sequence-to-sequence-to-sequenceautoencoder for unsupervised abstractive sentencecompression. In NAACL-HLT.

Mia Xu Chen, Orhan Firat, Ankur Bapna, MelvinJohnson, Wolfgang Macherey, George Foster, LlionJones, Mike Schuster, Noam Shazeer, Niki Parmar,Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser,Zhifeng Chen, Yonghui Wu, and Macduff Hughes.2018. The best of both worlds: Combining recentadvances in neural machine translation. In ACL.

Eric Chu and Peter Liu. 2019. MeanSum: A neuralmodel for unsupervised multi-document abstractivesummarization. In ICML.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence mod-eling. In NIPS Deep Learning and RepresentationLearning Workshop.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In NAACL-HLT.

https://github.com/google-research/google-research/tree/master/summae



Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, andDragomir Radev. 2019. Multi-news: A large-scalemulti-document summarization dataset and abstrac-tive hierarchical model. In Proceedings of the 57thAnnual Meeting of the Association for Computa-tional Linguistics, pages 1074–1084, Florence, Italy.Association for Computational Linguistics.

Rafael Ferreira, Luciano de Souza Cabral,Rafael Dueire Lins, Gabriel Pereira e Silva,Fred Freitas, George Cavalcanti, Rinaldo Lima,Steven Simske, and Luciano Favaro. 2013. Assess-ing sentence scoring techniques for extractive textsummarization. Expert Systems with Applications,40(14):5755–5764.

Thibault Fevry and Jason Phang. 2018. Unsuper-vised sentence compression using denoising auto-encoders. In CoNLL.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In NIPS.

Vishal Gupta and Gurpreet Singh Lehal. 2010. Asurvey of text summarization extractive techniques.Journal of Emerging Technologies in Web Intelli-gence, 2(3):258–268.

Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification. InACL.

Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric Xing. 2017. Toward con-trolled generation of text. In ICML.

Masaru Isonuma, Junichiro Mori, and Ichiro Sakata.2019. Unsupervised neural single-document sum-marization of reviews via learning latent discoursestructure and its ranking. In ACL.

Eric Jang, Shixiang Gu, and Ben Poole. 2017. Cate-gorical reparameterization with gumbel-softmax. InICLR.

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, NoamShazeer, and Yonghui Wu. 2016. Exploringthe limits of language modeling. arXiv preprintarXiv:1602.02410.

Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In ICLR.

Wojciech Kryscinski, Nitish Shirish Keskar, McCannBryan, Caiming Xiong, and Richard Socher. 2019.Neural text summarization: A critical evaluation. InEMNLP.

Guillaume Lample, Ludovic Denoyer, andMarc’Aurelio Ranzato. 2018a. Unsupervisedmachine translation using monolingual corporaonly. In ICLR.

Guillaume Lample, Myle Ott, Alexis Conneau, Lu-dovic Denoyer, et al. 2018b. Phrase-based & neu-ral unsupervised machine translation. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 5039–5049.

Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In Workshop on Text Sum-marization Branches Out.

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE. Journal of MachineLearning Research, 9(11):2579–2605.

Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A corpusand evaluation framework for deeper understandingof commonsense stories. In NAACL-HLT.

Nasrin Mostafazadeh, Michael Roth, Annie Louis,Nathanael Chambers, and James Allen. 2017. Lsd-sem 2017 shared task: The story cloze test. In EACLWorkshop on Linking Models of Lexical, Sententialand Discourse-level Semantics.

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,Ca glar Gulcehre, and Bing Xiang. 2016. Abstrac-tive text summarization using sequence-to-sequenceRNNs and beyond. In CoNLL.

Shashi Narayan, Shay Cohen, and Mirella Lapata.2018. Don’t give me the details, just the summary!topic-aware convolutional neural networks for ex-treme summarization. In EMNLP.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In NAACL-HLT.

Ofir Press and Lior Wolf. 2017. Using the output em-bedding to improve language models. In EACL.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training. Technical re-port, OpenAI.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. Techni-cal report, OpenAI.

Prajit Ramachandran, Peter Liu, and Quoc Le. 2017.Unsupervised pretraining for sequence to sequencelearning. In EMNLP.

Alexey Romanov, Anna Rumshisky, Anna Rogers, andDavid Donahue. 2019. Adversarial decompositionof text representation. In NAACL-HLT.

Alexander Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In EMNLP.

https://doi.org/10.18653/v1/P19-1102

https://doi.org/10.18653/v1/P19-1102

https://doi.org/10.18653/v1/P19-1102

Mike Schuster and Kuldip Paliwal. 1997. Bidirectionalrecurrent neural networks. IEEE Transactions onSignal Processing, 45(11):2673–2681.

Abigail See, Peter Liu, and Christopher Manning.2017. Get to the point: Summarization with pointer-generator networks. In ACL.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In ACL.

Tianxiao Shen, Tao Lei, Regina Barzilay, and TommiJaakkola. 2017. Style transfer from non-parallel textby cross-alignment. In NIPS.

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked sequence to se-quence pre-training for language generation. InICML.

Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014. Se-quence to sequence learning with neural networks.In NIPS.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS.

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, andPierre-Antoine Manzagol. 2008. Extracting andcomposing robust features with denoising autoen-coders. In ICML.

Yaushian Wang and Hung-Yi Lee. 2018. Learningto encode text as human-readable summaries usinggenerative adversarial networks. In EMNLP.

Ronald Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine Learning, 8(3-4):229–256.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Ruslan Salakhutdinov, and Quoc Le. 2019.XLNet: Generalized autoregressive pretraining forlanguage understanding. In NeurIPS.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019. ERNIE: En-hanced language representation with informative en-tities. In ACL.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and AlexeiEfros. 2017. Unpaired image-to-image transla-tion using cycle-consistent adversarial networks. InICCV.

A Appendix

A.1 Amazon Mechanical Turk settingsFigure 4 shows the detailed instructions for theROCStories summarization task and the task userinterface. We rewarded $0.15 per summary andrestricted worker requirements to the following:

• HIT Approval Rate % > 98

• Location: United States

• Number of HITs Approved > 1000

We used the same worker requirements for thehuman evaluation experiment discussed in Section5.4.

The templates used are available at ourcode repository: https://github.com/google-research/google-research/tree/master/summae.

A.2 Training the adversarially-regularizedauto-encoder

The model architecture augmented with a critic isdepicted in Figure 5. The critic was implementedas a multi-layer perceptron (MLP) with one hid-den layer, parameterized by θD, and trained with astandard classification loss:

`D(p, θD) = −E[pD(y = y(t)|φenc(t), θD) (3)

where t is p or a randomly sampled sentence. Theauto-encoder loss was augmented with à encour-aging the encoder to fool D into classifying para-graph vectors as sentences:

à(p, θE) = −E[pD(y = 0|φenc(p))](4)

àae(p, θE , θG) = `rec(p, θE , θG) + λaà(p, θE)(5)

As with Generative Adversarial Networks (Good-fellow et al., 2014), this is a two-player mini-max game, and in practice we alternate trainingthe autoencoder (with respect to encoder and de-coder parameters, θE and θG) and the discrimina-tor (with respect to θD). However, unlike GANswe do not generate data from the training domain,in this case textual sequences, and instead operatein the continuous latent space, which allowed usto maintain differentiability.

Algorithm 1: Our model training loop withcriticInput : A collection of paragraphs, Pfor p in P do

step← step +1if step < Npretrain then

Update (θE , θG) by gradient descenton `rec(p, θE , θG) (Eq.2)

else if step mod2 = 0 thenUpdate θD by gradient descent on`D(p, θD) (Eq.3)

elseUpdate (θE , θG) by gradient descentàe(p, θE , θG) (Eq.2)

endendOutput: Auto-encoder parameters: (θE , θG)




Figure 4: AMT task instructions.

Figure 5: The auto-encoder model augmented with critic.

Story (paragraph) to summarize:Ben decided to go for a walk.But he stepped outside and shivered.The weather had turned cold overnight!Ben went back in and put on another sweater.Now he was ready for his walk!

Human (from 3 different workers)- Ben put on more layers after noticing the cold and went on a walk.- Ben went outside and realized it was cold so he put a sweater on.- Ben put on more layers after noticing the cold and went on a walk.

MeanSum-single + NoiseBen decided back in to also take him.

RNN-RNN + criticBen decided to go for an clean glass.

RNN-RNN + critic + LMBen decided to go outside and he was in a dark!

TRF-RNN + NSSP + LM (best model)Ben decided to go outside for awhile and he was ready for a ride back on and shivering!

Figure 6: For the same story, we show summaries produced by different models used in human evaluation (Table 2).

Sample #1Story: Bill was always bullied. His bully was a star athlete. One day Bill pushed him back. The bully fell wrong andbroke his ankle. It ruined his athletic career.Summary: Bill was always bullied by a bully and he broke his ankle but it broke.Reconstruction: Bill was always bullied. One day a bully was bullied. Bill beat his opponent. The ball broke down.Bill was crushed his ankle laced.

Sample #2Story: Antonio was happy. His new pizza place was doing great business. His pizza place had an edge on all the others.He only used the freshest and best ingredients. His pizza place would be there for a long time.Summary: Antonio was happy to get his best pizza place and had a great time at the place.Reconstruction: Antonio was happy to do the best pizza place. is favorite place had a great time. His best friend had anextra place. His place was the best pizza place. His place had a good time with the pizza.

Sample #3Story: Charles was sure he wouldn’t qualify for an auto loan. He was 36 years old and didn’t have a car. On a lunchbreak, he met a Nissan car salesman. The salesman invited Charles to apply for a car loan. Charles was approved andthe salesman sold him a Nissan.Summary: Charles was a salesman and told him he needed a loan from a dealership to buy a car and was approved.Reconstruction: Charles was a salesman and wanted to quit. Charles applied for a job and wanted a car. Charles appliedfor a loan and asked him to pay a loan. Charles was offered a loan and asked him to quit. Charles was a salesman andtold him he needed a loan.

Sample #4Story: Sue wasn’t feeling well. She had stayed in bed all day. Sue hadn’t eaten all day. She barely drank anything. Suegot dehydrated.Summary: Sue wasn’t feeling well in bed because she had barely drank after all.Reconstruction: Sue wasn’t feeling well. She had never drank before. She had all night. Sue got all cold. Sue felt veryrefreshed.

Sample #5Story: My boss asked me to find the mean value of a data set. I told him the mean was not appropriate for the data set.He asked me to explain why it was not appropriate. I told him it was because the data was not symmetrical. He thankedme for my brilliant insight into the problem.Summary: My boss asked me to make the mistake of the data entry for me, so I was pleased to find it.Reconstruction: My boss asked me to make the mistake of the data entry. He told me what the expectation was makinghim a special name. He told me it was the wrong size. He told me not to make the decision to be careful. I told me it wasthe perfect decision for me.

Sample #6Story: Bob just got a new puppy. Bob’s cat did not like the new puppy. They fought with each other constantly.Eventually Bob’s pets became more familiar with each other. Bob’s new puppy and cat eventually became best friends.Summary: Bob just got a new puppy to the other cat and was very happy with their new puppy.Reconstruction: Bob just got a new puppy. They were very cute and a little girl. They got the best cat they’d met withother kids. They eventually got a new puppy. The other person became very lonely now.

Sample #7Story: Nathan wanted to be a computer programmer. He would code everyday after school. Eventually he got adegree in computer science. He was hired by Microsoft to help program their new operating system. Nathan was a bigcontribution to Microsoft’s development team.Summary: Nathan wanted to be a great job at all his company’s computer and was able to improve his new role.Reconstruction: Nathan wanted to be a programmer. He worked hard to get a new company. He was hired to improvehis computer. He was hired to sponsor a month and was able to own a company. He was able to improve his own company.

Sample #8Story: Fred has a fever. He was not able to go to work. He had to call out. His work boss was angry. He got fired.Summary: Fred has a fever and was not able to go to work out of work and got fired.Reconstruction: Fred has a fever. He was not able to go to work. He got out of work. He was mad. He was able to getout.

Figure 7: Some summaries generated by our best SummAE model along with the reconstructed stories.

Documents

arXiv:1910.00998v1 [cs.CL] 2 Oct 2019mance on the SCT task is close to 100%. Our proposed UAS task involves summariz-ing the ﬁve-sentence training ROCStories with a single sentence