arXiv:2006.11063v1 [cs.CL] 19 Jun 2020 · Dataset for Automatic Summarization of Russian News IlyaGusev[0000 0002 8930 729X] MoscowInstituteofPhysicsandTechnology,Moscow,Russia [email protected]

Dataset for Automatic Summarization of RussianNews

Ilya Gusev[0000−0002−8930−729X]

Moscow Institute of Physics and Technology, Moscow, [email protected]

Abstract. Automatic text summarization has been studied in a varietyof domains and languages. However, this does not hold for the Russianlanguage. To overcome this issue, we present Gazeta, the first datasetfor summarization of Russian news. We describe the properties of thisdataset and benchmark several extractive and abstractive models. Wedemonstrate that the dataset is a valid task for methods of text sum-marization for Russian. Additionally, we prove the pretrained mBARTmodel to be useful for Russian text summarization.

Keywords: Text summarization · Russian language · Dataset · mBART

1 Introduction

Text summarization is the task of creating a shorter version of the document thatcaptures the essential information. Methods of automatic text summarizationcan be extractive or abstractive.

The extractive methods copy the chunks of the original documents to form asummary. In this case, the task usually reduces to tagging of words or sentences.The resulting summary will be grammatically coherent, especially in the case ofsentence copying. However, this is not enough for high-quality summarization asthe good summary should paraphrase and generalize the original text.

Recent advances in the field are usually utilizing abstractive models to getbetter summaries. These models can generate new words that do not exist inthe original texts. It allows them to compress text in a better way via sentencefusion and paraphrasing.

Before the dominance of the sequence-to-sequence models [1], the most com-mon approach was extractive. The design of the approach allows us to use clas-sic machine learning methods [2] as well as various neural network architec-tures such as RNNs [3,4] or Transformers [5], and pretrained models such asBERT [6,8]. The approach can still be useful on some datasets, but modern ab-stractive methods outperform extractive ones on CNN/DailyMail dataset sincePointer-Generators [7]. Various pretraining tasks such as MLM and NSP (usedin BERT [6]) or denoising autoencoding (used in BART [9]) allow models toincorporate rich language knowledge to understand the original document andgenerate grammatically correct and reasonable summary.

arX

iv:2

006.

1106

3v1

[cs

.CL

] 1

9 Ju

n 20

20

2 Ilya Gusev

In recent years, many new text summarization datasets have been revealed [7,11,12,13].XSum [11] focuses on very abstractive summaries; Newsroom [12] has more thana million pairs; Multi-News [13] reintroduces multi-document summarization.However, datasets for any language other than English are still scarce. For Rus-sian, there are only headline generation datasets such as RIA corpus [14]. Themain aim of this paper is to fix this situation by presenting a Russian summa-rization dataset and benchmarking some of the existing methods on it.

Moreover, we adapted mBART [10] model (that was originally used for ma-chine translation) to the summarization task. The BART [9] model was success-fully used for the text summarization on the English datasets, so it is naturalfor mBART to handle it for all the trained languages.

We believe that text summarization is a vital task for many news agenciesand news aggregators. It is hard for humans to compose a good summary, soautomation in this area will be useful for news editors and readers. Further-more, text summarization is one of the benchmarks for general natural languageunderstanding models.

Our contributions are as follows: we introduce the first Russian summariza-tion dataset in the news domain1. We benchmark extractive and abstractivemethods on this dataset to inspire further work in the area. Finally, we adoptthe mBART model to summarize Russian texts, and it achieves the best resultsof all benchmarked models2.

2 Data

2.1 Source

There are several requirements for a data source. First, we wanted news sum-maries as most of the datasets in English are in this domain. Second, thesesummaries should be human-generated. Third, no legal issues should exist withthe data and its publishing. The last requirement was hard to fulfill as manynews agencies have explicit restrictions for publishing their data and tend notto reply to any letters.

Gazeta.ru was one of the agencies that have explicit permission on theirwebsite to use their data for non-commercial purposes. Moreover, they havesummaries for many of their articles.

There are also requirements for the content of summaries. A summary shouldnot be a part of the original text, as it would not be a summarization taskanymore. It should mention entities and places from the original text but mergeand paraphrase some of its sentences.

We collected texts, dates, URLs, titles, and summaries of all their articlesfrom the website’s foundation to March 2020. We parsed summaries as the con-tent of a "meta" tag with "description" property. A small percentage of allarticles had a summary.1 https://github.com/IlyaGusev/gazeta2 https://github.com/IlyaGusev/summarus

Dataset for Automatic Summarization of Russian News 3

2.2 Cleaning

After the scraping, we did some cleaning. We removed too long or too shortsummaries and texts and pairs with very high or very low unigram intersection.Moreover, we removed all the data earlier than the 1st of June 2010, because thetexts in the meta tag were not news summaries. The complete code of a cleaningphase is available online and a raw version of the dataset.

2.3 Statistics

The resulting dataset consists of 63435 text-summary pairs. To form train, val-idation, and test datasets, these pairs were sorted by time. We define the first52400 pairs as the training dataset, the proceeding 5265 pairs as the validationdataset, and the remaining 5770 pairs as the test dataset. It is still essential torandomly shuffle the training dataset before training any models to reduce timebias even more.

Table 1. Dataset statistics after lowercasing

Train Validation TestText Summary Text Summary Text Summary

Dates 01.06.10 - 31.05.19 01.06.19 - 30.09.19 01.10.19 - 23.03.20Texts 52400 5265 5770

Unique words 611829 148073 167612 42104 175369 44169Unique lemmas 282867 63351 70210 19698 75214 20637

Common un. lemmas 60992 19138 20098Min words 28 15 191 18 357 18Max words 1500 85 1500 85 1498 85Avg words 766.5 48.8 772 55 750.3 53.2

Avg sentences 37.2 2.7 38.5 3.0 37.0 2.9Avg unique words 419.1 41.3 424 46 415.7 45.1Avg unique lemmas 350.0 40.2 352 45 345.4 43.9

Fig. 1. Documents distribution by count of tokens in a text

4 Ilya Gusev

Fig. 2. Documents distribution by count of tokens in a summary

To evaluate the dataset’s bias towards extractive or abstractive methods, wemeasured how many novel n-grams summaries have. The results are presented inTable 2 and show that more than 65% bi-grams of the summary do not presentin the original text. This number decreases to 58% if we consider different wordforms and calculate it on lemmatized bi-grams. Although we can not directlycompare these numbers with CNN/DailyMail or any other English dataset asthis statistic is heavily language-dependent, but we should state that it is 53%for CNN/DailyMail and 83% for XSum. From this, we can conclude that thebias towards extractive methods can exist.

Another way to evaluate the abstractiveness is by calculating metrics of oraclesummaries. To evaluate all benchmark models, we used ROUGE [22] metrics.For CNN/DailyMail oracle summaries score 31.2 ROUGE-2-F [8], and for ourdataset, it is 23.0 ROUGE-2-F.

Table 2. Average % of novel n-grams

Train Val TestUni-grams 34.2 30.5 30.6

Lemmatized uni-grams 21.4 17.8 17.6Bi-grams 68.6 65.0 65.5

Lemmatized bi-grams 61.4 58.0 58.5Tri-grams 84.5 81.5 81.9

2.4 BPE

We extensively utilized byte-pair encoding (BPE) tokenization in most of thedescribed models. For Russian, the models that use BPE tokenization performsbetter than those that use word tokenization as it enables the use of rich mor-phology and decreases the number of unknown tokens. The encoding was trainedon the training dataset using sentencepiece [25] library.


2.5 Lowercasing

We lower-cased all texts and summaries in most of our experiments. It is a con-troversial decision. On the one hand, we reduced vocabulary size and focusedon the essential properties of the model, but on the other hand, we lost someimportant information for a model to receive. Moreover, if we speak about pos-sible end-users of our summarization system, it is better to generate summariesin the original case.

We provide a non-lower-cased version of the dataset as the main version forpossible future research.

3 Benchmark methods

We used several groups of methods. TextRank [15] and LexRank [16] are fullyunsupervised extractive summarization methods. Summarunner [4] is a super-vised extractive method. PG [7], CopyNet [20], mBART [10] are abstractivesummarization methods.

3.1 Unsupervised methods

This group of methods does not have any access to reference summaries andutilizes only original texts. All of the considered methods in this group extractwhole sentences from the text, not separated words.

TextRank TextRank [15] is a classic graph-based method for unsupervised textsummarization. It splits a text into sentences, calculates the similarity matrix forevery distinct pair of them, and applies the PageRank algorithm to obtain a finalscore for every sentence. After that, it takes the best sentences by the score as apredicted summary. We used TextRank implementation from the summa [17]3library. It defines sentence similarity as a function of a count of common wordsbetween sentences and lengths of both sentences.

LexRank Continuous LexRank [16] can be seen as a modification of the Tex-tRank that utilizes TF-IDF of words to compute sentence similarity as IDFmodified cosine similarity. There is a continuous version of it that uses an orig-inal similarity matrix and a base version that performs binary discretization ofthis matrix by the threshold. We used LexRank implementation from lexrankPython package4.

3 https://github.com/summanlp/textrank4 https://github.com/crabcamp/lexrank

6 Ilya Gusev

LSA Latent semantic analysis can be used for text summarization [21]. It con-structs a matrix of terms by sentences with term frequencies, applies singularvalue decomposition to it, and searches the maximum values in the right singu-lar vectors. The search represents finding the best sentence describing the k’thtopic. We made the evaluation of this method with sumy library5.

3.2 Extractive methods

A method in this group has access to reference summaries, and the task for it isseen as sentence binary classification. For every sentence in the original text, thealgorithm must decide whether it will be included in the predicted summary.

To perform the reduction to this task, we used a greedy algorithm similar toSummaRunNNer paper [4] and BertSumExt paper [8]. The algorithm generatesa summary consisting of multiple sentences which maximize the ROUGE-2 scoreagainst the reference summary.

SummaRuNNer SummaRuNNer [4] is one of the simplest and yet effectiveneural approaches to extractive summarization. It uses 2-layer hierarchical RNNand positional embeddings to choose a binary label for every sentence. We usedour own implementation on top of the AllenNLP [19]6 framework along withPointer-Generator [7] implementation.

3.3 Abstractive methods

All of the tested models are based on a sequence-to-sequence framework. Pointer-generator and CopyNet were trained only on our train dataset, and mBARTwas pretrained on texts of 25 languages extracted from the Common Crawl.We performed no additional pretraining, though it is possible to utilize Russianheadline generation datasets here.

Pointer-generator Pointer-generator [7] is a modification of a sequence-to-sequence RNN model with attention [18]. In the generation phase, it sampleswords not only from the vocabulary but from the source text based on attentiondistribution. Furthermore, there is the second modification, the coverage mech-anism that prevents the model from attending to the same places many timesto handle repetition in the summary.

CopyNet CopyNet [20] is another variation of sequence-to-sequence RNNmodelwith attention with slightly different copying mechanism. We used the stock im-plementation from AllenNLP [19].

5 https://github.com/miso-belica/sumy6 https://github.com/allenai/allennlp


mBART for summarization BART [9] and mBART [10] are sequence-to-sequence Transformer models with autoregressive decoder trained on the de-noising autoencoding task. Unlike the preceding pretrained models like BERT,they focus on text generation even in the pretraining phase.

mBART is pretrained on the monolingual corpora for 25 languages includingRussian. In the original paper it was successfully used for machine translation,and BART was used for text summarization, so it is natural to try a pretrainedmBART model for Russian summarization.

We used training and prediction scripts from fairseq [27]7. However, it ispossible to convert the model for using it within HuggingFace’s Transformers8.We had to truncate input for every text to 600 tokens to fit the model in GPUmemory. We also used <unk> token instead of language codes to conditionmBART.

4 Results

We measured the quality of summarization with three sets of automatic metrics:ROUGE [22], BLEU [23], METEOR [24]. All of them are used in various textgeneration tasks. ROUGE and METEOR are more prevalent in text summariza-tion research, and BLEU is a primary automatic metric in machine translation.

We lower-cased and tokenized predicted and reference summaries with Razdeltokenizer to unify the methodology across all models.

We provide all the results in Table 3. Lead-1 and lead-2 are the most basicbaseline, where we choose the first of the first two sentences of every text as oursummary. The oracle summarization is an upper bound for extractive methods.

Table 3. Automatic scores for all models

ROUGE BLEU Meteor1 2 LLead-1 27.9 12.1 21.3 20.3 23.7Lead-2 22.7 9.1 14.9 13.2 21.4Greedy Oracle 44.9 23.0 40.1 54.2 36.0TextRank 21.4 6.3 16.4 28.6 17.5LexRank 23.7 7.8 19.9 37.7 18.1LSA 19.3 5.0 15.0 30.7 15.2SummaRuNNer 31.7 13.9 27.2 45.7 26.3CopyNet 28.7 12.3 23.6 37.2 21.0PG small 29.4 12.7 24.6 38.8 21.2PG words 29.4 12.6 24.4 35.9 20.9PG big 29.6 12.8 24.6 39.0 21.5PG small +coverage 30.2 12.9 26.0 42.8 22.7Finetuned mBART 32.1 14.2 27.9 50.1 25.7

7 https://github.com/pytorch/fairseq8 https://github.com/huggingface/transformers

8 Ilya Gusev

Unsupervised methods give summaries that are very dissimilar to the origi-nal ones. Extracted summaries win in BLEU over the baseline, but lose in othermetrics. The reason for this is the different length penalties of BLEU, ROUGE,and METEOR, and precision-recall differences. LexRank was the best of unsu-pervised methods in our experiments.

The SummRuNNer model has the best METEOR score and high BLEU andROUGE scores. In Figure 3, one can see that SummaRuNNer has a bias towardsthe sentences at the beginning of the text comparing to the oracle summaries.In contrast, LexRank sentence positions are almost uniformly distributed exceptfor the first sentence.

Fig. 3. Proportion of extracted sentences according to their position in the originaldocument.

It seems that more complex extractive models should perform better on thisdataset, but unfortunately we did not have time to prove it.

As for abstractive models, mBART has the best result among all the modelsin terms of ROUGE and BLEU. However, Figure 4 shows that it has fewernovel n-grams than Pointer-Generator with coverage. Consequently, it has worseextraction and plagiarism score [26] (Table 4).

We also did side-by-side annotation of mBART and human summaries withYandex.Toloka9, a Russian crowdsourcing platform. We sampled 1000 pairs fromthe test dataset. Nine people annotated every pair. We asked them which sum-mary is better and provided them three options: left summary wins, draw, rightsummary wins. The side of the human summary was random. Annotators were

9 https://toloka.yandex.ru/


Table 4. Extraction scores

Extraction score Plagiarism scoreReference 0.031 0.124PG small +coverage 0.325 0.501Finetuned mBART 0.332 0.502SummaRuNNer 0.513 0.662

Fig. 4. Proportion of novel n-grams in model generated summaries

10 Ilya Gusev

required to pass training, exam, and their work was continuously evaluatedthrough the control pairs ("honeypots").

Table 5. Human side-by-side evaluation

Votes for winner Reference wins mBART winsMajority 265 735

9/9 7 478/9 18 1067/9 30 1856/9 54 2005/9 123 1804/9 32 173/9 1 0

Table 5 shows the results of the annotation. There were no full draws, sowe do not include them in the table. mBART wins in more than 73% cases.We cannot just conclude that it performs on a superhuman level from theseresults. We did not ask our annotators to evaluate the abstractiveness of thesummaries. The reference summaries are usually too provocative and subjective.In contrast, mBART generates highly extractive summaries without any errorsand with many essential details, and annotators tend to like it. The annotationtask should be changed to evaluate the abstractiveness of the model. Even so,that is an excellent result for mBART.

Table 6 shows examples of mBART losses against reference summaries. Inthe first example, there is an unnamed entity in the first sentence, "by them"("ими"). In the second example, the factual error and repetition exist. In thelast example, the last sentence is not cohesive.

5 Conclusion

We present the first corpus for text summarization in the Russian language. Wedemonstrate that most of the methods for text summarization in English workwell for Russian without any special modifications. Moreover, mBART performsexceptionally well even if it was not initially designed for text summarization inthe Russian language.

We wanted to extend the dataset using data from other sources, but in mostcases, there were significant legal issues as most of the sources explicitly forbidany publishing of their data even in non-commercial purposes.

In future work, we will pre-train BART ourselves on standard Russian textcollections and open news datasets. Furthermore, we will try the headline gen-eration as a pretraining task for this dataset. We believe it will increase theperformance of the models.


Table 6. mBART summaries that lost 9/9

разработанный ими метод идентификации способен выделить специфическиедля конкретного человека белки из пряди волос длиной всего сантиметра .это позволит с высокой степенью точности идентифицировать людей и безвыделения днк .

президент россии владимир путин на встрече с ветеранами и представителямиобщественных патриотических объединений заявил , что каждый годединовременные выплаты ко дню победы составляют по 10 тыс . рублейветеранам и по 5 тыс . рублей труженикам тыла . по 50 тыс . рублей такжебудет выплачено труженикам тыла . ранее в послании федеральному собраниюпрезидент также подчеркнул важность предстоящего юбилея вов .

самый одинокий актер голливуда , наконец , официально нашел пару . киануривз , который многие годы предпочитал не распространяться о своей личнойжизни и после давней трагедии решил не иметь детей , пришел на светскоемероприятие с 46-летней художницей из лос-анджелеса александрой грант ,чем вызвал ажиотаж у журналистов .’, ’на арт-ивенте lacma art + film gala ,прошедшем при поддержке gucci , актер киану ривз завел девушку — впервыеза последние 20 лет . по словам артиста , в этом кругу редко вращается и ривз ,несколько лет вызывающий сочувствие пользователей соцсетей фотографиямис празднований своего дня рождения .

References

1. Sutskever, I., Vinyals, I., Le, Q.: Sequence to sequence learning with neural net-works. In: Proceedings of the 27th International Conference on Neural InformationProcessing Systems, vol. 2, pp. 3104–3112, Cambridge, MIT Press (2014).

2. Wong, K., Wu, M., Li W.: Extractive Summarization Using Supervised and Semi-supervised Learning. In: Proceedings of the 22nd International Conference on Com-putational Linguistics, pp. 985–992, Coling 2008 Organizing Committee (2008).

3. Hochreiter, S., Schmidhuber, J.: Long short-term memory. In: Neural Computation,vol. 9, issue 8, pp. 1735–1780 (1997)

4. Nallapati, R., Zhai, F., Zhou B.: SummaRuNNer: A Recurrent Neural Networkbased Sequence Model for Extractive Summarization of Documents. In: Proceedingsof the Thirty-First AAAI Conference on Artificial Intelligence, pp. 3075–3081 (2017).

5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., Polosukhin. I.: Attention is all you need. In: Advances in neural informationprocessing systems, pp. 5998–6008 (2017).

6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirec-tional transformers for language understanding. In: Proceedings of the 2019 Confer-ence of the North American Chapter of the Association for Computational Linguis-tics: Human Language Technologies, vol. 1, pp. 4171–4186, Minneapolis, Minnesota(2019).

7. See, A., Liu, P., Manning, C.: Get To The Point: Summarization with Pointer-Generator Networks. In: Proceedings of the 55th Annual Meeting of the Associationfor Computational Linguistics, vol.1, pp. 1073–1083, Association for ComputationalLinguistics, Vancouver (2017).

12 Ilya Gusev

8. Liu, Y., Lapata, M.: Text Summarization with Pretrained Encoders. In: Proceedingsof the 2019 Conference on Empirical Methods in Natural Language Processing andthe 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3730–3740, Association for Computational Linguistics, Hong Kong(2019).

9. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoy-anov, V., Zettlemoyer, L.: BART: Denoising Sequence-to-Sequence Pre-training forNatural Language Generation, Translation, and Comprehension. In: Proceedings ofthe 2019 Conference on Empirical Methods in Natural Language Processing andthe 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4003–4015, Association for Computational Linguistics, Hong Kong(2019).

10. Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M.,Zettlemoyer, L.: Multilingual Denoising Pre-training for Neural Machine Transla-tion. arXiv preprint arXiv:2001.08210 (2020)

11. Narayan, S., Cohen, S., Lapata, M.: Don’t Give Me the Details, Just the Sum-mary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization.In: Proceedings of the 2018 Conference on Empirical Methods in Natural LanguageProcessing, Brussels (2018).

12. Grusky, M., Naaman, M., Artzi, Y.: NEWSROOM: A Dataset of 1.3 Million Sum-maries with Diverse Extractive Strategies. In: Proceedings of the 2018 Conferenceof the American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, Association for Computational Linguistics, New Orleans(2018).

13. Fabbri, A., Li, I., She, T., Li, S., Radev, D.: Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Mode. In: Proceed-ings of the 57th Annual Meeting of the Association for Computational Linguistics,pp. 1074–1084, Association for Computational Linguistics, Florence (2019).

14. Gavrilov D., Kalaidin P., Malykh V.: Self-attentive Model for Headline Generation.In: Azzopardi L., Stein B., Fuhr N., Mayr P., Hauff C., Hiemstra D. (eds) Advancesin Information Retrieval. ECIR 2019. Lecture Notes in Computer Science, vol 11438.Springer, Cham (2019)

15. Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Text. In: Proceedings ofthe 2004 Conference on Empirical Methods in Natural Language Processing, pp.404–411, Association for Computational Linguistics, Barcelona (2004).

16. Erkan, G., Radev, D.: LexRank: Graph-based Lexical Centrality as Salience inText Summarization. In: Journal of Artificial Intelligence Research, vol. 22, issue 1,AI Access Foundation (2004).

17. Barrios, F., Lopez, F., Argerich, L., Wachenchauzer, R.: Variations of theSimilarity Function of TextRank for Automated Summarization. arXiv preprintarXiv:1602.03606 (2016)

18. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. In: International Conference on Learning Representations(2015).

19. Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P. Liu, N., Peters, M.,Schmitz M., Zettlemoyer L.: AllenNLP: A Deep Semantic Natural Language Pro-cessing Platform. arXiv preprint arXiv:1803.07640 (2018)

20. Gu, J., Lu, Z., Li, H., Li, V.: Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In: Proceedings of the 54th Annual Meeting of the Associationfor Computational Linguistics, vol. 1, pp. 1631–1640, Association for ComputationalLinguistics (2016).

http://arxiv.org/abs/2001.08210




21. Gong, Y. Liu, X.: Generic text summarization using relevance measure and latentsemantic analysis. In: Proceedings of the 24th annual international ACM SIGIRconference on Research and development in information retrieval, pp. 19–25 (2001).

22. Lin, C.: ROUGE: A package for automatic evaluation of summaries. In: Text Sum-marization Branches Out, pp. 74–81, Barcelona (2004).

23. Papineni, K., Roukos, S., Ward, T., Zhu, W. J.: BLEU: a method for automaticevaluation of machine translation, 40th Annual meeting of the Association for Com-putational Linguistics, pp. 311–318 (2002).

24. Denkowski, M., Lavie, A.: Meteor Universal: Language Specific Translation Eval-uation for Any Target Language. In: Proceedings of the EACL 2014 Workshop onStatistical Machine Translation (2014).

25. Kudo, T., Richardson, J.: SentencePiece: A simple and language independent sub-word tokenizer and detokenizer for Neural Text Processing, Proceedings of the 2018Conference on Empirical Methods in Natural Language Processing: System Demon-strations, pp. 66–71 (2018).

26. Cibils, A., Musat, C., Hossmann, A., Baeriswyl, M.: Diverse beam search forincreased novelty in abstractive summarization. arXiv preprint arXiv:1802.01457(2018)

27. Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., Auli,M.: fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In: Proceedings ofNAACL-HLT 2019: Demonstrations (2019)