A Deep Cascade Model for Multi-Document Reading Comprehension · tractive reading comprehension, there are two categories of approaches: 1) The pipeline approach treats the doc-ument

A Deep Cascade Model for Multi-Document Reading Comprehension

Ming Yan, Jiangnan Xia, Chen Wu, Bin BiZhongzhou Zhao, Ji Zhang, Luo Si, Rui Wang, Wei Wang, Haiqing Chen

Alibaba Group{ym119608, jiangnan.xjn, wuchen.wc, b,bi}@alibaba-inc.com

{zhongzhou.zhaozz, zj122146, luo.si, masi.wr, hebian.ww, haiqing.chenhq}@alibaba-inc.com

Abstract

A fundamental trade-off between effectiveness and efficiencyneeds to be balanced when designing an online questionanswering system. Effectiveness comes from sophisticatedfunctions such as extractive machine reading comprehension(MRC), while efficiency is obtained from improvements inpreliminary retrieval components such as candidate docu-ment selection and paragraph ranking. Given the complexityof the real-world multi-document MRC scenario, it is diffi-cult to jointly optimize both in an end-to-end system. To ad-dress this problem, we develop a novel deep cascade learn-ing model, which progressively evolves from the document-level and paragraph-level ranking of candidate texts to moreprecise answer extraction with machine reading comprehen-sion. Specifically, irrelevant documents and paragraphs arefirst filtered out with simple functions for efficiency consid-eration. Then we jointly train three modules on the remainingtexts for better tracking the answer: the document extraction,the paragraph extraction and the answer extraction. Experi-ment results show that the proposed method outperforms theprevious state-of-the-art methods on two large-scale multi-document benchmark datasets, i.e., TriviaQA and DuReader.In addition, our online system can stably serve typical scenar-ios with millions of daily requests in less than 50ms.

IntroductionMachine reading comprehension (MRC), which empowerscomputers with the ability to read and comprehend knowl-edge and then answer questions from textual data, has maderapid progress in recent years. From the early cloze-styletest (Hermann et al. 2015; Hill et al. 2015) to answer ex-traction from a single paragraph (Rajpurkar et al. 2016), andto the more complex open-domain question answering fromweb data (Joshi et al. 2017; Nguyen et al. 2016), great ef-forts have been made to push the MRC technique to morepractical applications.

The rapid progress of MRC in recent years mostly owesto the release of the single-paragraph benchmark datasetSQuAD (Rajpurkar et al. 2016), on which various deepattention-based methods have been proposed to constantlypush the state-of-the-art performance (Seo et al. 2016;Wang et al. 2017c; Yu et al. 2018). It is a significant mile-

Copyright c© 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

stone that several MRC models have exceeded the perfor-mance of human annotators on the SQuAD dataset1. How-ever, the SQuAD dataset makes a strong assumption thatthe answers are contained in the given paragraphs. Besides,the parapraphs are rather short, approximately 200 words onaverage, while a real-world scenario usually involves mul-tiple documents of much longer length. Therefore, severallatest studies (Joshi et al. 2017; Clark and Gardner 2017;Tan et al. 2017) begin to re-design the task into more re-alistic settings: the MRC models are required to read andcomprehend multiple documents to reach the final answer.

In multi-document MRC, depending on the way of com-bining the two components, document selection and ex-tractive reading comprehension, there are two categoriesof approaches: 1) The pipeline approach treats the doc-ument selection and extractive reading comprehension astwo separate parts, where a document is firstly selectedthrough document ranking and then passed to the MRCmodel for extracting the final answer (Joshi et al. 2017;Wang et al. 2017a); 2) Several recent studies (Tan et al.2017; Clark and Gardner 2017; Wang et al. 2018) adopt ajoint learning method to optimize both sub-tasks in a unifiedframework simultaneously.

The pipeline method relies heavily on the quality of thedocument ranking module. When it fails to give the rele-vant documents higher ranks or filters out the ones that con-tain the correct answers, the downstream MRC module hasno way to recover and extract the answers of interest. Forthe joint learning method, it is computationally expensiveto jointly optimize both tasks with all the documents. Thiscomputation cost limits its application to the operational on-line environment, such as Amazon2 and Taobao3, where ef-ficiency is a critical factor to be considered.

To address the above problems, we propose a deep cas-cade model which combines the advantages of both meth-ods in a coarse-to-fine manner. The deep cascade model isdesigned to properly keep the balance between the effec-tiveness and efficiency. At early stages of the model, simplefeatures and ranking functions are used to select a candi-date set of most relevant contents, filtering out the irrelevant

1 https://rajpurkar.github.io/SQuAD-explorer/2https://www.amazon.com/3https://www.taobao.com/

arX

iv:1

811.

1137

4v1

[cs

.CL

] 2

8 N

ov 2

018

documents and paragraphs as much as possible. Then theselected paragraphs are passed to the attention-based deepMRC model for extracting the actual answer span at wordlevel. To better support the answer extraction, we also intro-duce the document extraction and paragraph extraction astwo auxiliary tasks, which helps to quickly narrow down theentire search space. We jointly optimize all the three tasksin a unified deep MRC model, which shares some commonbottom layers. This cascaded structure enables the modelsto perform a coarse-to-fine pruning at different stages, bettermodels can be learnt effectively and efficiently.

The overall framework of our model is demonstrated inFigure 1, which consists of three modules: document re-trieval, paragraph retrieval and answer extraction. The firstmodule takes the question and a collection of raw docu-ments as input. The module at each subsequent stage con-sumes the output from the previous stage, and further prunesthe documents, paragraphs and answer spans given the ques-tion. For each of the first two modules, we define a rankingfunction and an extraction function. The ranking function isfirst used as a preliminary filter to discard most of the irrele-vant documents or paragraphs, so as to keep our frameworkefficient. The extraction function is then designed to dealwith the auxiliary document and paragraph extraction tasks,which is jointly optimized with the final answer extractionmodule for better extraction performance. The local rankingfunctions in different modules gradually increase in cost andcomplexity, to properly keep the balance between the effec-tiveness and efficiency.

The main contributions can be summarized as follow:• We propose a deep cascade learning framework to

address the practical multi-document machine readingcomprehension task, which considers both the effective-ness and efficiency in a coarse-to-fine manner.• We incorporate the auxiliary document extraction and

paragraph extraction tasks to the pure answer span pre-diction, which helps to narrow down the search space andimproves the final extraction result in multi-documentMRC scenario.• We conduct extensive experiments on two large-

scale multi-document MRC benchmark datasets: Trivi-aQA (Joshi et al. 2017) and DuReader (He et al. 2017).The results show that our deep cascade model can out-perform the previous state-of-the-art performance onboth datasets. Besides, the proposed model has also beensuccessfully applied in our online system and stablyserve various scenarios in a quick response time of lessthan 50ms.

Related WorkMachine Reading ComprehensionRecently, we can see emerging interests in multi-documentMRC research (Nguyen et al. 2016; Clark and Gardner2017; Wang et al. 2017b; He et al. 2017; Wang et al. 2018),where multiple documents are given as input. There are twocategories of approaches: the pipeline-based approaches andthe joint learning models. The pipeline approach firstly se-lects a single document via ranking and then pass it to the

Figure 1: The overall framework of our deep cascade model,which consists of the document retrieval, paragraph retrievaland answer extraction modules.

MRC model to extract the precise answer (Joshi et al. 2017;Wang et al. 2017a). This approach gives huge burden to thedocument ranking model, in which the downstream MRCmodel has no way to extract the right answer if the relevantdocuments are missed. The joint learning approaches takeall the documents into consideration and extract the answerby comparing it against other documents (Clark and Gard-ner 2017; Tan et al. 2018; Wang et al. 2018). (Clark andGardner 2017) proposes a confidence-based method with ashared normalization training objective, which enables themodel to produce globally correct output. (Tan et al. 2018)proposes an extraction-then-synthesis framework, by alsoincorporating passage ranking to answer span prediction.(Wang et al. 2018) further proposes a verification methodto make use of the extracted answers in different documentsto verify each other for more accurate prediction. However,taking all the documents into consideration will inevitablybring more computation cost, which can be unbearable inthe operational online environment. Our deep cascade modelcan serve as a proper tradeoff between the pipeline methodand joint learning method. It has a coarse-to-fine structurewhich can eliminate irrelevant documents and paragraphs inthe early stages with simple features and models, and betteridentify more relevant answers in a well-designed multi-taskdeep MRC model on the remaining content.

Cascade LearningIn designing online systems, trade-off between effective-ness and efficiency remains a long-standing problem. Cas-cade learning is an alternative strategy that can better bal-ance these two, which utilizes a sequence of functions in dif-ferent stages and allows using different sets of features fordifferent instances. It is firstly introduced in the traditionalclassification and detection problems such as fast visualobject detection (Schneiderman 2004; Bourdev and Brandt2005), and then widely applied in ranking applications forachieving high top-k rank effectiveness in an efficient man-ner (Lefakis and Fleuret 2010; Wang, Lin, and Metzler 2011;Liu et al. 2017). (Wang, Lin, and Metzler 2011) uses an Ad-aboost style framework with two independent ranking func-tions in each stage, one for pruning the input ranked docu-ments and the other for refining the rank order.

We apply the idea of cascade learning to machine read-ing comprehension, from a preliminary document-level and

Figure 2: The deep attention-based multi-task MRC model.

paragraph-level ranking of the candidate texts, to a more pre-cise answer span extraction. The extracted answer spans areprogressively narrowed down across different levels, and theranking and extraction functions also progressively increasein complexity for more precise answer prediction.

The Deep Cascade ModelFollowing the overview in Figure 1, our approach consistsof three cascade modules: document retrieval, paragraph re-trieval and answer extraction. The cascade ranking functionsin the first two modules aim to fast filter out the irrelevantdocument content based on the basic statistical and struc-tural features, and obtain a coarse ranking for the candidatedocuments. For the remaining document content, we designthree extraction tasks at different granularities, with the goalto simultaneously extract the right document, paragraph andalso the answer span. A deep attention-based MRC model isdesigned to jointly optimize all the three extraction tasks, bysharing the common bottom layers, as is shown in Figure 2.The final answer is thus determined by not only the answerspan prediction score, but also the corresponding documentand paragraph prediction score.

Cascade Ranking FunctionsGiven a question Q and a set of candidate documents {Di},we first introduce the cascade ranking functions of the firsttwo modules for pruning the documents, which graduallyincreases in complexity.

Document Ranking This part aims at fast filtering out theirrelevant documents and obtaining a coarse ranking for thecandidate documents. We utilize the traditional informationretrieval methods, such as BM25 and TF-IDF distance, tomeasure the relevance between the question and document.

The matching is conducted between the textual metadata ofquestion and document, including the document title andmain content. Besides, the recall ratio of the question wordsfrom the document metadata is used as another feature toindicate the relevance of the document.

To learn the importance of different features, we use alearning-to-rank model to assign a weighted relevance scoreto each retrieved document. By design, the first stage needsto be quick and simple, so we cast the task as a binary clas-sification problem and adopt the pointwise logistic regres-sion as the ranking function. The documents containing theanswer are labeled as positive. After this ranking, we onlykeep the top-K ranked documents for further processing.

Paragraph Ranking This part aims at fast discarding theirrelevant content within each document at a paragraph level.Specifically, given an output document Di = {Pij} from theprevious stage, we first prune the noisy paragraphs withoutword or entity match. The simple question and paragraphtextual matching features are also extracted as in that of doc-ument ranking. Moreover, the document structure can con-tain some sort of inherent information, for example, the firstparagraph within a document may tend to possess more in-formative content as a document abstract. Therefore, we alsoadd some structural features, such as whether the paragraphis the first or last paragraph of the document, the lengthof the paragraph, the length of the previous or subsequentparagraphs. To understand the question, we also incorporatethe question type information as several binary features if isgiven, e.g. for DuReader dataset.

To better combine different kinds of features, we adopt ascalable tree boosting method XGBoost (Chen and Guestrin2016) for ranking, which is widely used to achieve state-of-the-art results on many large-scale machine learning chal-lenges. Again, we use the binary logistic loss for modeltraining and label the paragraph containing the answer aspositive. As a result, we select the top-N paragraphs fromeach document for the subsequent answer prediction.

Multi-task Deep Attention ModelGiven the selected P paragraphs from the top-ranked K doc-uments, the final task is to extract an answer span to an-swer the question Q. A deep attention-based MRC model isdesigned to achieve this goal. However, with all these doc-uments and paragraphs, it may be still difficult to directlyconduct the pure answer prediction at a precise word level,as in that of SQuAD dataset. The document and paragraphinformation is also not fully exploited. Therefore, we splitthe answer prediction task into three joint tasks: documentextraction, paragraph extraction and answer span extraction.The three tasks share the same bottom layers, which repre-sents the semantics of the document context with respect tothe question words, as is shown in Figure 2. By introducingthe auxiliary document extraction and paragraph extractiontasks, the proposed model can progressively narrow downthe search space from coarse to fine, which helps to betterlocate on the final answer span. The final answer predictionis based on the results of all the three tasks, which is jointlyoptimized with a joint learning method.

Shared Q&D Modeling Given a question Q and a set ofselected documents {Di}, one of the keys in MRC modellies in how to incorporate the question context into the doc-ument, so that important information can be highlighted. Wefollow the attention & fusion mechanism used in (Wang,Yan, and Wu 2018), which is a previous state-of-the-artMRC method on SQuAD dataset.

Specifically, we first map each word into the vector spaceby concatenating its word embedding and CNN-based char-acter embedding. Then we use bi-directional LSTM (BiL-STM) to encode the question Q and documents {Di} as:

uQt = BiLSTMQ(uQ

t−1, [eQt , c

Qt ])

uDt = BiLSTMD(uD

t−1, [eDt , c

Dt ])

(1)

where et and ct are the word embedding and character em-bedding of the tth word. uQ

t and uDt are the encoding vectors

of the tth word in Q and D, respectively.After the encoding, we use the co-attention method to ef-

fectively incorporate the question information into the docu-ment context, and obtain the question-aware document rep-resentation uD

t =∑

j αtj · uDj . We adopt the attention func-

tion used in DrQA (Chen et al. 2017a), which computes theattention score αij by the dot products between nonlinearmappings of word representations:

αij = softmax(ReLU(W>l uQi )> · ReLU(W>l uD

j )) (2)where Wl is a linear projection matrix, softmax is the nor-malization function, and ReLU is the nonlinear activationfunction.

To combine the original representation uDt and the atten-

tion vector uDt , we adopt the fusion kernel used in (Wang,

Yan, and Wu 2018) for better semantic understanding:

vDt = Fuse(uD

t , uDt ) (3)

where the fusion kernel Fuse(·, ·) is actually a gating layerto combine two representations, we do not give the detailshere due to space limitation.

To model the long distance dependency issue of documentcontext, we also introduce the self-attention layer to furtheralign the document representation vDt against itself, as:

βij = softmax(vDi ·W>s · vD

j )

vDt =

∑j

βtj · vDj

dDt = Fuse(vDt , v

Dt )

(4)

where Ws is a trainable bilinear projection matrix. Anotherfusion kernel is again used to combine the original and self-attentive representations. For all the previous encoding andattention steps, we process each document independentlygiven the question. Finally, we obtain a question-aware rep-resentation DDi = {dDi

t } for each word in each document.For the question side, since it is generally short, we di-

rectly self-align the question to a vector rQ, which is inde-pendent from the document, as

γt = softmax(w>q · uQt )

rQ =∑t

γt · uQt

(5)

where wq is a trainable linear weight vector.The shared question and document modeling lay the foun-

dation for the subsequent three extraction tasks. Based onthe document and question representations DDi = {dDi

t }and rQ, we introduce the three joint extraction tasks.

Document Extraction In multi-document MRC, in addi-tion to annotating the answer span, the benchmark datasetsgenerally also annotate which documents are correct for ex-tracting the answer, or it can also be easily obtained giventhe labeled answer. Therefore, we also introduce an auxil-iary document extraction task, to help improve the answerprediction. Compared to the answer span extraction, the doc-ument extraction is relatively easier. The aim is to better laythe foundation for the answer prediction and help learn theshared bottom layers.

Firstly, we also self-align the document representationDDi = {dDi

t } for each selected document Di, to obtain aweighted document vector rDi as:

µt = softmax(w>d · dDit )

rDi =∑t

µt · dDit

(6)

Next, the question vector rQ and document vector rDi arematched in a bilinear function for a relevance score as,

sDi = rQ ·Wqd · rDi (7)where Wqd is a trainable bilinear projection matrix, whichhelps to match the two vectors in the same space.

For one question, each selected document Di has a match-ing score sDi . We normalize their scores and optimize thefollowing objective function:

sDi = 1/(1 + exp−sDi) (8)

LDE = − 1

K

K∑i=1

[yDi logsDi +(1−yDi)log(1− sDi)] (9)

whereK is the number of selected documents. yDi ∈ {0, 1}denotes the label, yDi = 1 means document Di contains onegolden answer, otherwise yDi = 0.

Paragraph Extraction In general, the golden answer usu-ally comes from one or two paragraphs in each docu-ment. We can also annotate the correct paragraphs wherethe answer is extracted from, by some distant supervisionmethod (Chen et al. 2017a). Therefore, we introduce a mid-level paragraph extraction task, so that our model can notonly distinguish among different documents, but it can alsoselect the relevant paragraphs within each document.

We first organize each selected document with para-graphs, and follow the same way as in document extractionto calculate a question-paragraph matching score for eachparagraph. Specifically, for each paragraph in document Di

with DDi = {VPi1 , · · · ,VPiN}, we first self-align the word-level paragraph representation VPij to a weighted vector rep-resentation rPij as in Equ. 6. Then a bilinear matching func-tion is used between rQ and rPij to obtain the correspondingrelevance score as:

sPij = rQ ·Wqp · rPij (10)

where Wqp is the trainable bilinear projection matrix be-tween question and paragraph.

For one document, each paragraph Pij in the documenthas a matching score sPij . We normalize the scores amongeach document and obtain sPij as in Equ. 8. In this sub-task,we optimize the average cross-entropy loss among all theselected documents and paragraphs as:

LPE = − 1

K

1

N

K∑i=1

N∑j=1

[yPij logsPij+(1−yPij )log(1−sPij )]

(11)where N is the number of remaining paragraphs for eachdocument. yPij ∈ {0, 1} denotes the paragraph-level labelfor the jth paragraph in ith document.

Answer Span Extraction The ultimate goal is to predict acorrect answer, where the afore-mentioned document extrac-tion and paragraph extraction actually act as two auxiliarytasks, so that the shallow semantic representations can bebetter learnt. In this stage, we aim to combine all the avail-able information to accurately extract the answer from all theselected documents at a span level. To make the documentrepresentation aware of information in different documentsand enable a direct comparison across different documents,we concatenate all the selected documents together and in-troduce a muilti-document shared LSTM layer for contex-tual modeling as:

gDt = BiLSTM(gDt−1, [d

Dt ; rQ; f]) (12)

where f is a manual feature vector including the popular fea-tures such as whether each document word occurs in thequestion words and whether the word is a sentence endingseparator. Here we also concatenate the question vector rQto each word representation dD

t of the document for bettermodeling the interaction.

Since all the words from different documents will bepassed to the shared LSTM layer, the sequence order is thusvery important. We follow the document ranking order ob-tained via the document ranking function in document re-trieval module, as is shown in top of Figure 2. In this way,we expect that the answer prediction model can also bear theranking relevance in document retrieval module in mind andit shows good performance in our experiment.

Finally, the pointer network (Wang and Jiang 2016) isused to predict the start and end position of the answer withthe probabilities α1

t and α2t , and the answer extraction model

can be trained by minimizing the negative log probabilitiesof the true start and end indices:

αt = exp(w>a gDt )/

|Dw|∑j=1

exp(w>a gDj ) (13)

LAE = − 1

M

M∑i=1

(logα1y1i+ logα2

y2i) (14)

where wa is a trainable vector, |Dw| is the total number ofwords. M is the number of question samples, y1

i , y2i are the

golden start and end positions across the entire documents.

Joint Training and Prediction According to the design,the three extraction tasks share the same embedding, encod-ing and matching layers. Therefore, we propose to train themtogether as multi-task learning. The joint objective functionis formulated as follows:

L = LAE + λ1LDE + λ2LPE (15)

where λ1 and λ2 are two hyper-parameters that control theweights of those tasks.

To keep the training process stable, we adopt a coarse-to-fine joint training strategy and progressively finetune oneupper task with the joint loss. Specifically, we first trainthe downside document extraction and paragraph extractiontasks to obtain an initial shallow representation, and thenjointly train the three tasks with Equ.15 based on it. Be-sides, when training a new upper task, we follow the methodin (Hashimoto et al. 2016) and introduce a successive regu-larization term on the shared parameters, as:

L = L+ δ||θs − θ′s||2 (16)

where θs, θ′s are the shared parameters at successive trainingstages. In this way, we can restrain the joint training processso that the shared parameters will not change so much.

When predicting the final answer, we take the documentmatching score, paragraph matching score and answer spanscore into consideration and choose the answer with thehighest prediction score, given as:

s = (α1k · α2

k) · sDi · sPij (17)

ExperimentsThis section presents the experimental methodology. Wefirst verify the effectiveness of our model on two benchmarkdatasets: TriviaQA (Joshi et al. 2017) and DuReader (He etal. 2017). Then we test our model in operational online en-vironment, which can stably and effectively serve differentscenarios promptly.

DatasetsOff-line Benchmark Dataset We choose the TriviaQAWeb and DuReader benchmark datasets to test our method,since both of them are multi-document MRC datasets whichis more realistic and challenging.

TriviaQA is a recently released large-scale multi-document MRC datasets, which consists of 650K context-query-answer triples. There are 95K distinct question-answer pairs, which are authored by Trivia enthusiasts, with6 evidence documents (context) per question on average,which are generated from either Wikipedia or Web search.In this paper, we focus on the TriviaQA Web dataset, whichcontains more context data for each question.

DuReader is so far the largest Chinese MRC dataset,which contains 200K questions, 1M documents and morethan 420K human-summarized answers. All the questionsand documents are extracted from real data, by the largestChinese search engine Baidu. The average length of the doc-uments is 396.0 words, and on average each question has 5evidence documents, each document has about 7 paragraphs.

On-line Environment We also apply our model to the Al-iMe Chatbot system, which is an intelligent online assistantdesigned for creating an innovative online shopping expe-rience in e-commerce. Currently, it serves millions of cus-tomer questions per day. We test our model in two practicalscenarios, i.e., e-commerce promotion and tax policy read-ing. E-commerce promotion scenario is about consulting in-structions on shopping games and sales promotion, whichmostly involves with a short document with no more than500 words. Tax policy scenario is about reading tax pol-icy articles, which can be viewed as a multi-document MRCtask. The length of the article is much longer, which consistof many sections and paragraphs.

Implementation DetailsFor the cascade ranking functions, the number of selecteddocuments K and paragraphs N are the key factors to bal-ance the effectiveness and efficiency trade-off. We chooseK = 4 and N = 2 for the good performance when eval-uating on the dev set. Since the TriviaQA documents oftencontain many small paragraphs, we also restructure the doc-uments by merging consecutive paragraphs to a maximumsize of 600 words for each paragraph as in (Clark and Gard-ner 2017). The detailed analysis will be given and discussedin the next section.

For the multi-task deep attention framework, we adoptthe Adam optimizer for training, with a mini-batch size of32 and initial learning rate of 0.0005. We use the GloVe300 dimensional word embeddings in TriviaQA and train aword2vec word embeddings with the whole DuReader cor-pus for DuReader. The word embeddings are fixed duringtraining. The hidden size of LSTM is set as 150 for TriviaQAand 128 for DuReader. The task-specific hyper-parametersλ1 and λ2 in Equ. 15 are set as λ1 = λ2 = 0.5. Regulariza-tion parameter δ in Equ. 16 is set as a small value of 0.01. Allmodels are trained on Nvidia Tesla M40 GPU with CudnnLSTM cell in Tensorflow 1.3.

Off-line EvaluationMain Results The results of our single deep cascademodel 4 on TriviaQA Web and DuReader 1.0 are summa-rized in Table 1 and Table 2, respectively. We can see thatby adopting the deep cascade learning framework, the pro-posed model outperforms the previous state-of-the-art meth-ods by an evident margin on both datasets, which validatesthe effectiveness of the proposed method in addressing thechallenging multi-document MRC task.

Ablation Study To get better insight into our model ar-chitecture, we conduct an in-depth ablation study on the de-velopment set of DuReader and TriviaQA, which is shownin Table 3. The main goal is to validate the effectivenessof the critical components in our architecture including themanual features and multi-document shared LSTM in thepure answer span extraction task, the cascade document andparagraph ranking functions for pruning irrelevant documentcontent and the adoption of multi-task learning strategy.

4We only submit the single model without any model ensemble.

Table 1: Performance of our method and competing modelson the TriviaQA Web leaderboard.

Full VerifiedModel EM / F1 EM / F1BiDAF Baseline (Joshi et al. 2017) 40.74 / 47.05 49.54 / 55.80Smarnet (Chen et al. 2017b) 40.87 / 47.09 51.11 / 55.98M-Reader (Hu, Peng, and Qiu 2017) 46.65 / 52.89 56.96 / 61.48Re-Ranker (Wang et al. 2017b) 63.04 / 68.53 69.70 / 74.57S-Norm (Clark and Gardner 2017) 66.37 / 71.32 79.97 / 83.70Weissenborn (Weissenborn 2017) 67.46 / 72.80 77.63 / 82.01Our-Single 68.65 / 73.07 82.44 / 85.35

Table 2: Performance on the DuReader 1.0 test set.Model BLEU-4 ROUGE-LMatch-LSTM (Wang and Jiang 2016) 31.8 39.0BiDAF (Seo et al. 2016) 31.9 39.2PR + BiDAF (Wang et al. 2018) 37.55 41.81Cross-Passage Verify (Wang et al. 2018) 40.97 44.18R-net (Wang et al. 2017c) 44.88 47.71Our-Single 49.39 50.71Human Performance 56.1 57.4

From the results, we can see that: 1) the shared LSTMplays an important role in answer extraction among multipledocuments, the benefit lies in two parts: a) it helps to normal-ize the content probability score from multiple documents sothat the answers extracted from different documents can bedirectly compared; b) it can keep the ranking order from doc-ument ranking component in mind, which may serve as anadditional signal when predicting the best answer. By incor-porating the manual features, the performance can be furtherimproved slightly. 2) Both the preliminary cascade rankingand multi-task answer extraction strategy are vital for the fi-nal performance, which serve as a good trade-off betweenthe pure pipeline method and fully joint learning method.By removing the rich irrelevant noisy data in the cascadedocument and paragraph ranking stage, the downside MRCmodel can better extract the answer from the more relevantcontent data. Jointly training the three extraction tasks canprovide great benefits, which shows that the three tasks areactually closely related and can boost each other with sharedrepresentations at bottom layers.

Effectiveness v.s. Efficiency Trade-off Now we furtherexamine how the performance of our model changes withrespect to the number of selected documents and paragraphsin cascade ranking stage, which is the key factor to con-trol the effectiveness and efficiency trade-off. The result onDuReader development set is presented in Table 4. We cansee that: 1) By properly taking more documents or para-graphs into consideration, the performance of the modelgradually increases when it reaches 4 documents and 2 para-graphs, and then the performance decreases slightly whichmay be due to that much noisy data is introduced. 2) Thetime cost can be largely reduced by removing more irrele-vant documents and paragraphs in the cascade ranking stage,while keeping the performance not change that much. Forexample, for the best setting at 4 documents and 2 para-graphs, if we instead only keep the top-1 paragraph for eachdocument, the time cost will be reduced by 36.7%, while the

Table 3: Ablation study on model components.

ModelDuReader TriviaQA

Bleu-4 score ∆ F1 ∆

Complete Model 50.8 – 73.8 –w/o Manual Features 49.8 -1.0 73.0 -0.8w/o Shared LSTM 48.7 -2.1 70.5 -3.3w/o Cascade Ranking 47.0 -3.8 71.1 -2.7w/o Multi-task Learning 48.5 -2.3 70.9 -2.9Boundary Baseline 41.0 -9.8 61.5 -12.3

Table 4: Effectiveness and efficiency w.r.t document andparagraph selection number on DuReader development set(Efficiency is indicated by time cost at prediction stage).

Document No. Paragraph No. Time cost(s) / batch Bleu-4 score

11 0.42 32.12 0.53 35.43 0.69 36.0

21 0.56 40.52 0.89 44.83 1.14 44.2

31 0.71 48.12 1.09 49.53 1.36 49.0

41 0.88 49.62 1.39 50.83 1.75 50.0

51 0.98 49.62 1.70 50.03 2.03 48.8

performance only decreases about 2.4%. As a result, we canadaptively change our model to meet the practical situationand we choose 4 documents and 2 paragraphs in our off-lineexperiment where effectiveness is most emphasized.

Advantage of Multi-task Learning Next, we also ana-lyze the benefits brought in via the adoption of the multi-task learning strategy in detail. The performance of jointlytraining the answer extraction module with different aux-iliary tasks on DuReader development set is shown in Ta-ble 5. We can see that by incorporating the auxiliary doc-ument extraction or paragraph extraction task in the jointlearning framework, the performance can always improvewhich again shows the advantage of introducing auxiliarytasks for helping to learn shared bottom representations. Be-sides, the performance gain by adding document extractiontask is larger, which may be due to that it can better lay thefoundation of the model with that information from differentdocuments can be distinguished.

On-line EvaluationResults on E-commerce and Tax data We also test theeffectiveness and efficiency of our model in two practicalscenarios, E-commerce and tax policy reading, where real-time responses are expected and a large number of cus-tomers are being served simultaneously. The comparativeresult is shown in Table 6. We can see that by introducingthe cascade ranking stage and keeping the selected numberproperly, our method can serve the requests with a muchhigher speed of less than 50ms, especially for tax scenario

Table 5: Performance with different extraction tasks.Task Bleu-4 score ∆

Pure Answer Span Extraction 48.5 –+ Document Extraction 49.7 +1.2+ Paragraph Extraction 49.2 +0.7+ Document & Paragraph Extraction 50.8 +2.3

Table 6: Performance and response time (RT) in two real-world online scenarios.

Tax E-CommerceModel F1 / RT F1 / RTBiDAF (Joshi et al. 2017) 40 / 130ms 63 / 65msDrQA (Chen et al. 2017a) 46 / 122ms 67 / 61msOur-Single (w/o Cascade Ranking) 55 / 138ms 71 / 70msOur-Single (K=3, N=1) 76.5 / 45ms 73 / 38ms

where the improvement is about 3 times. Besides, the perfor-mance with respect to F1 score is also largely improved withthe proposed multi-document MRC model, which demon-strates the effectiveness of our method for removing the richirrelevant noisy content in our online scenario.

Results on Different Document Lengths We further ex-amine how the F1 score and response time change on taxscenario when processing documents with different lengths,ranging from 50 to 2000 words. The result is shown in Fig-ure 3. We can see that without incorporating with the cas-cade ranking module, the answer extraction module per-forms rather poorly both in effectiveness and efficiency asthe document length increases. In particular, when the docu-ment length exceeds 1,000 the total response time increases3 to 6 times, while for our full cascade model only 15msmore are needed.

ConclusionIn this paper, we propose a novel deep cascade learningframework to balance the effectiveness and efficiency in themore realistic multi-document MRC. We design three cas-cade modules, which can eliminate irrelevant document con-tent in the earlier stages with simple features and models,and discern more relevant answers at later stages. The ex-periment results show that our method can achieve state-of-the-art performance on two large-scale benchmark datasets.Besides, the proposed method has also been effectively and

Figure 3: F1 score and average response time w.r.t differentdocument lengths.

efficiently applied in our online system.

References[Bourdev and Brandt 2005] Bourdev, L., and Brandt, J.2005. Robust object detection via soft cascade. In Com-puter Vision and Pattern Recognition, 2005. CVPR 2005.IEEE Computer Society Conference on, volume 2, 236–243.IEEE.

[Chen and Guestrin 2016] Chen, T., and Guestrin, C. 2016.Xgboost: A scalable tree boosting system. In Proceedings ofthe 22nd acm sigkdd international conference on knowledgediscovery and data mining, 785–794. ACM.

[Chen et al. 2017a] Chen, D.; Fisch, A.; Weston, J.; and Bor-des, A. 2017a. Reading wikipedia to answer open-domainquestions. arXiv preprint arXiv:1704.00051.

[Chen et al. 2017b] Chen, Z.; Yang, R.; Cao, B.; Zhao, Z.;Cai, D.; and He, X. 2017b. Smarnet: Teaching ma-chines to read and comprehend like human. arXiv preprintarXiv:1710.02772.

[Clark and Gardner 2017] Clark, C., and Gardner, M. 2017.Simple and effective multi-paragraph reading comprehen-sion. arXiv preprint arXiv:1710.10723.

[Hashimoto et al. 2016] Hashimoto, K.; Xiong, C.; Tsu-ruoka, Y.; and Socher, R. 2016. A joint many-task model:Growing a neural network for multiple nlp tasks. arXivpreprint arXiv:1611.01587.

[He et al. 2017] He, W.; Liu, K.; Lyu, Y.; Zhao, S.; Xiao,X.; Liu, Y.; Wang, Y.; Wu, H.; She, Q.; Liu, X.; et al.2017. Dureader: a chinese machine reading comprehen-sion dataset from real-world applications. arXiv preprintarXiv:1711.05073.

[Hermann et al. 2015] Hermann, K. M.; Kocisky, T.; Grefen-stette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; and Blun-som, P. 2015. Teaching machines to read and compre-hend. In Advances in Neural Information Processing Sys-tems, 1693–1701.

[Hill et al. 2015] Hill, F.; Bordes, A.; Chopra, S.; and We-ston, J. 2015. The goldilocks principle: Reading children’sbooks with explicit memory representations. arXiv preprintarXiv:1511.02301.

[Hu, Peng, and Qiu 2017] Hu, M.; Peng, Y.; and Qiu, X.2017. Reinforced mnemonic reader for machine compre-hension. CoRR, abs/1705.02798.

[Joshi et al. 2017] Joshi, M.; Choi, E.; Weld, D. S.; andZettlemoyer, L. 2017. Triviaqa: A large scale distantly su-pervised challenge dataset for reading comprehension. arXivpreprint arXiv:1705.03551.

[Lefakis and Fleuret 2010] Lefakis, L., and Fleuret, F. 2010.Joint cascade optimization using a product of boosted classi-fiers. In Advances in neural information processing systems,1315–1323.

[Liu et al. 2017] Liu, S.; Xiao, F.; Ou, W.; and Si, L. 2017.Cascade ranking for operational e-commerce search. InProceedings of the 23rd ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining, 1557–1565. ACM.

[Nguyen et al. 2016] Nguyen, T.; Rosenberg, M.; Song, X.;Gao, J.; Tiwary, S.; Majumder, R.; and Deng, L. 2016. Msmarco: A human generated machine reading comprehensiondataset. arXiv preprint arXiv:1611.09268.

[Rajpurkar et al. 2016] Rajpurkar, P.; Zhang, J.; Lopyrev, K.;and Liang, P. 2016. Squad: 100,000+ questions for machinecomprehension of text. arXiv preprint arXiv:1606.05250.

[Schneiderman 2004] Schneiderman, H. 2004. Feature-centric evaluation for efficient cascaded object detection.In Computer Vision and Pattern Recognition, 2004. CVPR2004. Proceedings of the 2004 IEEE Computer Society Con-ference on, volume 2, II–II. IEEE.

[Seo et al. 2016] Seo, M.; Kembhavi, A.; Farhadi, A.; andHajishirzi, H. 2016. Bidirectional attention flow for ma-chine comprehension. arXiv preprint arXiv:1611.01603.

[Tan et al. 2017] Tan, C.; Wei, F.; Yang, N.; Lv, W.; andZhou, M. 2017. S-net: From answer extraction to an-swer generation for machine reading comprehension. arXivpreprint arXiv:1706.04815.

[Tan et al. 2018] Tan, C.; Wei, F.; Yang, N.; Du, B.; Lv, W.;and Zhou, M. 2018. S-net: From answer extraction toanswer synthesis for machine reading comprehension. InAAAI.

[Wang and Jiang 2016] Wang, S., and Jiang, J. 2016. Ma-chine comprehension using match-lstm and answer pointer.arXiv preprint arXiv:1608.07905.

[Wang et al. 2017a] Wang, S.; Yu, M.; Guo, X.; Wang, Z.;Klinger, T.; Zhang, W.; Chang, S.; Tesauro, G.; Zhou,B.; and Jiang, J. 2017a. R 3: Reinforced reader-ranker for open-domain question answering. arXiv preprintarXiv:1709.00023.

[Wang et al. 2017b] Wang, S.; Yu, M.; Jiang, J.; Zhang, W.;Guo, X.; Chang, S.; Wang, Z.; Klinger, T.; Tesauro, G.; andCampbell, M. 2017b. Evidence aggregation for answer re-ranking in open-domain question answering. arXiv preprintarXiv:1711.05116.

[Wang et al. 2017c] Wang, W.; Yang, N.; Wei, F.; Chang, B.;and Zhou, M. 2017c. Gated self-matching networks forreading comprehension and question answering. In Proceed-ings of the 55th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers), volume 1,189–198.

[Wang et al. 2018] Wang, Y.; Liu, K.; Liu, J.; He, W.; Lyu,Y.; Wu, H.; Li, S.; and Wang, H. 2018. Multi-passagemachine reading comprehension with cross-passage answerverification. arXiv preprint arXiv:1805.02220.

[Wang, Lin, and Metzler 2011] Wang, L.; Lin, J.; and Met-zler, D. 2011. A cascade ranking model for efficient rankedretrieval. In Proceedings of the 34th international ACM SI-GIR conference on Research and development in Informa-tion Retrieval, 105–114. ACM.

[Wang, Yan, and Wu 2018] Wang, W.; Yan, M.; and Wu, C.2018. Multi-granularity hierarchical attention fusion net-works for reading comprehension and question answering.In Proceedings of the 56th Annual Meeting of the Associa-

tion for Computational Linguistics (Volume 1: Long Papers),volume 1, 1705–1714.

[Weissenborn 2017] Weissenborn, e. a. . 2017. Dynamic in-tegration of background knowledge in neural nlu systems.arXiv preprint arXiv:1706.02596.

[Yu et al. 2018] Yu, A. W.; Dohan, D.; Luong, M.-T.; Zhao,R.; Chen, K.; Norouzi, M.; and Le, Q. V. 2018. Qanet:Combining local convolution with global self-attention forreading comprehension. arXiv preprint arXiv:1804.09541.

Documents

A Deep Cascade Model for Multi-Document Reading Comprehension · tractive reading comprehension, there are two categories of approaches: 1) The pipeline approach treats the doc-ument