Inferential Machine Comprehension: Answering Questions by ... · Machine comprehension is one of the hot research topics in natural language processing. It mea-sures the machine’s

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2241–2251Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

2241

Inferential Machine Comprehension: Answering Questions byRecursively Deducing the Evidence Chain from Text

Jianxing Yu1,2, Zheng-Jun Zha3, Jian Yin1,2 ∗

1School of Data and Computer Science, Sun Yat-sen University2Guangdong Key Laboratory of Big Data Analysis and Processing, China

3School of Information Science and Technology, University of Science and Technology of China{yujx26, issjyin}@mail.sysu.edu.cn

[email protected]

Abstract

This paper focuses on the topic of inferentialmachine comprehension, which aims to ful-ly understand the meanings of given text toanswer generic questions, especially the onesneeded reasoning skills. In particular, we firstencode the given document, question and op-tions in a context aware way. We then proposea new network to solve the inference problemby decomposing it into a series of attention-based reasoning steps. The result of the pre-vious step acts as the context of next step. Tomake each step can be directly inferred fromthe text, we design an operational cell with pri-or structure. By recursively linking the cells,the inferred results are synthesized together toform the evidence chain for reasoning, wherethe reasoning direction can be guided by im-posing structural constraints to regulate inter-actions on the cells. Moreover, a terminationmechanism is introduced to dynamically de-termine the uncertain reasoning depth, and thenetwork is trained by reinforcement learning.Experimental results on 3 popular data sets, in-cluding MCTest, RACE and MultiRC, demon-strate the effectiveness of our approach.

1 Introduction

Machine comprehension is one of the hot researchtopics in natural language processing. It mea-sures the machine’s ability to understand the se-mantics of a given document via answering ques-tions related to the document. Towards this task,many datasets and corresponding methods havebeen proposed. In most of these datasets, suchas CNN/Daily Mail (Hermann et al., 2015), andSQuAD (Rajpurkar et al., 2016), the answer isoften a single entity or a text span in the docu-ment. That leads to the fact that lots of ques-tions can be solved trivially via word and con-text matching (Trischler et al., 2016a) instead of

∗Corresponding author.

Figure 1: Sample of question needed reasoning skill.Correct answer is marked with an asterisk

genuine comprehension on text. To alleviate thisissue, some datasets are released, such as M-CTest (Richardson et al., 2013), RACE (Lai et al.,2017) and MultiRC (Khashabi et al., 2018), wherethe answers are not restricted to be the text span-s in the document; instead, they can be describedin any words. Specially, a significant proportionof questions require reasoning which is a sophis-ticated comprehension ability to choose the rightanswers. As shown in Figure 1, the question asksthe reason for the phenomenon on sentence S5.The answer has to be deduced over the logical re-lations among sentence S3, S4 and S5, and thenentailed from S3 to the correct option B. Diffi-cultly, such deduced chain is not explicitly givenbut expressed on text semantics. Existing methodsprimarily focus on document-question interactionto capture the context similarity for answer spanmatching (Wang et al., 2017). They have mini-mal capability to synthesize supported facts scat-tered across multiple sentences to form the evi-dence chain which is crucial for reasoning.

To support inference, mainstream methods canbe summarized into three folds. One is convert-ing the unstructured document to formal predicateexpressions, on which to perform mathematical d-

2242

eduction via Bayesian network or first-order log-ic. The conversion lacks of adequate robustnessto be applicable. Another direction is to explicitlyparse the document into a relation tree, on whichto generate answers via hand-crafted rules (Sunet al., 2018b). However, the parser often has tocascade to the model, which is difficult to trainglobally and would suffer from the error propaga-tion problem. The third method exploits memorynetwork to imitate reasoning by multi-layer archi-tecture and iterative attention mechanism (Westonet al., 2014). Nevertheless, the reasoning abilityis insufficient due to the lack of prior structuralknowledge to lead the inference direction.

We observe that when humans answer the infer-ential question, they often finely analyze the ques-tion details and comprehend contextual relationsto derive an evidence chain step by step. Usingthe sample in Figure 1 for illustration, humans firstinvestigate the question to find the useful details,such as the question type “why”, and the aspectasked, i.e. “some newspapers refused delivery todistant suburbs”. Such details often play a criti-cal role for answering. For example, why ques-tion usually expects the causal relation that couldindicate the reasoning direction. Based on ques-tion details, they then carefully read the documentto identify the content on which the question as-pect mentions, that is, the sentence S5. Based onthe content, they would deduce new supported ev-idences step by step guided by question type andcontextual relations, such as explainable relationbetween S5 and S4, and casual relation among S3and S4. By considering the options, they woulddecide to stop when the observed information isadequate already to answer the question. For in-stance, by relevant paraphrase, S3 can entail op-tion B that may be the answer. In this process, con-textual relations and multi-step deduction are effi-cient mechanisms for deriving the evidence chain.

Based on above observations, we here proposean end-to-end approach to mimic human processfor deducing the evidence chain. In particular, wefirst encode the given document, question and op-tions by considering contextual information. Wethen tackle the inference problem by proposing anovel network that consists of a set of operationalcells. Each cell is designed with structural prior tocapture the inner working procedure of an elemen-tary reasoning step, where the step can be direct-ly inferred from the text without strong supervi-

sion. The cell includes the memory and three op-erating units that work in tandem. That is, masterunit derives a series of attention-based operationsbased on the question; reader unit extracts relevantdocument content on the operation; and writer u-nit performs the operation to deduce a result andupdate the memory. The cells are recursively con-nected, where the result of the previous step acts asthe context of next step. The interactions of cell-s are restricted by structural constraints, so as toregulate the reasoning direction. With such struc-tural multi-step design, the network can integratethe supported facts by contextual relations to buildthe evidence chain in arbitrarily complex acyclicform. Since the reasoning depth is uncertain, atermination mechanism is exploited to adaptive-ly determine the ending. Moreover, a reinforce-ment approach is employed for effective training.Experiments are conducted on 3 popular data set-s that contain questions required reasoning skills,including MCTest, RACE and MultiRC. The re-sults show the effectiveness of our approach.

The main contributions of this paper include,

• We design a new network that can answer in-ferential question by recursively deducing theevidence chain from the text.• We propose an effective termination mech-

anism which can dynamically determine theuncertain reasoning depth.• We employ a reinforcement training ap-

proach and conduct extensive experiments.

The rest of this paper is organized as follows.Section 2 elaborates our approach on the infer-ential framework. Section 3 presents the experi-mental results. Section 4 reviews related work andSection 5 concludes this paper with future works.

2 Approach

As shown in Figure 2, our approach consists ofthree components, including input representation,inferential network composed out of multiple cell-s, and output. Next, we define some notations, andthen elaborate the details on each component.

2.1 Notations and Problem Formulation

Given a document D in unstructured text, the taskof machine comprehension is to answer the ques-tions according to the semantics of the document.In this paper, multi-choice questions are our ma-jor focus. Thus, a set of plausible answer option-

2243

Figure 2: Overview of the our approach

s are assumed to be provided, and the task is re-duced to select a correct option from the given set.Formally, let q represent the question, of lengthS, where {w1, · · · , wS} are the question words;O = {o1, · · · , oL} denotes an option set. For agiven document and question x = (D, q), a s-core h(x, y) ∈ R is assigned for each candidatein the option set y = o ∈ O, so as to measure itsprobability of being the correct answer. The op-tion with highest score is outputted as the answery = argmaxy∈Oh(x, y).

2.2 Input Representation

We first encode the input text into distributed vec-tor representations by taking account the context.

Question: Two stages are conducted on the en-coding. (1) We convert the question into a se-quence of learned word embeddings by looking upthe pre-trained vectors, such as GloVe (Penning-ton et al., 2014). By considering the question typewould help inference, we customize an embeddingto indicate such type via linguistic prior knowl-edge, such as the positions of interrogative word-s are often relatively fixed, and the correspond-ing parts of speech (POS) are mainly adverbs orconjunctions, etc. Practically, we utilize positionand POS embedding (Li et al., 2018b) generatedby word embedding tool. That is, the embeddinglayer is a W ∈ Rd×v, where d is the dimensionand v denotes the number of instances. (2) Weconcatenate the embeddings of word, position andPOS, and feed them into a bi-directional GRU (Bi-GRU) (Cho et al., 2014) to incorporate sequentialcontext. Then we can yield two kinds of represen-tations, including (a) contextual words: a series ofoutput states cws|Ss=1 that represent each word inthe context of the question, where cws = [

←−hs,−→hs],

←−hs and

−→hs are the sth hidden states in the back-

ward and forward GRU passes respectively; and(b) overall encoding: q = [←−−cw1,

−−→cwS ], the con-catenation of the final hidden states.

Options: Each option word is embedded bypre-trained vectors and then option is contextuallyencoded by BiGRU to generate an overall vector.

Document: Three steps are performed on theencoding. (1) We encode each document sentenceby considering context via BiGRU as aforemen-tioned. The sentence is transformed into an ni× dmatrix, where ni is the size of words in the sen-tence i, and d is the dimension. (2) We conductattention to compress the sentence encoding into afixed size vector, and focus on the important com-ponents. Intuitively, long sentence may containmultiple significant parts, where each would helpinference. For example, two clauses are linkedby “or” with the causal relation in the sentence“The oil spill must be stopped or it will spreadfor miles.” The clauses and the contextual rela-tion can assist answer the question “Why mustthe oil spill be stopped?” To model such situa-tion, structured self attention technique proposedby Lin et al. (2017) is utilized. It can convert thesentence into a J×dmatrix, attending at J signif-icant parts of the sentence in a context aware way.(3) All sentence matrices are fed into another Bi-GRU, so as to capture the context between the sen-tences. That is DH×J×d = {dsdh,j |

H,Jh,j=1,1}, where

H is the sentences size, ds is the sentence vector.

2.3 Micro-Infer CellMicro-infer is a recurrent cell designed to modelthe mechanism of an atomic reasoning step. Thecell consists of one memory unit and three opera-tional units, including the master unit, reader u-nit and writer unit. The memory independently

2244

Figure 3: Flow chart of the master unit

stores the intermediate results obtained from thereasoning process up to the tth step. Based on thememory state, three operational units work togeth-er in series to accomplish the reasoning process.In particular, master unit analyzes the question de-tails to focus on certain aspect via self-attention;reader unit then extracts related content, guided bythe question aspect and text context; and the writ-er unit iteratively integrates the content with pre-ceding results from the memory to produce a newintermediate result. The interactions between thecell’s units are regulated by structured constrain-s. Specially, the master outcome can only indi-rectly guide the integration of relevant content intothe memory state by soft-attention maps and gat-ing mechanisms. Moreover, a termination gate isintroduced to adaptively determine ending of theinference. In the following, we detail the formalspecifications of three operational units in the cell.

2.3.1 Master UnitAs presented in Figure 3, this unit consists of twocomponents, involving the termination mechanis-m and question analysis.

Termination MechanismA maximum step is set to guarantee termina-

tion. Since the complexity of the questions is dif-ferent, the reasoning depths are uncertain. To dy-namically adjust to such depth, a terminated gateis designed by considering two conditions. Thatis, the correlation between the intermediate resultmt−1 and the reasoning operation at−1 in previ-ous step, as well as mt−1 and candidate answeroptions ol|Ll=1. When both conditions are met,an acceptable answer is highly probable to ob-tain. Technically, the correlations are calculatedby Eq.(1), i.e. mt−1 � at−1, mt−1 � ol, respec-tively. We then combine these two factors to gettat,l, and utilize a sigmoid layer to estimate theending probability for a certain option. By maxi-mizing over all the options, a termination function

fts(mt−1, at−1, ol|Ll=1; θts) is generated, where θtsis a parameter set, namely (W d×2d

ta , bdta). Based onthe function, a binary random variable tt is prob-abilistically drawn as tt ∼ p(·|fts(·; θts)). If tt isTrue, stop and execute the answer module accord-ingly; otherwise, continue the tth reasoning step.

tat,l =W d×2dta [mt−1 � at−1,mt−1 � ol] + bdta

fts(·; θts) = max{sigmoid(tat,l)|Ll=1}(1)

Question AnalysisWe design a soft-attention based mechanism to

analyze the question and determine the basic op-eration performed at each step. Instead of grasp-ing the complex meaning on the whole question atonce, the model is encouraged to focus on certainquestion aspect at a time, making the reasoningoperation can be directly inferred from the text.Three stages are performed as follows.

Firstly, we project the question q through alearned linear transformation to derive the aspectrelated to tth reasoning step, as qt =W d×d

qt q+bdqt.Secondly, we use the previously performed op-

eration at−1 and memory result mt−1 as decisionbase to lead tth reasoning operation. In detail-s, we validate previous reasoning result by lever-aging the terminated conditions in Eq.(1), that is,pat =W d×Ld

pa [tat,1, · · · , tat,L] + bdpa. We then in-tegrate qt with preceding operation at−1 and val-idation pat through a linear transformation intoaqt, as W d×3d

aq [qt, at−1, pat] + bdaq.Thirdly, aqt is regulated by casting it back to o-

riginal question words cws|Ss=1 based on attentionin Eq.(2), so as to restrict the space of the valid rea-soning operations and boost the convergence rate.In particular, we calculate the correlation act,s andpass it through a softmax layer to yield a distri-bution avt,s over the question words. By aggre-gation, a new reasoning operation at is generated,represented in terms of the question words.

act,s =W 1×dac [aqt � cws] + b1ac

avt,s = softmax(act,s)

at =∑S

s=1 avt,s · cws;(2)

Briefly, the new reasoning operation at ismodeled by a function fna(q, at−1, cws; θna),where θna is a set of parameters, including(W d×d

qt , bdqt,Wd×3daq , bdaq,W

d×Ldpa , bdpa,W

1×dac , b1ac).

2.3.2 Reader UnitAs shown in Figure 4, reader unit retrieves rel-evant document content that is required for per-

2245

Figure 4: Flow chart of the reader unit

forming the tth reasoning operation. The rele-vance is measured by the content context in a soft-attention manner, taking account of the currentreasoning operation and prior memory. We do notrely on external tools to facilitate globally training.

To support transitive reasoning , we first extractthe document content relevant to the preceding re-sult mt−1, resulting in dmt,h,j = [W d×d

m mt−1 +bdm]�[W d×d

ds dsh,j+bdds]. The relevance often indi-

cates a contextual relation in the distributed space.For instance, given a question aspect why, the con-tents with causal relation are highly expected andtheir relevant score is likely to be large.

Then, dmt,h,j is independently incorporatedwith the document content dsh,j to producednt,h,j , i.e. W d×2d

dn [dmt,h,j , dsh,j ] + bddn. This al-lows us to also consider new information which isnot directly related to the prior intermediate result,so as to assist parallel and inductive reasoning.

Lastly, we use soft attention to select contentthat is relevant to the reasoning operation at andcandidate options ol|Ll=1. Precisely, we unify theat and ol|Ll=1 by a linear transformation to obtainoat, i.e. W d×Ld

oa [at, ol|Ll=1] + bdoa, where the op-tions size L is fixed and predefined. We then mea-sure the correlation between oat and the extractedcontent dnt,h,j , passing the result through softmaxlayer to produce an attention distribution. By tak-ing weighted average over the distribution, we canretrieve related content rit by Eq.(3).

adt,h,j =W d×dad [oat � dnt,h,j ] + bdad

rvt,h,j = softmax(adt,h,j)

rit =∑H;J

h=1;j=1 rvt,h,j · dsh,j ;(3)

In short, the retrieved content rit is formulat-ed by a function fri(mt−1, dsh,j , at, ol|Ll=1; θri),where θri is a parameter set, involving (W d×d

m , bdm,W d×dds , bdds,W

d×2ddn , bddn,W

d×Ldoa , bdoa,W

d×dad , bdad).

2.3.3 Writer UnitAs illustrated in Figure 5, writer unit is responsi-ble to compute the intermediate result on the tth

Figure 5: Flow chart of the writer unit

reasoning process and update the memory state.It integrates the retrieved content from the readerunit with the preceding intermediate result in thememory, guided by the tth reasoning operation inthe master unit. Details are presented as follows.

(1) Motivated by the work on relational reason-ing (Santoro et al., 2017), we linearly incorporatethe retrieved content rit, prior result mt−1, andquestion q to get mct = W d×3d

mc [rit,mt−1, q] +bdmc, so as to measure their correlations.

(2) By considering non-sequential reasoning,such as tree or graph style, we refer to all pre-vious memorized results instead of just the pro-ceeding one mt−1. Motivated by the work onscalable memory network (Miller et al., 2016),we compute the attention of the current opera-tion at against all previous ones ai|t−1i=1, yieldingsati = softmax(W 1×d

sa [at � ai] + b1sa). Andthen we average over the previous results mi|t−1i=1

to get preceding relevant support as mpt, that is∑t−1i=1 sati ·mi. By combining mpt with correlat-

ed result mct above, we can obtain a plausible re-sult mut, namely W d×d

mp mpt +W d×dmc mct + bdmu.

(3) The operations on some question aspectssuch as why need multi-step reasoning and updat-ing while others no need. In order to regulate thevalid reasoning space, an update gate is introducedto determine whether to refresh the previous resultmt−1 in the memory by the new plausible resultmut. The gate αt is conditioned on the operationat by using a learned linear transformation and asigmoid function. If the gate is open, the unit up-dates the new result to the memory, otherwise, itskips this operation and performs the next one.

αt = sigmoid(W 1×da at + b1a)

mt = αt ·mt−1 + (1− αt) ·mut;(4)

In brief, the new reasoning resultmt is modeledby a function fnm(mt−1, rit, q, at; θnm), whereθnm is a parameter set, including (W d×3d

mc , bdmc,W 1×dsa , b1sa,W

d×dmp ,W

d×dmc , b

dmu,W

1×da , b1a).

2246

2.4 Output and TrainingAfter the terminated condition is met, we can ob-tain the memory state mt−1, which indicates thefinal intermediate result of the reasoning process.For the multi-choice questions focused in the pa-per, there is a fixed set of possible answers. Wethen leverage a classifier to predict an answer byreferring to the question q and options ol|Ll=1. Pre-cisely, we first measure the correlation of mt−1 a-gainst q and ol|Ll=1, to get mt−1 � q, mt−1 � ol.By concatenation, we pass the outcome through a2-layer fully-connected softmax network to derivean answer option by Eq.(5), with ReLU activationfunction to alleviate over-fitting. In summary, theparameter set θans is (W d×2d

u , bdu, W1×Ldan , b1an).

ul = ReLU(W d×2du [mt−1 � q,mt−1 � ol] + bdu)

Anst = softmax(W 1×Ldan [u1, · · · , uL] + b1an)

(5)Reinforcement LearningDue to the discrete of the termination steps, the

proposed network could not be directly optimizedby back-propagation. To facilitate training, a re-inforcement approach is used by viewing the in-ference operations as policies, including the rea-soning operation flow G1:T , termination decisionflow t1:T and answer prediction AT , where T isthe reasoning depth. Given ith training instance〈qi;Di; oi〉, the expected reward r is defined tobe 1 if the predicted answer is correct, otherwise0. The rewards on intermediate steps are 0, i.e.{rt = 0}|T−1t=1 . Each probable value pair of (G; t;A) corresponds to an episode, where all possibleepisodes denote as A†. Let J(θ) = Eπ

[∑Tt=1 rt

]be the total expected reward, where π(G, t,A; θ)is a policy parameterized by the network parame-ter θ, involving the encoding matrices θW , ques-tion network θna, termination gate θts, reader net-work θri, writer network θnm, and answer networkθans. To maximize the reward J , we explore gra-dient descent optimization, with Monte-Carlo RE-INFORCE (Williams, 1992) estimation by Eq.(6).

∇θJ(θ) = Eπ(G,t,A;θ) [∇θ log π(G, t, A; θ)(r − b)]=

∑(G,t,A)∈A†

π(G, t, A; θ)[∇θ log π(G, t, A; θ)(r − b)]

(6)

where b is a critic value function. It is usuallyset as

∑(G,t,A) π(G, t,A; θ)r (Shen et al., 2016)

and (r/b − 1) is often used instead of (r − b) toachieve stability and boost the convergence speed.

3 Evaluations

In this section, we extensively evaluate the effec-tiveness of our approach, including comparisonswith state-of-the-arts, and components analysis.

3.1 Data and Experimental SettingAs shown in Table 1, experiments were conduct-ed on 3 popular data sets in 9 domains, includingMCTest, RACE and MultiRC. Different from da-ta sets such as bAbI (Weston et al., 2015) that aresynthetic, the questions in the evaluated data setsare high-quality to reflect real-world applications.

Data set #doc #q #domain ratioMCTest 660 2,640 1 54.2%

MC160 160 640 1 53.3%MC500 500 2,000 1 54.6%

RACE 27,933 97,687 1 25.8%RACE-M 7,139 20,794 1 22.6%RACE-H 28,293 69,394 1 26.9%

MultiRC 871 9,872 7 59.0%

Table 1: Statistics of the data sets. #doc, #q denote thesize of the documents and questions accordingly; ra-tio means the proportion of the questions that requirereasoning on multiple sentences; MC160 is a humandouble-check subset of MCTest, while MC500 is anunchecked one; RACE-M and RACE-H are the subsetsof RACE on middle/ high school exams, respectively

Hyper-parameters were set as follows. Forquestion encoding, the POS tags were obtained byusing OpenNLP toolkit. Multiple cells were con-nected to form the network, where the cells wereweight sharing. The maximum size of connectedcells length was 16. The network was optimizedvia Adam (Kingma and Ba, 2014) with a learningrate of 10−4 and a batch size of 64. We used gradi-ent clipping with clipnorm of 8, and employed ear-ly stopping based on the validation accuracy. Forword embedding, we leveraged 300-dimensionpre-trained word vectors from GloVe, where theword embeddings were initialized randomly us-ing a standard uniform distribution and not updat-ed during training. The out-of-vocabulary word-s were initialized with zero vectors. The num-ber of hidden units in GRU was set to 256, andthe recurrent weights were initialized by randomorthogonal matrices. The other weights in GRUwere initialized from a uniform distribution be-tween −0.01 and 0.01. We maintained the ex-ponential moving averages on the model weight-s with a decay rate of 0.999, and used them attest time instead of the raw weights. Variationaldropout of 0.15 was used across the network and

2247

maximum reasoning step was set to 5. Trainingusually converged within 30 epochs.

3.2 Comparisons with the State-of-the-Arts

We compared our approach with all publishedbaselines at the time of submission on the e-valuated data sets. The baselines were summa-rized as follows. (1) On RACE data set, sixbaselines were employed, including three intro-duced in the release of the data set, that is S-liding Window (Richardson et al., 2013), Stan-ford AR (Chen et al., 2016), and GA (Dhingraet al., 2016); another three methods proposed re-cently, namely DFN (Xu et al., 2017), BiAttention250d MRU(Tay et al., 2018), and OFT (Radfordet al., 2018). (2) For MCTest data set, nine base-lines were investigated, involving four on lexicalmatching, i.e. RTE, SWD, RTE+SWD Richard-son et al. (2013), Linguistic (Smith et al., 2015);two methods used hidden alignment, that is Dis-course (Narasimhan and Barzilay, 2015), Syn-tax (Wang et al., 2015); three approaches basedon deep learning, i.e. EK (Wang et al., 2016),PH (Trischler et al., 2016b), and HV (Li et al.,2018a). (3) Regarding multi-choices questions inMultiRC data set, we replace softmax to sigmoidat the answer generation layer, so as to make pre-diction on each option. Accordingly, five base-lines were exploited, including three used in therelease of the data set, that is IR, SurfaceLR, andLR (Khashabi et al., 2018); two methods currentlycomposed, namely OFT (Radford et al., 2018) andStrategies (Sun et al., 2018a).

As elaborated in Figure 6, our approach outper-formed the individual baselines on all three datasets 1. Specifically, for RACE data set, our ap-proach achieved the best performance and outper-formed the second one (i.e. OFT) in terms of aver-age accuracy by over 4.12%, 5.00% on RACE-Mand RACE-H, respectively. On MCTest data set,the outperformance was 5.55%, 7.14% over PHbaseline which was the second best on MC160-multi and MC500-multi, respectively, where mul-ti is a subset of the data set that is more diffi-cult and needs understanding multiple sentencesto answer. For MultiRC data set, our approachled to a performance boost against the second bestone (i.e. Strategies) in terms of macro-average F1by over 4.06%, while in terms of micro-average

1The leaderboard rankings were quickly refreshed, butour performance is still competitive at the camera-ready time.

Figure 6: Comparisons of our approach against state-of-the-arts on the RACE, MCTest, and MultiRC da-ta sets respectively. Statistical significant with p-values<0.01 using two-tailed paired test

F1 and exact match accuracy by over 5.20% and6.64%, respectively. Such results showed thatour approach with structural multi-step design andcontext aware inference can correctly answer thequestions, especially the non-trivial ones requiredreasoning, thus boost the overall performance.

3.3 Ablations Studies

To gain better insight into the relative contribu-tions of various components in our approach, em-pirical ablation studies were performed on sev-en aspects, including (1) position and POS awareembedding on the question; (2) structural self-attention in document encoding; (3) two in themaster unit, that is, guiding the reasoning op-eration by previous memory result, and cast-ing back to original question words; (4) extract-

2248

ing relevant content based on preceding mem-ory result in reader unit; (5) two in the writ-er unit, namely, non-sequential reasoning andupdating gate mechanisms. They were denot-ed as pos aware, doc self att, rsn prior mem,que w reg, prior mem res, non seq rsn, andudt gate, respectively.

Figure 7: Ablation studies on various components ofour approach for affecting the performance

As displayed in Figure 7, the ablation on all e-valuated components in our approach led to theperformance drop. The drop was more than 10%on four components, including (1) rsn prior mem;Lack of the memory guidance, the inferred resultfrom previous step could not be served as contex-t for the next. Losing such valuable context maylead to the misalignment of the reasoning chain.(2) prior mem res; Discard of the preceding mem-ory result, the relevant content with contextual re-lations would not be identified. Such relations arethe key for transitive reasoning. (3) que w reg;Without casting back to original question words,

Figure 8: Evaluation on the termination mechanism

it is equivalent to processing the complex ques-tion at one step without identifying the details.Such coarse-grained processing fails to effective-ly regulate the space of the valid reasoning opera-tions, and may confuse the reasoning direction. (4)udt gate; The gate could help balance the complexand simple questions, and reduce long-range de-pendencies in the reasoning process by skipping,which would improve performance. These resultsfurther convinced us on the significant value of im-posing strong structural priors to help the networkderive the evidence chain from text.

Furthermore, we evaluated the efficiency of thetermination mechanism by replacing it with fixedsteps from 1 up to 5. The results on the RACE andMCTest data sets showed the replacement wouldlead to drop on average accuracy and slowdown onthe convergence rate. As demonstrated in Figure8, for fixed size reasoning, more steps performedwell at first, but deteriorated soon, while dynamicstrategy can adaptively determine the optimal ter-mination, that may help boost the accuracy.

3.4 Case StudyDue to the use of soft attention, the proposed net-work offers a traceable reasoning path which caninterpret the generation of the answer based on theattended words. To better understand the reason-ing behavior, we plotted the attention map over thedocument, question and options in Figure 9 withrespect to the sample on Figure 1. From the se-quence of the maps, we observed that the networkadaptively decided which part of an input questionshould be analyzed at each hop. For example, itfirst focused on the question aspect “some news-papers refused delivery to distant suburbs.” Thenit generated evidence attended at S5 regarding tothe focused aspect by similarity. Subsequently, theaspect “why” was focused and evidence attend-ed at S4 was identified. We may infer that since

2249

Figure 9: Visualized attention map on figure 1 sample

S4 and previous intermediate result S5 contain theexplainable relation, they would most likely becorrelated in the distributed space with sentence-level context aware encoding. Later, “why” wasre-focused, the evidence attended at S3 was de-rived. Finally, option B was attended and the pro-cess ended due to termination unit may be trig-gered to work. Such results showed the networkcan derive the answer by capturing underlying se-mantics of the question and sequentially traversingthe relations on document based on the context.

4 Related Work

Earlier studies on machine comprehension main-ly focused on the text span selection question.It is often transformed into a similarity match-ing problem and solved by feature engineering-based methods (Smith et al., 2015) or deep neu-ral networks. The classical features include lex-ical features (e.g. overlapping of words, N-gram, POS tagging) (Richardson et al., 2013),syntactic features (Wang et al., 2015), discoursefeatures (Narasimhan and Barzilay, 2015), etc.Besides, the typical networks involve StanfordAR (Chen et al., 2016), AS Reader (Kadlecet al., 2016), BiDAF (Seo et al., 2016), Match-LSTM (Wang and Jiang, 2017), etc, which useddistributed vectors rather than discrete features tobetter compute the contextual similarity.

To support inference, existing models can beclassified into three categories, including predi-cate based methods (Richardson and Domingos,2006), rule-based methods relied on external pars-er (Sun et al., 2018b) or pre-built tree (Yu et al.,2012), and multi-layer memory networks (Hillet al., 2015), such as gated attended net (Dhingraet al., 2016), double-sided attended net (Cui et al.,2016), etc. These models either lack end-to-enddesign for global training, or no prior structure tosubtly guide the reasoning direction. On the topicof multi-hop reasoning, current models often haveto rely on the predefined graph constructed by ex-ternal tools, such as interpretable network (Zhouet al., 2018) on knowledge graph. The graph plain-ly links the facts, from which the intermediate re-

sult in the next hop can be directly derived. How-ever, in this paper, the evidence graph is not ex-plicitly given by embodied in the text semantics.

Another related works are on Visual QA, aim-ing to answer the compositional questions with re-gards to a given image, such as “What color is thematte thing to the right of the sphere in front ofthe tiny blue block?” In particular, Santoro et al.(2017) proposed a relation net, yet the net was re-stricted to relational question, such as comparison.Later, Hudson and Manning (2018) introduced aniterative network. The network separated mem-ory and control to improve interpretability. Ourwork leverages such separated design. Differentfrom previous researches, we dedicate to inferen-tial machine comprehension, where the questionmay not be compositional, such as why question,but requires reasoning on an unknown evidencechain with uncertain depth. The chain has to beinferred from the text semantics. To the best of ourknowledge, no previous studies have investigatedan end-to-end approach to address this problem.

5 Conclusions and Future Works

We have proposed a network to answer genericquestions, especially the ones needed reasoning.We decomposed the inference problem into a se-ries of atomic steps, where each was executed bythe operation cell designed with prior structure.Multiple cells were recursively linked to producean evidence chain in a multi-hop manner. Besides,a terminated gate was presented to dynamicallydetermine the uncertain reasoning depth and a re-inforcement method was used to train the network.Experiments on 3 popular data sets demonstratedthe efficiency of the approach. Such approach ismainly applied to multiple-choice questions now.In the future, we will expand it to support the ques-tions on text span selection by using the relationtype rather than the option as the terminated con-dition. For example, given the why question, rea-soning process should be stopped when unrelatedrelation is met, such as transitional relation.

Acknowledgments

This work is supported by the National KeyR&D Program of China (2018YFB1004404),Key R&D Program of Guangdong Province(2018B010107005, 2019B010120001), NationalNatural Science Foundation of China (U1711262,U1401256, U1501252, U1611264, U1711261).

2250

ReferencesD. Chen, J. Bolton, and C.D. Manning. 2016. A

thorough examination of the cnn/daily mail readingcomprehension task. In Proceedings of the 54th A-CL, volume abs/1606.02858.

K. Cho, B.V. Merrienboer, C. Gulcehre, D. Bahdanau,F. Bougares, H. Schwenk, and Y. Bengio. 2014.Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Pro-ceedings of EMNLP, volume abs/1704.04683.

Y. Cui, Z. Chen, S. Wei, S. Wang, T. Liu, and G. Hu.2016. Attention-over-attention neural networks forreading comprehension. In Proceedings of the 55thACL, volume abs/1607.04423.

B. Dhingra, H. Liu, W. W.Cohen, and R. Salakhutdi-nov. 2016. Gated-attention readers for text compre-hension. In Proceedings of the 55th ACL, volumeabs/1606.01549.

K.M. Hermann, T. Kocisky, E. Grefenstette, L. Espe-holt, W. Kay, M. Suleyman, and P. Blunsom. 2015.Teaching machines to read and comprehend. In Pro-ceedings of NIPS, volume abs/1506.03340.

F. Hill, A. Bordes, S. Chopra, and J. Weston. 2015.The goldilocks principle: Reading children’s bookswith explicit memory representations. In Journal ofComputer Science, abs/1511.02301.

D.A. Hudson and C.D. Manning. 2018. Compositionalattention networks for machine reasoning. In Pro-ceedings of ICLR, volume abs/1803.03067.

R. Kadlec, M. Schmid, O. Bajgar, and J. Kleindienst.2016. Text understanding with the attention sumreader network. In Proceedings of the 54th ACL,volume abs/1603.01547.

D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay,and D. Roth. 2018. Looking beyond the surface: Achallenge set for reading comprehension over mul-tiple sentences. In Proceedings of NAACL-HLT,pages 252–262.

D.P. Kingma and J. Ba. 2014. Adam: A method for s-tochastic optimization. In Proceedings of ICLR, vol-ume abs/1412.6980.

G. Lai, Q. Xie, H. Liu, Y. Yang, and E.H. Hovy. 2017.RACE: large-scale reading comprehension datasetfrom examinations. In Proceedings of EMNLP, vol-ume abs/1704.04683.

C. Li, Y. Wu, and M. Lan. 2018a. Inference on syntac-tic and semantic structures for machine comprehen-sion. In Proceedings of AAAI, pages 5844–5851.

L. Li, Y. Liu, and A. Zhou. 2018b. Hierarchical atten-tion based position-aware network for aspect-levelsentiment analysis. In Proceedings of CoNLL, vol-ume abs/1704.04683, page 181189.

Z. Lin, M. Feng, C.N. Santos, M. Yu, B. Xiang,B. Zhou, and Y. Bengio. 2017. A structured self-attentive sentence embedding. In Proceedings of I-CLR, volume abs/1703.03130.

A.H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes,and J. Weston. 2016. Key-value memory networksfor directly reading documents. In Proceedings ofthe 54th ACL, volume abs/1606.03126.

K. Narasimhan and R. Barzilay. 2015. Machine com-prehension with discourse relations. In Proceedingsof the 53rd ACL, pages 1253–1262.

J. Pennington, R. Socher, and C.D. Manning. 2014.Glove: Global vectors for word representation. InProceedings of EMNLP, pages 1532–1543.

A. Radford, K. Narasimhan, T. Salimans, andI. Sutskever. 2018. Improving language understand-ing by generative pre-training. In eprint.

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016.Squad: 100, 000+ questions for machine compre-hension of text. volume abs/1606.05250.

M. Richardson, C. Burges, and E. Renshaw. 2013. M-CTest: A challenge dataset for the open-domain ma-chine comprehension of text. In Proceedings ofEMNLP, pages 193–203.

M. Richardson and P. Domingos. 2006. Markov logicnetworks. In Journal of Machine Learning, 62(1-2):107–136.

A. Santoro, D. Raposo, D.G.T. Barrett, M. Malinowski,R. Pascanu, P. Battaglia, and T.P. Lillicrap. 2017. Asimple neural network module for relational reason-ing. In eprint arXiv:1706.01427, abs/1706.01427.

M.J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi.2016. Bidirectional attention flow for machine com-prehension. In Proceedings of ICLR, volume ab-s/1611.01603.

Y. Shen, P. Huang, J. Gao, and W. Chen. 2016. Rea-sonet: Learning to stop reading in machine compre-hension. In Proceedings of the 23rd ACM SIGKDD,volume abs/1609.05284.

E. Smith, N. Greco, M. Bosnjak, and A. Vlachos. 2015.A strong lexical matching method for the machinecomprehension test. In Proceedings of EMNLP,pages 1693–1698.

K. Sun, D. Yu, D. Yu, and C. Cardie. 2018a. Im-proving machine reading comprehension with gen-eral reading strategies. In eprint arXiv:1810-13441,abs/1810.13441.

Y. Sun, G. Cheng, and Y. Qu. 2018b. Reading compre-hension with graph-based temporal-casual reason-ing. In Proceedings of COLING, pages 806–817.

Y. Tay, L.A. Tuan, and S.C. Hui. 2018. Multi-rangereasoning for machine comprehension. In eprintarXiv:1803.09074, abs/1803.09074.

https://doi.org/10.18653/v1/P16-1223

https://doi.org/10.18653/v1/P16-1223

https://doi.org/10.18653/v1/P16-1223

https://doi.org/10.3115/v1/D14-1179

https://doi.org/10.3115/v1/D14-1179

https://doi.org/10.18653/v1/P16-1001

https://doi.org/10.18653/v1/P16-1001

https://doi.org/10.18653/v1/P17-1168

https://doi.org/10.18653/v1/P17-1168

https://doi.org/arXiv:1506.03340





https://doi.org/10.18653/v1/P16-1086

https://doi.org/10.18653/v1/P16-1086

https://doi.org/10.18653/v1/N18-1023

https://doi.org/10.18653/v1/N18-1023

https://doi.org/10.18653/v1/N18-1023



https://doi.org/10.18653/v1/D17-1082

https://doi.org/10.18653/v1/D17-1082

https://doi.org/10.18653/v1/P16-1001

https://doi.org/10.18653/v1/P16-1001

https://doi.org/10.18653/v1/P16-1001

https://doi.org/10.18653/v1/K18-1018

https://doi.org/10.18653/v1/K18-1018

https://doi.org/10.18653/v1/K18-1018



https://doi.org/10.18653/v1/D16-1147

https://doi.org/10.18653/v1/D16-1147

https://doi.org/10.3115/v1/P15-1121

https://doi.org/10.3115/v1/P15-1121

https://doi.org/10.3115/v1/D14-1162

https://doi.org/10.18653/v1/P16-1001

https://doi.org/10.18653/v1/P16-1001

https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D13-1020

https://doi.org/10.18653/v1/D13-1020

https://doi.org/10.18653/v1/D13-1020

https://doi.org/10.1007/s10994-006-8633-8

https://doi.org/10.1007/s10994-006-8633-8









https://doi.org/10.18653/v1/D15-1197

https://doi.org/10.18653/v1/D15-1197




https://doi.org/10.18653/v1/C18-1069

https://doi.org/10.18653/v1/C18-1069

https://doi.org/10.18653/v1/C18-1069



2251

A. Trischler, Z. Ye, X. Yuan, and K. Suleman. 2016a.Natural language comprehension with the epireader.In Proceedings of EMNLP, volume abs/1606.02270.

A. Trischler, Y. Zheng, X. Yuan, H. Jing, and P. Bach-man. 2016b. A parallel-hierarchical model for ma-chine comprehension on sparse data. In Proceedingsof the 54th ACL, pages 432–441.

B. Wang, S. Guo, L.L. Kang, S. He, and J. Zhao.2016. Employing external rich knowledge for ma-chine comprehension. In Proceedings of IJCAI.

H. Wang, M. Bansal, K. Gimpel, and D. McAllester.2015. Machine comprehension with syntax, frames,and semantics. In Proceedings of the 53rd ACL,pages 700–706.

S. Wang and J. Jiang. 2017. Machine comprehensionusing match-lstm and answer pointer. In Proceed-ings ICLR, volume abs/1608.07905.

W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou.2017. Gated self-matching networks for readingcomprehension and question answering. In Pro-ceedings of the 55th ACL, pages 189–198.

J. Weston, A. Bordes, S. Chopra, and T. Mikolov. 2015.Towards ai-complete question answering: A set ofprerequisite toy tasks. In Proceedings ICLR, volumeabs/1502.05698.

J. Weston, S. Chopra, and A. Bordes. 2014. Mem-ory networks. In eprint arXiv:1503.08895, ab-s/1410.3916.

R.J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. In Journal Machine Learning, 8(3-4):229–256.

Y. Xu, J. Liu, J. Gao, Y. Shen, and X. Liu. 2017.Dynamic fusion networks for machine readingcomprehension. In eprint arXiv:1711.04964, ab-s/1711.04964.

J. Yu, Z.J. Zha, and T.S. Chua. 2012. Answering opin-ion questions on products by exploiting hierarchicalorganization of consumer reviews. In Proceedingsof EMNLP, volume abs/1704.04683.

M. Zhou, M. Huang, and X. Zhu. 2018. An inter-pretable reasoning network for multi-relation ques-tion answering. In Proceedings COLING, volumeabs/1801.04726.

https://doi.org/10.18653/v1/D16-1013

https://doi.org/10.18653/v1/P16-1041

https://doi.org/10.18653/v1/P16-1041

https://doi.org/10.18653/v1/P16-1001

https://doi.org/10.18653/v1/P16-1001

https://doi.org/10.18653/v1/P15-2115

https://doi.org/10.18653/v1/P15-2115



https://doi.org/10.18653/v1/P17-1018

https://doi.org/10.18653/v1/P17-1018





https://doi.org/10.1007/BF00992696

https://doi.org/10.1007/BF00992696

https://doi.org/10.1007/BF00992696



https://doi.org/10.3115/v1/D12-1036

https://doi.org/10.3115/v1/D12-1036

https://doi.org/10.3115/v1/D12-1036

https://doi.org/10.18653/v1/C18-1171

https://doi.org/10.18653/v1/C18-1171

https://doi.org/10.18653/v1/C18-1171

Documents

Inferential Machine Comprehension: Answering Questions by ... · Machine comprehension is one of the hot research topics in natural language processing. It mea-sures the machine’s