Transcript
Page 1: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

Complex Program Induction for Querying KnowledgeBases in the Absence of Gold Programs

Amrita Saha1 Ghulam Ahmed Ansari1 Abhishek Laddha∗1

Karthik Sankaranarayanan1 Soumen Chakrabarti2

1IBM Research India, 2Indian Institute of Technology [email protected], [email protected], [email protected],

[email protected], [email protected]

Abstract

Recent years have seen increasingly com-plex question-answering on knowledge bases(KBQA) involving logical, quantitative, andcomparative reasoning over KB subgraphs.Neural Program Induction (NPI) is a pragmaticapproach toward modularizing the reasoningprocess by translating a complex natural lan-guage query into a multi-step executable pro-gram. While NPI has been commonly trainedwith the ‘‘gold’’ program or its sketch, forrealistic KBQA applications such gold pro-grams are expensive to obtain. There, prac-tically only natural language queries and thecorresponding answers can be provided fortraining. The resulting combinatorial explo-sion in program space, along with extremelysparse rewards, makes NPI for KBQA ambi-tious and challenging. We present ComplexImperative Program Induction from TerminalRewards (CIPITR), an advanced neural pro-grammer that mitigates reward sparsity withauxiliary rewards, and restricts the programspace to semantically correct programs usinghigh-level constraints, KB schema, and in-ferred answer type. CIPITR solves complexKBQA considerably more accurately thankey-value memory networks and neural sym-bolic machines (NSM). For moderately com-plex queries requiring 2- to 5-step programs,CIPITR scores at least 3× higher F1 than thecompeting systems. On one of the hardest classof programs (comparative reasoning) with5–10 steps, CIPITR outperforms NSM by afactor of 89 and memory networks by 9 times.1

∗Now at Hike Messenger1The NSM baseline in this work is a re-implemented

version, as the original code was not available.

1 Introduction

Structured knowledge bases (KB) like Wikidataand Freebase can support answering questions(KBQA) over a diverse spectrum of structuralcomplexity. This includes queries with single-hop(Obama’s birthplace) (Yao, 2015; Berant et al.,2013), or multi-hop (who voiced Meg in FamilyGuy) (Bast and Haußmann, 2015; Yih et al., 2015;Xu et al., 2016; Guu et al., 2015; McCallum et al.,2017; Das et al., 2017), or complex queries suchas ‘‘how many countries have more rivers andlakes than Brazil?’’ (Saha et al., 2018). Complexqueries require a proper assembly of selectedoperators from a library of graph, set, logical, andarithmetic operations into a complex procedure,and is the subject of this paper.

Relatively simple query classes, in particular,in which answers are KB entities, can be servedwith feed-forward (Yih et al., 2015) and seq2seq(McCallum et al., 2017; Das et al., 2017) networks.However, such systems show copying or rotelearning behavior when Boolean or open numericdomains are involved. More complex queriesneed to be evaluated as an acyclic expressiongraph over nodes representing KB access, set,logical, and arithmetic operators (Andreas et al.,2016a). A practical alternative to inferring a state-less expression graph is to generate an imperativesequential program to solve the query. Each stepof the program selects an atomic operator anda set of previously defined variables as argu-ments and writes the result to scratch memory,which can then be used in subsequent steps. Suchimperative programs are preferable to opaque,monolithic networks for their interpretabilityand generalization to diverse domains. Another

185

Transactions of the Association for Computational Linguistics, vol. 7, pp. 185–200, 2019. Action Editor: Scott Wen-tau Yih.Submission batch: 8/2018; Revision batch: 11/2018; Final submission: 1/2019; Published 4/2019.

c© 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

Page 2: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

motivation behind opting for the program induc-tion paradigm for solving complex tasks, such ascomplex question answering, is modularizing theend-to-end complex reasoning process. With thisapproach it is now possible to first train separatemodules for each of the atomic operations in-volved and then train a program induction modelthat learns to use these separately trained modelsand invoke the sub-modules in the correct fashionto solve the task. These sub-modules can evenbe task-agnostic generic models that can be pre-trained with much more extensive training data,while the program induction model learns fromexamples pertaining to the specific task. This par-adigm of program induction has been used fordecades, with rule induction and probabilistic pro-gram induction techniques in Lake et al. (2015)and by constructing algorithms utilizing formaltheorem-proving techniques in Waldinger andLee (1969). These traditional approaches (e.g.,Muggleton and Raedt, 1994) incorporated domainspecific knowledge about programming languagesinstead of applying learning techniques. Morerecently, to promote generalizability and reducedependecy on domain specific knowledge, neu-ral approaches have been applied to problemslike addition, sorting, and word algebra problems(Reed and de Freitas, 2016; Bosnjak et al., 2017)as well as for manipulating a physical environment(Bunel et al., 2018).

Program Induction has also seen initial prom-ise in translating simple natural language queriesinto programs executable in one or two hopsover a KB to obtain answers (Liang et al.,2017). In contrast, many of the complex queriesfrom Saha et al. (2018), such as the one inFigure 1, require up to 10-step programs involvingmultiple relations and several arithmetic andlogical operations. Sample operations includegen−set: collecting {t : (h, r, t) ∈ KB}, comput-ing set−union, counting set sizes (set−count),comparing numbers or sets, and so forth. Theseoperations need to be executed in the correct order,with correct parameters, sharing information viaintermediate results to arrive at the correct answer.Note also that the actual gold program is notavailable for supervision and therefore the largespace of possible translation actions at each step,coupled with a large number of steps needed to getany payoff, makes the reward very sparse. Thisrenders complex KBQA in the absence of goldprograms extremely challenging.

Figure 1: The CIPITR framework reads a natural lang-uage query and writes a program as a sequence ofactions, guided at every step by constraints posed bythe KB and the answer-type. Because the space of actionsis discrete, REINFORCE is used to learn the actionselection by computing the reward from the outputanswer obtained by executing the program and thetarget answer, which is the only source of supervision.

Main Contributions

• We present ‘‘Complex Imperative ProgramInduction from Terminal Rewards’’ (CIPITR),2

an advanced Neural Program Induction(NPI) system that is able to answer complexlogical, quantitative, and comparative queriesby inducing programs of length up to 7, using20 atomic operators and 9 variable types.This, to our knowledge, is the first NPI sys-tem to be trained with only the gold answeras (very distant) supervision for inducingsuch complex programs.

• CIPITR reduces the combinatorial programspace to only semantically correct programsby (i) incorporating symbolic constraintsguided by KB schema and inferred an-swer type, and (ii) adopting pragmatic pro-gramming techniques by decomposing thefinal goal into a hierarchy of sub-goals,thereby mitigating the sparse reward problemby considering additional auxiliary rewardsin a generic, task-independent way.

We evaluate CIPITR on the following twochallenging tasks: (i) complex KBQA posed bythe recently-published CSQA data set (Saha et al.,2018) and (ii) multi-hop KBQA in one of the more

2The code and reinforcement learning environment ofCIPITR is made public inhttps://github.com/CIPITR/CIPITR.

186

Page 3: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

popularly used KBQA data sets WebQuestionsSP(Yih et al., 2016). WebQuestionsSP involvescomplex multi-hop inferencing, sometimes withadditional constraints, as we will describe later.However, CSQA poses a much greater challenge,with its more diverse classes of complex que-ries and almost 20-times larger scale. On a dataset such as CSQA, contemporary models likeneural symbolic machines (NSM) fail to handleexponential growth of the program search spacecaused by a large number of operator choicesat every step of a lengthy program. Key-valuememory networks (KVMnet) (Miller et al., 2016)are also unable to perform the necessary complexmulti-step inference. CIPITR outperforms themboth by a significant margin while avoidingexploration of unwanted program space or mem-orization of low-entropy answer distributions. Oneven moderately complex programs of length2–5, CIPITR scored at least 3× higher F1 thanboth. On one of the hardest class of programs ofaround 5–10 steps (i.e., comparative reasoning),CIPITR outperformed NSM by a factor of 89 andKVMnet by a factor of 9. Further, we empiricallyobserve that among all the competing models,CIPITR shows the best generalization acrossdiverse program classes.

2 Related Work

Whereas most of the earlier efforts to handlecomplex KBQA did not involve writable mem-ory, some recent systems (Miller et al., 2016;Neelakantan et al., 2015, 2016; Andreas et al.,2016b; Dong and Lapata, 2016) used end-to-end differentiable neural networks. One of thestate-of-the-art neural models for KBQA, the key-value memory network KVMnet (Miller et al.,2016) learns to answer questions by attending onthe relevant KB subgraph stored in its memory.Neelakantan et al. (2016) and Pasupat and Liang(2015) support simple queries over tables, forexample, of the form ‘‘find the sum of a specifiedcolumn’’ or ‘‘list elements in a column morethan a given value.’’ The query is read by a re-current neural network (RNN), and then, in eachtranslation step, the column and operator areselected using the query representation and historyof operators and columns selected in the past.Andreas et al. (2016b) use a ‘‘stateless’’ modelwhere neural network based subroutines areassembled using syntactic parsing.

Recently, Reed and de Freitas (2016) took anearly influential step with the NPI compositionalframework that learns to decompose high leveltasks like addition and sorting into program steps(carry, comparison) aided by persistent memory.It is trained by high-level task input and outputas well as all the program steps. Li et al. (2016)and Bosnjak et al. (2017) took another importantstep forward by replacing NPI’s expensive strongsupervision with supervision of the program-sketch. This form of supervision at every inter-mediate step still keeps the problem simple, byarresting the program space to a tractable size.Although such data are easy to generate for sim-pler problems such as arithmetic and sorting,it is expensive for KBQA. Liang et al. (2017)proposed the NSM framework in absence of thegold program, which translates the KB queryto a structured program token-by-token. Whilebeing a natural approach for program induction,NSM has several inherent limitations preventinggeneralization towards longer programs that arecritical for complex KBQA. Subsequently, it wasevaluated only on WebQuestionsSP (Yih et al.,2016), that requires relatively simpler programs.We consider NSM as the primary and KVMnetas an additional baseline and show that CIPITRsignificantly outperforms both, especially on themore complex query types.

3 Complex KBQA Problem Set-up

3.1 CSQA Data SetThe CSQA data set (Saha et al., 2018) contains1.15M natural language questions and its corre-sponding gold answer from WikiData KnowledgeBase. Figure 1 shows a sample query from thedata set along with its true program-decomposedform, the latter not provided by CSQA. CSQA isparticularly suited to study the Complex ProgramInduction (CPI) challenge over other KBQA datasets because:

• It contains large-scale training data ofquestion-answer pairs across diverse classesof complex queries, each requiring differentinference tools over large KB sub-graphs.

• Poor state-of-the-art performance of mem-ory networks on it motivates the need forsweeping changes to the NPI’s learningstrategy.

187

Page 4: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

• The massive size of the KB involved (13 mil-lion entities and 50 million tuples) poses ascalability challenge for prior NPI techniques.

• Availability of KB metadata helps standard-ize comparisons across techniques (explainedsubsequently).

We adapt CSQA in two ways for the CPI problem.

Removal of extended conversations: To beconsistent with the NSM work on KBQA, wediscard QA pairs that depend on the previousdialogue context. This is possible as every queryis annotated with information on whether it isself-contained or depends on the previous con-text. Relevant statistics of the resulting data setare presented in Table 3.

Use of gold entity, type, and relation anno-tations to standardize comparisons: Our focusbeing on the reasoning aspect of the KBQAproblem, we use the gold annotations of canonicalKB entities, types, and relations available in thedata set along with the the queries, in orderto remove a prominent source of confusion incomparing KBQA systems (i.e., all systems takeas inputs the natural language query, with spansidentified with KB IDs of entities, types, relations,and integers). Although annotation accuracy af-fects a complete KBQA system, our focus here ison complex, multi-step program generation withonly final answer as the distant supervision, andnot entity/type/relation linking.

3.2 WebQuestionsSP Data Set

In Figure 2 we illustrate one of the most com-plex questions from the the WebQuestionsSP dataset and its semantic parsed version provided by hu-man annotator. Questions in the WebQuestionsSPdata set are answerable from the Freebase KB andtyically require up to 2-hop inference chains,sometimes with additional requirements of satis-fying specific constraints. These constraints canbe temporal (e.g., governing−position−held−from)ornon-temporal (e.g., government−office−position−or−title). The human-annotated semantic parse ofthe questions provide the exact structure of thesubgraph and the inference process on it to reachthe final answer. As in this work, we are focusingon inducing programs where the gold entity rela-tion annotations are known; for this data set aswell, we use the human-annotations to collect all

Figure 2: Semantic parsed form of a sample Questionfrom WebQuestionsSP, along with the depiction of thereasoning over the subgraph to reach the answer.

the entities and relations in the oracle subgraphassociated with the query. The NPI model hasto understand the role of these gold programinputs in question-answering and learn to inducea program to reflect the same inferencing.

4 Complex Imperative ProgramInduction from Terminal Rewards

4.1 NotationThis subsection introduces the different notationscommonly used by our model.

Nine variable-types: (distinct from KB types)

• KB artifacts: ent(entity), rel(relation), type

• Base data types: int, bool, None (emptyargument type used for padding)

• Composite data types: set (i.e., set of KBentities) or map−set and map−int (i.e., amapping function from an entity to a set ofKB entities or an integer)

Twenty Operators:

• gen−set(ent, rel, type) → set

• verify(ent, rel, ent) → bool

• gen−map set(type, rel, type) → map−set

• map−count(map−set) → map−int

• set {union/ints/diff}(set, set) → set

• map−{union/ints/diff}(map−set,map−set) → map−set

• set−count(set) → int

188

Page 5: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

• select−{atleast/atmost/more/less/equal/approx}(map int, int) → set

• select−{max/min}(map int) → ent

• no−op() (i.e., no action taken)

Symbols and Hyperparameters: (typical values)

• num−op: Number of operators (20)

• num−var−types: Number of variable types (9)

• max−var: Maximum number of variablesaccommodated in memory for each type (3)

• m: Maximum number of arguments for anoperator (None padding for fewer argu-ments) (3)

• dkey & dval: Dimension of the key and valueembeddings (dkey � dval) (100, 300)

• np & nv: Number of operators and argumentvariables sampled per operator each time(4, 10)

• f with subscript: some feed-forward network

Embedding Matrices: The model is trained witha vocabulary of operators and variable-types.In order to sample operators, two matrices Mop key

∈ Rnum op×dkey and Mop val ∈ R

num op×dval areneeded for encoding the operator’s key andvalue embedding. The key embedding is usedfor looking up and retrieving an entry from theoperator vocabulary and the corresponding valueembedding encodes the operator information. Thevariable type has only the value embeddingMvtype val ∈ R

num op×dval as no lookup is neededon it.

Operator Prototype Matrices: These matricesstore the argument variable type informationfor the m arguments of every operator inMop arg ∈ {0, 1, . . . , num−var−types}num op×m

and the output variable type created by it inMop out ∈ {0, 1, . . . , num−var−types}num op.

Memory Matrices: This is the query-specificscratch memory for storing new program variablesas they get created by CIPITR. For each variabletype, we have separate key and value embeddingmatrices Mvar key ∈ R

num var type×max var×dkey

and Mvar val ∈ Rnum var type×max var×dval , re-

spectively for looking up a variable in memory

and accessing the information in it. In addition, wealso have a variable attention matrix Mvar att ∈Rnum var type×max var which stores the attention

vector over the variables declared of each type.CIPITR consists of three components:

The preprocessor takes the input query and theKB and performs the task of entity, relation,and type linking which acts as input to theprogram induction. It also pre-populates thevariable memory matrices with any entity,relation, type, or integer variable directlyextracted from the query.

The programmer model takes as input the nat-ural language question, the KB, and thepre-populated variable memory tables to gen-erate a program (i.e., a sequence of operatorsinvoked with past instantiated variables astheir arguments and generating new variablesin memory).

The interpreter executes the generated programwith the help of the KB and scratch memoryand outputs the system answer.

During training, the predicted answer is com-pared with the gold to obtain a reward, which issent back to CIPITR to update its model pa-rameters through a REINFORCE (Williams,1992) objective. In the current version of CIPITR,the preprocessor consults an oracle to link enti-ties, types and relations in the query to the KB.This is to isolate the programming performanceof CIPITR from the effect of imperfect linkage.Extending earlier studies (Karimi et al., 2012;Khalid et al., 2008) to investigate robustness ofCIPITR to linkage errors may be of future interest.

4.2 Basic Memory Operations in CIPITRWe describe some of the foundational modulesinvoked by the rest of CIPITR.

Memory Lookup: The memory lookup looksup scratch memory with a given probe, say x (ofarbitrary dimension), and retrieves the memoryentry having closest key embedding to x. It firstpasses x through a feed-forward layer to transformits dimension to key embedding dimensionx−key.Then, by computing softmax over the matrixmultiplication ofMx key and xkey, the distributionover the memory variables for lookup is obtained.

xkey = f(x), xdist = softmax(Mx keyxkey)

189

Page 6: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

Figure 3: CIPITR Control Flow (with np=1 & nv=1 for simplicity) depicting the order in which different modules(see Section 4) are invoked. A corresponding example execution trace of the CIPITR algorithm is given inFigure 4.

Feasibility Sampling: To restrict the search spaceto meaningful programs, CIPITR incorporatesboth high-level generic or task-specific constraintswhen sampling any action. The generic constraintscan help it adopt more pragmatic programmingstyles like not repeating lines of code or avoidingsyntactical errors. The task specific constraintsensure that the generated program is consistentas per the KB schema or on execution gives ananswer of the desired variable type. To samplefrom the feasible subset using these constraints,the input sampling distribution, xdist, is element-wise transformed by a feasibility vector xfeas

followed by a L1-normalization. Along withthe transformed distribution, the top-k entriesxsampled is also returned.

Algorithm 1 Feasibility SamplingInput:

• xdist ∈ RN (where N is the size of the population

set over which lookup needs to be done)

• xfeas ∈ {0, 1}N (boolean feasibility vector)

• k (top-k sampled)

Procedure: FeasSampling (xdist, xfeas, k)xdist = xdist � xfeas (elementwise multiply)xdist = L1-Normalized(xdist)xsampled = k-argmax(xdist)

Output: xdist, xsampled

Writing a new variable to memory: This oper-ation takes a newly generated variable, say x, oftype xtype and adds its key and value embedding

to the row corresponding to xtype in the memorymatrices. Further, it updates the attention vectorforxtype to provide maximum weight to the newestvariable generated, thus, emulating a stack likebehavior.

Algorithm 2 Write a new variable to memoryInput:

• xkey, xval the key and value embedding of x

• xtype is a scalar denoting type of variable x

Procedure: WriteVarToMem(xkey, xval, xtype)i is the 1st empty slot in the row Mx key[xtype, :]Mvar key[xtype, i] = xkey

Mvar val[xtype, i] = xval

Mvar att[xtype, :] = L1-Normalized(Mvar att[xtype, :]+ One-Hot(i))

4.3 CIPITR Architecture

In Figure 3, we sketch the CIPITR components;in this section we describe them in the order theyappear in the model.

Query Encoder: The query is first parsed intoa sequence of KB-entities and non-KB words.KB entities e are embedded with the concatenatedvector [TransE(e),0] using Bordes et al. (2013),and non-KB words ω with [0,GloVe(ω)]. Thefinal query representation is obtained from a GRUencoder as q.

NPI Core: The query representation q is fed atthe initial timestep to an environment encoding

190

Page 7: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

RNN, which gives out the environment state etat every timestep. This, along with the valueembedding uvalt−1 of the last output variable gen-erated by the NPI engine, is fed at every timestepinto another RNN that finally outputs the pro-gram state ht. ht is then fed into the successivemodules of the program induction engine asdescribed below. The ‘OutVarGen’ algorithm des-cribes how to obtain uvalt−1.

Procedure: NPI Core(et−1, ht−1, uvalt−1)

et = GRU(et−1, uvalt−1)

ht = GRU(et, uvalt−1, ht−1)

Output: et, ht

Operator Sampler: It takes the program stateht, a Boolean vector pfeast denoting operatorfeasibility, and the number of operators to samplenp. It passes ht through the Lookup operationfollowed by Feasibility Sampling to obtain thetop-np operations (Pt).

Argument Variable Sampler: For each sam-pled operator p, it takes: (i) program state ht, (ii)the list of variable types V type

p of the m argumentsobtained by looking up the operator prototypematrix Mop arg, and (iii) a Boolean vector V feas

p

that indicates the valid variable configurations forthe m-tuple arguments of the operator p. For eachof them arguments, a feed-forward network fvtypefirst transforms the program state ht to a vector inRmax var. It is then element-wise multiplied with

the current attention state over the variables inmemory of that type. This provides the program-state-specific attention over variables vattp,j which isthen passed through the Lookup function to obtainthe distribution over the variables in memory.Next, feasibility sampling is applied over the jointdistribution of its argument variables, comprisedof the m individual distributions. This providesthe top-nv tuples of m-variable instantiations Vp.

Output Variable Generator: The new variableup of type utypep = Mop out[p] is generated by theprocedure OutVarGen by invoking a sampledoperator p with m variables vp,1 · · · vp,m of typevtypep,1 · · · vtypep,m as arguments. This also requiresgenerating the key and value embedding, whichare both obtained by applying different feed-forward layers over the concatenated represen-tation of the value embedding of the operatorMop val[p], argument types (Mvtype val[vtypep,1 ] · · ·

Mvtype val[vtypep,m ]) and the instantiated variables(Mvar val[vtypep,1 , vp,1] · · · Mvar val[vtypep,m , vp,m]).The newly generated variable is then written tomemory using Algorithm WriteVarToMem.

Procedure: ArgVarSampler(ht, Vtypep , V feas

p , nv)forj ∈ 1, 2, · · · ,m do

vattp,j = softmax(Mvar att[V typep,j ])�fvtype(ht)

vdistp,j = Lookup(vattp,j , fvar,Mvar key[V type

p,j ])

V distp = vdistp,0 × vdistp,1 · · · × vdistp,m , � Joint Distribution

V distp , Vp = FeasSampling(V dist

p , V feasp , nv)

Output: Vp

End-to-End CIPITR training: CIPITR takes anatural language query and generates an outputprogram in a number of steps. A program iscomposed of actions, which are operators appliedover variables (as in Figure 3). In each step, itselects an operator and a set of previously definedvariables as its arguments, and writes the operatoroutput to a dynamic memory, to be subsequentlyused for further search of next actions. To reduceexposure bias (Ranzato et al., 2015), CIPITRuses a beam search to obtain multiple candidateprograms to provide feedback to the model froma single training instance. Algorithm 3 shows thepseudocode of the program induction algorithm(with beam size b as 1 for simplicity), which goesover T time steps, each time sampling np feasibleoperators conditional to the program state. Then,for each of the np operators, it samples nv feasible

Algorithm 3 CIPITR pseudo-code (beam size=1)Query Encoding: q = GRU(Query)Initialization: e1, h1 = f(q), A = [ ]

for t ∈ 1, · · · , T dopfeast = FeasibleOp()Pt = OperatorSampler(ht, p

feast , np)

C = {}for p ∈ Pt do

V typep = [vtypep,1 , · · · , vtypep,m ] = Mop arg[p]

V feasp = FeasibleVar(p)

Vp = ArgVarSampler(ht, Vtypep , V feas

p , nv)for V ∈ Vp do

C = C⋃(p, V, V type

p )

(p, V, V typep ) = argmax(C)

ukeyp , uval

p , utypep = OutVarGen(p, V type

p , V )

WriteVarToMem(ukeyp , uval

p , utypep )

et+1, ht+1 = NPICore(et, ht)A.append((p, V ))

Output: A

191

Page 8: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

variable instantiations, resulting in a total of np ∗ nv

candidates out of which b most-likely actions aresampled for the b beams and the correspondingnewly generated variables written into memory.This way the algorithm progresses to finally outputb candidate programs, each of which will feed themodel back with some reward. Finally, in orderto learn from the discrete action samples, theREINFORCE objective (Williams, 1992) is used.Because of lack of space, we do not providethe equation for REINFORCE, but our objectiveformulation remains very similar to that in Lianget al. (2017). We next describe several learningchallenges that arise in the context of this overallarchitecture.

5 Mitigating Large Program Space andSparse Reward

Handling complex queries by expanding the oper-ator set and generating longer programs blows upthe program space to a huge size of (num−op ∗(max−var)

m)T . This, in absence of gold pro-grams, poses serious training challenges for theprogrammer. Additionally, whereas the relativelysimple NSM architecture could explore a largebeam size (50–100), the complex architectureof CIPITR entailed by the CPI problem couldonly afford to operate with a smaller beam size(≤ 20), which further exacerbates the sparsity ofthe reward space. For example, for integer an-swers, only a single point in the integer spacereturns a positive reward, without any notion ofpartial reward. Such a delayed—indeed, terminal—reward causes high variance, instability, and localminima issues. A problem as complex as oursrequires not only generic constraints for produc-ing semantically correct programs, but also in-corporation of prior knowledge, if the modelpermits. We now describe how to guide CIPITRmore efficiently through such a challenging en-vironment using both generic and task-specificconstraints.

Phase change network: For complex real-wordproblems, the reinforcement learning communityhas proposed various task-abstractions (Parr andRussell, 1998; Dietterich, 2000; Bakker andSchmidhuber, 2004; Barto and Mahadevan, 2003;Sutton et al., 1999) to address the curse of dimen-sionality in exponential action spaces. HAMs,proposed by Parr and Russell (1998), is one suchimportant form of abstraction aimed at restricting

Figure 4: An example of a CIPITR execution tracedepicting the internals of memory and action samplingto generate the program: (A = gen−set(Brazil, f low,river), B = set−count(A)).

the realizable action sequences. Inspired byHAMs, we decompose the program synthesis intophases having restricted action spaces. The firstphase (retrieval phase) constitutes gathering theinformation from the preprocessed input variablesonly (i.e., KB entities, relations, types, integers).This restricts the feasible operator set to gen−set,gen−map−set, and verify. In the second phase(algorithm phase) the model is allowed to operateon all the generated variables in order to reachthe answer. The programmer learns whether toswitch from the first phase to the second at anytimestep t, based on parameter φt (φt=1 indicating

192

Page 9: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

change of phase, where φ0 = 0) which is ob-tained as φt = 11{max(sigmoid(f(ht)), φt−1) ≥φthresh} if t < T/2, else 1 (T being total time-steps and φthresh is set to 0.8 in our experi-ments). The motivation behind this is similarto the multi-staged techniques that have beenadopted in order to make QA tasks more tractable,as in Yih et al. (2015) and Iyyer et al. (2017).In contrast, here we further allow the model tolearn when to switch from one stage to the next.Note that this is a generic characteristic, as forevery task, this kind of phase division is possible.

Generatingsemanticallycorrectprograms: Otherthan the generic syntactical and semantic rules,the NPI paradigm also allows us to leverageprior knowledge and to incorporate task-specificsymbolic constraints in the program representationlearning in an end-to-end differentiable way.

• Enforcing KB consistency: Operators usedin the retrieval phase (described above) musthonor the KB-imposed constraints, so as notto initialize variables that are inconsistentwith respect to the KB. For example, a setvariable assigned from gen−set is consideredvalid only when the ent, rel, type argumentsto gen−set are consistent with the KB.

• Biasing the last operator using answertype predictor: Answer type prediction isa standard preprocessing step in questionanswering (Li and Roth, 2002). For thiswe use a rule-based predictor that has 98%accuracy. The predicted answer type helpsin directing the program search toward thecorrect answer type by biasing the samplingtowards feasible operators that can producethe desired answer type.

• Auxiliary reward strategy: Jaccard scoresof the executed program’s output and thegold answer set is used as reward. An in-valid program gets a reward of −1. Further,to mitigate the sparsity of the extrinsicrewards, an additional auxiliary feedback isdesigned to reward the model on generatingan answer of the predicted answer-type. Alinear decay makes the effect of auxiliaryreward vanish eventually. Such a curriculumlearning mechanism, while being particularlyuseful for the more complex queries, is still

quite generic as it does not require anyadditional task-specific prior knowledge.

Beam Management and Action Sampling

• Pruning beams by target answer type:Penalize beams that terminate with an answertype not matching the predicted answer type.

• Length-based normalization of beam scores:To counteract the characteristic of beam searchfavoring shorter beams as more probable andto ensure the scoring is fair to the longerbeams, we normalize the beam scores withrespect to their length.

• Penalizing beams for no−op operators:Another way of biasing the beams towardgenerating longer sequences, is by penalizingfor the number of times a beam takes no−opas the action. Specifically, we reduce thebeam score by a hyperparameter-controlledlogarithmic factor of the number of no−opactions taken till now.

• Stochastic beam exploration with entropyannealing: To avoid early local minimawhere the model severely biases towardsspecific actions, we added techniques like (i)a stochastic version of beam search to sampleoperators in an ε-greedy fashion (ii) dropout,and (iii) entropy-based regularization ofaction distribution.

Sampling only feasible actions: Sampling afeasible action requires first sampling a feasibleoperator and then its feasible variable arguments:

• The operator must be allowed in the currentphase of the model’s program induction.

• Valid Variable instantiation: A feasibleoperator should be having at least one validinstantiation of its formal arguments withnon-empty variable values that are alsoconsistent with the KB.

• Action Repetition: An action (i.e., an oper-ator invoked with a specific argument in-stantiation) should not be repeated at anytime step.

• Some operators disallow some arguments;for example, union or intersection of a setwith itself.

193

Page 10: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

Quant Comp WebQSPSimple Logical Verify Quanti Count Comp Count All

Timesteps 2 4 5 7 7 7 7 3 to 5Entropy-Loss Wt. 5e−4 5e−4 5e−6 5e−3 5e−3 5e−2 5e−2 5e−3

Feasible Programafter iterations 1000 2000 500 1300 1300 1500 1500 50

Beam Pruning afteriterations 100 100 100 1300 1300 1300 1000 100

Auxillary Rewardtill iterations 0 0 0 800 800 800 800 200

Learning Rate 1e−5 1e−5 1e−5 1e−5 1e−5 1e−5 1e−5 1e−4

Table 1: Critical hyperparameters.

6 Experiments

We compare CIPITR against baselines (Milleret al., 2016; Liang et al., 2017) on complex KBQAand further identify the contributions of the ideaspresented in Section 5 via ablation studies. Forthis work, we limit our effort on KBQA to thesetting where the query is annotated with the goldKB-artifacts, which standardizes the input to theprogram induction for the competing models.

6.1 Hyperparameters SettingsWe trained our model using the Adam Optimizerand tuned all hyperparameters on the validationset. Some parameters are selectively turned on/off after few training iterations, which is itself ahyperparameter (see Table 1). We combined reward/loss such as entropy annealing and auxiliary re-wards using different weights detailed in Table 1.The key, value embedding dimensions are set to100, 300.

6.2 WebQuestionsSP Data SetWe first evaluate our model on the more popularlyused WebQuestionsSP data set.

6.2.1 Rule-Based Model on WebQuestionsSPThough quite a few recent works on KBQA haveevaluated their model on WebQuestionsSP, thereported performance is always in a setting wherethe gold entities/relations are not known. Theyeither internally handle the entity and relation-linking problem or outsource it to some externalor in-house model, which itself might have beentrained with additional data. Additionally, theentity/relation linker outputs used by these modelsare also not made public, making it difficult to setup a fair ground for evaluating the program induc-tion model, especially because we are interestedin the program induction given the program inputs

and handling the entity/relation linking is beyondthe scope of this work. To avoid these issues, weuse the human-annotated entity/relation linkingdata available along with the questions as inputto the program induction model. Consequentlythe performance reported here is not comparableto the previous works evaluated on this data set,as the query annotation is obtained here from anoracle linker.

Further, to gauge the proficiency of the pro-posed program induction model, we construct arule-based model which is aware of the humanannotated semantic parsed form of the query—thatis, the inference chain of relations and the exactconstraints that need to be additionally appliedto reach the answer. The pseudocode belowelaborates how the rule based model works on thehuman-annotated parse of the given query, takingas input the central entity, the inference chain,and associated constraints and their type. This

Procedure: RuleBasedModel(parse,KB)ent1 ← parse[‘TopicEntityMid’]rel1 ← parse[‘InferentialChain’][0]ans ← {x | (ent1, rel1, x) ∈ KB}for c ∈ parse[‘Constraints’]c−rel ← c[‘NodePredicate’]c−op ← c[‘Operator’]c−arg ← c[‘Argument’]if c[‘ArgumentType’] == ‘Entity’ans ← ans ∩ {x | (c−arg, c−rel, x) ∈ KB}

elseans ←

x∈ans{x | (x, c rel, y) ∈ KB,

c−arg �c−op y}if len(parse[‘InferentialChain’]) > 1rel2 ← parse[‘InferentialChain’][1]ans ←

x∈ans{y | (x, rel2, y) ∈ KB}

Output: ans

194

Page 11: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

RuleQuestion Type Based CIPITR

Inference-chain-len-1, no constraint 87.34 89.09Inference-chain-len-1 with constraint 93.64 79.94Inference-chain-len-2, no constraint 82.85 88.69Inference-chain-len-2, with nontemporal constraint 61.26 63.07Inference-chain-len-2, with temporal constraint 35.63 48.86All 81.19 82.85

Table 2: F1 scores(%) of CIPITR and rule-based model(as in Sec.6.2.1) on WebQuestionsSP test set having1,639 queries.

inference rule, manually derived, can be writtenout in a program form, which on execution willgive the final answer. On the other hand, the taskof CIPITR is to actually learn the program bylooking at training examples of the query andcorresponding answer. Both the models need toinduce the program using the gold entity/relationdata. Subsequently, the rule-based model is indeeda very strong competitor as it is generated byannotators having detailed knowledge about theKB.

6.2.2 Results on WebQuestionsSP

A comparative performance analysis of theproposed CIPITR model, the rule-based modeland the SparQL executor is tabulated in Table 2.The main take-away from these results is thatCIPITR is indeed able to learn the rules behindthe multi-step inference process simply from thedistance supervision provided by the question-answer pairs and even perform slightly better insome of the query classes.

6.3 CSQA Data Set

We now showcase the performance of the pro-posed models and related baselines on the CSQAdata set.

6.3.1 Baselines on CSQA

KVMnet with decoder (2016), which performedbest on CSQA data set (Saha et al., 2018) (asdiscussed in Section 2), learns to attend on a KBsubgraph in memory and decode the attentionover memory-entries as their likelihood of beingin the answer. Further, it can also decode a vocab-ulary of non-KB words like integers or booleans.However, because of the inherent architecturalconstraints, it is not possible to incorporate mostof the symbolic constraints presented in Section 5in this model, other than KB-guided consistency

and biasing towards answer-type. More impor-tantly, recently the usage of these models havebeen criticized for numerical and boolean questionanswering as these deep networks can easily mem-orize answers without ‘‘understanding’’ the logicbehind the queries simply because of the skewin the answer distribution. In our case this effectis more pronounced as CSQA evinces a curi-ous skew in integer answers to ‘‘count’’ queries.Fifty-six percent of training and 52% of testcount-queries have single digit answers. Ninetypercent of training and 81% of test count-querieshave answers less than 200. Though this makes itunfair to compare NPI models (that are obliviousto the answer vocabulary) with KVMnet on suchqueries, we still train a KVMnet version on abalanced resample of CSQA, where, for onlythe count queries, the answer distribution overintegers has been made uniform.

NSM (2017) uses a key-variable memory anddecodes the program as a sequence of operatorsand memory variables. As the NSM code was notavailable, we implemented it and further incor-porated most of the six techniques presented inTable 4. However, constraints like action repe-tition, biasing last operator selection, and phasechange cannot be incorporated in NSM whilekeeping the model generic, as it decodes theprogram token by token.

6.3.2 Results on CSQA

In Table 3 we compare the F1 scores obtainedby our system, CIPITR, against the KVMnetand NSM baselines. For NSM and CIPITR, wetrain seven models with different hyperparameterstuned on each of the seven question types. For thetrain and valid splits, a rule-based query typeclassifier with 97% accuracy was used to bucketqueries into the classes listed in Table 3. For eachof these three systems, we also train and evaluate

195

Page 12: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

Quant Comp↓ Run name \ Question type → Simple Logical Verify Quanti. Count Compar. Count AllTraining Size Stats. 462K 93K 43K 99K 122K 41K 42K 904KTest Size Stats. 81K 18K 9K 9K 18K 7K 7K 150KKVMnet 41.40 37.56 27.28 0.89 17.80 1.63 9.60 26.67NSM, best at top beam 78.38 35.40 28.70 4.31 12.38 0.17 0.00 10.63NSM best over top 2 beams 80.12 41.23 35.67 4.65 15.34 0.21 0.00 11.02NSM, best over top 5 beams 86.46 64.70 50.80 6.98 29.18 0.48 0.00 12.07NSM, best over top 10 beams 96.78 69.86 60.18 10.69 30.71 2.09 0.00 14.36CIPITR, best at top beam 96.52 87.72 89.43 23.91 51.33 15.12 0.33 58.92CIPITR, best over top 2 beams 96.55 87.78 90.48 25.85 51.72 19.85 0.41 62.52CIPITR, best over top 5 beams 97.18 87.96 90.97 27.19 52.01 29.45 1.01 69.25CIPITR, best over top 10 beams 97.18 88.92 90.98 28.92 52.71 32.98 1.54 73.71

Table 3: F1 score (%) of KVMnet and NSM, and CIPITR. Bold numbers indicate the best among KVMnet andtop beam score of NSM and CIPITR.

one single model over all question types. KVMnetdoes not have any beam search, the NSM modeluses a beam size of 50, and CIPITR uses only 20beams for exploring the program space.

Our manual inspection of these seven querycategories show that simple and verify are simplestin nature requiring 1-line programs while logicalis moderately difficult, with around 3 lines of code.The query categories next in order of complexityare quantitative and quantitative count, needinga sequence of 2–5 operations. The hardest typesare comparative and comparative count, whichtranslate to an average of 5–10 lined programs.

Analysis: The experiments show that on thesimple to moderately difficult (i.e., first three)query classes, CIPITR’s performance at the topbeam is up to 3 times better than both the base-lines. The superiority of CIPITR over NSM isshowcased better on the more complex classeswhere it outperforms the latter by 5–10 times,with the biggest impact (by a factor of 89 times)being on the ‘‘comparative’’ questions. Also, the5× better performance of CIPITR over NSM overAll category evinces the better generalizabilityof the abstract high-level program decompositionapproach of the former.

On the other hand, training the KVMnetmodel on the balanced data helps showcase thereal performance of the model, where CIPITRoutperforms KVMnet significantly on most ofthe harder query classes. The only exception isthe hardest class (Comp, Count with numericalanswers) where the abrupt ‘‘best performance’’of KVMnet can be attributed to its rote learning

Feature F1 (%)Action Repetition 1.68Phase Change 2.41Valid VariableInstantiation 3.08

Biasing last operator 7.52Auxiliary Reward 9.85Beam Pruning 10.34

Table 4: Ablation testing on comparative questions:Top beam’s F1 score obtained by omitting each of theabove features from CIPITR, originally having F1 of15.123%.

abilities simply because of its knowledge of theanswer vocabulary, which the program inductionmodels are oblivious to, as they never see theactual answer.

Lastly, in our experimental configurations,whereas CIPITR and NSM’s parameter-size isalmost comparable, KVMnet’s is approximately6× larger.

Ablation Study: To quantitatively analyze theutility of the features mentioned in Section 5,we experiment with various ablations in Table 4by turning off each feature, one at a time. Weshow the effect on the hardest question category(‘‘comparative’’) on which our proposed modelachieved reasonable performance. We see inthe table that each of the 6 techniques helpedthe model significantly. Some of them boostedF1 by 1.5–4 times, while others proved to beinstrumental to obtained large improvements inF1 score of over 6–9 times.

To summarize, CIPITR has the followingadvantages, inducing programs more efficiently

196

Page 13: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

QuesType

InputQuery ( E,R,T) CIPITR Program NSM program

SimpleWho is associated withRobert Emmett O’Malley ?

E0: Robert Emmett O’MalleyR0: associated withT0: person

A = gen set(E0, R0, T0) A = gen set(E0, R0, T0)

VerifyIs Sergio Mattarella thechief of state of Italy ?

E0: Sergio Mattarella,E1: Italy,R0: chief of state

A = verify(E0, R0, E1) A = verify(E0, R0, E1)

LogicalWhich cities wereAnimal Kingdom filmed on orshare border with Pedralba ?

E0: Animal Kingdom,E1: Pedralba,R0: filmed on,R1: Share border,T0: cities

A = gen set(E0, R0, T0);B = gen set(E1, R1, T0);C = set union(A,B)

A = gen set(E1, R1, T0);B = gen set(E1, R1, None);C = set union(A,B)

QuantCount

How many nucleic acid sequencesencodes Calreticulin orNeurotensin/neuromedin-N ?

E0: Calreticulin,E1: Neurotensin/neuromedin-N,R0: encoded by,T0: nucleic acid

A = gen set(E0, R0, T0);B = gen set(E1, R0, T0);C = set union(A,B);D = set count(C)

A = gen set(E0, R0, T0);B = set count(A);C = set union(A,B)

QuantWhat municipal councils arethe legislative bodies formax US administrative territories ?

T0: muncipal council,T1: US administrative territoriesR0: legislative body of

A = gen map set(T0, R0, T1);B = map count(A);C = select max(B)

A = gen map set(T0, R0, T1);B = gen map set(T0, R0, T1);C = map count(A)

CompWhich works did less number ofpeople do the dubbing forthan Herculesy el rey de Tesalia ?

E0: Herculesy el rey de Tesalia,R0: dubbed by,T0: works,T1: people

A = gen map set(T0, R0, T1);B = gen set(E0, R0, T1);C = set count(B);D= map count(A);E = select less(D,C)

A = gen set(E0, R0, T1);B= gen set(E0, R0, T1);C= gen map set(T0, R0, T1);D= set diff(A,B)

Table 5: Qualitative analysis of programs for different type of question for CIPITR and NSM.

and pragmatically, as illustrated by the sampleoutputs in Table 5:

• Generating syntactically correct programs:Because of the token-by-token decoding ofthe program, NSM cannot restrict its searchto only syntactically correct programs, butrather only resorts to a post-filtering stepduring training. However, at test time, itcould still generate programs with wrongsyntax, as shown in Table 5. For example,for the Logical question, it invokes a gen−setwith a wrong argument type None and for theQuantitative count question, it invokes theset−union operator on a non-set argument.On the other hand, CIPITR, by design,can never generate a syntactically incorrectprogram because at every step it implicitlysamples only feasible actions.

• Generating semantically correct programs:CIPITR is capable of incorporating differentgeneric programming styles as well as problem-specific constraints, restricting its searchspace to only semantically correct programs.As shown in Table 5, CIPITR is able to gen-erate at least meaningful programs havingthe desired answer-type or without repeatinglines of code. On the other hand the NSM-generated programs are often semanticallywrong, for instance, both in the Quantitativeand Quantitative Count based questions, the

type of the answer is itself wrong, renderingthe program meaningless. This arises onceagain, owing to the token-by-token decodingof the program by NSM which makes it hardto incorporate high level rules to guide orconstrain the search.

• Efficient search-space exploration: Owingto the different strategies used to explore theprogram space more intelligently, CIPITRscales better to a wide variety of complexqueries by using less than half of NSM’sbeam size. We experimentally establishedthat for programs of length 7 these varioustechniques reduced the average programspace from 1.33× 1019 to 2,998 programs.

7 Conclusion

We presented CIPITR, an advanced NPI frame-work that significantly pushes the frontier ofcomplex program induction in absence of goldprograms. CIPITR uses auxiliary rewarding tech-niques to mitigate the extreme reward sparsityand incorporates generic pragmatic programmingstyles to constrain the combinatorial programspace to only semantically correct programs. Asfuture directions of work, CIPITR can be furtherimproved to handle the hardest question typesby making the search more strategic, and canbe further generalized to a diverse set of goalswhen training on all question categories together.

197

Page 14: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

Other potential directions of research could betoward learning to discover sub-goals to furtherdecompose the most complex classes beyond justthe two-level phase transition proposed here.Additionally, further improvements are requiredto induce complex programs without availabilityof gold program input variables.

References

Jacob Andreas, Marcus Rohrbach, Trevor Darrell,and Dan Klein. 2016a. Learning to composeneural networks for question answering. InNAACL HLT 2016, The 2016 Conference of theNorth American Chapter of the Association forComputational Linguistics: Human LanguageTechnologies, pages 1545–1554.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell,and Dan Klein. 2016b. Learning to composeneural networks for question answering. InNAACL-HLT , pages 1545–1554.

B. Bakker and J. Schmidhuber. 2004. Hierarchi-cal reinforcement learning based on subgoaldiscovery and subpolicy specialization. In Pro-ceedings of the 8th Conference on IntelligentAutonomous Systems IAS-8, pages 438–445.

Andrew G. Barto and Sridhar Mahadevan. 2003.Recent advances in hierarchical reinforcementlearning. Discrete Event Dynamic Systems,13(1-2):41–77.

Hannah Bast and Elmar Haußmann. 2015. Moreaccurate question answering on freebase. InCIKM, pages 1431–1440.

J. Berant, A. Chou, R. Frostig, and P. Liang.2013. Semantic parsing on Freebase fromquestion-answer pairs. In EMNLP Conference,pages 1533–1544.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modelingmulti-relational data. In NIPS Conference,pages 2787–2795.

Matko Bosnjak, Tim Rocktaschel, JasonNaradowsky, and Sebastian Riedel. 2017. Pro-gramming with a differentiable forth interpreter.In Proceedings of the 34th International Con-ference on Machine Learning, ICML 2017,pages 547–556.

Rudy Bunel, Matthew J. Hausknecht, JacobDevlin, Rishabh Singh, and Pushmeet Kohli.2018. Leveraging grammar and reinforcementlearning for neural program synthesis. In Inter-national Conference on Learning Representa-tions (ICLR).

Rajarshi Das, Manzil Zaheer, Siva Reddy, andAndrew McCallum. 2017. Question answeringon knowledge bases and text using universalschema and memory networks. In ACL (2),pages 358–365.

Peter Dayan and Geoffrey E. Hinton. 1993.Feudal reinforcement learning. In Advancesin Neural Information Processing Systems 5,[NIPS Conference], pages 271–278.

Thomas G. Dietterich. 2000. Hierarchical re-inforcement learning with the maxq valuefunction decomposition. Journal of ArtificialIntelligence Research, 13(1):227–303.

Li Dong and Mirella Lapata. 2016. Languageto logical form with neural attention. In ACL,volume 1, pages 33–43.

Kelvin Guu, John Miller, and Percy Liang. 2015.Traversing knowledge graphs in vector space.In EMNLP Conference.

Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang.2017. Search-based neural structured learningfor sequential question answering. In ACL,volume 1, pages 1821–1831.

Sarvnaz Karimi, Justin Zobel, and Falk Scholer.2012. Quantifying the impact of concept rec-ognition on biomedical information retrieval.Information Processing & Management, 48(1):94–106.

Mahboob Alam Khalid, Valentin Jijkoun, andMaarten De Rijke. 2008. The impact of namedentity normalization on information retrievalfor question answering. In Proceedings of theIR Research, 30th European Conference onAdvances in Information Retrieval, ECIR’08,pages 705–710.

Brenden M. Lake, Ruslan Salakhutdinov, andJoshua B. Tenenbaum. 2015. Human-level con-cept learning through probabilistic programinduction. Science, 350(6266):1332–1338.

198

Page 15: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

Chengtao Li, Daniel Tarlow, Alexander L. Gaunt,Marc Brockschmidt, and Nate Kushman. 2016.Neural program lattices. In International Con-ference on Learning Representations (ICLR).

X. Li and D. Roth. 2002. Learning question clas-sifiers. In COLING, pages 556–562.

Chen Liang, Jonathan Berant, Quoc Le, KennethD. Forbus, and Ni Lao. 2017. Neural symbolicmachines: Learning semantic parsers on free-base with weak supervision. In Proceedings ofthe 55th Annual Meeting of the Association forComputational Linguistics, pages 23–33.

Andrew McCallum, Arvind Neelakantan, RajarshiDas, and David Belanger. 2017. Chains of rea-soning over entities, relations, and text usingrecurrent neural networks. In Proceedingsof the 15th Conference of the EuropeanChapter of the Association for ComputationalLinguistics, EACL 2017, Volume 1: LongPapers, pages 132–141.

Alexander H. Miller, Adam Fisch, Jesse Dodge,Amir-Hossein Karimi, Antoine Bordes, andJason Weston. 2016. Key-value memory net-works for directly reading documents. InEMNLP, pages 1400–1409.

Stephen Muggleton and Luc De Raedt. 1994.Inductive logic programming: Theory and meth-ods. Journal of Logic Programming, 19/20:629–679.

Arvind Neelakantan, Quoc V. Le, Martin Abadi,Andrew McCallum, and Dario Amodei. 2016.Learning a natural language interface withneural programmer. arXiv preprint, arXiv:1611.08945.

Arvind Neelakantan, Quoc V. Le, and IlyaSutskever. 2015. Neural programmer: Inducinglatent programs with gradient descent. CoRR,abs/1511.04834.

Ronald Parr and Stuart Russell. 1998. Reinforce-ment learning with hierarchies of machines.In Proceedings of the 1997 Conference onAdvances in Neural Information ProcessingSystems 10, NIPS ’97, pages 1043–1049.

Panupong Pasupat and Percy Liang. 2015. Com-positional semantic parsing on semi-structuredtables. arXiv preprint arXiv:1508.00305.

Marc’Aurelio Ranzato, Sumit Chopra, MichaelAuli, and Wojciech Zaremba. 2015. Sequencelevel training with recurrent neural networks.CoRR, abs/1511.06732.

Scott Reed and Nando de Freitas. 2016. Neuralprogrammer-interpreters. In International Con-ference on Learning Representations (ICLR).

Amrita Saha, Vardaan Pahuja, Mitesh M. Khapra,Karthik Sankaranarayanan, and Sarath Chandar.2018. Complex sequential question answering:Towards learning to converse over linkedquestion answer pairs with a knowledge graph.In AAAI.

Richard S. Sutton, Doina Precup, and SatinderSingh. 1999. Between mdps and semi-mdps:A framework for temporal abstraction inreinforcement learning. Artificial Intelligence,112(1-2):181–211.

Reut Tsarfaty, Ilia Pogrebezky, Guy Weiss, YaaritNatan, Smadar Szekely, and David Harel. 2014.Semantic parsing using content and context:A case study from requirements elicitation.In Proceedings of the 2014 Conference onEmpirical Methods in Natural LanguageProcessing (EMNLP), pages 1296–1307.

Richard J. Waldinger and Richard C. T. Lee.1969. PROW: A step toward automatic programwriting. In Proceedings of the 1st InternationalJoint Conference on Artificial Intelligence,pages 241–252.

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist re-inforcement learning. In Reinforcement Learning,Springer, pages 5–32.

Kun Xu, Siva Reddy, Yansong Feng, SongfangHuang, and Dongyan Zhao. 2016. Questionanswering on Freebase via relation extractionand textual evidence. arXiv preprint, arXiv:1603.00957.

Xuchen Yao. 2015. Lean question answering overFreebase from scratch. In NAACL Conference,pages 66–70.

Scott Wen-tau Yih, Ming-Wei Chang, XiaodongHe, and Jianfeng Gao. 2015. Semantic parsingvia staged query graph generation: Question

199

Page 16: Complex Program Induction for Querying Knowledge Bases in ... › anthology › Q19-1012.pdf · Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs

answering with knowledge base. In ACL Con-ference, pages 1321–1331.

Wen-tau Yih, Matthew Richardson, Chris Meek,Ming-Wei Chang, and Jina Suh. 2016. The

value of semantic parse labeling for knowledgebase question answering. In Proceedings of the54th Annual Meeting of the Association forComputational Linguistics (Volume 2: ShortPapers), volume 2, pages 201–206.

200