8
Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual estion Answering Aman Jain Mayank Kothyari IIT Bombay India Vishwajeet Kumar IBM Research, India Preethi Jyothi IIT Bombay India Ganesh Ramakrishnan IIT Bombay India Soumen Chakrabarti IIT Bombay India ABSTRACT Multimodal IR, spanning text corpus, knowledge graph and images, called outside knowledge visual question answering (OKVQA), is of much recent interest. However, the popular data set has serious limitations. A surprisingly large fraction of queries do not assess the ability to integrate cross-modal information. Instead, some are in- dependent of the image, some depend on speculation, some require OCR or are otherwise answerable from the image alone. To add to the above limitations, frequency-based guessing is very effective because of (unintended) widespread answer overlaps between train and test folds. Overall, it is hard to determine when state-of-the- art systems exploit these weaknesses rather than really infer the answers, because they are opaque and their ‘reasoning’ process is uninterpretable. An equally important limitation is that the dataset is designed for the quantitative assessment only of the end-to-end answer retrieval task, with no provision for assessing the correct (semantic) interpretation of the input query. In response, we iden- tify a key structural idiom in OKVQA, viz., S3 (select, substitute and search), and build a new data set and challenge around it. Specif- ically, the questioner identifies an entity in the image and asks a question involving that entity which can be answered only by con- sulting a knowledge graph or corpus passage mentioning the entity. Our challenge consists of (i)OKVQA S3 , a subset of OKVQA anno- tated based on the structural idiom and (ii)S3VQA, a new dataset built from scratch. We also present a neural but structurally trans- parent OKVQA system, S3, that explicitly addresses our challenge data set, and outperforms recent competitive baselines. We make our code and data available at https://s3vqa.github.io/ 1 INTRODUCTION Multimodal question answering (QA), specifically, combining visual and textual information, is of much recent interest [1, 17, 20, 24]. Both authors contributed equally to this research. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGIR ’21, , © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 https://doi.org/10.1145/1122445.1122456 Figure 1: Sample OKVQA image and query: ‘What is this ani- mal poached for?’. As discussed in Section 3, we associate the following additional ground truth with each such question: What is this <select, substitute=“elephant”>animal</select> poached for?In the simplest visual QA (VQA) variant, the query is textual and a given image suffices, by itself, to answer it. A more challenging vari- ant is where outside knowledge (OK) from a corpus or knowledge graph needs to be combined with image information to answer the query [25]. This task is referred to as OKVQA. An example OKVQA instance is shown in Figure 1. The system cannot answer the question by detecting objects from the image, or by recognizing colour, count, shape as in traditional VQA [1, 17, 20, 24]). The system first needs to understand the relationship between the image content and the question and then needs to query the outside world for retrieving relevant facts (in this particular case, about the poaching of elephants). For our purposes, a host of modern image processing tech- niques [16, 35] (which we consider as black boxes) can identify image patches representing salient entities (elephant, man, jeep, tree, sky, camera) and some spatial relations between them (e.g., man holding camera, man sitting in a jeep, etc.). This is called a scene graph. Knowledge that elephant is-a animal may come from a knowledge graph [7] or open information extraction [13]. Treated thus, OKVQA is less of a visual recognition challenge and more of a challenge to fuse unstructured text and (graph) structured data extracted from the image, which is relevant to the Information Retrieval (IR) community. 1.1 Discontent with current state of OKVQA Small step vs. giant leap. QA in the TREC community [45, 46] be- gan as “corpus QA” long before neural NLP became widespread. Nat- urally, any structural query interpretation or scoring of responses 1 arXiv:2103.05568v3 [cs.CV] 10 Aug 2021

Select, Substitute, Search: A New Benchmark for Knowledge

Embed Size (px)

Citation preview

Select, Substitute, Search: A New Benchmark forKnowledge-Augmented VisualQuestion AnsweringAman Jain

Mayank Kothyari∗

IIT Bombay

India

Vishwajeet Kumar

IBM Research, India

Preethi Jyothi

IIT Bombay

India

Ganesh Ramakrishnan

IIT Bombay

India

Soumen Chakrabarti

IIT Bombay

India

ABSTRACT

Multimodal IR, spanning text corpus, knowledge graph and images,

called outside knowledge visual question answering (OKVQA), is

of much recent interest. However, the popular data set has serious

limitations. A surprisingly large fraction of queries do not assess the

ability to integrate cross-modal information. Instead, some are in-

dependent of the image, some depend on speculation, some require

OCR or are otherwise answerable from the image alone. To add to

the above limitations, frequency-based guessing is very effective

because of (unintended) widespread answer overlaps between train

and test folds. Overall, it is hard to determine when state-of-the-

art systems exploit these weaknesses rather than really infer the

answers, because they are opaque and their ‘reasoning’ process is

uninterpretable. An equally important limitation is that the dataset

is designed for the quantitative assessment only of the end-to-end

answer retrieval task, with no provision for assessing the correct

(semantic) interpretation of the input query. In response, we iden-

tify a key structural idiom in OKVQA, viz., S3 (select, substitute andsearch), and build a new data set and challenge around it. Specif-

ically, the questioner identifies an entity in the image and asks a

question involving that entity which can be answered only by con-

sulting a knowledge graph or corpus passage mentioning the entity.

Our challenge consists of (i)OKVQAS3, a subset of OKVQA anno-

tated based on the structural idiom and (ii)S3VQA, a new dataset

built from scratch. We also present a neural but structurally trans-

parent OKVQA system, S3, that explicitly addresses our challenge

data set, and outperforms recent competitive baselines. We make

our code and data available at https://s3vqa.github.io/

1 INTRODUCTION

Multimodal question answering (QA), specifically, combining visual

and textual information, is of much recent interest [1, 17, 20, 24].

∗Both authors contributed equally to this research.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full citation

on the first page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specific permission and/or a

fee. Request permissions from [email protected].

SIGIR ’21, ,© 2018 Association for Computing Machinery.

ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00

https://doi.org/10.1145/1122445.1122456

Figure 1: Sample OKVQA image and query: ‘What is this ani-mal poached for?’. As discussed in Section 3, we associate the

following additional ground truth with each such question:

‘What is this <select, substitute=“elephant”>animal</select>poached for?’

In the simplest visual QA (VQA) variant, the query is textual and a

given image suffices, by itself, to answer it. A more challenging vari-

ant is where outside knowledge (OK) from a corpus or knowledge

graph needs to be combined with image information to answer the

query [25]. This task is referred to as OKVQA.

An example OKVQA instance is shown in Figure 1. The system

cannot answer the question by detecting objects from the image, or

by recognizing colour, count, shape as in traditional VQA [1, 17, 20,

24]). The system first needs to understand the relationship between

the image content and the question and then needs to query the

outside world for retrieving relevant facts (in this particular case,

about the poaching of elephants).

For our purposes, a host of modern image processing tech-

niques [16, 35] (which we consider as black boxes) can identify

image patches representing salient entities (elephant, man, jeep,

tree, sky, camera) and some spatial relations between them (e.g.,man holding camera, man sitting in a jeep, etc.). This is called a

scene graph. Knowledge that elephant is-a animal may come from a

knowledge graph [7] or open information extraction [13]. Treated

thus, OKVQA is less of a visual recognition challenge and more of

a challenge to fuse unstructured text and (graph) structured data

extracted from the image, which is relevant to the Information

Retrieval (IR) community.

1.1 Discontent with current state of OKVQA

Small step vs. giant leap. QA in the TREC community [45, 46] be-

gan as “corpus QA” long before neural NLP becamewidespread. Nat-

urally, any structural query interpretation or scoring of responses

1

arX

iv:2

103.

0556

8v3

[cs

.CV

] 1

0 A

ug 2

021

SIGIR ’21, , Aman and Mayank, et al.

was completely transparent and deliberate. QA entered the modern

NLP community as “KGQA” [5, 48], which was focused on precise

semantic interpretation of queries. Initial efforts at corpus QA in the

NLP community also had modest goals: given a query and passage,

identify a span from the passage that best answers the query. Such

“reading comprehension” (RC) generalized proximity-sensitive scor-

ing [8, 23, 32]. Compared to these “small steps”, OKVQA was a

“giant leap”, placing enormous faith in opaque neural machinery to

answer visual queries that have no controlled space of reasoning

structure. Some questions require optical character recognition to

‘read’ street signs in the image, others are speculative (“can you

guess the celebration where people are enjoying?”), questions that

are independent of the image (“Pigeons are also known as what?”),

and questions that require only the image and no outside informa-

tion to answer (“Which breed of dog is it?”).

The unwelcome power of guesswork. Rather than score candidate

answers from a large set, such as entities in a KG, or spans from pas-

sages in a large open-domain corpus, current systems treat OKVQA

as a classification problem over the 𝑘 most frequent answers in the

training set. We found a huge train-test overlap, which condones

and even rewards this restrictive strategy - 48.9% answers in the

test set of OKVQA [25] were found to be present in the train set.

Retrospection. Such limitations are not specific to OKVQA, but

have been witnessed in other communities as well. Guessing has

been found unduly successful for the widely used SQuAD QA

dataset [9, 34]. The complex QA benchmark HotPotQA [47] was

supposed to test multi-step reasoning. However, Min et al. [28]

found that top-scoring QA systems as well as humans could guess

the answer without such reasoning for a large fraction of queries.

Their conclusion is particularly instructive: “there should be an

increasing focus on the role of evidence in multi-hop reasoning and

possibly even a shift towards information retrieval style evaluations

with large and diverse evidence collections”. Even more damning is

the work of Tang et al. [44], who report that “state-of-the-art multi-

hop QA models fail to answer 50–60% of sub-questions, although

the corresponding multi-hop questions are correctly answered”.

Other tasks such as entailment [26] have faced similar fate.

Remedies. Awareness of the above-mentioned pitfalls have led to

algorithmic improvements, principally by reverting toward explicit

query structure decomposition. Training data can be augmented

with decomposed clauses [44]. The system can learn to decom-

pose queries by looking up a single-clause query collection [31, 50].

Explicit and partly interpretable control can be retained on the

assembly of responses for subqueries [40, 41]. Earlier datasets and

evaluations are being fixed and augmented as well [9, 10, 36]. New

datasets are being designed to be compositional from the ground

up [43], so that system query decomposition quality can be directly

assessed. Here we pursue both the data and algorithm angles: we

present a new data set where the structural intent of every query

is clear by design, and we also propose an interpretable neural

architecture to handle this form of reasoning. Our new dataset

and our supporting algorithm are designed keeping in mind the

interpretability of the entire system and can thus potentially

contribute to advancement of research in other applications

that include question answering over video, audio as well as

over knowledge graphs. As mentioned earlier, most state-of-the-

art VQA systems are trained to fit an answer distribution; this

brings into question their ability to generalize well to test queries.

To address this issue and improve the explanatory capabilities of

VQA systems, prior work has investigated the use of visual explana-

tions [11] and textual explanations [30? ] to guide the VQA process.

Our new challenge dataset and our proposed OKVQA architecture

has the advantage of being inherently explanatory, without having

to rely on any external explanations. Further, we contribute a

new benchmark S3VQA that is explanatory in its very gene-

sis, as will be described in Section 3.2.

1.2 A new OKVQA Challenge Dataset

In response to the serious shortcomings we have described above,

we present a new OKVQA challenge. Our dataset consists of two

parts, each of which is provided with rich, compositional annota-

tions that provision for assessment of a system’s query reformu-

lation viz., (i) OKVQAS3, a subset of the OKVQA [25] dataset with

such compositional annotations (Section 3.1) and (ii) S3VQA, a com-

pletely new dataset (c.f., Section 3.2) that has been sanitized from

the ground up against information leakage or guesswork-friendly

answering strategies. Every question in our new benchmark is guar-

anteed to require integration of KG, image-extracted information,

and search in an open-domain corpus (such as the Web). Specifi-

cally, the new data set focuses on a specific but extremely common

reasoning idiom, shown in Figure 1. The query pertains to an entity

𝑒 in the image (such as elephant), but refers to it in more general

terms, such as type 𝑡 (animal) of which 𝑒 is an instance. There is

no control over the vocabulary for mentioning 𝑒 or 𝑡 . In our run-

ning example, an OKVQA system may implement a substitutionto generate the query “what is this animal poached for?”, drawing

‘elephant’ out of a visual object recognizer’s output. While many

reasoning paradigms may be important for OKVQA, we argue that

this is an important one. Our data set exercises OKVQA systems in

fairly transparent ways and reveals how (well) they perform this

basic form of reasoning. Addressing other structures of reasoning

is left for future work. Despite this restricted reasoning paradigm,

which ought to be well within the capabilities of general-purpose

state-of-the-art OKVQA systems, we are surprised to find existing

OKVQA models yield close to 0 evaluation score on S3VQA.

1.3 An interpretable OKVQA system

Continuing in the spirit of “small steps before giant leap”, we present

S3 (c.f., Section 5), a neural OKVQA system that targets this class

of queries and reasoning structure. Our system thus has a known

interpretation target for each query, and is therefore interpretable

and affords systematic debugging of its modules. S3 has access to

the query and the scene graph of the image. It first selects a queryspan to substitutewith (the string description of) an object from the

scene graph; this object, too, has to be selected frommany objects in

the image. Sometimes, the question span and the object description

would be related through an instance-of or subtype-of relation in a

typical KG or linguistic database; ideally, S3 should take advantage

of this signal. There are now two ways to use the reformulated

question.We can regard answer selection as a classification problem

like much of prior work, or we can prepare a Web search query,

2

Select, Substitute, Search: A New Benchmark forKnowledge-Augmented Visual Question Answering SIGIR ’21, ,

(a) Type 1 (b) Type 2

(c) Type 3 (d) rest

Figure 2: Example question for each category

get responses, and learn to extract and report answer span/s in the

RC style (referred to as the open-domain setting). S3 is wired to do

both (c.f., Section 6.4).

2 LIMITATIONS IN EXISTING OKVQA DATA

The widely used benchmark OKVQA dataset [25] consists of over

14000 question-image pairs, with 9000 training examples and 5000

examples in the test set. We identify two broad issues with it.

First, significant overlap exists between answers in the train and

test folds. Recall from Section 1.1 that 48.9% of answers in the test

set are present in the training set. Existing systems leverage this

limitation to boost their accuracy by limiting their test answers to

the most frequent answers in the training set.

Second, unlikeWebQuestions [6] or ComplexWebQuestions [42],

OKVQA questions, even when grouped into some categories (see

below) have no clear pattern of reasoning. 18% (type-1) of the

questions require detecting objects and subsequent reasoning over

an external knowledge source to arrive at the answer. 7% of the

questions (type-2) require reading text from the image (OCR) (and

no other information) to answer. 12% of the questions (type-3) are

based on personal opinion or speculation. The remaining questions

(rest) can perhaps be described best through Figure 2d. We provide

an example of each type in Figure 2.

We found that several queries of type-1 have a structural simi-

larity to the bridging queries in ComplexWebQuestions [42]. There,

each query has exactly two clauses. The first clause, when issued

to a Web search engine, returns (via RC) an entity which plugs into

the second clause, which is again sent to the Web search engine,

fetching the overall answer. We found that type-1 questions can

be reformulated, with the help of the scene graph, to a query that

can be answered directly using Web search. Inspired by Complex-

WebQuestions, we next develop our challenge data set.

3 DESIGN OF A NEW OKVQA CHALLENGE

DATA SET

Consider the type-1 query associated with Figure 1: “what is::this

:::::animal poached for?” We discovered that such questions can be

answered through a sequence of three well-defined steps:

Select: This step identifies a span in the question (“this animal”)

that needs to be replacedwith some information from the image, as

part of the query reformulation process. As part of our dataset(s),

we make available the ground truth span for each question, to

facilitate the development and evaluation of the selection model,

independent from other components of a OKVQA system. We

hypothesize that any OKVQA system that fetches the correct

answer must solve this subproblem correctly in the first place.

Substitute: Having identified the query span to replace, the substi-tute operation determines a key-phrase associated with the image

that should replace the selected span. Such a key-phrase could be

the name of an object in the image (e.g., ‘elephant’) or some at-

tributes within the image (such as color), or a relationship between

objects/people (e.g., ‘carrying in hand’). As part of our dataset(s),

we release the keyphrase substituting each selected span in the

question. This data can be used for independent training and eval-

uation of implementations of the substitution operation, assuming

oracle assistance from the other operations.

Search: After the query has been reformulated as described above,

it can be used to harvest candidate answers from a corpus or a

KG. The reformulated query in Figure 1 will be “what is::::::elephant

poached for?” with associated ground truth answer ‘tusk’. By

providing gold query reformulations and gold answers, we also

facilitate the evaluation of the search operation, independent of

the selection and substitution operations.

As another example, the question in Figure 2a can be reformu-

lated from “how many chromosomes do:::these

::::::::creatures have” to

“howmany chromosomes do::::::humans have?” by replacing the hyper-

nym or super-type “these creatures” with the hyponym or sub-type

‘humans’. Again, the reformulated query can be answered using an

open domain QA system.

The following two subsections, we describe the two parts of our

challenge data set. First, in Section 3.1, we describeOKVQAS3, pro-

duced by subsetting and annotating OKVQA to fit the specifications

justified above. Next, in Section 3.2, we describe S3VQA, created

from the ground up to our specifications.

3.1 OKVQAS3: Annotated subset of OKVQA

This is a subset of the OKVQA dataset in which we have annotated

every question with spans, substitutions and gold answers. More

specifically, we divide the OKVQA dataset into two parts based on

the category of the question. As discussed in Section 2, the first

part consists of all the questions of category Type 1 and the second

part consists of rest of the questions. We refer to the first part as

3

SIGIR ’21, , Aman and Mayank, et al.

OKVQAS3 and the second part as OKVQA\S3 . The nameOKVQAS3

represents three operations (Select, Substitute and Search) needed

to answer all the questions of category Type1. OKVQA\S3 is the

remaining OKVQA dataset after removing OKVQAS3 .

2640 of the 14000 question-image pairs in OKVQA are of type-1,

amounting to 18.8% of the total. For these questions, our ground

truth annotations also include the object and span to be used for

select and substitution for every image question pair. The average

length of the span selected was 2.44 words (13.14 characters), with a

standard deviation of 1.36 (7.24). When COCO [22], ImageNet [38]

and OpenImages [21] object detection models were run on this set,

24.7% of the question-image pairs did not have the ground truth

object in the detections owing to detection errors and vocabulary

limitations.

3.2 S3VQA: New dataset built from scratch

Gathering experience from the process of annotating the subset

OKVQAS3 of OKVQA, we build a new benchmark dataset S3VQA

in a bottom-up manner, pivoting on the select, substitute and searchoperations, applied in that order, to each query. Our S3VQA data is

built on theOpen Images collection [21]. For each entity/object/class

name 𝑒 (e.g., peacock) in the OpenImages dataset, we identified its

corresponding ‘parent’ label 𝑡 (e.g., bird) that generalizes 𝑒 . We em-

ployed the hierarchical structure specified within the Open Images

dataset itself to ensure that 𝑒 → 𝑡 enjoy a hyponym-hypernym or

instance-category relation.

Next, we (semi-automatically) identified an appropriateWikipedia

page for 𝑒 and, using the Wikimedia parser [14] on that page, ex-

tracted text snippets. A question generation (QG) model based on

T5 [33] was used to generate < question, answer> pairs from these

snippets. We retained only those pairs in which the question had

an explicit mention of 𝑒 . Note that unlike OKVQA, by design, the

S3VQA dataset has exactly one correct ground truth answer perquestion. In our experimental results (in Section 6.4, specifically

Table 2) we observe how this leads to a much stricter evaluation of

search systems and therefore much lower numbers for S3VQA in

comparison to OKVQA, which offers 10 answers per question [25].

Subsequently, each mention of 𝑒 in the filtered question set was

replaced with the ‘parent’ label 𝑡 using manually defined templates.

One such template is to replace 𝑒 and the determiner preceding it

(e.g. a, an, the) with the string - ‘this 𝑡 ’. Finally, this question set was

filtered and cleaned up manually using the following guidelines:

(i) any question which could be answered just by using the image

or was not fact-based was eliminated; (ii) any question that needed

grammatical improvements was corrected, without changing its

underlying meaning.

We also provided paraphrased questions (generated using T5,

and Pegasus[51]) as suggestions alongwith each templated question

for the annotators to pick instead of the original templated question

in order to bring in variety among the questions. Finally, for each

question we pick an image from OpenImages corresponding to the

object 𝑒 referred in it. We ensure that we pick a distinct image each

time in case the object 𝑒 is referred in multiple questions.

4 PRIOR ARCHITECTURES

VQA that requires reasoning over information from external knowl-

edge sources has recently gained a lot of research interest. VQA

systems [25] have started incorporating external knowledge for

question answering. Existing methods use one of the following

approaches to integrate external knowledge:

(1) Retrieve relevant facts about objects in the image and entities

in the question and reason over extracted facts to arrive at the

final answer for the question.

(2) Collect evidence from text snippets returned from a search

engine and extract answer from the text snippet.

The baseline architecture proposed by Marino et al. [25] intro-

duced ArticleNet to retrieve Wikipedia articles relevant to the ques-

tion and image entities. ArticleNet encodes a Wikipedia article

using a GRU and it is trained to predict whether the ground truth

answer is present in the article. The hidden states of sentences in

the article are used as the encoded representation. This encoded

representation is given as input to a multimodal fusion model along

with the question and image features. The answer is predicted by

formulating the problem as a classification on the most frequent

answers in the training set, leveraging the huge overlap in answers

in training and test sets. Gardères et al. [15] jointly learn knowledge,

visual and language embeddings. They use ConceptNet [39] as the

knowledge source, and graph convolution networks [19] to inte-

grate the information. Similar to the OKVQA baseline system, they

also formulate the problem as classification on the most frequent

answers in the training set [49].

5 PROPOSED ARCHITECTURE

Figure 3 shows an overall schematic diagram of our proposed S3

(Select, Substitute, Search) architecture. As described in Section 3,

Select and Substitute are two operations that, when sequentially

applied to a question, result in a reformulated query that could be

answered without further reference to the image. Question refor-

mulation is followed by a Search operation that involves a Machine

Reading Comprehension (MRC) module to extract answers from

the top snippets retrieved via a search engine for each of the refor-

mulated questions.

5.1 Select and Substitute

The Select operation entails determining a span from the question

that serves as a good placeholder for an object in the image. We

refer to this module as the SpanSelector in Figure 3. To implement

this module, we use a large pre-trained language model (such as

BERT [12]) with additional feed-forward layers (denoted by FFL1)

that are trained to predict the start and end of the span. These addi-

tional layers are trained to minimize the cross-entropy loss between

the predicted spans and the ground truth spans that accompany all

the questions in our dataset.

Once we have a predicted span from the SpanSelector, we want

to determine which object from the image is best described by the

question span. First, we rely on a state-of-the-art object detection

system to provide a list of most likely objects present in the image.

Next, we aim to identify one object from among this list that would

be most appropriate as a substitution for the span that the question

is centred around. This is achieved by minimizing the following

4

Select, Substitute, Search: A New Benchmark forKnowledge-Augmented Visual Question Answering SIGIR ’21, ,

Select+SubstituteObject

Detector

What is this animal poached for?

Predicted span

Predicted object

BERT

COSINE

Reformulated Question

Tusk

Open-domain Setting

CLASSIFICATION

TuskClassification Setting

Search

Search Engine

Top-N Snippets MRC

LBCE<latexit sha1_base64="3Q8vkx3wpL+2974GA9qr8hYy8ng=">AAACAXicbVDLSsNAFJ34rPUVdSO4CRbBVUmqoMvSIrhwUcE+oAlhMp20QycPZm7EEuLGX3HjQhG3/oU7/8ZJm4W2HrhwOOde7r3HizmTYJrf2tLyyuraemmjvLm1vbOr7+13ZJQIQtsk4pHoeVhSzkLaBgac9mJBceBx2vXGzdzv3lMhWRTewSSmToCHIfMZwaAkVz+0Awwjgnl6k7mpDfQB0kbzKstcvWJWzSmMRWIVpIIKtFz9yx5EJAloCIRjKfuWGYOTYgGMcJqV7UTSGJMxHtK+oiEOqHTS6QeZcaKUgeFHQlUIxlT9PZHiQMpJ4KnO/F457+Xif14/Af/SSVkYJ0BDMlvkJ9yAyMjjMAZMUAJ8oggmgqlbDTLCAhNQoZVVCNb8y4ukU6taZ9Xa7Xml3ijiKKEjdIxOkYUuUB1doxZqI4Ie0TN6RW/ak/aivWsfs9YlrZg5QH+gff4AIuKXUw==</latexit>

Ltrip<latexit sha1_base64="3iCUwNZ1OJJaoxJbhHlcqWAcIYo=">AAACAnicbVBNS8NAEN3Ur1q/op7ES7AInkpSBT0WvXjwUMF+QBPCZrttl242YXcilhC8+Fe8eFDEq7/Cm//GTZuDtj4YeLw3w8y8IOZMgW1/G6Wl5ZXVtfJ6ZWNza3vH3N1rqyiRhLZIxCPZDbCinAnaAgacdmNJcRhw2gnGV7nfuadSsUjcwSSmXoiHgg0YwaAl3zxwQwwjgnl6k/mpC/QBUpAszjLfrNo1ewprkTgFqaICTd/8cvsRSUIqgHCsVM+xY/BSLIERTrOKmygaYzLGQ9rTVOCQKi+dvpBZx1rpW4NI6hJgTdXfEykOlZqEge7MD1bzXi7+5/USGFx4KRNxAlSQ2aJBwi2IrDwPq88kJcAnmmAimb7VIiMsMQGdWkWH4My/vEja9ZpzWqvfnlUbl0UcZXSIjtAJctA5aqBr1EQtRNAjekav6M14Ml6Md+Nj1loyipl99AfG5w/HMphS</latexit>

Gold object

FFL

FFL

Span Selector

FFL SUB

Hypernym- Hyponym

Scores

Encode Encode

Figure 3: Schematic diagram of S3. FFL refers to feedforward layers that are trained using our datasets.

combined loss function that predicts an object as the best substitute

for a question span:

L = 𝛼 · Ltrip (𝑠, 𝑝, 𝑛) + (1 − 𝛼) · LBCE (𝑠, 𝑝, 𝑛) (1)

where Ltrip is a triplet loss defined over triplets of embeddings

corresponding to the question span (𝑠), the ground-truth (gold)

object (𝑝) and a detected object different from the ground-truth

object (𝑛), LBCE is a binary cross-entropy loss over each detected

object and the span with the gold object acting as the reference, and

𝛼 is a mixing coefficient that is tuned as a hyperparameter. Here,

Ltrip is defined as:

Ltrip (𝑠, 𝑑, 𝑝) = max(𝑑 (𝑠, 𝑝) − 𝑑 (𝑠, 𝑛) +𝑚, 0) (2)

where 𝑑 (·, ·) is a distance function on the embedding space (such

as cosine distance) and𝑚 is a predefined margin hyperparameter

that forces examples from the same class (𝑠 and 𝑝) to be closer than

examples from different classes (𝑠 and 𝑛). (Both Ltrip and LBCE

have been explicitly listed in Figure 3.)

Computing LBCE (𝑠, 𝑝, 𝑛) also involves the use of a hypernym-

hyponym scorer that examines whether or not each (question span,

detected object) pair exhibits a hypernym-hyponym relationship.

The scorer makes use of a lexical database such as WordNet [27]

that explicitly contains hypernym-hyponym relationships between

words. We map the span and detected object to the corresponding

synsets in WordNet, and use the pre-trained Poincaré embeddings

by Nickel and Kiela [29] to embed them. These pre-trained embed-

dings were trained on the (hypernymy based) hierarchical represen-

tations of the word sequences in WordNet synsets by embedding

them into an𝑛-dimensional Poincaré ball by leveraging the distance

property of hyperbolic spaces.

Finally, the reformulated question is obtained by substituting

the predicted span in the question with the object that is predicted

with highest probability as being the best substitute for the span.

5.2 Search

After the question has been reformulated, we pass it throughGoogle’s

search engine to retrieve the top 10 most relevant snippets.1These

snippets are further passed as input to a Machine Reading Com-

prehension (MRC) module. An MRC system takes a question and

a passage as its input and predicts a span from the passage that

1To ensure reproducibility, we release these snippets as part of our dataset.

is most likely to be the answer. Using the MRC module, we aim

at meaningfully pruning the relatively large amounts of text in

Google’s top 10 snippets and distilling it down to what is most

relevant to the reformulated question.

The MRC module also allows us to move beyond the classifica-

tion setting (as mentioned in Section 1.1) and directly predict an

answer given a question and its corresponding context. We refer

to this as the open-domain task, shown in Figure 3. For fair com-

parisons to prior work that adopt the classification setting, we also

support classification as shown in Figure 3. Here, embeddings for

both the reformulated question and the output from the MRC are

concatenated and fed as input to a softmax output layer over the

top 𝑘 answers in the training set.

6 EXPERIMENTS AND RESULTS

We report experiments with OKVQAS3 and S3VQA, using our S3

model in comparison with a strong baseline (the BLOCK [4] model).

In the following subsections, we explain the approaches being com-

pared (Section 6.1), evaluation settings employed in the experiments

(Section 6.2) and the results (Section 6.4).

6.1 Methods

BLOCK [4]: BLOCK is a state-of-the-art multimodal fusion tech-

nique for VQA, where the image and question pairs are embedded

in a bimodal space. This bimodal representation is then passed to

a softmax output layer, to predict the final answer. Note that, as

per the currently prevalent practice in OKVQA (referred to in Sec-

tions 1.1 and 4) the softmax is performed over the 𝑘 most frequent

answers in the training set. We chose BLOCK as our baseline, since

it is an improved version of a VQA system called MUTAN [3] that

was originally reported as a baseline for the OKVQA [25] dataset.

By design, BLOCK is a vanilla VQA model that takes an image

and question as its inputs and predicts an answer. Modifying it to

additionally accept context as an input did not help improve its

accuracies.

S3: As described in Section 5, our model S3 processes each question

using the select, substitute and search operations.

We use SpanBERT [18] as our MRC system to extract candidate

answers from the relevant text snippets. We use the pretrained

5

SIGIR ’21, , Aman and Mayank, et al.

BERT-base model2to implement the select and substitute opera-

tions.

6.2 Evaluation

Closed-domain (classification) setting: In this setting, the an-

swer is predicted from a predefined vocabulary, that is constructed

using the top-𝑘 answers in the training set. Since OKVQAS3 is a

relatively small subset of the OKVQA dataset and since all sys-

tems were observed to improve in their evaluation scores with

increasing values of 𝑘 , we report numbers by setting 𝑘 = 1084in all experiments, which corresponds to setting the predefined

answer vocabulary to the entire train set. The evaluation criteria

for OKVQAS3 is the standard VQA metric [2] which is defined as:

Accuracy(ans) =𝑚𝑖𝑛

{#ℎ𝑢𝑚𝑎𝑛𝑠 𝑡ℎ𝑎𝑡 𝑠𝑎𝑖𝑑 ans

3, 1}

(3)

For S3VQA, we note that accuracy is calculated by doing an ex-

act string match with answer since we have exactly one correct

ground truth answer per question. In retrospect, 42.1% of answers

in the test set of OKVQAS3 are found to be present in the train

set. Thus 42.1% serves as a generous skyline in the classification

setting for OKVQAS3 (for S3VQA, the overlap is 11.6%).

Open-domain setting: The vocabulary of answers is unconstrained

while predicting the answer. The predicted answer should exactly

match one of the annotated answers to be counted as correct. The

evaluation metric is the same as in the classification setting.

6.3 Implementation details

Select module To select the span, the input question is first en-

coded using a pretrained BERT model. BERT outputs a 768 dimen-

sional representation for each of the tokens in the input. Each

token’s output representation is then passed to a linear layer

(size:768×2) with two values in the output that correspond to the

probability of the token being the start (and end) of the desired

span. This linear layer is trained with cross entropy loss using the

gold span labels. During prediction, the tokens with the highest

values for start and end are used to mark the start and end of the

spans. In case the start token comes after the end token during

prediction, an empty span is returned.

Substitute module It comprises two parts: a) We apply a pre-

trained BERT model to encode the input question and all the other

reformulated questions using each detection as the replacement

for the span. BERT outputs a 768 dimensional representation for

each of the tokens in the input. We compute the representation

for a span using the averaged representation of all its tokens. The

span representation and all other BERT representations of the

detections within the reformulated questions are passed through

a linear layer (size: 768 × 2048) followed by tanh activation and

another linear layer (size: 2048 × 1024). We use the span as the

anchor, the gold object as the positive instance and all other de-

tections as negative instances for the triplet loss. b) We take the

representation of the span and the detection from the network

above and compute the cosine distance between these represen-

tations. This is further passed through a linear layer (size: 2 × 1)

2https://github.com/google-research/bert

Reformulation

Original Gold Predicted

BLOCK W/o context 24.12 25.74 25.0

W/o MRC 12.03 25.02 19.44

S3 With MRCsing 13.51 31.81 26.43

With MRCmult

18.07 33.55 28.57

Table 1: Classification results on OKVQAS3. ‘W/o’ expands

as ‘without’.

along with the hypernym-to-hyponym score described in Sec-

tion 5.1. This linear layer is trained using a binary cross-entropy

loss.

SpanBERTWe use a SpanBERT model pretrained on SQuAD to

find the answer from the retrieved snippets. We further fine-tune

this model on OKVQAS3 by using the gold reformulated question,

the snippets retrieved from it and the ground truth answers.

S3 classification We merge the candidate answer representation

(from the S3 open-domain setting) with the question representa-

tion using BLOCK’s fusion module [4]. This fused representation

is then fed to a softmax output layer with labels as the most fre-

quent 𝑘 = 1084 answers for OKVQAS3 (𝑘 = 2048 for S3VQA) in

the training set. The label with the highest score is returned as

the predicted answer.

6.4 Results

Classification Results. In Table 1, we report classification results

on OKVQAS3 using three different variants of S3. For these experi-

ments, we use SpanBERT as our MRC system. The questions can

either be in their original form (“Original"), or reformulated using

the ground-truth annotations (“Gold") or reformulated using our

span selector and substitution module (“Predicted"). We experiment

with three variants of S3 that use the top 10 snippets from Google

concatenated together (henceforth referred to as T10text):

Without MRC: We bypass the use of an MRC module altogether;

T10text is directly fed as an input to the classification module.

With MRCsing: T10text is passed as input, along with the reformu-

lated question, to SpanBERT. The output from SpanBERT, along

with the reformulated question, are combined and fed as input to

the classification module.

WithMRCmult

: Each of the snippets in T10text are passed to Span-

BERT to produce ten output spans. We use a simple attention layer

over these spans to derive a single output representation, that is

further fed as input to the classification module.

The main observation from Table 1 is that the MRC system,

with its ability to extract relevant spans from the T10text snippets,

is critical to performance. Using eitherMRCsing orMRCmult

in S3

provides a substantial boost in performance compared to S3 without

anyMRC system. And,MRCmult

consistently outperformsMRCsing.

The BLOCK model significantly underperforms compared to S3.

Open-domain Results: In Table 2, we present open-domain re-

sults on both OKVQAS3 and our newly constructed S3VQA using

MRCsing. We report accuracies on all the three main steps of our

proposed architecture. While the open-domain setting is free of

any answer bias, unlike the classification setting, it does not have

the advantage of drawing from a fixed vocabulary of answers. De-

spite this perceived disadvantage, the best open-domain results on

6

Select, Substitute, Search: A New Benchmark forKnowledge-Augmented Visual Question Answering SIGIR ’21, ,

Test on Select Substitute Search

Train on OKVQAS3 S3VQA 63.8 24.6 14.6

Train on S3VQA S3VQA 94.9 61.9 23.3

Train on S3VQA OKVQAS3 57.2 32.5 24.2

Train on OKVQAS3 OKVQAS3 67.2 55.1 30.5

Table 2: Open-domain results on S3VQA and OKVQAS3. As

for the low search results (marked *), we recall from Sec-

tion 3.2 that unlike OKVQA, S3VQA has exactly one cor-

rect ground truth answer per question leading to a much

stricter evaluation and therefore much lower numbers. In

fact, BLOCK yields close to 0 evaluation score on S3VQA.

MRC System Testing on OKVQAS3

T5

without fine-tuning with fine-tuning

16 37.6 (G) 30.7 (P)

SpanBERT

without fine-tuning with fine-tuning

18 33.9 (G) 28.9 (P)

Table 3: Open-domain results using different MRC systems.

(G) refers to ground-truth reformulated questions, (P) refers

to reformulated questions predicted using S3.

OKVQAS3 are comparable to (in fact slightly better than) the best

classification results (i.e. 28.9 vs. 28.57). Recall from Section 3.2 that

unlike OKVQA, S3VQA has exactly one correct ground truth answer

per question leading to a much stricter evaluation and therefore

much lower ‘search’ numbers. Thus, while BLOCK yields close to 0evaluation score on S3VQA (not reported in the table), results using

S3 are marginally better.

Different MRC Systems.We experiment with two different MRC

systems, T5 and SpanBERT, and present open-domain results on

OKVQAS3 using both these systems. The columns labeled “without

fine-tuning" refer to the original pretrained models for both systems

and "with fine-tuning" refers to fine-tuning the pretrained models

using OKVQAS3. Fine-tuning the models with target data signifi-

cantly helps performance, as has been shown in prior work [37].

Scores labeled with (G) show the performance when select and

substitute work perfectly.

7 CONCLUSION

In this paper, we identify key limitations of existing multimodal

QA datasets in being opaque and uninterpretable in their reasoning.

Towards addressing these limitations, we present OKVQAS3, an

improvisation on the existing OKVQA dataset as well as design and

build a new challenge data set S3VQA that focuses on a specific

structural idiom that frequently appears in VQA. We also present

a structurally transparent and interpretable system S3 tailored to

answer questions from our challenge data set and show that it

outperforms strong baselines in both existing classification as well

as the proposed open-domain settings.

REFERENCES

[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,

C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering.

In ICCV. 2425–2433. http://openaccess.thecvf.com/content_iccv_2015/papers/

Antol_VQA_Visual_Question_ICCV_2015_paper.pdf

[2] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,

C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In

Proceedings of the IEEE international conference on computer vision. 2425–2433.[3] Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. 2017. Mutan:

Multimodal tucker fusion for visual question answering. In Proceedings of theIEEE international conference on computer vision. 2612–2620.

[4] Hedi Ben-Younes, Remi Cadene, Nicolas Thome, and Matthieu Cord. 2019. Block:

Bilinear superdiagonal fusion for visual question answering and visual relation-

ship detection. In AAAI, Vol. 33. 8102–8109. https://ojs.aaai.org/index.php/AAAI/article/view/4818/4691

[5] J. Berant, A. Chou, R. Frostig, and P. Liang. 2013. Semantic Parsing on Freebase

from Question-Answer Pairs. In EMNLP Conference. 1533–1544. http://aclweb.

org/anthology//D/D13/D13-1160.pdf

[6] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic

Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013Conference on Empirical Methods in Natural Language Processing. Associationfor Computational Linguistics, Seattle, Washington, USA, 1533–1544. https:

//www.aclweb.org/anthology/D13-1160

[7] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor.

2008. Freebase: a collaboratively created graph database for structuring human

knowledge. In SIGMOD Conference. 1247–1250. http://ids.snu.ac.kr/w/images/9/

98/sc17.pdf

[8] Stefan Büttcher, Charles L. A. Clarke, and Brad Lushman. 2006. Term proximity

scoring for ad-hoc retrieval on very large text collections. In SIGIR Conference(Seattle, Washington, USA). ACM, 621–622. https://doi.org/10.1145/1148170.

1148285

[9] Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. 2019. Don’t Take the

Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases. In

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro-cessing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China,

4069–4082. https://doi.org/10.18653/v1/D19-1418

[10] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa

Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering?

try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018).

https://arxiv.org/pdf/1803.05457.pdf),

[11] Abhishek Das, Harsh Agrawal, Larry Zitnick, Devi Parikh, and Dhruv Batra. 2017.

Human attention in visual question answering: Do humans and deep networks

look at the same regions? Computer Vision and Image Understanding 163 (2017),

90–100.

[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:

Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805 (2018).

[13] Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and

Mausam Mausam. 2011. Open Information Extraction: The Second Genera-

tion. In IJCAI. 3–10. https://www.aaai.org/ocs/index.php/IJCAI/IJCAI11/paper/

viewFile/3353/3408

[14] Wikimedia Foundation. 2021. Mediawiki Parser. Code. https://pypi.org/project/

pymediawiki/

[15] François Gardères, Maryam Ziaeefard, Baptiste Abeloos, and Freddy Lecue. 2020.

ConceptBert: Concept-Aware Representation for Visual Question Answering. In

Findings of the Association for Computational Linguistics: EMNLP 2020. Associationfor Computational Linguistics, Online, 489–498. https://doi.org/10.18653/v1/

2020.findings-emnlp.44

[16] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich

Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014,Columbus, OH, USA, June 23-28, 2014. IEEE Computer Society, 580–587. https:

//doi.org/10.1109/CVPR.2014.81

[17] Drew A Hudson and Christopher D Manning. 2019. GQA: A new

dataset for real-world visual reasoning and compositional question

answering. In CVPR. 6700–6709. https://openaccess.thecvf.com/

content_CVPR_2019/papers/Hudson_GQA_A_New_Dataset_for_Real-

World_Visual_Reasoning_and_Compositional_CVPR_2019_paper.pdf

[18] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and

Omer Levy. 2020. SpanBERT: Improving Pre-training by Representing and

Predicting Spans. Transactions of the Association for Computational Linguistics 8(2020), 64–77. https://doi.org/10.1162/tacl_a_00300

[19] Thomas N Kipf and MaxWelling. 2016. Semi-supervised classification with graph

convolutional networks. arXiv preprint arXiv:1609.02907 (2016).

[20] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua

Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al.

2017. Visual genome: Connecting language and vision using crowdsourced dense

image annotations. International journal of computer vision 123, 1 (2017), 32–73.

https://link.springer.com/content/pdf/10.1007/s11263-016-0981-7.pdf

7

SIGIR ’21, , Aman and Mayank, et al.

[21] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi

Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov,

Tom Duerig, and Vittorio Ferrari. 2020. The Open Images Dataset V4: Unified

Image Classification, Object Detection, and Visual Relationship Detection at

Scale. International Journal of Computer Vision 128 (03 2020). https://doi.org/10.

1007/s11263-020-01316-z

[22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva

Ramanan, Piotr Dollár, and C. Zitnick. 2014. Microsoft COCO: Common Objects

in Context. (05 2014).

[23] Yuanhua Lv and ChengXiang Zhai. 2009. Positional language models for infor-

mation retrieval. In SIGIR Conference. 299–306. https://doi.org/10.1145/1571941.

1571994

[24] Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question

answering about real-world scenes based on uncertain input. arXiv preprintarXiv:1410.0210 (2014). https://arxiv.org/pdf/1410.0210

[25] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi.

2019. OK-VQA: A visual question answering benchmark requiring external

knowledge. In CVPR. 3195–3204. https://openaccess.thecvf.com/content_CVPR_

2019/papers/Marino_OK-VQA_A_Visual_Question_Answering_Benchmark_

Requiring_External_Knowledge_CVPR_2019_paper.pdf

[26] Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the Wrong Rea-

sons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Pro-ceedings of the 57th Annual Meeting of the Association for Computational Lin-guistics. Association for Computational Linguistics, Florence, Italy, 3428–3448.

https://doi.org/10.18653/v1/P19-1334

[27] G. A.Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. 1990. Introduction

to WordNet: an on-line lexical database. International Journal of Lexicography 3(4) (1990), 235 – 244. https://wordnet.princeton.edu/

[28] Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi,

and Luke Zettlemoyer. 2019. Compositional Questions Do Not Necessitate Multi-

hop Reasoning. In Proceedings of the 57th Annual Meeting of the Association forComputational Linguistics. Association for Computational Linguistics, Florence,

Italy, 4249–4257. https://doi.org/10.18653/v1/P19-1416

[29] Maximillian Nickel and Douwe Kiela. 2017. Poincaré embeddings for learning

hierarchical representations. In Advances in neural information processing systems.6338–6347.

[30] Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt

Schiele, Trevor Darrell, and Marcus Rohrbach. 2018. Multimodal explanations:

Justifying decisions and pointing to the evidence. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. 8779–8788.

[31] Ethan Perez, Patrick Lewis, Wen-tau Yih, Kyunghyun Cho, and Douwe Kiela. 2020.

Unsupervised Question Decomposition for Question Answering. In Proceedingsof the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP). Association for Computational Linguistics, Online, 8864–8880. https:

//doi.org/10.18653/v1/2020.emnlp-main.713

[32] Desislava Petkova and W Bruce Croft. 2007. Proximity-based document repre-

sentation for named entity retrieval. In CIKM. ACM, 731–740. https://doi.org/

10.1145/1321440.1321542

[33] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,

Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the lim-

its of transfer learning with a unified text-to-text transformer. arXiv preprintarXiv:1910.10683 (2019).

[34] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.

SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceed-ings of the 2016 Conference on Empirical Methods in Natural Language Pro-cessing. Association for Computational Linguistics, Austin, Texas, 2383–2392.

https://doi.org/10.18653/v1/D16-1264

[35] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-

CNN: Towards Real-Time Object Detection with Region Proposal Networks.

In Advances in Neural Information Processing Systems 28: Annual Conference onNeural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec,Canada, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama,

and Roman Garnett (Eds.). 91–99. https://proceedings.neurips.cc/paper/2015/

hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html

[36] Marco Tulio Ribeiro, Carlos Guestrin, and Sameer Singh. 2019. Are Red Roses

Red? Evaluating Consistency of Question-AnsweringModels. In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics. Associationfor Computational Linguistics, Florence, Italy, 6174–6184. https://doi.org/10.

18653/v1/P19-1621

[37] Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How Much Knowledge

Can You Pack into the Parameters of a Language Model?. In Proceedings of the2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).5418–5426.

[38] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, SeanMa,

Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander

Berg, and Li Fei-Fei. 2014. ImageNet Large Scale Visual Recognition Challenge.

International Journal of Computer Vision 115 (09 2014). https://doi.org/10.1007/

s11263-015-0816-y

[39] Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An Open

Multilingual Graph of General Knowledge. Proceedings of the AAAI Conferenceon Artificial Intelligence 31, 1 (Feb. 2017). https://ojs.aaai.org/index.php/AAAI/

article/view/11164

[40] Haitian Sun, Tania Bedrax-Weiss, and William Cohen. 2019. PullNet: Open Do-

main Question Answering with Iterative Retrieval on Knowledge Bases and Text.

In Proceedings of the 2019 Conference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong,

China, 2380–2390. https://doi.org/10.18653/v1/D19-1242

[41] Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, KathrynMazaitis, Ruslan Salakhut-

dinov, and William Cohen. 2018. Open Domain Question Answering Using Early

Fusion of Knowledge Bases and Text. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing. Association for Computational

Linguistics, Brussels, Belgium, 4231–4242. https://doi.org/10.18653/v1/D18-1455

[42] Alon Talmor and Jonathan Berant. 2018. The Web as a Knowledge-Base for

Answering Complex Questions. In Proceedings of the 2018 Conference of theNorth American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers). Association for Computational

Linguistics, New Orleans, Louisiana, 641–651. https://doi.org/10.18653/v1/N18-

1059

[43] Alon Talmor and Jonathan Berant. 2019. MultiQA: An Empirical Investigation of

Generalization and Transfer in Reading Comprehension. In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics. Associationfor Computational Linguistics, Florence, Italy, 4911–4921. https://doi.org/10.

18653/v1/P19-1485

[44] Yixuan Tang, Hwee TouNg, and Anthony KHTung. 2020. DoMulti-HopQuestion

Answering Systems Know How to Answer the Single-Hop Sub-Questions? arXivpreprint arXiv:2002.09919 (2020). https://arxiv.org/pdf/2002.09919

[45] Ellen Voorhees. 2001. Overview of the TREC 2001 Question Answering Track.

In The Tenth Text REtrieval Conference (NIST Special Publication), Vol. 500-250.42–51. http://trec.nist.gov/pubs/trec10/t10_proceedings.html

[46] Ellen M Voorhees. 1999. The TREC-8 Question Answering Track Report. In TREC.http://trec.nist.gov/pubs/trec8/papers/qa_report.pdf

[47] Yunxuan Xiao, Yanru Qu, Lin Qiu, Hao Zhou, Lei Li, Weinan Zhang, and Yong

Yu. 2019. Dynamically Fused Graph Network for Multi-hop Reasoning. arXivpreprint arXiv:1905.06933 (2019). https://arxiv.org/pdf/1905.06933

[48] Xuchen Yao and Benjamin Van Durme. 2014. Information Extraction over

Structured Data: Question Answering with Freebase. In ACL Conference. ACL.http://www.cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf

[49] Jing Yu, Zihao Zhu, Yujing Wang, Weifeng Zhang, Yue Hu, and Jianlong Tan.

2020. Cross-modal knowledge reasoning for knowledge-based visual question

answering. Pattern Recognition 108 (2020), 107563. https://arxiv.org/abs/2009.

00145

[50] Haoyu Zhang, Jingjing Cai, Jianjun Xu, and Ji Wang. 2019. Complex Question

Decomposition for Semantic Parsing. In Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics. Association for Computational

Linguistics, Florence, Italy, 4477–4486. https://doi.org/10.18653/v1/P19-1440

[51] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus:

Pre-training with extracted gap-sentences for abstractive summarization. In

International Conference on Machine Learning. PMLR, 11328–11339.

8