Upload
khangminh22
View
0
Download
0
Embed Size (px)
Citation preview
Select, Substitute, Search: A New Benchmark forKnowledge-Augmented VisualQuestion AnsweringAman Jain
∗
Mayank Kothyari∗
IIT Bombay
India
Vishwajeet Kumar
IBM Research, India
Preethi Jyothi
IIT Bombay
India
Ganesh Ramakrishnan
IIT Bombay
India
Soumen Chakrabarti
IIT Bombay
India
ABSTRACT
Multimodal IR, spanning text corpus, knowledge graph and images,
called outside knowledge visual question answering (OKVQA), is
of much recent interest. However, the popular data set has serious
limitations. A surprisingly large fraction of queries do not assess the
ability to integrate cross-modal information. Instead, some are in-
dependent of the image, some depend on speculation, some require
OCR or are otherwise answerable from the image alone. To add to
the above limitations, frequency-based guessing is very effective
because of (unintended) widespread answer overlaps between train
and test folds. Overall, it is hard to determine when state-of-the-
art systems exploit these weaknesses rather than really infer the
answers, because they are opaque and their ‘reasoning’ process is
uninterpretable. An equally important limitation is that the dataset
is designed for the quantitative assessment only of the end-to-end
answer retrieval task, with no provision for assessing the correct
(semantic) interpretation of the input query. In response, we iden-
tify a key structural idiom in OKVQA, viz., S3 (select, substitute andsearch), and build a new data set and challenge around it. Specif-
ically, the questioner identifies an entity in the image and asks a
question involving that entity which can be answered only by con-
sulting a knowledge graph or corpus passage mentioning the entity.
Our challenge consists of (i)OKVQAS3, a subset of OKVQA anno-
tated based on the structural idiom and (ii)S3VQA, a new dataset
built from scratch. We also present a neural but structurally trans-
parent OKVQA system, S3, that explicitly addresses our challenge
data set, and outperforms recent competitive baselines. We make
our code and data available at https://s3vqa.github.io/
1 INTRODUCTION
Multimodal question answering (QA), specifically, combining visual
and textual information, is of much recent interest [1, 17, 20, 24].
∗Both authors contributed equally to this research.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected].
SIGIR ’21, ,© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00
https://doi.org/10.1145/1122445.1122456
Figure 1: Sample OKVQA image and query: ‘What is this ani-mal poached for?’. As discussed in Section 3, we associate the
following additional ground truth with each such question:
‘What is this <select, substitute=“elephant”>animal</select>poached for?’
In the simplest visual QA (VQA) variant, the query is textual and a
given image suffices, by itself, to answer it. A more challenging vari-
ant is where outside knowledge (OK) from a corpus or knowledge
graph needs to be combined with image information to answer the
query [25]. This task is referred to as OKVQA.
An example OKVQA instance is shown in Figure 1. The system
cannot answer the question by detecting objects from the image, or
by recognizing colour, count, shape as in traditional VQA [1, 17, 20,
24]). The system first needs to understand the relationship between
the image content and the question and then needs to query the
outside world for retrieving relevant facts (in this particular case,
about the poaching of elephants).
For our purposes, a host of modern image processing tech-
niques [16, 35] (which we consider as black boxes) can identify
image patches representing salient entities (elephant, man, jeep,
tree, sky, camera) and some spatial relations between them (e.g.,man holding camera, man sitting in a jeep, etc.). This is called a
scene graph. Knowledge that elephant is-a animal may come from a
knowledge graph [7] or open information extraction [13]. Treated
thus, OKVQA is less of a visual recognition challenge and more of
a challenge to fuse unstructured text and (graph) structured data
extracted from the image, which is relevant to the Information
Retrieval (IR) community.
1.1 Discontent with current state of OKVQA
Small step vs. giant leap. QA in the TREC community [45, 46] be-
gan as “corpus QA” long before neural NLP becamewidespread. Nat-
urally, any structural query interpretation or scoring of responses
1
arX
iv:2
103.
0556
8v3
[cs
.CV
] 1
0 A
ug 2
021
SIGIR ’21, , Aman and Mayank, et al.
was completely transparent and deliberate. QA entered the modern
NLP community as “KGQA” [5, 48], which was focused on precise
semantic interpretation of queries. Initial efforts at corpus QA in the
NLP community also had modest goals: given a query and passage,
identify a span from the passage that best answers the query. Such
“reading comprehension” (RC) generalized proximity-sensitive scor-
ing [8, 23, 32]. Compared to these “small steps”, OKVQA was a
“giant leap”, placing enormous faith in opaque neural machinery to
answer visual queries that have no controlled space of reasoning
structure. Some questions require optical character recognition to
‘read’ street signs in the image, others are speculative (“can you
guess the celebration where people are enjoying?”), questions that
are independent of the image (“Pigeons are also known as what?”),
and questions that require only the image and no outside informa-
tion to answer (“Which breed of dog is it?”).
The unwelcome power of guesswork. Rather than score candidate
answers from a large set, such as entities in a KG, or spans from pas-
sages in a large open-domain corpus, current systems treat OKVQA
as a classification problem over the 𝑘 most frequent answers in the
training set. We found a huge train-test overlap, which condones
and even rewards this restrictive strategy - 48.9% answers in the
test set of OKVQA [25] were found to be present in the train set.
Retrospection. Such limitations are not specific to OKVQA, but
have been witnessed in other communities as well. Guessing has
been found unduly successful for the widely used SQuAD QA
dataset [9, 34]. The complex QA benchmark HotPotQA [47] was
supposed to test multi-step reasoning. However, Min et al. [28]
found that top-scoring QA systems as well as humans could guess
the answer without such reasoning for a large fraction of queries.
Their conclusion is particularly instructive: “there should be an
increasing focus on the role of evidence in multi-hop reasoning and
possibly even a shift towards information retrieval style evaluations
with large and diverse evidence collections”. Even more damning is
the work of Tang et al. [44], who report that “state-of-the-art multi-
hop QA models fail to answer 50–60% of sub-questions, although
the corresponding multi-hop questions are correctly answered”.
Other tasks such as entailment [26] have faced similar fate.
Remedies. Awareness of the above-mentioned pitfalls have led to
algorithmic improvements, principally by reverting toward explicit
query structure decomposition. Training data can be augmented
with decomposed clauses [44]. The system can learn to decom-
pose queries by looking up a single-clause query collection [31, 50].
Explicit and partly interpretable control can be retained on the
assembly of responses for subqueries [40, 41]. Earlier datasets and
evaluations are being fixed and augmented as well [9, 10, 36]. New
datasets are being designed to be compositional from the ground
up [43], so that system query decomposition quality can be directly
assessed. Here we pursue both the data and algorithm angles: we
present a new data set where the structural intent of every query
is clear by design, and we also propose an interpretable neural
architecture to handle this form of reasoning. Our new dataset
and our supporting algorithm are designed keeping in mind the
interpretability of the entire system and can thus potentially
contribute to advancement of research in other applications
that include question answering over video, audio as well as
over knowledge graphs. As mentioned earlier, most state-of-the-
art VQA systems are trained to fit an answer distribution; this
brings into question their ability to generalize well to test queries.
To address this issue and improve the explanatory capabilities of
VQA systems, prior work has investigated the use of visual explana-
tions [11] and textual explanations [30? ] to guide the VQA process.
Our new challenge dataset and our proposed OKVQA architecture
has the advantage of being inherently explanatory, without having
to rely on any external explanations. Further, we contribute a
new benchmark S3VQA that is explanatory in its very gene-
sis, as will be described in Section 3.2.
1.2 A new OKVQA Challenge Dataset
In response to the serious shortcomings we have described above,
we present a new OKVQA challenge. Our dataset consists of two
parts, each of which is provided with rich, compositional annota-
tions that provision for assessment of a system’s query reformu-
lation viz., (i) OKVQAS3, a subset of the OKVQA [25] dataset with
such compositional annotations (Section 3.1) and (ii) S3VQA, a com-
pletely new dataset (c.f., Section 3.2) that has been sanitized from
the ground up against information leakage or guesswork-friendly
answering strategies. Every question in our new benchmark is guar-
anteed to require integration of KG, image-extracted information,
and search in an open-domain corpus (such as the Web). Specifi-
cally, the new data set focuses on a specific but extremely common
reasoning idiom, shown in Figure 1. The query pertains to an entity
𝑒 in the image (such as elephant), but refers to it in more general
terms, such as type 𝑡 (animal) of which 𝑒 is an instance. There is
no control over the vocabulary for mentioning 𝑒 or 𝑡 . In our run-
ning example, an OKVQA system may implement a substitutionto generate the query “what is this animal poached for?”, drawing
‘elephant’ out of a visual object recognizer’s output. While many
reasoning paradigms may be important for OKVQA, we argue that
this is an important one. Our data set exercises OKVQA systems in
fairly transparent ways and reveals how (well) they perform this
basic form of reasoning. Addressing other structures of reasoning
is left for future work. Despite this restricted reasoning paradigm,
which ought to be well within the capabilities of general-purpose
state-of-the-art OKVQA systems, we are surprised to find existing
OKVQA models yield close to 0 evaluation score on S3VQA.
1.3 An interpretable OKVQA system
Continuing in the spirit of “small steps before giant leap”, we present
S3 (c.f., Section 5), a neural OKVQA system that targets this class
of queries and reasoning structure. Our system thus has a known
interpretation target for each query, and is therefore interpretable
and affords systematic debugging of its modules. S3 has access to
the query and the scene graph of the image. It first selects a queryspan to substitutewith (the string description of) an object from the
scene graph; this object, too, has to be selected frommany objects in
the image. Sometimes, the question span and the object description
would be related through an instance-of or subtype-of relation in a
typical KG or linguistic database; ideally, S3 should take advantage
of this signal. There are now two ways to use the reformulated
question.We can regard answer selection as a classification problem
like much of prior work, or we can prepare a Web search query,
2
Select, Substitute, Search: A New Benchmark forKnowledge-Augmented Visual Question Answering SIGIR ’21, ,
(a) Type 1 (b) Type 2
(c) Type 3 (d) rest
Figure 2: Example question for each category
get responses, and learn to extract and report answer span/s in the
RC style (referred to as the open-domain setting). S3 is wired to do
both (c.f., Section 6.4).
2 LIMITATIONS IN EXISTING OKVQA DATA
The widely used benchmark OKVQA dataset [25] consists of over
14000 question-image pairs, with 9000 training examples and 5000
examples in the test set. We identify two broad issues with it.
First, significant overlap exists between answers in the train and
test folds. Recall from Section 1.1 that 48.9% of answers in the test
set are present in the training set. Existing systems leverage this
limitation to boost their accuracy by limiting their test answers to
the most frequent answers in the training set.
Second, unlikeWebQuestions [6] or ComplexWebQuestions [42],
OKVQA questions, even when grouped into some categories (see
below) have no clear pattern of reasoning. 18% (type-1) of the
questions require detecting objects and subsequent reasoning over
an external knowledge source to arrive at the answer. 7% of the
questions (type-2) require reading text from the image (OCR) (and
no other information) to answer. 12% of the questions (type-3) are
based on personal opinion or speculation. The remaining questions
(rest) can perhaps be described best through Figure 2d. We provide
an example of each type in Figure 2.
We found that several queries of type-1 have a structural simi-
larity to the bridging queries in ComplexWebQuestions [42]. There,
each query has exactly two clauses. The first clause, when issued
to a Web search engine, returns (via RC) an entity which plugs into
the second clause, which is again sent to the Web search engine,
fetching the overall answer. We found that type-1 questions can
be reformulated, with the help of the scene graph, to a query that
can be answered directly using Web search. Inspired by Complex-
WebQuestions, we next develop our challenge data set.
3 DESIGN OF A NEW OKVQA CHALLENGE
DATA SET
Consider the type-1 query associated with Figure 1: “what is::this
:::::animal poached for?” We discovered that such questions can be
answered through a sequence of three well-defined steps:
Select: This step identifies a span in the question (“this animal”)
that needs to be replacedwith some information from the image, as
part of the query reformulation process. As part of our dataset(s),
we make available the ground truth span for each question, to
facilitate the development and evaluation of the selection model,
independent from other components of a OKVQA system. We
hypothesize that any OKVQA system that fetches the correct
answer must solve this subproblem correctly in the first place.
Substitute: Having identified the query span to replace, the substi-tute operation determines a key-phrase associated with the image
that should replace the selected span. Such a key-phrase could be
the name of an object in the image (e.g., ‘elephant’) or some at-
tributes within the image (such as color), or a relationship between
objects/people (e.g., ‘carrying in hand’). As part of our dataset(s),
we release the keyphrase substituting each selected span in the
question. This data can be used for independent training and eval-
uation of implementations of the substitution operation, assuming
oracle assistance from the other operations.
Search: After the query has been reformulated as described above,
it can be used to harvest candidate answers from a corpus or a
KG. The reformulated query in Figure 1 will be “what is::::::elephant
poached for?” with associated ground truth answer ‘tusk’. By
providing gold query reformulations and gold answers, we also
facilitate the evaluation of the search operation, independent of
the selection and substitution operations.
As another example, the question in Figure 2a can be reformu-
lated from “how many chromosomes do:::these
::::::::creatures have” to
“howmany chromosomes do::::::humans have?” by replacing the hyper-
nym or super-type “these creatures” with the hyponym or sub-type
‘humans’. Again, the reformulated query can be answered using an
open domain QA system.
The following two subsections, we describe the two parts of our
challenge data set. First, in Section 3.1, we describeOKVQAS3, pro-
duced by subsetting and annotating OKVQA to fit the specifications
justified above. Next, in Section 3.2, we describe S3VQA, created
from the ground up to our specifications.
3.1 OKVQAS3: Annotated subset of OKVQA
This is a subset of the OKVQA dataset in which we have annotated
every question with spans, substitutions and gold answers. More
specifically, we divide the OKVQA dataset into two parts based on
the category of the question. As discussed in Section 2, the first
part consists of all the questions of category Type 1 and the second
part consists of rest of the questions. We refer to the first part as
3
SIGIR ’21, , Aman and Mayank, et al.
OKVQAS3 and the second part as OKVQA\S3 . The nameOKVQAS3
represents three operations (Select, Substitute and Search) needed
to answer all the questions of category Type1. OKVQA\S3 is the
remaining OKVQA dataset after removing OKVQAS3 .
2640 of the 14000 question-image pairs in OKVQA are of type-1,
amounting to 18.8% of the total. For these questions, our ground
truth annotations also include the object and span to be used for
select and substitution for every image question pair. The average
length of the span selected was 2.44 words (13.14 characters), with a
standard deviation of 1.36 (7.24). When COCO [22], ImageNet [38]
and OpenImages [21] object detection models were run on this set,
24.7% of the question-image pairs did not have the ground truth
object in the detections owing to detection errors and vocabulary
limitations.
3.2 S3VQA: New dataset built from scratch
Gathering experience from the process of annotating the subset
OKVQAS3 of OKVQA, we build a new benchmark dataset S3VQA
in a bottom-up manner, pivoting on the select, substitute and searchoperations, applied in that order, to each query. Our S3VQA data is
built on theOpen Images collection [21]. For each entity/object/class
name 𝑒 (e.g., peacock) in the OpenImages dataset, we identified its
corresponding ‘parent’ label 𝑡 (e.g., bird) that generalizes 𝑒 . We em-
ployed the hierarchical structure specified within the Open Images
dataset itself to ensure that 𝑒 → 𝑡 enjoy a hyponym-hypernym or
instance-category relation.
Next, we (semi-automatically) identified an appropriateWikipedia
page for 𝑒 and, using the Wikimedia parser [14] on that page, ex-
tracted text snippets. A question generation (QG) model based on
T5 [33] was used to generate < question, answer> pairs from these
snippets. We retained only those pairs in which the question had
an explicit mention of 𝑒 . Note that unlike OKVQA, by design, the
S3VQA dataset has exactly one correct ground truth answer perquestion. In our experimental results (in Section 6.4, specifically
Table 2) we observe how this leads to a much stricter evaluation of
search systems and therefore much lower numbers for S3VQA in
comparison to OKVQA, which offers 10 answers per question [25].
Subsequently, each mention of 𝑒 in the filtered question set was
replaced with the ‘parent’ label 𝑡 using manually defined templates.
One such template is to replace 𝑒 and the determiner preceding it
(e.g. a, an, the) with the string - ‘this 𝑡 ’. Finally, this question set was
filtered and cleaned up manually using the following guidelines:
(i) any question which could be answered just by using the image
or was not fact-based was eliminated; (ii) any question that needed
grammatical improvements was corrected, without changing its
underlying meaning.
We also provided paraphrased questions (generated using T5,
and Pegasus[51]) as suggestions alongwith each templated question
for the annotators to pick instead of the original templated question
in order to bring in variety among the questions. Finally, for each
question we pick an image from OpenImages corresponding to the
object 𝑒 referred in it. We ensure that we pick a distinct image each
time in case the object 𝑒 is referred in multiple questions.
4 PRIOR ARCHITECTURES
VQA that requires reasoning over information from external knowl-
edge sources has recently gained a lot of research interest. VQA
systems [25] have started incorporating external knowledge for
question answering. Existing methods use one of the following
approaches to integrate external knowledge:
(1) Retrieve relevant facts about objects in the image and entities
in the question and reason over extracted facts to arrive at the
final answer for the question.
(2) Collect evidence from text snippets returned from a search
engine and extract answer from the text snippet.
The baseline architecture proposed by Marino et al. [25] intro-
duced ArticleNet to retrieve Wikipedia articles relevant to the ques-
tion and image entities. ArticleNet encodes a Wikipedia article
using a GRU and it is trained to predict whether the ground truth
answer is present in the article. The hidden states of sentences in
the article are used as the encoded representation. This encoded
representation is given as input to a multimodal fusion model along
with the question and image features. The answer is predicted by
formulating the problem as a classification on the most frequent
answers in the training set, leveraging the huge overlap in answers
in training and test sets. Gardères et al. [15] jointly learn knowledge,
visual and language embeddings. They use ConceptNet [39] as the
knowledge source, and graph convolution networks [19] to inte-
grate the information. Similar to the OKVQA baseline system, they
also formulate the problem as classification on the most frequent
answers in the training set [49].
5 PROPOSED ARCHITECTURE
Figure 3 shows an overall schematic diagram of our proposed S3
(Select, Substitute, Search) architecture. As described in Section 3,
Select and Substitute are two operations that, when sequentially
applied to a question, result in a reformulated query that could be
answered without further reference to the image. Question refor-
mulation is followed by a Search operation that involves a Machine
Reading Comprehension (MRC) module to extract answers from
the top snippets retrieved via a search engine for each of the refor-
mulated questions.
5.1 Select and Substitute
The Select operation entails determining a span from the question
that serves as a good placeholder for an object in the image. We
refer to this module as the SpanSelector in Figure 3. To implement
this module, we use a large pre-trained language model (such as
BERT [12]) with additional feed-forward layers (denoted by FFL1)
that are trained to predict the start and end of the span. These addi-
tional layers are trained to minimize the cross-entropy loss between
the predicted spans and the ground truth spans that accompany all
the questions in our dataset.
Once we have a predicted span from the SpanSelector, we want
to determine which object from the image is best described by the
question span. First, we rely on a state-of-the-art object detection
system to provide a list of most likely objects present in the image.
Next, we aim to identify one object from among this list that would
be most appropriate as a substitution for the span that the question
is centred around. This is achieved by minimizing the following
4
Select, Substitute, Search: A New Benchmark forKnowledge-Augmented Visual Question Answering SIGIR ’21, ,
Select+SubstituteObject
Detector
What is this animal poached for?
Predicted span
Predicted object
BERT
COSINE
Reformulated Question
Tusk
Open-domain Setting
CLASSIFICATION
TuskClassification Setting
Search
Search Engine
Top-N Snippets MRC
LBCE<latexit sha1_base64="3Q8vkx3wpL+2974GA9qr8hYy8ng=">AAACAXicbVDLSsNAFJ34rPUVdSO4CRbBVUmqoMvSIrhwUcE+oAlhMp20QycPZm7EEuLGX3HjQhG3/oU7/8ZJm4W2HrhwOOde7r3HizmTYJrf2tLyyuraemmjvLm1vbOr7+13ZJQIQtsk4pHoeVhSzkLaBgac9mJBceBx2vXGzdzv3lMhWRTewSSmToCHIfMZwaAkVz+0Awwjgnl6k7mpDfQB0kbzKstcvWJWzSmMRWIVpIIKtFz9yx5EJAloCIRjKfuWGYOTYgGMcJqV7UTSGJMxHtK+oiEOqHTS6QeZcaKUgeFHQlUIxlT9PZHiQMpJ4KnO/F457+Xif14/Af/SSVkYJ0BDMlvkJ9yAyMjjMAZMUAJ8oggmgqlbDTLCAhNQoZVVCNb8y4ukU6taZ9Xa7Xml3ijiKKEjdIxOkYUuUB1doxZqI4Ie0TN6RW/ak/aivWsfs9YlrZg5QH+gff4AIuKXUw==</latexit>
Ltrip<latexit sha1_base64="3iCUwNZ1OJJaoxJbhHlcqWAcIYo=">AAACAnicbVBNS8NAEN3Ur1q/op7ES7AInkpSBT0WvXjwUMF+QBPCZrttl242YXcilhC8+Fe8eFDEq7/Cm//GTZuDtj4YeLw3w8y8IOZMgW1/G6Wl5ZXVtfJ6ZWNza3vH3N1rqyiRhLZIxCPZDbCinAnaAgacdmNJcRhw2gnGV7nfuadSsUjcwSSmXoiHgg0YwaAl3zxwQwwjgnl6k/mpC/QBUpAszjLfrNo1ewprkTgFqaICTd/8cvsRSUIqgHCsVM+xY/BSLIERTrOKmygaYzLGQ9rTVOCQKi+dvpBZx1rpW4NI6hJgTdXfEykOlZqEge7MD1bzXi7+5/USGFx4KRNxAlSQ2aJBwi2IrDwPq88kJcAnmmAimb7VIiMsMQGdWkWH4My/vEja9ZpzWqvfnlUbl0UcZXSIjtAJctA5aqBr1EQtRNAjekav6M14Ml6Md+Nj1loyipl99AfG5w/HMphS</latexit>
Gold object
FFL
FFL
Span Selector
FFL SUB
Hypernym- Hyponym
Scores
Encode Encode
Figure 3: Schematic diagram of S3. FFL refers to feedforward layers that are trained using our datasets.
combined loss function that predicts an object as the best substitute
for a question span:
L = 𝛼 · Ltrip (𝑠, 𝑝, 𝑛) + (1 − 𝛼) · LBCE (𝑠, 𝑝, 𝑛) (1)
where Ltrip is a triplet loss defined over triplets of embeddings
corresponding to the question span (𝑠), the ground-truth (gold)
object (𝑝) and a detected object different from the ground-truth
object (𝑛), LBCE is a binary cross-entropy loss over each detected
object and the span with the gold object acting as the reference, and
𝛼 is a mixing coefficient that is tuned as a hyperparameter. Here,
Ltrip is defined as:
Ltrip (𝑠, 𝑑, 𝑝) = max(𝑑 (𝑠, 𝑝) − 𝑑 (𝑠, 𝑛) +𝑚, 0) (2)
where 𝑑 (·, ·) is a distance function on the embedding space (such
as cosine distance) and𝑚 is a predefined margin hyperparameter
that forces examples from the same class (𝑠 and 𝑝) to be closer than
examples from different classes (𝑠 and 𝑛). (Both Ltrip and LBCE
have been explicitly listed in Figure 3.)
Computing LBCE (𝑠, 𝑝, 𝑛) also involves the use of a hypernym-
hyponym scorer that examines whether or not each (question span,
detected object) pair exhibits a hypernym-hyponym relationship.
The scorer makes use of a lexical database such as WordNet [27]
that explicitly contains hypernym-hyponym relationships between
words. We map the span and detected object to the corresponding
synsets in WordNet, and use the pre-trained Poincaré embeddings
by Nickel and Kiela [29] to embed them. These pre-trained embed-
dings were trained on the (hypernymy based) hierarchical represen-
tations of the word sequences in WordNet synsets by embedding
them into an𝑛-dimensional Poincaré ball by leveraging the distance
property of hyperbolic spaces.
Finally, the reformulated question is obtained by substituting
the predicted span in the question with the object that is predicted
with highest probability as being the best substitute for the span.
5.2 Search
After the question has been reformulated, we pass it throughGoogle’s
search engine to retrieve the top 10 most relevant snippets.1These
snippets are further passed as input to a Machine Reading Com-
prehension (MRC) module. An MRC system takes a question and
a passage as its input and predicts a span from the passage that
1To ensure reproducibility, we release these snippets as part of our dataset.
is most likely to be the answer. Using the MRC module, we aim
at meaningfully pruning the relatively large amounts of text in
Google’s top 10 snippets and distilling it down to what is most
relevant to the reformulated question.
The MRC module also allows us to move beyond the classifica-
tion setting (as mentioned in Section 1.1) and directly predict an
answer given a question and its corresponding context. We refer
to this as the open-domain task, shown in Figure 3. For fair com-
parisons to prior work that adopt the classification setting, we also
support classification as shown in Figure 3. Here, embeddings for
both the reformulated question and the output from the MRC are
concatenated and fed as input to a softmax output layer over the
top 𝑘 answers in the training set.
6 EXPERIMENTS AND RESULTS
We report experiments with OKVQAS3 and S3VQA, using our S3
model in comparison with a strong baseline (the BLOCK [4] model).
In the following subsections, we explain the approaches being com-
pared (Section 6.1), evaluation settings employed in the experiments
(Section 6.2) and the results (Section 6.4).
6.1 Methods
BLOCK [4]: BLOCK is a state-of-the-art multimodal fusion tech-
nique for VQA, where the image and question pairs are embedded
in a bimodal space. This bimodal representation is then passed to
a softmax output layer, to predict the final answer. Note that, as
per the currently prevalent practice in OKVQA (referred to in Sec-
tions 1.1 and 4) the softmax is performed over the 𝑘 most frequent
answers in the training set. We chose BLOCK as our baseline, since
it is an improved version of a VQA system called MUTAN [3] that
was originally reported as a baseline for the OKVQA [25] dataset.
By design, BLOCK is a vanilla VQA model that takes an image
and question as its inputs and predicts an answer. Modifying it to
additionally accept context as an input did not help improve its
accuracies.
S3: As described in Section 5, our model S3 processes each question
using the select, substitute and search operations.
We use SpanBERT [18] as our MRC system to extract candidate
answers from the relevant text snippets. We use the pretrained
5
SIGIR ’21, , Aman and Mayank, et al.
BERT-base model2to implement the select and substitute opera-
tions.
6.2 Evaluation
Closed-domain (classification) setting: In this setting, the an-
swer is predicted from a predefined vocabulary, that is constructed
using the top-𝑘 answers in the training set. Since OKVQAS3 is a
relatively small subset of the OKVQA dataset and since all sys-
tems were observed to improve in their evaluation scores with
increasing values of 𝑘 , we report numbers by setting 𝑘 = 1084in all experiments, which corresponds to setting the predefined
answer vocabulary to the entire train set. The evaluation criteria
for OKVQAS3 is the standard VQA metric [2] which is defined as:
Accuracy(ans) =𝑚𝑖𝑛
{#ℎ𝑢𝑚𝑎𝑛𝑠 𝑡ℎ𝑎𝑡 𝑠𝑎𝑖𝑑 ans
3, 1}
(3)
For S3VQA, we note that accuracy is calculated by doing an ex-
act string match with answer since we have exactly one correct
ground truth answer per question. In retrospect, 42.1% of answers
in the test set of OKVQAS3 are found to be present in the train
set. Thus 42.1% serves as a generous skyline in the classification
setting for OKVQAS3 (for S3VQA, the overlap is 11.6%).
Open-domain setting: The vocabulary of answers is unconstrained
while predicting the answer. The predicted answer should exactly
match one of the annotated answers to be counted as correct. The
evaluation metric is the same as in the classification setting.
6.3 Implementation details
Select module To select the span, the input question is first en-
coded using a pretrained BERT model. BERT outputs a 768 dimen-
sional representation for each of the tokens in the input. Each
token’s output representation is then passed to a linear layer
(size:768×2) with two values in the output that correspond to the
probability of the token being the start (and end) of the desired
span. This linear layer is trained with cross entropy loss using the
gold span labels. During prediction, the tokens with the highest
values for start and end are used to mark the start and end of the
spans. In case the start token comes after the end token during
prediction, an empty span is returned.
Substitute module It comprises two parts: a) We apply a pre-
trained BERT model to encode the input question and all the other
reformulated questions using each detection as the replacement
for the span. BERT outputs a 768 dimensional representation for
each of the tokens in the input. We compute the representation
for a span using the averaged representation of all its tokens. The
span representation and all other BERT representations of the
detections within the reformulated questions are passed through
a linear layer (size: 768 × 2048) followed by tanh activation and
another linear layer (size: 2048 × 1024). We use the span as the
anchor, the gold object as the positive instance and all other de-
tections as negative instances for the triplet loss. b) We take the
representation of the span and the detection from the network
above and compute the cosine distance between these represen-
tations. This is further passed through a linear layer (size: 2 × 1)
2https://github.com/google-research/bert
Reformulation
Original Gold Predicted
BLOCK W/o context 24.12 25.74 25.0
W/o MRC 12.03 25.02 19.44
S3 With MRCsing 13.51 31.81 26.43
With MRCmult
18.07 33.55 28.57
Table 1: Classification results on OKVQAS3. ‘W/o’ expands
as ‘without’.
along with the hypernym-to-hyponym score described in Sec-
tion 5.1. This linear layer is trained using a binary cross-entropy
loss.
SpanBERTWe use a SpanBERT model pretrained on SQuAD to
find the answer from the retrieved snippets. We further fine-tune
this model on OKVQAS3 by using the gold reformulated question,
the snippets retrieved from it and the ground truth answers.
S3 classification We merge the candidate answer representation
(from the S3 open-domain setting) with the question representa-
tion using BLOCK’s fusion module [4]. This fused representation
is then fed to a softmax output layer with labels as the most fre-
quent 𝑘 = 1084 answers for OKVQAS3 (𝑘 = 2048 for S3VQA) in
the training set. The label with the highest score is returned as
the predicted answer.
6.4 Results
Classification Results. In Table 1, we report classification results
on OKVQAS3 using three different variants of S3. For these experi-
ments, we use SpanBERT as our MRC system. The questions can
either be in their original form (“Original"), or reformulated using
the ground-truth annotations (“Gold") or reformulated using our
span selector and substitution module (“Predicted"). We experiment
with three variants of S3 that use the top 10 snippets from Google
concatenated together (henceforth referred to as T10text):
Without MRC: We bypass the use of an MRC module altogether;
T10text is directly fed as an input to the classification module.
With MRCsing: T10text is passed as input, along with the reformu-
lated question, to SpanBERT. The output from SpanBERT, along
with the reformulated question, are combined and fed as input to
the classification module.
WithMRCmult
: Each of the snippets in T10text are passed to Span-
BERT to produce ten output spans. We use a simple attention layer
over these spans to derive a single output representation, that is
further fed as input to the classification module.
The main observation from Table 1 is that the MRC system,
with its ability to extract relevant spans from the T10text snippets,
is critical to performance. Using eitherMRCsing orMRCmult
in S3
provides a substantial boost in performance compared to S3 without
anyMRC system. And,MRCmult
consistently outperformsMRCsing.
The BLOCK model significantly underperforms compared to S3.
Open-domain Results: In Table 2, we present open-domain re-
sults on both OKVQAS3 and our newly constructed S3VQA using
MRCsing. We report accuracies on all the three main steps of our
proposed architecture. While the open-domain setting is free of
any answer bias, unlike the classification setting, it does not have
the advantage of drawing from a fixed vocabulary of answers. De-
spite this perceived disadvantage, the best open-domain results on
6
Select, Substitute, Search: A New Benchmark forKnowledge-Augmented Visual Question Answering SIGIR ’21, ,
Test on Select Substitute Search
Train on OKVQAS3 S3VQA 63.8 24.6 14.6
Train on S3VQA S3VQA 94.9 61.9 23.3
Train on S3VQA OKVQAS3 57.2 32.5 24.2
Train on OKVQAS3 OKVQAS3 67.2 55.1 30.5
Table 2: Open-domain results on S3VQA and OKVQAS3. As
for the low search results (marked *), we recall from Sec-
tion 3.2 that unlike OKVQA, S3VQA has exactly one cor-
rect ground truth answer per question leading to a much
stricter evaluation and therefore much lower numbers. In
fact, BLOCK yields close to 0 evaluation score on S3VQA.
MRC System Testing on OKVQAS3
T5
without fine-tuning with fine-tuning
16 37.6 (G) 30.7 (P)
SpanBERT
without fine-tuning with fine-tuning
18 33.9 (G) 28.9 (P)
Table 3: Open-domain results using different MRC systems.
(G) refers to ground-truth reformulated questions, (P) refers
to reformulated questions predicted using S3.
OKVQAS3 are comparable to (in fact slightly better than) the best
classification results (i.e. 28.9 vs. 28.57). Recall from Section 3.2 that
unlike OKVQA, S3VQA has exactly one correct ground truth answer
per question leading to a much stricter evaluation and therefore
much lower ‘search’ numbers. Thus, while BLOCK yields close to 0evaluation score on S3VQA (not reported in the table), results using
S3 are marginally better.
Different MRC Systems.We experiment with two different MRC
systems, T5 and SpanBERT, and present open-domain results on
OKVQAS3 using both these systems. The columns labeled “without
fine-tuning" refer to the original pretrained models for both systems
and "with fine-tuning" refers to fine-tuning the pretrained models
using OKVQAS3. Fine-tuning the models with target data signifi-
cantly helps performance, as has been shown in prior work [37].
Scores labeled with (G) show the performance when select and
substitute work perfectly.
7 CONCLUSION
In this paper, we identify key limitations of existing multimodal
QA datasets in being opaque and uninterpretable in their reasoning.
Towards addressing these limitations, we present OKVQAS3, an
improvisation on the existing OKVQA dataset as well as design and
build a new challenge data set S3VQA that focuses on a specific
structural idiom that frequently appears in VQA. We also present
a structurally transparent and interpretable system S3 tailored to
answer questions from our challenge data set and show that it
outperforms strong baselines in both existing classification as well
as the proposed open-domain settings.
REFERENCES
[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,
C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering.
In ICCV. 2425–2433. http://openaccess.thecvf.com/content_iccv_2015/papers/
Antol_VQA_Visual_Question_ICCV_2015_paper.pdf
[2] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,
C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In
Proceedings of the IEEE international conference on computer vision. 2425–2433.[3] Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. 2017. Mutan:
Multimodal tucker fusion for visual question answering. In Proceedings of theIEEE international conference on computer vision. 2612–2620.
[4] Hedi Ben-Younes, Remi Cadene, Nicolas Thome, and Matthieu Cord. 2019. Block:
Bilinear superdiagonal fusion for visual question answering and visual relation-
ship detection. In AAAI, Vol. 33. 8102–8109. https://ojs.aaai.org/index.php/AAAI/article/view/4818/4691
[5] J. Berant, A. Chou, R. Frostig, and P. Liang. 2013. Semantic Parsing on Freebase
from Question-Answer Pairs. In EMNLP Conference. 1533–1544. http://aclweb.
org/anthology//D/D13/D13-1160.pdf
[6] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic
Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013Conference on Empirical Methods in Natural Language Processing. Associationfor Computational Linguistics, Seattle, Washington, USA, 1533–1544. https:
//www.aclweb.org/anthology/D13-1160
[7] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor.
2008. Freebase: a collaboratively created graph database for structuring human
knowledge. In SIGMOD Conference. 1247–1250. http://ids.snu.ac.kr/w/images/9/
98/sc17.pdf
[8] Stefan Büttcher, Charles L. A. Clarke, and Brad Lushman. 2006. Term proximity
scoring for ad-hoc retrieval on very large text collections. In SIGIR Conference(Seattle, Washington, USA). ACM, 621–622. https://doi.org/10.1145/1148170.
1148285
[9] Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. 2019. Don’t Take the
Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro-cessing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China,
4069–4082. https://doi.org/10.18653/v1/D19-1418
[10] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa
Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering?
try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018).
https://arxiv.org/pdf/1803.05457.pdf),
[11] Abhishek Das, Harsh Agrawal, Larry Zitnick, Devi Parikh, and Dhruv Batra. 2017.
Human attention in visual question answering: Do humans and deep networks
look at the same regions? Computer Vision and Image Understanding 163 (2017),
90–100.
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805 (2018).
[13] Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and
Mausam Mausam. 2011. Open Information Extraction: The Second Genera-
tion. In IJCAI. 3–10. https://www.aaai.org/ocs/index.php/IJCAI/IJCAI11/paper/
viewFile/3353/3408
[14] Wikimedia Foundation. 2021. Mediawiki Parser. Code. https://pypi.org/project/
pymediawiki/
[15] François Gardères, Maryam Ziaeefard, Baptiste Abeloos, and Freddy Lecue. 2020.
ConceptBert: Concept-Aware Representation for Visual Question Answering. In
Findings of the Association for Computational Linguistics: EMNLP 2020. Associationfor Computational Linguistics, Online, 489–498. https://doi.org/10.18653/v1/
2020.findings-emnlp.44
[16] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich
Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014,Columbus, OH, USA, June 23-28, 2014. IEEE Computer Society, 580–587. https:
//doi.org/10.1109/CVPR.2014.81
[17] Drew A Hudson and Christopher D Manning. 2019. GQA: A new
dataset for real-world visual reasoning and compositional question
answering. In CVPR. 6700–6709. https://openaccess.thecvf.com/
content_CVPR_2019/papers/Hudson_GQA_A_New_Dataset_for_Real-
World_Visual_Reasoning_and_Compositional_CVPR_2019_paper.pdf
[18] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and
Omer Levy. 2020. SpanBERT: Improving Pre-training by Representing and
Predicting Spans. Transactions of the Association for Computational Linguistics 8(2020), 64–77. https://doi.org/10.1162/tacl_a_00300
[19] Thomas N Kipf and MaxWelling. 2016. Semi-supervised classification with graph
convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[20] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua
Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al.
2017. Visual genome: Connecting language and vision using crowdsourced dense
image annotations. International journal of computer vision 123, 1 (2017), 32–73.
https://link.springer.com/content/pdf/10.1007/s11263-016-0981-7.pdf
7
SIGIR ’21, , Aman and Mayank, et al.
[21] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi
Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov,
Tom Duerig, and Vittorio Ferrari. 2020. The Open Images Dataset V4: Unified
Image Classification, Object Detection, and Visual Relationship Detection at
Scale. International Journal of Computer Vision 128 (03 2020). https://doi.org/10.
1007/s11263-020-01316-z
[22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C. Zitnick. 2014. Microsoft COCO: Common Objects
in Context. (05 2014).
[23] Yuanhua Lv and ChengXiang Zhai. 2009. Positional language models for infor-
mation retrieval. In SIGIR Conference. 299–306. https://doi.org/10.1145/1571941.
1571994
[24] Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question
answering about real-world scenes based on uncertain input. arXiv preprintarXiv:1410.0210 (2014). https://arxiv.org/pdf/1410.0210
[25] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi.
2019. OK-VQA: A visual question answering benchmark requiring external
knowledge. In CVPR. 3195–3204. https://openaccess.thecvf.com/content_CVPR_
2019/papers/Marino_OK-VQA_A_Visual_Question_Answering_Benchmark_
Requiring_External_Knowledge_CVPR_2019_paper.pdf
[26] Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the Wrong Rea-
sons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Pro-ceedings of the 57th Annual Meeting of the Association for Computational Lin-guistics. Association for Computational Linguistics, Florence, Italy, 3428–3448.
https://doi.org/10.18653/v1/P19-1334
[27] G. A.Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. 1990. Introduction
to WordNet: an on-line lexical database. International Journal of Lexicography 3(4) (1990), 235 – 244. https://wordnet.princeton.edu/
[28] Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi,
and Luke Zettlemoyer. 2019. Compositional Questions Do Not Necessitate Multi-
hop Reasoning. In Proceedings of the 57th Annual Meeting of the Association forComputational Linguistics. Association for Computational Linguistics, Florence,
Italy, 4249–4257. https://doi.org/10.18653/v1/P19-1416
[29] Maximillian Nickel and Douwe Kiela. 2017. Poincaré embeddings for learning
hierarchical representations. In Advances in neural information processing systems.6338–6347.
[30] Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt
Schiele, Trevor Darrell, and Marcus Rohrbach. 2018. Multimodal explanations:
Justifying decisions and pointing to the evidence. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. 8779–8788.
[31] Ethan Perez, Patrick Lewis, Wen-tau Yih, Kyunghyun Cho, and Douwe Kiela. 2020.
Unsupervised Question Decomposition for Question Answering. In Proceedingsof the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP). Association for Computational Linguistics, Online, 8864–8880. https:
//doi.org/10.18653/v1/2020.emnlp-main.713
[32] Desislava Petkova and W Bruce Croft. 2007. Proximity-based document repre-
sentation for named entity retrieval. In CIKM. ACM, 731–740. https://doi.org/
10.1145/1321440.1321542
[33] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the lim-
its of transfer learning with a unified text-to-text transformer. arXiv preprintarXiv:1910.10683 (2019).
[34] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceed-ings of the 2016 Conference on Empirical Methods in Natural Language Pro-cessing. Association for Computational Linguistics, Austin, Texas, 2383–2392.
https://doi.org/10.18653/v1/D16-1264
[35] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-
CNN: Towards Real-Time Object Detection with Region Proposal Networks.
In Advances in Neural Information Processing Systems 28: Annual Conference onNeural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec,Canada, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama,
and Roman Garnett (Eds.). 91–99. https://proceedings.neurips.cc/paper/2015/
hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html
[36] Marco Tulio Ribeiro, Carlos Guestrin, and Sameer Singh. 2019. Are Red Roses
Red? Evaluating Consistency of Question-AnsweringModels. In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics. Associationfor Computational Linguistics, Florence, Italy, 6174–6184. https://doi.org/10.
18653/v1/P19-1621
[37] Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How Much Knowledge
Can You Pack into the Parameters of a Language Model?. In Proceedings of the2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).5418–5426.
[38] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, SeanMa,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander
Berg, and Li Fei-Fei. 2014. ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision 115 (09 2014). https://doi.org/10.1007/
s11263-015-0816-y
[39] Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An Open
Multilingual Graph of General Knowledge. Proceedings of the AAAI Conferenceon Artificial Intelligence 31, 1 (Feb. 2017). https://ojs.aaai.org/index.php/AAAI/
article/view/11164
[40] Haitian Sun, Tania Bedrax-Weiss, and William Cohen. 2019. PullNet: Open Do-
main Question Answering with Iterative Retrieval on Knowledge Bases and Text.
In Proceedings of the 2019 Conference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong,
China, 2380–2390. https://doi.org/10.18653/v1/D19-1242
[41] Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, KathrynMazaitis, Ruslan Salakhut-
dinov, and William Cohen. 2018. Open Domain Question Answering Using Early
Fusion of Knowledge Bases and Text. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing. Association for Computational
Linguistics, Brussels, Belgium, 4231–4242. https://doi.org/10.18653/v1/D18-1455
[42] Alon Talmor and Jonathan Berant. 2018. The Web as a Knowledge-Base for
Answering Complex Questions. In Proceedings of the 2018 Conference of theNorth American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers). Association for Computational
Linguistics, New Orleans, Louisiana, 641–651. https://doi.org/10.18653/v1/N18-
1059
[43] Alon Talmor and Jonathan Berant. 2019. MultiQA: An Empirical Investigation of
Generalization and Transfer in Reading Comprehension. In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics. Associationfor Computational Linguistics, Florence, Italy, 4911–4921. https://doi.org/10.
18653/v1/P19-1485
[44] Yixuan Tang, Hwee TouNg, and Anthony KHTung. 2020. DoMulti-HopQuestion
Answering Systems Know How to Answer the Single-Hop Sub-Questions? arXivpreprint arXiv:2002.09919 (2020). https://arxiv.org/pdf/2002.09919
[45] Ellen Voorhees. 2001. Overview of the TREC 2001 Question Answering Track.
In The Tenth Text REtrieval Conference (NIST Special Publication), Vol. 500-250.42–51. http://trec.nist.gov/pubs/trec10/t10_proceedings.html
[46] Ellen M Voorhees. 1999. The TREC-8 Question Answering Track Report. In TREC.http://trec.nist.gov/pubs/trec8/papers/qa_report.pdf
[47] Yunxuan Xiao, Yanru Qu, Lin Qiu, Hao Zhou, Lei Li, Weinan Zhang, and Yong
Yu. 2019. Dynamically Fused Graph Network for Multi-hop Reasoning. arXivpreprint arXiv:1905.06933 (2019). https://arxiv.org/pdf/1905.06933
[48] Xuchen Yao and Benjamin Van Durme. 2014. Information Extraction over
Structured Data: Question Answering with Freebase. In ACL Conference. ACL.http://www.cs.jhu.edu/~xuchen/paper/yao-jacana-freebase-acl2014.pdf
[49] Jing Yu, Zihao Zhu, Yujing Wang, Weifeng Zhang, Yue Hu, and Jianlong Tan.
2020. Cross-modal knowledge reasoning for knowledge-based visual question
answering. Pattern Recognition 108 (2020), 107563. https://arxiv.org/abs/2009.
00145
[50] Haoyu Zhang, Jingjing Cai, Jianjun Xu, and Ji Wang. 2019. Complex Question
Decomposition for Semantic Parsing. In Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics. Association for Computational
Linguistics, Florence, Italy, 4477–4486. https://doi.org/10.18653/v1/P19-1440
[51] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus:
Pre-training with extracted gap-sentences for abstractive summarization. In
International Conference on Machine Learning. PMLR, 11328–11339.
8