Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings
Michael Wray
University of Bristol
Diane Larlus
Naver Labs Europe
Gabriela Csurka
Naver Labs Europe
Dima Damen
University of Bristol
Abstract
We address the problem of cross-modal fine-grained ac-
tion retrieval between text and video. Cross-modal retrieval
is commonly achieved through learning a shared embed-
ding space, that can indifferently embed modalities. In
this paper, we propose to enrich the embedding by disen-
tangling parts-of-speech (PoS) in the accompanying cap-
tions. We build a separate multi-modal embedding space
for each PoS tag. The outputs of multiple PoS embed-
dings are then used as input to an integrated multi-modal
space, where we perform action retrieval. All embeddings
are trained jointly through a combination of PoS-aware and
PoS-agnostic losses. Our proposal enables learning spe-
cialised embedding spaces that offer multiple views of the
same embedded entities.
We report the first retrieval results on fine-grained ac-
tions for the large-scale EPIC dataset, in a generalised
zero-shot setting. Results show the advantage of our ap-
proach for both video-to-text and text-to-video action re-
trieval. We also demonstrate the benefit of disentangling
the PoS for the generic task of cross-modal video retrieval
on the MSR-VTT dataset.
1. Introduction
With the onset of the digital age, millions of hours of
video are being recorded and searching this data is becom-
ing a monumental task. It is even more tedious when search-
ing shifts from video-level labels, such as ‘dancing’ or ‘ski-
ing’, to short action segments like ‘cracking eggs’ or ‘tight-
ening a screw’. In this paper, we focus on the latter and
refer to them as fine-grained actions. We thus explore the
task of fine-grained action retrieval where both queries and
retrieved results can be either a video sequence, or a textual
caption describing the fine-grained action. Such free-form
action descriptions allow for a more subtle characterisation
of actions but require going beyond training a classifier on
a predefined set of action labels [20, 30].
As is common in cross-modal search tasks [26, 36], we
learn a shared embedding space onto which we project both
videos and captions. By nature, fine-grained actions can
Figure 1. We target fine-grained action retrieval. Action captions
are broken using part-of-speech (PoS) parsing. We create separate
embedding spaces for the relevant PoS (e.g. Noun or Verb) and
then combine these embeddings into a shared embedding space
for action retrieval (best viewed in colour).
be described by an actor, an act and the list of objects in-
volved in the interaction. We thus propose to learn a sep-
arate embedding for each part-of-speech (PoS), such as for
instance verbs, nouns or adjectives. This is illustrated in
Fig. 1 for two PoS (verbs and nouns). When embedding
verbs solely, relevant entities are those that share the same
verb/act regardless of the nouns/objects used. Conversely,
for a PoS embedding focusing on nouns, different actions
performed on the same object are considered relevant enti-
ties. This enables a PoS-aware embedding, specialised for
retrieving a variety of relevant entities, given that PoS. The
outputs from the multiple PoS embedding spaces are then
combined within an encoding module that produces the fi-
nal action embedding. We train our approach end-to-end,
jointly optimising the multiple PoS embeddings and the fi-
nal fine-grained action embedding.
This approach has a number of advantages over training
a single embedding space as is standardly done [7, 8, 15,
22, 24]. Firstly, this process builds different embeddings
that can be seen as different views of the data, which con-
tribute to the final goal in a collaborative manner. Secondly,
it allows to inject, in a principled way, additional informa-
tion but without requiring additional annotation, as parsing
a caption for PoS is done automatically. Finally, when con-
sidering a single PoS at a time, for instance verbs, the cor-
450
responding PoS-embedding learns to generalise across the
variety of actions involving each verb (e.g. the many ways
‘open’ can be used). This generalisation is key to tackling
more actions including new ones not seen during training.
We present the first retrieval results for the recent large-
scale EPIC dataset [6] (Sec 4.1), utilising the released free-
form narrations, previously unexplored for this dataset, as
our supervision. Additionally, we show that our second con-
tribution, learning PoS-aware embeddings, is also valuable
for general video retrieval by reporting results on the MSR-
VTT dataset [39] (Sec. 4.2).
2. Related Work
Recently, neural networks trained with a ranking loss
considering image pairs [27], triplets [35], quadruplets [5]
or beyond [32], have been considered for metric learn-
ing [17, 35] and for a broad range of search tasks such
as face/person identification [29, 5, 16, 2] or instance re-
trieval [10, 27]. These learning-to-rank approaches have
been generalised to two or more modalities. Standard ex-
amples include building a joint embedding for images and
text [11, 36], videos and audio [33] and, more related to
our work, for videos and action labels [15], videos and
text [8, 14, 40] or some of those combined [25, 24, 22].
Representing text. Early works in image-to-text cross-
modal retrieval [9, 11, 36] used TF-IDF as a weighted bag-
of-words model for text representations (either from a word
embedding model or one-hot vectors) in order to aggre-
gate variable length text captions into a single fixed sized
representation. With the advent of neural networks, works
shifted to use RNNs, Gated Recurrent Units (GRU) or Long
Short-Term Memory (LSTM) units to extract textual fea-
tures [8] or to use these models within the embedding net-
work [15, 18, 24, 25, 34] for both modalities.
Action embedding and retrieval. Joint embedding spaces
are a standard tool to perform action retrieval. Zhang et
al. [42] use a Semantic Similarity Embedding (SSE) in or-
der to perform action recognition. Their method, inspired
by sparse coding, splits train and test data into a mixture of
proportions of already seen classes which then generalises
to unseen classes at test time. Mithun et al. [24] create two
embedding spaces: An activity space using flow and au-
dio along with an object space from RGB. Their method
encodes the captions with GRUs and the output vectors
from the activity and object spaces are concatenated to rank
videos. We instead create Part of Speech embedding spaces
which we learn jointly, allowing our method to capture re-
lationships between e.g. verbs and nouns.
Hahn et al. [15] use two LSTMs to directly project
videos into the Word2Vec embedding space. This method
is evaluated on higher-level activities, showing that such
a visual embedding aligns well with the learned space
of Word2Vec to perform zero-shot recognition of these
coarser-grained classes. Miech et al. [21] found that us-
ing NetVLAD [3] results in an increase in accuracy over
GRUs or LSTMs for aggregation of both visual and text
features. A follow up on this work [22] learns a mixture
of experts embedding from multiple modalities such as ap-
pearance, motion, audio or face features. It learns a sin-
gle output embedding which is the weighted similarity be-
tween the different implicit visual-text embeddings. Re-
cently, Miech et al. [23] propose the HowTo100M dataset:
A large dataset collected automatically using generated cap-
tions from youtube of ‘how to tasks’. They find that fine-
tuning on these weakly-paired video clips allows for state-
of-the-art performance on a number of different datasets.
Fine-grained action recognition. Recently, several large-
scale datasets have been published for the task of fine-
grained action recognition [6, 12, 13, 31, 28]. These gener-
ally focus on a closed vocabulary of class labels describing
short and/or specific actions.
Rohrbach et al. [28] investigate hand and pose estimation
techniques for fine-grained activity recognition. By com-
positing separate actions, and treating them as attributes,
they can predict unseen activities via novel combinations of
seen actions. Mahdisoltani et al. [20] train for four different
tasks, including both coarse and fine grain action recogni-
tion. They conclude that training on fine-grain labels allows
for better learning of features for coarse-grain tasks.
In our previous work [38], we explored action retrieval
and recognition using multiple verb-only representations,
collected via crowd-sourcing. We found that a soft-assigned
representation was beneficial for retrieval tasks over using
the full verb-noun caption. While the approach enables
scaling to a broader vocabulary of action labels, such multi-
verb labels are expensive to collect for large datasets.
While focusing on fine-grained actions, we diverge from
these works using open vocabulary captions for supervision.
As recognition is not suitable, we instead formulate this as
a retrieval problem. Up to our knowledge, no prior work at-
tempted cross-modal retrieval on fine-grained actions. Our
endeavour has been facilitated by the recent release of open
vocabulary narrations on the EPIC dataset [6] which we
note is the only fine-grained dataset to do so. While our
work is related to both fine-grained action recognition and
general action retrieval, we emphasise that it is neither. We
next describe our proposed model.
3. Method
Our aim is to learn representations suitable for cross-
modal search where the query modality is different from the
target modality. Specifically, we use video sequences with
textual captions/descriptions and perform video-to-text (vt)
or text-to-video (tv) retrieval tasks. Additionally, we would
like to make sure that classical search (where the query and
451
Figure 2. Overview of the JPoSE model. We first disentangle a caption into its parts of speech (PoS) and learn a Multi-Modal Embedding
Network (MMEN, Sec. 3.1) for each PoS (Sec. 3.2). The output of these PoS-MMENs are then encoded (ev , et) to get new representations
vi and ti on top of which the final embeddings f and g are learnt. JPoSE learns all of those jointly (Sec. 3.3), using a combination of
PoS-aware L1, L2, defined in Eq. (5) and PoS-agnostic L losses, defined in Eq. (6). Non-trained modules are shown in grey.
the retrieved results have the same modalities) could still
be performed in that representation space. The latter are
referred to as video-to-video (vv) and text-to-text (tt) search
tasks. As discussed in the previous section, several possibil-
ities exist, the most common being embedding both modal-
ities in a shared space such that, regardless of the modality,
the representation of two relevant entities in that space are
close to each other, while the representation of two non-
relevant entities are far apart.
We first describe how to build such a joint embedding
between two modalities, enforcing both cross-modal and
within-modal constraints (Sec. 3.1). Then, based on the
knowledge that different parts of the caption encode differ-
ent aspects of an action, we describe how to leverage this in-
formation and build several disentangled Part of Speech em-
beddings (Sec. 3.2). Finally, we propose a unified represen-
tation well-suited for fine-grained action retrieval (Sec. 3.3).
3.1. MultiModal Embedding Network (MMEN)
This section describes a Multi-Modal Embedding Net-
work (MMEN) that encodes the video sequence and the text
caption into a common descriptor space.
Let (vi, ti)|vi ∈ V, ti ∈ T be a set of videos with
vi being the visual representation of the ith video sequence
and ti the corresponding textual caption. Our aim is to learn
two embedding functions f : V → Ω and g : T → Ω,
such that f(vi) and g(ti) are close in the embedded space Ω.
Note that f and g can be linear projection matrices or more
complex functions e.g. deep neural networks. We denote
the parameters of the embedding functions f and g by
θf and θg respectively, and we learn them jointly with a
weighted combination of two cross-modal (Lv,t, Lt,v) and
two within-modal (Lv,v, Lt,t) triplet losses. Note that other
point-wise, pairwise or list-wise losses can also be consid-
ered as alternatives to the triplet loss.
The cross-modal losses are crucial to the task and en-
sure that the representations of a query and a relevant item
for that query from a different modality are closer than the
representations of this query and a non-relevant item. We
use cross-modal triplet losses [19, 36]:
Lv,t(θ) =∑
(i,j,k)∈Tv,t
max(
γ + d(fvi , gtj )− d(fvi, gtk), 0
)
Tv,t = (i, j, k) | vi ∈ V, tj ∈ Ti+, tk ∈ Ti−
(1)
Lt,v(θ) =∑
(i,j,k)∈Tt,v
max(
γ + d(gti , fvj )− d(gti , fvk), 0
)
Tt,v = (i, j, k) | ti ∈ T, vj ∈ Vi+, vk ∈ Vi−
(2)
where γ is a constant margin, θ = [θf , θg], and d(.) is the
distance function in the embedded space Ω. Ti+, Ti− re-
spectively define sets of relevant and non relevant captions
and Vi+, Vi− the sets of relevant and non relevant videos se-
quences for the multi-modal object (vi, ti). To simplify the
notation, fvidenotes f(vi) ∈ Ω and gtj denotes g(tj) ∈ Ω.
Additionally, within-modal losses, also called structure
preserving losses [19, 36], ensure that the neighbourhood
structure within each modality is preserved in the newly
built joint embedding space. Formally,
Lv,v(θ) =∑
(i,j,k)∈Tv,v
max(
γ + d(fvi, fvj )− d(fvi , fvk), 0
)
Tv,v = (i, j, k) | vi ∈ V, vj ∈ Vi+, vk ∈ Vi−
(3)
Lt,t(θ) =∑
(i,j,k)∈Tt,t
max(
γ + d(gti , gtj )− d(gti , gtk), 0)
Tt,t = (i, j, k) | ti ∈ T, tj ∈ Ti+, tk ∈ Ti−
(4)
452
using the same notation as before. The final loss used for
the MMEN network is a weighted combination of these four
losses, summed over all triplets in T defined as follows:
L(θ) = λv,vLv,v + λv,tLv,t + λt,vLt,v + λt,tLt,t (5)
where λ is a weighting for each loss term.
3.2. Disentangled Part of Speech Embeddings
The previous section described the generic Multi-Modal
Embedding Network (MMEN). In this section, we pro-
pose to disentangle different caption components so each
component is encoded independently in its own embedding
space. To do this, we first break down the text caption
into different PoS tags. For example, the caption “I di-
vided the onion into pieces using wooden spoon” can be
divided into verbs, [divide, using], pronouns, [I], nouns,
[onion, pieces, spoon] and adjectives, [wooden]. In our
experiments, we focus on the most relevant ones for fine-
grained action recognition: verbs and nouns, but we explore
other types for general video retrieval. We extract all words
from a caption for a given PoS tag and train one MMEN to
only embed these words and the video representation in the
same space. We refer to it as a PoS-MMEN.
To train a PoS-MMEN, we propose to adapt the notion of
relevance specifically to the PoS. This has a direct impact on
the sets Vi+, Vi−, Ti+, Ti− defined in Equations (1)-(4). For
example, the caption ‘cut tomato’ is disentangled into the
verb ‘cut’ and the noun ‘tomato’. Consider a PoS-MMEN
focusing on verb tags solely. The caption ‘cut carrots’ is a
relevant caption as the pair share the same verb ‘cut’. In an-
other PoS-MMEN focusing on noun tags solely, the two re-
main irrelevant. As the relevant/irrelevant sets differ within
each PoS-MMEN, these embeddings specialise to that PoS.
It is important to note that, although the same visual fea-
tures are used as input for all PoS-MMEN, the fact that we
build one embedding space per PoS trains multiple visual
embedding functions fk that can be seen as multiple views
of the video sequence.
3.3. PoSAware Unified Action Embedding
The previous section describes how to extract different
PoS from captions and how to build PoS-specific MMENs.
These PoS-MMENs can already be used alone for PoS-
specific retrieval tasks, for instance a verb-retrieval task
(e.g. retrieve all videos where “cut” is relevant) or a noun-
retrieval task.1 More importantly, the output of different
PoS-MMENs can be combined to perform more complex
tasks, including the one we are interested in, namely fine-
grained action retrieval.
1Our experiments focus on action retrieval but we report on these other
tasks in the supplementary material.
Let us denote the kth PoS-MMEN visual and textual em-
bedding functions by fk : V → Ωk and gk : T → Ωk. We
define:
vi = ev(f1vi, f2
vi, . . . , fK
vi)
ti = et(g1t1i, g2t2
i, . . . , gK
tKi)
(6)
where ev and et are encoding functions that combine the
outputs of the PoS-MMENs. We explore multiple pooling
functions for ev and et: concatenation, max, average - the
latter two assume all Ωk share the same dimensionality.
When vi, ti have the same dimension, we can perform
action retrieval by directly computing the distance between
these representations. We instead propose to train a final
PoS-agnostic MMEN that unifies the representation, lead-
ing to our final JPoSE model.
Joint Part of Speech Embedding (JPoSE). Consider-
ing the PoS-aware representations vi and ti as input and,
still following our learning to rank approach, we learn
the parameters θf
and θg of the two embedding functions
f : V → Γ and g : T → Γ which project in our final em-
bedding space Γ. We again consider this as the task of build-
ing a single MMEN with the inputs vi and ti, and follow
the process described in Sec. 3.1. In other words, we train
using the loss defined in Equation (5), which we denote
L here, which combines two cross-modal and two within-
modal losses using the triplets Tv,t, Tt,v, Tv,v, Tt,t formed
using relevance between videos and captions based on the
action retrieval task. As relevance here is not PoS-aware,
we refer to this loss as PoS-agnostic. This is illustrated in
Fig. 2.
We learn the multiple PoS-MMENs and the final MMEN
jointly with the following combined loss:
L(θ, θ1, . . . θK) = L(θ) +
K∑
k=1
αkLk(θk) (7)
where αk are weighting factors, L is the PoS-agnostic loss
described above and Lk are the PoS-aware losses corre-
sponding to the K PoS-MMENs.
4. Experiments
We first tackle fine-grained action retrieval on the EPIC
dataset [6] (Sec. 4.1) and then the general video retrieval
task on the MSR-VTT dataset [39] (Sec. 4.2). This allows
us to explore two different tasks using the proposed multi-
modal embeddings.
The large English spaCy parser [1] was used to find
the Part Of Speech (PoS) tags and disentangle them in the
captions of both datasets. Statistics on the most frequent
PoS tags are shown in Table 1. As these statistics show,
EPIC contains mainly nouns and verbs, while MSR-VTT
453
has longer captions and more nouns. This will have an im-
pact of the PoS chosen for each dataset when building the
JPoSE model.
4.1. FineGrained Action Retrieval on EPIC
Dataset. The EPIC dataset [6] is an egocentric dataset with
32 participants cooking in their own kitchens who then nar-
rated the actions in their native language. The narrations
were translated to English but maintain the open vocabu-
lary selected by the participants. We employ the released
free-form narrations to use this dataset for fine-grained ac-
tion retrieval. We follow the provided train/test splits. Note
that by construction there are two test sets: Seen and Un-
seen, referring to whether the kitchen has been seen in the
training set. We follow the terminology from [6], and note
that this terminology should not be confused with the zero-
shot literature which distinguishes seen/unseen classes. The
actual sequences are strictly disjoint between all sets.
Additionally, we train only on the many-shot examples
from EPIC excluding all examples of the few shot classes
from the training set. This ensures each action has more
than 100 relevant videos during training and increases the
number of zero-shot examples in both test sets.
Building relevance sets for retrieval. The EPIC dataset
offers an opportunity for fine-grained action retrieval, as
the open vocabulary has been grouped into semantically
relevant verb and noun classes for the action recognition
challenge. For example, ‘put’, ‘place’ and ‘put-down’ are
grouped into one class. As far as we are aware, this paper
presents the first attempt to use the open vocabulary narra-
tions released to the community.
We determine retrieval relevance scores from these se-
mantically grouped verb and noun classes2, defined in [6].
These indicate which videos and captions should be con-
sidered related to each other. Following these semantic
groups, a query ‘put mug’ and a video with ‘place cup’
in its caption are considered relevant as ‘place’ and ‘put’
share the same verb class and ‘mug’ and ‘cup’ share the
same noun class. Subsequently, we define the triplets
Tv,t, Tt,v, Tv,v, Tt,t used to train the MMEN models and to
compute the loss L in JPoSE.
When training a PoS-MMEN, two videos are considered
relevant only within that PoS. Accordingly, ‘put onion’ and
‘put mug’ are relevant for verb retrieval, whereas, ‘put cup’
and ‘take mug’ are for noun retrieval. The corresponding
PoS-based relevances define the triplets T k for Lk.
4.1.1 Experimental Details
Video features. We extract flow and appearance features
using the TSN BNInception model [37] pre-trained on Ki-
2We use the verb and noun classes purely to establish relevance scores,
the training is done with the original open vocabulary captions.
EPIC MSR-VTT
Parts of Speech count avg/caption count avg/caption
Noun 34,546 1.21 418,557 3.33
Verb 30,279 1.06 245,177 1.95
Determiner 6,149 0.22 213,065 1.69
Adposition 5,048 0.18 151,310 1.20
Adjective 2,271 0.08 79,417 0.63
Table 1. Statistics of the 5 most common PoS tags in the training
sets of both datasets: total counts and average counts per caption.
netics and fine-tuned on our training set. TSN averages
the features from 25 uniformly sampled snippets within the
video. We then concatenate appearance and flow features to
create a 2048 dimensional vector (vi) per action segment.
Text features. We map each lemmatised word to its feature
vector using a 100-dimension Word2Vec model, trained on
the Wikipedia corpus. Multiple word vectors with the same
part of speech were aggregated by averaging. We also
experimented with the pre-trained 300-dimension Glove
model, and found the results to be similar.
Architecture details. We implement fk and gk in each
MMEN as a 2 layer perceptron (fully connected layers) with
ReLU. Additionally, the input vectors and output vectors
are L2 normalised. In all cases, we set the dimension of
the embedding space to 256, a dimension we found to be
suitable across all settings. We use a single layer perceptron
with shared weights for f and g that we initialise with PCA.
Training details. The triplet weighting parameters are set
to λv,v = λt,t = 0.1 and λv,t = λt,v = 1.0 and the
loss weightings αk are set to 1. The embedding models
were implemented in Python using the Tensorflow library.
We trained the models with an Adam solver and a learn-
ing rate of 1e−5, considering batch sizes of 256, where for
each query we sample 100 random triplets from the corre-
sponding Tv,t, Tt,v, Tv,v, Tt,t sets. The training in general
converges after a few thousand iterations, we report all re-
sults after 4000 iterations.
Evaluation metrics. We report mean average preci-
sion (mAP), i.e. for each query we consider the average pre-
cision over all relevant elements and take the mean over all
queries. We consider each element in the test set as a query
in turns. When reporting within-modal retrieval mAP, the
corresponding item (video or caption) is removed from the
test set for that query.
4.1.2 Results
First, we consider cross-modal and within-modal fine-
grained action retrieval. Then, we present an ablation study
as well as qualitative results to get more insights. Finally we
show that our approach is well-suited for zero-shot settings.
Compared approaches. Across a wide of range of exper-
454
EPIC SEEN UNSEEN
vt tv vt tv
Random Baseline 0.6 0.6 0.9 0.9
CCA Baseline 20.6 7.3 14.3 3.7
MMEN (Verb) 3.6 4.0 3.9 4.2
MMEN (Noun) 9.9 9.2 7.9 6.1
MMEN (Caption) 14.0 11.2 10.1 7.7
MMEN ([Verb, Noun]) 18.7 13.6 13.3 9.5
JPoSE (Verb, Noun) 23.2 15.8 14.6 10.2
Table 2. Cross-modal action retrieval on EPIC.
EPIC SEEN UNSEEN
vv tt vv tt
Random Baseline 0.6 0.6 0.9 0.9
CCA Baseline 13.8 62.2 18.9 68.5
Features (Word2Vec) – 62.5 – 71.3
Features (Video) 13.6 – 21.0 –
MMEN (Verb) 15.2 11.7 20.1 15.8
MMEN (Noun) 16.8 30.1 21.2 34.1
MMEN (Caption) 17.2 63.8 20.7 69.6
MMEN ([Verb, Noun]) 17.6 83.5 22.5 84.7
JPoSE (Verb, Noun) 18.8 87.7 23.2 87.7
Table 3. Within-modal action retrieval on EPIC.
iments, we compare the proposed JPoSE (Sec. 3.3) with
some simpler variants based on MMEN (Sec. 3.1).
For the captions, we use 1) all the words together without
distinction, denoted as ‘Caption’, 2) only one PoS such as
‘Verb’ or ‘Noun’, 3) the concatenation of their respective
representations, denoted as ‘[Verb, Noun]’.
These models are also compared to standard baselines.
The Random Baseline randomly ranks all the database
items, providing a lower bound on the mAP scores. The
CCA-baseline applies Canonical Correlation Analysis to
both modalities vi and ti to find a joint embedding space for
cross-modal retrieval [9]. Finally, Features (Word2Vec)
and Features (Video), which are only defined for within-
modal retrieval (i.e. vv and tt), show the performance when
we directly use the video representation vi or the averaged
Word2Vec caption representation ti.
Cross-modal retrieval. Table 2 presents cross-modal re-
sults for fine-grained action retrieval. The main observation
is that the proposed JPoSE outperforms all the MMEN vari-
ants and the baselines for both video-to-text (vt) and text-
to-video retrieval (tv), on both test sets. We also note that
MMEN ([Verb, Noun]) outperforms other MMEN variants,
showing the benefit of learning specialised embeddings. Yet
the full JPoSE is crucial to get the best results.
Within-modal retrieval. Table 3 shows the within-modal
retrieval results for both text-to-text (tt) and video-to-
video (vv) retrieval. Again, JPoSE outperforms all the
flavours of MMEN on both test sets. This shows that by
EPIC SEEN
Learn L (ev ,et) (f , g) vv vt tv tt
indep × Sum × 17.4 20.7 13.3 86.5
indep × Max × 17.5 21.2 13.3 86.5
indep × Conc. × 18.3 21.5 14.6 87.1
joint X Sum (Id, Id) 18.1 21.0 14.3 87.3
joint X Max (Id, Id) 18.1 22.4 14.8 87.5
joint X Conc. (Id, Id) 18.3 22.7 15.4 87.6
joint X Conc. (θf, θg) 18.8 23.2 15.8 87.7
Table 4. Ablation study for JPoSE showing the effects of different
encodings, training PoS-MMENs, independently or jointly with f
and g being the identity function Id or being learnt.
learning a cross-modal embedding we inject information
from the other modality that helps to better disambiguate
and hence to improve the search.
Ablation study. We evaluate the role of the components
of the proposed JPoSE model, for both cross-modal and
within-modal retrieval. Table 4 reports results comparing
different options for the encoding functions ev and et in ad-
dition to learning the model jointly both with and without
learned functions f and g. This confirms that the proposed
approach is the best option. In the supplementary material,
we also compare the performance when using the closed vo-
cabulary classes from EPIC to learn the embedding. Results
demonstrate the benefit of utilising the open vocabulary at
training time.
Zero-shot experiments. The use of the open vocabulary
in EPIC lends itself well to zero-shot settings. These are
the cases for which the open vocabulary verb or noun in
the test set is not present in the training set. Accordingly,
all previous results can be seen as a Generalised Zero-Shot
Learning (GZSL) [4] set-up: there exists both actions in
the test sets that have been seen in training and actions that
have not. Table 5 shows the zero-shot (ZS) counts in both
test sets. In total 12% of the videos in both test sets are
zero-shot instances. We separate cases where the noun is
present in the training set but the verb is not, denoted by
ZSV (zero-shot verb), from ZSN (zero-shot noun) where the
verb is present but not the noun. Cross-modal ZS retrieval
results for this interesting setting are shown in Table 6. We
compare JPoSE to MMEN (Caption) and baselines.
Results show that the proposed JPoSE model clearly im-
proves over these zero-shot settings, thanks to the differ-
ent views captured by the multiple PoS embeddings, spe-
cialised to acts and objects.
Qualitative results. Fig. 3 illustrates both video-to-text and
text-to-video retrieval. For several queries, it shows the rel-
evance of the top-50 retrieved items (relevant in green, non-
relevant in grey).
Fig. 4 illustrates our motivation that disentangling PoS
455
Figure 3. Qualitative results for video-to-text (top) and text-to-video (bottom) action retrieval on EPIC. For several query videos (top) or
query captions (bottom), we show the quality of the top 50 captions (resp. videos) retrieved with green/grey representing relevant/irrelevant
retrievals. The number in front of the colour-coded bar shows the rank of the first relevant retrieval (lower rank is better).
Figure 4. Maximum activation examples for visual embedding in the noun (left) and the verb (right) PoS-MMEN. Examples of similar
objects over different actions are shown in the noun embedding (left) [chopping board vs cutlery] while the same action is shown over
different objects in the verb embedding (right) [open/close vs put/take].
EPIC All ZSV ZSN
Videos Verbs Nouns Videos Verbs Videos Nouns
Train 26,710 192 1005 – – – –
Seen Test 8,047 232 530 452 119 367 80
Unseen Test 2,929 136 243 257 63 275 127
Table 5. Number of videos, and number of open-vocabulary verbs
and nouns in the captions, for the three splits of EPIC. For both test
sets we also report zero-shot (ZS) instances, showing the number
of verbs and nouns that were not seen in the training set, as well as
the corresponding number of videos.
EPIC ZSV ZSN
vt tv vt tv
Random Baseline 1.57 1.57 1.64 1.64
CCA Baseline 2.92 2.96 4.36 3.25
MMEN (Caption) 5.77 5.51 4.17 3.32
JPoSE 7.50 6.47 7.68 6.17
Table 6. Zero shot experiments on EPIC.
embeddings would learn different visual functions. It
presents maximum activation examples on chosen neurons
within fi for both verb and noun embeddings. Each clus-
ter represents the 9 videos that respond maximally to one of
these neurons (video in supplementary). We can remark that
noun activations indeed correspond to objects of shared ap-
pearance occurring in different actions (in the figure, chop-
ping boards in one and cutlery in the second), while verb
embedding neuron activations reflect the same action ap-
plied to different objects (open/close vs. put/take).
4.2. General Video Retrieval on MSRVTT
Dataset. We select MSR-VTT [39] as a public dataset
for general video retrieval. Originally used for video cap-
tioning, this large-scale video understanding dataset is in-
creasingly evaluated for video-to-text and text-to-video re-
trieval [8, 22, 24, 41, 23]. We follow the code and setup
of [22] using the same train/test split that includes 7,656
training videos each with 20 different captions describing
the scene and 1000 test videos with one caption per video.
We also follow the evaluation protocol in [22] and compute
recall@k (R@K) and median rank (MR).
In contrast to the EPIC dataset, there is no semantic
groupings of the captions in MSR-VTT. Each caption is
considered relevant only for a single video, and two cap-
tions describing different videos are considered irrelevant
even if they share semantic similarities. Furthermore, dis-
entangling captions yields further semantic similarities. For
example, “A cooking tutorial” and “A person is cooking”,
for a verb-MMEN, will be considered irrelevant as they be-
long to different videos even though they share the same
single verb ‘cook’.
Consequently, we can not directly apply JPoSE as pro-
posed in Sec. 3.3. Instead, we adapt JPoSE to this prob-
lem as follows. We use the Mixture-of-Expert Embeddings
(MEE) model from [22], as our core MMEN network. In
fact, MEE is a form of multi-modal embedding network in
456
Video-to-text Text-to-Video
MSR-VTT Retrieval R@1 R@5 R@10 MR R@1 R@5 R@10 MR
Mixture of Experts [22]* – – – – 12.9 36.4 51.8 10.0
Random Baseline 0.3 0.7 1.1 502.0 0.3 0.7 1.1 502.0
CCA Baseline 2.8 5.6 8.2 283.0 7.0 14.4 18.7 100.0
MMEN (Verb) 0.7 4.0 8.3 70.0 2.9 7.9 13.9 63.0
MMEN (Caption\Noun) 5.7 18.7 28.2 31.1 5.3 17.0 26.1 33.3
MMEN (Noun) 10.8 31.3 42.7 14.0 10.8 30.7 44.5 13.0
MMEN ([Verb, Noun]) 15.6 39.4 55.1 9.0 13.6 36.8 51.7 10.0
MMEN (Caption) 15.8 40.2 53.6 9.0 13.8 36.7 50.7 10.3
JPoSE (Caption\Noun, Noun) 16.4 41.3 54.4 8.7 14.3 38.1 53.0 9.0
Table 7. MSR-VTT Video-Caption Retrieval results. *We include results from [22], only available for Text-to-Video retrieval.
Figure 5. Qualitative results of text-to-video action retrieval on MSR-VTT. A← B shows the rank B of the retrieved video from using the
full caption MMEN (caption) and the rank A when disentangling the caption JPoSE (Caption\Noun, Noun).
that it embeds videos and captions into the same space. We
instead focus on assessing whether disentangling PoS and
learning multiple PoS-aware embeddings produce better re-
sults. In this adapted JPoSE we encode the output of the dis-
entangled PoS-MMENs with ev and et (i.e. concatenated)
and use NetVLAD [3] to aggregate Word2Vec representa-
tions. Instead of the combined loss in Equation (7), we use
the pair loss, used also in [22]:
L(θ) =1
B
B∑
i
∑
j 6=i
max(
γ + d(fvi, gti)− d(fvi , gtj ), 0
)
+max(
γ + d(fvi, gti)− d(fvj
, gti), 0)
(8)
This same loss is used when we train different MMENs.
Visual and text features. We use appearance, flow, audio
and facial pre-extracted visual features provided from [22].
For the captions, we extract the encodings ourselves3 using
the same Word2Vec model as for EPIC.
Results. We report on video-to-text and text-to-video re-
trieval on MSR-VTT in Table 7 for the standard baselines
and several MMEN variants. Comparing MMENs, we note
that nouns are much more informative than verbs for this
retrieval task. MMEN results with other PoS tags (shown in
the supplementary) are even lower, indicating that they are
not informative alone. Building on these findings, we report
3Note that this explains the difference between the results reported
in [22] (shown in the first row of the Table 7) and MMEN (Caption).
results of a JPoSE combining two MMENs, one for nouns,
and one for the remainder of the caption (Caption\Noun).
Our adapted JPoSE model consistently outperforms full-
caption single embedding for both video-to-text and text-
to-video retrieval. We report other PoS disentanglement re-
sults in supplementary material.
Qualitative results. Figure 5 shows qualitative results com-
paring using the full caption and JPoSE noting the disentan-
gled model’s ability to commonly rank videos closer to their
corresponding captions.
5. Conclusion
We have proposed a method for fine-grained action re-
trieval. By learning distinct embeddings for each PoS, our
model is able to combine these in a principal manner and
to create a space suitable for action retrieval, outperform-
ing approaches which learn such a space through captions
alone. We tested our method on a fine-grained action re-
trieval dataset, EPIC, using the open vocabulary labels. Our
results demonstrate the ability for the method to generalise
to zero-shot cases. Additionally, we show the applicability
of the notion of disentangling the caption for the general
video-retrieval task on MSR-VTT.
Acknowledgement Research supported by EPSRC LO-
CATE (EP/N033779/1) and EPSRC Doctoral Training Part-
nerships (DTP). We use publicly available datasets. Part
of this work was carried out during Michael’s internship at
Naver Labs Europe.
457
References
[1] English spaCy parser https://spacy.io/. 4
[2] Jon Almazan, Bojana Gajic, Naila Murray, and Diane Lar-
lus. Re-id done right: towards good practices for person re-
identification. CoRR, abs/1801.05339, 2018. 2
[3] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pa-
jdla, and Josef Sivic. NetVLAD: CNN architecture for
weakly supervised place recognition. In CVPR, 2016. 2,
8
[4] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei
Sha. An empirical study and analysis of generalized zero-
shot learning for object recognition in the wild. In ECCV,
2016. 6
[5] Weihua Chen, Xiaotang Chen, Ianguo Zhang, and Kaiqi
Huang. Beyond triplet loss: a deep quadruplet network for
person re-identification. In CVPR, 2017. 2
[6] Dima Damen, Hazel Doughty, Giovanni Maria Farinella,
Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide
Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and
Michael Wray. Scaling egocentric vision: The epic-kitchens
dataset. In ECCV, 2018. 2, 4, 5
[7] Jianfeng Dong, Xirong Li, and Cees GM Snoek. Predict-
ing visual features from text for image and video caption re-
trieval. IEEE Trans. Multimed., 2018. 1
[8] Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, and Xun
Wang. Dual dense encoding for zero-example video re-
trieval. CoRR, arXiv:1809.06181, 2018. 1, 2, 7
[9] Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hocken-
maier, and Svetlana Lazebnik. Improving image-sentence
embeddings using large weakly annotated photo collections.
In ECCV, 2014. 2, 6
[10] Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Lar-
lus. Deep image retrieval: Learning global representations
for image search. In ECCV, 2016. 2
[11] Albert Gordo and Diane Larlus. Beyond instance-level im-
age retrieval: Leveraging captions to learn a global visual
representation for semantic retrieval. In CVPR, 2017. 2
[12] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal-
ski, Joanna Materzynska, Susanne Westphal, Heuna Kim,
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz
Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo
Bax, and Roland Memisevic. The “Something Something”
Video Database for Learning and Evaluating Visual Com-
mon Sense. In ICCV, 2017. 2
[13] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Car-
oline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan,
George Toderici, Susanna Ricco, Rahul Sukthankar, et al.
Ava: A video dataset of spatio-temporally localized atomic
visual actions. In CVPR, 2018. 2
[14] Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkar-
nenkar, Subhashini Venugopalan, Raymond Mooney, Trevor
Darrell, and Kate Saenko. Youtube2text: Recognizing and
describing arbitrary activities using semantic hierarchies and
zero-shot recognition. In ICCV, 2013. 2
[15] Meera Hahn, Andrew Silva, and James M. Rehg. Ac-
tion2vec: A crossmodal embedding approach to action learn-
ing. In BMVC, 2018. 1, 2
[16] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In de-
fense of the triplet loss for person re-identification. CoRR,
arXiv:1703.07737, 2017. 2
[17] Elad Hoffer and Nir Ailon. Deep metric learning using triplet
network. In ICLR, 2015. 2
[18] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel.
Unifying visual-semantic embeddings with multimodal neu-
ral language models. TACL, 2015. 2
[19] Wang Liwei, Li Yin, Huang Jing, and Svetlana Lazebnik.
Learning two-branch neural networks for image-text match-
ing tasks. TPAMI, 2019. 3
[20] Farzaneh Mahdisoltani, Guillaume Berger, Waseem Ghar-
bieh, David Fleet, and Roland Memisevic. On the effec-
tiveness of task granularity for transfer learning. CoRR,
arXiv:1804.09235, 2018. 1, 2
[21] Antoine Miech, Ivan Laptev, and Josef Sivic. Learnable
pooling with context gating for video classification. CoRR,
arXiv:1706.06905, 2017. 2
[22] Antoine Miech, Ivan Laptev, and Josef Sivic. Learning a
text-video embedding from incomplete and heterogeneous
data. CoRR, arXiv:1804.02516, 2018. 1, 2, 7, 8
[23] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac,
Makarand Tapaswi, Ivan Laptev, and Josef Sivic.
HowTo100M: Learning a Text-Video Embedding by
Watching Hundred Million Narrated Video Clips. In ICCV,
2019. 2, 7
[24] Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze,
and Amit K Roy-Chowdhury. Learning joint embedding
with multimodal cues for cross-modal video-text retrieval.
In ICMR, 2018. 1, 2, 7
[25] Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkila,
and Naokazu Yokoya. Learning joint representations of
videos and sentences with web image search. In ECCV,
2016. 2
[26] Bryan A Plummer, Matthew Brown, and Svetlana Lazebnik.
Enhancing video summarization via vision-language embed-
ding. In CVPR, 2017. 1
[27] Filip Radenovic, Giorgos Tolias, and Ondrej Chum. CNN
image retrieval learns from BoW: Unsupervised fine-tuning
with hard examples. In ECCV, 2016. 2
[28] Marcus Rohrbach, Anna Rohrbach, Michaela Regneri,
Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, and
Bernt Schiele. Recognizing fine-grained and composite ac-
tivities using hand-centric features and script data. IJCV,
2016. 2
[29] Florian Schroff, Dmitry Kalenichenko, and James Philbin.
Facenet: A unified embedding for face recognition and clus-
tering. In CVPR, 2015. 2
[30] Jie Shao, Kai Hu, Yixin Bao, Yining Lin, and Xiangyang
Xue. High order neural networks for video classification.
CoRR, arXiv:1811.07519, 2018. 1
[31] Gunnar A. Sigurdsson, Gul Varol, Xiaolong Wang, Ali
Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in
homes: Crowdsourcing data collection for activity under-
standing. In ECCV, 2016. 2
[32] Kihyuk Sohn. Improved deep metric learning with multi-
class n-pair loss objective. In NIPS, 2016. 2
458
[33] Didac Surıs, Amanda Duarte, Amaia Salvador, Jordi Torres,
and Xavier Giroo-i Nieto. Cross modal embeddings for video
and audio retrieval. In ECCVW, 2018. 2
[34] Atousa Torabi, Niket Tandon, and Leonid Sigal. Learning
language-visual embedding for movie understanding with
natural-language. CoRR, arXiv:1609.08124, 2016. 2
[35] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg,
Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learn-
ing fine-grained image similarity with deep ranking. In
CVPR, 2014. 2
[36] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning
deep structure-preserving image-text embeddings. In CVPR,
2016. 1, 2, 3
[37] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua
Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment
networks: Towards good practices for deep action recogni-
tion. In ECCV, 2016. 5
[38] Michael Wray and Dima Damen. Learning visual actions
using multiple verb-only labels. In BMVC, 2019. 2
[39] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large
video description dataset for bridging video and language. In
CVPR, 2016. 2, 4, 7
[40] Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso.
Jointly modeling deep video and compositional text to bridge
vision and language in a unified framework. In AAAI, 2015.
2
[41] Youngjae Yu, Jongseok Kim, and Gunhee Kim. A joint se-
quence fusion model for video question answering and re-
trieval. In ECCV, 2018. 7
[42] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning
via semantic similarity embedding. In ICCV, 2015. 2
459