20
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2569–2588 July 5 - 10, 2020. c 2020 Association for Computational Linguistics 2569 Learning to Segment Actions from Observation and Narration Daniel Fried Jean-Baptiste Alayrac Phil Blunsom Chris Dyer Stephen Clark Aida Nematzadeh DeepMind, London, UK Computer Science Division, UC Berkeley [email protected] {jalayrac,pblunsom,cdyer,clarkstephen,nematzadeh}@google.com Abstract We apply a generative segmental model of task structure, guided by narration, to action seg- mentation in video. We focus on unsupervised and weakly-supervised settings where no ac- tion labels are known during training. De- spite its simplicity, our model performs com- petitively with previous work on a dataset of naturalistic instructional videos. Our model al- lows us to vary the sources of supervision used in training, and we find that both task structure and narrative language provide large benefits in segmentation quality. 1 Learning to Segment Actions Finding boundaries in a continuous stream is a crucial process for human cognition (Martin and Tversky, 2003; Zacks and Swallow, 2007; Levine et al., 2019; ¨ Unal et al., 2019). To understand and remember what happens in the world around us, we need to recognize the action boundaries as they unfold and also distinguish the important actions from the insignificant ones. This process, referred to as temporal action segmentation, is also an im- portant first step in systems that ground natural lan- guage in videos (Hendricks et al., 2017). These sys- tems must identify which frames in a video depict actions – which amounts to distinguishing these frames from background ones – and identify which actions (e.g., boiling potatoes) each frame depicts. Despite recent advances (Miech et al., 2019; Sun et al., 2019), unsupervised action segmentation in videos remains a challenge. The recent availability of large datasets of natu- ralistic instructional videos provides an opportunity for modeling of action segmentation in a rich task context (Yu et al., 2014; Zhou et al., 2018; Zhukov et al., 2019; Miech et al., 2019; Tang et al., 2019); Work begun while DF was interning at DeepMind. Code is available at https://github.com/dpfried/action-segmentation. in these videos, a person teaches a specific high- level task (e.g., making croquettes) while describ- ing the lower-level steps involved in that task (e.g., boiling potatoes). However, the real-world nature of these datasets introduces many challenges. For example, more than 70% of the frames in one of the YouTube instructional video datasets, CrossTask (Zhukov et al., 2019), consist of background re- gions (e.g., the video presenter is thanking their viewers), which do not correspond to any of the steps for the video’s task. These datasets are interesting because they pro- vide (1) narrative language that roughly corre- sponds to the activities demonstrated in the videos and (2) structured task scripts that define a strong signal of the order in which steps in a task are typi- cally performed. As a result, these datasets provide an opportunity to study the extent to which task structure and language can guide action segmen- tation. Interestingly, young children can segment actions without any explicit supervision (Baldwin et al., 2001; Sharon and Wynn, 1998), by tapping into similar cues – action regularities and language descriptions (Levine et al., 2019). While previous work mostly focuses on build- ing action segmentation models that perform well on a few metrics (Richard et al., 2018; Zhukov et al., 2019), we aim to provide insight into how various modeling choices impact action segmenta- tion. How much do unsupervised models improve when given implicit supervision from task struc- ture and language, and which types of supervision help most? Are discriminative or generative mod- els better suited for the task? Does explicit struc- ture modeling improve the quality of segmentation? To answer these questions, we compare two exist- ing models with a generative hidden semi-Markov model, varying the degree of supervision. On a challenging and naturalistic dataset of in- structional videos (Zhukov et al., 2019), we find

Learning to Segment Actions from Observation and Narration

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2569–2588July 5 - 10, 2020. c©2020 Association for Computational Linguistics

2569

Learning to Segment Actions from Observation and Narration

Daniel Fried‡ Jean-Baptiste Alayrac† Phil Blunsom†Chris Dyer† Stephen Clark† Aida Nematzadeh†

†DeepMind, London, UK‡Computer Science Division, UC Berkeley

[email protected]{jalayrac,pblunsom,cdyer,clarkstephen,nematzadeh}@google.com

Abstract

We apply a generative segmental model of taskstructure, guided by narration, to action seg-mentation in video. We focus on unsupervisedand weakly-supervised settings where no ac-tion labels are known during training. De-spite its simplicity, our model performs com-petitively with previous work on a dataset ofnaturalistic instructional videos. Our model al-lows us to vary the sources of supervision usedin training, and we find that both task structureand narrative language provide large benefitsin segmentation quality.

1 Learning to Segment Actions

Finding boundaries in a continuous stream is acrucial process for human cognition (Martin andTversky, 2003; Zacks and Swallow, 2007; Levineet al., 2019; Unal et al., 2019). To understand andremember what happens in the world around us,we need to recognize the action boundaries as theyunfold and also distinguish the important actionsfrom the insignificant ones. This process, referredto as temporal action segmentation, is also an im-portant first step in systems that ground natural lan-guage in videos (Hendricks et al., 2017). These sys-tems must identify which frames in a video depictactions – which amounts to distinguishing theseframes from background ones – and identify whichactions (e.g., boiling potatoes) each frame depicts.Despite recent advances (Miech et al., 2019; Sunet al., 2019), unsupervised action segmentation invideos remains a challenge.

The recent availability of large datasets of natu-ralistic instructional videos provides an opportunityfor modeling of action segmentation in a rich taskcontext (Yu et al., 2014; Zhou et al., 2018; Zhukovet al., 2019; Miech et al., 2019; Tang et al., 2019);

Work begun while DF was interning at DeepMind. Codeis available at https://github.com/dpfried/action-segmentation.

in these videos, a person teaches a specific high-level task (e.g., making croquettes) while describ-ing the lower-level steps involved in that task (e.g.,boiling potatoes). However, the real-world natureof these datasets introduces many challenges. Forexample, more than 70% of the frames in one of theYouTube instructional video datasets, CrossTask(Zhukov et al., 2019), consist of background re-gions (e.g., the video presenter is thanking theirviewers), which do not correspond to any of thesteps for the video’s task.

These datasets are interesting because they pro-vide (1) narrative language that roughly corre-sponds to the activities demonstrated in the videosand (2) structured task scripts that define a strongsignal of the order in which steps in a task are typi-cally performed. As a result, these datasets providean opportunity to study the extent to which taskstructure and language can guide action segmen-tation. Interestingly, young children can segmentactions without any explicit supervision (Baldwinet al., 2001; Sharon and Wynn, 1998), by tappinginto similar cues – action regularities and languagedescriptions (Levine et al., 2019).

While previous work mostly focuses on build-ing action segmentation models that perform wellon a few metrics (Richard et al., 2018; Zhukovet al., 2019), we aim to provide insight into howvarious modeling choices impact action segmenta-tion. How much do unsupervised models improvewhen given implicit supervision from task struc-ture and language, and which types of supervisionhelp most? Are discriminative or generative mod-els better suited for the task? Does explicit struc-ture modeling improve the quality of segmentation?To answer these questions, we compare two exist-ing models with a generative hidden semi-Markovmodel, varying the degree of supervision.

On a challenging and naturalistic dataset of in-structional videos (Zhukov et al., 2019), we find

2570

that our model and models from past work bothbenefit substantially from the weak supervisionprovided by task structure and narrative language,even on top of rich features from state-of-the-artpretrained action and object classifiers. Our analy-sis also shows that: (1) Generative models tend todo better than discriminative models of the sameor similar model class at learning the full rangeof step types, which benefits action segmentation;(2) Task structure affords strong, feature-agnosticbaselines that are difficult for existing systems tosurpass; (3) Reporting multiple metrics is necessaryto understand each model’s effectiveness for actionsegmentation; we can devise feature-agnostic base-lines that perform well on single metrics despiteproducing low-quality action segments.

2 Related Work

Typical methods (Rohrbach et al., 2012; Singhet al., 2016; Xu et al., 2017; Zhao et al., 2017;Lea et al., 2017; Yeung et al., 2018; Farha and Gall,2019) for temporal action segmentation consist ofassigning action classes to intervals of videos andrely on manually-annotated supervision. Such an-notation is difficult to obtain at scale. As a result,recent work has focused on training such modelswith less supervision: one line of work assumesthat only the order of actions happening in thevideo is given and use this weak supervision to per-form action segmentation (Bojanowski et al., 2014;Huang et al., 2016; Kuehne et al., 2017; Richardet al., 2017; Ding and Xu, 2018; Chang et al., 2019).Other approaches weaken this supervision and useonly the set of actions that occur in each video(Richard et al., 2018), or are fully unsupervised(Sener and Yao, 2018; Kukleva et al., 2019).

Instructional videos have gained interest over thepast few years (Yu et al., 2014; Sener et al., 2015;Malmaud et al., 2015; Alayrac et al., 2016; Zhukovet al., 2019) since they enable weakly-supervisedmodeling: previous work most similar to ours con-sists of models that localize actions in narratedvideos with minimal supervision (Alayrac et al.,2016; Sener et al., 2015; Elhamifar and Naing,2019; Zhukov et al., 2019).

We present a generative model of action segmen-tation that incorporates duration modeling, narra-tion and ordering constraints, and can be trainedin all of the above supervision conditions by maxi-mizing the likelihood of the data; while these pastworks have had these individual components, they

have not yet all been combined.

3 The CrossTask Dataset

We use the recent CrossTask dataset (Zhukov et al.,2019) of instructional videos. To our knowledge,CrossTask is the only available dataset that hastasks from more than one domain, includes back-ground regions, provides step annotations and natu-ralistic language. Other datasets lack one of these;e.g.they focus on one domain (Kuehne et al., 2014)or do not have natural language (Tang et al., 2019)or step annotations (Miech et al., 2019). An exam-ple instance from the dataset is shown in Figure 1,and we describe each aspect below.

Tasks Each video comes from a task, e.g. makea latte, with tasks taken from the titles of se-lected WikiHow articles, and videos curated fromYouTube search results for the task name. We focuson the primary section of the dataset, containing2,700 videos from 18 different tasks.

Steps and canonical order Each task has a setof steps: lower-level action step types, e.g., steammilk and pour milk, which are typically completedwhen performing the task. Step names consist ofa few words, typically naming an action and anobject it is applied to. The dataset also provides acanonical step order for each task: an ordering, likea script (Schank and Abelson, 1977; Chambers andJurafsky, 2008), in which a task’s steps are typicallyperformed. For each task, the set of step types andtheir canonical order were hand-constructed by thedataset creators based on section headers in thetask’s WikiHow article.

Annotations Each video in the primary sectionof the dataset is annotated with labeled temporalsegments identifying where steps occur. (In theweak supervision setting, these step segment labelsare used only in evaluation, and never in training.)A given step for a task can occur multiple times,or not at all, in any of the task’s videos. Steps ina video also need not occur in the task’s canonicalordering (although in practice our results show thatthis ordering is a helpful inductive bias for learn-ing). Most of the frames in videos (72% over theentire corpus) are background – not contained inany step segment.

Narration Videos also have narration text (tran-scribed by YouTube’s automatic speech recognitionsystem) which typically consists of a mix of the

2571

Regions background pour mixture into pan flip pancakebackground background

Video

Narration "hey folks here welcome to my kitchen [...] folks my pan is nice and hot [...] just change the angle to show you [...] let cook [...] sit on towel [...] big old stack [...]

TimestepTime (in s)

Step

backgroundflip pancakerm pancakebackground

Figure 1: An example video instance from the CrossTask dataset (Sec. 3). The video depicts a task, make pancakes,and is annotated with region segments, which can be either action steps (e.g., pour mixture into pan) or backgroundregions. Videos also are temporally-aligned with transcribed narration. We learn to segment the video into theseregions and label them with the action steps (or background), without access to region annotations during training.

task demonstrator describing their actions and talk-ing about unrelated topics. Although narration istemporally aligned with the video, and steps (e.g.,pour milk) are sometimes mentioned, these men-tions often do not occur at the same time as thestep they describe (e.g., “let the milk cool beforepouring it”). Zhukov et al. (2019) guide weakly-supervised training using the narration by defininga set of narration constraints for each video, whichidentify where in the video steps are likely to oc-cur, using similarity between the step names andtemporally-aligned narration (see Sec. 6.1).

4 Model

Our generative model of the video features and la-beled task segments is a first-order semi-Markovmodel. We use a semi-Markov model for the ac-tion segmentation task because it explicitly modelstemporal regions of the video, their duration, theirprobable ordering, and their features.1 It can betrained in an unsupervised way, without labeledregions, to maximize the likelihood of the features.

Timesteps Our atomic unit is a one-second re-gion of the video, which we refer to as a timestep.A video with T timesteps has feature vectors x1:T .The features xt at timestep t are derived fromthe video, its narration, or both, and in our work(and past work on the dataset) are produced bypre-trained neural models which summarize somenon-local information in the region containing eachtimestep, which we describe in Sec. 6.3.

Regions Our model segments a video with Ttimesteps into a sequence of regions, each of whichconsists of a consecutive number of timesteps (theregion’s duration). The number of regions K in a

1Semi-Markov models are also shown to be successful inthe similar domain of speech recognition (e.g., Pylkkonen andKurimo, 2004).

video and the duration dk of each region can vary;the only constraint is that the sum of the durationsequals the video length:

∑Kk=1 dk = T . Each re-

gion has a label rk, which is either one of the task’sstep labels (e.g., pour milk) or a special label BKGindicating the region is background. In our mostgeneral, unconstrained model, a given task step canoccur multiple times (or not at all) as a region labelin any video for the task, allowing step repetitions,dropping, and reordering.

Structure We define a first-order Markov (bi-gram) model over these region labels:

P (r1:K) = P (r1)

K∏k=2

P (rk | rk−1) (1)

with tabular conditional probabilities. While re-gion labels are part of the dataset, they are pri-marily used for evaluation: we seek models thatcan be trained in the unsupervised and weakly-supervised conditions where labels are unavailable.This model structure, while simple, affords a dy-namic program allowing efficient enumeration overboth all possible segmentations of the video intoregions and assignments of labels to the regions,allowing unsupervised training (Sec. 4.1).

Duration Our model, following past work(Richard et al., 2018), parameterizes region du-rations using Poisson distributions, where each la-bel type r has its own mean duration λr: dk ∼Poisson(λrk). These durations are constrained sothat they partition the video: e.g., region r2 beginsat timestep d1 (after region r1), and the final regionrK ends at the final timestep T .

Timestep labels The region labels r1:K (step, orbackground) and region durations d1:K togethergive a sequence of timestep labels l1:T for alltimesteps, where a timestep’s label is equal to thelabel for the region it is contained in.

2572

Feature distribution Our model’s feature dis-tribution p(xt|lt) is a class-conditioned multivari-ate Gaussian distribution: xt ∼ Normal(µlt ,Σ),where lt is the step label at timestep t. (We note thatthe assignment of labels to steps is latent and unob-served during unsupervised and weakly-supervisedtraining.) We use a separate learned mean µl foreach label type l, both steps and background. La-bels are atomic and task-specific, e.g., the step typepour milk when it occurs in the task make a lattedoes not share parameters with the step add milkwhen it occurs in the task make pancakes.2 We usea diagonal covariance matrix Σ which is fixed to theempirical covariance of each feature dimension.3

4.1 TrainingIn the unsupervised setting, labels l are unavailableat training (used only in evaluation). We describetraining in this setting, as well as two supervisedtraining methods which we use to analyze proper-ties of the dataset and compare model classes.

Unsupervised We train the generative model asa hidden semi-Markov model (HSMM). We opti-mize the model’s parameters to maximize the logmarginal likelihood of the features for all videoinstance features x(i) in the training set:

ML =

N∑i

logP (x(i)1:Ti

) (2)

Applying the semi-Markov forward algorithm(Murphy, 2002; Yu, 2010) allows us to marginal-ize over all possible sequences of step labels tocompute the log marginal likelihood for each videoas a function of the model parameters, which weoptimize directly using backpropagation and mini-batched gradient descent with the Adam (Kingmaand Ba, 2015) optimizer.4 See Appendix A foroptimization details.

Generative supervised Here the labels l are ob-served; we train the model as a generative semi-Markov model (SMM) to maximize the log jointlikelihood:

JL =N∑i

logP (l(i)1:Ti

, x(i)1:Ti

) (3)

2We experimented with sharing steps, or step components,across tasks in initial experiments, but found that it was helpfulto have task-specific structural probabilities.

3We found that using a shared diagonal covariance matrixoutperformed using full or unshared covariance matrices.

4This is the same as mini-batched Expectation Maximiza-tion using gradient descent on the M-objective (Eisner, 2016).

Richard et al. (2018) Zhukov et al. (2019) Oursstep reordering X Xstep repetitions X Xstep duration X Xlanguage X Xgenerative model X X

Table 1: Characteristics of each model we compare.

We maximize this likelihood over the entire train-ing set using the closed form solution giventhe dataset’s sufficient statistics (per-step featuremeans, average durations, and step transition fre-quencies).

Discriminative supervised To train the SMMmodel discriminatively in the supervised setting,we use gradient descent to maximize the log condi-tional likelihood:

CL =N∑i

logP (l(i)1:T | x

(i)1:T ) (4)

5 Benchmarks

We identify five modeling choices made in recentwork: imposing a fixed ordering on steps (not al-lowing step reordering); allowing for steps to re-peat in a video; modeling the duration of steps;using the language (narrations) associated withthe video; and using a discriminative/generativemodel. We picked the recent models of Zhukovet al. (2019) and Richard et al. (2018) since theyhave non-overlapping strengths (see Table 1).

ORDEREDDISCRIM This work (Zhukov et al.,2019) uses a discriminative classifier which gives aprobability distribution over labels at each timestep:p(lt | xt). Inference finds an assignment of steps totimesteps that maximizes

∑t log p(lt|xt) subject to

the constraints that: all steps are predicted exactlyonce; steps occur in the fixed canonical orderingdefined for the task; one background region occursbetween each step. Unsupervised training of themodel alternates between inferring labels using thedynamic program, and updating the classifier tomaximize the probability of these inferred labels.5

ACTIONSETS This work (Richard et al., 2018)uses a generative model which has structure sim-ilar to ours, but uses dataset statistics (e.g., aver-age video length and number of steps) to learn the

5To allow the model to predict step regions with durationlonger than a single timestep, we modify this classifier to alsopredict a background class, and incorporate the scores of thebackground class into the dynamic program.

2573

structure distributions, rather than setting parame-ters to maximize the likelihood of the data. As inour model, region durations are modeled using aclass-conditional Poisson distribution. The featuredistribution is modeled using Bayesian inversionof a discriminative classifier (a multi-layer percep-tron) with an estimated label prior. The structuralparameters of the model (durations and class priors)are estimated using the length of each video, andthe number of possible step types. As originallypresented, this model depends on knowing whichsteps occur in a video at training time; for faircomparison, we adapt it to the same supervisionconditions of Zhukov et al. (2019) by enforcingthe canonical step ordering for the task during bothtraining and evaluation.

6 Experimental Setting

We compare models on the CrossTask datasetacross supervision conditions. We primarily eval-uate the models on action segmentation (Sec. 1).Past work on the dataset (Zhukov et al., 2019) hasfocused on a step recognition task, where modelsidentify individual timesteps in videos that corre-spond to possible steps; for comparison, we alsoreport performance for all models on this task.

6.1 Supervision Conditions

In all settings, the task for a given video is known(and hence the possible steps), but the settings varyin the availability of other sources of supervision:step labels for each timestep in a video, and con-straints from language and step ordering. Modelsare trained on a training set and evaluated on a sep-arate held-out testing set, consisting of differentvideos (from the same tasks).

Supervised Labels for all timesteps l1:T are pro-vided for all videos in the training set.

Fully unsupervised No labels for timesteps areavailable during training. The only supervision isthe number of possible step types for each task (and,as in all settings, which task each video is from). Inevaluation, the task for a given video (and hence thepossible steps, but not their ordering) are known.We follow past work in this setting (Sener et al.,2015; Sener and Yao, 2018) by finding a mappingfrom model states to region labels that maximizeslabel accuracy, averaged across all videos in thetask. See Appendix C for details.

Weakly supervised No labels for timesteps areavailable, but two supervision types are used in theform of constraints (Zhukov et al., 2019):

(1) Step ordering constraints: Step regions areconstrained to occur in the canonical step ordering(see Sec. 3) for the task, but steps may be sepa-rated by background. We constrain the structureprior distributions p(r1) and transition distributionsp(rk+1|rk) of the HSMM to enforce this ordering.For p(r1), we only allow non-zero probability forthe background region, BKG, and for the first stepin the task’s ordering. p(rk | rk−1) constrains eachstep type to only transition to the next step in theconstrained ordering, or to BKG.6 As step order-ing constraints change the parameters of the model,when we use them we enforce them during bothtraining and testing. While this obviates most ofthe learned structure of the HSMM, the durationmodel (as well as the feature model) is still learned.

(2) Narration constraints: These give regions inthe video where each step type is likely to occur.Zhukov et al. (2019) obtained these using simi-larities between word vectors for the transcribednarration and the words in the step labels, and a dy-namic program to produce constraint regions thatmaximize these similarities, subject to the step or-dering matching the canonical task ordering. SeeZhukov et al. for details. We enforce these con-straints in the HSMM by penalizing the featuredistributions to prevent any step labels that occuroutside of one of the allowed constraint regions forthat step. Following Zhukov et al., we only usethese narration constraints during training.7

6.2 EvaluationWe use three metrics from past work, outlined hereand described in more detail in Appendix D. Toevaluate action segmentation, we use two varietiesof the standard label accuracy metric (Sener andYao, 2018; Richard et al., 2018): all label accu-racy, which is computed on all timesteps, includ-ing background and non-background, as well asstep label accuracy: accuracy only for timestepsthat occur in a non-background region (accordingto the ground-truth annotations). Since these twoaccuracy metrics are defined on individual frames,

6To enforce ordering when steps are separated by BKG,we annotate BKG labels with the preceeding step type (but allBKG labels for a task share feature and duration parameters,and are merged for evaluation).

7We also experiment with using features derived fromtranscribed narration in Appendix G.

2574

they penalize models if they don’t capture the fulltemporal extent of actions in their predicted seg-mentations. Our third metric is step recall, usedby past work on the CrossTask dataset (Zhukovet al., 2019) to measure step recognition (defined inSec. 6). This metric evaluates the fraction of steptypes which are correctly identified by a modelwhen it is allowed to predict only one frame perstep type, per video. A high step recall indicates amodel can accurately identify at least one represen-tative frame of each action type in a video.

We also report three other statistics to analyzethe predicted segmentations: (1) Sequence similar-ity: the similarity of the sequence of region labelspredicted in the video to the groundtruth, usinginverse Levenshtein distance normalized to be be-tween 0 and 100. See Appendix D for more details.(2) Predicted background percentage: the percent-age of timesteps for which the model predicts thebackground label. Models with a higher percentagethan the ground truth background percentage (72%)are overpredicting background. (3) Number of seg-ments: the number of step segments predicted in avideo. Values higher than the ground truth average(7.7) indicate overly-fragmented steps. Sequencesimilarity and number of segments are particularlyrelevant for measuring the effects of structure, asthey do not factor over individual timesteps (asdo the all label and step label accuracies and steprecall).

We average values across the 18 tasks in theevaluation set (following Zhukov et al., 2019).

6.3 FeaturesFor our features x1:T , we use the same base fea-tures as Zhukov et al. (2019), which are producedby convolutional networks pre-trained on separateactivity, object, and audio classification datasets.See Appendix B for details. In our generative mod-els, we apply PCA (following Kuehne et al., 2014and Richard et al., 2018) to project features to 300dimensions and decorrelate dimensions (see Ap-pendix B for details).8

7 Results

We first define several baselines based on datasetstatistics (Sec. 7.1), which we will find to be strongin comparison to past work. We then analyze each

8This reduces the number of parameters that need to belearned in the emission distributions, both by reducing thedimensionality and allowing a diagonal covariance matrix. Inearly experiments we found PCA improved performance.

B1

B2

B3

S1

S2

S3

S4

S6

S5

U1U2

U3 U4

U5

U6

U7

0

10

20

30

40

50

0 10 20 30 40 50 60

Ste

p R

ecal

l

Step Label Accuracy

BaselinesSupervised: UnstructuredSupervised: StructuredFully UnsupervisedOrdering ConstraintsNarration ConstraintsOrdering+Narration

OrderedDiscrim

OrderedDiscrimSMM, discrim.

HSMM+Narr+Ord

SMM, gen.

Figure 2: Baseline and model performance on two keymetrics: step label accuracy and step recall. Points arecolored according to their supervision type, and labeledwith their row number from Table 2. We also label par-ticular important models.

aspect of our proposed model on the dataset in a su-pervised training setting (Sec. 7.2), removing someerror sources of unsupervised learning and evaluat-ing whether a given model fits the dataset (Liangand Klein, 2008). Finally, we move to our mainsetting, the weakly-supervised setting of past work,incrementally adding step ordering and narrationconstraints (see Sec. 6.1) to evaluate the degree towhich each helps (Sec. 7.3).

Results are given in Table 2 for models trainedon the CrossTask training set of primary tasks, andevaluated on the held-out validation set. We willdescribe and analyze each set of results in turn.See Figure 2 for a plot of models’ performanceon two key metrics, and Appendix I for examplepredictions.

7.1 Dataset Statistic Baselines

Table 2 (top block) shows baselines that do notuse video (or narration) features, but predict stepsaccording to overall statistics of the training data.These demonstrate characteristics of the data, andthe importance of using multiple metrics.

Predict background (B1) Since most timestepsare background, a model that predicts backgroundeverywhere can obtain high overall label accuracy,showing the importance of also using step labelaccuracy as a metric for action segmentation.

Sample from the training distribution (B2)For each timestep in each video, we sample a labelfrom the empirical distribution of step and back-ground label frequencies for the video’s task in thetraining data.

2575

All Label Step Label Step Sequence Predicted Num.# Model Accuracy Accuracy Recall Similarity Bkg. % Segments.

Dataset Statistic Baselines (Sec. 7.1)GT Ground truth 100.0 100.0 100.0 100.0 71.9 7.7B1 Predict background 71.9 0.0 0.0 9.0 100.0 0.0B2 Sample from train distribution 54.6 7.2 8.3 12.8 72.4 69.5B3 Ordered uniform 55.6 8.1 12.2 55.0 73.0 7.4

Supervised (Sec. 7.2)Unstructured

S1 Discriminative linear 71.0 36.0 31.6 30.7 73.3 27.1S2 Discriminative MLP 75.9 30.4 27.7 41.1 82.8 13.0S3 Gaussian mixture 69.4 40.6 31.5 33.3 68.9 23.9

StructuredS4 ORDEREDDISCRIM 75.2 18.1 45.4 54.4 90.7 7.4S5 SMM, discriminative 66.0 37.3 24.1 50.5 65.9 8.5S6 SMM, generative 60.5 49.4 28.7 46.6 52.4 10.6

Un- and Weakly-Supervised (Sec. 7.3)Fully Unsupervised

U1 HSMM (with opt. acc. assignment) 31.8 28.8 10.6 31.0 31.1 15.4Ordering Supervision

U2 ACTIONSETS 40.8 14.0 12.1 55.0 49.8 7.4U3 ORDEREDDISCRIM (without Narr.) 69.5 0.2 2.8 55.0 97.2 7.4U4 HSMM + Ord 55.5 8.3 7.3 55.0 70.6 7.4

Narration SupervisionU5 HSMM + Narr 65.7 9.6 8.5 35.1 84.6 4.5

Ordering + Narration SupervisionU6 ORDEREDDISCRIM 71.0 1.8 24.5 55.0 97.2 7.4U7 HSMM + Narr + Ord 61.2 15.9 17.2 55.0 73.7 7.4

Table 2: Model comparison on the CrossTask validation data. We evaluate primarily using all label accuracy andstep label accuracy to evaluate action segmentation, and step recall to evaluate step recognition.

Ordered uniform (B3) For each video, we pre-dict step regions in the canonical step order, sepa-rated by background regions. The length of eachregion is set so that all step regions in a video haveequal duration, and the percentage of backgroundtimesteps is equal to the corpus average. See Uni-form in Figure 3a for sample predictions.

Sampling each timestep label independentlyfrom the task distribution (row B2), and using auniform step assignment in the task’s canonical or-dering with background (B3) both obtain similarstep label accuracy, but the ordered uniform base-line improves substantially on the step recall metric,indicating that step ordering is a useful inductivebias for step recognition.

7.2 Full Supervision

Models in the unstructured block of Table 2 areclassification models applied independently to alltimesteps, allowing us to compare the performanceof the feature models used as components in ourstructured models. We find that a Gaussian mix-ture model (row S3), which is used as the featuremodel in the HSMM, obtains comparable step re-call and substantially higher step label accuracy

than a discriminative linear classifer (row S1) simi-lar to the one used in Zhukov et al. (2019), which ispartially explained by the discriminative classifieroverpredicting the background class (comparingPredicted Background % for those two rows). Us-ing a higher capacity discriminative classifier, aneural net with a single hidden layer (MLP), im-proves performance over the linear model on sev-eral metrics (row S2); however, the MLP still over-predicts background, substantially underperform-ing the Gaussian mixture on the step label accuracymetric.

In the structured block of Table 2, we comparethe full models which use step constraints (Zhukovet al., 2019) or learned transition distributions (theSMM) to model task structure. The structured mod-els learn (or in the case of Zhukov et al., enforce) or-derings over the steps, which greatly improve theirsequence similarity scores when compared to theunstructured models, and decrease step fragmenta-tion (as measured by num. segments). Figure 3ashows predictions for a typical video, demonstrat-ing this decreased fragmentation.9

9We also perform an ablation study to understand theeffect of the duration model. See Appendix F for details.

2576

GT

Uniform

GMM

SMM

0 50 100 150 200 250 300

BKG pour sesame oil add onion add hamadd kimchi add rice stir mixture

(a) Step segmentations in the full supervision conditionfor a video from the make kimchi fried rice task, com-paring the ground truth (GT), ordered uniform baseline(Uniform), and predictions from the Gaussian mixture(GMM) and semi-Markov (SMM) models.

BKG

GT

OrderedDiscrim

HSMM+Narr+Ord

HSMM

pour eggadd flourpour mixture into pan

add sugarflip pancake

whisk mixturetake pancake from pan

pour milk

0 50 100 150 200 250 300 350

(b) Step segmentations in the no- or weak-supervision conditionsfor a video from the make pancakes task, comparing the groundtruth (GT) to predictions from our model without (HSMM) andwith constraint supervision (HSMM+Narr+Ord) and from Zhukovet al. (2019) (ORDEREDDISCRIM).

Figure 3: Step segmentation visualizations for two sample videos in supervised (left) and unsupervised (right)conditions. The x-axes show timesteps, in seconds. See Appendix I for more visualizations.

We see two trends in the supervised results:(1) Generative models obtain substantially

higher step label accuracy than discriminative mod-els of the same or similar class. This is likely dueto the fact that the generative models directly pa-rameterize the step distribution. (See Appendix E.)

(2) Structured sequence modeling naturally im-proves performance on sequence-level metrics (se-quence similarity and number of segments pre-dicted) over the unstructured models. However,none of the learned structured models improve onthe strong ordered uniform baseline (B3) whichjust predicts the canonical ordering of a task’s steps(interspersed with background regions). This willmotivate using this canonical ordering as a con-straint in unsupervised learning.

Overall, the SMM models obtain strong actionsegmentation performance (high step label accu-racy without fragmenting segments or overpredict-ing background).

7.3 No or Weak Supervision

Here models are trained without supervision for thelabels l1:T . We compare models trained withoutany constraints, to those that use constraints fromstep ordering and narration, in the Un- and WeaklySupervised block of Table 2. Example outputs areshown in Appendix I.

Our generative HSMM model affords trainingwithout any constraints (row U1). This model hashigh step label accuracy (compared to the otherunsupervised models) but low all label accuracy,and similar scores for both metrics. This hints,and other metrics confirm, that the model is notadequately distinguishing steps from background:the percentage of predicted background is very low(31%) compared to the ground truth (72%, row GT).

See HSMM in Figure 3b for predictions for a typi-cal video. These results are attributable to featureswithin a given video (even across step types) beingmore similar than features of the same step typein different videos (see Appendix H for featurevisualizations). The induced latent model statestypically capture this inter-video diversity, ratherthan distinguishing steps across tasks.

We next add in constraints from the canoni-cal step ordering, which our supervised resultsshowed to be a strong inductive bias. Unlike in thefully unsupervised setting, the HSMM model withordering (HSMM+Ord, row U4) learns to distin-guish steps from background when constrained topredict each step region once in a video, with pre-dicted background timesteps (70.6%) close to theground-truth (72%). However, performance of thismodel is still very low on the task metrics – com-parable to or underperforming the ordered uniformbaseline with background (row B3) on all metrics.

This constrained step ordering setting also al-lows us to apply ACTIONSETS (Richard et al.,2018) and ORDEREDDISCRIM (Zhukov et al.,2019). ACTIONSETS obtains high step label ac-curacy, but substantially underpredicts background,as evidenced by both the all label accuracy andthe low predicted background percentage. The ten-dency of ORDEREDDISCRIM to overpredict back-ground which we saw in the supervised setting(row S4) is even more pronounced in this weakly-supervised setting (row U3), resulting in scoresvery close to the predict background baseline (B1).

Next, we use narration constraints (U5), whichare enforced only during training time, followingZhukov et al. (2019). Narration constraints sub-stantially improve all label accuracy (comparingU1 and U5). However, the model overpredicts

2577

All Label Step Label StepAcc. Acc. Recall

ORDEREDDISCRIM 71.3 1.2 17.9HSMM+Narr+Or 66.0 5.6 14.2

Table 3: Unsupervised and weakly supervised resultsin the cross-validation setting.

background, likely because it doesn’t enforce eachstep type to occur in a given video. Overpredict-ing background causes step label accuracy and steprecall to decrease.

Finally, we compare the HSMM and ORDERED-DISCRIM models when using both narration con-straints (in training) and ordering constraints (intraining and testing) in the ordering + narrationblock. Both models benefit substantially from nar-ration on all metrics when compared to using onlyordering supervision, more than doubling their per-formance on step label accuracy and step recall(comparing U6 and U7 to U3 and U4).

Our weakly-supervised results show that:(1) Both action segmentation metrics – all label

accuracy and step label accuracy – are importantto evaluate whether models adequately distinguishmeaningful actions from background.

(2) Step constraints derived from the canonicalstep ordering provide a strong inductive bias for un-supervised step induction. Past work requires theseconstraints and the HSMM, when trained withoutthem, does poorly, learning to capture diversityacross videos rather than to identify steps.

(3) However, ordering supervision alone is notsufficient to allow these models to learn better seg-mentations than a simple baseline that just uses theordering to assign labels (ordered uniform); narra-tion is also required.

7.4 Comparison to Past WorkFinally, we compare our full model to the OR-DEREDDISCRIM model of Zhukov et al. (2019) inthe primary data evaluation setup from that work:averaging results over 20 random splits of the pri-mary data (Table 3). This is a low data settingwhich uses only 30 videos per task as training datain each split.

Accordingly, both models have lower perfor-mance, although the relative ordering is the same:higher step label accuracy for the HSMM, andhigher all label accuracy and step recall for OR-DEREDDISCRIM. Although in this low-data set-ting, models overpredict background even more,

this problem is less pronounced for the HSMM:97.4% of timesteps for ORDEREDDISCRIM arepredicted background (explaining its high all labelaccuracy), and 87.1% for HSMM.

8 Discussion

We find that unsupervised action segmentation innaturalistic instructional videos is greatly aided bythe inductive bias given by typical step orderingswithin a task, and narrative language describingthe actions being done. While some results aremore mixed (with the same supervision, differentmodels are better on different metrics), we do ob-serve that across settings and metrics, step orderingand narration increase performance. Our resultsalso illustrate the importance of strong baselines:without weak supervision from step orderings andnarrative language, even state-of-the-art unsuper-vised action segmentation models operating on richvideo features underperform feature-agnostic base-lines. We hope that future work will continue toevaluate broadly.

While action segmentation in videos from di-verse domains remains challenging – videos con-tain both a large variety of types of depicted actions,and high visual variety in how the actions are por-trayed – we find that structured generative modelsprovide a strong benchmark for the task due totheir abilities to capture the full diversity of ac-tion types (by directly modeling distributions overaction occurrences), and to benefit from weak su-pervision. Future work might explore methods forincorporating richer learned representations bothof the diverse visual observations in videos, and thenarration that describes them, into such models.

Acknowledgments

Thanks to Dan Klein, Andrew Zisserman, LisaAnne Hendricks, Aishwarya Agrawal, Gabor Melis,Angeliki Lazaridou, Anna Rohrbach, Justin Chiu,Susie Young, the DeepMind language team, andthe anonymous reviewers for helpful feedback onthis work. DF is supported by a Google PhD Fel-lowship.

ReferencesSami Abu-El-Haija, Nisarg Kothari, Joonseok Lee,

Paul Natsev, George Toderici, BalakrishnanVaradarajan, and Sudheendra Vijayanarasimhan.2016. YouTube-8M: A large-scale video classifica-tion benchmark. arXiv preprint arXiv:1609.08675.

2578

Jean-Baptiste Alayrac, Piotr Bojanowski, NishantAgrawal, Josef Sivic, Ivan Laptev, and SimonLacoste-Julien. 2016. Unsupervised learning fromnarrated instruction videos. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition (CVPR).

Dare A Baldwin, Jodie A Baird, Megan M Saylor, andM Angela Clark. 2001. Infants parse dynamic ac-tion. Child Development, 72(3):708–717.

Piotr Bojanowski, Remi Lajugie, Francis Bach, IvanLaptev, Jean Ponce, Cordelia Schmid, and JosefSivic. 2014. Weakly supervised action labeling invideos under ordering constraints. In Proceedingsof the European Conference on Computer Vision(ECCV).

Joao Carreira and Andrew Zisserman. 2017. Quo vadis,action recognition? a new model and the Kineticsdataset. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR).

Nathanael Chambers and Dan Jurafsky. 2008. Unsu-pervised learning of narrative event chains. In Pro-ceedings of the Annual Meeting of the Associationfor Computational Linguistics (ACL).

Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei,and Juan Carlos Niebles. 2019. D3TW: Discrimina-tive differentiable dynamic time warping for weaklysupervised action alignment and segmentation. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR).

Li Ding and Chenliang Xu. 2018. Weakly-supervisedaction segmentation with iterative soft boundary as-signment. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR).

Jason Eisner. 2016. Inside-outside and forward-backward algorithms are just backprop (tutorial pa-per). In Proceedings of the Workshop on StructuredPrediction for NLP.

Ehsan Elhamifar and Zwe Naing. 2019. Unsupervisedprocedure learning via joint dynamic summarization.In Proceedings of the IEEE International Confer-ence on Computer Vision (ICCV).

Yazan Abu Farha and Jurgen Gall. 2019. MS-TCN:Multi-stage temporal convolutional network for ac-tion segmentation. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR).

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR).

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman,Josef Sivic, Trevor Darrell, and Bryan Russell. 2017.Localizing moments in video with natural language.In Proceedings of the IEEE International Confer-ence on Computer Vision (ICCV).

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis,Jort F Gemmeke, Aren Jansen, R Channing Moore,Manoj Plakal, Devin Platt, Rif A Saurous, BryanSeybold, et al. 2017. CNN architectures for large-scale audio classification. In 2017 IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP).

De-An Huang, Li Fei-Fei, and Juan Carlos Niebles.2016. Connectionist temporal modeling for weaklysupervised action labeling. In Proceedings of theEuropean Conference on Computer Vision (ECCV).

Will Kay, Joao Carreira, Karen Simonyan, BrianZhang, Chloe Hillier, Sudheendra Vijaya-narasimhan, Fabio Viola, Tim Green, Trevor Back,Paul Natsev, et al. 2017. The Kinetics human actionvideo dataset. arXiv preprint arXiv:1705.06950.

Diederik P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceedingsof the International Conference on Learning Repre-sentations (ICLR).

Dan Klein and Christopher D. Manning. 2002. Con-ditional structure versus conditional estimation inNLP models. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP).

Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014.The language of actions: Recovering the syntax andsemantics of goal-directed human activities. In Pro-ceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR).

Hilde Kuehne, Alexander Richard, and Juergen Gall.2017. Weakly supervised learning of actions fromtranscripts. In CVIU.

Harold W Kuhn. 1955. The Hungarian method for theassignment problem. Naval research logistics quar-terly, 2(1-2):83–97.

Anna Kukleva, Hilde Kuehne, Fadime Sener, and Ju-rgen Gall. 2019. Unsupervised learning of actionclasses with continuous temporal embedding. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR).

Colin Lea, Michael D Flynn, Rene Vidal, Austin Re-iter, and Gregory D Hager. 2017. Temporal con-volutional networks for action segmentation and de-tection. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR).

Dani Levine, Daphna Buchsbaum, Kathy Hirsh-Pasek,and Roberta M Golinkoff. 2019. Finding events in acontinuous world: A developmental account. Devel-opmental Psychobiology, 61(3):376–389.

Percy Liang and Dan Klein. 2008. Analyzing the errorsof unsupervised learning. In Proceedings of the An-nual Meeting of the Association for ComputationalLinguistics (ACL).

2579

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE. Journal of MachineLearning Research (JMLR), 9(Nov):2579–2605.

Jonathan Malmaud, Jonathan Huang, Vivek Rathod,Nicholas Johnston, Andrew Rabinovich, and KevinMurphy. 2015. What’s cookin’? interpreting cook-ing videos using text, speech and vision. In Proceed-ings of the Conference of the North American Chap-ter of the Association for Computational Linguistics(NAACL).

Bridgette A Martin and Barbara Tversky. 2003. Seg-menting ambiguous events. In Proceedings of theAnnual Meeting of the Cognitive Science Society.

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac,Makarand Tapaswi, Ivan Laptev, and Josef Sivic.2019. HowTo100M: Learning a text-video embed-ding by watching hundred million narrated videoclips. In Proceedings of the IEEE International Con-ference on Computer Vision (ICCV).

Tomas Mikolov, Edouard Grave, Piotr Bojanowski,Christian Puhrsch, and Armand Joulin. 2018. Ad-vances in pre-training distributed word representa-tions. In Proceedings of the International Con-ference on Language Resources and Evaluation(LREC).

Kevin Murphy. 2002. Hidden semi-markov models.Unpublished tutorial.

Janne Pylkkonen and Mikko Kurimo. 2004. Durationmodeling techniques for continuous speech recogni-tion. In Eighth International Conference on SpokenLanguage Processing.

Alexander Richard, Hilde Kuehne, and Juergen Gall.2017. Weakly supervised action learning with RNNbased fine-to-coarse modeling. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR).

Alexander Richard, Hilde Kuehne, and Juergen Gall.2018. Action sets: Weakly supervised action seg-mentation without ordering constraints. In Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR).

Marcus Rohrbach, Sikandar Amin, Mykhaylo An-driluka, and Bernt Schiele. 2012. A database for finegrained activity detection of cooking activities. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR).

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. 2015. ImageNetLarge Scale Visual Recognition Challenge. IJCV,115(3):211–252.

Roger C Schank and Robert P Abelson. 1977. Scripts,plans, goals, and understanding: An inquiry into hu-man knowledge structures. Psychology Press.

Fadime Sener and Angela Yao. 2018. Unsupervisedlearning and segmentation of complex activitiesfrom video. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR).

O. Sener, A. Zamir, S. Savarese, and A. Saxena. 2015.Unsupervised semantic parsing of video collections.In Proceedings of the IEEE International Confer-ence on Computer Vision (ICCV).

Tanya Sharon and Karen Wynn. 1998. Individuationof actions from continuous motion. PsychologicalScience, 9(5):357–362.

Karen Simonyan and Andrew Zisserman. 2015. Verydeep convolutional networks for large-scale imagerecognition. International Conference on LearningRepresentations (ICLR).

Bharat Singh, Tim K Marks, Michael Jones, OncelTuzel, and Ming Shao. 2016. A multi-stream bi-directional recurrent neural network for fine-grainedaction detection. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR).

Chen Sun, Fabien Baradel, Kevin Murphy, andCordelia Schmid. 2019. Contrastive bidirectionaltransformer for temporal representation learning.arXiv preprint arXiv:1906.05743.

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng,Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou.2019. COIN: A large-scale dataset for comprehen-sive instructional video analysis. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR).

Ercenur Unal, Yue Ji, and Anna Papafragou. 2019.From event representation to linguistic meaning.Topics in Cognitive Science.

Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-C3D:Region convolutional 3d network for temporal activ-ity detection. In Proceedings of the IEEE Interna-tional Conference on Computer Vision (ICCV).

Serena Yeung, Olga Russakovsky, Ning Jin, MykhayloAndriluka, Greg Mori, and Li Fei-Fei. 2018. Everymoment counts: Dense detailed labeling of actionsin complex videos. International Journal of Com-puter Vision (IJCV), 126(2-4):375–389.

Shoou-I Yu, Lu Jiang, and Alexander Hauptmann.2014. Instructional videos for unsupervised harvest-ing and learning of action examples. In Proceedingsof the ACM International Conference on Multimedia(MM).

Shun-Zheng Yu. 2010. Hidden semi-markov models.Artificial intelligence, 174(2):215–243.

Jeffrey M Zacks and Khena M Swallow. 2007. Eventsegmentation. Current Directions in PsychologicalScience, 16(2):80–84.

2580

Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu,Xiaoou Tang, and Dahua Lin. 2017. Temporal ac-tion detection with structured segment networks. InProceedings of the IEEE International Conferenceon Computer Vision (ICCV).

Luowei Zhou, Xu Chenliang, and Jason J. Corso. 2018.Towards automatic learning of procedures from webinstructional videos. In Proceedings of the Confer-ence on Artificial Intelligence (AAAI).

Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gok-berk Cinbis, David Fouhey, Ivan Laptev, and JosefSivic. 2019. Cross-task weakly supervised learn-ing from instructional videos. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition (CVPR).

2581

A Optimization

For both training conditions for our semi-Markovmodels that require gradient descent (generativeunsupervised and discriminative supervised), weinitialize parameters randomly and use Adam(Kingma and Ba, 2015) with an initial learningrate of 5e-3, a batch size of 5 videos, and decay thelearning rate when training log likelihood does notdecrease for more than one epoch.

B Features

For our features x1:T , we use the same base featuresas Zhukov et al. (2019). There are three featuretypes: activity recognition features, produced by anI3D model (Carreira and Zisserman, 2017) trainedon the Kinetics-400 dataset (Kay et al., 2017); ob-ject classification features, from a ResNet-152 (Heet al., 2016) trained on ImageNet (Russakovskyet al., 2015), and audio classification features10

from the VGG model (Simonyan and Zisserman,2015) trained by Hershey et al. (2017) on a prelim-inary version of the YouTube-8M dataset (Abu-El-Haija et al., 2016).11

For the generative mdoels which use Gaussianemission distributions, we apply PCA to the basefeatures above to reduce the feature dimensional-ity and decorrelate dimensions. We perform PCAseparately for features within task and within eachfeature group (I3D, ResNet, and audio features),but on features from all videos within that task.We use 100 components for each feature group,which explained roughly 70-100% of the variancein the features, depending on the task and featuregroup. The 100-dimensional PCA representationsfor the I3D, ResNet, and audio features for eachframe, at timestep t, are then concatenated to givea 300-dimensional vector for the frame, xt.

C Unsupervised Evaluation

The HSMM model, when trained in a fully unsu-pervised setting, induces class labels for regionsin the video; however while these class labels aredistinct, they do not correspond a priori to any ofthe actual region labels (which can be step types,or background) for our task. Just as with other un-supervised tasks and models (e.g., part-of-speechinduction), we need a mapping from these classes

10https://github.com/tensorflow/models/tree/master/research/audioset/vggish

11We also experiment with using features derived fromtranscribed narration in Appendix G.

to step types (and background) in order to evaluatethe model’s predictions. We follow the evalua-tion procedure of past work (Sener and Yao, 2018;Sener et al., 2015) by finding the mapping frommodel states to region labels that maximizes labelaccuracy, averaged across all videos in the task,using the Hungarian method (Kuhn, 1955). Thisevaluation condition is only used in the “Unsuper-vised” section of Table 2 (in the rows marked withoptimal accuracy assignment).

D Evaluation Metrics

Label accuracy The standard metric for actionsegmentation (Sener and Yao, 2018; Richard et al.,2018) is timestep label accuracy, in datasets with alarge amount of background, label accuracy on non-background timesteps. The CrossTask dataset hasmultiple reference step labels in the groundtruth foraround 1% of timesteps, due to noisy region anno-tations that overlap slightly. We obtain a single ref-erence label for these timesteps by taking the stepthat appears first in the canonical step ordering forthe task. We then compute accuracy of the modelpredictions against these reference labels acrossall timesteps and all videos for a task (in the alllabel accuracy condition), or by filtering to thosetimesteps which have a step label (non-background)in the reference (to focus on the model’s ability toaccurately predict step labels), in the step labelaccuracy condition.

Step recall This metric (Zhukov et al., 2019)measures a model’s ability to pick out instants foreach of the possible step types for a task, if theyoccur in a video. The model of Zhukov et al. (2019)predicted a single frame for each step type; whileour extension of their model, ORDEREDDISCRIM,and our HSMM model can predict multiple, whencomputing this metric we obtain a single frame foreach step type to make the numbers comparableto theirs.When a model predicts multiple framesper step type, we obtain a single one by taking theone closest to the middle of the temporal extent ofthe predicted frames for that step type. We thenapply their recall metric: First, count the numberof recovered steps, step types from the true labelsfor the video that were identified by one of the pre-dicted labels (have a predicted label of the sametype at one of the true label’s frames). These recov-ered step counts are summed across videos in theevaluation set for a given task, and normalized bythe maximum number of possible recovered steps

2582

(the number of step types in each video, summedacross videos) to produce a step recall fraction forthe task.

Sequence similarity This measures the similar-ity of the predicted sequence of regions in a videoagainst the true sequence of regions. As in speechrecognition, we are interested in the high-level se-quence of steps recognized in a video (and wish toabstract away from noise in the boundaries of theannotated regions). We first compute the negatedLevenshtein distance between the true sequenceof steps and background r1, . . . , rK for a videoand the and predicted sequence r1, . . . , r′K . Thenegated distance for the sequence pairs for a givenvideo are scaled to be between 0 and 100, where 0indicates the Levenshtein distance is the maximumpossible between two sequences of their respectivelengths, and 100 corresponds to the sequences be-ing identical. These similarities are then averagedacross all videos in a task.

E Comparing Generative andDiscriminative Models

We observe that the generative models tend to ob-tain higher performance on the action segmentationtask, as measured by step label accuracy, than dis-criminative models of the same or similar class. Weattribute this finding to two factors: first, the gener-ative models explicitly parameterize probabilitiesfor the steps, allowing better modeling of the fulldistribution of step labels. Second, the discrimina-tive models are trained to optimize p(lt | xt) forall timesteps t. We would expect that this wouldproduce better accuracies on metrics aligned withthis objective (Klein and Manning, 2002) – and in-deed the all timestep accuracy is higher for the dis-criminative models. However, the discriminativemodels’ high accuracy often comes at the expenseof predicting background more frequently, leadingto lower performance on step label accuracy.

F Duration Model Ablation

We examine the effect of the (hidden) semi-Markov model’s Poisson duration model by com-paring to a (hidden) Markov model (HMM in theunsupervised/weakly-supervised settings, or MMin the supervised setting). We use the model as de-scribed in Sec. 4 except for fixing all durations tobe a single timestep. We then train as described inSec. 4.1. While this does away with explicit mod-eling of duration, the transition distribution still

All Label Step Label Step Seq.Model Acc. Acc. Recall Sim.

SupervisedSMM, gen. 60.5 49.4 28.7 46.6MM, gen. 60.1 48.6 28.2 46.8SMM, disc. 66.0 37.3 24.1 50.5MM, disc. 62.8 32.2 20.1 41.8

Weakly-SupervisedHSMM 31.8 28.8 10.6 31.0HMM 28.8 30.8 10.3 29.9HSMM+Ord+Narr 61.2 15.9 17.2 55.0HMM+Ord+Narr 60.6 17.0 20.0 55.0

Table 4: Comparison between the semi-Markov andhidden semi-Markov models (SMM and HSMM) withthe Markov and hidden Markov (MM and HMM) mod-els, which ablate the semi-Markov’s duration model.

allows the model to learn expected durations foreach region type by implicitly parameterizing a ge-ometric distribution over region length. Results areshown in 4. We observe that results are overall verysimilar, with the exceptions that removing the du-ration model decreases performance substantiallyon all metrics in the discriminative supervised set-ting, and increases performance on step label accu-racy and step recall in the constrained unsupervisedsetting (HSMM+Ord+Narr and HMM+Ord+Narr).This suggests that the HMM transition distribu-tion is able to model region duration as well asthe HSMM’s explicit duration model, or that dura-tion overall plays a small role in modeling in mostsettings relative to the importance of the features.

G Narration Features

The benefit of narration-derived hard constraints onlabels (following past work by Zhukov et al. 2019)raises the question of how much narration wouldhelp when used to provide features for the models.We obtain narration features for each video usingFastText word embeddings (Mikolov et al., 2018)for the video’s time-aligned transcribed narration(see Zhukov et al. 2019 for details on this transcrip-tion), pooled within a sliding window to allow forimperfect alignment between activities mentionedin the narration and their occurrence in the video.The features for a given timestep t are produced bya weighted sum of embeddings for all the wordsin the transcribed narration within a 5-second win-dow of t (i.e.from t− 2 to t+ 2), weighted usinga Hanning window12 (so that words in the centerof each window are most heavily weighted for that

12https://docs.scipy.org/doc/numpy/reference/generated/numpy.hanning.html

2583

(a) Feature vectors colored by their step label in thereference annotations.

(b) Feature vectors colored by the id of the videothey occur in.

All Label Step Label StepAcc. Acc. Recall

SupervisedGaussian mixture 70.4 (+1.0) 43.7 (+3.1) 34.9 (+3.4)SMM, generative 63.3 (+2.8) 53.2 (+3.8) 32.1 (+3.4)

Weakly-SupervisedHSMM+Ord 53.6 (-1.9) 9.5 (+1.2) 8.5 (+1.2)HSMM+Narr 68.9 (+3.2) 8.0 (-1.6) 12.6 (+4.1)HSMM+Narr+Ord 64.3 (+3.1) 17.9 (+2.0) 21.9 (+4.7)

Table 5: Performance of key supervised and weakly-supervised models on the validation data when addingnarration vectors as features. Numbers in parenthesesgive the change from adding narration vectors to thesystems from Table 2.

window). We did not tune the window size, or ex-periment with other weighting functions. The wordembeddings are pretrained on Common Crawl, andare not fine-tuned with the rest of the model param-eters.

Once these narration features are produced, asabove, we treat them in the same way as the otherfeature types (activity recognition, object classifica-tion, and audio) described in Appendix B: reducingtheir dimensionality with PCA, and concatenatingthem with the other feature groups to produce thefeatures xt.

In Table 5, we show performance of key super-vised and weakly-supervised models on the vali-dation set, when using these narration features inaddition to activity recognition, object detection,and audio features. Narration features improveperformance over the corresponding systems fromTable 2 (differences are shown in parentheses) in13 out of 15 cases, typically by 1-4%.

H Feature Visualizations

To give a sense for feature similarities both withinstep types and within a video, we visualize featurevectors for 20 videos randomly chosen from thechange a tire task, dimensionality-reduced usingt-SNE (Maaten and Hinton, 2008) so that similarfeature vectors are close in the visualization.

Figure 4a shows feature vectors colored by steptype: we see little consistent clustering of featurevectors by step. On the other hand, we observe agreat deal of similarity across step types within avideo (see Figure 4b); when we color feature vec-tors by video, different steps from the same videoare close to each other in space. These together sug-gest that better featurization of videos can improveaction segmentation.

I Segmentation Visualizations

In the following pages, we show example segmen-tations from the various systems. Figure 5 and 6visualize predicted model segmentations for theunstructured Gaussian mixture and structured semi-Markov model in the supervised setting, in compar-ison to the ground-truth and the ordered uniformbaseline. We see that while both models typicallymake similar predictions in the same temporal re-gions of the video, the structured model producessteps that are much less fragmented.

Figure 7 and 8 visualize segmentations in theunsupervised and weakly-supervised settings forthe HSMM model and ORDEREDDISCRIM ofZhukov et al. (2019). The unsupervised HSMMhas difficulty distinguishing steps from background(see Appendix H), while the model trained with

2584

weak supervision from ordering and narration(HSMM+Ord+Narr) is better able to induce mean-ingful steps. The ORDEREDDISCRIM model, al-though it has been modified to allow predictingmultiple timesteps per step, collapses to predict-ing a single label, background, nearly everywhere,which we conjecture is because the model is dis-criminatively trained: jointly inferring labels thatare easy to predict, and the model parameters topredict them.

2585

Figure 5: Supervised segmentations We visualize segmentations from the validation set for a video from thetask make kimchi fried rice. We show the ground truth (GT), ordered uniform baseline (Uniform), and predictionsfrom the unstructured Gaussian mixture model (GMM), and structured semi-Markov model (SMM) trained in thesupervised setting. Predictions from the unstructured model are more fragmented than predictions from the SMM.The x-axis gives the timestep in the video.

2586

Figure 6: Supervised segmentations We visualize segmentations from the validation set for a video from the taskbuild simple floating shelves. We show the ground truth (GT), ordered uniform baseline (Uniform), and predictionsfrom the unstructured Gaussian mixture model (GMM), and structured semi-Markov model (SMM) trained in thesupervised setting. Predictions from the unstructured model are more fragmented than predictions from the SMM.The x-axis gives the timestep in the video.

2587

Figure 7: Unsupervised and weakly-supervised segmentations We visualize segmentations from the validationset for a video from the task make pancakes. We show the ground truth (GT), ordered uniform baseline (Uniform),and predictions from the hidden semi-markov trained without constraints (HSMM) and with constraints fromnarration and ordering (HSMM+Narr+Ord), and the system of Zhukov et al. The x-axis gives the timestep in thevideo.

2588

Figure 8: Unsupervised and weakly-supervised segmentations We visualize segmentations from the validationset for a video from the task grill steak. We show the ground truth (GT), ordered uniform baseline (Uniform), andpredictions from the hidden semi-markov trained without constraints (HSMM) and with constraints from narrationand ordering (HSMM+Narr+Ord), and the system of Zhukov et al. The x-axis gives the timestep in the video.