11
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1482–1492, Austin, Texas, November 1-5, 2016. c 2016 Association for Computational Linguistics Jointly Learning Grounded Task Structures from Language Instruction and Visual Demonstration Changsong Liu 1* , Shaohua Yang 1* , Sari Saba-Sadiya 1 , Nishant Shukla 2 , Yunzhong He 2 , Song-Chun Zhu 2 , and Joyce Y. Chai 1 1 Department of Computer Science and Engineering Michigan State University, East Lansing, MI 48824 2 Center for Vision, Cognition, Learning, and Autonomy University of California, Los Angeles, CA 90095 {cliu,yangshao,sadiyasa,jchai}@cse.msu.edu {shukla,yunzhong}@cs.ucla.edu,[email protected] Abstract To enable language-based communication and collaboration with cognitive robots, this paper presents an approach where an agent can learn task models jointly from language instruction and visual demonstration using an And-Or Graph (AoG) representation. The learned AoG captures a hierarchical task structure where linguistic labels (for language communication) are grounded to correspond- ing state changes from the physical environ- ment (for perception and action). Our em- pirical results on a cloth-folding domain have shown that, although state detection through visual processing is full of uncertainties and error prone, by a tight integration with lan- guage the agent is able to learn an effective AoG for task representation. The learned AoG can be further applied to infer and interpret on-going actions from new visual demonstra- tion using linguistic labels at different levels of granularity. 1 Introduction Given tremendous advances in robotics, computer vision, and natural language processing, a new gen- eration of cognitive robots have emerged that aim to collaborate with humans in joint tasks. To facili- tate natural and efficient communication with these physical agents, natural language processing will need to go beyond traditional symbolic representa- tions, but rather ground language to sensors (e.g., vi- sual perception) and actuators (e.g., lower-level con- trol systems) of physical agents. The internal task * The first two authors contributed equally to this paper. representation will need to capture both higher-level concepts (for language communication) and lower- level visual features (for perception and action). To address this need, we have developed an ap- proach on learning procedural tasks jointly from language instruction and visual demonstration. In particular, we use And-Or Graph (AoG), which has been used in many computer vision tasks and robotic applications (Zhao and Zhu, 2013; Li et al., 2016; Xiong et al., 2016), to represent a hier- archical task model that not only captures symbolic concepts (extracted from language instructions) but also the corresponding visual state changes from the physical environment (detected by computer vision algorithms). Different from previous works that ground lan- guage to perception (Liu et al., 2012; Matuszek et al., 2012; Kollar et al., 2013; Yu and Siskind, 2013; Yang et al., 2016), a key innovation in our frame- work is that language is no longer grounded just to perceived objects in the environment, but is further grounded to a hierarchical structure of state changes where the states are perceived from the environment during visual demonstration. The state of environ- ment is an important notion in robotic systems as the change of states drives planning for lower-level robotic actions. Thus, connecting language concepts to state changes, our learned AoG provides a unified representation that integrates language and vision to not only support language-based communication but also facilitate robot action planning and execution in the future. More specifically, within this AoG framework, we have developed and evaluated our algorithms in the 1482

Jointly Learning Grounded Task Structures from Language

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Jointly Learning Grounded Task Structures from Language

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1482–1492,Austin, Texas, November 1-5, 2016. c©2016 Association for Computational Linguistics

Jointly Learning Grounded Task Structures from Language Instruction andVisual Demonstration

Changsong Liu1*, Shaohua Yang1*, Sari Saba-Sadiya1,Nishant Shukla2, Yunzhong He2, Song-Chun Zhu2, and Joyce Y. Chai1

1Department of Computer Science and EngineeringMichigan State University, East Lansing, MI 48824

2Center for Vision, Cognition, Learning, and AutonomyUniversity of California, Los Angeles, CA 90095

{cliu,yangshao,sadiyasa,jchai}@cse.msu.edu{shukla,yunzhong}@cs.ucla.edu,[email protected]

Abstract

To enable language-based communicationand collaboration with cognitive robots, thispaper presents an approach where an agentcan learn task models jointly from languageinstruction and visual demonstration using anAnd-Or Graph (AoG) representation. Thelearned AoG captures a hierarchical taskstructure where linguistic labels (for languagecommunication) are grounded to correspond-ing state changes from the physical environ-ment (for perception and action). Our em-pirical results on a cloth-folding domain haveshown that, although state detection throughvisual processing is full of uncertainties anderror prone, by a tight integration with lan-guage the agent is able to learn an effectiveAoG for task representation. The learned AoGcan be further applied to infer and interpreton-going actions from new visual demonstra-tion using linguistic labels at different levelsof granularity.

1 Introduction

Given tremendous advances in robotics, computervision, and natural language processing, a new gen-eration of cognitive robots have emerged that aimto collaborate with humans in joint tasks. To facili-tate natural and efficient communication with thesephysical agents, natural language processing willneed to go beyond traditional symbolic representa-tions, but rather ground language to sensors (e.g., vi-sual perception) and actuators (e.g., lower-level con-trol systems) of physical agents. The internal task

* The first two authors contributed equally to this paper.

representation will need to capture both higher-levelconcepts (for language communication) and lower-level visual features (for perception and action).

To address this need, we have developed an ap-proach on learning procedural tasks jointly fromlanguage instruction and visual demonstration.In particular, we use And-Or Graph (AoG),which has been used in many computer vision tasksand robotic applications (Zhao and Zhu, 2013; Li etal., 2016; Xiong et al., 2016), to represent a hier-archical task model that not only captures symbolicconcepts (extracted from language instructions) butalso the corresponding visual state changes from thephysical environment (detected by computer visionalgorithms).

Different from previous works that ground lan-guage to perception (Liu et al., 2012; Matuszek etal., 2012; Kollar et al., 2013; Yu and Siskind, 2013;Yang et al., 2016), a key innovation in our frame-work is that language is no longer grounded just toperceived objects in the environment, but is furthergrounded to a hierarchical structure of state changeswhere the states are perceived from the environmentduring visual demonstration. The state of environ-ment is an important notion in robotic systems asthe change of states drives planning for lower-levelrobotic actions. Thus, connecting language conceptsto state changes, our learned AoG provides a unifiedrepresentation that integrates language and vision tonot only support language-based communication butalso facilitate robot action planning and execution inthe future.

More specifically, within this AoG framework, wehave developed and evaluated our algorithms in the

1482

Page 2: Jointly Learning Grounded Task Structures from Language

context of learning a cloth-folding task. Althoughcloth-folding appears simple and intuitive for hu-mans, it represents significant challenges for bothvision and robotics systems. Furthermore, althoughsymbolic language processing in this domain is easydue to limited use of vocabulary, grounded languageunderstanding is particularly challenging. A sim-ple phrase (e.g., “fold in half”) could have differ-ent grounded meanings (e.g., lower-level represen-tation) given different contexts. Thus, this cloth-folding domain is a good starting point to focus ongrounding language to task structures.

Our empirical results have shown that, althoughstate detection from the physical world can be ex-tremely noisy, our learning algorithm that tightly in-corporates language is capable of acquiring an effec-tive and meaningful task model to compensate theuncertainties in visual processing. Once the AoGfor the task is learned, it can be applied by our infer-ence algorithm, for example, to infer on-going ac-tions from new visual demonstration and generatelinguistic labels at different levels of granularity tofacilitate human-agent communication.

2 Related Work

Recent years have seen an increasing amount ofwork on grounding language to visual percep-tion (Liu et al., 2012; Matuszek et al., 2012; Yuand Siskind, 2013; Kollar et al., 2013; Naim et al.,2015; Yang et al., 2016; Gao et al., 2016). Further-more, the robotics community made significant ef-forts to utilize novel grounding techniques to facil-itate task execution given natural language instruc-tions (Chen et al., 2010; Kollar et al., 2010; Tellex etal., 2011; Misra et al., 2014) and task learning fromdemonstration (Saunders et al., 2006; Chernova andVeloso, 2008).

Research on Learning from Demonstration(LfD) employed various approaches to model thetasks (Argall et al., 2009), such as state-to-actionmapping (Chernova and Veloso, 2009), predicatecalculus (Hofmann et al., 2016), and HierarchicalTask Networks (Nejati et al., 2006; Hogg et al.,2009). However, aspiring to enable human robotcommunication, the framework developed in thispaper focuses on task representation using languagegrounded to a structure of state changes detected

Figure 1: The setting of our situated task learning where a hu-

man teacher teaches the robot how to fold a T-shirt through both

task demonstrations and language instructions.

from the physical world. As demonstrated in recentwork (She et al., 2014a; Misra et al., 2015; She andChai, 2016), explicitly modeling change of states isan important step towards interacting with robots inthe physical world.

Additionally, there has also been an increasingamount of work that learns new tasks either usingmethods like supervised learning on large corpus ofdata (Branavan et al., 2010; Branavan et al., 2012;Tellex et al., 2014; Misra et al., 2014), or by learn-ing from humans through dialogue (Cantrell et al.,2012; Mohan et al., 2013; Kirk and Laird, 2013; Sheet al., 2014b; Mohseni-Kabir et al., 2015). In this pa-per, we focus on jointly learning new tasks throughvisual demonstration and language instruction. Thelearned task model is explicitly represented by anAoG, a hierarchical structure consisting of both lin-guistic labels and corresponding changes of statesfrom the physical world. This rich task model willfacilitate not only language-based communication,but also lower-level action planning and execution.

3 Task and Data

In this paper, we use cloth-folding (e.g., teaching arobot how to fold a T-shirt) as the task to demon-strate and evaluate our joint task learning approach.As mentioned earlier, cloth-folding, although sim-ple for humans, represents a challenging task for therobotics community due to the complex state and ac-tion space.

Figure 1 illustrates the setting for our situatedtask learning. A human teacher can teach a robothow to fold a T-shirt through simultaneous verbalinstructions and visual demonstrations. A MicrosoftKinect2 camera is mounted on the robot to record

1483

Page 3: Jointly Learning Grounded Task Structures from Language

Figure 2: Examples of our parallel data where language in-

structions are paired with a sequence of visual states detected

from the video.

the human’s visual demonstration, and the human’scorresponding verbal instructions are recorded byKinect2’s embedded microphone array.

A recorded video of task demonstration and itscorresponding verbal instruction become one train-ing example for our task learning system. Figure 2shows two examples of such “parallel data”. The vi-sual demonstration is processed into a sequence ofvisual states, where each state is a numeric vector(~vi) capturing the visual features of the T-shirt at aparticular time (see later Section 5.1 for details). Therecorded verbal instructions are then aligned withthe sequence of visual states based on the timing in-formation.

During teaching the task, we specifically re-quested the demonstrator to describe and do eachstep at roughly the same time. This greatly simpli-fied the alignment problem. Since our ultimate goalis to enable humans to teach the robot through nat-ural language dialogue and demonstration, our hy-pothesis is that the alignment issue can be alleviatedby certain dialogue mechanism (e.g., ask to repeatthe action, ask for step-by-step aligned instructions,etc.). As it is human’s best interest that the robot getsthe clearest instructions, we also anticipate duringdialogue human teachers will be collaborative andprovide mostly aligned instructions. Certainly, thesehypotheses will need to be validated in the dialoguesetting in our future work.

In our collected data, each change of state, i.e.,a transition between two visual states, is caused byone or more physical actions. Some language de-scriptions align with only a single-step change ofstate. For instance, “fold right sleeve” is alignedwith the change (~v0 → ~v1) and “fold left sleeve”

is aligned with (~v1 → ~v2) in Example 1. Thiskind of single-step change of state is considered asa primitive action. Other language descriptions arealigned with a sequence of multiple state changes.For instance, “fold the two sleeves” in Example 2 isaligned with two consecutive changes: (~v′0, ~v

′1, ~v′2).

This kind of sequence of state changes is consideredas a complex action, which can be decomposed intopartially ordered primitive actions. A complex ac-tion can also be concisely represented by the changefrom the initial state to the end state in the sequence,such as (~v′0 → ~v′2) in Example 2.

These parallel data are used to train and test ourlearning and inference algorithms presented later.

4 And-Or Graph Representation

We use AoG as the formal model of a proceduraltask. Figure 3 shows an example AoG for the cloth-folding task. It is a hierarchical structure that ex-plicitly captures the compositionality and reconfig-urability of a procedural task. The terminal nodescapture state changes associated with primitive ac-tions of this task, and non-terminal nodes capturestate changes associated with complex actions whichare further composed by lower-level actions.

In addition to state changes, the learned AoG isalso labeled with linguistic information (e.g., verbframes) capturing the causes of the correspondingstate changes. The state changes are also consideredas grounded meanings of these verb frames. For ex-ample, Figure 3 shows two “fold the t-shirt” labelsat the top layer. Note that although symbolically,these two phrases have the same meaning (e.g., sameverb frames), their grounded meanings are differentas they correspond to different changes of state. Be-ing able to represent differences or ambiguities ingrounded meanings is crucial to connect languageto perception and action.

Formally, an AoG is defined as a 5-tuple G =(S,Σ, N,R,Θ), where

• S is a root node (or a start symbol) representinga complete task.• Σ is a set of terminal nodes, each of which

represents a change of state associated with aprimitive action.• N = NAND ∪ NOR is a set of non-terminal

nodes, which is divided into two disjoint sub-

1484

Page 4: Jointly Learning Grounded Task Structures from Language

Figure 3: An example of the learned AoG

sets of And-nodes and Or-nodes.• R is a “child-parent” relation (many-to-one

mapping), i.e., R(nch) = npa (meaning npa isthe parent node of nch), where nch ∈ Σ ∪ Nand npa ∈ N ∪ {S}.• Θ is a set of conditional probabili-

ties p(nch|npa), where npa ∈ NOR,nch ∈ {n | R(n) = npa}. Namely, foreach Or-node, Θ defines a probability distribu-tion over the set of all its children nodes.

In essence, our AoG model is equivalent to Prob-abilistic Context-Free Grammar (PCFG). An AoGcan be converted into a PCFG:

• Each And-node and its children form a produc-tion rule

nAND → nch1 ∧ nch2 ∧ . . .

that represents the decomposition of a complexaction into sequentially ordered sub-actions.

• Each Or-node and its children form a produc-tion rule

nOR → nch1 | nch2 | . . .

that represents all the alternative ways of ac-complishing an action. Each alternative alsocomes with a probability as specified in Θ.

5 Method

5.1 Vision and Language ProcessingThe input data to our AoG learning algorithm con-sist of co-occurring visual demonstrations and lan-guage instructions as described in Section 3. Based

on the RGB-D information provided by the Kinect2sensor, we developed a vision processing system tokeep track of human’s actions and statuses of the T-shirt object.

To learn a meaningful task structure, the most im-portant visual information are those key statuses thatthe object goes through. Therefore, our vision sys-tem processes each visual demonstration into a se-quence of states. Each state ~v is a multi-dimensionalnumeric vector that encodes the geometric infor-mation of the detected T-shirt, such as its smallestbounding rectangle and largest inscribed contour-fitting rectangle. These key states are detectedby tracking the human’s folding actions. Namely,whenever a folding action is detected1, we appendthe new state caused by the action to the sequence ofobserved states, till the end of the demonstration.

The verbal instructions given by the demonstra-tors were mainly verb phrases such as “fold which-part”, “fold to which-position”, or “fold in-what-manner”. A semantic parser2 is applied to parseeach instruction text into a canonical verb-framerepresentation, such as

FOLD : [PART : left sleeve][POSITION : middle].

Through the vision and language processing, eachtask demonstration becomes two parallel sequences,i.e., a sequence of extracted visual states and a se-quence of parsed language instructions. The align-

1The vision system keeps track of human’s hands, and de-tects a folding action as a gripping action followed by movingand releasing the hand(s).

2We use the CMU’s Phoenix parser:http://wiki.speech.cs.cmu.edu/olympus/index.php/Phoenix

1485

Page 5: Jointly Learning Grounded Task Structures from Language

ment between these two sequences is also extractedfrom their co-occurrence timing information. Thus,an instance of a task demonstration is formally rep-resented as a 3-tuple x = (D,L,∆), where D ={~v1, ~v2, . . . , ~vM} is the sequence of visual states,L = {l1, l2, . . . , lK} is the sequence of linguis-tic verb-frames, and ∆(k) = (i, j) is an “align-ment function” specifying the correspondence be-tween a linguistic verb-frame lk and a single or asub-sequence of visual state(s) {~vi, . . . , ~vj} (i ≤ j).

Then, given a dataset X of such task demonstra-tions, our AoG learning algorithm learns an AoG Gas defined in Section 4. The next section describesour learning algorithm in detail.

5.2 AoG Learning Algorithm

Learning an AoG G = (S,Σ, N,R,Θ) is carriedout in two stages. Firstly, we learn a set of termi-nal nodes Σ to represent the primitive actions (i.e.,the actions that can be preformed in a single step).This is done through clustering the observed visualstates. Secondly, the hierarchical structure (i.e., Nand R) and parameters Θ of the AoG is learned us-ing an iterative grammar induction algorithm.

5.2.1 Learning Terminal NodesA terminal node in the AoG represents a primitive

action, which causes the object to directly changefrom one state to another. Thus we represent a ter-minal node as a 2-tuple of states (or a “change ofstate”). Since the visual states detected by computervision are numeric vectors with continuous values,we first apply a clustering algorithm to form a fi-nite set of discrete state representations. Each clus-ter then represents a unique situation that one canencounter in a task. Since when learning a new taskwe usually do not know how many unique situa-tions exist, here we employ a greedy clustering algo-rithm (Ng and Cardie, 2002), which does not assumea fixed number of clusters.

As the greedy clustering algorithm relies on thepairwise similarities of all the visual states, we alsotrain an SVM classifier on a separate dataset of 22 T-shirt folding videos and use its classification outputto measure the similarity between two visual states.The SVM classifier takes two numeric vectors as aninput, and predicts whether these two vectors rep-resent the same status of a T-shirt. We then apply

this SVM classifier on each pair of detected visualstates in our new dataset (i.e., the dataset for learn-ing the AoG), and use the SVM’s output class la-bel (1 or −1) multiplies its classification confidencescore as the similarity measurement between two vi-sual states.

After clustering all the observed visual statesin the data, we then replace each numeric vec-tor state representation with the cluster “ID” it be-longs to. Thus each visual demonstration now be-comes a sequence of symbolic values, denoted asD′ = {s1, s2, . . . , sM}. And we further trans-form it into an equivalent change of state sequenceC = {(s1, s2), (s2, s3), . . . , (sM−1, sM )}, in whicheach change of state essentially represents a primi-tive action in this task. These change of state pairsthen form the set of terminal nodes Σ.

5.2.2 Learning the Structure and ParametersWith the sequences of numeric vector states re-

placed by the “symbolic” change of state sequencesin the first stage, we can further learn the structureand parameters of an AoG. Namely, to learn N , R,and Θ that maximize the posterior probability:

arg maxN,R,Θ

P (N,R,Θ|X ,Σ)

= arg maxN,R,Θ

P (N,R|X ,Σ) P (Θ|X ,Σ, N,R).

Following the iterative grammar inductionparadigm (Tu et al., 2013; Xiong et al., 2016),we employ an iterative procedure that al-ternatively solves arg max

N,RP (N,R|X ,Σ) and

arg maxΘ

P (Θ|X ,Σ, N,R).

To solve the first term, we use greedy or beamsearch with a heuristic function similar to (Solan etal., 2005). To solve the second term, we estimate theprobability of each branch of an Or-node by comput-ing the frequency of that branch, which is essentiallya maximum likelihood estimation similar to (Pei etal., 2013).

In detail, the learning procedure first initializesemptyN ,R, and Θ, then iterates through the follow-ing two steps until no further update can be made.Step (1): search for new And-nodes.

This step searches for new And-node candidatesfrom Σ∪N , and updateN andRwith the top-rankedcandidates. Specifically, we denote an And-nodecandidate to be searched as A = (sl → sm → sr).

1486

Page 6: Jointly Learning Grounded Task Structures from Language

Here sl is the initial state of an existing node, whoseend state is sm. And sr is the end state of another ex-isting node, whose initial state is sm. Thus an And-node candidate always has two child nodes, and rep-resents a pattern of sub-sequences which starts fromstate sl, ends at sr, and has sm occurred somewherein the middle.

Using the above notation, the heuristic functionfor ranking And-node candidates is defined as

h(A) = (1− λ)Pstate(A) + λPlabel(A)

where Pstate(A) captures the prevalence of a partic-ular And-node candidate based on the observed statechange sequences:

Pstate(A) =PR(A) + PL(A)

2

and PR(A) is the ratio between the number of times(sl → sm → sr) appears and the number of times(sl → sm) appears, and PL(A) is the ratio betweenthe number of times (sl → sm → sr) appears andthe number of times (sm → sr) appears.

The component Plabel(A) captures the prevalenceof linguistic labels associated with the sequentialstate change patterns. It is computed as the ratiobetween the number of times (sl → sm → sr)co-occurs with a linguistic instruction3 and the to-tal number of times (sl → sm → sr) appears.

We specially define two AoG learning settingsbased on the role that language plays:• Tight language integration: incorporate heuris-

tics on linguistic labels (i.e., λ = 0.5). Inthis setting, the learned AoG prefers And-nodesthat not only happen frequently, but also can bedescribed by a linguistic label.• Loose language integration: without incorpo-

rating the heuristics on linguistic labels (λ =0). Each And-node is learned only based onthe frequency of its state change pattern. Thelearned node can still acquire a linguistic labelif there happen to be a co-occurring one, but thechance is lower than the “tight” setting.

Step (2): search for new Or-nodes or newbranches of existing Or-nodes, then update Θ.

Once new And-nodes are added by the previousstep, the next step is to search for Or-nodes that

3Such information is encoded in the ∆ function as men-tioned in Section 5.1.

can be created or updated. An Or-node in the AoGessentially represents the set of all And-nodes thatshare the same initial and end states, denoted as(sl → sr) here (sl and sr are the common initial andend states, respectively). Suppose (sl → s′m → sr)is a newly added And-node, it is then assigned as achild of the Or-node (sl → sr). To further updateΘ, the branching probability is computes as the ra-tio between the number of times (sl → s′m → sr)appears and the number of times (sl → sr) appears.

5.3 Inference Using AoG

Once a task AoG is learned, it can be applied to in-terpret and explain new visual demonstrations usinglinguistic labels. Due to the noises and uncertaintiesfrom computer vision processing, one key challengein interpreting the visual demonstration is to reliablyidentify the different states of the T-shirt.

To tackle this issue, we formulate a joint infer-ence problem. Namely, given a task demonstrationvideo, we first process it into a sequence of numericvector states D = {~v1, ~v2, . . . , ~vM} as described inSection 5.1. Then the goal of inference is to find themost-likely parse tree T and a sequence of “sym-bolic states” D′ = {s1, s2, . . . , sM} based on theAoG G and the input D:

(T ∗, D′∗) = arg maxT,D′

P (T,D′ | G, D)

We apply a chart parsing algorithm (Klein andManning, 2001) to efficiently solve this problem.Furthermore, to accommodate the ambiguities inmapping a numeric vector state ~vm to a symbolicstate, we take into consideration the top-k hypothe-ses measured by the similarity between ~vm and asymbolic state sk.4 For each state mapping hypoth-esis, we add a completed edge between indices mand m + 1 in the chart, with sk as its symbol anda probability p based on the similarity between ~vmand sk. Based on the given AoG, the chart pars-ing algorithm then uses Dynamic Programming tosearch the best parse tree that maximizes the jointprobability of P (T,D′|G, D).

Figure 4 illustrates the input and output of our in-ference algorithm. As illustrated by this example,

4A symbolic state is represented by a cluster of numeric vec-tor states learned from the training data.

1487

Page 7: Jointly Learning Grounded Task Structures from Language

Figure 4: An illustration of the input and output of our AoG-based inference algorithm.

the parse tree represents a hierarchical structure un-derlying the observed task procedure, and the lin-guistic labels associated with the nodes can be usedto describe the primitive and complex actions in-volved in the procedure.

6 Evaluation

Using the setting as described in Section 3, wecollected 45 T-shirt folding demonstrations from 6people to evaluate our AoG learning and inferencemethods. More specifically, we conducted a 5-foldcross validation. In each fold, 36 demonstrationswere used for training to learn a task AoG. Then theremaining 9 demonstrations were used for testing, inwhich the learned AoG is further applied to processeach of the testing visual demonstrations.

Motivated by earlier work on plan/activity recog-nition using CFG-based models (Carberry, 1990;Pynadath and Wellman, 2000), we use an extrin-sic task that automatically assigns linguistic labelsto new demonstrations to evaluate the quality of thelearned AoG and the effectiveness of the inferencealgorithm. This involves three steps: (1) parse thevideo using the learned AoG; (2) identify linguisticlabels associated with terminal or nonterminal nodesin the parse tree; and (3) compare the identified lin-guistic labels with the manually annotated labels.

We conduct the evaluation at two levels:

• Primitive actions: use linguistic labels associ-ated with terminal nodes to describe the primi-tive actions in each video. This level providesdetailed descriptions on how the observed taskprocedure is performed step-by-step.

• Complex actions: use linguistic labels associ-ated with nonterminal nodes to describe com-plex actions. This provides a high-level “sum-mary” of the detailed low-level actions.

The capability to recognize fine-grained primitiveactions as well as high-level complex actions in atask procedure and to communicate those in lan-guage is important for many real-world AI applica-tions such as human-robot collaboration (Mohseni-Kabir et al., 2015) and visual question answer-ing (Tu et al., 2014).

6.1 Primitive Actions

We first compare the performance of interpretingprimitive actions using the learned AoG with a base-line. The baseline applies a memory-based (orsimilarity-base) approach. Given a testing video, itextracts all the different visual states and maps eachstate to the nearest cluster learned from the train-ing data (see Section 5.2.1). It then pairs each twoconsecutive states as a change of state instance, anduses the linguistic label corresponding to the identi-cal change of state found in the training data as thelabel of a primitive action.

We measure the primitive action recognition per-formance in terms of the normalized Minimal EditDistance (MED). Namely, for each testing demon-stration we calculate the MED between the ground-truth sequence of primitive action labels and the au-tomatically generated sequence of labels, and dividethe MED value by the length of the ground-truthsequence to produce a normalized score (a smallerscore indicates better performance in recognizingthe primitive actions).

1488

Page 8: Jointly Learning Grounded Task Structures from Language

Figure 5: Performance of interpreting primitive actions. Dif-

ferent number of state mapping hypotheses (k) are used in the

inference algorithm. The x-axis is the number of training ex-

amples used for learning the AoG.

The performances of the baseline and our AoG-based approach are shown in Figure 5. For theAoG-based approach, Figure 5 also shows the per-formances of incorporating different number of statemapping hypotheses (i.e., k = 1, 5, 10, 15, 20) intothe inference algorithm (Section 5.3). Here we onlyreport the performance of using AoG learned withthe tight language integration (see Section 5.2.2),since there is no difference in performance betweenthe tight and loose language integration settings inrecognizing primitive actions5.

As Figure 5 shows, the baseline performance israther weak (i.e., high MED scores). This is largelydue to the noise in state clustering and mappingfrom vision. After manually inspecting the collecteddemonstration videos, we found 18 unique statusesassociated with folding a T-shirt. However the com-puter vision based clustering on average producesmore than 30 clusters when all the 36 training ex-amples are used. This makes it difficult to directlymatch the state changes as in the baseline. For ourAoG-based method, when the inference algorithmonly takes the single best state mapping hypothesisinto consideration (i.e., k = 1), it yields a very weakperformance because the observed state change se-quence often cannot be parsed using the learnedAoG.

However, the performance of the AoG-based

5Because the linguistic labels generated for primitive actionsare all from terminal nodes, and the two different AoG learningsettings only affect nonterminal nodes.

method is significantly improved when multiplestate mapping hypotheses are incorporated into theinference process. When the top-5 (k = 5) statemapping hypotheses are incorporated into the AoG-based inference, its MED score has already outper-formed the baseline by a 0.3 gap (p < 0.001 usingthe Wilcoxon signed-rank test). When k = 20, theMED score has dropped by more than 0.6 comparedto k = 1 (p < 0.001).

These results indicate that our AoG-based methodis capable of learning useful task structure fromsmall data. When multiple hypotheses of visualstate mapping are incorporated, the learned AoG cancompensate the uncertainties in vision processingand identify highly reliable primitive actions fromunseen demonstrations.

6.2 Complex ActionsWe further evaluate the performance of interpretingcomplex actions using the learned AoG. The base-line for comparison is similar to the one used in theprevious section. It first converts a test video intoa sequence of “symbolic” states by mapping eachdetected visual state to its nearest cluster. It thenenumerates all the possible segments that consist ofmore than two consecutive states and search for theidentical segments in the training data. If a matchingsegment is found, then the corresponding linguisticlabel (if any) is used as the label for a complex ac-tion. Since complex actions correspond to nontermi-nal nodes in the parse tree generated by AoG-basedinference, and some of them may have linguistic la-bels while others may not. We use precision, re-call, and F-score to measure how well the generatedlinguistic labels match the manually segmented andannotated complex actions in testing videos.

Figure 6 shows the F-scores of recognizing com-plex actions using the AoG learned from the looseand the tight language integration, respectively. Inthis figure, results are based on k = 20 state map-ping hypotheses incorporated into the inference al-gorithm. As shown here, performances from bothsettings are significantly better than the baseline(p < 0.001). The AoG learned based on the tightintegration with language yields significantly bet-ter performance than the loose integration (over 0.2gain on F-score, p < 0.001).

This result indicates that the tight integration of

1489

Page 9: Jointly Learning Grounded Task Structures from Language

Figure 6: Performances (F-score) of recognizing complex ac-

tions. The lowest curve shows the performance from the base-

line. Two other curves represent the performance using the AoG

learned from the loose integration and the tight integration with

language respectively (where k = 20 is used in inference).

language during AoG learning favors And-node pat-terns that are more likely to be described by naturallanguage (or more consistent with human conceptu-alization of the task structure)6. Such an AoG repre-sentation can lead to recognition of video segmentsthat can be better explained or summarized by hu-man language. This capability of learning explicitand language-oriented task representations is impor-tant to link language and vision for enabling situatedhuman-agent communication/collaboration.

Table 6.2 further shows the results from differentnumbers of state mapping hypotheses that are incor-porated into the inference algorithm. As shown here,the trend of performance improvement with the in-crease in k is again observed. When multiple statemapping hypotheses are incorporated in inference,the learned AoG is capable of compensating uncer-tainties in vision processing and producing betterparses for unseen visual demonstrations.

7 Conclusion and Future Work

This paper presents an approach on task learningwhere an agent can learn a grounded task modelfrom human demonstrations and language instruc-tions. A key innovation of this work is groundinglanguage to a perceived structure of state changes

6By further investigating the learned AoG under the two dif-ferent settings, we found that the nonterminal nodes learnedfrom the tight language integration setting is more likely toacquire a linguistic label (33%) than the nonterminal nodeslearned from the loose setting (18%).

Table 1: Performance of recognizing complex actions using the

AoG learned from the loose and tight integration of language as

described in Section 5.2. Different (k) number of state mapping

hypotheses are used in the inference algorithm.k=1 k=5 k=10 k=15 k=20

Precision Loose 0.34 0.76 0.79 0.84 0.84Tight 0.34 0.8 0.86 0.89 0.9

Recall Loose 0.12 0.33 0.35 0.38 0.38Tight 0.12 0.51 0.59 0.64 0.65

F-Score Loose 0.17 0.46 0.49 0.52 0.52Tight 0.18 0.63 0.70 0.74 0.75

based on AoG representation. Once the task modelis acquired, it can be used as a basis to support col-laboration and communication between humans andagents/robots. Using cloth-folding as an example,our empirical results have demonstrated that tightlyintegrating language with vision can effectively pro-duce task structures in AoG that can generalize wellto new demonstrations.

Although we have only made an initial attempton a small task, our approach can be naturally ex-tended to more complex tasks such like assemblingand cooking. Both the AoG representation and thetask learning approach are general and applicable todifferent domains. What needs to be adapted is therepresentation of the visual states and computer vi-sion algorithms to detect these states for a specifictask.

Grounding language to a structure of perceivedstate changes will provide an important steppingstone towards integrating language, perception, andaction for human-robot communication and collabo-ration. Currently, our algorithms learn the task struc-tures based on offline parallel data. Our future workwill explore incremental learning through human-agent dialogue to acquire grounded task structures.

Acknowledgments

The authors are grateful to Sarah Fillwock andJames Finch for their help on data collection andprocessing, to Mun Wai Lee for his helpful discus-sions, and to anonymous reviewers for their valu-able comments and suggestions. This work wassupported in part by N66001-15-C-4035 from theDARPA SIMPLEX program, and IIS-1208390 andIIS-1617682 from the National Science Foundation.

1490

Page 10: Jointly Learning Grounded Task Structures from Language

ReferencesBrenna D. Argall, Sonia Chernova, Manuela Veloso, and

Brett Browning. 2009. A survey of robot learningfrom demonstration. Robotics and autonomous sys-tems, 57(5):469–483.

S. R. K. Branavan, Luke S. Zettlemoyer, and ReginaBarzilay. 2010. Reading between the lines: Learn-ing to map high-level instructions to commands. InProceedings of the 48th Annual Meeting of the As-sociation for Computational Linguistics (ACL), pages1268–1277.

S. R. K. Branavan, Nate Kushman, Tao Lei, and ReginaBarzilay. 2012. Learning high-level planning fromtext. In Proceedings of the 50th Annual Meeting ofthe Association for Computational Linguistics (ACL),pages 126–135.

Rehj Cantrell, J. Benton, Kartik Talamadupula, SubbaraoKambhampati, Paul Schermerhorn, and MatthiasScheutz. 2012. Tell me when and why to do it! run-time planner model updates via natural language in-struction. In 7th ACM/IEEE International Conferenceon Human-Robot Interaction (HRI), pages 471–478.

Sandra Carberry. 1990. Plan recognition in natural lan-guage dialogue. MIT press.

David L. Chen, Joohyun Kim, and Raymond J. Mooney.2010. Training a multilingual sportscaster: Using per-ceptual context to learn language. Journal of ArtificialIntelligence Research, 37(1):397–436.

Sonia Chernova and Manuela Veloso. 2008. Teach-ing multi-robot coordination using demonstration ofcommunication and state sharing. In Proceedings ofthe 7th international joint conference on Autonomousagents and multiagent systems-Volume 3, pages 1183–1186.

Sonia Chernova and Manuela Veloso. 2009. Interactivepolicy learning through confidence-based autonomy.Journal of Artificial Intelligence Research, 34(1):1.

Qiaozi Gao, Malcolm Doering, Shaohua Yang, andJoyce Y. Chai. 2016. Physical causality of actionverbs in grounded language understanding. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (ACL), Volume 1:Long Papers, pages 1814–1824.

Till Hofmann, Tim Niemueller, Jens Claßen, and GerhardLakemeyer. 2016. Continual planning in golog. InThirtieth AAAI Conference on Artificial Intelligence.

Chad Hogg, Ugur Kuter, and Hector Munoz-Avila. 2009.Learning hierarchical task networks for nondetermin-istic planning domains. In Proceedings of the Twenty-First International Joint Conference on Artificial Intel-ligence (IJCAI), pages 1708–1714.

James R. Kirk and John E. Laird. 2013. Learning taskformulations through situated interactive instruction.

In Proceedings of the Second Annual Conference onAdvances in Cognitive Systems (ACS), volume 219,page 236.

Dan Klein and Christopher D. Manning. 2001. An o(n3)agenda-based chart parser for arbitrary probabilisticcontext-free grammars. Stanford Technical Report.

Thomas Kollar, Stefanie Tellex, Deb Roy, and NicholasRoy. 2010. Toward understanding natural languagedirections. In Proceedings of the 5th ACM/IEEE In-ternational Conference on Human-robot Interaction(HRI), pages 259–266.

Thomas Kollar, Jayant Krishnamurthy, and Grant P.Strimel. 2013. Toward interactive grounded languageacqusition. In Robotics: Science and Systems.

Bo Li, Tianfu Wu, Caiming Xiong, and Song-Chun Zhu.2016. Recognizing car fluents from video. In Pro-ceedings of the 29th IEEE Conference on ComputerVision and Pattern Recognition (CVPR).

Changsong Liu, Rui Fang, and Joyce Y. Chai. 2012. To-wards mediating shared perceptual basis in situated di-alogue. In Proceedings of the 13th Annual Meeting ofthe Special Interest Group on Discourse and Dialogue,pages 140–149.

Cynthia Matuszek, Nicholas FitzGerald, Luke Zettle-moyer, Liefeng Bo, and Dieter Fox. 2012. A jointmodel of language and perception for grounded at-tribute learning. Proceedings of the 29th InternationalConference on Machine Learning (ICML).

Dipendra K. Misra, Jaeyong Sung, Kevin Lee, andAshutosh Saxena. 2014. Tell me dave: Context-sensitive grounding of natural language to manipula-tion instructions. In Proceedings of Robotics: Scienceand Systems (RSS).

Dipendra K. Misra, Kejia Tao, Percy Liang, and AshutoshSaxena. 2015. Environment-driven lexicon inductionfor high-level instructions. In Proceedings of the 53rdAnnual Meeting of the Association for ComputationalLinguistics, pages 992–1002.

Shiwali Mohan, James Kirk, and John Laird. 2013. Acomputational model for situated task learning with in-teractive instruction. In Proceedings of the 12th Inter-national Conference on Cognitive Modeling (ICCM).

Anahita Mohseni-Kabir, Charles Rich, Sonia Chernova,Candace L Sidner, and Daniel Miller. 2015. In-teractive hierarchical task learning from a singledemonstration. In Proceedings of the Tenth An-nual ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 205–212.

Iftekhar Naim, Young Chol Song, Qiguang Liu, LiangHuang, Henry Kautz, Jiebo Luo, and Daniel Gildea.2015. Discriminative unsupervised alignment of nat-ural language instructions with corresponding video

1491

Page 11: Jointly Learning Grounded Task Structures from Language

segments. In North American Chapter of the Associ-ation for Computational Linguistics Human LanguageTechnologies (NAACL-HLT).

Negin Nejati, Pat Langley, and Tolga Konik. 2006.Learning hierarchical task networks by observation. InProceedings of the 23rd international conference onMachine learning (ICML), pages 665–672.

Vincent Ng and Claire Cardie. 2002. Improving machinelearning approaches to coreference resolution. In Pro-ceedings of the 40th Annual Meeting on Associationfor Computational Linguistics (ACL), pages 104–111.

Mingtao Pei, Zhangzhang Si, Benjamin Z Yao, and Song-Chun Zhu. 2013. Learning and parsing video eventswith goal and intent prediction. Computer Vision andImage Understanding, 117(10):1369–1383.

David V. Pynadath and Michael P. Wellman. 2000. Prob-abilistic state-dependent grammars for plan recogni-tion. In Proceedings of the Sixteenth conference onUncertainty in artificial intelligence, pages 507–514.Morgan Kaufmann Publishers Inc.

Joe Saunders, Chrystopher L. Nehaniv, and Kerstin Daut-enhahn. 2006. Teaching robots by moulding behaviorand scaffolding the environment. In Proceedings ofthe 1st ACM SIGCHI/SIGART conference on Human-robot interaction (HRI), pages 118–125.

Lanbo She and Joyce Y. Chai. 2016. Incremental acqui-sition of verb hypothesis space towards physical worldinteraction. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(ACL).

Lanbo She, Yu Cheng, Joyce Y. Chai, Yunyi Jia, Shao-hua Yang, and Ning Xi. 2014a. Teaching robotsnew actions through natural language instructions. InProceedings of the 23rd IEEE International Sympo-sium on Robot and Human Interactive Communica-tion, pages 868–873.

Lanbo She, Shaohua Yang, Yu Cheng, Yunyi Jia, Joyce Y.Chai, and Ning Xi. 2014b. Back to the blocks world:Learning new actions through situated human-robotdialogue. In Proceedings of the 15th Annual Meetingof the Special Interest Group on Discourse and Dia-logue.

Zach Solan, David Horn, Eytan Ruppin, and ShimonEdelman. 2005. Unsupervised learning of natural lan-guages. Proceedings of the National Academy of Sci-ences of the United States of America, 102(33):11629–11634.

Stefanie Tellex, Thomas Kollar, Steven Dickerson,Matthew R. Walter, Ashis Gopal Banerjee, Seth J.Teller, and Nicholas Roy. 2011. Understanding natu-ral language commands for robotic navigation and mo-bile manipulation. In Proceedings of the Twenty-FifthAAAI Conference on Artificial Intelligence.

Stefanie Tellex, Pratiksha Thaker, Joshua Joseph, andNicholas Roy. 2014. Learning perceptually groundedword meanings from unaligned parallel data. MachineLearning, 94(2):151–167.

Kewei Tu, Maria Pavlovskaia, and Song-Chun Zhu.2013. Unsupervised structure learning of stochasticand-or grammars. In Advances in Neural InformationProcessing Systems (NIPS), pages 1322–1330.

Kewei Tu, Meng Meng, Mun Wai Lee, Tae Eun Choe,and Song-Chun Zhu. 2014. Joint video and text pars-ing for understanding events and answering queries.IEEE MultiMedia, 21(2):42–70.

Caiming Xiong, Nishant Shukla, Wenlong Xiong, andSong-Chun Zhu. 2016. Robot learning with a spatial,temporal, and causal and-or graph. In Proceedings ofthe 2016 IEEE International Conference on Roboticsand Automation (ICRA).

Shaohua Yang, Qiaozi Gao, Changsong Liu, CaimingXiong, Song-Chun Zhu, and Joyce Y. Chai. 2016.Grounded semantic role labeling. In Proceedings ofthe 15th Annual Conference of the North AmericanChapter of the Association for Computational Linguis-tics: Human Language Technologies (NAACL-HLT),pages 149–159.

Haonan Yu and Jeffrey M. Siskind. 2013. Grounded lan-guage learning from video described with sentences.In Proceedings of the 51st Annual Meeting of the As-sociation for Computational Linguistics (ACL), pages53–63.

Yibiao Zhao and Song-Chun Zhu. 2013. Scene pars-ing by integrating function, geometry and appearancemodels. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR),pages 3119–3126.

1492