Neural Program Synthesis from Diverse Demonstration Videos

Neural Program Synthesis from Diverse Demonstration Videos

Shao-Hua Sun * 1 Hyeonwoo Noh * 2 Sriram Somasundaram 1 Joseph J. Lim 1

AbstractInterpreting decision making logic in demonstra-tion videos is key to collaborating with and mim-icking humans. To empower machines with thisability, we propose a neural program synthesizerthat is able to explicitly synthesize underlyingprograms from behaviorally diverse and visuallycomplicated demonstration videos. We introducea summarizer module as part of our model toimprove the network’s ability to integrate multi-ple demonstrations varying in behavior. We alsoemploy a multi-task objective to encourage themodel to learn meaningful intermediate represen-tations for end-to-end training. We show that ourmodel is able to reliably synthesize underlyingprograms as well as capture diverse behaviors ex-hibited in demonstrations. The code is availableat https://shaohua0116.github.io/demo2program.

1. IntroductionImagine you are watching others driving cars. You willeasily notice many common behaviors even if you knownothing about driving. For example, cars stop when thetraffic light turns to red and move again when the light turnsto green. Cars also slow down when pedestrians are seenjay-walking. Through observation, humans can abstractbehaviors and understand the reasoning behind behaviors– especially extracting the structural relationship betweenactions (e.g. start, slow down, stop) and perception (e.g.light, pedestrian).

Can machines also reason decision making logic behindbehaviors? There has been tremendous effort and successin understanding behaviors such as recognizing actions (Si-monyan & Zisserman, 2014), describing activities in lan-guages (Venugopalan et al., 2015), and predicting future

*Equal contribution 1Department of Computer Science, Uni-versity of Southern California, California, USA 2Department ofComputer Science and Engineering, POSTECH, Pohang, Korea.Correspondence to: Shao-Hua Sun <[email protected]>.

Proceedings of the 35 th International Conference on MachineLearning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s).

def run():while frontIsClear():

move()turnRight()if thereIsPig():

attack()else:

if not thereIsWolf():spawnPig()

else:giveBone()

Synthesized ProgramDemonstrations

demo 1

demo 2

demo 3

Figure 1. An illustration of neural program synthesis from demon-strations. Given multiple demonstration videos exhibiting diversebehaviors, our neural program synthesizer learn to produce inter-pretable and executable underlying programs. Divergence aboveoccurs based on perception in the second frame.

outcomes (Srivastava et al., 2015). Yet, interpreting reasonsbehind behaviors is relatively unexplored and is a crucialskill for machines to collaborate with and mimic humans.Hence, our goal is to step towards developing a method thatcan interpret perception-based decision making logic fromdiverse behaviors seen in multiple visual demonstrations.

Our insight is to exploit declarative programs, structured ina formal language, as representations of decision makinglogics. The formal language is composed of action blocks,perception blocks, and control flow (e.g. if/else). Programswritten in such a language can explicitly model the connec-tion between an observation (e.g. traffic light, biker) andan action (e.g. stop). An example is shown in Figure 11.Described in a formal language, programs are logically inter-pretable and executable. Thus, the problem of interpretingdecision making logic from visual demonstrations can bereduced to extracting an underlying program.

In fact, there have been many neural network frameworksproposed recently for program induction or synthesis. First,a variety of frameworks (Kaiser & Sutskever, 2016; Reed &De Freitas, 2016; Xu et al., 2018; Devlin et al., 2017a) pro-pose to induce latent representations of underlying programs.While they can be efficient at mimicking desired behaviors,they do not explicitly yield interpretable programs, resulting

1The illustrated environment is not tested in our experiments.

https://shaohua0116.github.io/demo2program


in inexplicable failure cases. On the other hand, another lineof work (Devlin et al., 2017b; Bunel et al., 2018) directlysynthesize programs from input/output pairs, giving fullinterpretability. While successful, the limited informationin the input/output pairs restricts applicability in synthesiz-ing programs with rich expressibility. Hence, in this paper,we develop a model that synthesizes programs from visu-ally complex and sequential inputs that demonstrate morebranching conditions and long term effects, increasing thecomplexity of the underlying programs.

To this end, we develop a program synthesizer augmentedwith a summarizer module that is capable of encoding theinterrelationship between multiple demonstrations and sum-marizing them into compact aggregated representations. Inaddition, to enable efficient end-to-end training, we intro-duce auxiliary tasks to encourage the model to learn theknowledge that is essential to infer an underlying program.

We extensively evaluate our model in two environments:a fully observable, third-person environment (Karel) anda partially observable, egocentric game (ViZDoom). Ourexperiments in both environments with a variety of settingspresent the strength of explicitly modeling programs forreasoning underlying conditions and the necessity of theproposed components (the summarizer module and the aux-iliary tasks).

In summary, in this paper, we introduce a novel problem ofprogram synthesis from diverse demonstration videos and amethod to address it. This substantially enables machinesto explicitly interpret decision making logic and interactwith humans. We also demonstrate that our algorithm cansynthesize programs reliably on multiple environments.

2. Related WorkProgram Induction Learning to perform a specific task byinducing latent representations of underlying task-specificprograms is known as program induction. Various ap-proaches have been developed: designing end-to-end differ-entiable architectures (Graves et al., 2014; 2016; Zaremba& Sutskever, 2015; Kaiser & Sutskever, 2016; Joulin &Mikolov, 2015; Grefenstette et al., 2015; Neelakantan et al.,2015), learning to call subprograms using step-by-step su-pervision (Reed & De Freitas, 2016; Cai et al., 2017), andfew-shot program induction (Devlin et al., 2017a). Contraryto our work, those method do not return explicit programs.

Program Synthesis The line of work in program synthesisfocuses on explicitly producing programs that are restrictedto certain languages. (Balog et al., 2017) train a model topredict program attributes and used external search algo-rithms for inductive program synthesis. (Parisotto et al.,2017; Devlin et al., 2017b) directly synthesize simple stringtransformation programs. (Bunel et al., 2018) employ re-

Program m := def run() : sStatement s := while(b) : (s) | s1; s2 | a | repeat(r) : (s)

| if(b) : (s) | ifelse(b) : (s1) else : (s2)

Repetition r := Number of repetitionsCondition b := percept | not b

Perception p := Domain dependent perception primitivesAction a := Domain dependent action primitives

Figure 2. Domain specific language for the program representation.The program is composed of domain dependent perception andaction primitives and control flows.

inforcement learning to directly optimize the execution ofgenerated programs. However, those methods are limited tosynthesizing programs from input-output pairs, which sub-stantially restricts the expressibility of the programs that areconsidered; instead, we address the problem of synthesizingprograms from full demonstrations videos.

Imitation Learning The methods that are concerned withacquiring skills from expert demonstrations, dubbed imita-tion learning, can be split into behavioral cloning (Pomer-leau, 1989; 1991; Ross et al., 2011) which casts the prob-lem as a supervised learning task and inverse reinforcementlearning (Ng et al., 2000) that extracts estimated rewardfunctions given demonstrations. Recently, (Duan et al.,2017; Finn et al., 2017; Xu et al., 2018) have studied thetask of mimicking given few demonstrations. This line ofwork can be considered as program induction, as they imi-tate demonstrations without explicitly modeling underlyingprograms. While those methods are able to mimic givenfew demonstrations, it is not clear if they could deal withmultiple demonstrations with diverse branching conditions.

3. Problem OverviewIn this section, we define our formulation for program syn-thesis from diverse demonstration videos. We define pro-grams in a domain specific language (DSL) with perceptionprimitives, action primitives, and control flows. Actionprimitives define the way that agents can interact with an en-vironment, while perception primitives describe how agentscan percept it. Control flow can include if/else statements,while loops, repeat statements, and simple logic operations.An example of control flow introduced in (Pattis, 1981)is shown in Figure 2. Note that we focus on perceptionswith boolean types in this paper, although a more genericperception type constraint is possible.

A program η is a deterministic function that outputs anaction a ∈ A given a history of states at time step t, Ht =(s1, s2, ..., st), where s ∈ S is a state of the environment.The generation of an action given the history of states is


Demo Encoder

Program Decoder

Perception Decoder

Action Decoder

…

Demo 1 …def run():move()if leftIsClear(): turnLeft()REPEAT R=5: turnRight()if MarkersPresent(): pickupMarker()else: move()

Program

Demo Encoder

Demo Encoder

move()def run() move()

Demo 2

Demo k

else

…cc

frontIsClear()rightIsClear()MarkersPresent()

yesnoyes

…

move() turnLeft() pickupMarker()

move() turnRight() pickupMarker()

…move() turnLeft() move()

……

…

…

Summarizer Module

v1demo

v2demo

vkdemo

vsummary

Figure 3. Model Architecture. The demonstration encoder encodes each of the k demonstrations separately and the summarizer networkaggregates them to construct a summary vector. The summary vector is used by the program decoder to produce program tokenssequentially. The encoded demonstrations are used to decode the action sequence and perception conditions as additional supervision.

represented as at = η (Ht). In this paper, we focus onprograms that can be represented in DSL by a code C =(w1, w2, ..., wN ), which consists a sequence of tokens w.

A demonstration τ = ((s1, a1), (s2, a2), ..., (sT , aT )) is asequence of state and action tuples generated by an under-lying program η∗ given an initial state s1. Given an initialstate s1 and its corresponding state history H1, the programgenerates new action a1 = η∗ (H1). The following state s2is generated by a state transition function T : s2 ∼ T (s1, a1).The newly sampled state is incorporated into the state his-tory H2 = H1

a (s2) and this process is iterated until theend of file action EOF ∈ A is returned by the program.A set of demonstrations D = τ1, τ2, ..., τK can be gen-erated by running a single program η∗ on different initialstates s11, s

21, ..., s

K1 , where each initial state is sampled from

an initial state distribution (i.e. sk1 ∼ P0(s1)).

While we are interested in inferring a program η∗ from aset of demonstrations D, it is preferable to predict a codeC∗ instead, because it is a more accessible representationwhile immediately convertible to a program. Formally, weformulate the problem as a sequence prediction where theinput is a set of demonstrations D and the output is a codesequence C. Note that our objective is not about inferring acode perfectly but instead generating a code that can inferthe underlying program, which models the diverse behaviorsappearing in the demonstrations in an executable form.

4. ApproachInferring a program behind a set of demonstrations requires(1) interpreting each demonstration video (2) spotting andsummarizing the difference among demonstrations to inferthe conditions behind the taken actions (3) describing the un-derstanding of demonstrations in a written language, Basedon this intuition, we design a neural architecture composedof three components:

• Demonstration Encoder receives a demonstrationvideo as input and produces an embedding that cap-tures an agent’s actions and perception.

• Summarizer Module discovers and summarizeswhere actions diverge between demonstrations andupon which branching conditions subsequent actionsare taken.

• Program Decoder represents the summarized under-standing of demonstrations as a code sequence.

The details of the three main components are described inthe Section 4.1, and the learning objective of the proposedmodel is described in Section 4.2. Section 4.3 introducesauxiliary tasks for encouraging the model to learn the knowl-edge that is essential to infer a program.

4.1. Model Architecture

Figure 3 illustrates the overall architecture of the proposedmodel, The details of each component are described in thefollowing sections.

4.1.1. DEMONSTRATION ENCODER

The demonstration encoder receives a demonstration videoas input and produces an latent vector that captures theactions and perception of an agent. At each time step, tointerpret visual input, we employ a stack of convolutionallayers, to encode a state st to its embedding as a state vectorvtstate = CNNenc(st) ∈ Rd, where t ∈ [1, T ] is the time-step.

Since the demonstration encoder needs to handle demon-strations with variable numbers of frames, we employ anLSTM (Long Short Term Memory) (Hochreiter & Schmid-huber, 1997) to encode each state vector and summarizedrepresentation at the same time.

ctenc, htenc = LSTMenc(v

tstate, c

t−1enc , h

t−1enc ), (1)


where, t ∈ [1, T ] is the time step, while ctenc and htenc denotethe cell state and the hidden state. While final state tuples(cTenc, h

Tenc) encode the overall idea of the demonstration,

intermediate hidden states h1enc, h2enc, ..., h

Tenc contain high

level understanding of each state, which are used as an inputto the following modules. Note that these operations areapplied to allK demonstrations while the index k is droppedin the equations for simplicity.

4.1.2. SUMMARIZER MODULE

Inferring an underlying program from demonstrations thatexhibits different behaviors requires the ability to discoverand summarize where actions diverge between demonstra-tions and upon which branching conditions subsequent ac-tions are taken. The summarizer module first re-encodeseach demonstration with the context of all encoded demon-strations to infer branching conditions. Then, the moduleaggregates all encoded demonstration vectors to obtain thesummarized representation. An illustration of the summa-rizer is shown in Figure 4.

The first summarization is performed by a reviewer module,an LSTM initialized with the average-pooled final statetuples of the demonstration encoder outputs, which can bewritten as follows:

c0review =1

K

K∑k=1

cT,kenc , h0review =1

K

K∑k=1

hT,kenc , (2)

where (cT,kenc , hT,kenc ) is the final state tuple of the kth demon-

stration encoder. Then the reviewer LSTM encodes thehidden states by

ct,kreview, ht,kreview = LSTMreview(ht,kenc, c

t−1,kreview , h

t−1,kreview ), (3)

where the final hidden state becomes a demonstration vec-tor vkdemo = hT,kreview ∈ Rd, which includes the summarizedinformation within a single demonstration.

The final summarization, which is performed across multi-ple demonstrations, is performed by an aggregation module,which gets K demonstration vectors and aggregates theminto a single compact vector representation. To effectivelymodel complex relations between demonstrations, we em-ploy a relational network (RN) module (Santoro et al., 2017).The aggregation process is formally written as follows.

vsummary = RN(v1demo, ..., v

Kdemo

)=

1

K2

K∑i,j

gθ(videmo, v

jdemo),

(4)where vsummary ∈ Rd is the summarized demonstration vec-tor and gθ is an MLP parameterized by θ jointly trained withthe summarizer module.

We show that employing the summarizer module signifi-cantly alleviates the difficulty of handling multiple demon-

vsummary

LSTM

LSTM LSTM

LSTM

s11 sT1

LSTM

LSTM LSTM

LSTM

s1k sTk

…

Demo 1

Demo k

Average

Pooling

zero state

zero state

Relation

Network

…

…

zDemonstration

Encoder

Summarizer Module z

z

…

… vkdemo

v1demo

Figure 4. Summarizer Module. The demonstration encoder (innerlayer) encodes each demonstration starting from a zero state. Thesummarizer module (outer layer) aggregates the outputs of thedemonstration encoder with a relation network to provide contextfrom other demonstrations.

strations and improve generalization over different numberof generations in Section 5 .

4.1.3. PROGRAM DECODER

The program decoder synthesizes programs from a sum-marized representation of all the demonstrations. We useLSTMs similar to (Sutskever et al., 2014; Vinyals et al.,2015) as a program decoder. Initialized with the summa-rized vector vsummary, the LSTM at each time step gets theprevious token embedding as an input and outputs a prob-ability of the following program tokens as in the Eq. 5.During training, the previous ground truth token is fed asan input, and during inference, the predicted token in theprevious steps is fed as an input.

4.2. Learning

The proposed model learns a conditional distribution be-tween a set of demonstrations D and a corresponding codeC = w1, w2, ..., wN. By employing the LSTM programdecoder, this problem becomes an autoregressive sequenceprediction (Sutskever et al., 2014). For a given demonstra-tion and previous code token wi−1, our model is trainedto predict the following ground truth token w∗

i , where thecross entropy loss is optimized.

Lcode = − 1

NM

M∑m=1

N∑n=1

log p(w∗m,n|Wm

m,n−1, Dm), (5)

where M is the total number of training examples, wm,n isthe nth token of the mth training example and Dm are mthtraining demonstrations. Wm,n = wm,1, ..., wm,n is thehistory of previous token inputs at time step n.


4.3. Multi-task Objective

To reason an underlying program from a set of demonstra-tions, the primary and essential step is recognizing actionsand perceptions happening in each step of the demonstration.However, it can be difficult to learn meaningful represen-tations solely from the sequence loss of programs whenenvironments increase in visual complexity. To alleviatethis issue, we propose to predict action sequences and per-ception vectors from the demonstrations as auxiliary tasks.An overview of the auxiliary tasks are illustrated in Figure 3.

Predicting action sequences Given a demo vector vkdemoencoded by the summarizer, an action decoder LSTM pro-duces a sequence of actions. During training, a sequentialcross entropy loss similar to Equation 5 is optimized:

Laction = − 1

MKT

M∑m=1

K∑k=1

T∑t=1

log p(ak∗m,t|Akm,t−1, vkdemo),

(6)where, akm,t is the t-th action token in k−th demonstrationof m-th training example, Akm,t = akm,1, ..., akm, t is thehistory of previous actions at time step t.

Predicting perceptions We denote a perception vectorΦ = φ1, ..., φL ∈ 0, 1L as an L dimensional binaryvector obtained by executing L perception primitives e.g.frontIsClear() on a given state s. Specifically, weformulate the perception vector prediction as a sequentialmulti-label binary classification problem and optimizes thebinary cross entropy:

Lperception =

− 1

MKTL

M∑m=1

K∑k=1

T∑t=1

L∑l=1

log p(φk∗m,t,l|P km,t−1, vkdemo),

(7)

where P km,t = f(Φkm,1), ..., f(Φkm,t) is the history ofencoded previous perception vectors and f(·) is an encodingfunction.

The aggregated multi-task objective is as follows: L =Lcode + αLaction + βLperception, where α and β are hyper-parameters controlling the importance of each loss. We setα = β = 1 to equally optimize the objectives for all theexperiments.

5. ExperimentsWe perform experiments in different environments:Karel (Pattis, 1981) and ViZDoom (Kempka et al., 2016).We first describe the experimental setup and then presentthe experimental results.

5.1. Evaluation Metric

To verify whether a model is able to infer an underlying pro-gram η∗ from a given set of demonstrations D, we evaluateaccuracy based on the synthesized codes and the underlyingprogram (sequence accuracy and program accuracy) as wellas the execution of the program (execution accuracy).

Sequence accuracy Comparison in the code space isbased on the instantiated code C∗ of a ground truth pro-gram and the synthesized code C from a program synthe-sizer. The sequence accuracy counts exact match of twocode sequences, which is formally written as: Accseq =1M

∑Mm=1 1seq(C∗

m, Cm), where M is the number of test-ing examples and 1seq(·, ·) is the indicator function of exactsequence match.

Program accuracy While the sequence accuracy issimple, it is a pessimistic estimation of program ac-curacy since it does not consider program aliasing– different codes with identical program semantics(e.g. repeat(2):(move()) and move() move()).Therefore, we measure the program accuracy by enumer-ating variations of codes. Specifically, we exploit the syn-tax of DSL to identify variations: e.g. unfolding repeatstatements, decomposing if-else statement into two if state-ments, etc. Formally, the program accuracy is Accprogram =1M

∑Mm=1 1prog(C∗

m, Cm), where 1prog(C∗m, Cm) is an in-

dicator function that returns 1 if any variations of Cm matchany variations of C∗

m. Note that the program accuracy isonly computable when the DSL is relatively simple andsome assumptions are made i.e. termination of loops. Thedetails of computing program accuracy are presented in thesupplementary material.

Execution accuracy To evaluate how well a synthesizedprogram can capture the behaviors of an underlying pro-gram, we compare the execution results of the synthe-sized program code C and the demonstrations D∗ gener-ated by a ground truth program η∗, where both are gen-erated from the same set of sampled initial states IK =s11, ..., sK1 . We formally define the execution accuracyas: Accexecution = 1

M

∑Mm=1 1execution(D∗

m, Dm), where1execution(D∗

m, Dm) is the indicator function of exact se-quence match. Note that when the number of sampled ini-tial states becomes infinitely large, the execution accuracyconverges to the program accuracy.

5.2. Evaluation Setting

For training and evaluation, we collect Mtrain training pro-grams and Mtest test programs. Each program code C∗

m

is randomly sampled from an environment specific DSLand compiled into an executable form η∗m. The corre-sponding demonstrations D∗

m = τ1, ..., τK are gener-


def run():while frontIsClear():move()

putMarker()turnLeft()move()putMarker()move()move()

(b)

Underlying Program Synthesized Program

(a)

Program seen demo

unseen demo

def run():turnRight()turnRight()while frontIsClear():move()

if markersPresent():turnLeft()move()

else:turnRight()

def run():turnRight()turnRight()while frontIsClear():move()

else:turnRight()

Figure 5. Karel Results. Seen training examples are on top row (in blue) and unseen testing examples are on the bottom row (in green). (a)A successful case with a program sequence match (b) Due to a missing branch condition execution in training data (top images), thesynthesized program doesn’t incorporate the condition, resulting in execution mismatch in lower right testing image.

ated by running the program on K = Kseen + Kunseendifferent initial states. The seen demonstrations are usedas an input to the program synthesizer, and the unseendemonstrations are used for computing execution accu-racy. We train our model on the training set Ωtrain =(C∗

1 , D∗1), ..., (C∗

Mtrain, D∗

Mtrain) and test them on the test-

ing set Ωtest = (C∗1 , D

∗1), ..., (C∗

Mtest, D∗

Mtest). Note that

Ωtrain and Ωtest are disjoint. Both sequence and executionaccuracies are used for the evaluation. The training detailsare described in the supplementary material.

5.3. Baselines

We compare our proposed model (ours) against baselinesto evaluate the effectiveness of: (1) explicitly modeling theunderlying programs (2) our proposed model with the sum-marizer module and multi-task objective. To address (1), wedesign a program induction baseline based on (Duan et al.,2017), which bypasses synthesizing programs and directlypredicts action sequences. We modified the architecture toincorporate multiple demonstrations as well as pixel inputs.The details are presented in the supplementary material. Fora fair comparison with our model that gets supervision ofperception primitives, we feed the perception primitive vec-tor of every frame as an input to the induction baseline . Toverify (2), we compose a program synthesis baseline sim-ply consisting of a demonstration encoder and a programdecoder without a summarizer module and multi-task loss.To integrate all the demonstration encoder outputs acrossdemos, an average pooling layer is applied.

5.4. Karel

We first focus on a visually simple environment to verify thefeasibility of program synthesis from demonstrations. Weconsider Karel (Pattis, 1981) featuring an agent navigatingthrough a gridworld with walls and interacting with markers

based on the underlying program.

5.4.1. ENVIRONMENT AND DATASET

Karel has 5 action primitives for moving and interactingwith markers and 5 perception primitives for detecting ob-stacles and markers. A gridworld of 8× 8 size is used forour experiments. To evaluate the generalization ability ofthe program synthesizer to novel programs, we randomlygenerate 35,000 unique programs and split them into a train-ing set with 25,000 program, a validation set with 5,000program, and a testing set with 5,000 programs. The maxi-mum length of the program codes is 43. For each program,10 seen demonstrations and 5 unseen demonstrations aregenerated. The maximum length of the demonstrations is20.

Methods Execution Program SequenceInduction baseline 62.8% (69.1%) - -Synthesis baseline 64.1% 42.4% 35.7%

+ summarizer (ours) 68.6% 45.3% 38.3%+ multi-task loss (ours-full) 72.1% 48.9% 41.0%

Table 1. Performance evaluation on Karel environment. Synthesisbaseline outperforms induction baseline . The summarizer moduleand the multi-task objective introduce significant improvement.

5.4.2. PERFORMANCE EVALUATION

The evaluation results of our proposed model and base-lines are shown in Table. 1. Comparison of execution ac-curacy shows relative performance of the proposed modeland the baselines. Synthesis baseline outperforms inductionbaseline based on the execution accuracy, which shows theadvantage of explicit modeling the underlying programs. In-duction baseline often matches some of the Kunseen demon-stration, but fails to match all of them from a single pro-gram. This observation is supported by the number in theparenthesis (69.1%), which counts the number of correct


demonstrations while execution accuracy counts the numberof program whose demonstrations match perfectly. Thisfinding has also been reported in (Devlin et al., 2017b).

The proposed model shows consistent improvement oversynthesis baseline for all the evaluation metrics. The se-quence accuracy for our full model is 41.0%, which is a rea-sonable generalization performance given that none of thetest programs are seen during training. We observe that ourmodel often synthesizes programs that do not exactly matchwith the ground truth program but are semantically identical.For example, given a ground truth program repeat(4):(turnLeft; turnLeft; turnLeft ), our modelpredicts repeat (12): ( turnLeft ). Thesecases are considered correct for program accuracy. Notethat comparison based on the execution and sequence accu-racy is consistent with the program accuracy, which justifiesusing them as a proxy for the program accuracy when it isnot computable.

The qualitative success and failure cases of the proposedmodel are described in Figure 5. The Figure 5(a) showsa correct case where a single program is used to gener-ate diverse action sequences. Figure 5(b) show a failurecase, where part of the ground truth program tokens are notgenerated due to missing seen demonstration hitting thatcondition.

Methods k=3 k=5 k=10Synthesis baseline 58.5% 60.1% 64.1%

+ summarizer (ours) 60.6% 63.1% 68.6%Improvement 2.1% 3.0% 4.5%

Table 2. Effect of the summarizer module. Employing the pro-posed summarizer module brings more improvement as the numberof seen demonstration increases over synthesis baseline .

5.4.3. EFFECT OF SUMMARIZER

To verify the effectiveness of our proposed summarizer mod-ule, we conduct experiments where models are trained onvarying numbers of demonstrations and compare the execu-tion accuracy in Table. 2. As the number of demonstrationsincreases, both models enjoy a performance gain due toextra available information. However, the gap between ourproposed model and synthesis baseline also grows, whichdemonstrates the effectiveness of our summarizer module.

5.5. ViZDoom

Doom is a 3D first-person shooter game where a playercan move in a continuous space and interact with monsters,items and weapons. We use ViZDoom (Kempka et al.,2016), an open-source Doom-based AI platform, for ourexperiments. ViZDoom’s increased visual complexity and aricher DSL could test the boundary of models in state com-prehension, demo summarization, and program synthesis.

5.5.1. ENVIRONMENT AND DATASET

The ViZDoom environment has 7 action primitives includ-ing diverse motions and attack as well as 6 perceptionprimitives checking the existence of different monsters andwhether they are targeted. Each state is represented by animage with 120× 160× 3 pixels. For each demonstration,initial state is sampled by randomly spawning different typesof monsters and ammos in different location and placingan agent randomly. To ensure that the program behaviorresults in the same execution, we control the environmentto be deterministic.

We generate 80,000 training programs and 8,000 testing pro-grams. To encourage diverse behavior of generated program,we give a higher sampling rate to the perception primitivesthat has higher entropy over K different initial states. Weuse 25 seen demonstrations for program synthesis and 10unseen demonstrations for execution accuracy measure. Themaximum length of programs is 32 and the maximum lengthof demonstrations is 20.

5.5.2. PERFORMANCE EVALUATION

Table. 3 shows the result on ViZDoom environment. Syn-thesis baseline outperforms induction baseline in terms ofthe execution accuracy, which shows the strength of pro-gram synthesis for understanding diverse demonstrations.In addition, the proposed summarizer module and the multi-task objective bring improvement in terms of all evaluationmetrics. Also we found that the syntax of the synthesizedprograms is about 99.9% accurate. This tells that the pro-gram synthesizer correctly learn the syntax of the DSL.

Figure 6 shows the qualitative result. It is shown that thegenerated program covers different conditional behavior inthe demonstration successfully. In the example, the synthe-sized program does not match the underlying program inthe code space, while matching the underlying program inthe program space.

Methods Execution Program SequenceInduction baseline 35.1% (60.6%) - -Synthesis baseline 48.2% 39.9% 33.1%Ours-full 78.4% 62.5% 53.2%

Table 3. Performance evaluation on ViZDoom environment. Theproposed model outperforms induction baseline and synthesis base-line significantly as the environment is more visually complex.

5.5.3. ANALYSIS

To verify the importance of inferring underlying conditions,we perform evaluation only with programs containing a sin-gle if-else statement with two branching consequences. Thissetting is sufficiently simple to isolate other diverse factorsthat might affect the evaluation result. For the experiment,


Underlying Program

inTarget HellKnight ! attack()

inTarget HellKnight not inTarget Demon ! moveRight()

Demo 1

inTarget HellKnight and inTarget Demon

inTarget HellKnight ! attack()

inTarget Demon ! attack()

Demo 2

def run():if inTarget HellKnight:attack()

if inTarget Demon: attack()else: moveRight()

def run():if inTarget HellKnight:attack()

if not inTarget Demon: moveRight()else: attack()

Synthesized Program

Figure 6. ViZDoom results. Annotations below frames are the perception conditions and actions. Hellknight, Revenant, and Demonmonsters are white, black, and pink respectively. The model is able to correctly percepts the condition and actions as well as synthesize aprecise program. Note that the synthesized and the underlying program are semantically identical.

Methods Execution Program SequenceInduction baseline 26.5% (83.1%) - -Synthesis baseline 59.9% 44.4% 36.1%Ours-full 89.4% 69.1% 58.8%

Table 4. If-else experiment on ViZDoom environment. Single if-else statement with two branching consequences is used to evaluateability of inferring underlying conditions.

we use 25 seen demonstrations to understand a behavior and10 unseen demonstrations for testing. The result is shown inTable. 4. Induction baseline has difficulty inferring the un-derlying condition to match all unseen demonstrations mostof the times. In addition, our proposed model outperformssynthesis baseline ,2 which demonstrates the effectivenessof the summarizer module and the multi-task objective.

Figure 7 illustrates how models trained with a fixed number(25) of seen demonstration generalize to fewer or more seendemonstrations during testing time. This shows our modeland synthesis baseline are able to leverage more seen demon-strations to synthesize more accurate programs as well asachieve reasonable performance when fewer demonstrationsare given. On the contrary, Induction baseline could notexploit more than 10 demonstrations well.

5.5.4. DEBUGGING THE SYNTHESIZED PROGRAM

One of the intriguing properties of the program synthesis isthat synthesized programs are interpretable and interactableby human. This makes it possible to debug a synthesizedprogram and fix minor mistakes to correct the behaviors. Toverify this idea, we use edit distance between synthesizedprogram and ground truth program as a number of minimumtoken that is required to get a exactly correct program. With

Figure 7. Generalization over different number of Kseen. The base-line models and our model trained with 25 seen demonstration areevaluated with fewer or more seen demonstrations.

this setting, we found that fixing at most 2 program tokenprovides 4.9% improvement in sequence accuracy and 4.1%improvement in execution accuracy.

6. ConclusionWe propose the task of synthesizing a program from di-verse demonstration videos. To address this, we introduce amodel augmented with a summarizer module to deal withbranching conditions and a multi-task objective to inducemeaningful latent representations. Our method is evalu-ated on a fully observable, third-person environment (Karelenvironment) and a partially observable, egocentric game(ViZDoom environment). The experiments demonstratethat the proposed model is able to reliably infer underlyingprograms and achieve satisfactory performances.


AcknowledgmentsWe thank the anonymous ICML reviewers for insightfulcomments. This project was supported by the center forsuper intelligence, Kakao Brain, and SKT.

ReferencesBalog, Matej, Gaunt, Alexander L, Brockschmidt, Marc,

Nowozin, Sebastian, and Tarlow, Daniel. Deepcoder:Learning to write programs. In International Conferenceon Learning Representations, 2017.

Bunel, Rudy R, Hausknecht, Matthew, Devlin, Jacob, Singh,Rishabh, and Kohli, Pushmeet. Leveraging grammar andreinforcement learning for neural program synthesis. InInternational Conference on Learning Representations,2018.

Cai, Jonathon, Shin, Richard, and Song, Dawn. Making neu-ral programming architectures generalize via recursion.In International Conference on Learning Representations,2017.

Devlin, Jacob, Bunel, Rudy R, Singh, Rishabh, Hausknecht,Matthew, and Kohli, Pushmeet. Neural program meta-induction. In Neural Information Processing Systems,2017a.

Devlin, Jacob, Uesato, Jonathan, Bhupatiraju, Surya, Singh,Rishabh, Mohamed, Abdel-rahman, and Kohli, Pushmeet.Robustfill: Neural program learning under noisy i/o. InInternational Conference on Machine Learning, 2017b.

Duan, Yan, Andrychowicz, Marcin, Stadie, Bradly, Ho, Ope-nAI Jonathan, Schneider, Jonas, Sutskever, Ilya, Abbeel,Pieter, and Zaremba, Wojciech. One-shot imitation learn-ing. In Neural Information Processing Systems, 2017.

Finn, Chelsea, Yu, Tianhe, Zhang, Tianhao, Abbeel, Pieter,and Levine, Sergey. One-shot visual imitation learningvia meta-learning. In Conference on Robot Learning,2017.

Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neuralturing machines. arXiv preprint arXiv:1410.5401, 2014.

Graves, Alex, Wayne, Greg, Reynolds, Malcolm, Harley,Tim, Danihelka, Ivo, Grabska-Barwinska, Agnieszka,Colmenarejo, Sergio Gomez, Grefenstette, Edward, Ra-malho, Tiago, Agapiou, John, et al. Hybrid computingusing a neural network with dynamic external memory.Nature, 2016.

Grefenstette, Edward, Hermann, Karl Moritz, Suleyman,Mustafa, and Blunsom, Phil. Learning to transduce withunbounded memory. In Neural Information ProcessingSystems, 2015.

Hochreiter, Sepp and Schmidhuber, Jurgen. Long short-termmemory. Neural computation, 1997.

Joulin, Armand and Mikolov, Tomas. Inferring algorithmicpatterns with stack-augmented recurrent nets. In NeuralInformation Processing Systems, 2015.

Kaiser, Łukasz and Sutskever, Ilya. Neural gpus learn algo-rithms. In International Conference on Learning Repre-sentations, 2016.

Kempka, Michał, Wydmuch, Marek, Runc, Grzegorz,Toczek, Jakub, and Jaskowski, Wojciech. Vizdoom: Adoom-based ai research platform for visual reinforce-ment learning. In Computational Intelligence and Games,2016.

Neelakantan, Arvind, Le, Quoc V, and Sutskever, Ilya. Neu-ral programmer: Inducing latent programs with gradientdescent. In International Conference on Learning Repre-sentations, 2015.

Ng, Andrew Y, Russell, Stuart J, et al. Algorithms for in-verse reinforcement learning. In International Conferenceon Machine Learning, 2000.

Parisotto, Emilio, Mohamed, Abdel-rahman, Singh,Rishabh, Li, Lihong, Zhou, Dengyong, and Kohli, Push-meet. Neuro-symbolic program synthesis. In Interna-tional Conference on Learning Representations, 2017.

Pattis, Richard E. Karel the robot: a gentle introduction tothe art of programming. John Wiley & Sons, Inc., 1981.

Pomerleau, Dean A. Alvinn: An autonomous land vehiclein a neural network. In Neural Information ProcessingSystems, 1989.

Pomerleau, Dean A. Efficient training of artificial neural net-works for autonomous navigation. Neural Computation,1991.

Reed, Scott and De Freitas, Nando. Neural programmer-interpreters. In International Conference on LearningRepresentations, 2016.

Ross, Stephane, Gordon, Geoffrey, and Bagnell, Drew. Areduction of imitation learning and structured predictionto no-regret online learning. In International Conferenceon Artificial Intelligence and Statistics, 2011.

Santoro, Adam, Raposo, David, Barrett, David G, Mali-nowski, Mateusz, Pascanu, Razvan, Battaglia, Peter, andLillicrap, Tim. A simple neural network module for re-lational reasoning. In Neural Information ProcessingSystems, 2017.


Simonyan, Karen and Zisserman, Andrew. Two-streamconvolutional networks for action recognition in videos.In Neural Information Processing Systems, 2014.

Srivastava, Nitish, Mansimov, Elman, and Salakhudinov,Ruslan. Unsupervised learning of video representationsusing lstms. In International Conference on MachineLearning, 2015.

Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequenceto sequence learning with neural networks. In NeuralInformation Processing Systems, 2014.

Venugopalan, Subhashini, Rohrbach, Marcus, Donahue, Jef-frey, Mooney, Raymond, Darrell, Trevor, and Saenko,Kate. Sequence to sequence-video to text. In Interna-tional Conference on Computer Vision, 2015.

Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Er-han, Dumitru. Show and tell: A neural image captiongenerator. In Conference on Computer Vision and PatternRecognition, 2015.

Xu, Danfei, Nair, Suraj, Zhu, Yuke, Gao, Julian, Garg,Animesh, Fei-Fei, Li, and Savarese, Silvio. Neural taskprogramming: Learning to generalize across hierarchi-cal tasks. In International Conference on Robotics andAutomation, 2018.

Zaremba, Wojciech and Sutskever, Ilya. Reinforcementlearning neural turing machines-revised. arXiv preprintarXiv:1505.00521, 2015.

Documents

Neural Program Synthesis from Diverse Demonstration Videos