15
Neural Event Semantics for Grounded Language Understanding Shyamal Buch Li Fei-Fei Noah D. Goodman Stanford University {shyamal,feifeili}@cs.stanford.edu [email protected] Abstract We present a new conjunctivist framework, neural event semantics (NES), for compo- sitional grounded language understanding. Our approach treats all words as classifiers that compose to form a sentence meaning by multiplying output scores. These clas- sifiers apply to spatial regions (events) and NES derives its semantic structure from lan- guage by routing events to different classi- fier argument inputs via soft attention. NES is trainable end-to-end by gradient descent with minimal supervision. We evaluate our method on compositional grounded lan- guage tasks in controlled synthetic and real- world settings. NES offers stronger gener- alization capability than standard function- based compositional frameworks, while im- proving accuracy over state-of-the-art neu- ral methods on real-world language tasks. 1 Introduction Capturing the compositional semantics of grounded language is a long-standing goal in natural language processing. Composition yields systematicity, and is thus essential to developing systems that can generalize broadly in real-world settings. Recent progress with neural module networks (Andreas et al., 2016b; Hu et al., 2017) and related models (Johnson et al., 2017b; Yi et al., 2018; Bahdanau et al., 2019a) have moved neural network methods closer to this goal. These works are largely based on the idea, func- tionism (Montague, 1970), that semantic composi- tion is function composition. In Fig. 1(a), func- tion predicates compose by nesting: predicates like “red” and “circle” operate on sets of ele- ments, progressively filtering them at each step (circle(red(x))). The final relational pred- icate above is thus several steps removed from Project Website: https://neural-event-semantics.github.io/ A version of this pre-print paper will appear in TACL 2021. filter [red] re-filter [circle] relation [above] re-filter [square] function composition (w/module layout) True λx,y.above( circle( red(x) ), square( green(y) ) ) Text input: “A red circle is above a green square.” filter [green] e 1 ,e 2 .red(e 1 ) circle(e 1 ) above(e 1 ,e 2 ) green(e 2 ) square(e 2 ) red circle above green conjunctive composition (w/events routing) True square * Prior Work NES (Ours) (a) (b) e 1 e 2 Figure 1: (a) Prior neural network methods for compo- sitional semantics, such as neural module networks, de- rive compositional structure through nested application of function modules. This paradigm, rooted in func- tionism, is powerful but retains drawbacks to learnabil- ity due to its complexity. (b) We propose neural event semantics (NES), a new framework based on conjunc- tivism, where all words are classifiers and output scores compose by simple multiplication. We call the input spatial regions to these classifiers events: NES derives semantic structure from language by learning how to route event inputs of classifiers for different words in a context-sensitive manner. By relaxing this routing op- eration with soft attention, NES enables end-to-end dif- ferentiable learning without low-level supervision for compositional grounded language understanding. the original inputs x,y. Similarly, in module net- works, atomic module blocks compose by sequen- tially passing outputs of intermediate blocks to later modules. The diverse composition ruleset needed to coordinate function inputs and outputs leads to complexity in this paradigm, which has practical implications for its fundamental learn- ability. Indeed, neural module network instan- tiations of this framework often depend on low- level ground truth module layout programs (John- son et al., 2017b) or large amounts of training data to sustain end-to-end reinforcement learning methods (Yi et al., 2018; Mao et al., 2019).

Neural Event Semantics for Grounded Language Understanding

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Neural Event Semantics for Grounded Language Understanding

Neural Event Semantics for Grounded Language Understanding

Shyamal Buch Li Fei-Fei Noah D. GoodmanStanford University

{shyamal,feifeili}@cs.stanford.edu [email protected]

Abstract

We present a new conjunctivist framework,neural event semantics (NES), for compo-sitional grounded language understanding.Our approach treats all words as classifiersthat compose to form a sentence meaningby multiplying output scores. These clas-sifiers apply to spatial regions (events) andNES derives its semantic structure from lan-guage by routing events to different classi-fier argument inputs via soft attention. NESis trainable end-to-end by gradient descentwith minimal supervision. We evaluateour method on compositional grounded lan-guage tasks in controlled synthetic and real-world settings. NES offers stronger gener-alization capability than standard function-based compositional frameworks, while im-proving accuracy over state-of-the-art neu-ral methods on real-world language tasks.

1 Introduction

Capturing the compositional semantics ofgrounded language is a long-standing goal innatural language processing. Composition yieldssystematicity, and is thus essential to developingsystems that can generalize broadly in real-worldsettings. Recent progress with neural modulenetworks (Andreas et al., 2016b; Hu et al., 2017)and related models (Johnson et al., 2017b; Yiet al., 2018; Bahdanau et al., 2019a) have movedneural network methods closer to this goal.

These works are largely based on the idea, func-tionism (Montague, 1970), that semantic composi-tion is function composition. In Fig. 1(a), func-tion predicates compose by nesting: predicateslike “red” and “circle” operate on sets of ele-ments, progressively filtering them at each step(circle(red(x))). The final relational pred-icate above is thus several steps removed from

Project Website: https://neural-event-semantics.github.io/A version of this pre-print paper will appear in TACL 2021.

filter[red]

re-filter[circle] relation

[above]re-filter[square]

function composition (w/module layout)

True

λx,y.above( circle( red(x) ), square( green(y) ) )

Text input: “A red circle is above a green square.”

filter[green]

∃e1,e2.red(e1) ∧ circle(e1) ∧ above(e1,e2)∧ green(e2) ∧ square(e2)

red

circle

above

green

✓✓

✓✓✓

conjunctive composition (w/events routing)

True

square✓

*

PriorWork

NES(Ours)

(a)

(b)

e1 e2

Figure 1: (a) Prior neural network methods for compo-sitional semantics, such as neural module networks, de-rive compositional structure through nested applicationof function modules. This paradigm, rooted in func-tionism, is powerful but retains drawbacks to learnabil-ity due to its complexity. (b) We propose neural eventsemantics (NES), a new framework based on conjunc-tivism, where all words are classifiers and output scorescompose by simple multiplication. We call the inputspatial regions to these classifiers events: NES derivessemantic structure from language by learning how toroute event inputs of classifiers for different words in acontext-sensitive manner. By relaxing this routing op-eration with soft attention, NES enables end-to-end dif-ferentiable learning without low-level supervision forcompositional grounded language understanding.

the original inputs x,y. Similarly, in module net-works, atomic module blocks compose by sequen-tially passing outputs of intermediate blocks tolater modules. The diverse composition rulesetneeded to coordinate function inputs and outputsleads to complexity in this paradigm, which haspractical implications for its fundamental learn-ability. Indeed, neural module network instan-tiations of this framework often depend on low-level ground truth module layout programs (John-son et al., 2017b) or large amounts of trainingdata to sustain end-to-end reinforcement learningmethods (Yi et al., 2018; Mao et al., 2019).

Page 2: Neural Event Semantics for Grounded Language Understanding

While functionism is the dominant paradigm inlinguistic semantics, there is an intriguing alter-native: event semantics (Davidson, 1967). Con-junctivism (Pietroski, 2005) is a particularly pow-erful version of event semantics, wherein the onlycomposition operator is conjunction — structurearises by routing event variables to the functionpredicates. We illustrate this key difference be-tween paradigms in Figure 1: in a conjunctivistsetting (Fig. 1(b)), even the relational abovehas events e1, e2 directly routed as input, ratherthan taking inputs that are output results of a se-quence of filter operations. Overall meaningis still preserved, since e1 concurrently routes to(red, circle) and e2 to (green, square).All module outputs directly contribute to the finaltruth value without intermediate steps. Altogether,this shift from deriving compositional structureby functional module layout to conjunctive eventsrouting offers a path to improved learnability; weexplore the implications of this line of thinking inthe context of compositional neural models.

We propose neural event semantics (NES), anew conjunctivist framework for compositionalgrounded language understanding. Our work ad-dresses the drawbacks of modern neural mod-ule approaches by re-examining the underlyingsemantics framework, shifting from functionismto conjunctivism. The focus of NES revolvesaround event variables, abstractions of entities inthe world (e.g. in images, we can think of eventsas spatial regions). We treat all words as eventclassifiers: for each word, a single score indicatesthe presence of a concept on a specific input (e.g.red, above in Fig. 1(b)). We compose outputscores from classifiers by multiplication, gener-alizing logical conjunction. The structural heartof NES is the intermediate soft (attentional) eventrouting stage, which ensures that these otherwiseindependent word-level modules receive contextu-ally consistent event inputs. In this way, the sim-ple product of all classifier scores accurately rep-resents the intended compositional structure of thefull sentence. Our NES framework is end-to-enddifferentiable, able to learn from high-level super-vision by gradient descent while providing inter-pretability at the level of individual words.

We evaluate our NES framework on a seriesof grounded language tasks aimed at assessing itsgeneralizability. We verify the merits of our con-junctivist design in a controlled comparison with

functionist methods on the synthetic ShapeWorldbenchmark (Kuhnle and Copestake, 2017). Weshow NES exhibits stronger systematic general-ization over prior techniques, without requiringany low-level supervision. Further, we verify theflexibility of the framework in real-world languagesettings, offering significant gains (+4 to 6 points)in the state-of-the-art accuracy for language andzero-shot generalization tasks on the CiC refer-ence game benchmark (Achlioptas et al., 2019).

2 Background and Related Work

Compositional Neural Networks. The advent ofneural module networks (NMN) (Andreas et al.,2016a,b; Hu et al., 2017) and related techniques(Johnson et al., 2017b; Yi et al., 2018; Bahdanauet al., 2019a) has proven to be a driving forcein compositional language understanding. Thesetechniques share a key principle: small, reusableneural network modules stack together as func-tional building blocks in an overall executableneural program. A parsing system determines theprogrammatic layout, wiring the outputs of inter-mediate blocks to the inputs of other blocks.

The reliance of these techniques on pre-specified module libraries, ground truth super-vision on functional module layouts, and/orsample-inefficient reinforcement learning meth-ods (Williams, 1992) has motivated subsequentwork to eschew explicit semantics for recurrentattentional computation techniques (Hudson andManning, 2018; Perez et al., 2018; Hu et al.,2018; Hudson and Manning, 2019). This class ofmore implicit semantics methods offers the ben-efits of end-to-end differentiability of traditionalnon-compositional neural networks (Lake et al.,2017), making them better suited for real-worldsettings. As a trade-off, however, these methodsexhibit less systematic generalization than theirmore explicit counterparts (Marois et al., 2018;Jayram et al., 2019; Bahdanau et al., 2019b).

Recent work has also suggested that the mod-ular network approach leads to limitations of sys-tematic generalization: functional module layoutcan lead to entangled concept understanding (Bah-danau et al., 2019a; Subramanian et al., 2020).While these works go on to propose mitigatingmeasures, such as module-level pretraining, weconsider an orthogonal approach: re-visiting theunderlying semantics foundation. This enables usto address the challenges jointly: our NES frame-

Page 3: Neural Event Semantics for Grounded Language Understanding

Text to Neural Logical Form

blue

circle

on

red

square

Sec 3.2: Event Argument Routing (Attn)

v1 v1 v1 v2 v3 v2 v3 v3v4 v4v2 v3 v3v4 v4 v1 v1 v1 v2 v3

v2v4 v2

v4

Final Output: “True”

argument 1e1 e2 e∅

,

,

,

,

,

∃e1,e2.

Sec 3.4: Conjunctive Composition

blue(e1)

on(e1,e2)

square(e2)

red(e2)

circle(e1)

Input Text T

Input World W Event Candidates V

e1

e2Event Assignment

(e← V )

Inputs

The blue circle is on the red square

Encoder

e∅background

event

+

grounded events

overall score

v1v2v3v4

Neural LogicalForm F

bluecircle

onred

square

Discrete Boolean

Logic Form

A

*sF

sF*= 1.0

maxe1,e2

[blue(e1) ⋅ circle(e1) ⋅ on(e1, e2) ⋅ red(e2) ⋅ square(e2))]

e1 e2 e∅

sequenceencoder

Sec 3.3: Words as Event Classifiers

1.0 0.0 0.0

.99 .01 0.0

.99 0.0 .01

0.0 1.0 0.0

0.0 .99 .01

.01 0.0 .99

0.0 0.0 1.0

0.0 0.0 1.0

0.0 .01 .99

0.0 .99 .01

argument 2e1 e2 e∅

Semantic Evaluation (Grounding)Sec 3.5: Existential Event Variable Interpretation

* * * *

Figure 2: We propose neural event semantics (NES), an end-to-end differentiable framework based on conjunc-tivist event semantics (Sec 3.1). NES parses input text to a neural logical form F , which can score a given set ofinput events. In NES, all words are event classifiers (Sec. 3.3) whose scores compose by multiplication (Sec. 3.4).The structural heart of NES is a differentiable event argument routing operation (Sec. 3.2), ensuring argumentsto each event classifier are contextually correct. NES semantically grounds F to an input world W by existentialevent variable intepretation (Sec. 3.5), finding a satisfying assignment (if one exists) of events e from values V .

work retains the end-to-end learnability of implicitmethods, while improving upon the systematicgeneralizability of explicit ones.

Grounded Compositional Semantics. Our workis also closely related to the broader, pre-neuralnetwork body of prior work which developedmodels for compositional semantics in groundedlanguage settings (Matuszek et al., 2012; Krishna-murthy and Kollar, 2013; Malinowski and Fritz,2014). These methods all share the two-stage ap-proach of semantic parsing and evaluation, andcombine functionist and conjunctivist elements.The parsing stage typically leverages a (function-ist) combinatory categorial grammar (CCG) parser(Zettlemoyer and Collins, 2005) to map input lan-guage input to a discrete (conjunctive) logicalform bound by an existential closure. The eval-uation stage passes visual segments as input tothese predicates to obtain a final score represent-ing its truth condition. In our work, we aim togeneralize these frameworks to a modular neuralnetwork setting, embracing conjunctivist designacross all stages to improve end-to-end learnabil-ity. Our proposed soft event routing mechanismrelaxes prior discrete constraints and offers an al-ternative to probablistic program (Krishnamurthyet al., 2016) formulations. Together, NES is ableto learn how to predict the (soft) conjunctive neu-ral logical forms while jointly learning the un-

derlying semantics of each concept (without pre-specification) end-to-end from denotation alone.Grounded Language Understanding. The spaceof grounded language understanding methods andtasks is large, encompassing tasks in image-caption agreement (Kuhnle and Copestake, 2017;Suhr et al., 2019), reference grounding (Monroeet al., 2017; Achlioptas et al., 2019), instructionfollowing (Ruis et al., 2020; Vogel and Juraf-sky, 2010; Chaplot et al., 2018), captioning (Chenet al., 2015), and question answering (Antol et al.,2015; Johnson et al., 2017a; Hudson and Manning,2019), among others. Often, the ability to oper-ate with only high-level labels is critical (Karpathyand Fei-Fei, 2015). Consistent with recent work(Bahdanau et al., 2019a), we center our analysis onfoundational tasks of caption agreement and refer-ence grounding, on both synthetic and real-worldlanguage data, with the understanding that core in-sights can translate to related tasks.

3 Technical Approach

3.1 Prelude: Classical Conjunctivism to NES

To explain our proposed differentiable neural ap-proach, we first revisit classical logic in our cur-rent context. In conjunctivist event semantics(Pietroski, 2005), we work with the space of exis-tentially quantified conjunctions of predicates. For

Page 4: Neural Event Semantics for Grounded Language Understanding

illustration, consider the partial logical form:

∃e1, e2 ∈ V. [[circle(e1) ∧ on(e1, e2)]] (1)

where ei are event variables and V is the domainof candidate event values. To evaluate this expres-sion, we need an interpretation of the variables:an assignment of event values in V to each eventvariable ei. We then route these events to the argu-ments of predicates based on the logical form. Thelogical form gives the abstract template for whichevents route to which inputs and, most crucially,which arguments are shared across predicates (e1routes to “circle” and the first argument of “on”).We make this routing explicit by a routing ten-sor Awri ∈ {0, 1}: for each argument slot (r) ofeach predicate (w, for word), Awr∗ ∈ {0, 1}n isa one-hot vector indicating which of the n eventvariables ei ∈ e belongs in this argument slot. Wecan thus rewrite the matrix expression in Eq. 1 as:

[[circle(A11∗e, A12∗e) ∧ on(A21∗e, A22∗e)]] (2)

Without loss of generality1, we upgrade each pred-icate to take a fixed m arguments; here m = 2.Eq. 2 makes it clear that the routing tensorA is thekey syntactical element specifying the structure ofthe logical form in Eq. 1. Having routed eventsei to predicate arguments via A, we can evaluatethe predicates (“circle”, “on”). These predicatesare Boolean functions, assigned by a lookup ta-ble (lexicon). We compose the outputs of theseBoolean functions by conjunction to get the truth-value of the entire matrix. This describes how weevaluate the matrix expression in Eq. 1 for a spe-cific assignment of ei in V . We arrive at the fi-nal interpretation of Eq. 1 by existential quantifi-cation: searching over the possible assignments tosee if there exists one that makes the matrix true.

We emphasize that the logical form is fully de-termined by the routing tensor A and the lexiconmapping each word/predicate to a Boolean func-tion. Evaluation is specified by conjunctive com-position and finding a satisfying variable interpre-tation. Our strategy to develop a learnable frame-work is to soften each of the key components: ar-gument routing (Sec. 3.2), predicate evaluation(Sec. 3.3), conjunctive composition (Sec. 3.4),and existential event interpretation (Sec. 3.5).

1We add a background event variable e∅, backed by anull representation; A can route e∅ to extra slots. In Eq. 2,routed events to “circle” are A11∗e = e1 and A12∗e = e∅.

Aw1∗

Event Routing (A)

e1 e2 e∅

Events (Logical Form)

Event Classifier (Mw )

circle

scorearg1

qw

M

m slots

n events

0.97Aw2∗

Aw1∗ee1 e2 e∅

e=

Aw2∗e arg2

1.0

1.0

0.0 0.0

0.0 0.0

word

Figure 3: Words as Classifiers of Routed Events.All words w correspond to modules Mw of a singletype signature. Predicted argument routing attention Aroutes input events e from the overall logical form F tothe specific arguments in the event classifier Mw (perEq. 5), ensuring contextual consistency between eventclassifiers for different words. qw, a decontextualizedword embedding, indicates to Mw its lexical concept.Mw shown here with maximum arity m = 2 slots andn = 3 events (including the ungrounded backgroundevent e∅); since “circle” only binds to one argument e1,the second slot is bound to e∅. See Sec. 3.2 and 3.3.

Overview. We propose a neural event semantics(NES) framework, illustrated in Fig. 2, which re-laxes this classical logic into a differentiable com-putation that can be learned end-to-end. NES takesa text statement T and constructs a neural logi-cal form F . This form is specified by a now real-valued routing tensor Awri ∈ [0, 1] and a lexiconassociating event classifiers Mw to each word w.NES specifies composition via the product of clas-sifier prediction scores, as a relaxation of conjunc-tion. Finally, evaluation is completed by existen-tially interpreting event variables ei into a domainof event values V (grounded representations ex-tracted from a visual world W ) by a max operator.

3.2 Differentiable Event Argument Routing

Our first key operation in NES is to predict the ar-gument routing tensor A from the input language.Critically, we relaxA from its original discrete for-mulation in Sec. 3.1 to a continuous-valued oneAwri ∈ [0, 1], where Awr∗ ∈ [0, 1]n is normal-ized by softmax over the index for the n eventsei. This softened routing can be seen as a form ofattention, determining which argument slot r for awordw will attend to which event variables ei (seeFig. 3). We predict these attentions directly fromthe input tokenized text sequence T = [t1, . . . , tl],of length l. For each token word tw, we passa word embedding qw as input to a bidirectionalLSTM (Graves and Schmidhuber, 2005) whichserves as the sequence encoder and outputs for-ward/backward hidden states (h→w , h

←w ∈ Rdh)

Page 5: Neural Event Semantics for Grounded Language Understanding

capturing the bidirectional context surrounding tw.Passing the concatenated states through a linearlayer, we obtain a final hidden state:

hw =W ([h→w , h←w ]) + b ∈ Rdh (3)

From hw, a multilayer perceptron (MLPROUTE)network outputs for each argument slot r:

Awr∗ = softmax(MLPROUTE(hw)) (4)

Over the full input sequence of length l, we obtainthe full argument routing tensor A ∈ [0, 1]l×m×n,with m argument slots per word and n events.Note that the prediction of A from input text Tplays the role of capturing syntax for NES, usingthe language to derive coordination of argumentroutings across different words.2

A key design aspect of the routing operation:because A can route an ungrounded backgroundevent e∅ to (extra) argument slots, NES can implic-itly learn the arity of each word. Further, the at-tention formulation enables partial routing of suchbackground events; we observe later in Sec. 4.1.4that this is critical to enabling the more complexcoordination necessary to handle negation.

3.3 Words as Event ClassifiersIn NES, all words are event classifiers: words areassociated with modules Mw that output a real-valued score sw of how true a lexical concept isfor a given set of (routed) event inputs (Sec. 3.2,Fig. 3). Denoting events ei ∈ Rde , e∅ as a nullbackground event, and e = [e1 · · · en−1 | e∅] ∈Rn×de , we can formalize the routed inputs asAwr∗e ∈ Rde . The concatenation of these routedinputs over all m argument slots is input to Mw.

While in principle the modules can be com-pletely separate for each word in the lexicon, wechoose to share the weights of the different clas-sifiers Mw: this improves memory efficiency forlarge vocabularies and is helpful in real-world lan-guage generalization settings. Thus, we can re-alize modules Mw by an MLP network that re-ceives the word embedding qw as further input (seeFig. 3), computing its output sw as:

sw = σ(MLPMw([Aw1∗e, . . . , Awm∗e; qw]))(5)

where σ denotes the sigmoid function that normal-izes the output score sw ∈ [0, 1].3

2We emphasize that this is a language-only operation: co-ordination here is not conditional on the later grounding stepto specific event values V in the visual world (Sec. 3.5).

3qw is a decontextualized embedding that only represents

3.4 Conjunctive Composition in NESPer Sec. 3.1, the matrix of a classical conjunctivelogical form (for a given interpretation of vari-ables) is evaluated by composing Boolean predi-cate outputs by conjunction. For the neural logicalform F in NES, we consider the real-valued gener-alization of conjunction: we compose the l word-level scores sw from the classifiers Mw (Eq. 5) bymultiplication (

∏w sw). For numerical stability,

we calculate the combined log score in log space:

log sF =1

l

∑w

log sw (6)

where the length normalization is optional buthelps with training on variable length sequences.

3.5 Existential Event Variable InterpretationIn the previous Sec. 3.2-3.4, we’ve described howNES translates input language to a neural logicalform F , and how such a logical form can oper-ate for a specific intepretation (binding) of eventsto candidate values V . Now, we describe the fi-nal existential variable interpretation step, whichrelaxes the existential quantification of classicallogic (Eq. 1) into a max operation over possibleevent interpretations of a specific input domain V .Candidate Event Values V . We decompose ourinput world W into a set of candidate event pro-posals, with corresponding representation valuesV . In our experiments, we process input visualscenes W with a pre-trained convolutional visualencoder φ (Simonyan and Zisserman, 2015; Heet al., 2016) to provide a set of up to k candidateevent value representations V = {v1, . . . , vk},with v ∈ Rde . These candidate values capturethe information corresponding to the localized im-age segment surrounding that specific event; webase our approach on recent findings of object-centric representations for compositional modularnetwork approaches (Yi et al., 2018). To capturespatial information, we augment each representa-tion with the spatial coordinates of the center ofits bounding box; this enables NES and our rele-vant baseline methods (e.g. NMN) to assess thesemantics of spatial relationships (e.g. “below”)while operating directly on event values.Assignment and Final Scoring. Given the do-main V of candidate event values, an interpreta-tion is thus an assignment of each of the n − 1

the standalone lexical concept, not the recurrent embeddinghw. Consistent with Subramanian et al. (2020), we find thisimproves systematic generalization in NES and baselines.

Page 6: Neural Event Semantics for Grounded Language Understanding

grounded event variables (we don’t include e∅) toa unique value in V : we denote this assignmentoperation as e ← V . We translate the existen-tial closure (∃e1, e2 in Fig. 2) as an operation thatdetermines the best scoring assignment of eventcandidate values to event variables. Expanded, thefinal grounded score s∗F = max

e←VsF is:

s∗F = maxe←V

1

l

∑w

logMw(Aw1∗e, . . . , Awm∗e; qw)

(7)Fig. 4 visualizes output score tables (including

sw, sF , s∗F ) with k = 2 candidate event values and

n = 3 events including background e∅. We high-light that Fig. 4 shows how each individual moduleprovides consistent outputs depending on the spe-cific event interpretation e ← V (e.g. “below” isonly true if (e1, e2) bind to (v2, v1), not (v1, v2)).The final score s∗F reflects the sF of that correctassignment, since it is the max score.

3.6 Training: Learning from DenotationWe train our overall system end-to-end with gradi-ent descent with a dataset of (statement T , worldscene W , true/false denotation label Y ) triplets.We apply a straightforward binary cross entropyloss at the level of text statements and their truthlabels to the final output score s∗F , without need-ing any low-level ground truth supervision of theneural logical form. Overall, our full NES frame-work offers advantages from both traditional neu-ral module network methods and end-to-end dif-ferentiable implicit semantics techniques.

The max operation in Eq. 7 is a technical chal-lenge for the end-to-end training. To improve gra-dient flow, we propose to use a tunable approxima-tion fmax, which approaches the max as β → ∞and is always upper-bounded by it:

fmax(s;β) =

∑q (sq)

β+1∑q (sq)

β≤ max(s) (8)

In context, s is a vector of all the scores sF (Eq. 6)corresponding to the assignments e← V , and theoutput of Eq. 8 is a bounded approximation of s∗Fin Eq. 7. See Appendix A for correctness and de-tails.4 During test-time inference, we still use theoriginal max operation shown in Eq. 7.

4Bound follows from Holder’s inequality. Eq. 8 is im-portant since standard alternatives (e.g. log-sum-exp) do nothave this upper-bound, and the possibility of multiple validassignments e ← V renders softmax inappropriate. Sincefmax converges quickly to the max operation as β increases,we can use numerically stable values β ≤ 4 during training.

**

1.0

v1 v2

agre

ensh

ape

isbelo

wa re

dcir

cle .

(v1, v2)

(v2, v1)

1.0 0.0 1.0 1.0 0.0 1.0 0.0 1.0 1.0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

0.0*sF

True

(e1, e2)

EventAssign(E V)

Module-Level Scores sw

atri

angle

isab

ove a red

circle .

(v1, v2)

(v2, v1)

1.0 0.0 1.0 1.0 1.0 0.0 1.0 1.0

1.0 0.0 1.0 0.0 1.0 1.0 1.0 0.0

0.0

sF

False

(e1, e2)

*

Input World W+ Candidates V

(sF = 0.0)*

(sF = 1.0)*

Figure 4: Qualitative Results (Scoring). Exampleend-to-end NES results on ShapeWorld. We show aninput world with two event candidates (for clarity) withrepresentations v1, v2 for the red and green circles, re-spectively. We visualize the possible event assignments(e1, e2) ∈ {(v1, v2), (v2, v1)} and the classifier scoressw ∈ [0, 1] for each assignment, including stop words.We find NES provides correct and consistent predic-tions across assignments and concepts, without any ex-plicit logical form-level supervision. See Sec. 4.1.4.

4 Experiments

4.1 Experiments: Synthetic Language

We design the first series of experiments to high-light key compositional and generalization proper-ties of NES in a controlled, synthetic setting.

4.1.1 Dataset and TasksShapeWorld. Our synthetic tasks and datasetsare based on the ShapeWorld benchmark suite(Kuhnle and Copestake, 2017), which was de-signed specifically for evaluation of compositionalmodels for grounded semantics. Here, eventsare based on simple objects: shapes with differ-ent color attributes and spatial relationships. Im-ages are generated by sampling events from task-specific distributions with visual noise (e.g. hue,size variance), and are placed without hard gridconstraints. For each image, multiple true/falselanguage statements are generated with a tem-plated grammar (Copestake et al., 2016). Negativestatements are generated close to the distributionof positive statements to ensure difficulty: modelsmust understand all aspects of the statement cor-rectly to output a truth condition label. We visual-ize an example in the qualitative results (Fig. 4).Task A: Standard Generalization. This gener-alization task evaluates compositional models onthe standard setting where train and evaluationsplits are based on the same underlying input eventdistribution. This task is similar to the originalSHAPES dataset (Andreas et al., 2016b), withoutshape positions locked to a 3× 3 spatial grid.

Page 7: Neural Event Semantics for Grounded Language Understanding

Standard (Task A) Compositional (Task B)0.7

0.8

0.9

1.0Ac

cura

cy

NMN-GTNES-GT

Figure 5: Validating Conjunctivism. Here, we pro-vide ground truth (GT) logical forms for both function-ist (NMN) and conjunctivist (NES) approaches. Con-trolling other factors, we observe that our conjunctivistNES framework provides better systematic generaliza-tion (Task B) than a functionist one. See Sec. 4.1.3.

red(e) blue(e)green(e)yellow(e) magenta(e)cyan(e)

NES (Ours)Response

NMNResponse

Input color

Figure 6: Validating Conjunctivism: Attribute Re-sponse. Response graphs for color attributes on Shape-World data in Task B. Our conjunctivist NES frame-work offers stronger disentangled understanding ofeach word as a concept classifier, compared to the priorfunctionist NMN framework. See Sec. 4.1.3.

Task B: Compositional Generalization. Thecompositional generalization task examines thesystematic generalization of models to an unseenevent distribution. During test time, every instancehas at least one event sampled from a held-out dis-tribution. For example, while red triangles andblue squares may be present at train time, bluetriangles and red squares are only present duringtest time. Critically, any language associated withthese unseen events is always false during train-ing since these events are never actually present.Thus, models that overfit on complete phrases dur-ing training will not generalize well at test time.

Task Variant: Negation. For both tasks, we in-clude a variation with negation to ensure NEScan model non-intersective modifiers, which areprevalent in real-world grounded language. Inthese variants, true and false statements that in-clude attribute-level negation (e.g. phrases like“not red”) are also generated for each image.

4.1.2 Baseline and Model Details

Baselines. Across our synthetic experiments, wecompare NES against baselines in 3 categories:• Black-box neural networks. These baselineneural network models combine CNN, LSTM,and attention components (Johnson et al., 2017a)and represent standard end-to-end black-box tech-niques for language + vision tasks.• Functionist approaches. For our functionistbaselines, we consider the prevailing parameter-izations of the neural module networks (NMN)framework (Andreas et al., 2016b). For the mod-ules, we leverage the base generic module designintroduced in the E2ENMN framework (Hu et al.,2017; Bahdanau et al., 2019a). Because our ex-periments are event-centric, the inputs and imple-mentation of the framework are consistent withprior work (Yi et al., 2018; Mao et al., 2019; Sub-ramanian et al., 2020). Thus, each module takesas input a set of localized event values (originallyfrom the image), an attention over these values(from a preceding module step), and a decontex-tualized word embedding. The module then ap-plies the attention and processes the input, be-fore outputting an updated attention to be usedin dependent downstream module steps. For end-to-end (E2E) experiments, ground truth programsare used to pre-train the parsing module layoutgenerator, which is the structural heart of NMN.This parser is implemented using a sequence-to-sequence Bi-LSTM (Hu et al., 2017; Johnsonet al., 2017b). We emphasize that, in our experi-ments, we ensure consistent hidden state sizes forboth the modules and the sequence encoder forNMN and NES, as well as consistent event-centricvisual + decontextualized word embedding input.• Implicit semantics methods. This class ofmodels leverages recursive computation units withattention over visual and textual input to providebetter compositionality than traditional end-to-endblack-box neural network methods. We exam-ine the MAC model (Hudson and Manning, 2018,2019) as a representative baseline, following re-cent prior work (Bahdanau et al., 2019a). Similarto our NMN baseline, we report results with anevent-centric version of the MAC model, follow-ing Mao et al. (2019), such that MAC is able toattend over a discrete set of localized event values.Thus, we can enable fair and consistent compari-son of MAC, NMN, and other baselines with NES.

Implementation Details. Models and baselines

Page 8: Neural Event Semantics for Grounded Language Understanding

0.5 0.6 0.7 0.8 0.9 1.0Accuracy - Standard (Task A)

CNN-LSTM

CNN-LSTM-Attn

MAC

E2E-Func (NMN)

NES (ours)

0.5 0.6 0.7 0.8 0.9 1.0Accuracy - Compositional (Task B)

Figure 7: End-to-End Methods. Generalization performance of end-to-end-methods on ShapeWorld tasks. Weobserve that our conjunctivist NES framework offers stronger generalization performance on both standard (TaskA) and systematic (Task B) compositional task settings. See Section 4.1.4 for additional details and analysis.

are implemented in PyTorch (Paszke et al., 2019).Localized event candidate values V are extractedby a pre-processing step. Our encoder φ is aResNet-101 network (He et al., 2016), and lo-calized event feature representations are based onconv4 features per prior work (Johnson et al.,2017a; Hudson and Manning, 2018) with pixelgrid coordinates (per Sec. 3.5) to capture the nec-essary spatial and visual information for the down-stream semantics. Following standard work inobject detection (He et al., 2017), we use pool-ing to ensure all localized event values have thesame dimension. Word embeddings are 300-dimGloVe.6B embeddings (Pennington et al., 2014).All text and visual inputs are consistent across allmodels for fair comparison. As noted previously,model sizes are also kept consistent across modelswhere applicable. Please refer to the supplementfor implementation and additional details.5

4.1.3 Validating Conjunctivism

Overview. Our first experiments are centeredaround validating a fundamental design princi-ple underlying our NES framework: that conceptmeaning can be effectively represented by con-junction of event classifiers. Both NMN and NESleverage syntax to guide their compositional struc-ture: functional module layout (NMN) and eventrouting (NES), respectively.6 Here, we isolate theimpact of the design philosophy on the quality ofthe learned semantics by providing ground truth

5Available at https://neural-event-semantics.github.io/6We note that while we focus on the functionist realiza-

tions of NMNs prevalent in prior work, we recognize that thebroader family of modular network approaches can includeconjunctivist elements as well. A key intention of these ex-periments is to illustrate the value of our conjunctivist designas a compelling direction for future modular network design.

(GT) “syntax” (layout or routing) to each frame-work, assessing performance on Tasks A and B.Systematic Generalization. Figure 5 shows theresults for both NMN-GT and NES-GT. Bothframeworks perform equally well on the standardgeneralization task (Task A), showing that theNES conjunctivist design preserves the efficacyof the functionist paradigm. In Task B however,while both frameworks perform reasonably well,NES exhibits stronger systematic generalizationcapability than the NMN model when evaluatedon an unseen event distribution. These quantitativeresults suggest that NES enables a stronger decou-pling of individual concepts, yielding higher accu-racy when they are composed for unseen events.

To explore concept disentanglement further, weanalyze the color sensitivity of color words in Fig-ure 6. For this analysis, we take the trained modelsfrom Task B and examine the normalized responsescore of different modules (e.g. red) to a con-tinuous spectrum of color input. We sample theinput shapes for each color classifier from the un-seen event distribution. Our analysis suggests thatNES offers stronger disentanglement of attributeconcepts: color words respond to separated andappropriate spectral regions, in contrast to NMN.7

4.1.4 End-to-End Experiments

Overview. Having validated that conjunctivistcomposition can support strong performance withknown event routings, our second set of syntheticexperiments are designed to assess the full end-to-end learning capability of the NES framework,including the critical event routing stage. In thissetting, we offer no ground truth logical form input

7This finding, with respect to NMN, is analogous and con-sistent with concurrent prior work (Subramanian et al., 2020).

Page 9: Neural Event Semantics for Grounded Language Understanding

triangle,1.0 0.0 0.0

1.0 0.0 0.0

1.0 0.0 0.0

1.0 0.0 0.0

0.0 1.0 0.0

argument 1 argument 2

0.0 1.0 0.0

0.0 1.0 0.0

0.0 0.0 1.0 0.0 0.0 1.0

0.0 0.0 1.0

0.0 0.0 1.0

0.0 0.0 1.0

0.0 1.0 0.0

0.0 1.0 0.0

0.0 0.0 1.0

0.0 0.0 1.0a

is

above

a

red

circle

.

(i)

(ii)

(iii)

(iv)

# # #Predicted event routings (attn vectors) =

e1 e2 e∅ e1 e2 e∅

a(e1)

triangle(e1)

is(e1,e2)

a(e2)

circle(e2)

.()

red(e2)

above(e1,e2)

∃e1,e2.

,,

,,,,,

Figure 8: Learned Event Routing. We visualize thepredicted soft event routings of a sentence from Fig-ure 4. (i) shows how “a” and “triangle” are effectivelyarity-1 functions, with the same event e1 routed to theirfirst argument, and e∅ to the second. (iii) shows thesame, with e2. (ii) shows an arity-2 routing for rela-tional predicates, and (iv) shows how punctuation canbe given an arity-0 routing. See Sec. 4.1.4.

or supervision to the NES model, and evaluate per-formance on all tasks. We do necessary programlayout pre-training for the E2E-Func (NMN) base-line prior to end-to-end REINFORCE training.Generalization. In Figure 7, we show that ourinitial findings in Sec. 4.1.3 hold in the more gen-eral end-to-end setting, across the broader set ofmodel classes. While compositional methods con-sistently outperform the noncompositional base-lines, there is a clear differentiation between MACand NES/E2E-Func on Task B (systematic novel-event generalization). This suggests that MAC re-lies too strongly on correlative associations of textphrases for unseen events, overfitting at training.

In Fig. 4, we visualize a table of NES scorepredictions on a specific input V , using a two-event setting for visual clarity. An input state-ment is considered true if there is an assignment(grounding to V ) of the events with a high overallscore. Across different event assignments e← V ,NES provides consistent and correct score out-puts. Because NES considers each word as its ownevent classifier (with appropriate routing), it pro-vides interpretable indicators for which attributesare specifically not present for each assignment.

In Figure 8, we visualize the event routing pre-dictions from an example NES model trained end-to-end. Consistent with our observation in Fig-ure 4, we see that the model can learn approx-imate routings and implicit arity of the differentevent classifiers. Though event routings are mod-eled as soft attention and classifier output scoresare continuous, both have approached nearly dis-

0.5 0.0 0.5red

1.0 0.0 0.0

argument 1 argument 2

0.0 0.0 1.0 0.0 0.0 1.0

0.0 0.0 1.0

0.0 0.0 1.0red

not

(a)

e1 e2 e∅ e1 e2 e∅

red(e1)

red(e1’)

not()∧

…… …

(b)

,

,,

Figure 9: Negation with NES. (a) We visualize oneway in which NES can handle coordination for non-intersective modifiers (e.g. attribute negation) by lever-aging the background event e∅. NES soft routing leadsto a modified event argument input e′1 attending over e1and e∅, enabling the red classifier to output the oppo-site prediction (now, output score = 1.0 if original e1 isnot red). (b) NES performance on Task A and B nega-tion variants remains consistent. Ablation (Sec. 4.1.4)highlights the impact of the event routing mechanism.

crete outputs by the end of learning, capturing theunderlying logical structure of the domain.8

Negation. Finally, we demonstrate that NES iscapable of handling non-intersective modifiers byexamining its ability to model property negation.In contrast with functionist models, conjunctivistevent semantics must handle negation throughmodification of the input event to the given pred-icate (Pietroski, 2005). In Figure 9, we show theresults from these experiments. First, we observethat NES can maintain the same level of general-ization accuracy in variants of Task A and B thatcontain negation. Visualizing an example model,we see that NES learns to coordinate negationthrough its event routing stage: the presence of“not” in the textual input can lead NES to predicta soft routing Aw1∗ that attends to a combinationof both e1 and the ungrounded background e∅ forthe first argument of “red” (denoted as e′1 in theexample). Now, when this specific “red” attributeclassifier processes its updated event arguments,its classification behavior is reversed: a high score

8Without low-level supervision to break symmetry, it ispossible for separate end-to-end training runs to learn dif-ferent but equivalent routing schemes (and matching eventclassifiers): for example, NES can learn event classifiers Mwhere argument slot 2 is consistently its primary slot (insteadof slot 1). In such a case, we can use the jointly learned (con-sistently inverted) event routings to remap for visualization.

Page 10: Neural Event Semantics for Grounded Language Understanding

when the attribute is not present in the original e1.We compare with an ablation variant of NES

that removes this routing flexibility: for attributeclassifiers Mw, we restrict their routing attentionAw∗∗ to only consider the n − 1 grounded eventsin the first argument slot (removing e∅ from con-sideration) and fix the second slot a2 to the back-ground e∅. Because individual event classifiermodules only take decontextualized word embed-dings, the event routing mechanism is the only wayfor context information to influence the classifica-tion. Thus, this ablation directly reflects the im-pact of the flexible event routing mechanism andits usage of the ungrounded background event tohandle more complex language settings. We findthat while the ablation maintains performance onthe standard tasks, its accuracy significantly de-creases in this setting where some input statementshave negation. Overall, we observe that the rich,augmented event space and flexible event routingstage enable our conjunctivist framework to learnhow to model non-intersective modifiers, a crucialstep for real-world language (Sec. 4.2).

4.2 Experiments: Real-World Language

Having validated the efficacy of NES in a con-trolled synthetic setting, we now explore NES ina grounded reference game task to demonstrate itsbroader applicability. Because the overall end-to-end NES framework requires no low-level super-vision during training, it mirrors the broader ap-plicability of implicit semantics methods (MAC)to less structured, human-generated language.Chairs-in-Context (CiC). The Chairs-in-Context(CiC) dataset (Achlioptas et al., 2019) con-tains chairs and other objects from the ShapeNetdataset, paired with human-generated languagecollected in the context of a reference game. EachCiC input consists of a set of 3 chairs representinga contrastive communication context, with a hu-man utterance (up to 33 tokens) intended to iden-tify one of the chairs. In total, there are over 75ktriplets with an 80-10-10 split for train-val-test.CiC also contains a zero-shot evaluation set withtriplets of unseen object classes (e.g. tables). CiCis challenging due to its relatively long-tail lan-guage diversity and varied visual inputs.Task A: Language Generalization. Our first CiCbenchmark task is language generalization, wherea model must ground the specific chair from theinput set given a referring utterance. The dataset

Table 1: CiC-Language Generalization. NES onreal-world language from the Chairs-in-Context (CiC)dataset. *SG architectures from Achlioptas et al.(2019) are the previously reported state-of-the-artmethod. NES+ grounds sub-events on the feature gridinput. -SN indicates ShapeNet pre-trained features.

METHOD INPUT LISTENER ACC.

MAJORITY N/A 0.333

*SG-NOATTN VGG16-SN 0.812 ± 0.008*SG-ATTN VGG16-SN 0.817 ± 0.008

LSTM-ATTN VGG16-SN 0.731 ± 0.012POE VGG16-SN 0.752 ± 0.009

NMN VGG16-SN 0.763 ± 0.023MAC VGG16-SN 0.818 ± 0.013

NES VGG16 0.842 ± 0.005NES VGG16-SN 0.856 ± 0.005

NES RES101 0.853 ± 0.011NES+ RES101 0.870 ± 0.009

Table 2: CiC-Zero Shot Generalization. Zero-shot generalization to unseen objects on the Chairs-in-Context (CiC) dataset. Results suggest NES can learnwords as event classifiers in a general, object-agnosticmanner. *SG model from (Achlioptas et al., 2019).

ZERO-SHOT CLASSES

MODEL LAMP BED TABLE SOFA ALL

MAJOR. 0.333 0.333 0.333 0.333 0.333

*SG 0.501 0.564 0.637 0.536 0.560

POE 0.422 0.466 0.587 0.483 0.490NMN 0.462 0.492 0.572 0.532 0.515MAC 0.533 0.531 0.632 0.551 0.567

NESw/VGG16 0.544 0.578 0.693 0.588 0.601w/Res101 0.573 0.589 0.715 0.610 0.622

split ensures no overlap in speaker-listener pairsbetween training and evaluation, so models mustgeneralize to new communication contexts.Task B: Zero-Shot Generalization. Our sec-ond CiC benchmark task is zero-shot generaliza-tion, which examines the ability for the model togeneralize from understanding attribute conceptslearned in a chairs context to contexts with unseenobject classes like tables and lamps. The overalltask setting is the same as before, but during eval-uation the triplets are composed of objects froma particular unseen class. For consistency withprior work, all models here are evaluated on animage-only setting (i.e. no 3D point-cloud repre-sentation). We provide a breakdown of the resultson the full zero-shot transfer set by class.Models and Implementation. Our main base-

Page 11: Neural Event Semantics for Grounded Language Understanding

b1 b2 b3

0.00 0.32 0.34 0.00 0.000.00 0.33 0.33 0.01 0.491.00 0.35 0.33 0.99 0.51

messy

b1

b2

b3

bed , fou

rpillow

s(b)

0.00 0.18 0.82

tall (Top-5 rank)

tall (Bottom-5 rank)

short (Top-5 rank)

short (Bottom-5 rank)

(c)

c1 c2 c3

1.00 0.79 0.59 0.37 0.380.00 0.00 0.00 0.28 0.270.00 0.21 0.41 0.35 0.35

three

c1

c2

c3

openings

in the back

(a)

0.93 0.00 0.07

(d)

Figure 10: CiC Qualitative Results. We visualize results on the (a) CiC evaluation set and (b) zero-shot evaluationset. Chair and bed triplets (c∗, b∗) are shown with NES output scores. Tables show relative classifier scores that arenormalized per word, for visualization purpose only (e.g. if a word has classifier scores of 1.0 across events, thenwe show them as 0.333). NES grounds real-world reference language and provides meaningful interpretability onhow individual classifiers contribute to the final score. (c) Event classifiers can be used standalone for retrieval,showing lexical consistency between antonyms – (d) shows Pearson correlations (p-value< 1e−13). See Sec. 4.2.

line is the recent ShapeGlot (SG) architecture(Achlioptas et al., 2019). The SG baseline lever-ages recurrent, convolutional, and attention com-ponents in an end-to-end architecture to achievestate-of-the-art performance on the language andzero-shot generalization datasets. We also con-sider a conjunctive baseline with event classifierswithout the soft event routing stage, reminiscent ofa product-of-experts (PoE) classification setting.This baseline serves to illustrate the impact of theflexible routing stage on compositionality, and inparticular handling of non-intersective modifiers.We additionally report two compositional base-lines from Sec. 4.1.2, MAC and NMN, follow-ing the protocols outlined by our previous end-to-end synthetic experiments 4.1.4. Because CiCcontains unstructured human-generated text and itis difficult to train NMN end-to-end from denota-tion alone, we initialize the sequence-to-sequenceprogram generator in the NMN baseline by pre-training on auxiliary parse information for 1,000examples (Suhr et al., 2019; Yi et al., 2018); allother baselines do not have any additional supervi-sion data. Finally, we also consider a denser inputevent space for NES corresponding to sub-regionsin the image input. Here, sub-events are addition-ally sampled from the (unannotated) final conv4feature grid of the encoder network; we denote thisas NES+ in our experiments. We adopt consis-tent experimental settings from Achlioptas et al.(2019), treating each chair as an event candidatespace, with predictions normalized by 3-way soft-

max over possible target images. All model sizesare kept comparable in number of parameters forfair comparison. We leverage the same pre-trainedVGG16 features (Simonyan and Zisserman, 2015;Chang et al., 2015) and GloVe (Wiki.6B) embed-dings (Pennington et al., 2014). For completeness,we report results with VGG16 and ResNet-101without ShapeNet pre-training for both tasks.Analysis. We report our results in Table 1 andTable 2 against the prior state-of-the-art SG ar-chitecture (Achlioptas et al., 2019). The MACbaseline provides comparable performance to theprior state-of-the-art. The NMN baseline has rea-sonable accuracy, albeit lower than the MAC andSG baselines. This is likely due to the ambigu-ity in longer token sequences (up to 33 tokens),which can contain filler words and occasional dis-fluencies that hurt the efficacy of the sequence-to-sequence program generator. Nonetheless, NMNoutperforms the PoE baseline, which serves as asimplistic conjunctive modular baseline withoutthe NES event routing framework.

We observe that our model improves overthe prior state-of-the-art work on this dataset bya large margin on the original neural listenertask. Further, NES significantly improves zero-shot generalization performance, indicating thatit has learned event classifiers for attributes (e.g.“messy”, “tall”) that can generalize to entirely un-seen input event distributions. We visualize qual-itative results in Fig. 10: NES can provide inter-pretable event classifier outputs at the word level

Page 12: Neural Event Semantics for Grounded Language Understanding

without any additional low-level supervision, inboth the main (chairs) and unseen zero-shot set-tings. We also show how learned event classi-fiers are lexically consistent by performing stan-dalone retrieval of antonym pairs. We observe thathigh-ranked retrievals for a word classifier corre-late with low-ranked retrievals of its antonym.

4.3 Overall Discussion

We provide additional discussion of the overallNES framework, considering its broader implica-tions, limitations, and avenues for further work.Broader Generality. In the above sections, wehave described our key results of NES on theShapeWorld and CiC benchmarks. However,modular neural network approaches like NMN areintuitively suited to settings where the visual andlanguage environments are particularly regular,context-free, and unambiguous. In its current for-mulation, NES is similarly suited to such struc-tured settings: effective generalization to highly ir-regular and context-sensitive vision and languagesettings in images and videos (Zhou et al., 2019;Huang* et al., 2018), remains outside the currentscope of the presented paper. Nonetheless, webelieve that careful consideration of key elementsin the NES framework, such as the proposed softevent routing system with ungrounded events forcoordinating richer meaning, can offer a promis-ing route towards improving the state-of-the-art.Computational Complexity. Through its existen-tial quantification operating over events, the com-plexity of event assignment (Eq. 7) during infer-ence scales by O(kn−1), where k is the number ofvisual event candidates V and n−1 the number ofevents e in the logical form F (excluding e∅). Thiswas not an issue in the domains examined here,but may become one in complex vision-languagedomains. Exploring potential relationships withconcurrent techniques (Bogin et al., 2020) that in-crease computational complexity but also improvesystematicity may prove insightful here as well.

5 Conclusion

In this work, we introduced neural event seman-tics (NES) for compositional grounded languageunderstanding. Our framework’s conjunctivist de-sign offers a compelling alternative to designsrooted primarily in function-based semantics: byderiving structure from events and their (soft) rout-ings, NES operates with a simpler composition

ruleset (conjunction) and effectively learns seman-tic concepts without any low-level ground truthsupervision. Controlled synthetic experiments(ShapeWorld) show the generalization benefits ofour framework, and we demonstrate broader appli-cability of NES on real-world language data (CiC)by significantly improving language and zero-shotgeneralization over prior state-of-the-art. Ulti-mately, our work shows that deep consideration ofthe mechanisms for compositional neural methodsmay yield techniques better suited for differen-tiable neural modeling, maintaining core expres-sivity for grounded language understanding tasks.

A Appendix: Equation 8

We describe a formula for an approximation of themax function (L∞-norm), used during training toimprove end-to-end gradient flow (Sec. 3.6). Lets ∈ Rn ≥ 0 be a vector of non-negative scores.We consider fmax : Rn → R (Eq. 8 in Sec 3.6):

fmax(s;β) =

∑i (si)

β+1∑i (si)

β

where β ≥ 0 is a hyperparameter. As β → ∞,fmax(s;β) → s∗, where s∗ = max(s) = L∞(s).We can show this by dividing the numerator anddenominator by (s∗)β+1 and taking the limit:

limβ→∞

fmax(s) = limβ→∞

(∑i (si)

β+1/ (s∗)

β+1∑i (si)

β/ (s∗)

β+1

)(9)

Now all terms where |si| < s∗ tend to 0, leavingus just the maximum terms in the numerator anddenominator where |si| = s∗. Thus, Eq. 9 reducesto 1

1/s∗ = s∗ , as desired.Further, |fmax(s)| has the essential property of

always being upper-bounded by s∗. We show thisby Holder’s inequality. Let xi = si, yi = (si)

β ,and let p → ∞ and q → 1 (satisfying conditions1/p+ 1/q = 1 and p, q ∈ (1,∞)). Then,

∑i

|xiyi| ≤

(∑i

|xi|p) 1

p(∑

i

|yi|q) 1

q

∑i |xiyi|

(∑i |yi|q)

1q

(∑i

|xi|p) 1

p

|fmax(s)| ≤ L∞(s) = s∗

Thus, with non-negative scores si ≥ 0, we havelimβ→∞

fmax(s) = max(s) and fmax(s) ≤ max(s).

Page 13: Neural Event Semantics for Grounded Language Understanding

Acknowledgments

We sincerely thank Alex Tamkin, Jesse Mu, MikeWu, Panos Achlioptas, Robert Hawkins, BorisPerkhounkov, Fereshte Khani, Robin Jia, ElisaKreiss, Jim Fan, Juan Carlos Niebles, ThomasIcard, and Paul Pietroski for helpful discussionsand support. We are grateful to our action editor,Luke Zettlemoyer, and our anonymous reviewersfor their insightful and constructive feedback dur-ing the review process. S. Buch is supported by anNDSEG Fellowship. This work was supported inpart by the Office of Naval Research grant ONRMURI N00014-16-1-2007 and by a NSF Expe-ditions Grant, Award Number (FAIN) 1918771.This article reflects the authors’ opinions and con-clusions, and not any other entity.

References

Panos Achlioptas, Judy Fan, Robert Hawkins,Noah Goodman, and Leonidas J Guibas. 2019.ShapeGlot: Learning Language for Shape Dif-ferentiation. In Proceedings of the IEEE Inter-national Conference on Computer Vision, pages8938–8947.

Jacob Andreas, Marcus Rohrbach, Trevor Dar-rell, and Dan Klein. 2016a. Learning to Com-pose Neural Networks for Question Answering.NAACL.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell,and Dan Klein. 2016b. Neural Module Net-works. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition,pages 39–48.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu,Margaret Mitchell, Dhruv Batra, C. LawrenceZitnick, and Devi Parikh. 2015. VQA: VisualQuestion Answering. In Proceedings of theIEEE International Conference on ComputerVision.

Dzmitry Bahdanau, Shikhar Murty, MichaelNoukhovitch, Thien Huu Nguyen, Harmde Vries, and Aaron Courville. 2019a. System-atic Generalization: What is Required and CanIt Be Learned? ICLR.

Dzmitry Bahdanau, Harm de Vries, Timothy JO’Donnell, Shikhar Murty, Philippe Beau-doin, Yoshua Bengio, and Aaron Courville.

2019b. CLOSURE: Assessing Systematic Gen-eralization of CLEVR Models. arXiv preprintarXiv:1912.05783.

Ben Bogin, Sanjay Subramanian, Matt Gard-ner, and Jonathan Berant. 2020. Latent Com-positional Representations Improve SystematicGeneralization in Grounded Question Answer-ing. arXiv preprint arXiv:2007.00266.

Angel X Chang, Thomas Funkhouser, LeonidasGuibas, Pat Hanrahan, Qixing Huang, ZimoLi, Silvio Savarese, Manolis Savva, ShuranSong, Hao Su, et al. 2015. ShapeNet: AnInformation-Rich 3D Model Repository. arXivpreprint arXiv:1512.03012.

Devendra Singh Chaplot, Kanthashree MysoreSathyendra, Rama Kumar Pasumarthi,Dheeraj Rajagopal, and Ruslan Salakhut-dinov. 2018. Gated-Attention Architecturesfor Task-Oriented Language Grounding. InThirty-Second AAAI Conference on ArtificialIntelligence.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-ishna Vedantam, Saurabh Gupta, Piotr Dollar,and C Lawrence Zitnick. 2015. MicrosoftCOCO Captions: Data Collection and Evalua-tion Server. arXiv preprint arXiv:1504.00325.

Ann Copestake, Guy Emerson, Michael WayneGoodman, Matic Horvat, Alexander Kuhnle,and Ewa Muszynska. 2016. Resources forbuilding applications with dependency minimalrecursion semantics. In Proceedings of theTenth International Conference on LanguageResources and Evaluation (LREC’16), pages1240–1247.

Donald Davidson. 1967. The Logical Form of Ac-tion Sentences. In The Logic of Decision andAction, pages 81–120. Pittsburgh: University ofPittsburgh Press.

Alex Graves and Jurgen Schmidhuber. 2005.Framewise Phoneme Classification with Bidi-rectional LSTM and Other Neural Network Ar-chitectures. Neural networks, 18(5-6):602–610.

Kaiming He, Georgia Gkioxari, Piotr Dollar, andRoss Girshick. 2017. Mask R-CNN. In Pro-ceedings of the IEEE international conferenceon computer vision, pages 2961–2969.

Page 14: Neural Event Semantics for Grounded Language Understanding

Kaiming He, Xiangyu Zhang, Shaoqing Ren, andJian Sun. 2016. Deep Residual Learning forImage Recognition. In Proceedings of theIEEE conference on computer vision and pat-tern recognition, pages 770–778.

Ronghang Hu, Jacob Andreas, Trevor Darrell, andKate Saenko. 2018. Explainable Neural Com-putation via Stack Neural Module Networks.In Proceedings of the European Conference onComputer Vision (ECCV), pages 53–69.

Ronghang Hu, Jacob Andreas, Marcus Rohrbach,Trevor Darrell, and Kate Saenko. 2017. Learn-ing to Reason: End-to-End Module Networksfor Visual Question Answering. In Proceedingsof the IEEE International Conference on Com-puter Vision.

De-An Huang*, Shyamal Buch*, Lucio Dery,Animesh Garg, Li Fei-Fei, and Juan Car-los Niebles. 2018. Finding “It”: Weakly-Supervised, Reference-Aware Visual Ground-ing in Instructional Videos. In IEEE Confer-ence on Computer Vision and Pattern Recogni-tion (CVPR).

Drew A Hudson and Christopher D Manning.2018. Compositional attention networks formachine reasoning. In International Confer-ence on Learning Representations (ICLR).

Drew A Hudson and Christopher D Manning.2019. GQA: A new dataset for real-world vi-sual reasoning and compositional question an-swering. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recogni-tion, pages 6700–6709.

TS Jayram, Vincent Marois, Tomasz Kornuta,Vincent Albouy, Emre Sevgen, and Ahmet SOzcan. 2019. Transfer Learning in Visualand Relational Reasoning. arXiv preprintarXiv:1911.11938.

Justin Johnson, Bharath Hariharan, Laurensvan der Maaten, Li Fei-Fei, C Lawrence Zit-nick, and Ross Girshick. 2017a. CLEVR: A Di-agnostic Dataset for Compositional Languageand Elementary Visual Reasoning. In Pro-ceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 1988–1997. IEEE.

Justin Johnson, Bharath Hariharan, Laurensvan der Maaten, Judy Hoffman, Li Fei-Fei,C Lawrence Zitnick, and Ross B Girshick.2017b. Inferring and Executing Programs forVisual Reasoning. In Proceedings of the IEEEInternational Conference on Computer Vision,pages 3008–3017.

Andrej Karpathy and Li Fei-Fei. 2015. DeepVisual-Semantic Alignments for GeneratingImage Descriptions. In Proceedings of theIEEE Conference on Computer Vision and Pat-tern Recognition, pages 3128–3137.

Jayant Krishnamurthy and Thomas Kollar. 2013.Jointly Learning to Parse and Perceive: Con-necting Natural Language to the PhysicalWorld. Transactions of the Association forComputational Linguistics, 1:193–206.

Jayant Krishnamurthy, Oyvind Tafjord, andAniruddha Kembhavi. 2016. Semantic Pars-ing to Probabilistic Programs for Situated Ques-tion Answering. In Proceedings of the 2016Conference on Empirical Methods in NaturalLanguage Processing, pages 160–170, Austin,Texas. Association for Computational Linguis-tics.

Alexander Kuhnle and Ann A. Copestake. 2017.ShapeWorld - A New Test Methodology forMultimodal Language Understanding. CoRR,abs/1704.04517.

Brenden M Lake, Tomer D Ullman, Joshua BTenenbaum, and Samuel J Gershman. 2017.Building Machines that Learn and Think LikePeople. Behavioral and brain sciences, 40.

Mateusz Malinowski and Mario Fritz. 2014. AMulti-World Approach to Question AnsweringAbout Real-World Scenes Based on UncertainInput. Advances in neural information process-ing systems, 27:1682–1690.

Jiayuan Mao, Chuang Gan, Pushmeet Kohli,Joshua B Tenenbaum, and Jiajun Wu. 2019.The Neuro-Symbolic Concept Learner: Inter-preting Scenes, Words, and Sentences FromNatural Supervision. In International Confer-ence on Learning Representations.

Vincent Marois, TS Jayram, Vincent Albouy,Tomasz Kornuta, Younes Bouhadjar, and Ah-met S Ozcan. 2018. On Transfer Learning

Page 15: Neural Event Semantics for Grounded Language Understanding

using a MAC Model Variant. arXiv preprintarXiv:1811.06529.

Cynthia Matuszek, Nicholas FitzGerald, LukeZettlemoyer, Liefeng Bo, and Dieter Fox. 2012.A joint model of language and perception forgrounded attribute learning. International Con-ference on Machine Learning.

Will Monroe, Robert XD Hawkins, Noah D Good-man, and Christopher Potts. 2017. Colorsin Context: A pragmatic neural model forgrounded language understanding. Transac-tions of the Association for Computational Lin-guistics, 5:325–338.

Richard Montague. 1970. Universal Grammar.Theoria, 36(3):373–398.

Adam Paszke, Sam Gross, Francisco Massa,Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, NataliaGimelshein, Luca Antiga, Alban Desmaison,Andreas Kopf, Edward Yang, Zachary De-Vito, Martin Raison, Alykhan Tejani, SasankChilamkurthy, Benoit Steiner, Lu Fang, JunjieBai, and Soumith Chintala. 2019. PyTorch:An Imperative Style, High-PerformanceDeep Learning Library. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d'Alche-Buc,E. Fox, and R. Garnett, editors, Advances inNeural Information Processing Systems 32,pages 8024–8035. Curran Associates, Inc.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global Vectorsfor Word Representation. In Empirical Meth-ods in Natural Language Processing (EMNLP),pages 1532–1543.

Ethan Perez, Florian Strub, Harm De Vries, Vin-cent Dumoulin, and Aaron Courville. 2018.FiLM: Visual Reasoning with a General Con-ditioning Layer. In Thirty-Second AAAI Con-ference on Artificial Intelligence.

Paul M. Pietroski. 2005. Events and Semantic Ar-chitecture. Oxford University Press, U.S.A.

Laura Ruis, Jacob Andreas, Marco Baroni, Di-ane Bouchacourt, and Brenden M Lake. 2020.A Benchmark for Systematic Generalization inGrounded Language Understanding. Advancesin Neural Information Processing Systems.

Karen Simonyan and Andrew Zisserman. 2015.Very Deep Convolutional Networks for Large-Scale Image Recognition. In InternationalConference on Learning Representations.

Sanjay Subramanian, Ben Bogin, Nitish Gupta,Tomer Wolfson, Sameer Singh, Jonathan Be-rant, and Matt Gardner. 2020. Obtaining Faith-ful Interpretations from Compositional NeuralNetworks. In Proceedings of the 58th AnnualMeeting of the Association for ComputationalLinguistics, pages 5594–5608.

Alane Suhr, Stephanie Zhou, Ally Zhang, IrisZhang, Huajun Bai, and Yoav Artzi. 2019. ACorpus for Reasoning about Natural LanguageGrounded in Photographs. In Proceedings ofthe 57th Annual Meeting of the Association forComputational Linguistics, pages 6418–6428.

Adam Vogel and Dan Jurafsky. 2010. Learningto Follow Navigational Directions. In Proceed-ings of the 48th Annual Meeting of the Associa-tion for Computational Linguistics, pages 806–814.

Ronald J Williams. 1992. Simple Statisti-cal Gradient-Following Algorithms for Con-nectionist Reinforcement Learning. Machinelearning, 8(3-4):229–256.

Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Tor-ralba, Pushmeet Kohli, and Josh Tenenbaum.2018. Neural-Symbolic VQA: Disentanglingreasoning from vision and language understand-ing. In Advances in Neural Information Pro-cessing Systems, pages 1039–1050.

Luke S Zettlemoyer and Michael Collins. 2005.Learning to Map Sentences to Logical Form:Structured Classification with Probabilistic Cat-egorial Grammars. UAI ’05, Proceedings of the21st Conference in Uncertainty in Artificial In-telligence.

Luowei Zhou, Yannis Kalantidis, Xinlei Chen,Jason J Corso, and Marcus Rohrbach. 2019.Grounded Video Description. In Proceedings ofthe IEEE Conference on Computer Vision andPattern Recognition, pages 6578–6587.