Compositionality in Recursive Neural Networksevents.cs.bham.ac.uk/syco/3/slides/Lewis1.pdf · Compositionality in Recursive Neural Networks Martha Lewis ILLC University of Amsterdam

Compositionality in Recursive Neural Networks

Martha Lewis

ILLCUniversity of Amsterdam

SYCO3, March 2019

Oxford, UK

M. Lewis Compositionality in TreeRNNs 1/25

Outline

Compositional distributional semantics

Pregroup grammars and how to map to vector spaces

Recursive neural networks (TreeRNNs)

Mapping pregroup grammars to TreeRNNs

Implications


Compositional Distributional Semantics

Frege’sprin

cipleof composit

ionality

The meaningof a

complex expression

isdeterm

inedby the

meanings of itsparts

and the rules used for combining them.


Compositional Distributional Semantics

Frege’sprin

cipleof composit

ionality

The meaningof a

complex expression

isdeterm

inedby the

meanings of itsparts

and the rules used for combining them.

Distributional hypothesis

Words that occur in

similar contexts have similar meanings

[Harris, 1958].


Symbolic Structure

A pregroup algebra is a partially ordered monoid, where eachelement p has a left and a right adjoint such that:

p · pr ≤ 1 ≤ pr · p pl · p ≤ 1 ≤ p · pl

Elements of the pregroup are basic (atomic) grammaticaltypes, e.g. B = {n, s}.Atomic grammatical types can be combined to form types ofhigher order (e.g. n · nl or nr · s · nl)A sentence w1w2 . . .wn (with word wi to be of type ti ) isgrammatical whenever:

t1 · t2 · . . . · tn ≤ s


Pregroup derivation: example

p · pr ≤ 1 ≤ pr · p pl · p ≤ 1 ≤ p · pl

S

NP

Adj

trembling

N

shadows

VP

V

play

N

hide-and-seek

trembling shadows play hide-and-seek

n nl n nr s nl n

n · nl · n · nr · s · nl · n ≤ n · 1 · nr · s · 1= n · nr · s≤ 1 · s= s


Distributional Semantics

Words are represented as vectorsEntries of the vector represent how often the target wordco-occurs with the context word

iguana

cuddly

smelly

scaly

teethcute

1

10

15

7

2 scaly

cuddly

smelly

Wilbur

iguana

Similarity is given by cosine distance:

sim(v ,w) = cos(θv ,w ) =〈v ,w〉||v ||||w ||


The role of compositionality

Compositional distributional models

We can produce a sentence vector by composing the vectorsof the words in that sentence.

−→s = f (−→w1,−→w2, . . . ,

−→wn)

Three generic classes of CDMs:

Vector mixture models [Mitchell and Lapata (2010)]

Tensor-based models [Coecke, Sadrzadeh, Clark (2010); Baroni and

Zamparelli (2010)]

Neural models [Socher et al. (2012); Kalchbrenner et al. (2014)]


A multi-linear model

The grammatical type of a word defines the vector spacein which the word lives:

Nouns are vectors in N;

adjectives are linear maps N → N, i.e elements inN ⊗ N;

intransitive verbs are linear maps N → S , i.e. elementsin N ⊗ S ;

transitive verbs are bi-linear maps N ⊗ N → S , i.e.elements of N ⊗ S ⊗ N;

The composition operation is tensor contraction, i.e.elimination of matching dimensions by application of innerproduct.

Coecke, Sadrzadeh, Clarke 2010


Diagrammatic calculus: Summary

A

f A

V V W V W ZBmorphisms tensors

A ArA Ar A = A

Ar A

ε-map η-map (εrA ⊗ 1A) ◦ (1A ⊗ ηrA) = 1A


Diagrammatic calculus: example


N VP

Adj N V N

S

F( ) = N N l N Nr S N l N

F(α)(trembling ⊗−−−−−→shadows ⊗ play ⊗

−−−−−−−−−→hide-and-seek)


N N l N Nr S N l N

⊗i

−→wi 7→

F(α) 7→


Recursive Neural Networks

−→p2 = g(−−−−→Clowns,−→p1)

−−−−→Clowns

−→p1 = g(−→tell,−−→jokes)

−→tell

−−→jokes

gRNN : Rn × Rn → Rn :: (−→v1 ,−→v2) 7→ f1

(M ·

[−→v1−→v2])

gRNTN : Rn×Rn → Rn :: (−→v1 ,−→v2) 7→ gRNN(−→v1 ,−→v2)+f2(−→v1> · T · −→v2)


How compositional is this?

Successful

Some element of grammatical structure

The compositionality function has to do everything

Does that help us understand what’s going on?


Information-routing words


−−→who

−→tell

−−→jokes


Information-routing words

−−→John

−−−−−−→introduces

−−−−→himself


Can we map pregroup grammar onto TreeRNNs?

−→p2 = g(−−−−→Clowns,−→p1)


−→p1 = g(−→tell,−−→jokes)

−→tell

−−→jokes



Clowns tell jokes

gLinTen

gLinTen

−→p1 = gLinTen(−−−→cross,−−−→roads)

−→p2 = gLinTen(−−−−→Clowns,−→p1)



Clowns

tell

jokes

gLinTen

gLinTen


Why?

Opens up more possibilities to use tools from formalsemantics in computational linguistics.

We can immediately see possibilities for building alternativenetworks - perhaps different compositionality functions fordifferent parts of speech

Decomposing the tensors for functional words into repeatedapplications of a compositionality function gives options forlearning representations.


Why?

who : nrns ls

dragons breathe firewho

=

dragons breathe fire


Why?

himself : nsrnrrnr s

John loves himself

=

John loves


Experiments?

Not yet. But there are a number of avenues for exploration

Examining performance of this kind of model with standardcategorical compositional distributional models

Different compositionality functions for different word types

Testing the performance of TreeRNNs with formally analyzedinformation-routing words.

Investigating the effects of switching between word types.

Investigating meanings of logical words and quantifiers.

Extending the analysis to other types of recurrent neuralnetwork such as long short-term memory networks or gatedrecurrent units.


Summary

We have shown how to interpret a simplification of recursiveneural networks within a formal semantics framework

We can then analyze ‘information routing’ words such aspronouns as specific functions rather than as vectors

This also provides a simplification of tensor-based vectorcomposition architectures, reducing the number of high ordertensors to be learnt, and making representations more flexibleand reusable.

Plenty of work to do on both the experimental and thetheoretical side!


Thanks!

NWO Veni grant ‘Metaphorical Meanings for Artificial Agents’


Category-Theoretic Background

The category of pregroups Preg and the category of finitedimensional vector spaces FdVect are both compact closed

This means that they share a structure, namely:

Both have a tensor product ⊗ with a unit 1Both have adjoints Ar , Al

Both have special morphisms

εr : A⊗ Ar → 1, εl : Al ⊗ A→ 1

ηr : 1→ Ar ⊗ A, ηl : 1→ A⊗ Al

These morphisms interact in a certain way.

In Preg:

p · pr ≤ 1 ≤ pr · p pl · p ≤ 1 ≤ p · pl


A functor from syntax to semantics

We define a functor F : Preg→ FdVect such that:

F(p) = P ∀p ∈ BF(1) = R

F(p · q) = F(p)⊗F(q)

F(pr ) = F(pl) = F(p)

F(p ≤ q) = F(p)→ F(q)

F(εr ) = F(εl) = inner product in FdVect

F(ηr ) = F(ηl) = identity maps in FdVect

[Kartsaklis, Sadrzadeh, Pulman and Coecke, 2016]


References I


Documents

Compositionality in Recursive Neural Networksevents.cs.bham.ac.uk/syco/3/slides/Lewis1.pdf · Compositionality in Recursive Neural Networks Martha Lewis ILLC University of Amsterdam