50
Neurons and semantics and stuff

Edinburgh MT: Neurons and semantics and stuff

Embed Size (px)

Citation preview

Neurons and semantics and stuff

x1

x2

x3

x4

x5

x

“Neurons”

w1

w2

w3

w4

w5

x1

x2

x3

x4

x5

x w

“Neurons”

w1

w2

w3

w4

w5

x1

x2

x3

x4

x5

y

y = x ·w

x

“Neurons”

w

w1

w2

w3

w4

w5

x1

x2

x3

x4

x5

y

y = x ·wy = g(x ·w)

x

“Neurons”

w

w1

w2

w3

w4

w5

x1

x2

x3

x4

x5

y

x

“Neurons”

w

x1

x2

x3

x4

x5

x

“Neural” Networks

x1

x2

x3

x4

x5

x

w4,1

“Neural” Networks

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

y = x

>W

“Neural” Networks

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

y = x

>W

y = g(x>W)

“Neural” Networks

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

y = x

>W

y = g(x>W)

g(u)i =expuiPi0 expui0

“Soft max”

“Neural” Networks

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

z = g(y>V)

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

z = g(y>V)

z = g(h(x>W)>V)

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

z = g(y>V)

z = g(h(x>W)>V)

z = g(Vh(Wx))

x1

x2

x3

x4

x5

x

y1

y2

y3

W y

“Deep”

V z

z1

z2

z3

z4

z = g(y>V)

z = g(h(x>W)>V)

z = g(Vh(Wx))

Note:if g(x) = h(x) = x

z = (VW)| {z }U

x

x1

x2

x3

x4

x5

y1

y2

y3

“Recurrent”

z1

z2

z3

z4

yt-11

yt-12

yt-13

Design Decisions

• How to represent inputs and outputs?

• Neural architecture?

• How many layers? (Requires non-linearities to improve capacity!)

• How many neurons?

• Recurrent or not?

• What kind of non-linearities?

Representing Language

• “One-hot” vectors

• Each position in a vector corresponds to a word type

• Sequence of words, sequence of vectors

• Bag of words: multiple vectors

• Distributed representations

• Vectors encode “features” of input words (character n-grams, morphological features, etc.)

Training Neural Networks

• Neural networks are supervised models - you need a set of inputs paired with outputs

• Algorithm

• Run for a while

• Give input to the network, see what it predicts

• Compute loss(y,y*) and (sub)gradient with respect to parameters. Use the chain rule, aka “back propagation”

• Update parameters (SGD, AdaGrad, LBFGS, etc.)

Bengio et al. (2003)

p(e) =

|e|Y

i=1

p(ei | ei�n+1, . . . , ei�1)

p(ei | ei�n+1, . . . , ei�1) =

ei

ei�1

ei�2

ei�3

C

C

C

W V

tanhsoftmax

x=x

Bengio et al. (2003)

Devlin et al. (2014)

• Turn Bengio et al. (2003) into a translation model

• Conditional model; generate the next English word conditioned on

• The previous n English words you generated

• The aligned source word, and its m neighbors

eiei�1

ei�2

ei�3

C

C

C

W V

tanh

softmax

x=x

p(e | f,a) =|e|Y

i=1

p(ei | ei�2, ei�1, fai�1, fai , fai+1)

p(ei | ei�2, ei�1, fai�1, fai , fai+1) =

D

D

D

fai

fai�1

fai+1

Devlin et al. (2014)

Kalchbrenner & Blunsom (2013)

• Can we get rid of alignments?

• Conditional model

• Represent f as a vector (or matrix)

• Use recurrent model + vector representation of f to generate translation

Summary of neural MT

• Two problems in standard statistical models

• We don’t condition on enough stuff

• We don’t know what features to use when we condition on lots of structure

• Punchline: Neural networks let us condition on a lot of stuff without an exponential growth in parameters

The diagram that will not die

(Vauquois, 1968)

Do we really need semantics?

(Jones et al. 2012)

Source: Anna fehlt ihrem KaterMT: Anna is missing her cat

Reference: Anna’s cat is missing her

"Fehlen" means "ARG1 to be missing to ARG0".  There's a slew of German (active voice) verbs that behave like this---dative NP translating to subject NP in English---including certain uses of "sein" (to be).   "Passen", for example, can mean "ARG1 to be acceptable to ARG0".   "Mir ist kalt" -- "I am feeling cold" or literally "to me is cold"... None of these are idiomatic.  They are just forms that the other Germanic

languages have (I think) but English lost at the Battle of Hastings.(Asad Sayeed, personal communication)

Semantic transfer

Anna fehlt ihrem Kater

MISS

CAT

ANNA

instance

agent

patient

instance

owner

instance

Anna’s cat is missing her

Figure 1: A string to meaning graph to string translation pipeline.

Experimental results demonstrate that our system is capable of learning semantic abstractions,and more specifically, to both analyse text into these abstractions and decode them back intotext in multiple languages.

The need to manipulate graph structures adds an additional level of complexity to the stan-dard MT task. While the problems of parsing and rule-extraction are well-studied for stringsand trees, there has been considerably less work within the NLP community on the equiva-lent algorithms for graphs. In this paper, we use hyperedge replacement grammars (HRGs)(Drewes et al., 1997) for the basic machinery of graph manipulation; in particular, we use asynchronous HRG (SHRG) to relate graph and string derivations.

We provide the following contributions:1. Introduction of string⇔ graph transduction with HRGs to NLP2. Efficient algorithms for

• string–graph alignment• inference of graph grammars from aligned graph/string pairs

3. Empirical results from a working machine translation system, and analysis of that sys-tem’s performance on the subproblems of semantic parsing and generation.

We proceed as follows: Section 2 explains the SHRG formalism and shows how it is usedto derive graph-structured meaning representations. Section 3 introduces two algorithms forlearning SHRG rules automatically from semantically-annotated corpora. Section 4 describesthe details of our machine translation system, and explains how a SHRG is used to transforma natural language sentence into a meaning representation and vice-versa. Section 6 discussesrelated work and Section 7 summarizes the main results of the paper.

2 Synchronous Hyperedge Replacement Grammars

Hyperedge replacement grammars (Drewes et al., 1997) are an intuitive generalization of con-text free grammars (CFGs) from strings to hypergraphs. Where in CFGs strings are built upby successive rewriting of nonterminal tokens, in hyperedge replacement grammars (HRGs),nonterminals are hyperedges, and rewriting steps replace these nonterminal hyperedges withsubgraphs rather than strings.

A hypergraph is a generalization of an graph in which edges may link an arbitrary number ofnodes. Formally, a hypergraph over a set of edge labels C is a tuple H = ⟨V, E, l, X ⟩, where Vis a finite set of nodes, E is a finite set of edges, where each edge is a subset of V , l : E → Cis a labeling function. |e| ∈ N denotes the type of a hyperedge e ∈ E (the number of nodesconnected by the edge). For the directed hypergraphs we are concerned with, each edgecontains a distinguished source node and one or more target nodes.

9x1, x2, x3 instance(x1, MISS) ^ agent(x1, x2) ^ patient(x1, x3) ^instance(x2, CAT ) ^ instance(x3, ANNA) ^ owner(x2, x3)

Problems we must solve

• Where do we get data that looks like this?

• How do we go from sentences to graphs (analysis)?

• How do we go from graphs to sentences (generation)?

• How do we do this efficiently?

Problems we must solve

• Where do we get data that looks like this?

• How do we go from sentences to graphs (analysis)?

• How do we go from graphs to sentences (generation)?

• How do we do this efficiently?

Note: generation from arbitrary conjunctions is NP-complete

(Moore, ENLG 2002)

AMRbank

(s / say-01 :ARG0 (g / organization :name (n / name :op1 "UN")) :ARG1 (f / flee-01 :ARG0 (p / person :quant (a / about :op1 14000)) :ARG1 (h / home :poss p) :time (w / weekend) :time (a2 / after :op1 (w2 / warn-01 :ARG1 (t / tsunami) :location (l / local)))) :medium (s2 / site :poss g :mod (w3 / web)))

http://amr.isi.edu

About 14,000 people fled their homes at the weekend after a local tsunami warning was issued, the UN said on its Web site

Hyperedge Replacement Grammar

Hyperedge Replacement Grammar

Hyperedge Replacement Grammar

Hyperedge Replacement Grammar

Hyperedge Replacement Grammar

Hyperedge Replacement Grammar

Hyperedge Replacement Grammar

Hyperedge Replacement Grammar

Hyperedge Replacement Grammar

Hyperedge Replacement Grammar

Hyperedge Replacement Grammar

Hyperedge Replacement Grammar

Hyperedge Replacement Grammar

Synchronous Hyperedge Replacement Grammar

A HRG over a set of labels C is a rewriting system G = ⟨N , T, P,S⟩, where N and T ⊂ C are thefinite sets of nonterminal and terminal labels (T ∩ N = %), and S ∈ N is the start symbol. P isa finite set of productions of the form A→ R, where A∈ N and R is a hypergraph over C , witha set of distinguished external nodes, XR.

To describe the rewriting mechanism, let H[e/R] be the hypergraph obtained by replacing theedge e = (v1 · · · vn) with the hypergraph R. The external nodes of R “fuse” to the nodes of e,(v1 · · · vn), so that R connects to H[e/R] at the same nodes that e does to H. Note that H[e/R]is undefined if |e| = |XR|. Given some hypergraph H with an edge e, if there is a productionp : lH(e)→ R ∈ GP and |XR|= |e|, we write H ⇒p H[e/R] to indicate that p can derive H[e/R]from H in a single step. We write H ⇒∗G R to mean that R is derivable from H by G in somefinite number of rewriting steps. The grammars we use in this paper do not contain terminalhyperedges, thus the yield of each complete derivation is a graph (but note that intermediatesteps in the derivation may contain hyperedges).

A Synchronous Hyperedge Replacement Grammar (SHRG) is a HRG whose productions havepairs of right hand sides. Productions have the form (A→ ⟨R,Q⟩,∼), where A ∈ N and R andQ are hypergraphs over N ∪ T . ∼ is a bijection linking nonterminal mentions in R and Q. Wecall the R side of a rule the source and the Q side the target. Isolating each side produces aprojection HRG of the SHRG. In general the target representation can be any hypergraph, oreven a string since string can be represented as monadic (non-branching) graphs. Becausewe are interested in translation between MRs and natural language we focus on graph-stringSHRGs. The target projection of such a SHRG is a context free string grammar. To ensure thatsource and target projection allow the same derivations, we constrain the relation ∼ such thatevery linked pair of nonterminals has the same label in R and Q.

Figure 2 shows an example SHRG with start symbol ROOTS . External nodes are shaded black.

R1 A0NNP

!

A0:anna , Anna

"

R2 ROOTVB

!

ROOT:miss , misses

"

R3 POSSPP

!

poss:anna , her

"

R4 A1NN

!

A1:cat , cat

"

R5 A0NP

!

A0NNP , A0

NNP

"

R6 A1NP

!

A1NN

POSSPRP

, POSSPRP

A1NN

"

R7 ROOTVP

!

ROOTVB

A1NP

, ROOTVB

A1NP

"

R8 ROOTS

!

A0NP

ROOTVP , A0

NPROOT

VP

"

Figure 2: A graph-string SHRG automatically extracted from the meaning representation graphin figure 3a using the SYNSEM algorithm. Note the hyperedge in rule R8.

The graph language captures a type of meaning representation in which semantic predicatesand concepts are connected to their semantic arguments by directed edges. The edges arelabeled with PropBank-style semantic roles (A0, A1, poss). Nonterminal symbols in this SHRGare complex symbols consisting of a semantic and a syntactic part, notated with the formerabove the latter.

Since HRG derivations are context free, we can represent them as trees. As an example, Figure3c shows a derivation tree using the grammar in Figure 2, Figure 3a shows the resulting graphand Figure 3b the corresponding string. Describing graphs as their SHRG derivation treesallows us to use a number of standard algorithms from the NLP literature.

Finally, an Adaptive Synchronous Hyperedge Replacement Grammar (ASHRG) is a SHRG G =⟨N , T, P∗,S, V ⟩, where V is a finite set of variables. ASHRG production templates are of thesame form as SHRG productions, (A → ⟨R,Q⟩,∼), but A ∈ N ∪ V and Q,R ∈ N ∪ T ∪ V . Aproduction template p∗ ∈ P∗ is realised as a set of rules P by substituting all variables v forany symbol s ∈ N ∪ T : P = {∀v∈V∀s∈N∪T p∗[v/s]}. ASHRGs are a useful formalism for definingcanonical grammars over the structure of graphs, with production templates describing graphstructure transformations without regard to edge labels. We make use of this formalism in theproduction template R∗ in Figure 4a.

root:miss1

A0:a

nna 0 A1:cat3

poss:anna2

(a)

Anna0 misses1 her2 cat3

NNP VB PRP$ NN

NP NPVP

S

(b)

R8R5R1

R7R2 R6

R3 R4

(c)

Figure 3: (a) an example meaning representation graph for the sentence ‘Anna misses her cat.’,(b) the corresponding syntax tree. Subscripts indicate which words align to which graph edges.(c) a SHRG derivation tree for (a) using the grammar Figure in 2.

R* NT→

!

(role):(concept) , (string)

"

R1 NT→

! NT

NT, —

"

R2 NT→

!

NT NT , —

"

(a)

R1R* R2

R1R* R*

R*

(b)

Figure 4: (a) The canonical grammar of width 2. R∗ is a production template and values inparentheses denote variables as defined by the ASHRG formalism. (b) A SHRG derivation treefor the MR graph in Figure 3a using the canonical grammar in a, as created by the CANSEM

algorithm.