14
Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3567–3580 November 16 - 20, 2020. c 2020 Association for Computational Linguistics 3567 Accurate polyglot semantic parsing with DAG grammars Federico Fancellu Akos K´ ad´ ar Ran Zhang Afsaneh Fazly Samsung AI Centre Toronto (SAIC Toronto) {federico.f, ran.zhang, a.fazly}@samsung.com Abstract Semantic parses are directed acyclic graphs (DAGs), but in practice most parsers treat them as strings or trees, mainly because models that predict graphs are far less understood. This simplification, however, comes at a cost: there is no guarantee that the output is a well-formed graph. A recent work by Fancellu et al. (2019) addressed this problem by proposing a graph- aware sequence model that utilizes a DAG grammar to guide graph generation. We sig- nificantly improve upon this work, by propos- ing a simpler architecture as well as more ef- ficient training and inference algorithms that can always guarantee the well-formedness of the generated graphs. Importantly, unlike Fancellu et al., our model does not require language-specific features, and hence can har- ness the inherent ability of DAG-grammar parsing in multilingual settings. We perform monolingual as well as multilingual experi- ments on the Parallel Meaning Bank (Abzian- idze et al., 2017). Our parser outperforms previous graph-aware models by a large mar- gin, and closes the performance gap between string-based and DAG-grammar parsing. 1 Introduction Semantic parsers map a natural language utterance into a machine-readable meaning representation, thus helping machines understand and perform in- ference and reasoning over natural language data. Various semantic formalisms have been explored as the target meaning representation for seman- tic parsing, including dependency-based composi- tional semantics (Liang et al., 2013), abstract mean- ing representation (AMR, Banarescu et al., 2013), minimum recursion semantics (MRS, Copestake et al., 2005), and discourse representation theory (DRT, Kamp, 1981). Despite meaningful differ- ences across formalisms or parsing models, a rep- resentation in any of these formalisms can be ex- b1 e1 b2 bar(e1) AGENT(e1, ‘speaker’) THEME(e1, x1) e2 b3 lock(e2) AGENT(e2, ‘speaker’) PATIENT(e2,x1) CONTINUATION(b2, b3) x1 b4 door(x1) Figure 1: The discourse representation structure for We barred the door and locked it ’. For ease of refer- ence in later figures, each box includes a variable cor- responding to the box itself, at top right in gray. pressed as a directed acyclic graph (DAG). Consider for instance the sentence ‘We barred the door and locked it ’, whose meaning representa- tion as a Discourse Representation Structure (DRS) is shown in Figure 1. A DRS is usually repre- sented as a set of nested boxes (e.g. b 1 ), containing variable-bound discourse referents (e.g. ‘lock(e 2 )’), semantic constants (e.g. ‘speaker’), predicates (e.g. AGENT) expressing relations between variables and constants, and discourse relations between the boxes (e.g. CONTINUATION). This representation can be expressed as a DAG by turning referents and constants into vertices, and predicates and dis- course relations into connecting edges, as shown in Figure 2. How can we parse a sentence into a DAG? Commonly-adopted approaches view graphs as strings (e.g. van Noord and Bos, 2017; van No- ord et al., 2018), or trees (e.g. Zhang et al., 2019a; Liu et al., 2018), taking advantage of the linearized graph representations provided in annotated data (e.g. Figure 3, where the graph in Figure 2 is rep- resented in PENMAN notation (Goodman, 2020)).

Accurate polyglot semantic parsing with DAG grammars · 2020. 11. 12. · 2 Restricted DAG Grammar We model graph generation as a process of graph rewriting with an underlying grammar

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3567–3580November 16 - 20, 2020. c©2020 Association for Computational Linguistics

    3567

    Accurate polyglot semantic parsing with DAG grammars

    Federico Fancellu Akos Kádár Ran Zhang Afsaneh FazlySamsung AI Centre Toronto (SAIC Toronto)

    {federico.f, ran.zhang, a.fazly}@samsung.com

    Abstract

    Semantic parses are directed acyclic graphs(DAGs), but in practice most parsers treat themas strings or trees, mainly because models thatpredict graphs are far less understood. Thissimplification, however, comes at a cost: thereis no guarantee that the output is a well-formedgraph. A recent work by Fancellu et al. (2019)addressed this problem by proposing a graph-aware sequence model that utilizes a DAGgrammar to guide graph generation. We sig-nificantly improve upon this work, by propos-ing a simpler architecture as well as more ef-ficient training and inference algorithms thatcan always guarantee the well-formedness ofthe generated graphs. Importantly, unlikeFancellu et al., our model does not requirelanguage-specific features, and hence can har-ness the inherent ability of DAG-grammarparsing in multilingual settings. We performmonolingual as well as multilingual experi-ments on the Parallel Meaning Bank (Abzian-idze et al., 2017). Our parser outperformsprevious graph-aware models by a large mar-gin, and closes the performance gap betweenstring-based and DAG-grammar parsing.

    1 Introduction

    Semantic parsers map a natural language utteranceinto a machine-readable meaning representation,thus helping machines understand and perform in-ference and reasoning over natural language data.Various semantic formalisms have been exploredas the target meaning representation for seman-tic parsing, including dependency-based composi-tional semantics (Liang et al., 2013), abstract mean-ing representation (AMR, Banarescu et al., 2013),minimum recursion semantics (MRS, Copestakeet al., 2005), and discourse representation theory(DRT, Kamp, 1981). Despite meaningful differ-ences across formalisms or parsing models, a rep-resentation in any of these formalisms can be ex-

    b1

    e1 b2

    bar(e1)AGENT(e1, ‘speaker’)

    THEME(e1, x1)

    e2 b3

    lock(e2)AGENT(e2, ‘speaker’)

    PATIENT(e2,x1)

    CONTINUATION(b2, b3)

    x1 b4

    door(x1)

    Figure 1: The discourse representation structure for‘We barred the door and locked it’. For ease of refer-ence in later figures, each box includes a variable cor-responding to the box itself, at top right in gray.

    pressed as a directed acyclic graph (DAG).

    Consider for instance the sentence ‘We barredthe door and locked it’, whose meaning representa-tion as a Discourse Representation Structure (DRS)is shown in Figure 1. A DRS is usually repre-sented as a set of nested boxes (e.g. b1), containingvariable-bound discourse referents (e.g. ‘lock(e2)’),semantic constants (e.g. ‘speaker’), predicates (e.g.AGENT) expressing relations between variablesand constants, and discourse relations between theboxes (e.g. CONTINUATION). This representationcan be expressed as a DAG by turning referentsand constants into vertices, and predicates and dis-course relations into connecting edges, as shown inFigure 2.

    How can we parse a sentence into a DAG?Commonly-adopted approaches view graphs asstrings (e.g. van Noord and Bos, 2017; van No-ord et al., 2018), or trees (e.g. Zhang et al., 2019a;Liu et al., 2018), taking advantage of the linearizedgraph representations provided in annotated data(e.g. Figure 3, where the graph in Figure 2 is rep-resented in PENMAN notation (Goodman, 2020)).

  • 3568

    b1/�

    b2/�

    e1/bar

    c1/speaker

    AGENT

    DRS

    CONTINUATION

    b3/�

    e2/lock

    x1/door

    PATIENT

    DRS

    CONTINUATION

    AGENT

    THEME

    Figure 2: The DRS of Figure 1 expressed as a DAG.

    (b1/�:CONTINUATION1(b2/�

    :DRS(e1/bar:AGENT(c1/speaker))):THEME(x1/doorp)))

    :CONTINUATION2(b3/�:DRS(e2/ lock

    :AGENT c1:PATIENTx1))

    Figure 3: The DAG of Figure 2 expressed as a string.

    An advantage of these linearized representationsis that they allow for the use of well-understoodsequential decoders and provide a general frame-work to parse into any arbitrary formalism. How-ever, these representations are unaware of the over-all graph strucure they build as well as of reen-trant semantic relations, such as coordination, co-reference, and control, that are widespread in lan-guage. Parsers such as Zhang et al. (2019b) al-though able to generate reentrancies in their output,they do so by simply predicting pointers back toalready generated nodes.

    Parsing directly into DAGs, although desirable,is less straightforward than string-based parsing.Whereas probabilistic models of strings and treesare ubiquitous in NLP, at present, it is an activeproblem in modern formal language theory to de-velop formalisms that allow to define probabilitydistributions over DAGs of practical interest.1 Asuccessful line of work derives semantic graphs us-ing graph grammars that allow to generate a graphby rewriting non-terminal symbols with graph frag-ments. Among these, hyperedge replacement gram-mar (HRG) has been explored for parsing into se-mantic graphs (Habel, 1992; Chiang et al., 2013).However, parsing with HRGs is not practical due toits complexity and large number of possible deriva-tions per graph (Groschwitz et al., 2015). Thus,work has looked at ways of constraining the spaceof possible derivations, usually in the form of align-

    1See Gilroy (2019) for an extensive review of the issue.

    ment or syntax (Peng et al., 2015). For example,Groschwitz et al. (2018) and Donatelli et al. (2019)extracted fine-grained typed grammars whose pro-ductions are aligned to the input sentence and com-bined over a dependency-like structure. Similarly,Chen et al. (2018) draw on constituent parses tocombine together HRG fragments.

    Björklund et al. (2016) show that there exists arestricted subset of HRGs, Restricted DAG gram-mar (RDG), that provides a unique derivation pergraph. A unique derivation means that a graphis generated by a unique sequence of productions,which can then be predicted using sequential de-coders, without the need of an explicit alignmentmodel or an underlying syntactic structure. Fur-thermore, the grammar places hard constraints onthe rewriting process, which can be used to guar-antee the well-formedness of output graphs duringdecoding. Drawing on this result, a recent work byFancellu et al. (2019) introduces recurrent neuralnetwork RDGs, a sequential decoder that modelsgraph generation as a rewriting process with anunderlying RDG. However, despite the promisingframework the approach in FA192 falls short inseveral aspects.

    In this paper, we address these shortcomings,and propose an accurate, efficient, polyglot modelfor Neural RDG parsing. Specifically, our contri-butions are as follows:Grammar: In practice, RDGs extracted from train-ing graphs can be large and sparse. We show anovel factorization of the RDG production rulesthat reduces the sparsity of the extracted grammars.Furthermore, we make use of RDGs extracted onfully human annotated training data to filter outsamples from a larger noisy machine-generateddataset that cannot be derived using such gram-mars. We find that this strategy not only drasticallyreduces the size of the grammar, but also improvesthe final performance.Model: FA19 use a syntactic parsing inspired ar-chitecture, a stackLSTM, trained on a gamut ofsyntactic and semantic features. We replace thiswith a novel architecture that allows for batchedinput, while adding a multilingual transformer en-coder that relies on word-embedding features only.Constrained Decoding: We identify a limitationin the decoding algorithm presented by FA19,in that it only partially makes use of the well-

    2For the sake of brevity, we refer to Fancellu et al. (2019)as FA19 throughout the paper.

  • 3569

    formdness constraints of an RDG. We describe thesource of this error, implement a correction andshow that we can guarantee well-formed DAGs.Multilinguality: Training data in languages otherthan English is often small and noisy. FA19 ad-dressed this issue with cross-lingual models usingfeatures available only for a small number of lan-guages, but did not observe improvements overmonolingual baselines in languages other than En-glish. We instead demonstrate the flexibility ofRDGs by extracting a joint grammar from graphannotations in different languages. At the sametime, we make full use of our multilingual encoderto build a polyglot model that can accept trainingdata in any language, allowing us to experimentwith different combinations of data. Our resultstell a different story where models that use com-bined training data from multiple languages alwayssubstantially outperform monolingual baselines.

    We test our approach on the Parallel MeaningBank (PMB, Abzianidze et al., 2017), a multilin-gual graphbank. Our experimental results demon-strate that our new model outperforms that of FA19by a large margin on English while fully exploitingthe power of RDGs to always guarantee a well-formed graph. We also show that the ability ofsimultaneously training on multiple languages sub-stantially improves performance for each individuallanguage. Importantly, we close the performancegap between graph-aware parsing and state-of-the-art string-based models.

    2 Restricted DAG Grammar

    We model graph generation as a process of graphrewriting with an underlying grammar. Ourgrammar is a restricted DAG grammar (RDG,Björklund et al., 2016), a type of context-free gram-mar designed to model linearized DAGs. For easeof understanding, we represent fragments in gram-mar productions as strings. This is shown in Fig-ure 4, where the right-hand-side (RHS) fragmentcan be represented as its left-to-right linearization,with reentrant nodes flagged by a dedicated $ sym-bol.

    An RDG is a tuple 〈P,N,Σ, S, V 〉 where P isa set of productions of the form α → β; N isthe set of non-terminal symbols {L, T0, · · · , Tn}up to a maximum number of n; Σ is the set ofterminal symbols; S is the start symbol; V is anunbounded set of variable references {$1, $2,...},whose role is described below.

    S →

    b1/�

    T2

    CONTINUATION

    T2

    CONTINUATION

    S → (b1/� :CONTINUATION T2($1, $2) :CONTINUATIONT2($1, $2))

    Figure 4: An example production for a grammar. Thegraph fragment on the right-hand side can be replacedwith a string representing its depth-first traversal.

    The left-hand-side (LHS) α of a production p ∈P is a function Ti ∈ N (where i is the rank) thattakes i variable references as arguments. Variablereferences are what ensure the well-formedness ofa generated graph in an RDG, by keeping track ofhow many reentrancies are expected in a deriva-tion as well as how they are connected to theirneighbouring nodes. Rank, in turn, is an indicationof how many reentrancies are present in a graphderivation. For instance, in the graph fragmentin Figure 4, given that there are two variable ref-erences and a non-terminal of rank 2, we are ex-pecting two reentrant nodes at some point in thederivation. The RHS β is a typed fragment madeup of three parts: a variable v describing the se-mantic type3, a label non-terminal L, and a list oftuples 〈e, s〉 where e is an edge label from a set oflabels E and s is either a non-terminal function Tor a variable reference. The non-terminal L canonly be rewritten as a terminal symbol l ∈ Σ. Ifa node is reentrant, we mark it with a superscript∗ over v. Variable references are percolated downthe derivation and are replaced once a reentrantvariable v∗ is found on the RHS.

    Following FA19, we show a complete derivationin Figure 5 that reconstructs the graph in Figure 2.Our grammar derives strings by first rewriting thestart symbol S, a non-terminal function T0. Ateach subsequent step, the leftmost non-terminalfunction in the partially derived string is rewritten,with special handling for variable references de-scribed below. A derivation is complete when nonon-terminals remain.

    Variable references are resolved when applyinga production that maps a reentrant variable name

    3where v ∈ {e, x, t, s, c, b} (e=event, x=individual,s=state, t=time, c=constant, b=box). Note that type is op-tional: in other formalisms like AMR variable do not havea semantic type, hence one may name all variables with thesame letter.

  • 3570

    Step Production Result1 r1 (b1/L :CONT T2($1, $2) :CONT T2($1, $2))2 r2 (b1/� :CONT (b2/L :DRS T2($1, $2)) :CONT T2($1, $2))3 r3 (b1/� :CONT (b2/� :DRS (e1/L :AGENT T1($1) :THEME T1($2))))

    :CONT T2($1, $2))4 r4 (b1/� :CONT (b2/� :DRS (e1/bar :AGENT (c∗/L) :THEME T1($2))))

    :CONT T2(c, $2))5 r5 (b1/� :CONT (b2/� :DRS (e1/bar :AGENT (c∗/speaker) :THEME (x∗/L))))

    :CONT T2(c, x))6 r2 (b1/� :CONT (b2/� :DRS (e1/bar :AGENT (c∗/speaker) :THEME (x∗/doorp))))

    :CONT (b3/� :DRS T2(c, x)))7 r6 (b1/� :CONT (b2/� :DRS (e1/bar :AGENT (c∗/speaker) :THEME (x∗/doorp))))

    :CONT (b3/� :DRS (e2/lock :AGENT c :PATIENT x)

    Figure 5: A full RDG derivation for the graph in Figure 2. At each step t the leftmost non-terminal Tn (in blue)is rewritten into a fragment (underlined) and its label non-terminal L (in red) replaced with a terminal. Variablereferences are percolated down the derivation unless a reentrant variable v∗ is found (step 4 and 5).

    to a reference, as shown for production r4, wherethe variable c is mapped to $1. Once this mappingis performed, all instances of $1 in the RHS arereplaced by the corresponding variable name. Inthis way, the reference to c is kept track of duringthe derivation becoming the target of AGENT in r6.Same applies in r5 where x is mapped to $2.

    All our fragments are delexicalized. This isachieved by the separate non-terminal L that at ev-ery step is rewritten in the corresponding terminallabel (e.g. bar). Delexicalization allows to reducethe size of grammar and factorize the prediction offragments and labels separately.

    However, DAG grammars can still be large dueto the many combinations of how edge labels andtheir corresponding non-terminals can appear in afragment. For this reason, we propose a furthersimplification where edge labels are replaced withplaceholders ê1...ê|e|, which we exemplify usingthe production in Figure 4 as follows:

    S→ (b1/L ê1 T2($1, $2) ê2 T2($1, $2))

    After a fragment is predicted, placeholdersare then replaced with actual edge labels by adedicated module (see § 3.2 for more details).

    Comparison with Groschwitz et al. (2018)’sAM algebra. RDG is very similar to other graphgrammars proposed for semantic parsing, in partic-ular to Groschwitz et al. (2018)’s AM algebra usedfor AMR parsing. Groschwitz et al. (2018)’s frame-work relies on a fragment extraction process similarto ours where each node in a graph along with itsoutgoing edges makes up a fragment. However, thetwo grammars differ mainly in how typing and as a

    consequence, composition is thought of: whereasin the AM algebra both the fragments themselvesand the non-terminal edges are assigned thematictypes (e.g. S[ubject], O[bject], MOD[ifier]), weonly place rank information on the non-terminalsand assign a more generic semantic type to thefragment.

    The fine-grained thematic types in the AM al-gebra add a level of linguistic sophistication thatRDG lacks, in that fragments fully specify the rolesa word is expected to fill. This ensures that the out-put graphs are always semantically well-formed;in that AM algebra behaves very similar to CCG.However this sophistication not only requires ad-hoc heuristics that are tailored to a specific formal-ism (AMR in this case) but also relies on alignmentinformation with the source words.

    On the other hand, our grammar is designedto predict a graph structure in sequential models.Composition is constrained by the rank of a non-terminal so to ensure that at each decoding step themodel is always aware of the placement of reen-trant nodes. However, we do not ensure semanticwell-formedness in that words are predicted sepa-rately from their fragments and we do not rely onalignment information. In that our grammar extrac-tion algorithm does not rely on any heuristics andcan be easily applied to any semantic formalism.

    3 Architecture

    Our model is an encoder-decoder architecture thattakes as input a sentence and generates a DAGG asa sequence of fragments with their correspondinglabels, using the rewriting system in § 2. In whatfollows we describe how we obtain the logits for

  • 3571

    each target prediction, all of which are normalizedwith the softmax function to yield probability dis-tributions. A detailed diagram of our architectureis shown in Figure 7 in Appendix A.

    3.1 Encoder

    We encode the input sentence w1, . . . , w|n| usinga pre-trained multilingual BERT (mBERT) model(Devlin et al., 2018).4 The final word-level repre-sentations are obtained through mean-pooling thesub-word representations of mBERT computed us-ing the Wordpiece algorithm (Schuster and Naka-jima, 2012). We do not rely on any additional(language-specific) features, hence making the en-coder polyglot. The word vectors are then fed toa two-layer BiLSTM encoder, whose forward andbackward states are concatenated to produce thefinal token encodings senc1 , . . . , s

    encn .

    3.2 Decoder

    The backbone of the decoder is a two layer LSTM,with two separate attention mechanisms for eachlayer. Our decoding strategy follows steps similarto those in Figure 5. At each step we first predict adelexicalized fragment ft, and substitute a terminallabel lt in place of L. We initialize the decoderLSTM with the encoder’s final state sencn . At eachstep t, the network takes as input [ft−1; lt−1], theconcatenation of the embeddings of the fragmentand its label output at the previous time step. Att = 0, we initialize both fragment and label en-codings with a 〈START〉 token. The first layer inthe decoder is responsible for predicting fragments.The second layer takes as input the output repre-sentations of the first layer, and predicts terminallabels. The following paragraphs provide detailson the fragment and label predictions.

    Fragment prediction. We make the predictionof a fragment dependant on the embedding of theparent fragment and the decoder history. We de-fine as parent fragment the fragment containingthe non-terminal the current fragment is rewriting;for instance, in Figure 5, the fragment in step 1is the parent of the fragment underlined in step 2.Following this intuition, at time t, we concatenatethe hidden state of the first layer h1t with a contextvector c1t and the embedding of its parent fragmentut. The logits for fragment ft are predicted with

    4Available through the HuggingFace Transformers library(Wolf et al., 2019) at https://huggingface.co/.

    a single linear layer Wf [c1t ;ut;h1t ] + b. We com-

    pute c1t using a standard soft attention mechanism(Xu et al., 2015) as follows, where senc1:N representsthe concatenation of all encoding hidden states:

    c1t =N∑i

    αisenci (1)

    a = MLP1[h1t ; senc1:N ] (2)

    αi =eai∑j aj

    (3)

    MLP1(x) = ReLU(Wx + b) (4)

    Label prediction. Terminal labels in the outputgraph can either correspond to a lemma in the in-put sentence (e.g. ‘bar’, ‘lock’), or to a semanticconstant (e.g. ‘speaker’). We make use of this dis-tinction by incorporating a selection mechanismthat learns to choose to predict either a lemma fromthe input or a token from a vocabulary of L. Weconcatenate the hidden state of the second layer h2twith the embedding of the fragment predicted atthe current time-step ft and the second layer con-text vector c2t . Let us refer to this representationas zt = [ft;h2t ; c

    2t ]. The context vector for the

    second layer is computed in the same way as c1t ,but using h2t in place of h

    1t and separate attention

    MLP parameters. To compute the logits for label-prediction we apply a linear transformation to theencoder representations e = Wssenc1:N. We concate-nate the resulting vector with the label embeddingmatrix L and compute the dot product zTt [e;L] toobtain the final unnormalized scores jointly for alltokens in the input and L.

    In the PMB, each label is also annotated withits sense tag and information about whether it ispresupposed in the context or not. We predict theformer, st, from a class of sense tags S extractedfrom the training data, and the latter, pt, a binaryvariable, by passing zt two distinct linear layers toobtain the logits for each.

    Edge factorization. In § 2, we discussed how wemade grammars even less sparse by replacing theedge labels in a production fragment with place-holders. From a modelling perspective, this allowsto factorize edge label prediction, where the de-coder first predicts all the fragments in the graphand then predicts the edge labels ei...e|e| that sub-stitute in place of the placeholders.

    To do so, we cache the intermediate represen-tations zt over time. We use these as features, to

    https://huggingface.co/

  • 3572

    replace the edge-placeholders êi with the corre-sponding true edge labels ei. To obtain the edge-label logits we pass the second-layer representationfor the child fragment zc and parent fragment zp toa pairwise linear layer: We[Wczc �Wpzp].

    3.3 Graph-aware decodingAt inference time, our graph decoder rewrites non-terminals left-to-right by choosing the fragmentwith the highest probability, and then predicts termi-nal and/or edge labels. The rank of a non-terminaland the variable references it takes as argumentsplace a hard constraint on the fragment that rewritesin its place (as shown in § 2). Only by satisfy-ing these constraints, the model can ensure well-formedness of generated graphs.

    By default, our decoder does not explicitly fol-low these constraints and can substitute a non-terminal with any fragment in the grammar. Thisis to assess whether a vanilla decoder can learn tosubstitute in a fragment that correctly matches anon-terminal. On top of the vanilla decoder, wethen exploit these hard constrains in two differentways, as follows:

    Rank prediction. We incorporate informationabout rank as a soft constraint during learningby having the model predict it at each time step.This means that the model can still predict a frag-ment whose rank and variable references do notmatch those of a non-terminal but it is guided notto do so. We treat rank prediction as a classifica-tion task where we use the same features as frag-ment prediction that we then pass to a linear layer:rt = W

    r[c1t ;ut;h1t ] + b

    r. Note that the rangeof predicted ranks is determined by the traininggrammar so it is not possible to generate a rank thathas not been observed and doesn’t have associatedrules.

    Constrained decoding. We explicitly ask themodel to choose only amongst those fragmentsthat can match the rank and variable references of anon-terminal. This may override model predictionsbut always ensures that a graph is well-formed.To ensure well-formedness, FA19 only checks forrank. This can lead to infelicitous consequences.Consider for instance the substitution in Figure 6.Both fragments at the bottom of the middle andright representations are of rank 2 but whereas thefirst allows for the edges to refer back to the reen-trant nodes, the second introduces an extra reen-trant node, leaving therefore one of the reentrant

    nodes disconnected. Checking just for rank is there-fore not enough; one also needs to check whethera reentrant node that will substitute in a variablereference has already been generated. If not, anyfragment of the same rank can be accepted. If sucha node already exists, only fragments that do notintroduce another reentrant node can be accepted.This constrained decoding strategy is what allowsus to always generate well-formed graphs; we inte-grate this validation step in the decoding algorithmwhen selecting the candidate fragment.

    Finally, we integrate these hard constraints in thesoftmax layer as well. Instead of normalizing thelogits across all fragment types with a single soft-max operation, we normalize them separately foreach rank. The errors are only propagated throughthe subset of parameters in Wf and bf responsiblefor the logits within the target rank rt.

    3.4 Training objectiveOur objective is to maximize the log-likelihoodof the full graph P (G|s) approximated by the de-composition over each prediction task separately:

    ∑t

    logP (ft) + logP (`t) + logP (rt)

    + logP (st) + logP (pt) +∑i

    logP (ei)

    (5)

    where ft is the fragment; `t is the label; rt is the(optional) rank of ft; st and pt are the sense andpresupposition label of terminal label `t; ei...e|e|are the edge labels of ft. To prevent our modelfrom overfitting, rather than directly optimizing thelog-likelihoods, we apply label smoothing for eachprediction term (Szegedy et al., 2016).

    4 Experimental setup

    4.1 DataWe evaluate our parser on the Parallel MeaningBank (Abzianidze et al., 2017), a multilingualgraph bank where sentences in four languages (En-glish (en), Italian (it), German (de) and Dutch (nl))are annotated with their semantic representationsin the form of Discourse Representation Structures(DRS). We test on v.2.2.0 to compare with previ-ous work, and present the first results on v.3.0 onall four languages. We also present results whentraining on both gold and silver data, where the lat-ter is ∼10x larger but contains machine-generated

  • 3573

    # traininginstances

    # fragments+edge label

    # fragments-edge label

    avg.rank

    PMB2.2.0-g 4585 1196 232 1.56PMB2.2.0-s 63960 17414 2586 2.85

    PMB3-g 6618 1695 276 2.22PMB3-s 94776 36833 6251 3.01PMB3-it 2743 1827 378 2.32PMB3-de 5019 4025 843 2.61PMB3-nl 1238 1338 318 2.29

    Table 1: Statistics for the grammars extracted from thePMB (g - gold; s - silver).

    parses, of which only a small fraction has beenmanually edited. Statistics for both versions of thePMB are reported in Appendix B.

    Our model requires an explicit grammar whichwe obtain by automatically converting each DAGin the training data5 into a sequence of productions.This conversion follows the one in FA19 with mi-nor changes; we include details in Appendix C.

    Statistics regarding the grammars extracted fromthe PMB are presented in Table 1, where along withthe number of training instances and fragments, wereport average rank — an indication of how manyreentrancies (on average) are present in the graphs.RDGs can be large especially in the case of silverdata, where incorrect parses lead to a larger numberof fragments extracted and more complex, noisyconstructions, as attested by the higher averageranks. More importantly, we show that removingthe edge labels from the fragments leads to a drasticreduction in the number of fragments, especiallyfor the silver corpora.

    4.2 Evaluation metricsTo evaluate our parser, we need to compare itsoutput DRSs to the gold-standard graph structures.For this, we use the Counter tool of Van Noord et al.(2018), which calculates an F-score by searchingfor the best match between the variables of thepredicted and the gold-standard graphs. Counter’ssearch algorithm is similar to the evaluation systemSMATCH for AMR parsing (Cai and Knight, 2013).

    There might be occurrences where our graph isdeemed ill-formed by Counter; we assign thesegraphs a score of 0. The ill-formedness is howevernot due to the model itself but to specific require-ments placed on the output DRS by the Counterscript.

    5Our DAGs are different from the DRG (Discourse Rep-resentation Graphs) of Basile and Bos (2013); we elaboratemore on this in Appendix C.

    P R F1

    baseline 80.0 70.9 75.2+ rank-prediction 81.0 72.3 76.4+ constrained-decoding 80.5 75.2 77.8+ edge-factorization 82.5 78.5 80.4

    ours-best + silver 83.8 80.6 82.2ours-best + filtering 83.1 80.5 81.8

    Table 2: Ablation results on the dev portion ofPMB2.2.0. The top half shows results for modelstrained on gold data only. The bottom half shows re-sults of models trained on silver+gold data.

    5 Experimental Results

    We first present results of ablation experimentsto understand which model configuration per-forms best (§ 5.1). We then compare our best-performing model with several existing semanticparsers (§ 5.2), and present our model’s perfor-mance in multilingual settings (§ 5.3).

    5.1 Ablation experiments

    Table 2 shows results for our model in various set-tings. Our baseline is trained on gold data alone,uses a full grammar and performs unconstrained de-coding, with and without rank prediction. Note thatunconstrained decoding could lead to ill-formedgraphs. To better understand the effect of this, wecompare the performance of the baseline with amodel that uses constrained decoding and thus al-ways generates well-formed graphs. We train allour models on a single TitanX GPU v100. We re-port hyperparameters and other training details inAppendix D.

    Our results are different from that of FA19, whoshow that a baseline model outperforms one withconstrained decoding. Not only we find that con-strained decoding outperforms the baseline, butwe observe that without it, 26 graphs (∼4%) areill-formed. In addition, our results show that pre-dicting edge labels separately from fragments (edgefactorization) leads to a substantial improvementin performance, while also drastically reducing thesize of the grammar (as shown in Table 1).

    We also train our best-performing model (ours-best) on the silver and gold data combined (+silver).This is to assess whether more data, albeit noisy,results in better performance. However, noisy datacan lead to noisy grammar; to reduce this noise, weexperiment with first extracting a grammar from

  • 3574

    (b1/�:CONTINUATION1(b2/�

    :DRS(e1/bar:AGENT(c∗1/speaker):THEME(x∗1/doorp)))

    :CONTINUATION2(b3/�:DRS T2(c,x)))

    (b1/�:CONTINUATION1(b2/�

    :DRS(e1/bar:AGENT(c∗1/speaker):THEME(x∗1/doorp)))

    :CONTINUATION2(b3/�:DRS (e2/ lock

    :AGENT c1:PATIENT x1)))

    3T2($1,$2) → (e/L :AGENT $1 :PATIENT $2)

    (b1/�:CONTINUATION1(b2/�

    :DRS(e1/bar:AGENT(c1/speaker):THEME(x1/doorp)))

    :CONTINUATION2(b3/�:DRS (e∗2/ lock

    :AGENT x1))))

    7T2($1,$2) → (e∗/L :AGENT $2)

    Figure 6: Example of correct (middle) and wrong (right) substitution of non-terminal function (left, in blue) duringconstrained decoding.

    the gold training set, and use it to filter the silverset, where only instances that can be derived us-ing the gold grammar are kept (+filtering). Thefiltering results in smaller grammar (232 vs. 2586fragments), while at the same time sacrificing onlya small percentage of training instances (10%).

    van Noord et al. (2019), Liu et al. (2019) andFA19 found that models trained on silver data re-quires an additional training fine-tuning on golddata alone to achieve the best performance; we alsofollow this strategy in our experiments. Overall,results show that adding silver data improves per-formance, and that filtering the input silver dataleads only to a slight loss in performance whilekeeping the size of the grammar small.

    5.2 Comparison to previous workWe compare our best-performing model againstprevious work on PMB2.2.0. We first compare theperformance on models trained solely on gold data.Besides the DAG-grammar parser of FA19, wecompare with the transition-based stackLSTM ofEvang (2019) that utilizes a buffer-stack architec-ture to predict a DRS fragment for each input tokenusing the alignment information in the PMB; ourgraph parser does not make use of such informationand solely relies on attention.

    We then compare our best-performing modelwith two models trained on gold plus silver data.van Noord et al. (2019) is a seq2seq parser thatdecodes an input sentence into a concatenationof clauses, essentially a flattened version of theboxes in Figure 1. Similar to FA19, their modelalso uses a wide variety of language-dependent fea-tures, including part-of-speech, dependency andCCG tags, while ours relies solely on word embed-dings. In this respect, our model is similar to Liuet al. (2019)’s that uses the same architecture as themodel of van Noord et al. (2019) but replaces theLSTM encoder with a transformer model, without

    P R F1

    Fancellu et al. (2019) - - 73.4Evang (2019) - - 74.4ours-best 84.5 81.3 82.9

    van Noord et al. (2019) - - 86.8Liu et al. (2019) 85.8 84.5 85.1ours-best + silver 86.1 83.6 84.9

    Table 3: Comparison with previous work on the testportion of PMB2.2.0. Results in the top half arefor models trained on gold data, whereas bottom halfshows results for models trained on silver+gold data.

    the use of additional features.Results are summarized in Table 3. When

    trained on gold data alone, our model outperformsprevious models by a large margin, without relyingon alignment information or extra features besidesword embeddings. When trained on silver+gold,we close the performance gap with state-of-the-artmodels that decode into strings, despite relyingsolely on multilingual word embeddings.

    5.3 Multilingual experiments

    Table 4 shows the results on languages other thanEnglish. In our multilingual experiments, we firsttrain and test monolingual models in each language.In addition, we perform zero-shot experiments bytraining a model on English and testing it on otherlanguages (cross-lingual). We also take full advan-tage of the fact that our models rely solely on multi-lingual word embeddings, and experiment with twoother multilingual settings: The bilingual modelsare trained on data in English plus data in a tar-get language (tested on the target language). Thepolyglot models combine training data of all fourlanguages (tested on each language). Parametersfor all languages in the bilingual and polyglot mod-

  • 3575

    PMB2.2.0

    en de nl it

    FA19 (monolingual) - 67.9 65.8 75.9FA19 (cross-lingual) - 63.5 65.1 72.1Ours (cross-lingual) - 73.4 73.9 76.9

    ours-best (various) trained and tested on PMB3

    monolingual 80 64.2 60.9 71.5cross-lingual - 73.2 74.1 75.2bilingual - 71.8 76.0 77.7polyglot 79.8 72.5 74.1 77.9

    Table 4: Results for the multilingual experiments on thetest sets for PMB2.2.0 (top half) and PMB3.0 (bottomhalf). For the sake of brevity, we report only F1 scoreshere, and refer the reader to Table 6 in Appendix E forPrecision and Recall values.

    els are fully shared.FA19 only experiment with a cross-lingual

    model trained with additional language-dependentfeatures, some of which available only for a smallnumber of languages (on PMB2.2.0). We thereforecompare our cross-lingual models with theirs onPMB2.2.0. We then introduce the first results onPMB3, where we experiment with the other twomultilingual settings.

    Our results tell a different story from FA19,where all of our multilingual models (bilingual,polyglot and cross-lingual) outperform the corre-sponding monolingual baselines. We hypothesizethis is mainly due to the fact that for languagesother than English, only small silver training dataare available and adding a large gold English datamight help dramatically with performance. Thishypothesis is also reinforced by the fact that a cross-lingual model training on English data alone canreach a performance comparable to the other twomodels.

    6 Conclusions

    In this paper, we have introduced a graph parserthat can fully harness the power of DAG grammarsin a seq2seq architecture. Our approach is efficient,fully multilingual, always guarantees well-formedgraphs and can rely on small grammars, while out-performing previous graph-aware parsers in En-glish, Italian, German and Dutch by large margin.At the same time, we close the gap between string-based and RDG-based decoding. In the future, weare planning to extend this work to other semanticformalisms (e.g. AMR, UCCA) as well as test-

    ing on other languages, so to encourage work inlanguages other than English.

    Acknowledgments

    We thank three anonymous reviewers for their use-ful comments. Research was conducted at Sam-sung AI Centre Toronto and funded by SamsungResearch, Samsung Electronics Co.,Ltd.

    ReferencesLasha Abzianidze, Johannes Bjerva, Kilian Evang,

    Hessel Haagsma, Rik Van Noord, Pierre Ludmann,Duc-Duy Nguyen, and Johan Bos. 2017. The paral-lel meaning bank: Towards a multilingual corpus oftranslations annotated with compositional meaningrepresentations. In Proceedings of the 15th Confer-ence of the European Chapter of the Association forComputational Linguistics.

    Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. In Proceedings of the 7th linguisticannotation workshop and interoperability with dis-course, pages 178–186.

    Valerio Basile and Johan Bos. 2013. Aligning for-mal meaning representations with surface strings forwide-coverage text generation. In Proceedings ofthe 14th European Workshop on Natural LanguageGeneration, pages 1–9, Sofia, Bulgaria. Associationfor Computational Linguistics.

    Henrik Björklund, Frank Drewes, and Petter Ericson.2016. Between a rock and a hard place–uniformparsing for hyperedge replacement DAG grammars.In Proceedings of the International Conference onLanguage and Automata Theory and Applications,pages 521–532. Springer.

    Shu Cai and Kevin Knight. 2013. Smatch: An evalu-ation metric for semantic feature structures. In Pro-ceedings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics, pages 748–752.

    Yufei Chen, Weiwei Sun, and Xiaojun Wan. 2018. Ac-curate shrg-based semantic parsing. In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 408–418.

    David Chiang, Jacob Andreas, Daniel Bauer,Karl Moritz Hermann, Bevan Jones, and KevinKnight. 2013. Parsing graphs with hyperedgereplacement grammars. In Proceedings of the 51stAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages924–932.

    https://www.aclweb.org/anthology/W13-2101https://www.aclweb.org/anthology/W13-2101https://www.aclweb.org/anthology/W13-2101

  • 3576

    Ann Copestake, Dan Flickinger, Carl Pollard, andIvan A Sag. 2005. Minimal recursion semantics: Anintroduction. Research on language and computa-tion, 3(2-3):281–332.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In Proceedings of NAACL-HLT 2019, pages4171–4186.

    Lucia Donatelli, Meaghan Fowlie, Jonas Groschwitz,Alexander Koller, Matthias Lindemann, Mario Mina,and Pia Weißenhorn. 2019. Saarland at MRP2019: Compositional parsing across all graph-banks. In Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the2019 Conference on Natural Language Learning,pages 66–75.

    Kilian Evang. 2019. Transition-based DRS parsingusing stack-LSTMs. In Proceedings of the IWCSShared Task on Semantic Parsing, Gothenburg, Swe-den. Association for Computational Linguistics.

    Federico Fancellu, Sorcha Gilroy, Adam Lopez, andMirella Lapata. 2019. Semantic graph parsing withrecurrent neural network DAG grammars. In Pro-ceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural LanguageProcessing, pages 2769–2778.

    Sorcha Gilroy. 2019. Probabilistic graph formalismsfor meaning representations.

    Michael Wayne Goodman. 2020. Penman: An open-source library and tool for amr graphs. In Proceed-ings of the 58th Annual Meeting of the Associationfor Computational Linguistics: System Demonstra-tions, pages 312–319.

    Jonas Groschwitz, Alexander Koller, and Christoph Te-ichmann. 2015. Graph parsing with s-graph gram-mars. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers),pages 1481–1490.

    Jonas Groschwitz, Matthias Lindemann, MeaghanFowlie, Mark Johnson, and Alexander Koller. 2018.Amr dependency parsing with a typed semantic alge-bra. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers).

    Annegret Habel. 1992. Hyperedge replacement: gram-mars and languages, volume 643. Springer Science& Business Media.

    Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 328–339, Melbourne, Australia.Association for Computational Linguistics.

    Hans Kamp. 1981. A theory of truth and semanticrepresentation. Formal semantics-the essential read-ings, pages 189–222.

    Percy Liang, Michael I Jordan, and Dan Klein. 2013.Learning dependency-based compositional seman-tics. Computational Linguistics, 39(2):389–446.

    Jiangming Liu, Shay B Cohen, and Mirella Lapata.2018. Discourse representation structure parsing. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 429–439.

    Jiangming Liu, Shay B Cohen, and Mirella Lapata.2019. Discourse representation structure parsingwith recurrent neural networks and the transformermodel. In Proceedings of the IWCS Shared Task onSemantic Parsing.

    Rik van Noord, Lasha Abzianidze, Antonio Toral, andJohan Bos. 2018. Exploring neural methods forparsing discourse representation structures. Trans-actions of the Association for Computational Lin-guistics, 6:619–633.

    Rik van Noord and Johan Bos. 2017. Neural seman-tic parsing by character-based translation: Experi-ments with abstract meaning representations. Com-putational Linguistics in the Netherlands (CLIN).

    Rik van Noord, Antonio Toral, and Johan Bos. 2019.Linguistic information in neural semantic parsingwith multiple encoders. In Proceedings of the13th International Conference on ComputationalSemantics-Short Papers, pages 24–31.

    Xiaochang Peng, Linfeng Song, and Daniel Gildea.2015. A synchronous hyperedge replacement gram-mar based approach for amr parsing. In Proceedingsof the Nineteenth Conference on Computational Nat-ural Language Learning, pages 32–41.

    Mike Schuster and Kaisuke Nakajima. 2012. Japaneseand korean voice search. In Proceedings of the2012 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pages5149–5152. IEEE.

    Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.2019. How to fine-tune BERT for text classification?In Proceedings of the China National Conference onChinese Computational Linguistics, pages 194–206.Springer.

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jon Shlens, and Zbigniew Wojna. 2016. Rethinkingthe inception architecture for computer vision. InProceedings of the IEEE conference on computer vi-sion and pattern recognition, pages 2818–2826.

    Rik Van Noord, Lasha Abzianidze, Hessel Haagsma,and Johan Bos. 2018. Evaluating scoped meaningrepresentations. In Proceedings of the 11th Interna-tional Conference on Language Resources and Eval-uation (LREC).

    https://doi.org/10.18653/v1/W19-1202https://doi.org/10.18653/v1/W19-1202https://doi.org/10.18653/v1/P18-1031https://doi.org/10.18653/v1/P18-1031

  • 3577

    Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing. arXiv, abs/1910.03771.

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual atten-tion. In Proceedings of the 32nd InternationalConference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages2048–2057, Lille, France. PMLR.

    Sheng Zhang, Xutai Ma, Kevin Duh, and BenjaminVan Durme. 2019a. AMR Parsing as Sequence-to-Graph Transduction. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), Florence, Italy.Association for Computational Linguistics.

    Sheng Zhang, Xutai Ma, Kevin Duh, and BenjaminVan Durme. 2019b. Broad-coverage semantic pars-ing as transduction. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP).

    http://proceedings.mlr.press/v37/xuc15.htmlhttp://proceedings.mlr.press/v37/xuc15.htmlhttp://proceedings.mlr.press/v37/xuc15.html

  • 3578

    A System architecture

    An illustration of our system architecture is shownin Figure 7.

    B PMB - data statistics

    train dev test

    PMB2.2.0-g 4597 (4585) 682 650PMB2.2.0-s 67965 (63960) - -

    PMB3-g 6620 (6618) 885 898PMB3-s 97598 (94776) - -PMB3-it 2772 (2743)* 515 547PMB3-de 5250 (5019)* 417 403PMB-nl 1301 (1238)* 529 483

    Table 5: Data statistics for the PMB v.2.2.0 and 3.0(g- gold; s - silver). Numbers in paranthesis are the in-stances we used during training that we were able toextract a derivation tree for. *: training instances forlanguages other than English are silver, whereas devand test are gold

    C DAG-grammar extraction

    Our grammar consists of three steps:Preprocess the DRS. First, we treat all con-stants as lexical elements and bind them to avariable c. For instance, in Figure 1 we bind‘speaker’ to a variable c1 and change the relationsAGENT(e1, ‘speaker’) and AGENT(e2, ‘speaker’)into AGENT(e1, c1) and AGENT(e2, c1), respec-tively. Second, we deal with multiple lexical ele-ments that map to the same variables (e.g. cat(x1)∧ entity(x1), where the second predicate specifythe ‘nature’ of the first) by renaming the secondvariable as i and creating a dummy relation OF thatmaps from from the first to the second. Finally,we get rid of relations that generate cycles. Wefound 25 cycles in the PMB, and they are all relatedto the same phenomenon where the relationships‘Role’ and ‘Of’ have inverted source and target (e.g.person(x1) - Role - mother(x4), mother(x4) - Of -person(x1)). We remove cyclicity by merging thetwo relations into one edge label. All these changesare then reverted before evaluation.Converting the convert the DRS into a DAG. Weconvert all main boxes, lexical predicates and con-stants (now bound to a variable) to nodes whereasbinary relations between predicates and boxes aretreated as edges. For each box, we identify a rootvariable (if any) and attach this as child to the box-node with an edge :DRS. A root variable is defined

    as a variable belonging to a box that is *not* atthe receiving end of any binary predicates; in Fig-ure 1, these are e1 and e2 for b2 and b3 respectively.We then follow the binary relations to expand thegraph. In doing so, we also incorporate presup-positional boxes in the graph (i.e. b4 in Figure 1).Each of these boxes contain predicates that are pre-supposed in context (usually definite descriptionslike ‘the door’). To link presupposed boxes to themain boxes (i.e. to get a fully connected DAG)we assign a (boolean) presupposition feature to theroot variable of the presupposed box (this featureis marked with the superscript p in Figure 2). Anydescendant predicates of this root variable will beconsidered as part of the presupposed DRS. Duringpost-processing, when we need to reconstruct theDRS out of a DAG, we rebuild the presupposedbox around variables for which presupposition ispredicted as ‘True’, and their descendants.Note that Basile and Bos (2013) proposed a similarconversion to generate Discourse RepresentationGraphs (DRG), exemplified in Figure 8 using ourworking example. We argue that our representationis more compact in that: 1) we ignore ‘in’ edges –where each variable is explicitly marked as part ofthe box by means of a dedicated edge. This is pos-sible since each box (the square nodes) has a mainpredicate and all its descendants belong to the box;2) we treat binary predicates (e.g. AGENT) as edgelabels and not nodes; 3) we remove presuppositionboxes (in Figure 8, the subgraph rooted in a P la-belled edge) and assign a (boolean) presuppositionvariable to the presupposed predicates.Convert the DAGs into derivation trees. DAGsare converted into derivation trees in two passesfollowing the algorithm in Björklund et al. (2016),which we summarize here; the reader is referred tothe original paper for further details. The algorithmconsists of two steps: first, for each node n wetraverse the graph post-order and store informationon the reentrant nodes in the subgraph rooted n.To be more precise, each outgoing edge ei from ndefines a subgraph si along which we extract a listof all the reentrant nodes we encounter. This listalso includes the node itself, if reentrant.

    We then traverse the tree depth-first to collectthe grammar fragments and build the derivationtree. Each node contains information of its variable(and type), lexical predicate and features as wellas a list of the labels on outgoing edges that weplug in the fragments. In order to add variable

  • 3579

    Figure 7: Overview of our architecture, following the description is § 3. Our encoder (on the left) computesmultilingual word-embeddings using MBERT which then feed into a 2-layers BiLSTM. At the time step t, a2 layers decoder LSTM (on the right) reconstructs a graph G by predicting fragment ft and terminal label lt.Additionally, parsing on PMB requires to predict for each label lt a sense tag st and presupposition information pt(a boolean flag). To predict ft we use the hidden state of the decoder first layer (in blue) along with context vectorcft and information about the parent fragment ut (yellow edges). All other predictions are done using the hiddenstate of the decoder second layer (in red) along a separate context vector clt. Both context vectors are computedusing soft attention over the input representations (top left). Fragments predicted are used to substitute the leftmostnon-terminal in the partial graph G (in pink), as shown at the top for G2...G5. For G1 the first fragment predictedinitializes the graph (this corresponds to substituting the start symbol S). The edge labels in the fragments aboveare replaced with placeholders e1...e|e| to display how edge factorization works. We assume here, for brevity, thatG5 is our final output graph and show the prediction of two edges that substitute in place of the placeholders (boxat the bottom). For edge prediction, we use a bundle of features collected during decoding, namely the parent andchildren fragment embedding ft, the second layer hidden state (in red) and the context vector cl at time t.

  • 3580

    en de nl itP R F P R F P R F P R F

    monolingual 81.6 78.4 80 64.5 64 64.2 62.6 59.2 60.9 72.4 70.6 71.5

    cross-lingual - - - 72.8 73.6 73.2 73.4 74.9 74.1 74.2 76.2 75.2bilingual - - - 72 71.5 71.8 76.7 75.3 76 76.8 78.6 77.7polyglot 81 78.8 79.8 72.2 72.9 72.5 74.3 73.8 74.1 78.2 77.5 77.9

    Table 6: Results for the multilingual experiments on PMB v.3.0 (test set). Monolingual results (top half) arecompared with different combinations of multilingual training data (bottom half).

    bar

    Agent

    speaker

    Theme

    IN

    CONTINUATION

    door

    IN

    P

    lock

    Patient Agent

    IN

    CONTINUATION

    Figure 8: The DRS of Figure 2 expressed as a Dis-course Representation Graph (DRG).

    references, if any, we need to know whether thereare any reentrant nodes that are shared across thesubgraphs si...s|e|. If so, these become variablereferences. If the node n itself is reentrant, we flagit with ∗ so that we know that its variable name cansubstitute a variable reference.

    D Implementation Details

    We use the pre-trained uncased multilingual BERTbase model from Wolf et al. (2019). All modelstrained on English data, monolingual or multilin-gual, share the same hyper-parameter settings. Lan-guages other than English in the PMB v3.0 haveless training data, especially in the cases of Dutchand Italian. Hence, we reduce the model capac-ity across the board and increase dropout to avoidover-fitting. Hyperparameter settings are shown inTable. 7.

    We found fine-tuning BERT model necessary toachieve good performance. Following Sun et al.(2019) and Howard and Ruder (2018), we exper-imented with different fine-tuning strategies, allapplied after model performance plateaued:

    1. setting constant learning rate for BERT layers

    2. gradually unfreezing BERT layer by layerwith decaying learning rate

    3. slanted-triangular learning rate schedulingfollowing Howard and Ruder (2018).

    We have concluded that strategy 1 works best forour task, with fine-tuning learning rate of 2e-5 forEnglish and a smaller learning rate of 1e-5 for otherlanguages.

    Model ParametersBERT 768Num of Encoder Layer 2

    Encoderen 2@512

    de/nl/it 1@512

    Fragment/Relation/Labelen 100

    de/nl/it 75

    Edge Prediction Layeren 100

    de/nl/it 75

    Decoderen 1024

    de/nl/it 512

    Optimization ParametersOptimizer ADAMLearning Rate 0.001Weight Decay 1e-4Gradient Clipping 5Label Smoothing � 0.1

    Bert Finetune LRen 2e-5

    de/nl/it 1e-5

    Dropouten 0.33

    de/nl/it 0.5

    Table 7: Hyper-parameter Settings

    E Multilingual experiments - full results

    All results for the multilingual experiments includ-ing precision and recall are shown in Table 6.