15
Awakening Latent Grounding from Pretrained Language Models for Semantic Parsing Qian Liu * , Dejian Yang § , Jiahui Zhang * , Jiaqi Guo * , Bin Zhou , Jian-Guang Lou § Beihang University, Beijing, China § Microsoft Research, Beijing, China Xi’an Jiaotong University, Xi’an, China {qian.liu, 17231043, zhoubin}@buaa.edu.cn; [email protected] § {dejian.yang, jlou}@microsoft.com Abstract Recent years pretrained language models (PLMs) hit a success on several downstream tasks, showing their power on modeling lan- guage. To better understand and leverage what PLMs have learned, several techniques have emerged to explore syntactic structures entailed by PLMs. However, few efforts have been made to explore grounding capabilities of PLMs, which are also essential. In this paper, we highlight the ability of PLMs to discover which token should be grounded to which con- cept, if combined with our proposed erasing- then-awakening approach. Empirical studies on four datasets demonstrate that our approach can awaken latent grounding which is under- standable to human experts, even if it is not exposed to such labels during training. More importantly, our approach shows great poten- tial to benefit downstream semantic parsing models. Taking text-to-SQL as a case study, we successfully couple our approach with two off-the-shelf parsers, obtaining an absolute im- provement of up to 9.8%. 1 Introduction Recent breakthroughs of Pretrained Language Models (PLMs) such as BERT (Devlin et al., 2019) and GPT3 (Brown et al., 2020) have demonstrated the effectiveness of self-supervised learning for a range of downstream tasks. Without being guided by structural information in training, PLMs show the potential for learning implicit syntactic struc- tures and language semantic, which can be trans- ferred to other tasks. To better understand and leverage what PLMs have learned, several work has emerged to probe or induce syntactic structures from PLMs. According to prior studies (Rogers et al., 2020), most existing work focuses on syn- tactic structures such as part of speech (Liu et al., * Work done during an internship at Microsoft Research. The first three authors contributed equally. Reign Monarchs 1789-1802 George 1802-1826 John I 1826-1845 Andrew George Washington what war was george washington associated with? U.S. President Colonial Beach (a). Structured Table (b). Knowledge Base Figure 1: Typical scenarios for grounding, here the lin- guistic tokens “george washington” can be grounded into different real-world concepts. 2019), constituency tree (Wu et al., 2020) and de- pendency tree (Hewitt and Manning, 2019; Jawahar et al., 2019), paying much less attention on lan- guage semantics (Tenney et al., 2019). However, as well known, semantic information is essential for high-level tasks like machine reading compre- hension (Wang and Jiang, 2019). Regarding to language semantics, an impor- tant branch is grounding, which is overlooked by most previous work. Broadly speaking, grounding means “connecting linguistic symbols to real-world perception or actions”(Roy, 2005). It is generally thought to be important for a variety of tasks, such as video descriptions (Zhou et al., 2019), visual question answering (Zhu et al., 2016) and semantic parsing (Guo et al., 2019). In this paper, we focus on single-modal scenarios, where grounding refers more specifically to mapping linguistic tokens into a real-world concept described in natural language. As shown in Figure 1, “george washington” can be grounded into either a cell value in a structured table, or an entity in knowledge bases. In single-modal scenarios, grounding is espe- cially important for semantic parsing, the task of translating a natural language sentence into its cor- responding executable logic form. For earlier work, grounding is essential since earlier work almost conceptualized semantic parsing as grounding an arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

Awakening Latent Grounding from Pretrained Language Modelsfor Semantic Parsing

Qian Liu†∗ , Dejian Yang§, Jiahui Zhang†∗, Jiaqi Guo♦∗, Bin Zhou†, Jian-Guang Lou§

†Beihang University, Beijing, China§Microsoft Research, Beijing, China

♦Xi’an Jiaotong University, Xi’an, China†qian.liu, 17231043, [email protected]; ♦[email protected]

§dejian.yang, [email protected]

Abstract

Recent years pretrained language models(PLMs) hit a success on several downstreamtasks, showing their power on modeling lan-guage. To better understand and leveragewhat PLMs have learned, several techniqueshave emerged to explore syntactic structuresentailed by PLMs. However, few efforts havebeen made to explore grounding capabilities ofPLMs, which are also essential. In this paper,we highlight the ability of PLMs to discoverwhich token should be grounded to which con-cept, if combined with our proposed erasing-then-awakening approach. Empirical studieson four datasets demonstrate that our approachcan awaken latent grounding which is under-standable to human experts, even if it is notexposed to such labels during training. Moreimportantly, our approach shows great poten-tial to benefit downstream semantic parsingmodels. Taking text-to-SQL as a case study,we successfully couple our approach with twooff-the-shelf parsers, obtaining an absolute im-provement of up to 9.8%.

1 Introduction

Recent breakthroughs of Pretrained LanguageModels (PLMs) such as BERT (Devlin et al., 2019)and GPT3 (Brown et al., 2020) have demonstratedthe effectiveness of self-supervised learning for arange of downstream tasks. Without being guidedby structural information in training, PLMs showthe potential for learning implicit syntactic struc-tures and language semantic, which can be trans-ferred to other tasks. To better understand andleverage what PLMs have learned, several workhas emerged to probe or induce syntactic structuresfrom PLMs. According to prior studies (Rogerset al., 2020), most existing work focuses on syn-tactic structures such as part of speech (Liu et al.,

∗Work done during an internship at Microsoft Research.The first three authors contributed equally.

Reign Monarchs

1789-1802 George

1802-1826 John I

1826-1845 AndrewGeorge

Washington

what war was george washington associated with?

U.S. President

Colonial Beach

(a). Structured Table (b). Knowledge Base

Figure 1: Typical scenarios for grounding, here the lin-guistic tokens “george washington” can be groundedinto different real-world concepts.

2019), constituency tree (Wu et al., 2020) and de-pendency tree (Hewitt and Manning, 2019; Jawaharet al., 2019), paying much less attention on lan-guage semantics (Tenney et al., 2019). However,as well known, semantic information is essentialfor high-level tasks like machine reading compre-hension (Wang and Jiang, 2019).

Regarding to language semantics, an impor-tant branch is grounding, which is overlooked bymost previous work. Broadly speaking, groundingmeans “connecting linguistic symbols to real-worldperception or actions” (Roy, 2005). It is generallythought to be important for a variety of tasks, suchas video descriptions (Zhou et al., 2019), visualquestion answering (Zhu et al., 2016) and semanticparsing (Guo et al., 2019). In this paper, we focuson single-modal scenarios, where grounding refersmore specifically to mapping linguistic tokens intoa real-world concept described in natural language.As shown in Figure 1, “george washington” canbe grounded into either a cell value in a structuredtable, or an entity in knowledge bases.

In single-modal scenarios, grounding is espe-cially important for semantic parsing, the task oftranslating a natural language sentence into its cor-responding executable logic form. For earlier work,grounding is essential since earlier work almostconceptualized semantic parsing as grounding an

arX

iv:2

109.

1054

0v1

[cs

.CL

] 2

2 Se

p 20

21

Page 2: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

How many total games were at braly stadium [SEP][CLS] Venue

Question Concept

PLM

0.14 0.10 0.21 0.01 0.14 0.18 0.06 0.16 Latent Grounding

george

PLM

0.02 0.03 0.05 0.12 0.04 0.11 0.08 0.27 Pseudo Alignment

Co

nce

pt

Pre

dic

tio

n C

on

fid

ence

0.92

0.65

ErasingHow many total games were at braly [SEP][CLS] Venue

Grounding Module

CP

CP

Awakening

Figure 2: The illustration of ETA, which consists of a PLM module, a Concept Prediction (CP) module and agrounding module. Two models (gray and blue) are drawn here for illustration purposes, and they are indeedthe same. The model training involves three steps: (1) The concept prediction module is trained to predict theconfidence of any concept occurring in a given question (Left). (2) The erasing mechanism erases tokens in thequestion sequentially, feeds them into CP, and obtains the confidence differences (e.g., 0.92− 0.65 = 0.27) as thepseudo alignment. Here we only demonstrate the process related to “stadium” (Bottom Right). (3) The pseudoalignment is employed to awaken the latent grounding, i.e., to supervise the grounding module (Top Right). Weshow only one concept “Venue” for the sake of brevity, which in practice is a sequence of concepts.

utterance to a task-specific meaning representation(Zelle and Mooney, 1996; Zettlemoyer and Collins,2005; Liang et al., 2013; Cheng et al., 2017). As formodern approaches based on the encoder-decoderarchitecture, grounding also plays an importantrole and considerable work has demonstrated thepositive effect of it (Guo et al., 2019; Dong et al.,2019; Liu et al., 2020a; Wang et al., 2020b; Chenet al., 2020). Despite its success, existing ground-ing methods mainly relied on heavy manual effortslike high-quality lexicons (Reddy et al., 2016) or ad-hoc heuristic rules like n-gram matching (Guo et al.,2019), suffering from poor flexibility. To exploremore flexible methods, researchers recently tried adata-driven way: they collected grounding annota-tions as supervision to train grounding models (Liet al., 2020a; Lei et al., 2020; Shi et al., 2020). How-ever, this modeling flexibility in their approachesrequires expensive annotations of grounding, whichmost of the time are not available.

To alleviate the above issues, we present a novelapproach Erasing-then-Awakening (ETA)1. It is in-spired by recent advances in interpretable machinelearning (Samek et al., 2017), where the impor-tance of individual pixels can be quantified withrespect to the classification decision. Similarly, ourapproach firstly quantifies the contribution of eachword with respect to each concept, by erasing itand probing the variation of concept prediction de-

1Our code is available at https://github.com/microsoft/ContextualSP

cisions (elaborated later). Then it employs thesecontributions as pseudo labels to awaken latentgrounding from PLMs. In contrast to prior work,our approach only needs supervision of concept pre-diction, which can be easily derived by downstreamtasks (e.g., text-to-SQL) instead of full groundingsupervision. Empirical studies on four datasetsdemonstrate that our approach can awaken latentgrounding which is understandable to human ex-perts. It is highly non-trivial because our approachis not exposed to any human-annotated groundinglabel in training. More importantly, we find that thegrounding can be easily coupled with downstreammodels to boost their performance, and the abso-lute improvement is up to 9.8%. In summarization,our contribution is as three-fold:

1. To the best of our knowledge, we are the firstone to highlight and demonstrate the possibil-ity of awakening latent grounding from PLMs.

2. We propose a novel weakly supervised ap-proach erasing-then-awakening, to awaken la-tent grounding from PLMs. Empirical stud-ies on four datasets demonstrate that our ap-proach can awaken latent grounding which isunderstandable to human experts.

3. Taking text-to-SQL as a case study, we suc-cessfully couple our approach with two off-the-shelf parsers. Experimental results on twobenchmarks show the effectiveness of our ap-proach on boosting downstream performance.

Page 3: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

2 Method: Erasing-then-Awakening

In the task of grounding, we are given a ques-tion x = 〈x1, · · · , xN 〉 and a concept set C =c1, · · · , cK, where each concept consists of sev-eral tokens. The goal of grounding is to find outtokens (also known as mentions) in x which arerelevant to concepts in C. Generally, the groundingprocedure learns to create a N×K matrix, whichwe call latent grounding. In some cases, a set ofpairs is needed, of which each one explicitly showsa token and a concept is grounded. We call thiskind of pairs as grounding pairs below.

As illustrated in Figure 2, our model consistsof a PLM module, a CP module and a groundingmodule. In this section, we first present the trainingprocedure of ETA, which at a high-level involvesthree steps: (1) Train an auxiliary concept predic-tion module. (2) Erase tokens in a question to ob-tain the concept prediction confidence differencesas pseudo alignment. (3) Awaken latent ground-ing from PLMs by applying pseudo alignment assupervision. Then we introduce the procedure toproduce grounding pairs in inference.

2.1 Training a Concept Prediction Module

Given x and C, the goal of the concept predictionmodule is to identify if each concept ck ∈ C ismentioned or not in the question x. Although itdoes not seem to be directly related to grounding, itis a pre-requisite for the erasing mechanism, whichwill be elaborated later. As for ck’s supervisionlk ∈ 0, 1, it is the weak supervision ETA re-lies on, and can be readily obtained through down-stream task signals. Taking text-to-SQL as an illus-tration, each database schema (i.e., table, columnand cell value) in an annotated SQL can be consid-ered as mentioned in the question (lk = 1), withothers as negative examples (lk = 0).

Once the supervision is prepared, the CP moduleis trained to conduct binary classification over therepresentation of each concept. As done in previ-ous work (Hwang et al., 2019), we first concatenatethe question and all concepts into a sequence asinput to the PLM module. As illustrated in Fig-ure 2, the input sequence starts with [CLS], withthe question and each concept being separated by[SEP]. Then, the sequence is fed into the PLMmodule to produce deep contextual representationsover each position. Denoting 〈q1,q2, ...,qN 〉 and〈e1, e2, ..., eK〉 as the token representations and

concept representations, they can be obtained by:

qnNn=1, ekKk=1=PLM([CLS],x,[SEP], ckKk=1

), (1)

where qn and ek correspond to the representationsat the position of n-th question token and the firsttoken in ck respectively. Finally, each concept rep-resentation ek is passed to a classifier to predict ifit is mentioned in x as:

pk = Sigmoid(Wl ek), (2)

where Wl is a learnable parameter. pk is the prob-ability of ck mentioned in the question, which isreferred to by concept prediction confidence below.

2.2 Erasing Question Tokens

Once the concept prediction module is converged,we apply an erasing mechanism to assist in thefollowing awakening phase. It follows a similaridea from the interpretable document classification(Arras et al., 2016), where a word is considered im-portant for the document classification if removingit and classifying the modified document resultsin a strong decrease of the classification score. Inour case, a token is considered highly relevant tocertain concepts if there is a large drop in these con-cept prediction confidences after erasing the token.Therefore, we need the above mentioned conceptprediction module to provide a reasonable conceptprediction confidence.

Concretely, as shown in Figure 2, the eras-ing mechanism erases the input sequentially, andfeeds each erased input into the PLM moduleand the subsequent CP module. For exam-ple, with xn being substituted by a special to-ken [UNK], we can obtain an erased input as[CLS], x1, · · · , xn−1,[UNK], xn+1, · · · , cK . De-noting pn,k the concept prediction confidence forck after erasing xn, we believe the difference be-tween pn,k and pk reveals ck’s relevance to xn froma PLM’s view. The confidence difference ∆n,k canbe obtained by ∆n,k = lk·max(0, pk − pn,k). Re-peating the above procedure on the input questionsequentially, ∆∈RN×K is filled completely.

2.3 Awakening Latent Grounding

As mentioned above, we believe ∆ reflects therelevance between each token and each conceptfrom a PLM’s view. Therefore, we could directlyuse ∆ as ETA’s output. However, according to ourpreliminary study, the method performs poorly and

Page 4: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

cannot produce high-quality alignment2. Differentfrom directly using ∆, we employ it to “awaken”the latent grounding. To be specific, we introducea grounding module upon representations of thePLM module and train it using ∆ as pseudo labels(i.e., pseudo alignment). The grounding modulefirst obtains grounding scores gn,k between eachquestion token xn and each concept ck based ontheir deep contextual representations qn and ek as:

gn,k =Week · (Wqqn)T√

d, (3)

where We,Wq are learnable parameters and d isthe dimension of ek. Then it normalizes the ground-ing scores into latent grounding α as:

αn,k =exp(gn,k)∑i exp(gi,k)

. (4)

Finally, the grounding module is trained to maxi-mize the likelihood with ∆ as the weight:∑

n

∑k

∆n,k · logαn,k. (5)

2.4 Producing Grounding PairRepeating erasing and awakening iteratively forepochs until the grounding module converges, wecan readily produce grounding pairs. Formally,we aim to obtain a set of pairs, where each pair〈xn, ck〉 indicates that xn is grounded to ck. Notic-ing ck may contain several tokens, we keep allprobabilities in α·,k which exceeds τ/|ck|, whereτ is a threshold and |ck| is the number of tokensin ck. Also, taking into account that xn should begrounded to only one concept, we keep only thehighest probability over αn,·. Finally, for each pair〈xn, ck〉, it is thought to be a grounding pair if αn,k

is kept and pk ≥ 0.5, otherwise it is not.

3 Experiments

In this section, we conduct experiments to evaluateif the latent grounding awakened by ETA is under-standable to human experts. Here we accomplishthe evaluation by comparing the grounding pairsproduced by ETA with human annotations.

3.1 Experimental SetupDatasets We select two representative ground-ing tasks where human annotations are available:schema linking and entity linking. Schema linking

2More experimental results can be found in §3.3.

is to ground questions into database schemas, whileentity linking is to ground questions into entitiesof knowledge bases. For schema linking, we selectSPIDER-L (Lei et al., 2020) and SQUALL (Shi et al.,2020) as our evaluation benchmarks. As mentionedin §2.1, the supervision for our model is obtainedfrom SQL queries. As for entity linking, we selectWebQSPELand GraphQEL(Sorokin and Gurevych,2018). The supervision for our model is obtainedfrom SPARQL queries in a similar way.

Evaluation For schema linking, as done in pre-vious work (Lei et al., 2020), we report the micro-average precision, recall and F1-score for bothcolumns (ColP , ColR, ColF ) and tables (TabP ,TabR, TabF ). For entity linking, we report theweak matching precision, recall and F1-score forentities (EntP , EntR, EntF ). The weak matchingmetric is a commonly used metric in previous work(Sorokin and Gurevych, 2018), which considers aprediction as correct whenever the correct entityis identified and the predicted mention boundaryoverlaps with the ground truth boundary. Moredetails can be seen in §A.

Baselines For schema linking, we consider fourstrong baselines. (1) N-gram Matching enumer-ates all n-gram (n ≤ 5) phrases in a natural lan-guage question, and links them to database schemasby fuzzy string matching. (2) SIM computes thedot product similarity between each question to-ken and schema using their PLM representationswithout fine-tuning, to explore grounding capaci-ties of unawakened PLMs. (3) CONTRAST learnsby comparing the aggregated grounding scores ofmentioned schemas with unmentioned ones in acontrastive learning style, as done in Liu et al.(2020b). Concretely, in training, CONTRAST istrained to accomplish the same concept predictiontask as our approach. With a similar architectureto the Receiver used in Liu et al. (2020b), it firstcomputes the similarity score between each tokenand each concept, and then uses max pooling toaggregate the similarity scores of a concept overan utterance into a concept prediction score. Fi-nally, a margin-based loss is used to encourage thebaseline to give higher concept prediction scoreson mentioned concepts than unmentioned concepts.(4) SLSQLL & ALIGNL. SLSQLL (ALIGNL) isa learnable schema linking module3 proposed in

3SLSQL and ALIGN use multi-task learning to simultane-ously learn schema linking and SQL generation.

Page 5: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

Model SPIDER-L SQUALL

ColP ColR ColF TabP TabR TabF ColP ColR ColF

N-gram Matching 61.4 69.1 65.1 78.2 69.6 73.6 71.6 50.8 59.4SIM + BERT 16.6 8.0 10.8 8.5 11.6 9.8 13.9 18.0 15.7CONTRAST + BERT 83.7 68.4 75.3 84.0 76.9 80.3 47.9 31.2 37.8ETA + BERT 86.1 79.3 82.5 81.1 85.3 83.1 77.3 62.4 69.0

SLSQLL + BERT♥ (Lei et al., 2020) 82.6 82.0 82.3 80.6 84.0 82.2 – – –ALIGNL + BERT♥ (Shi et al., 2020) – – – – – – 79.2 72.8 75.8

Table 1: Experimental results on schema linking dev sets. ♥ means the model uses schema linking supervision,while other learnable models use weak supervision. +BERT means using BERT as encoder, the same for Table 2.

Model WebQSPEL GraphQEL(zero-shot)

EntP EntR EntF EntP EntR EntF

Heuristic (Sorokin and Gurevych, 2018) 30.2 60.8 40.4 - - -ETA + BERT 76.6 72.5 74.5 43.1 42.1 42.7

VCG♥ (Sorokin and Gurevych, 2018) 82.4 68.3 74.7 54.1 30.6 39.0ELQ + BERT♥ (Li et al., 2020a) 90.0 85.0 87.4 60.1 57.2 58.6

Table 2: Experimental results on entity linking test sets. ♥ means the model uses entity linking supervision fromWebQSPEL, while ETA uses the weak supervision derived from WebQSP. Following previous work (Sorokin andGurevych, 2018), we use GraphQELonly in the evaluation phase to test the generalization ability of our model.

SLSQL (ALIGN). Unlike our method, these twomethods are trained with the full schema linkingsupervision. Please refer to Shi et al. (2020) andLei et al. (2020) for more details. Notably, forbaselines which require a threshold, we tuned theirthresholds based on dev sets for fair comparison.

For entity linking, we compare ETA with threepowerful methods. (1) Heuristic picks the mostfrequent entity among the candidates found bystring matching over Wikidata. (2) VCG (Sorokinand Gurevych, 2018) aggregates and mixes con-texts of different granularities to perform en-tity linking. (3) ELQ (Li et al., 2020a) usesa bi-encoder to perform entity linking in onepass, achieving state-of-the-art performance onWebQSPELand GraphQEL. VCG and ELQ utilizeentity linking supervision in training, while ETAdoes not.

Implementation For schema linking we followthe procedure in §2.4 to produce grounding pairsto evaluate, while for entity linking we furthermerge adjacent grounding pairs to produce span-level grounding pairs. We implement ETA in Py-torch (Paszke et al., 2019). With respect to PLMsin experiments, we use the uncased BERT-base(BERT)4 and BERT-large (BERTL) from Trans-

4Our approach is theoretically applicable to differentPLMs. In this paper, we chose BERT as a representativeand we leave exploration of different PLMs for future work.

formers library (Wolf et al., 2020). As for the opti-mizer, we employ AdamW (Loshchilov and Hutter,2019). More details (e.g., learning rate) of eachexperiment can be found in §C.1.

3.2 Experimental Results

Table 1 shows the experimental results on theschema linking task. As shown, our method outper-forms all weakly supervised methods and heuristic-based methods by a large margin. For example,on SPIDER-L, ETA + BERT achieves an absoluteimprovement of 7.2% ColF and 2.8% TabF overthe best baseline CONTRAST. The same conclu-sion can be drawn from the experimental resultson the entity linking task shown in Table 2. Forinstance, ETA + BERT can obtain a high EntF upto 74.5% on WebQSPEL, which is a satisfying per-formance for downstream tasks. All results abovedemonstrate the superiority of our approach onawakening latent grounding from PLMs. Withrespect to the reason that PLMs work well onboth schema linking and entity linking, it may bebecause both schema linking and entity linkingrequire text-based semantic matching (e.g., syn-onyms), which PLMs excel at.

Furthermore, it is very surprising that althoughnot trained under fine-grained grounding supervi-sion, our model is comparable with or slightlyworse than the fully supervised models across

Page 6: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

Error Type Example Error

Missed Grounding (43.1%) How many points did arnaud demare receive?GOLD: points→ “UCI world tour points” PRED:

Technically Correct (21.0%) Total population of millbrook first nation?GOLD: population→ “Population”PRED: population→ “Population”; nation→ “Community”

Partially Correct (15.8%) Who was the first winning captain?GOLD: the first→ “Year”; winning captain→ “Winning Captain”PRED: first→ “Year”; winning captain→ “Winning Captain”

Wrong Grounding (10.1%) Were the matinee and evening performances held earlier than the 8th anniversary?GOLD: earlier→ “Date”PRED: matinee→ “Performance”; earlier→ “Date”

Table 3: Four main error types made by ETA along with their proportions on SQUALL dataset.

datasets. For instance, on SPIDER-L, our model ex-ceeds the fully supervised baseline SLSQLL by 0.9points on TabF . On SQUALL, our model holds aslightly worse performance than the fully super-vised baseline ALIGNL. It is highly nontrivialsince CONTRAST, the best weakly supervised base-line on SPIDER-L, is far from the fully supervisedmodel on SQUALL, while our model has only asmall drop. Besides, on WebQSPELand GraphQEL,although our model is inferior to the state-of-the-art model ELQ, it also achieves a comparable per-formance with the fully supervised baseline VCG.These results provide strong evidence that PLMsdo have very good grounding capabilities, and ourapproach can awaken them from PLMs.

3.3 Model Analysis

In this section, we try to answer four interestingresearch questions via a thorough analysis: RQ1.Does the grounding capability come mainly fromthe PLM? RQ2. Is the awakening phase neces-sary? RQ3. Do larger PLMs have better groundingcapabilities? RQ4. What are the remaining errors?

RQ1 There is a long term debate in literatureabout if knowledge is primarily learned by PLMs,when extra parameters are employed in analysis(Hewitt and Liang, 2019). Similarly, since our ap-proach depends on extra modules (e.g., groundingmodule), it faces the same dilemma: how can weknow whether the latent grounding is learnt fromPLMs or extra modules? Therefore, we apply ourapproach to a randomly initialized Transformer en-coder (Vaswani et al., 2017), to probe the ground-ing capability of a model that has not been pre-trained. To make it comparable, the encoder hasthe same architecture as BERT. However, it onlygets a 40% ColF on SQUALL, not even as good

0 5 10 15 20 25

15

30

45

60

75

AwakeningPseudo Alignment

Pseudo w/ SoftmaxPseudo w/ Sum

Figure 3: ColF score on the dev set of SQUALL at dif-ferent training epochs. “Pseudo w/ Softmax” meansnormalizing pseudo alignment with Softmax, while“Pseudo w/ Sum” means normalizing through dividingeach number by the sum of them.

as the N-gram baseline. Considering it containsthe same extra modules as ETA + BERT, the hugegap between it and ETA + BERT supports the opin-ion that the latent grounding is mainly learnt fromPLMs. Meanwhile, one concern shared by our re-viewers is the risk of supervision exposure duringtraining of the concept prediction module. In otherwords, our approach may “steal” some supervisionin the concept prediction module to achieve goodperformance on grounding. However, the above ex-periment demonstrates that a non-pretrained modelis far from strong grounding capability even withthe same concept prediction module. We hope thefinding will alleviate the concern.

RQ2 As mentioned in §2.3, the pseudo alignment∆ can also be employed as the model prediction.Therefore, we conduct experiments to verify if ourproposed awakening phase is necessary. As shownin Figure 3, even with various normalization meth-ods (e.g., Softmax), ∆ does not produce satisfac-tory alignment. In contrast, our model consistentlyperforms well. To investigate deeper, we conducta careful analysis on ∆, and we are surprised to

Page 7: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

PLMs

[CLS] Find the average , maximum and ⋯ movies before 2002 . [SEP] Movie [SEP] budget ⋯ [SEP] Year

Decoder

ETA

𝑞1 𝑞1 𝑞𝑛−1 𝑞𝑛 𝑒1 𝑒𝑚⋯ ⋯

Encoder

⊕𝑓

0.14 0.16 ⋯ 0.18 0.06

⋮ ⋮ ⋱ ⋮ ⋮

0.11 0.33 ⋯ 0.48 0.55

SELECT Avg(budget), Max(budget), Min(budget) FROM movie WHERE year < 2020

Question Schema

SQL

Figure 4: The illustration of the solution to couple ETAwith downstream text-to-SQL parsers.

find that values of ∆ are generally small and not assignificantly different with each other as we wouldexpect. Therefore, we believe the success of ourapproach stems from the fact that it encourages thegrounding module to capture subtle differences andstrength them.

RQ3 We apply our approach on BERT-large(BERTL) and conduct experiments on SPIDER-L.The results show BERTL brings an improvementof 2.5% ColF and 0.5% TabF , suggesting the pos-sibility of awakening better latent grounding fromlarger PLMs. Nevertheless, the improvement mayalso come from more parameters, so the conclusionneeds further investigation.

RQ4 We manually examine 20% of our model’serrors on the SQUALL dataset and summarize fourmain error types: (1) missed grounding - where ourmodel did not ground any token to a concept, (2)technically correct - where our model was techni-cally correct but the annotation was missing, (3)partially correct - where our model did not find alltokens of a concept, (4) wrong grounding - wherethe model produced incorrect grounding. As shownin Table 3, only a small fraction of errors are wronggrounding, indicating that the main challenge ofour approach is recall rather than precision.

4 Case Study: Text-to-SQL

The ETA model is proposed for general-purposeuses and intends to enhance different downstreamsemantic parsing models. To verify it, we take thetext-to-SQL task as a case study. In this section, wefirst present a general solution to couple ETA withdifferent text-to-SQL parsers. Then, we conductexperiments on two off-the-shelf parsers to verifythe effectiveness of ETA.

4.1 Coupling with Text-to-SQL Parsers

Inspired by Lei et al. (2020), we present a generalsolution to couple ETA with downstream parsers in

Model Dev Test

Ex.Match Ex.Acc Ex.Acc

ALIGNP 37.8± 0.6 56.9± 0.7 46.6± 0.5ALIGNP + BERT 44.7± 2.1 63.8± 1.1 51.8± 0.4ETA + BERT 47.6± 2.5 66.6± 1.7 53.8± 0.3

ALIGN♥ 42.2± 1.5 61.3± 0.8 49.7± 0.4ALIGN + BERT♥ 47.2± 1.2 66.5± 1.2 54.1± 0.2

Table 4: Ex.Match and Ex.Acc results on the dev andtest set of WTQ. + BERT means using BERT to en-hance encoder. ♥ means the model uses extra schemalinking supervision. Both are the same for Table 5.

Figure 4. As shown, we first obtain a schema-awarerepresentation for each question token, by fusingthe token representation and its related schemarepresentation according to the latent groundingα∈RN×K (gray matrix in Figure 4). Specifically,given a token representation qn and all schemarepresentations 〈e1, e2, ..., eK〉, the schema-awarerepresentation qn for qn can be computed as:

qn = qn⊕∑k

αn,k ek. (6)

Then we feed every qn into a question encoder togenerate hidden states, which are attended by adecoder to decode the SQL query. By contributingto the schema-aware representation, ETA is able toprompt the decoder to predict appropriate schemasduring decoding. Notably, the encoder and decoderare not limited to specific modules, and we followthe paper settings in subsequent experiments.

4.2 Experimental Setup

Datasets and Evaluation We conduct experi-ments on two text-to-SQL benchmarks: WikiTable-Questions(WTQ) (Pasupat and Liang, 2015)5 andSpider (Yu et al., 2018b). Following previous work,we employ three kinds of evaluation metrics: ExactMatch (Ex.Match), Exact Set Match (Ex.Set) andExecution Accuracy (Ex.Acc). Ex.Match evaluatesthe predicted SQL correctness by checking if itis equal to the ground-truth, while Ex.Set evalu-ates the structural correctness by checking the setmatch of each SQL clause in the predicted querywith respect to the ground-truth. Ex.Acc evaluatesthe functional correctness of the predicted SQL bychecking whether it yields the ground-truth answer.

5Note that the original WTQ only contains answer anno-tations, and here we use the version with SQL annotationsprovided by Shi et al. (2020). Our training data is a subset ofthe original train set, while the test data keeps the same.

Page 8: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

Model Dev Test

IRNet + BERT (Guo et al., 2019) 61.9 54.7IRNet v2 + BERT (Guo et al., 2019) 63.9 55.0BRIDGE + BERTL (Lin et al., 2020) 70.0 65.0RATSQL + BERTL (Wang et al., 2020a) 69.7 65.6SLSQLP + BERT 57.4 -SLSQLP + BERTL 61.0 -ETA + BERT 64.5 59.5ETA + BERTL 70.8 65.3

Table 5: Ex.Set results on the dev and test set of Spider.

Baselines On WTQ, our baselines includeALIGNP and ALIGN, where the former is a vanillaattention based sequence to sequence model andthe latter enhances ALIGNP with an additionalschema linking task (Shi et al., 2020). Similarly,on Spider, our main baselines are SLSQLP andits schema linking enhanced version SLSQL (Leiet al., 2020). SLSQLP is made up of a questionencoder and a two-step SQL decoder. In the firstdecoding step, a coarse SQL (i.e., without aggre-gation functions) is generated. Then the coarseSQL is used to synthesize the final SQL in thesecond decoding step. Here we also report the per-formance of SLSQL + BERT (Oracle), where thelearnable schema linking module is replaced withhuman annotations in inference. It represents themaximum potential benefit of schema linking forthe text-to-SQL task. Meanwhile, for a comprehen-sive comparison, we also compare our model withstate-of-the-art models on the Spider benchmark6.We refer readers to their papers for details.

Implementation As for our approach, on WTQ,we employ ALIGNP

7 as our base parser, while onSpider we select SLSQLP

8 as our base parser. Forboth parsers, we try to follow the same hyperpa-rameters as described in the paper to reduce otherfactors that may affect the performance. More im-plementation details can be found in §C.2.

4.3 Experimental Results

Table 4 and Table 5 show the experimental re-sults of several methods on WTQ and Spider re-spectively. As observed, introducing ETA dra-matically improves the performance of both baseparsers, demonstrating its effectiveness on down-stream tasks. Taking Spider as an illustration, ourmodel ETA + BERT boosts SLSQLP + BERT by

6https://yale-lily.github.io/spider7https://github.com/tzshi/squall8https://github.com/WING-NUS/slsql

where is th

e

younges

t

teac

herfro

m ?

teacher

teacher.age

teacher.hometown

0.00 0.00 0.00 0.00 1.00 0.00 0.00

0.01 0.00 0.00 0.83 0.15 0.01 0.00

0.56 0.01 0.00 0.01 0.31 0.11 0.00

Figure 5: The latent grounding produced byETA + BERTL for the question “Where is the youngestteacher from?”.

an absolute improvement 7.1% on the Ex.Set met-ric. As the PLM becomes larger (e.g., BERTL),the improvement becomes more significant, up to9.8%. Compared with state-of-the-art methods,our model ETA + BERTL also obtains a competi-tive performance, which is extremely impressivesince it is based on a simple parser.

More interestingly, on both datasets, our modelcan achieve similar even better performance com-pared to methods which employ extra groundingsupervision. For instance, in comparison withSLSQL + BERT on Spider, our ETA + BERT out-performs it by 3.7%. Taking into account thatSLSQL utilizes additional supervision, the perfor-mance gain is very surprising. We attribute thegain to two possible reasons: (1) The PLMs alreadylearn latent grounding which is understandable tohuman experts. (2) Compared with training withstrong schema linking supervision, training withweak supervision alleviates the issue of exposurebias, and thus enhance the generalization ability ofETA.

Table 6 presents the model predictions ofETA + BERTL on three real cases. As observed,ETA has learned the grounding about adjec-tive (e.g., oldest → age), entity (e.g., where →hometown) and semantic matching (e.g., registered→ student enrolment). Meanwhile, groundingpairs provide us a useful guide to better understandthe model predictions. Figure 5 visualizes the latentgrounding for Q2 in Table 6, and more visualiza-tion can be found in §D.

5 Related Work

The most related work to ours is the line of inducingor probing knowledge in pretrained language mod-els. According to the knowledge category, thereare mainly two kinds of methods: one focuses onsyntactic knowledge and the other pays attention tosemantic knowledge. Under the category of syntac-

Page 9: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

Question with Alignment SQL with Alignment

1. Show name1, country2, age3 for all singers4ordered by age3 from the oldest3 to the youngest.

SELECT name1, country2, age3 FROM singer4ORDER BY age3 DESC

2. Where1 is the youngest2 teacher3 from? SELECT hometown1 FROM teacher3 ORDER BY age2 ASC LIMIT 1

3. For each semester1, what is the name2 and id3

of the one with the most students registered4?SELECT semester name2, semester id3 FROM semesters1 JOINstudent enrolment4 ON semesters.semester id =student enrolment.semester id GROUP BY semester id3

ORDER BY COUNT(*) DESC LIMIT 1

Table 6: The predicted grounding pairs and SQLs of our best model on three real cases from the Spider dev set.The question token and the schema with the same subscript are grounded.

tic knowledge, several work showed that BERT em-beddings encoded syntactic information in a struc-tural form and can be recovered (Lin et al., 2019b;Warstadt and Bowman, 2020; Hewitt and Manning,2019; Wu et al., 2020). However, recent work alsoshowed that BERT did not rely on syntactic infor-mation for downstream task performance, and thusdoubted the role of syntactic knowledge (Ettinger,2020; Glavas and Vulic, 2020). As for semanticknowledge, although it is less explored than syntac-tic knowledge, previous work showed that BERTcontained some semantic information, such as en-tity types (Ettinger, 2020), semantic roles (Tenneyet al., 2019) and factual knowledge (Petroni et al.,2019). Different from the above work, we focus onthe grounding capability, an under-explored branchof language semantics.

Our work is also closely related to entity link-ing and schema linking, which can be viewed assubareas of grounding on specific scenarios. Givenan utterance, entity linking aims at finding all men-tioned entities in it using a knowledge base as can-didate pool (Tan et al., 2017; Chen et al., 2018; Liet al., 2020a), while schema linking tries to findall mentioned schemas related to specific databases(Dong et al., 2019; Lei et al., 2020; Shi et al., 2020).Previous work generally either employed full su-pervision to train linking models (Li et al., 2020a;Lei et al., 2020; Shi et al., 2020), or treated linkingas a minor pre-processing(Yu et al., 2018a; Guoet al., 2019; Lin et al., 2019a) and used heuristicrules to obtain the result. Our work is differentfrom them since we optimize the linking modelwith weak supervision from downstream signals,which is flexible and practicable. Similarly, Donget al. (2019) utilized downstream supervision totrain their linking model. Compared with them us-ing policy gradient, our method is more efficientsince it directly learns the grounding module usingpseudo alignment as supervision.

6 Conclusion & Future Work

In summary, we propose a novel weakly super-vised approach to awaken latent grounding frompretrained language models via erasing. Only withdownstream signals, our approach can induce latentgrounding from pretrained language models whichis understandable to human experts. More impor-tantly, we demonstrate that our approach could beapplied to off-the-shelf text-to-SQL parsers andsignificantly improve their performance. For fu-ture work, we plan to extend our approach to moredownstream tasks such as visual question answer-ing. We also plan to utilize our approach to improvethe error locator module in existing interactive se-mantic parsing systems (Li et al., 2020b).

Acknowledgement

We would like to thank all the anonymous review-ers for their constructive feedback and useful com-ments. We also thank Tao Yu and Bo Pang forevaluating our submitted models on the test setof Spider. The first author Qian is supported bythe Academic Excellence Foundation of BeihangUniversity for PhD Students.

Ethical Considerations

This paper conducts experiments on several exist-ing datasets covering the areas of entity linking,schema entity and text-to-SQL. All claims in thispaper are based on the experimental results. Ev-ery experiment can be conducted on a single TeslaP100 or P40 GPU. No demographic or identitycharacteristics information is used in this paper.

References

Leila Arras, Franziska Horn, Gregoire Montavon,Klaus-Robert Muller, and Wojciech Samek. 2016.

Page 10: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

”what is relevant in a text document?”: An in-terpretable machine learning approach. CoRR,abs/1612.07843.

Tom B. Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, Sandhini Agarwal, Ariel Herbert-Voss,Gretchen Krueger, Tom Henighan, Rewon Child,Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,Clemens Winter, Christopher Hesse, Mark Chen,Eric Sigler, Mateusz Litwin, Scott Gray, BenjaminChess, Jack Clark, Christopher Berner, Sam Mc-Candlish, Alec Radford, Ilya Sutskever, and DarioAmodei. 2020. Language models are few-shot learn-ers. In Advances in Neural Information ProcessingSystems 33: Annual Conference on Neural Informa-tion Processing Systems 2020, NeurIPS 2020, De-cember 6-12, 2020, virtual.

Lihan Chen, Jiaqing Liang, Chenhao Xie, and YanghuaXiao. 2018. Short text entity linking with fine-grained topics. In Proceedings of the 27th ACM In-ternational Conference on Information and Knowl-edge Management, CIKM ’18, page 457–466, NewYork, NY, USA. Association for Computing Machin-ery.

Sanxing Chen, Aidan San, Xiaodong Liu, andYangfeng Ji. 2020. A tale of two linkings: Dy-namically gating between schema linking and struc-tural linking for text-to-SQL parsing. In Proceed-ings of the 28th International Conference on Com-putational Linguistics, pages 2900–2912, Barcelona,Spain (Online). International Committee on Compu-tational Linguistics.

Jianpeng Cheng, Siva Reddy, Vijay Saraswat, andMirella Lapata. 2017. Learning structured naturallanguage representations for semantic parsing. InProceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 44–55, Vancouver, Canada. As-sociation for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Zhen Dong, Shizhao Sun, Hongzhi Liu, Jian-GuangLou, and Dongmei Zhang. 2019. Data-anonymousencoding for text-to-SQL generation. In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), pages 5405–5414, HongKong, China. Association for Computational Lin-guistics.

Allyson Ettinger. 2020. What BERT is not: Lessonsfrom a new suite of psycholinguistic diagnostics forlanguage models. Transactions of the Associationfor Computational Linguistics, 8:34–48.

Goran Glavas and Ivan Vulic. 2020. Is supervised syn-tactic parsing beneficial for language understanding?an empirical investigation. CoRR, abs/2008.06788.

Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao,Jian-Guang Lou, Ting Liu, and Dongmei Zhang.2019. Towards complex text-to-SQL in cross-domain database with intermediate representation.In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pages4524–4535, Florence, Italy. Association for Compu-tational Linguistics.

John Hewitt and Percy Liang. 2019. Designing andinterpreting probes with control tasks. In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), pages 2733–2743, HongKong, China. Association for Computational Lin-guistics.

John Hewitt and Christopher D. Manning. 2019. Astructural probe for finding syntax in word repre-sentations. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4129–4138, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Wonseok Hwang, Jinyeung Yim, Seunghyun Park, andMinjoon Seo. 2019. A comprehensive explorationon wikisql with table-aware word contextualization.CoRR, abs/1902.01069.

Ganesh Jawahar, Benoıt Sagot, and Djame Seddah.2019. What does BERT learn about the structureof language? In Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics, pages 3651–3657, Florence, Italy. Associa-tion for Computational Linguistics.

Wenqiang Lei, Weixin Wang, Zhixin Ma, Tian Gan,Wei Lu, Min-Yen Kan, and Tat-Seng Chua. 2020.Re-examining the role of schema linking in text-to-SQL. In Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Process-ing (EMNLP), pages 6943–6954, Online. Associa-tion for Computational Linguistics.

Belinda Z. Li, Sewon Min, Srinivasan Iyer, YasharMehdad, and Wen-tau Yih. 2020a. Efficient one-pass end-to-end entity linking for questions. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 6433–6441, Online. Association for Computa-tional Linguistics.

Page 11: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

Yuntao Li, Bei Chen, Qian Liu, Yan Gao, Jian-Guang Lou, Yan Zhang, and Dongmei Zhang. 2020b.“what do you mean by that?” a parser-independentinteractive approach for enhancing text-to-SQL. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 6913–6922, Online. Association for Computa-tional Linguistics.

Percy Liang, Michael I. Jordan, and Dan Klein. 2013.Learning dependency-based compositional seman-tics. Computational Linguistics, 39(2):389–446.

Kevin Lin, Ben Bogin, Mark Neumann, Jonathan Be-rant, and Matt Gardner. 2019a. Grammar-based neu-ral text-to-sql generation. CoRR, abs/1905.13326.

Xi Victoria Lin, Richard Socher, and Caiming Xiong.2020. Bridging textual and tabular data for cross-domain text-to-SQL semantic parsing. In Findingsof the Association for Computational Linguistics:EMNLP 2020, pages 4870–4888, Online. Associa-tion for Computational Linguistics.

Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019b.Open sesame: Getting inside BERT’s linguisticknowledge. In Proceedings of the 2019 ACL Work-shop BlackboxNLP: Analyzing and Interpreting Neu-ral Networks for NLP, pages 241–253, Florence,Italy. Association for Computational Linguistics.

Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew E. Peters, and Noah A. Smith. 2019. Lin-guistic knowledge and transferability of contextualrepresentations. In Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 1073–1094, Minneapolis, Minnesota.Association for Computational Linguistics.

Qian Liu, Bei Chen, Jiaqi Guo, Jian-Guang Lou, BinZhou, and Dongmei Zhang. 2020a. How far arewe from effective context modeling? an exploratorystudy on semantic parsing in context twitter. In IJ-CAI, pages 3580–3586.

Qian Liu, Yihong Chen, Bei Chen, Jian-Guang Lou,Zixuan Chen, Bin Zhou, and Dongmei Zhang.2020b. You impress me: Dialogue generation viamutual persona perception. In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics, pages 1417–1427, Online. As-sociation for Computational Linguistics.

Ilya Loshchilov and Frank Hutter. 2019. Decou-pled weight decay regularization. In 7th Inter-national Conference on Learning Representations,ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.OpenReview.net.

Panupong Pasupat and Percy Liang. 2015. Compo-sitional semantic parsing on semi-structured tables.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the

7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers), pages1470–1480, Beijing, China. Association for Compu-tational Linguistics.

Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Pytorch:An imperative style, high-performance deep learn-ing library. In Advances in Neural Information Pro-cessing Systems, volume 32, pages 8026–8037. Cur-ran Associates, Inc.

Fabio Petroni, Tim Rocktaschel, Sebastian Riedel,Patrick Lewis, Anton Bakhtin, Yuxiang Wu, andAlexander Miller. 2019. Language models as knowl-edge bases? In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. As-sociation for Computational Linguistics.

Siva Reddy, Oscar Tackstrom, Michael Collins, TomKwiatkowski, Dipanjan Das, Mark Steedman, andMirella Lapata. 2016. Transforming dependencystructures to logical forms for semantic parsing.Transactions of the Association for ComputationalLinguistics, 4:127–140.

Anna Rogers, Olga Kovaleva, and Anna Rumshisky.2020. A primer in bertology: What we know abouthow bert works. Transactions of the Association forComputational Linguistics, 8:842–866.

Deb Roy. 2005. Grounding words in perception andaction: computational insights. Trends in CognitiveSciences, 9(8):389 – 396.

W. Samek, A. Binder, G. Montavon, S. Lapuschkin,and K. Muller. 2017. Evaluating the visualizationof what a deep neural network has learned. IEEETransactions on Neural Networks and Learning Sys-tems, 28(11):2660–2673.

Tianze Shi, Chen Zhao, Jordan Boyd-Graber, HalDaume III, and Lillian Lee. 2020. On the poten-tial of lexico-logical alignments for semantic pars-ing to SQL queries. In Findings of the Associationfor Computational Linguistics: EMNLP 2020, pages1849–1864, Online. Association for ComputationalLinguistics.

Daniil Sorokin and Iryna Gurevych. 2018. Mixingcontext granularities for improved entity linking onquestion answering data across entity categories. InProceedings of the Seventh Joint Conference on Lex-ical and Computational Semantics, pages 65–75,New Orleans, Louisiana. Association for Computa-tional Linguistics.

Page 12: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

Chuanqi Tan, Furu Wei, Pengjie Ren, Weifeng Lv,and Ming Zhou. 2017. Entity linking for queriesby searching Wikipedia sentences. In Proceed-ings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing, pages 68–77,Copenhagen, Denmark. Association for Computa-tional Linguistics.

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R. Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R. Bowman, Dipan-jan Das, and Ellie Pavlick. 2019. What do youlearn from context? probing for sentence structurein contextualized word representations. In 7th Inter-national Conference on Learning Representations,ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.OpenReview.net.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30: Annual Conference on NeuralInformation Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.

Bailin Wang, Richard Shin, Xiaodong Liu, OleksandrPolozov, and Matthew Richardson. 2020a. RAT-SQL: Relation-aware schema encoding and linkingfor text-to-SQL parsers. In Proceedings of the 58thAnnual Meeting of the Association for Computa-tional Linguistics, pages 7567–7578, Online. Asso-ciation for Computational Linguistics.

Chao Wang and Hui Jiang. 2019. Explicit utilizationof general knowledge in machine reading compre-hension. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics, pages 2263–2272, Florence, Italy. Associationfor Computational Linguistics.

Kai Wang, Weizhou Shen, Yunyi Yang, Xiaojun Quan,and Rui Wang. 2020b. Relational graph attentionnetwork for aspect-based sentiment analysis. In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 3229–3238, Online. Association for Computational Lin-guistics.

Alex Warstadt and Samuel R. Bowman. 2020. Canneural networks acquire a structural bias from rawlinguistic data? CoRR, abs/2007.06761.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander Rush. 2020. Trans-formers: State-of-the-art natural language process-ing. In Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.

Zhiyong Wu, Yun Chen, Ben Kao, and Qun Liu. 2020.Perturbed masking: Parameter-free probing for ana-lyzing and interpreting BERT. In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics, pages 4166–4176, Online. As-sociation for Computational Linguistics.

Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, andDragomir Radev. 2018a. TypeSQL: Knowledge-based type-aware neural text-to-SQL generation. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics: Human Language Technolo-gies, Volume 2 (Short Papers), pages 588–594, NewOrleans, Louisiana. Association for ComputationalLinguistics.

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li,Qingning Yao, Shanelle Roman, Zilin Zhang,and Dragomir Radev. 2018b. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages3911–3921, Brussels, Belgium. Association forComputational Linguistics.

John M. Zelle and Raymond J. Mooney. 1996. Learn-ing to parse database queries using inductive logicprogramming. In Proceedings of the ThirteenthNational Conference on Artificial Intelligence andEighth Innovative Applications of Artificial Intelli-gence Conference, AAAI 96, IAAI 96, Portland, Ore-gon, USA, August 4-8, 1996, Volume 2, pages 1050–1055. AAAI Press / The MIT Press.

Luke Zettlemoyer and Michael Collins. 2005. Learn-ing to map sentences to logical form: Structuredclassification with probabilistic categorial grammars.In UAI ’05, Proceedings of the 21st Conferencein Uncertainty in Artificial Intelligence, Edinburgh,Scotland, July 26-29, 2005, pages 658–666. AUAIPress.

Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J.Corso, and Marcus Rohrbach. 2019. Groundedvideo description. In IEEE Conference on ComputerVision and Pattern Recognition, CVPR 2019, LongBeach, CA, USA, June 16-20, 2019, pages 6578–6587. Computer Vision Foundation / IEEE.

Yuke Zhu, Oliver Groth, Michael S. Bernstein, andLi Fei-Fei. 2016. Visual7w: Grounded question an-swering in images. In 2016 IEEE Conference onComputer Vision and Pattern Recognition, CVPR2016, Las Vegas, NV, USA, June 27-30, 2016, pages4995–5004. IEEE Computer Society.

Page 13: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

A Evaluation Details

A.1 Schema Linking

Let Ωcol be a set (c, q)i|1 ≤ i ≤ N which con-tains N gold (column-question token) tuples. LetΩcol be a set (c, q)j |1 ≤ j ≤ M which con-tains M predicted (column-question token) tuples.We define the precision(ColP ), recall(ColR), F1-score(ColF ) as:

|Γcol|∣∣Ωcol

∣∣ , |Γcol||Ωcol|

,2ColPColR

ColP + ColR

where Γcol = Ωcol⋂

Ωcol. The definitions of TabP ,TabR, TabF are similar. Note that the result re-ported in Table 8 of Shi et al. (2020) use a differentevaluation metrics. Here we re-evaluate their modelby the above mentioned metrics for fair compari-son.

A.2 Entity Linking

Let Ω = (e, [qs, qe])i|1 ≤ i ≤ N be the goldentity-mention set and Ω = (e, [qs, qe])j |1 ≤ j ≤M be the predicted entity-mention set, where eis the entity, qe, qs are the mention boundaries inthe question q. In the weak matching setting, aprediction is correct only if the ground-truth entityis identified and the predicted mention boundariesoverlap with the ground-truth boundaries. There-fore, the True-Positive prediction set is defined as:

Γ = e|(e, [qs, qe]) ∈ Ω, (e, [qs, qe]) ∈ Ω,

[qs, qe]⋂

[qs, qe] 6= ∅.

The corresponding precision(EntP ), recall(EntR)and F1(EntF ) are:

|Γ|∣∣Ω∣∣ , |Γ||Ω| , 2EntPEntREntP + EntR

B Dataset Statistic

All details of datasets used in this paper are shownin Table 7.

C Implementation Details

For all experiments, we employ the AdamW opti-mizer and the default learning rate schedule strat-egy provided by Transformers library (Wolf et al.,2020).

C.1 Experiments on Grounding

SQUALL We use uncased BERT-base as the en-coder. The learning rate is 3× 10−5. The trainingepoch is 50 with a batch size of 16. The dropoutrate and the threshold τ are set to 0.3 and 0.2 re-spectively. The training process lasts 6 hours on asingle 16GB Tesla P100 GPU.

SPIDER-L We implement two versions: uncasedBERT-base and uncased BERT-large. For both ver-sions, the learning rate is 5× 10−5 and the trainingepoch is 50. For BERT-base (BERT-large) version,the batch size and gradient accumulation step areset to 12 (6) and 6 (4). The dropout rate and thethreshold τ are set to 0.3 and 0.2 respectively. Asfor training time, BERT-base (BERT-large) versionis trained on a 24GB Tesla P40 and it takes about16 (48) hours to finish the training process.

WebQSPEL& GraphQEL Due to the largeamount of entity candidates, we first use the can-didate retrieval method proposed in (Sorokin andGurevych, 2018) to reduce the number of candi-dates. After that, we still can not feed all candidatesalong with the question due to the maximum encod-ing length of BERT. Therefore, we divide the can-didates into multiple chunks and feed each chunk(along with the question) into BERT sequentially.

In implementation, we use uncased BERT-baseas the encoder. The learning rate is 1× 10−5 Thetraining epoch is 50 with a batch size of 16. Thedropout rate and the threshold τ are set to 0.3 and0.3 respectively. The training procedure finisheswithin 10 hours on a single Tesla M40 GPU.

C.2 Experiments on Text-to-SQL

For experiments of the text-to-SQL task, we em-ploy the official code released along with Shi et al.(2020) (on WTQ) and Lei et al. (2020) (on Spi-der). When coupling ETA with these models, wefirst produce a one-hot grounding matrix derivedby grounding pairs and then feed it into them asdescribed in §4.

WTQ We use uncased BERT-base as the encoder.The training epoch is 50 with a batch size of 8. Thelearning rate is 1× 10−5 for the BERT module and1 × 10−3 for other modules. The dropout rate isset to 0.2. The training process finishes within 16hours on a single 16GB Tesla P100 GPU.

Meanwhile, we follow the previous work (Shiet al., 2020) to employ 5-fold cross-validation, and

Page 14: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

DatasetTrain Dev Test

#Q #C #Q #C #Q #C

SQUALL 9, 030 19, 185 2, 246 4, 774 – –SPIDER-L 7, 000 28, 848 1, 034 4, 360 – –

WTQ 9, 030 – 2, 246 – 4, 344 –Spider 7, 000 – 1, 034 – 2, 147 –

WebQSPEL 2, 974 3, 242 – – 1, 603 1, 806GraphQEL 2, 089 2, 253 – – 2, 075 2, 229

Table 7: Statistics for all datasets used in our experiments. For SQUALL and WTQ, we only show the size ofSplit-0, and details of other splits can be found in Table 8. #Q represents the number of questions, #C representsthe number of concepts.

Split Train Dev

0 9, 030 2, 2461 9, 032 2, 2442 9, 028 2, 2483 8, 945 2, 3314 9, 069 2, 207

Table 8: The size of train set and dev set of five splitson SQUALL and WTQ.

SplitDev Test

Ex.Match Ex.Acc Ex.Acc

0 45.10 64.43 53.571 47.39 67.01 54.172 47.24 65.93 53.613 45.99 65.72 53.414 52.38 69.73 52.41

Table 9: The experimental results of all splits on WTQ.

experimental results of all five splits on WTQ usingETA + BERT are shown in Table 9.

Spider We implement two versions: uncasedBERT-base and uncased BERT-large. For BERT-base (BERT-large), the learning rate is 1.25×10−5

(6.25× 10−6) for the BERT module and 1× 10−4

(5× 10−5) for other modules. The batch size andgradient accumulation step are set to 10 (6) and5 (4) for BERT-base (BERT-large) version. Thedropout rate is set to 0.3. As for training time,BERT-base (BERT-large) version is trained on a24GB Tesla P40 and it takes about 36 (56) hours tofinish the training process.

D Latent Grounding Visualization

Figure 6 and Figure 7 show the latent groundingvisualization corresponding to examples in Table 6.

Page 15: arXiv:2109.10540v1 [cs.CL] 22 Sep 2021

show

name ,

country

,ag

e for all

singer

s

order

ed by age

from th

e

oldes

t to the

younges

t .

singer

singer.name

singer.country

singer.age

0.00 0.01 0.00 0.02 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.01 0.37 0.05 0.19 0.00 0.00 0.00 0.00 0.36 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.02 0.04 0.89 0.00 0.01 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.31 0.00 0.00 0.01 0.00 0.00 0.35 0.00 0.00 0.22 0.00 0.00 0.10 0.00

Figure 6: The latent grounding produced by ETA + BERTL for the question “Show name, country, age for allsingers ordered by age from the oldest to the youngest.”.

for

each

sem

este

r ,what is th

enam

ean

d id ofth

eone

with the

most

studen

ts

regist

ered ?

semesters

student enrolment

semesters.semester id

semesters.semester name

0.00 0.00 0.88 0.00 0.00 0.00 0.00 0.05 0.01 0.02 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.00

0.00 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.87 0.00

0.01 0.02 0.39 0.01 0.01 0.01 0.01 0.05 0.06 0.30 0.02 0.02 0.05 0.01 0.00 0.00 0.01 0.03 0.00

0.00 0.01 0.73 0.00 0.00 0.00 0.01 0.14 0.02 0.05 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00

Figure 7: The latent grounding produced by ETA + BERTL for the question “For each semester, what is the nameand id of the one with the most students registered?”.