TaBERT: Pretraining for Joint Understanding of Textual and

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8413–8426July 5 - 10, 2020. c©2020 Association for Computational Linguistics

8413

TABERT: Pretraining for Joint Understanding ofTextual and Tabular Data

Pengcheng Yin∗ Graham NeubigCarnegie Mellon University

{pcyin,gneubig}@cs.cmu.edu

Wen-tau Yih Sebastian RiedelFacebook AI Research

{scottyih,sriedel}@fb.com

Abstract

Recent years have witnessed the burgeoningof pretrained language models (LMs) for text-based natural language (NL) understandingtasks. Such models are typically trained onfree-form NL text, hence may not be suit-able for tasks like semantic parsing over struc-tured data, which require reasoning over bothfree-form NL questions and structured tabulardata (e.g., database tables). In this paper wepresent TABERT, a pretrained LM that jointlylearns representations for NL sentences and(semi-)structured tables. TABERT is trained ona large corpus of 26 million tables and theirEnglish contexts. In experiments, neural se-mantic parsers using TABERT as feature rep-resentation layers achieve new best results onthe challenging weakly-supervised semanticparsing benchmark WIKITABLEQUESTIONS,while performing competitively on the text-to-SQL dataset SPIDER.1

1 Introduction

Recent years have witnessed a rapid advance in theability to understand and answer questions aboutfree-form natural language (NL) text (Rajpurkaret al., 2016), largely due to large-scale, pretrainedlanguage models (LMs) like BERT (Devlin et al.,2019). These models allow us to capture the syntaxand semantics of text via representations learnedin an unsupervised manner, before fine-tuning themodel to downstream tasks (Melamud et al., 2016;McCann et al., 2017; Peters et al., 2018; Liu et al.,2019b; Yang et al., 2019; Goldberg, 2019). It isalso relatively easy to apply such pretrained LMsto comprehension tasks that are modeled as textspan selection problems, where the boundary ofan answer span can be predicted using a simpleclassifier on top of the LM (Joshi et al., 2019).

∗Work done while at Facebook AI Research.1Available at github.com/facebookresearch/TaBERT

However, it is less clear how one could pretrainand fine-tune such models for other QA tasks thatinvolve joint reasoning over both free-form NL textand structured data. One example task is seman-tic parsing for access to databases (DBs) (Zelleand Mooney, 1996; Berant et al., 2013; Yih et al.,2015), the task of transducing an NL utterance (e.g.,

“Which country has the largest GDP?”) into a struc-tured query over DB tables (e.g., SQL querying adatabase of economics). A key challenge in thisscenario is understanding the structured schema ofDB tables (e.g., the name, data type, and stored val-ues of columns), and more importantly, the align-ment between the input text and the schema (e.g.,the token “GDP” refers to the Gross Domestic

Product column), which is essential for inferringthe correct DB query (Berant and Liang, 2014).

Neural semantic parsers tailored to this tasktherefore attempt to learn joint representations ofNL utterances and the (semi-)structured schemaof DB tables (e.g., representations of its columnsor cell values, as in Krishnamurthy et al. (2017);Bogin et al. (2019b); Wang et al. (2019a), interalia). However, this unique setting poses severalchallenges in applying pretrained LMs. First, infor-mation stored in DB tables exhibit strong underly-ing structure, while existing LMs (e.g., BERT) aresolely trained for encoding free-form text. Sec-ond, a DB table could potentially have a largenumber of rows, and naively encoding all of themusing a resource-heavy LM is computationally in-tractable. Finally, unlike most text-based QA tasks(e.g., SQuAD, Rajpurkar et al. (2016)) which couldbe formulated as a generic answer span selectionproblem and solved by a pretrained model withadditional classification layers, semantic parsing ishighly domain-specific, and the architecture of aneural parser is strongly coupled with the structureof its underlying DB (e.g., systems for SQL-basedand other types of DBs use different encoder mod-

https://github.com/facebookresearch/TaBERT

8414

els). In fact, existing systems have attempted toleverage BERT, but each with their own domain-specific, in-house strategies to encode the struc-tured information in the DB (Guo et al., 2019;Zhang et al., 2019a; Hwang et al., 2019), and im-portantly, without pretraining representations onstructured data. These challenges call for devel-opment of general-purpose pretraining approachestailored to learning representations for both NLutterances and structured DB tables.

In this paper we present TABERT, a pretrainingapproach for joint understanding of NL text and(semi-)structured tabular data (§ 3). TABERT isbuilt on top of BERT, and jointly learns contex-tual representations for utterances and the struc-tured schema of DB tables (e.g., a vector for eachutterance token and table column). Specifically,TABERT linearizes the structure of tables to becompatible with a Transformer-based BERT model.To cope with large tables, we propose content snap-shots, a method to encode a subset of table contentmost relevant to the input utterance. This strat-egy is further combined with a vertical attentionmechanism to share information among cell repre-sentations in different rows (§ 3.1). To capture theassociation between tabular data and related NLtext, TABERT is pretrained on a parallel corpus of26 million tables and English paragraphs (§ 3.2).

TABERT can be plugged into a neural semanticparser as a general-purpose encoder to computerepresentations for utterances and tables. Our keyinsight is that although semantic parsers are highlydomain-specific, most systems rely on representa-tions of input utterances and the table schemas tofacilitate subsequent generation of DB queries, andthese representations can be provided by TABERT,regardless of the domain of the parsing task.

We apply TABERT to two different semanticparsing paradigms: (1) a classical supervised learn-ing setting on the SPIDER text-to-SQL dataset (Yuet al., 2018c), where TABERT is fine-tuned to-gether with a task-specific parser using parallelNL utterances and labeled DB queries (§ 4.1);and (2) a challenging weakly-supervised learningbenchmark WIKITABLEQUESTIONS (Pasupat andLiang, 2015), where a system has to infer latentDB queries from its execution results (§ 4.2). Wedemonstrate TABERT is effective in both scenar-ios, showing that it is a drop-in replacement of aparser’s original encoder for computing contextualrepresentations of NL utterances and DB tables.

Specifically, systems augmented with TABERT out-performs their counterparts using BERT, register-ing state-of-the-art performance on WIKITABLE-QUESTIONS, while performing competitively onSPIDER (§ 5).

2 Background

Semantic Parsing over Tables Semantic pars-ing tackles the task of translating an NL utteranceu into a formal meaning representation (MR) z.Specifically, we focus on parsing utterances to ac-cess database tables, where z is a structured query(e.g., an SQL query) executable on a set of rela-tional DB tables T = {Tt}. A relational table T isa listing of N rows {Ri}Ni=1 of data, with each rowRi consisting of M cells {s〈i,j〉}Mj=1, one for eachcolumn cj . Each cell s〈i,j〉 contains a list of tokens.

Depending on the underlying data representationschema used by the DB, a table could either be fullystructured with strongly-typed and normalized con-tents (e.g., a table column named distance has aunit of kilometers, with all of its cell values, like200, bearing the same unit), as is commonly thecase for SQL-based DBs (§ 4.1). Alternatively, itcould be semi-structured with unnormalized, tex-tual cell values (e.g., 200 km, § 4.2). The querylanguage could also take a variety of forms, fromgeneral-purpose DB access languages like SQL todomain-specific ones tailored to a particular task.

Given an utterance and its associated tables, aneural semantic parser generates a DB query fromthe vector representations of the utterance tokensand the structured schema of tables. In this paperwe refer schema as the set of columns in a table,and its representation as the list of vectors thatrepresent its columns2. We will introduce howTABERT computes these representations in § 3.1.

Masked Language Models Given a sequenceof NL tokens x = x1, x2, . . . , xn, a maskedlanguage model (e.g., BERT) is an LM trainedusing the masked language modeling objective,which aims to recover the original tokens in xfrom a “corrupted” context created by randomlymasking out certain tokens in x. Specifically, letxm = {xi1 , . . . , xim} be the subset of tokens inx selected to be masked out, and x denote themasked sequence with tokens in xm replaced by a[MASK] symbol. A masked LM defines a distribu-

2Column representations for more complex schemas, e.g.,those capturing inter-table dependency via primary and foreignkeys, could be derived from these table-wise representations.

8415

...

...

Vertical Self-Attention Layer

...

...

... ...

Vertical Pooling

Utterance Token Representations

...Column Representations

Transformer (BERT)

Cell-wise Pooling Cell-wise Pooling Cell-wise Pooling

Cell VectorsUtterance Token Vectors

...

...

Year Venue Position Event

Erfurt

Tampere

Izmir

Moscow

Bangkok

2003

2005

2005

2006

2007

3rd

1st

1st

2nd

1st

EU Junior Championship

EU U23 Championship

Universiade

World Indoor Championship

Universiade

In which city did Piotr's last 1st place finish occur?

(B) Per-row Encoding (for each row in content snapshot, using as an example)

Selected Rows as Content Snapshot

2005 Erfurt 1st

[CLS] In which city did Piotr's ... [SEP] Year | real | 2005 [SEP] Venue | text | Erfurt [SEP] Position | text | 1st [SEP]

In which city[CLS] ...

Year Venue PositionIn which city did ...

(A) Content Snapshot from Input Table (C) Vertical Self-Attention over Aligned Row Encodings

In

In

In

which city ?

which city ?

which city ?

2005 Erfurt 1st

2005 Izmir 1st

2007 Bangkok 1st

did

[CLS]

[CLS]

[CLS]

Figure 1: Overview of TABERT for learning representations of utterances and table schemas with an example from WIKITABLE-QUESTIONS3. (A) A content snapshot of the table is created based on the input NL utterance. (B) Each row in the snapshot isencoded by a Transformer (only R2 is shown), producing row-wise encodings for utterance tokens and cells. (C) All row-wiseencodings are aligned and processed by V vertical self-attention layers, generating utterance and column representations.

tion pθ(xm|x) over the target tokens xm given themasked context x.

BERT parameterizes pθ(xm|x) using a Trans-former model. During the pretraining phase, BERTmaximizes pθ(xm|x) on large-scale textual cor-pora. In the fine-tuning phase, the pretrained modelis used as an encoder to compute representationsof input NL tokens, and its parameters are jointlytuned with other task-specific neural components.

3 TABERT: Learning Joint Representa-tions over Textual and Tabular Data

We first present how TABERT computes represen-tations for NL utterances and table schemas (§ 3.1),and then describe the pretraining procedure (§ 3.2).

3.1 Computing Representations for NLUtterances and Table Schemas

Fig. 1 presents a schematic overview of TABERT.Given an utterance u and a table T , TABERT firstcreates a content snapshot of T . This snapshotconsists of sampled rows that summarize the infor-mation in T most relevant to the input utterance.The model then linearizes each row in the snap-shot, concatenates each linearized row with theutterance, and uses the concatenated string as in-put to a Transformer (e.g., BERT) model, whichoutputs row-wise encoding vectors of utterance to-kens and cells. The encodings for all the rows in

3Example adapted from stanford.io/38iZ8Pf

the snapshot are fed into a series of vertical self-attention layers, where a cell representation (or anutterance token representation) is computed by at-tending to vertically-aligned vectors of the samecolumn (or the same NL token). Finally, represen-tations for each utterance token and column aregenerated from a pooling layer.

Content Snapshot One major feature ofTABERT is its use of the table contents, as opposedto just using the column names, in encoding thetable schema. This is motivated by the fact thatcontents provide more detail about the semanticsof a column than just the column’s name, whichmight be ambiguous. For instance, the Venue

column in Fig. 1 which is used to answer theexample question actually refers to host cities, andencoding the sampled cell values while creating itsrepresentation may help match the term “city” inthe input utterance to this column.

However, a DB table could potentially have alarge number of rows, with only few of them actu-ally relevant to answering the input utterance. En-coding all of the contents using a resource-heavyTransformer is both computationally intractableand likely not necessary. Thus, we instead use acontent snapshot consisting of only a few rows thatare most relevant to the input utterance, providingan efficient approach to calculate content-sensitivecolumn representations from cell values.

We use a simple strategy to create content snap-

https://stanford.io/38iZ8Pf

8416

shots of K rows based on the relevance betweenthe utterance and a row. For K > 1, we select thetop-K rows in the input table that have the high-est n-gram overlap ratio with the utterance.4 ForK = 1, to include in the snapshot as much informa-tion relevant to the utterance as possible, we createa synthetic row by selecting the cell values fromeach column that have the highest n-gram overlapwith the utterance. Using synthetic rows in thisrestricted setting is motivated by the fact that cellvalues most relevant to answer the utterance couldcome from multiple rows. As an example, con-sider the utterance “How many more participantswere there in 2008 than in the London Olympics?”,and an associating table with columns Year, HostCity and Number of Participants, the mostrelevant cells to the utterance, 2008 (from Year)and London (from Host City), are from differentrows, which could be included in a single syntheticrow. In the initial experiments we found syntheticrows also help stabilize learning.

Row Linearization TABERT creates a linearizedsequence for each row in the content snapshot asinput to the Transformer model. Fig. 1(B) depictsthe linearization for R2, which consists of a con-catenation of the utterance, columns, and their cellvalues. Specifically, each cell is represented by thename and data type5 of the column, together withits actual value, separated by a vertical bar. As anexample, the cell s〈2,1〉 valued 2005 in R2 in Fig. 1is encoded as

Year︸︷︷︸Column Name

| real︸︷︷︸Column Type

| 2005︸︷︷︸Cell Value

(1)

The linearization of a row is then formed by con-catenating the above string encodings of all thecells, separated by the [SEP] symbol. We thenprefix the row linearization with utterance tokensas input sequence to the Transformer.

Existing works have applied different lineariza-tion strategies to encode tables with Transform-ers (Hwang et al., 2019; Chen et al., 2019), whileour row approach is specifically designed for en-coding content snapshots. We present in § 5 resultswith different linearization choices.

4We use n ≤ 3 in our experiments. Empirically thissimple matching heuristic is able to correctly identify thebest-matched rows for 40 out of 50 sampled examples onWIKITABLEQUESTIONS.

5We use two data types, text, and real for numbers, pre-dicted by majority voting over the NER labels of cell tokens.

Vertical Self-Attention Mechanism The baseTransformer model in TABERT outputs vector en-codings of utterance and cell tokens for each row.These row-level vectors are computed separatelyand therefore independent of each other. To allowfor information flow across cell representations ofdifferent rows, we propose vertical self-attention, aself-attention mechanism that operates over verti-cally aligned vectors from different rows.

As in Fig. 1(C), TABERT has V stacked vertical-level self-attention layers. To generate aligned in-puts for vertical attention, we first compute a fixed-length initial vector for each cell at position 〈i, j〉,which is given by mean-pooling over the sequenceof the Transformer’s output vectors that correspondto its variable-length linearization as in Eq. (1).Next, the sequence of word vectors for the NLutterance (from the base Transformer model) areconcatenated with the cell vectors as initial inputsto the vertical attention layer.

Each vertical attention layer has the same param-eterization as the Transformer layer in (Vaswaniet al., 2017), but operates on vertically aligned el-ements, i.e., utterance and cell vectors that corre-spond to the same question token and column, re-spectively. This vertical self-attention mechanismenables the model to aggregate information fromdifferent rows in the content snapshot, allowingTABERT to capture cross-row dependencies on cellvalues.

Utterance and Column Representations Arepresentation cj is computed for each column cjby mean-pooling over its vertically aligned cellvectors, {s〈i,j〉 : Ri in content snapshot}, from thelast vertical layer. A representation for each ut-terance token, xj , is computed similarly over thevertically aligned token vectors. These representa-tions will be used by downstream neural semanticparsers. TABERT also outputs an optional fixed-length table representation T using the representa-tion of the prefixed [CLS] symbol, which is usefulfor parsers that operate on multiple DB tables.

3.2 Pretraining Procedure

Training Data Since there is no large-scale,high-quality parallel corpus of NL text and struc-tured tables, we instead use semi-structured tablesthat commonly exist on the Web as a surrogatedata source. As a first step in this line, we fo-cus on collecting parallel data in English, whileextending to multilingual scenarios would be an

8417

interesting avenue for future work. Specifically,we collect tables and their surrounding NL textfrom English Wikipedia and the WDC WebTableCorpus (Lehmberg et al., 2016), a large-scale tablecollection from CommonCrawl. The raw data isextremely noisy, and we apply aggressive cleaningheuristics to filter out invalid examples (e.g., exam-ples with HTML snippets or in foreign languages,and non-relational tables without headers). SeeAppendix § A.1 for details of data pre-processing.The pre-processed corpus contains 26.6 millionparallel examples of tables and NL sentences. Weperform sub-tokenization using the Wordpiece tok-enizer shipped with BERT.

Unsupervised Learning Objectives We applydifferent objectives for learning representations ofthe NL context and structured tables. For NL con-texts, we use the standard Masked Language Mod-eling (MLM) objective (Devlin et al., 2019), with amasking rate of 15% sub-tokens in an NL context.

For learning column representations, we designtwo objectives motivated by the intuition that acolumn representation should contain both the gen-eral information of the column (e.g., its name anddata type), and representative cell values relevantto the NL context. First, a Masked Column Pre-diction (MCP) objective encourages the modelto recover the names and data types of maskedcolumns. Specifically, we randomly select 20% ofthe columns in an input table, masking their namesand data types in each row linearization (e.g., ifthe column Year in Fig. 1 is selected, the tokensYear and real in Eq. (1) will be masked). Giventhe column representation cj , TABERT is trained topredict the bag of masked (name and type) tokensfrom cj using a multi-label classification objective.Intuitively, MCP encourages the model to recovercolumn information from its contexts.

Next, we use an auxiliary Cell Value Recovery(CVR) objective to ensure information of represen-tative cell values in content snapshots is retainedafter additional layers of vertical self-attention.Specifically, for each masked column cj in theabove MCP objective, CVR predicts the originaltokens of each cell s〈i,j〉 (of cj) in the content snap-shot conditioned on its cell vector s〈i,j〉.6 For in-stance, for the example cell s〈2,1〉 in Eq. (1), wepredict its value 2005 from s〈2,1〉. Since a cell

6The cell value tokens are not masked in the input se-quence, since predicting masked cell values is challengingeven with the presence of its surrounding context.

could have multiple value tokens, we apply thespan-based prediction objective (Joshi et al., 2019).Specifically, to predict a cell token s〈i,j〉k ∈ s〈i,j〉,its positional embedding ek and the cell represen-tations s〈i,j〉 are fed into a two-layer network f(·)with GeLU activations (Hendrycks and Gimpel,2016). The output of f(·) is then used to predict theoriginal value token s〈i,j〉k from a softmax layer.

4 Example Application: SemanticParsing over Tables

We apply TABERT for representation learning ontwo semantic parsing paradigms, a classical super-vised text-to-SQL task over structured DBs (§ 4.1),and a weakly supervised parsing problem on semi-structured Web tables (§ 4.2).

4.1 Supervised Semantic ParsingBenchmark Dataset Supervised learning is thetypical scenario of learning a parser using paralleldata of utterances and queries. We use SPIDER (Yuet al., 2018c), a text-to-SQL dataset with 10,181examples across 200 DBs. Each example consistsof an utterance (e.g., “What is the total numberof languages used in Aruba?”), a DB with oneor more tables, and an annotated SQL query,which typically involves joining multiple tablesto get the answer (e.g., SELECT COUNT(*) FROM

Country JOIN Lang ON Country.Code =

Lang.CountryCode WHERE Name = ‘Aruba’).

Base Semantic Parser We aim to show TABERT

could help improve upon an already strong parser.Unfortunately, at the time of writing, none of thetop systems on SPIDER were publicly available. Toestablish a reasonable testbed, we developed ourin-house system based on TranX (Yin and Neubig,2018), an open-source general-purpose semanticparser. TranX translates an NL utterance into anintermediate meaning representation guided by auser-defined grammar. The generated intermediateMR could then be deterministically converted todomain-specific query languages (e.g., SQL).

We use TABERT as encoder of utterances andtable schemas. Specifically, for a given utteranceu and a DB with a set of tables T = {Tt}, wefirst pair u with each table Tt in T as inputs toTABERT, which generates |T | sets of table-specificrepresentations of utterances and columns. At eachtime step, an LSTM decoder performs hierarchicalattention (Libovicky and Helcl, 2017) over the listof table-specific representations, constructing an

8418

MR based on the predefined grammar. Followingthe IRNet model (Guo et al., 2019) which achievedthe best performance on SPIDER as the time ofwriting, we use SemQL, a simplified version ofthe SQL, as the underlying grammar. We referinterested readers to Appendix § B.1 for details ofour system.

4.2 Weakly Supervised Semantic Parsing

Benchmark Dataset Weakly supervised seman-tic parsing considers the reinforcement learningtask of inferring the correct query from its execu-tion results (i.e., whether the answer is correct).Compared to supervised learning, weakly super-vised parsing is significantly more challenging, asthe parser does not have access to the labeled query,and has to explore the exponentially large searchspace of possible queries guided by the noisy bi-nary reward signal of execution results.

WIKITABLEQUESTIONS (Pasupat and Liang,2015) is a popular dataset for weakly supervisedsemantic parsing, which has 22,033 utterances and2,108 semi-structured Web tables from Wikipedia.7

Compared to SPIDER, examples in this dataset donot involve joining multiple tables, but typicallyrequire compositional, multi-hop reasoning over aseries of entries in the given table (e.g., to answerthe example in Fig. 1 the parser needs to reasonover the row set {R2, R3, R5}, locating the Venuefield with the largest value of Year).

Base Semantic Parser MAPO (Liang et al.,2018) is a strong system for weakly supervisedsemantic parsing. It improves the sample efficiencyof the REINFORCE algorithm by biasing the ex-ploration of queries towards the high-rewardingones already discovered by the model. MAPO usesa domain-specific query language tailored to an-swering compositional questions on single tables,and its utterances and column representations arederived from an LSTM encoder, which we replacedwith our TABERT model. See Appendix § B.2 fordetails of MAPO and our adaptation.

5 Experiments

In this section we evaluate TABERT on downstreamtasks of semantic parsing to DB tables.

7While some of the 421 testing Wikipedia tables might beincluded in our pretraining corpora, they only account for avery tiny fraction. In our pilot study, we also found pretrainingonly on Wikipedia tables resulted in worse performance.

Pretraining Configuration We train two vari-ants of the model, TABERTBase and TABERTLarge,with the underlying Transformer model initial-ized with the uncased versions of BERTBase andBERTLarge, respectively.8 During pretraining, foreach table and its associated NL context in thecorpus, we create a series of training instances ofpaired NL sentences (as synthetically generated ut-terances) and tables (as content snapshots) by (1)sliding a (non-overlapping) context window of sen-tences with a maximum length of 128 tokens, and(2) using the NL tokens in the window as the utter-ance, and pairing it with randomly sampled rowsfrom the table as content snapshots. TABERT isimplemented in PyTorch using distributed training.Refer to Appendix § A.2 for details of pretraining.

Comparing Models We mainly present resultsfor two variants of TABERT by varying thesize of content snapshots K. TABERT(K = 3)uses three rows from input tables as contentsnapshots and three vertical self-attention layers.TABERT(K = 1) uses one synthetically generatedrow as the content snapshot as described in § 3.1.Since this model does not have multi-row input, wedo not use additional vertical attention layers (andthe cell value recovery learning objective). Its col-umn representation cj is defined by mean-poolingover the Transformer’s output encodings that corre-spond to the column name (e.g., the representationfor the Year column in Fig. 1 is derived from thevector of the Year token in Eq. (1)). We find thisstrategy gives better results compared with usingthe cell representation sj as cj . We also comparewith BERT using the same row linearization andcontent snapshot approach as TABERT(K = 1),which reduces to a TABERT(K = 1) model with-out pretraining on tabular corpora.

Evaluation Metrics As standard, we report exe-cution accuracy on WIKITABLEQUESTIONS andexact-match accuracy of DB queries on SPIDER.

5.1 Main Results

Tab. 1 and Tab. 2 summarize the end-to-end evalua-tion results on WIKITABLEQUESTIONS and SPI-DER, respectively. First, comparing with existingstrong semantic parsing systems, we found our

8We also attempted to train TABERT on our collected cor-pus from scratch without initialization from BERT, but withinferior results, potentially due to the average lower quality ofweb-scraped tables compared to purely textual corpora. Weleave improving the quality of training data as future work.

8419

Previous Systems on WikiTableQuestionsModel DEV TEST

Pasupat and Liang (2015) 37.0 37.1Neelakantan et al. (2016) 34.1 34.2

Ensemble 15 Models 37.5 37.7Zhang et al. (2017) 40.6 43.7Dasigi et al. (2019) 43.1 44.3Agarwal et al. (2019) 43.2 44.1

Ensemble 10 Models – 46.9Wang et al. (2019b) 43.7 44.5

Our System based on MAPO (Liang et al., 2018)DEV Best TEST Best

Base Parser† 42.3 ±0.3 42.7 43.1 ±0.5 43.8w/ BERTBase (K = 1) 49.6 ±0.5 50.4 49.4 ±0.5 49.2− content snapshot 49.1 ±0.6 50.0 48.8 ±0.9 50.2

w/ TABERTBase (K = 1) 51.2 ±0.5 51.6 50.4 ±0.5 51.2− content snapshot 49.9 ±0.4 50.3 49.4 ±0.4 50.0

w/ TABERTBase (K = 3) 51.6 ±0.5 52.4 51.4 ±0.3 51.3w/ BERTLarge (K = 1) 50.3 ±0.4 50.8 49.6 ±0.5 50.1w/ TABERTLarge (K = 1) 51.6 ±1.1 52.7 51.2 ±0.9 51.5w/ TABERTLarge (K = 3) 52.2 ±0.7 53.0 51.8 ±0.6 52.3

Table 1: Execution accuracies on WIKITABLEQUESTIONS.†Results from Liang et al. (2018). (TA)BERT models areevaluated with 10 random runs. We report mean, standarddeviation and the best results. TEST 7→BEST refers to theresult from the run with the best performance on DEV. set.

parsers with TABERT as the utterance and tableencoder perform competitively. On the test set ofWIKITABLEQUESTIONS, MAPO augmented witha TABERTLarge model with three-row content snap-shots, TABERTLarge(K = 3), registers a single-model exact-match accuracy of 52.3%, surpassingthe previously best ensemble system (46.9%) fromAgarwal et al. (2019) by 5.4% absolute.

On SPIDER, our semantic parser based on TranXand SemQL (§ 4.1) is conceptually similar to thebase version of IRNet as both systems use theSemQL grammar, while our system has a simplerdecoder. Interestingly, we observe that its perfor-mance with BERTBase (61.8%) matches the fullBERT-augmented IRNet model with a stronger de-coder using augmented memory and coarse-to-finedecoding (61.9%). This confirms that our baseparser is an effective baseline. Augmented with rep-resentations produced by TABERTLarge(K = 3),our parser achieves up to 65.2% exact-match ac-curacy, a 2.8% increase over the base model us-ing BERTBase. Note that while other competitivesystems on the leaderboard use BERT with moresophisticated semantic parsing models, our bestDEV. result is already close to the score registeredby the best submission (RyanSQL+BERT). Thissuggests that if they instead used TABERT as therepresentation layer, they would see further gains.

Comparing semantic parsers augmented with

Top-ranked Systems on Spider LeaderboardModel DEV. ACC.Global–GNN (Bogin et al., 2019a) 52.7EditSQL + BERT (Zhang et al., 2019a) 57.6RatSQL (Wang et al., 2019a) 60.9IRNet + BERT (Guo et al., 2019) 60.3+ Memory + Coarse-to-Fine 61.9

IRNet V2 + BERT 63.9RyanSQL + BERT (Choi et al., 2020) 66.6Our System based on TranX (Yin and Neubig, 2018)

Mean Bestw/ BERTBase (K = 1) 61.8 ±0.8 62.4− content snapshot 59.6 ±0.7 60.3

w/ TABERTBase (K = 1) 63.3 ±0.6 64.2− content snapshot 60.4 ±1.3 61.8

w/ TABERTBase (K = 3) 63.3 ±0.7 64.1w/ BERTLarge (K = 1) 61.3 ±1.2 62.9w/ TABERTLarge (K = 1) 64.0 ±0.4 64.4w/ TABERTLarge (K = 3) 64.5 ±0.6 65.2

Table 2: Exact match accuracies on the public developmentset of SPIDER. Models are evaluated with 5 random runs.

TABERT and BERT, we found TABERT is moreeffective across the board. We hypothesize that theperformance improvements would be attributed bytwo factors. First, pre-training on large paralleltextual and tabular corpora helps TABERT learnto encode structure-rich tabular inputs in their lin-earized form (Eq. (1)), whose format is differentfrom the ordinary natural language data that BERT

is trained on. Second, pre-training on parallel datacould also helps the model produce representationsthat better capture the alignment between an utter-ance and the relevant information presented in thestructured schema, which is important for semanticparsing.

Overall, the results on the two benchmarksdemonstrate that pretraining on aligned textual andtabular data is necessary for joint understanding ofNL utterances and tables, and TABERT works wellwith both structured (SPIDER) and semi-structured(WIKITABLEQUESTIONS) DBs, and agnostic ofthe task-specific structures of semantic parsers.

Effect of Content Snapshots In this paper wepropose using content snapshots to capture the in-formation in input DB tables that is most relevantto answering the NL utterance. We therefore studythe effectiveness of including content snapshotswhen generating schema representations. We in-clude in Tab. 1 and Tab. 2 results of models with-out using content in row linearization (“−contentsnapshot”). Under this setting a column is rep-

8420

u: How many years before was the film Bacchae out before the Watermelon?

Input to TABERTLarge (K = 3) . Content Snapshot with Three RowsFilm Year Function NotesThe Bacchae 2002 Producer Screen adaptation of...The Trojan Women 2004 Producer/Actress Documutary film...The Watermelon 2008 Producer Oddball romantic comedy...

Input to TABERTLarge (K = 1) . Content Snapshot with One Synthetic RowFilm Year Function NotesThe Watermelon 2013 Producer Screen adaptation of...

Table 3: Content snapshots generated by two models fora WIKITABLEQUESTIONS DEV. example. Matched tokensbetween the question and content snapshots are underlined.

resented as “Column Name | Type” without cellvalues (c.f., Eq. (1)). We find that content snap-shots are helpful for both BERT and TABERT, es-pecially for TABERT. As discussed in § 3.1, encod-ing sampled values from columns in learning theirrepresentations helps the model infer alignmentsbetween entity and relational phrases in the utter-ance and the corresponding column. This is par-ticularly helpful for identifying relevant columnsfrom a DB table that is mentioned in the input utter-ance. As an example, empirically we observe thaton SPIDER our semantic parser with TABERTBaseusing just one row of content snapshots (K = 1)registers a higher accuracy of selecting the cor-rect columns when generating SQL queries (e.g.,columns in SELECT and WHERE clauses), comparedto the TABERTBase model without encoding con-tent information (87.4% v.s. 86.4%).

Additionally, comparing TABERT using one syn-thetic row (K = 1) and three rows from input ta-bles (K = 3) as content snapshots, the latter gen-erally performs better. Intuitively, encoding moretable contents relevant to the input utterance couldpotentially help answer questions that involve rea-soning over information across multiple rows in thetable. Tab. 3 shows such an example, and to answerthis question a parser need to subtract the valuesof Year in the rows for “The Watermelon” and

“The Bacchae”. TABERTLarge (K = 3) is able tocapture the two target rows in its content snap-shot and generates the correct DB query, while theTABERTLarge(K = 1) model with only one row ascontent snapshot fails to answer this example.

Effect of Row Linearization TABERT uses rowlinearization to represent a table row as sequentialinput to Transformer. Tab. 4 (upper half) presentsresults using various linearization methods. Wefind adding type information and content snapshotsimproves performance, as they provide more hintsabout the meaning of a column.

Cell Linearization Template WIKIQ. SPIDER

Pretrained TABERTBase Models (K = 1)Column Name 49.6 ±0.4 60.0 ±1.1Column Name | Type† (−content snap.) 49.9 ±0.4 60.4 ±1.3Column Name | Type | Cell Value† 51.2 ±0.5 63.3 ±0.6

BERTBase ModelsColumn Name (Hwang et al., 2019) 49.0 ±0.4 58.6 ±0.3Column Name is Cell Value (Chen19) 50.2 ±0.4 63.1 ±0.7

Table 4: Performance of pretrained TABERTBase models andBERTBase on the DEV. sets with different linearization meth-ods. Slot names are underlined. †Results copied from Tab. 1and Tab. 2.

Learning Objective WIKIQ. SPIDER

MCP only 51.6 ±0.7 62.6 ±0.7MCP + CVR 51.6 ±0.5 63.3 ±0.7

Table 5: Performance of pretrained TABERTBase(K = 3) onDEV. sets with different pretraining objectives.

We also compare with existing linearizationmethods in literature using a TABERTBase model,with results shown in Tab. 4 (lower half). Hwanget al. (2019) uses BERT to encode concatenated col-umn names to learn column representations. In linewith our previous discussion on the effectivenesscontent snapshots, this simple strategy without en-coding cell contents underperforms (although withTABERTBase pretrained on our tabular corpus theresults become slightly better). Additionally, we re-mark that linearizing table contents has also be ap-plied to other BERT-based tabular reasoning tasks.For instance, Chen et al. (2019) propose a “natu-ral” linearization approach for checking if an NLstatement entails the factual information listed in atable using a binary classifier with representationsfrom BERT, where a table is linearized by concate-nating the semicolon-separated cell linearizationfor all rows. Each cell is represented by a phrase“column name is cell value”. For complete-ness, we also tested this cell linearization approach,and find BERTBase achieved improved results. Weleave pretraining TABERT with this linearizationstrategy as promising future work.

Impact of Pretraining Objectives TABERT

uses two objectives (§ 3.2), a masked column pre-diction (MCP) and a cell value recovery (CVR) ob-jective, to learn column representations that couldcapture both the general information of the column(via MCP) and its representative cell values relatedto the utterance (via CVR). Tab. 5 shows ablationresults of pretraining TABERT with different ob-jectives. We find TABERT trained with both MCPand the auxiliary CVR objectives gets a slight ad-vantage, suggesting CVR could potentially lead to

8421

more representative column representations withadditional cell information.

6 Related Works

Semantic Parsing over Tables Tables are impor-tant media of world knowledge. Semantic parsershave been adapted to operate over structured DBtables (Wang et al., 2015; Xu et al., 2017; Dongand Lapata, 2018; Yu et al., 2018b; Shi et al.,2018; Wang et al., 2018), and open-domain, semi-structured Web tables (Pasupat and Liang, 2015;Sun et al., 2016; Neelakantan et al., 2016). Toimprove representations of utterances and tablesfor neural semantic parsing, existing systems haveapplied pretrained word embeddings (e.g.., GloVe,as in Zhong et al. (2017); Yu et al. (2018a); Sunet al. (2018); Liang et al. (2018)), and BERT-familymodels for learning joint contextual representationsof utterances and tables, but with domain-specificapproaches to encode the structured information intables (Hwang et al., 2019; He et al., 2019; Guoet al., 2019; Zhang et al., 2019a). TABERT ad-vances this line of research by presenting a general-purpose, pretrained encoder over parallel corporaof Web tables and NL context. Another relevantdirection is to augment representations of columnsfrom an individual table with global information ofits linked tables defined by the DB schema (Boginet al., 2019a; Wang et al., 2019a). TABERT couldalso potentially improve performance of these sys-tems with improved table-level representations.

Knowledge-enhanced Pretraining Recent pre-training models have incorporated structured in-formation from knowledge bases (KBs) or otherstructured semantic annotations into training con-textual word representations, either by fusing vec-tor representations of entities and relations on KBsinto word representations of LMs (Peters et al.,2019; Zhang et al., 2019b,c), or by encouragingthe LM to recover KB entities and relations fromtext (Sun et al., 2019; Liu et al., 2019a). TABERT

is broadly relevant to this line in that it also exposesan LM with structured data (i.e., tables), while aim-ing to learn joint representations for both textualand structured tabular data.

7 Conclusion and Future Work

We present TABERT, a pretrained encoder forjoint understanding of textual and tabular data.We show that semantic parsers using TABERT

as a general-purpose feature representation layer

achieved strong results on two benchmarks. Thiswork also opens up several avenues for future work.First, we plan to evaluate TABERT on other re-lated tasks involving joint reasoning over textualand tabular data (e.g., table retrieval and table-to-text generation). Second, following the discussionsin § 5, we will explore other table linearizationstrategies with Transformers, improving the qualityof pretraining corpora, as well as novel unsuper-vised objectives. Finally, to extend TABERT tocross-lingual settings with utterances in foreignlanguages and structured schemas defined in En-glish, we plan to apply more advanced semanticsimilarity metrics for creating content snapshots.

ReferencesRishabh Agarwal, Chen Liang, Dale Schuurmans, and

Mohammad Norouzi. 2019. Learning to generalizefrom sparse and underspecified rewards. In ICML.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on freebase fromquestion-answer pairs. In Proceedings of EMNLP.

Jonathan Berant and Percy Liang. 2014. Semantic pars-ing via paraphrasing. In Proceedings of ACL.

Ben Bogin, Matt Gardner, and Jonathan Berant. 2019a.Global reasoning over database structures for text-to-sql parsing. ArXiv, abs/1908.11214.

Ben Bogin, Matthew Gardner, and Jonathan Berant.2019b. Representing schema structure with graphneural networks for text-to-sql parsing. In Proceed-ings of ACL.

Wenhu Chen, Hongmin Wang, Jianshu Chen, YunkaiZhang, Hong Wang, Shiyang Li, Xiyou Zhou, andWilliam Yang Wang. 2019. TabFact: A large-scale dataset for table-based fact verification. ArXiv,abs/1909.02164.

Donghyun Choi, Myeong Cheol Shin, EungGyun Kim,and Dong Ryeol Shin. 2020. Ryansql: Recur-sively applying sketch-based slot fillings for com-plex text-to-sql in cross-domain databases. ArXiv,abs/2004.03125.

Pradeep Dasigi, Matt Gardner, Shikhar Murty, Luke S.Zettlemoyer, and Eduard H. Hovy. 2019. Iterativesearch for weakly supervised semantic parsing. InProceedings of NAACL-HLT.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In Proceedings of NAACL-HLT.

Li Dong and Mirella Lapata. 2018. coarse-to-fine de-coding for neural semantic parsing. In Proceedingsof ACL.

8422

Yoav Goldberg. 2019. Assessing bert’s syntactic abili-ties. ArXiv, abs/1901.05287.

Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao,Jian-Guang Lou, Ting Liu, and Dongmei Zhang.2019. Towards complex text-to-sql in cross-domaindatabase with intermediate representation. In Pro-ceedings of ACL.

Pengcheng He, Yi Mao, Kaushik Chakrabarti, andWeizhu Chen. 2019. X-sql: reinforce schema rep-resentation with context. ArXiv, abs/1908.08113.

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian er-ror linear units (gelus). ArXiv, abs/1606.08415.

Wonseok Hwang, Jinyeung Yim, Seunghyun Park, andMinjoon Seo. 2019. A comprehensive explorationon wikisql with table-aware word contextualization.ArXiv, abs/1902.01069.

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.Weld, Luke S. Zettlemoyer, and Omer Levy. 2019.Spanbert: Improving pre-training by representingand predicting spans. In Proceedings of EMNLP.

Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard-ner. 2017. Neural semantic parsing with type con-straints for semi-structured tables. In Proceedingsof EMNLP.

Oliver Lehmberg, Dominique Ritze, Robert Meusel,and Christian Bizer. 2016. A large public corpus ofweb tables containing time and context metadata. InProceedings of WWW.

Chen Liang, Mohammad Norouzi, Jonathan Berant,Quoc V Le, and Ni Lao. 2018. Memory augmentedpolicy optimization for program synthesis and se-mantic parsing. In Proceedings of NIPS.

Jindrich Libovicky and Jindrich Helcl. 2017. Attentionstrategies for multi-source sequence-to-sequencelearning. In Proceedings of ACL.

Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju,Haotang Deng, and Ping Wang. 2019a. K-bert:Enabling language representation with knowledgegraph. ArXiv, abs/1909.07606.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke S. Zettlemoyer, and Veselin Stoyanov. 2019b.Roberta: A robustly optimized bert pretraining ap-proach. ArXiv, abs/1907.11692.

Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation: Con-textualized word vectors. In Proceedings of NIPS.

Oren Melamud, Jacob Goldberger, and Ido Dagan.2016. context2vec: Learning generic context em-bedding with bidirectional LSTM. In Proceedingsof CoNLL.

Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever.2016. Neural programmer: Inducing latent pro-grams with gradient descent. In Proceedings ofICLR.

Panupong Pasupat and Percy Liang. 2015. Composi-tional semantic parsing on semi-structured tables. InProceedings of ACL.

Matthew E. Peters, Mark Neumann, Mohit Iyyer,Matt Gardner, Christopher Clark, Kenton Lee, andLuke S. Zettlemoyer. 2018. Deep contextualizedword representations. In Proceedings of NAACL.

Matthew E. Peters, Mark Neumann, IV RobertLLo-gan, Roy Schwartz, Vidur Joshi, Sameer Singh, andNoah A. Smith. 2019. Knowledge enhanced con-textual word representations. In Proceedings ofEMNLP.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100, 000+ questions formachine comprehension of text. In Proceedings ofEMNLP.

Tianze Shi, Kedar Tatwawadi, Kaushik Chakrabarti,Yi Mao, Oleksandr Polozov, and Weizhu Chen. 2018.Incsql: Training incremental text-to-sql parsers withnon-deterministic oracles. ArXiv, abs/1809.05054.

Huan Sun, Hao Ma, Xiaodong He, Wen tau Yih, Yu Su,and Xifeng Yan. 2016. Table cell search for questionanswering. In Proceedings of WWW.

Yibo Sun, Duyu Tang, Nan Duan, Jianshu Ji, Gui-hong Cao, Xiaocheng Feng, Bing Qin, Ting Liu, andMing Zhou. 2018. Semantic parsing with syntax-and table-aware SQL generation. In Proceedings ofEMNLP.

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, XuyiChen, Han Zhang, Xin Tian, Danxiang Zhu, HaoTian, and Hua Wu. 2019. Ernie: Enhanced rep-resentation through knowledge integration. ArXiv,abs/1904.09223.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of NIPS.

Bailin Wang, Richard Shin, Xiaodong Liu, OleksandrPolozov, and Margot Richardson. 2019a. Rat-sql:Relation-aware schema encoding and linking fortext-to-sql parsers. ArXiv, abs/1911.04942.

Bailin Wang, Ivan Titov, and Mirella Lapata. 2019b.Learning semantic parsers from denotations with la-tent structured alignments and abstract programs. InEMNLP/IJCNLP.

Chenglong Wang, Kedar Tatwawadi, MarcBrockschmidt, Po-Sen Huang, Yi Xin Mao,Oleksandr Polozov, and Rishabh Singh. 2018.Robust text-to-sql generation with execution-guideddecoding. ArXiv, abs/1807.03100.

8423

Daniel C. Wang, Andrew W. Appel, Jeffrey L. Korn,and Christopher S. Serra. 1997. The Zephyr ab-stract syntax description language. In Proceedingsof DSL.

Yushi Wang, Jonathan Berant, and Percy Liang. 2015.Building a semantic parser overnight. In Proceed-ings of ACL.

Xiaojun Xu, Chang Liu, and Dawn Song. 2017. SQL-Net: Generating structured queries from naturallanguage without reinforcement learning. arXiv,abs/1711.04436.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In Proceedings of NIPS.

Wen-tau Yih, Ming-Wei Chang, Xiaodong He, andJianfeng Gao. 2015. Semantic parsing via stagedquery graph generation: Question answering withknowledge base. In Proceedings of ACL.

Pengcheng Yin and Graham Neubig. 2018. TRANX: Atransition-based neural abstract syntax parser for se-mantic parsing and code generation. In Proceedingsof EMNLP Demonstration Track.

Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, andDragomir R. Radev. 2018a. TypeSQL: Knowledge-based type-aware neural text-to-sql generation. InProceedings of NAACL-HLT.

Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang,Dongxu Wang, Zifan Li, and Dragomir R. Radev.2018b. Syntaxsqlnet: Syntax tree networks for com-plex and cross-domain text-to-sql task. In Proceed-ings of EMNLP.

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li,Qingning Yao, Shanelle Roman, Zilin Zhang, andDragomir R. Radev. 2018c. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. InProceedings of EMNLP.

John M. Zelle and Raymond J. Mooney. 1996. Learn-ing to parse database queries using inductive logicprogramming. In Proceedings of AAAI.

Rui Zhang, Tao Yu, He Yang Er, Sungrok Shim,Eric Xue, Xi Victoria Lin, Tianze Shi, CaimingXiong, Richard Socher, and Dragomir R. Radev.2019a. Editing-based sql query generation forcross-domain context-dependent questions. ArXiv,abs/1909.00786.

Yuchen Zhang, Panupong Pasupat, and Percy Liang.2017. Macro grammars and holistic triggering for ef-ficient semantic parsing. In Proceedings of EMNLP.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019b. Ernie: En-hanced language representation with informative en-tities. In Proceedings of ACL.

Zhuosheng Zhang, Yu-Wei Wu, Hai Zhao, Zuchao Li,Shuailiang Zhang, Xi Zhou, and Xiaodong Zhou.2019c. Semantics-aware bert for language under-standing. ArXiv, abs/1909.02209.

Victor Zhong, Caiming Xiong, and Richard Socher.2017. Seq2SQL: Generating structured queriesfrom natural language using reinforcement learning.arXiv, abs/1709.00103.

8424

Supplementary Materials

A Pretraining Details

A.1 Training DataWe collect parallel examples of tables and theirsurrounding NL sentences from two sources:

Wikipedia Tables We extract all the tables onEnglish Wikipedia9. For each table, we use thepreceding three paragraphs as the NL context, aswe observe that most Wiki tables are located afterwhere they are described in the body text.

WDC WebTable Corpus (Lehmberg et al.,2016) is a large collection of Web tables extractedfrom the Common Crawl Web scrape10. We useits 2015 English-language relational subset, whichconsists of 50.8 million relational tables and theirsurrounding NL contexts.

Preprocessing Our dataset is collected from ar-bitrary Web tables, which are extremely noisy. Wedevelop a set of heuristics to clean the data by: (1)removing columns whose names have more than10 tokens; (2) filtering cells with more than twonon-ASCII characters or 20 tokens; (3) removingempty or repetitive rows and columns; (4) filteringtables with less than three rows and four columns,and (5) running spaCy to identify the data typeof columns (text or real value) by majority votingover the NER labels of column tokens, (6) rotatingvertically oriented tables. We sub-tokenize the cor-pus using the Wordpiece tokenizer in Devlin et al.(2019). The pre-processing results in 1.3 milliontables from Wikipedia and 25.3 million tables fromthe WDC corpus.

A.2 Pretraining SetupAs discussed in § 5, we create training instances ofNL sentences (as synthetic utterances) and contentsnapshots from tables by sampling from the parallelcorpus of NL contexts and tables. Each epoch con-tains 37.6M training instances. We train TABERT

for 10 epochs. Tab. 6 lists the hyper-parametersused in training. Learning rates are validated on thedevelopment set of WIKITABLEQUESTIONS. Weuse a batch size of 512 for large models to reducetraining time. The training objective is sum of thethree pretraining objectives in § 3.2: the masked

9We do not use infoboxes (tables on the top-right of a Wikipage that describe properties of the main topic), as they arenot relational tables.

10http://webdatacommons.org/webtables

language modeling (MLM) objective for utterancetokens, the masked column prediction (MCP) ob-jective for columns, and the column value recovery(CVR) objective for their cell values. An exceptionis pretraining the TABERT(K = 1) models. Sincethere are no additional vertical attention layers, wedo not use the CVR objective, and the MCP ob-jective reduces to the vanilla MLM objective overencodings from the base Transformer model. Ourlargest model TABERTLarge(K = 3) takes six daysto train for 10 epochs on 128 Tesla V100 GPUsusing mixed precision training.

B Semantic Parsers

B.1 Supervised Parsing on SPIDER

Model We develop our text-to-SQL parser basedon TranX (Yin and Neubig, 2018), which trans-lates an NL utterance into a tree-structured abstractmeaning representation following user-specifiedgrammar, before deterministically convert the gen-erated abstract MR into an SQL query. TranX mod-els the construction process of an abstract MR (tree-structured representation of an SQL query) using atransition-based system, which decomposes its gen-eration story into a sequence of actions followingthe user defined grammar.

Formally, given an input NL utterance u and adatabase with a set of tables T = {Ti}, the prob-ability of generating of an SQL query (i.e., its se-mantically equivalent MR) z is decomposed as theproduction of action probabilities:

p(z|u, T ) =∏

p(at|a<t,u, T ) (2)

where at is the action applied to the hypothe-sis at time stamp t. a<t denote the previous ac-tion history. We refer readers to Yin and Neu-big (2018) for details of the transition system andhow individual action probabilities are computed.In our adaptation of TranX to text-to-SQL pars-ing on SPIDER, we follow Guo et al. (2019) anduse SemQL as the underlying grammar, whichis a simplification of the SQL language. Fig. 2lists the SemSQL grammar specified using the ab-stract syntax description language (Wang et al.,1997). Intuitively, the generation starts from a tree-structured derivation with the root production ruleselect stmt7→SelectStatement, which laysout overall the structure of an SQL query. At eachtime step, the decoder algorithm locates the currentopening node on the derivation tree, following adepth-first, left-to-right order. If the opening node

http://webdatacommons.org/webtables

8425

Parameter TABERTBase(K = 1) TABERTLarge(K = 1) TABERTBase(K = 3) TABERTLarge(K = 3)

Batch Size 256 512 512 512Learning Rate 2× 10−5 2× 10−5 4× 10−5 4× 10−5

Max Epoch 10Weight Decay 0.01Gradient Norm Clipping 1.0

Table 6: Hyper-parameters using in pretraining

is not a leaf node, the decoder invokes an actionat which expands the opening node using a pro-duction rule with appropriate type. If the currentopening node is a leaf node (e.g., a node denot-ing string literal), the decoder fills in the leaf nodeusing actions that emit terminal values.

To use such a transition system to generate SQLqueries, we extend its action space with two newtypes of actions, SELECTTABLE(Ti) for node oftype table ref in Fig. 2, which selects a tableTi (e.g., for predicting target tables for a FROM

clause), and SELECTCOLUMN(Ti, cj) for node oftype column ref, which selects the column cjfrom table Ti (e.g., for predicting a result columnused in the SELECT clause).

As described in § 4.1, TABERT produces a list ofentries, with one entry 〈Ti,Xi,Ci〉 for each tableTi:

M ={〈Ti,Xi = {x1,x2, . . .},

Ci = {c1, c2, . . . , }〉i}|T |i=1

(3)

where each entry 〈Ti,Xi,Ci〉 in M consists of Ti,the representation of table Ti given by the outputvector of the prefixed [CLS] symbol, the table-specific representations of utterance tokens Xi ={x1,x2, . . .}, and representations of columns inTi, Ci = {c1, c2, . . .}. At each time step t, thedecoder in TranX performs hierarchical attentionover representations in M to compute a context vec-tor. First, a table-wise attention score is computedusing the LSTM’s previous state, statet−1 with theset of table representations.

score(Ti) = Softmax(

DotProduct(statet−1, key(Ti))), (4)

where the linear projection key(·) ∈ R256 projectsthe table representations to key space. Next, foreach table Ti ∈ T , a table-wise context vectorctx(Ti) is generated by attending over the union

of vectors in utterance token representations Xi

and column representations Ci:

ctx(Ti) = DotProductAttention(

statet−1, key(Xi ∪Ci), value(Xi ∪Ci)), (5)

with the LSTM state as the query, key(·) as the key,and another linear transformation value(·) ∈ R256

to project the representations to value vectors. Thefinal context vector is then given by the weightedsum of these table-wise context vectors ctx(Ti)(i ∈ {1, . . . , |T |}) weighted by the attention scoresscore(Ti). The generated context vector is thenused to update the state of the decoder LSTM tostatet.

The updated decoder state is then used to com-pute the probability of carrying out the action de-fined at time step t, at. For a SELECTTABLE(Ti)action, its probability of is defined similarly asEq. (4). For a SELECTCOLUMN(Ti, cj) action, itis factorized as the probability of selecting the ta-ble Ti (given by Eq. (4)), times the probability ofselecting the column cj . The latter is defined as

score(cj) = Softmax(DotProduct(statet, cj)

).

(6)We also add simple entity linking features to

the representations in M, defined by the follow-ing heuristics: (1) If an utterance token x ∈ umatches with the name of a table T , we concatenatea trainable embedding vector (table match ∈R16) to the representations of x and T . (2)Similarly, we concatenate an embedding vector(column match ∈ R16) to the representationsof an utterance token and a column if their namesmatch. (3) Finally, we concatenate a zero-vector(0 ∈ R16) to representations of all unmatched ele-ments.

Configuration We use the default configurationof TranX. For TABERT parameters, we use anAdam optimizer with a learning rate of 3e − 5and linearly decayed learning rate schedule, and

8426

select stmt = SelectStatement(distinct distinct, # DISTINCT keywordexpr∗ result columns, # Columns in SELECT clauseexpr? where clause, # WHERE clauseorder by clause? order by clause, # ORDER BY clauseint? limit value, # LIMIT clausetable ref∗ join with tables, # Tables in the JOIN clausecompound stmt? compound statement # Compound statements (e.g. , UNION, EXCEPT)

)

distinct = None | Distinct

order by clause = OrderByClause(expr∗ expr list, order order)

order = ASC | DESC

expr = AndExpr(expr∗ expr list)| OrExpr(expr∗ expr list)| NotExpr(expr expr)| CompareExpr(compare op op, expr left value, expr right value)| AggregateExpr(aggregate op op, expr value, distinct distinct)| BinaryExpr(binary op op, expr left value, expr right value)| BetweenExpr(expr field, expr left value, expr right value)| InExpr(column ref left value, expr right value)| LikeExpr(column ref left value, expr right value)| AllRows(table ref table name)| select stmt| Literal(string value)| ColumnReference(column ref column name)

aggregate op = Sum | Max | Min | Count | Avg

compare op = LessThan | LessThanEqual | GreaterThan| GreaterThanEqual | Equal | NotEqual

binary op = Add | Sub | Divide | Multiply

compound stmt = CompoundStatement(compound op op, select stmt query)

compound op = Union | Intersect | Except

Figure 2: ASDL Grammar of SemQL used in TranX

another Adam optimizer with a constant learningrate of 1e− 3 for all remaining parameters. Duringtraining, we update model parameters for 25000 it-erations, and freeze the TABERT parameters at thefirst 1000 update steps. We use a batch size of 30and beam size of 3. We use gradient accumulationfor large models to fit a batch into GPU memory.

B.2 Weakly-supervised Parsing onWIKITABLEQUESTIONS

Model We use MAPO (Liang et al., 2018), astrong weakly-supervised semantic parser. Theoriginal MAPO models comes with an LSTM en-coder, which generates utterance and column rep-resentations used by the decoder to predict tablequeries. We directly substitute the encoder withTABERT, and project the utterance and table repre-sentations from TABERT to the original embeddingspace using a linear transformation. MAPO uses

a domain-specific query language tailored to an-swer compositional questions on a single table. Forinstance, the example question in Fig. 1 could beanswered using the following query:

Table.contains(column=Position, value=1st)# Get rows whose ‘Position’ field contains ‘1st’

.argmax(order by=Year)# Get the row which has the largest ‘Year’ field

.hop(column=Venue)# Select the value of ‘Venue’ in the result row

MAPO is written in Tensorflow. In our experimentswe use an optimized re-implementation in PyTorch,which yields 4× training speedup.

Configuration We use the same optimizer andlearning rate schedule as in § B.1. We use a batchsize of 10, and train the model for 20000 steps,with the TABERT parameters frozen at the first5000 steps. Other hyper-parameters are kept thesame as the original MAPO system.

Documents

TaBERT: Pretraining for Joint Understanding of Textual and