Towards a Multi-modal, Multi-task Learning based Pre

Towards a Multi-modal, Multi-task Learning based Pre-trainingFramework for Document Representation Learning

Subhojeet Pramanik∗

University of Alberta, [email protected]

Shashank Mujumdar?IBM Research, India

[email protected]

Hima PatelIBM Research, India

[email protected]

Abstract

Recent approaches in literature have exploitedthe multi-modal information in documents(text, layout, image) to serve specific down-stream document tasks. However, they are lim-ited by their - (i) inability to learn cross-modalrepresentations across text, layout and imagedimensions for documents and (ii) inability toprocess multi-page documents. Pre-trainingtechniques have been shown in Natural Lan-guage Processing (NLP) domain to learngeneric textual representations from large un-labelled datasets, applicable to various down-stream NLP tasks. In this paper, we propose amulti-task learning-based framework that uti-lizes a combination of self-supervised and su-pervised pre-training tasks to learn a genericdocument representation applicable to variousdownstream document tasks. Specifically, weintroduce Document Topic Modelling and Doc-ument Shuffle Prediction as novel pre-trainingtasks to learn rich image representations alongwith the text and layout representations fordocuments. We utilize the Longformer net-work architecture as the backbone to encodethe multi-modal information from multi-pagedocuments in an end-to-end fashion. We show-case the applicability of our pre-training frame-work on a variety of different real-world doc-ument tasks such as document classification,document information extraction, and docu-ment retrieval. We evaluate our frameworkon different standard document datasets andconduct exhaustive experiments to compareperformance against various ablations of ourframework and state-of-the-art baselines.

1 Introduction

1.1 Problem MotivationIn the era of digitization, most businesses areturning towards leveraging artificial intelligence(AI) techniques to exploit the information con-tained in business documents. Traditional infor-

∗?Authors contributed equally to the work

mation extraction (IE) approaches utilize Natu-ral Language Processing (NLP) methods to pro-cess the information from documents expressedin the form of natural language text (Manevitzand Yousef, 2001). However, documents con-tain rich multi-modal information that includesboth text and the document layout. The docu-ment layout organises the textual information intodifferent formats such as sections, paragraphs, ta-bles, multi-column etc. utilising different font-types/colors/positions/sizes/styles. Further, impor-tant visual cues are also indicated through fig-ures/charts/logos etc. and the overall documentpage appearance. In general, information in a doc-ument spans over multiple pages which gives riseto a variety of complex document layouts that canbe observed in scientific articles, invoices, receipts,emails, contracts, blogs, etc. Analyzing and under-standing these documents is a challenging endeavorand requires a multi-disciplinary perspective com-bining NLP, computer vision (CV), and knowledge-representation to learn a generic document repre-sentation suitable for various downstream applica-tions (DI, 2019).

1.2 Limitations of Prior-art Approaches

Recent approaches towards document analysis haveexplored frameworks that utilize information fromdocument text, document layout and document im-age in different capacities (Majumder et al., 2020;Katti et al., 2018; Yang et al., 2017) for specificdocument tasks. In (Majumder et al., 2020), theauthors have proposed joint training of documenttext and structure for the task of IE from form-likedocuments, while in (Yang et al., 2017), the au-thors combine text and image information for thetask of semantic segmentation of documents. Theirproposed frameworks optimize the network perfor-mance with respect to downstream task which arenot suitable for other tasks. To address this limita-tion, in (Xu et al., 2020) a pre-training technique

arX

iv:2

009.

1445

7v2

[cs

.CL

] 5

Jan

202

2

Figure 1: (a) Multi-modal information used from input documents. (b) Distinction of proposed framework withLayoutLM (Xu et al., 2020)

.

Figure 2: Applicability of our framework on multi-page documents for different downstream tasks - (a) DocumentClassification (b) Information Extraction (c) Document Retrieval

.

is proposed based on the BERT transformer archi-tecture (Devlin et al., 2018), to combine text andlayout information from a large set of unlabelleddocuments. They showcase applicability of theirpre-trained network on different downstream tasksfurther utilizing the image information during fine-tuning for each downstream task. As shown in Fig-ure 1(b), pre-training the network in an end-to-endfashion allows for cross-modality interaction whichfacilitates learning shared representations acrossdocument text tokens and their 2D positions/layout.However, there are two major limitations to theirapproach - (i) proposed pre-training tasks cannotutilize image information for learning the docu-ment representation and (ii) the framework onlyallows for single page documents. As shown inFigure 1(a), apart from the document text tokensand their positions, the visual appearance of theindividual tokens as well as the overall page serveas an important indicator for learning the docu-ment representation. In the real-world scenario,multi-page documents are common with differentpages potentially containing different informationacross text, layout, and image dimensions. Thus,for serving different documents tasks, a unified pre-

training framework that learns a generic documentrepresentation from all three modalities and workson multi-page documents is necessary.

1.3 Our Proposition

In this paper, we propose a generic document repre-sentation learning framework that takes as input thedocument text, layout, and image information appli-cable to various document tasks as shown in Figure1(a). Specifically, we encode the multi-modal doc-ument information as - (i) text and position embed-dings similar to BERT (Devlin et al., 2018) (ii) texttoken 2D position embeddings to capture the layout,(iii) text token image embeddings to capture theirappearance, and (iv) document page image and po-sition embeddings to learn the document represen-tation capable of handling multi-page documents.In order to handle large token sequences courtesy ofmulti-page documents, we utilize the Longformermodel (Beltagy et al., 2020) as the backbone of ourframework which introduces an attention mecha-nism that scales linearly with the sequence length.We utilize the Masked Visual Language Modelling(MVLM) task and a document classification (CLF)task that enforces the joint pre-training of all the

input embeddings (Xu et al., 2020). To facilitatethe cross-modality interaction and ensure that thenetwork learns from the visual information, weintroduce two novel self-supervised pre-trainingtasks in our framework - (i) document topic mod-eling (DTM) and (ii) document shuffle prediction(DSP). We mine the latent topics from the docu-ment text and train our framework to predict thetopic distribution using only the document page im-age embeddings for the task of DTM (Gomez et al.,2017). On the other hand, DSP involves shufflingthe page image order while keeping the other em-beddings intact for randomly sampled documentsduring training to identify if the document is tam-pered with. We employ a multi-task learning frame-work to simultaneously train multiple objectivesof the different pre-training tasks to learn sharedrepresentations across the text, layout, and imagemodalities of the documents. Figure 2 signifies theapplicability of our pre-trained embeddings for var-ious downstream document tasks. We evaluate theperformance of our proposed framework on vari-ous document tasks - (i) Information Extraction (ii)Document Classification (iii) Table Token Classifi-cation and (iv) Document Retrieval. In summary,the main contributions of this work are:• We introduce a multi-modal, multi-task learning

based pre-training framework to learn a genericdocument representation.

• We introduce document topic-modeling and doc-ument shuffle prediction as self-supervised pre-training tasks

• Our proposed framework is able to process multi-page documents which is a limitation of prior artapproaches

• We conduct exhaustive experiments to compareperformance against various ablations of ourframework and SOTA baselines

2 Approach

In this section, we describe the details of our pro-posed architecture.

2.1 The Proposed architecture

Common Transformer variants such as BERT (De-vlin et al., 2018) & RoBERTa (Liu et al., 2019)are adept at processing text tokens and learning se-mantic representations from text sequences. How-ever, they do not leverage the visual and layoutinformation present in documents. LayoutLM (Xuet al., 2020), incorporates the layout information

provided by the token level 2D bounding boxes andadds these layout embeddings to the existing tokenlevel embeddings in BERT architecture during pre-training. We introduce two new embeddings inaddition to layout & token embeddings: (1) thevisual information present in the correspondingbounding box for each token, and (2) the page-levelinformation present in multi-page PDF documents(as shown in Figure 1. Additionally, we use theLongformer (Beltagy et al., 2020) network archi-tecture for encoding the multi-modal inputs. BERTvariants (Devlin et al., 2018) scale quadratically interms of memory and CPU requirements, (Beltagyet al., 2020) introduces local windowed attentionand task-motivated global attention mechanismsthat scales linearly with sequence length.

Figure 3 showcases our proposed network archi-tecture. For a multi-page document we parse itsconstituent tokens using standard document parser(Tesseract, 2021) and store the token text and thecorresponding bounding boxes along with the pagenumbers and page images. Every document is en-coded as a sequence of tokens t ∈ Z, 0 ≤ t < nv,page numbers p ∈ Z, 0 ≤ p < np, bounding boxcoordinates (x1, y1, x2, y2), and the image of theentire page corresponding to the given token; wherenv is the vocab size, np is the maximum numberof page embeddings; (x1, y1) are the coordinatesfor the upper left and (x2, y2) are the coordinatesfor the lower right corner of the bounding box, andh = y2 − y1, w = x2 − x1 capture the heightand width of the bounding box respectively. Weuse four embedding layers to encode the layoutinformation: X dimension (x1, x2), Y dimension(y1, y2), h and w. Embeddings from the same di-mension share the embedding layers.

Novel to our approach, we use a ResNet-50(He et al., 2016) architecture combined with Fea-ture Pyramid Network (FPN) (Lin et al., 2017)to generate multi-level image embeddings for thegiven page corresponding to each token. For animage of size (u, v), the Resnet+FPN layer pro-duces feature maps of size (u

′, v′). The bounding

boxes (x1, y1, x2, y2) which are originally in rangex ∈ Z, 0 ≤ x ≤ u & y ∈ Z, 0 ≤ y ≤ v are lin-early scaled to match the feature map dimension(u′, v′) respectively. We select the final layer of the

FPN network which has the highest semantic repre-sentation. To generate the final image embeddingcorresponding to the region indicated by the bound-ing box coordinates, a Region of Interest (RoI)

Figure 3: Demonstration of the proposed architecture encoding two sample pages from a PDF document.

pooling operation (Dai et al., 2016) is performedon the page image feature map with an output sizeof 1× 1 using the interpolated bounding box coor-dinates. Using RoI pooling allows us to efficientlyselect the embeddings for the regions indicated byall the bounding boxes for each token from thepage feature map. For each token, we also pass thecorresponding page number through an embeddinglayer initialized using sinusoidal embeddings, asdescribed in (Vaswani et al., 2017). The embed-dings for the images, layout, and pages are addedto the existing text embeddings and passed to theLongformer encoder to generate sequence repre-sentations for the document.

For the special tokens <CLS> and <SEP>which are predominantly used in BERT variants forsequence inputs, we use (0, 0, u, v) as the bound-ing box as it captures the image embedding forthe entire page, thereby benefitting the downstreamtasks that require the representation of the <CLS>token for prediction. For, all our experiments, wefreeze all except the last layer of Resnet-50.

2.2 Multi-task learning framework

We use a multi-task learning framework to pre-train our network on a combination of three self-supervised tasks that are posed as classificationtasks along with a supervised category classifica-tion task. At each training step, we optimize all thepre-training tasks in a joint fashion. For each pre-training task, the task-specific inputs are encodedaccording to their respective input strategies, andthe task-specific loss is calculated. The gradientsare computed with respect to each task-specific lossand accumulated across all tasks to be optimized us-ing the AdamW optimizer (Loshchilov and Hutter,2018). All tasks use cross-entropy loss for classifi-cation except DTM which uses soft cross-entropyloss.

2.3 Pre-training dataset:

We use Arxiv PDFs (Arxiv, 2020) for pre-trainingour architecture comprising of scientific articlesbelonging to 16 different categories such as mathe-matics, physics, computer science, etc. We extractthe first 130k PDF documents from the arxiv bulk

server and use a (train, val, test) split of (110k,10k, 10k) respectively. We process the documents(Tesseract, 2021) to extract and store the text to-kens, corresponding bounding boxes, page num-bers, and the page images along with the documentcategory to feed as input to our network.

2.4 Pre-training tasks:

1. Masked Visual Language Modelling(MVLM): BERT model utilizes Masked LanguageModelling (MLM) where input tokens are maskedduring pre-training and predicted as output usingthe context from non-masked tokens. Compared toMLM, in (Xu et al., 2020) MVLM is introducedwhich masks the input tokens by replacing it witha designated <MASK> token, but keeps only thelayout information provided by the bounding boxes.It however does not utilize the visual informationof the tokens during pre-training. On the otherhand, we also utilize the image embedding gen-erated by the Resnet+FPN layers along with thelayout embeddings at the masked locations.

2. Document Category Classification (CLF):Each document in Arxiv dataset belongs to oneof 16 categories denoting the relevant subject areaof the document. The category prediction is per-formed at the <CLS> token, by passing the outputof token into a fully-connected (FC) layer appendedwith a softmax classification layer.

3. Document Shuffle Prediction (DSP): ForDSP, given a document, we randomly shuffle theorder of the page images while preserving the orderof other embeddings before passing to the network.Thus, although the token text and bounding boxembeddings are in order, the corresponding tokenimage embeddings are uncorrelated since the pageimages are shuffled. For a given document, thepage images are shuffled with a probability of 0.5,and the model is trained to predict whether theinput document pages are shuffled or intact usingall the embeddings. We argue that, in order tosuccessfully train on the DSP task, the network isforced to correlate the token image embeddingswith the corresponding token text and boundingbox embeddings.

4. Document Topic Modelling (DTM): Al-though training on the DSP task enforces the net-work to correlate image and text modalities at thetoken level, we introduce the DTM task to learn im-proved page image representations. The objectiveis to learn discriminative visual features employing

the semantic context as soft-labels during training(Gomez et al., 2017). We encode the semanticcontext for each document as a probability distri-bution over a set of latent topics. We utilize theLatent Dirichlet Allocation (LDA) algorithm (Bleiet al., 2003) to mine the latent topics over the setof text tokens parsed from the Arxiv training set.During training, the vector of topic probabilities iscomputed using the learned LDA model for eachdocument. For the DTM task, we pass the pageimages of the document to our network, while asingle <MASK> token is passed for text embed-ding. Further, the bounding box coordinates of thecomplete page are passed as part of the layout em-bedding. A soft cross-entropy loss is applied to thepredicted output of the network against the vectorof topic probabilities for learning. Since the Arxivdataset has 16 subject areas as categories withinthe documents, we chose to mine 30 latent topicsto further identify granular categorization amongthe documents.

3 Datasets and Experiments

3.1 Datasets

FUNSD: The FUNSD dataset (Guillaume Jaume,2019) consists of 199 fully annotated, scannedsingle-page forms with overall 31,485 words. Se-mantic entities comprising of multiple tokens areannotated with labels among ‘question’, ‘answer’,‘header’, or ‘other’. Additionally, the text, bound-ing boxes for each token, and links to other entitiesare present. The dataset has 149 train and 50 testimages. We evaluate our network on the semanticlabeling task and measure the word-level F1 scores(Xu et al., 2020).

RVL-CDIP: The RVL-CDIP dataset (Harleyet al., 2015) consists of 400,000 grayscale imagesorganized into 16 classes, with around 25,000 im-ages per class. The images are characterized bylow quality, presence of noise, and low resolu-tion, typically 100 dpi. The dataset consists of320,000 training, 40,000 validation and 40,000 testimages. The 16 classes include letter, form, email,handwritten, advertisement, scientific report, sci-entific publication, specification, file folder, newsarticle, budget, invoice, presentation, questionnaire,resume, memo. We evaluate our architecture on thedocument classification task using the 16 labels.

ICDAR19: The Track A Modern data releasedas part of the ICDAR19 dataset (Gao et al., 2019)contains 600 train & 240 test images from var-

Table 1: Model performance numbers for the Arxiv Classification task. (Prec: Precision, Rec: Recall)

Model Input Pre-training Tasks Pre-training Size Prec Rec F1Our Text MLM+CLF 110K (5 epochs) 90.72% 90.40% 90.46%Our Text+Layout MVLM+CLF 110K (5 epochs) 90.79% 90.72% 90.71%Our All MVLM+CLF 110K (5 epochs) 98.92% 98.90% 98.90%Our All MVLM+CLF+DSP 110K (5 epochs) 98.91% 98.90% 98.90%Our All MVLM+CLF+DTM 110K (5 epochs) 98.92% 98.91% 98.92%Our All All 110K (5 epochs) 98.93% 98.92% 98.93%

Table 2: Model performance numbers for the RVL-CDIP classification task. LayoutLM*BASE uses Resnet-101

image embeddings during fine-tuning. (Acc: Accuracy)

Model Input Pre-train Tasks Pre-train Size AccOur Text MLM+CLF 110K (5 epochs) 84.48%Our Text+Layout MVLM+CLF 110K (5 epochs) 86.55%Our All MVLM+CLF 110K (5 epochs) 91.22%Our All All 110K (5 epochs) 91.72%Our (VGG-16) All All 110K (5 epochs) 93.36%LayoutLMBASE Text+Layout MVLM 500K (6 epochs) 91.25%LayoutLM*

BASE Text+Layout MVLM+MDC 1M (6 epochs) 94.31%VGG-16 Afzal et al. Image - - 90.97%Stacked CNN Ensemble Das et al. Image - - 92.21%LadderNet Sarkhel and Nandi Image - - 92.77%Multimodal Ensemble Dauphinee et al. Text+Image - - 93.07%

ious PDF documents such as scientific journals,forms, financial statements, etc. annotated withtable bounding box coordinates. We perform word-level binary classification on this dataset.

3.2 Model Pre-training

We initialize the Longformer Encoder and Wordembedding layer with the pre-trained weights fromthe LongformerBASE (12 layers, 512 hidden size)(Beltagy et al., 2020), as shown in Figure 3. Weutilize a global+sliding window attention of length512. The weights of Resnet-50 are initialized usingthe Resnet-50 model pre-trained on the ImageNetdataset (He et al., 2016). Across all pre-trainingand downstream tasks, we resize all page images to563× 750 and correspondingly scale the boundingbox coordinates. We limit the maximum numberof pages to 5 per document and limit the numberof tokens to 500 per page during pre-training forsequence classification tasks. For the different pre-training tasks, we use a batch size (BS) and gradientaccumulation (GA) of - (i) MVLM & CLF (BS=32& GA=2); (ii) DSP (BS=16 & GA=1); (iii) DTM(BS=16 & GA=1). We pre-train our architecturefor 15K steps (∼5 epochs) with a learning rate of

3e-5 on a single NVIDIA Tesla V100 32GB GPU.

3.3 Experiment Setup

We evaluate our model on the following differentdownstream tasks to demonstrate its efficacy.

Document Classification: We finetune ourmodel to perform multi-page document classifica-tion on the Arxiv dataset and single-page documentclassification on the RVL-CDIP dataset. For bothtasks, each document is encoded as a sequence oftokens, bounding boxes, page images, page num-bers, as shown in Figure 3. We use (Tesseract,2021) to extract the word-level tokens and bound-ing boxes. The category prediction is performed atthe <CLS> token by passing its output through anFC+Softmax layer. We use a learning rate of 3e-5,(BS=12 & GA=4) for Arxiv and (BS=64 & GA=1)for RVL-CDIP, and we fine-tune our model anddifferent ablations for 5 epochs for both datasetsindependently. We use weighted precision, recallas F1 as our evaluation metric.

Form Understanding: We perform the seman-tic labeling task on the FUNSD dataset as a se-quence labeling problem. Each form is treated asa single-page document and sequenced as a list

of tokens, bounding boxes, and the page image.For each token, we pass its learned representationthrough an FC+Softmax layer to predict its cate-gory. We use word-level weighted precision, recall,and F1 score as the evaluation metrics (Xu et al.,2020). For fine-tuning, we use BS=12, GA=1, alearning rate of 3e-5, and train for 20 epochs.

Table Token Classification: For this task, themodel is fine-tuned as a sequence labeling problemto classify the word-tokens in a document as ‘ta-ble’ or ‘other’. Table bounding boxes are used togenerate ground truth labels for each token in thedocument as detected using (Tesseract, 2021). Pro-cessing the document, generating the input embed-dings, and the token level prediction is performedsimilarly to the Form Understanding task. For fine-tuning, we use BS=4, GA=2, a learning rate of3e-5, and train for 14 epochs.

Document Retrieval: Similar to the classifica-tion task, we process the multi-page documentsin the Arxiv dataset and fine-tune our pre-trainedmodel with all inputs on all pre-training tasks andthe BERTBASE model with only the text input fromthe Arxiv training set. We utilize 10k documentsfrom the Arxiv test set split into 2k query and 8kretrieval set. For a given query document, we usethe fine-tuned embeddings from each model andcompute its cosine distance with the query set forretrieval. We compare the mean average precision(MAP) and the normalized discounted cumulativegain (NDCG) for evaluation.

4 Results and Discussion

Multi-page Document Classification: Prior-artapproaches do not support multiple-page docu-ments as their network architecture does not en-able to encode multi-page information. Thus, wecompare the results of various ablations of our ar-chitecture for this task on the Arxiv dataset. Asseen in Table 1, a significant boost in performancecan be observed with the introduction of the im-age embeddings to the pre-training against the Textand Text+Layout ablations. The DSP and DTMtasks further improve the performance marginally.The ablation involving the DTM pre-training taskshowcases higher improvement which suggests thatutilizing the image information to predict the topicdistribution during pre-training helps in learningimproved image embeddings. The performancegain is also attributed to the underlying sequenceencoder LongformerBASE with an attention mech-

anism up to 4096 tokens which enable multi-pageprocessing to learn multi-modal contextual repre-sentations.

Single page Document Classification: Table 2shows the results of the single-page document clas-sification on the RVL-CDIP dataset. All our mod-els are pre-trained on the Arxiv dataset, LayoutLMis pre-trained on the IIT-CDIP dataset, whereasall other baselines perform no pre-training. Al-though our standard model with Resnet image lay-ers beats the comparable LayoutLMBASE modelpre-trained on 500K documents and image-basedVGG-16 model, the task-specific approaches suchas Stacked CNN Ensemble, LadderNet, and Mul-timodal Ensemble outperform our model. SinceRVL-CDIP inherently contains low-quality images,task-specific approaches propose clever networkarchitectures to utilize discriminative multi-scalefeatures (Sarkhel and Nandi, 2019), multiple VGG-16 models to process different parts of documentimage (Das et al., 2018) and augmenting imagefeatures with raw text features (Dauphinee et al.,2019) to achieve high classification performance.Further, it is known that Resnet-50 performs poorlyon the RVL-CDIP dataset and even a smaller net-work such as VGG-16 performances better (Afzalet al., 2017). Thus, we also consider a variation ofour architecture with the Resnet-50 image layersreplaced with the VGG-16 image layers. With theVGG-16 layers, we see a significant improvementin the performance from 91.72% to 93.36% andour VGG-16 based model beats the existing image-based and image+text-based baselines trained oncomparable dataset sizes. We achieve compara-ble performance with fine-tuning on 5 epochs toLayoutLM*

BASE which is pre-trained on 1M doc-uments and fine-tuned using the Faster R-CNNembeddings for 30 epochs.

Semantic Labelling Task: We present resultsof our model fine-tuned on FUNSD semantic label-ing task in Table 3. Our best model pre-trainedon all four tasks and all inputs achieves an F1score of 77.44% outperforming the comparableLayoutLM*

BASE model which achieves an F1 scoreof 74.41%. We attribute the increase in perfor-mance to the inclusion of RoI pooled image em-beddings indicated by the various bounding box re-gions for each text token during pre-training as wellas fine-tuning the Resnet-50 layers and the Lay-outLM architecture is agnostic to both these prop-erties. Further, our architecture pre-trains using

Table 3: Model performance numbers for the semantic labelling task on FUNSD dataset. LayoutLM*BASE uses

Resnet-101 image embeddings during fine-tuning. (Prec: Precision, Rec: Recall)

Model Input Pre-train Tasks Pre-train Size Prec Rec F1Our Text MLM+CLF 110K (5 epochs) 77.25% 68.40% 69.66%Our Text+Layout MVLM+CLF 110K (5 epochs) 75.45% 74.93% 75.15%Our All MVLM+CLF 110K (5 epochs) 77.31% 76.50% 76.79%Our All MVLM+CLF+DSP 110K (5 epochs) 77.55% 76.80% 77.30%Our All MVLM+CLF+DTM 110K (5 epochs) 77.84% 77.10% 77.42%Our All All 110K (5 epochs) 78.41% 77.35% 77.44%LayoutLMBASE Text+Layout MVLM 500K (6 epochs) 66.50% 73.55% 69.85%LayoutLM*

BASE Text+Layout MVLM+CLF 1M (6 epochs) 71.01% 78.15% 74.41%

Table 4: Results of Multi-page Document RetrievalTask

Model MAP NDCG-1 NDCG-10BERT 91.01% 90.08% 93.00%OurAll 98.99% 98.94% 99.21%

Table 5: Inference Ablation Results on the FUNSDdataset (F1 score)

Pre-trainingTasks

Inference AblationsAll TextOnly ImageOnly

MVLM+CLF 76.79% 73.64% 33.24%MVLM+CLF+DSP 77.30% 74.10% 35.42%MVLM+CLF+DTM 77.42% 74.68% 38.20%All 77.44% 75.10% 40.12%

110K documents compared to LayoutLM whichuses 500K & 1M documents. Thus, we argue thateven for a significantly smaller dataset size, ourmodel generalizes better by incorporating imageembeddings during pre-training on the FUNSDtask.

Similar to the FUNSD task, the table token clas-sification task performs semantic labeling. How-ever, the impact of jointly learning the text, lay-out, and image embeddings is much more evi-dent from the results shown in Figure 4. Ourmodel can correctly classify all the tokens belong-ing to tables with a negligible amount of false pos-itives. We get the precision, recall, and f1-scoreof 94.99%, 94.98%, and 94.97% respectively onthe ICDAR2019 test set. It is noteworthy that onlyfine-tuning our model on the train set can achievepromising results, which the prior art approachesemploy careful heuristics to achieve (Gao et al.,2019).

Multi-page Document Retrieval Task: To as-

Figure 4: Results of Table Token Classification. Tokenspredicted as “table" by our model are marked in green.

.

certain the utility of our proposed framework to pro-cess multi-page documents, we compare the resultsof our framework against the standard BERT modelfor the task of multi-page document retrieval in Ta-ble 4. Fine-tuned embeddings from our model sig-nificantly outperform those from the BERT model.The high value of MAP and NDCG-10 indicatesthat the retrieved samples are not only correct butalso ranked higher than the incorrect ones for mostof the queries. Although our model captures richerembeddings, the significant boost in performance isalso attributed to the Longformer architecture thatcan encode much more information across docu-ment pages compared to vanilla BERT architecture.

4.1 Inference Ablation Study

In order to further investigate the utility of DSPand DTM tasks, we compare the performance offour models during inference on the FUNSD taskFor each model, we conduct two ablations duringinference, where only the text or image embeddingis used to make the prediction while excluding the

layout embeddings. As shown in Table 5, for boththe ablations, OurDSP, OurDTM & OurAll retainshigher performance than OurMC. In particular, forthe image-only ablation, the difference in the per-formance drop (All% - Image Only%) for OurAll(∼37%) is lower than that for OurMC (∼43%). Sim-ilarly, for OurDSP (∼42%) and OurDTM (∼39%),the performance drop is lower than that for OurMC,however, higher than OurAll. This suggests thatwith the introduction of DSP and DTM tasks, thelearnt image embeddings for OurDSP, OurDTM &OurAll capture richer image representations thanOurMC. These pre-training tasks help learn betterimage representations that retain more performanceeven when the text information is missing during in-ference. The trend observed is similar when usingtext only embeddings. However, the difference isnot that significant as both share MVLM and clas-sification tasks which are more adept at learningtextual representations.

5 Related Work

In recent years, different prior-art approaches havefocussed on taking a multi-modal approach for doc-ument understanding. Exploring the document se-mantics as well as structure allow to learn a gran-ular understanding of the document informationnecessary to solve problems such as informationextraction, semantic segmentation, layout analysis,table structure detection etc. These approaches fun-damentally involve analysing document structurein addition to the primary modality of documenttext or document image. (Katti et al., 2018) intro-duce a document representation that encodes thecharacter level textual information while preserv-ing the 2D document layout. They train a fullyconvolutional encoder-decoder network that learnsfrom this input representation to extract semanticinformation from invoices. For similar task of infor-mation extraction from invoices, (Zhao et al., 2019)propose a convolutional network that learns bothsemantic and structural information from scannedinvoices by creating a gridded text representationthat preserves the spatial relationship among thetext tokens. Contrary to these approaches, (Ma-jumder et al., 2020) utilize the knowledge of keyfields to be extracted from a document to gener-ate candidates and learn their dense representationthat encodes information from its positional neigh-bors. For analysing the tables in scanned docu-ments, (Schreiber et al., 2017; Paliwal et al., 2019;

Prasad et al., 2020) propose different modificationsto standard CNN network architectures such as VG-GNet (Simonyan and Zisserman, 2014) used forclassification and Faster R-CNN (Ren et al., 2015)for object detection in images to recognise tablesand identify their structure. Similarly, (Soto andYoo, 2019) propose to augment the Faster R-CNNobject detection network architecture (Ren et al.,2015) with contextual features about the documentpages and region bounding boxes to segment keyregions in scientific articles. On the contrary, (Yanget al., 2017) propose to solve this as a pixel-wisesemantic segmentation task utilising a multi-modalencoder-decoder network architecture that takesas input both the text and image embeddings. Tolearn a generic representation for supporting differ-ent tasks such as document image classification anddocument information extraction, (Xu et al., 2020)propose to utilise the BERT transformer architec-ture (Devlin et al., 2018) to encode text as well aslayout information to learn pre-trained embeddingsand further utilise image information to fine-tunefor a specific task.

Most of the approaches in prior art utilize themulti-modal document information from single-page documents and extending their applicabilityto multi-page documents needs further exploration.Further, these approaches rely on limited labeleddata, thus, exploring self-supervised learning toleverage large unlabeled datasets also needs explo-ration. We attempt to address these limitations inthis paper.

6 Conclusion and Future work

We present a multi-modal pre-training frameworkthat utilizes multi-task learning to learn a genericdocument representation. Our framework encodesthe visual, layout, and textual information andsupports real-world multi-page documents. Ournetwork is pre-trained on the publically availableArxiv dataset utilizing self-supervised tasks thatpromote learning multi-modal shared representa-tions across the text, layout, and image dimensions.We fine-tune our pre-trained network to showcasestate-of-the-art performance on different documenttasks such as document classification, informationextraction, and document retrieval. In future, wewill investigate pre-training on large datasets suchas PubLayNet (Zhong et al., 2019) and further ex-plore new architecture designs that will enable doc-ument image tasks such as table detection/page

segmentation using our framework.

ReferencesMuhammad Zeshan Afzal, Andreas Kölsch, Sheraz

Ahmed, and Marcus Liwicki. 2017. Cutting the er-ror by half: Investigation of very deep cnn and ad-vanced training strategies for document image clas-sification. In ICDAR.

Arxiv. 2020. arxiv bulk data access. https://arxiv.org/help/bulk_data.

Iz Beltagy, Matthew E Peters, and Arman Cohan.2020. Longformer: The long-document transformer.arXiv.

David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation. JMLR.

Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-fcn: Object detection via region-based fully convo-lutional networks. In NIPS.

Arindam Das, Saikat Roy, Ujjwal Bhattacharya, andSwapan K Parui. 2018. Document image clas-sification with intra-domain transfer learning andstacked generalization of deep convolutional neuralnetworks. In ICPR.

Tyler Dauphinee, Nikunj Patel, Mohammad Rashidi,and . 2019. Modular multimodal architecture fordocument classification. arXiv.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv.

DI. 2019. Workshop on document intelligence atneurips 2019. https://sites.google.com/view/di2019.

Liangcai Gao, Yilun Huang, Hervé Déjean, Jean-LucMeunier, Qinqin Yan, Yu Fang, Florian Kleber, andEva Lang. 2019. Icdar 2019 competition on tabledetection and recognition (ctdar). In ICDAR.

Lluis Gomez, Yash Patel, Marçal Rusiñol, DimosthenisKaratzas, and CV Jawahar. 2017. Self-supervisedlearning of visual features through embedding im-ages into text topic spaces. In CVPR.

Jean-Philippe Thiran Guillaume Jaume, Hazim Ke-mal Ekenel. 2019. Funsd: A dataset for form un-derstanding in noisy scanned documents. In ICDAR-OST.

Adam W Harley, Alex Ufkes, and Konstantinos G Der-panis. 2015. Evaluation of deep convolutional netsfor document image classification and retrieval. InICDAR.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In CVPR.

Anoop R Katti, Christian Reisswig, Cordula Guder, Se-bastian Brarda, Steffen Bickel, Johannes Höhne, andJean Baptiste Faddoul. 2018. Chargrid: Towards un-derstanding 2d documents. In EMNLP.

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, KaimingHe, Bharath Hariharan, and Serge Belongie. 2017.Feature pyramid networks for object detection. InCVPR.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv.

Ilya Loshchilov and Frank Hutter. 2018. Fixing weightdecay regularization in adam.

Bodhisattwa Prasad Majumder, Navneet Potti, SandeepTata, James Bradley Wendt, Qi Zhao, and Marc Na-jork. 2020. Representation learning for informationextraction from form-like documents. In ACL.

Larry M Manevitz and Malik Yousef. 2001. One-classsvms for document classification. JMLR.

Shubham Singh Paliwal, D Vishwanath, Rohit Rahul,Monika Sharma, and Lovekesh Vig. 2019. Tablenet:Deep learning model for end-to-end table detectionand tabular data extraction from scanned documentimages. In ICDAR.

Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Man-ish Visave, and Kavita Sultanpure. 2020. Cas-cadetabnet: An approach for end to end table de-tection and structure recognition from image-baseddocuments. In CVPR Workshops.

Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015. Faster r-cnn: Towards real-time objectdetection with region proposal networks. In NIPS.

Ritesh Sarkhel and Arnab Nandi. 2019. Deterministicrouting between layout abstractions for multi-scaleclassification of visually rich documents. In IJCAI.

Sebastian Schreiber, Stefan Agne, Ivo Wolf, AndreasDengel, and Sheraz Ahmed. 2017. Deepdesrt: Deeplearning for detection and structure recognition oftables in document images. In ICDAR.

Karen Simonyan and Andrew Zisserman. 2014. Verydeep convolutional networks for large-scale imagerecognition. arXiv.

Carlos Soto and Shinjae Yoo. 2019. Visual detec-tion with context for document layout analysis. InEMNLP-IJCNLP.

Python Tesseract. 2021. Python wrapper for google’stesseract-ocr engine. https://github.com/madmaze/pytesseract.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS.

https://arxiv.org/help/bulk_data

https://arxiv.org/help/bulk_data

https://sites.google.com/view/di2019

https://sites.google.com/view/di2019

https://github.com/madmaze/pytesseract

https://github.com/madmaze/pytesseract

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang,Furu Wei, and Ming Zhou. 2020. Layoutlm: Pre-training of text and layout for document image un-derstanding. In ACM SIGKDD.

Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley,Daniel Kifer, and C Lee Giles. 2017. Learningto extract semantic structure from documents usingmultimodal fully convolutional neural networks. InCVPR.

Xiaohui Zhao, Endi Niu, Zhuo Wu, and XiaoguangWang. 2019. Cutie: Learning to understand docu-ments with convolutional universal text informationextractor. arXiv.

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes.2019. Publaynet: largest dataset ever for documentlayout analysis. In ICDAR.

Documents

Towards a Multi-modal, Multi-task Learning based Pre