1 ABCNet v2: Adaptive Bezier-Curve Network for Real-time

1

ABCNet v2: Adaptive Bezier-Curve Network forReal-time End-to-end Text Spotting

Yuliang Liu‡†, Chunhua Shen†∗, Lianwen Jin‡∗, Tong He†, Peng Chen†, Chongyu Liu‡, Hao Chen†

Abstract—End-to-end text-spotting, which aims to integrate detection and recognition in a unified framework, has attracted increasingattention due to its simplicity of the two complimentary tasks. It remains an open problem especially when processingarbitrarily-shaped text instances. Previous methods can be roughly categorized into two groups: character-based andsegmentation-based, which often require character-level annotations and/or complex post-processing due to the unstructured output.Here, we tackle end-to-end text spotting by presenting Adaptive Bezier Curve Network v2 (ABCNet v2). Our main contributions arefour-fold: 1) For the first time, we adaptively fit arbitrarily-shaped text by a parameterized Bezier curve, which, compared withsegmentation-based methods, can not only provide structured output but also controllable representation. 2) We design a novelBezierAlign layer for extracting accurate convolution features of a text instance of arbitrary shapes, significantly improving the precisionof recognition over previous methods. 3) Different from previous methods, which often suffer from complex post-processing andsensitive hyper-parameters, our ABCNet v2 maintains a simple pipeline with the only post-processing non-maximum suppression(NMS). 4) As the performance of text recognition closely depends on feature alignment, ABCNet v2 further adopts a simple yeteffective coordinate convolution to encode the position of the convolutional filters, which leads to a considerable improvement withnegligible computation overhead. Comprehensive experiments conducted on various bilingual (English and Chinese) benchmarkdatasets demonstrate that ABCNet v2 can achieve state-of-the-art performance while maintaining very high efficiency. Moreimportantly, as there is little work on quantization of text spotting models, we quantize our models to improve the inference time of theproposed ABCNet v2. This can be valuable for real-time applications. Code and model are available at: https://git.io/AdelaiDet

Index Terms—Bezier curve, Scene text spotting, Text detection and recognition

F

1 INTRODUCTION

T EXT spotting in the natural environment has drawnincreasing research attention in the community of com-

puter vision and image understanding, which aims to detectand recognize text instances in unconstrained conditions.Text information recovered has proven valuable for imageretrieval, automatic organization of photos and visual as-sistance, to name a few. To date, it remains challenging forthe following reasons: 1) text instances often exhibit diversepatterns in shape, color, font, and language. These inevitablevariations in the data often require heuristic settings forachieving satisfactory performance; 2) real-time applicationsrequire the algorithm to achieve a better trade-off betweenthe efficiency and effectiveness. Although the emergence ofdeep learning has significantly improved the performanceof the task of scene text spotting, current methods still exista considerable gap for solving generic real-world applica-tions.

Previous approaches often involve two independentmodules for scene text spotting: text detection and recog-nition, which are implemented sequentially. Many of themonly address one task and directly borrow top-performingmodules from the other. Such a simplified approach isunlikely to exploit the full potential of deep convolutionnetworks, as two tasks are isolated without shareable fea-tures.

Recently, end-to-end scene text spotting methods [3],

‡ South China University of Technology; † The University of Adelaide,Australia.∗ Corresponding authors.

(a) TPS-based alignment. (b) BezierAlign-ed results.

Fig. 1 – Comparison of the warping results. In Figure (a), wefollow previous methods by using TPS [1] and STN [2] to warpthe curved text region into a rectangular shape. In Figure (b),we use generated Bezier curves and the proposed BezierAlignto warp the results, leading to improved accuracy.

[4], [5], [6], [7], [8], [9], [10], [11], which directly buildthe mapping between the input image and sets of wordstranscripts in a unified framework, is drawing increasing at-tention. Compared to the models in which the detection andrecognition are two separate modules, the advantages ofdesigning an end-to-end framework are as follows. Firstly,word recognition can significantly improve the accuracy ofdetection. One of the most notable characteristics for text isthe attribute of being sequences. However, false positivesexhibiting the appearance of sequences exist in uncondi-tioned environments, such as blocks, buildings, and railings.To empower the network to have discriminative capabilityto distinguish different patterns, some approaches [3], [4],

arX

iv:2

105.

0362

0v3

[cs

.CV

] 2

0 Ju

l 202

1

https://git.io/AdelaiDet

2

[5] propose to share the features between the two tasks andtrain the network in an end-to-end manner. Moreover, dueto the shared features, end-to-end frameworks often showsuperiority in the inference speed, which is more suitablefor real-time applications. Finally, current standalone recog-nition models usually adopt perfectly cropped text imagesor heuristic synthetic images for training. An end-to-endmodule can force the recognition module to accustom to thedetection outputs, and thus the results can be more robust[10].

Existing approaches for end-to-end text-spotting can beroughly categorized into two groups: character-based andsegmentation-based. Character-based methods first detectand recognize individual characters and then output wordsby applying an extra grouping module. Although effective,laborious character-level annotations are required. Besides,often a few predefined hyper-parameters are necessary forthe grouping algorithm, showing limited robustness andgeneralization capability. Another line of the research issegmentation based, where text instances are represented byunstructured contours, making it difficult for the subsequentrecognition step. For example, the work in [12] relies on anTPS [1] or an STN [2] step to warp the original ground truthsinto rectangular shape. Note that, the characters can be sig-nificantly distorted, as shown in Figure 1. Besides, comparedwith detection, text recognition requires a significantly largeamount of training data, resulting in optimization difficul-ties in a unified framework.

To address these limitations, we propose the AdaptiveBezier Curve Network v2 (ABCNet v2), an end-to-end train-able framework, for real-time arbitrarily shaped scene textspotting. ABCNet v2 enables arbitrarily shaped scene textdetection with simple yet effective Bezier curve adaptation,which introduces negligible computation overhead com-pared with standard rectangle bounding box detection. Inaddition, we design a novel feature alignment layer, termedBezierAlign, to precisely calculate convolutional features oftext instances in curved shapes, and thus high recognitionaccuracy can be achieved. For the first time, we successfullyadopt the parameter space (Bezier curves) for multi-oriented orcurved text spotting, enabling a very concise and efficient pipeline.

Inspired by recent work in [13], [14], [15], we improveABCNet in our conference version [16] in four aspects: thefeature extractor, detection branch, recognition branch, andend-to-end training. Due to the inevitable variation of scales,the proposed ABCNet v2 incorporates iterative bidirectionalfeatures to achieve a better accuracy and efficiency trade-off.In addition, based on our observations, feature alignmentin the detection branch is essential for the subsequent textrecognition. To this end, we adopt a coordinate encoding ap-proach with negligible computation overhead to explicitlyencode the position in the convolutional filters, leading toconsiderable improvement in accuracy. For the recognitionbranch, we integrate a character attention module, whichcan recursively predict the characters of each word withoutusing character-level annotations. To enable effective end-to-end training, we further propose an Adaptive End-to-EndTraining (AET) strategy to match detection for end-to-endtraining. This can force the recognition branch more robustto the detection behaviors.

Thus, the proposed ABCNet v2 enjoys several advan-

tages over previous state-of-the-art methods, which aresummarized as follows:

• For the first time, we introduce a new, conciseparametric representation of curved scene text usingBezier curves. It introduces negligible computationoverhead compared with the standard bounding boxrepresentation.

• We propose a new feature alignment method, a.k.a.BezierAlign, and thus the recognition branch canbe seamlessly connected to the overall structure. Bysharing backbone features, the recognition branchcan be designed with a light-weight structure forefficient inference.

• The detection model of ABCNet v2 is more generalfor processing multi-scale text instances by consider-ing bidirectional multi-scale pyramid global textualfeatures.

• To our knowledge, our method is the first frameworkthat can simultaneously detect and recognize hor-izontal, multi-oriented, and arbitrarily shaped textin a single-shot manner, while keeping a real-timeinference speed.

• To further speed up inference, we also exploit thetechnique of model quantization, showing that ABC-Net v2 can reach a much faster inference speed withonly marginal accuracy reduction.

• Comprehensive experiments on various benchmarksdemonstrate the state-of-the-art text spotting per-formance of the proposed ABCNet v2 in terms ofaccuracy and speed.

2 RELATED WORK

Scene text spotting requires detecting and recognizing textsimultaneously, instead of concerning only one task. In theearly days, scene text spotting methods are usually simplyconnected by the independent detection and recognitionmodels. Two models are separately optimized with differ-ent architectures. Recently, end-to-end methods (§2.2) havesignificantly advanced the performance of text spotting byintegrating the detection and recognition into one unifiednetwork.

2.1 Separate Scene Text SpottingIn this section, we briefly review the literature, focusing oneither detection or recognition.

2.1.1 Scene Text DetectionThe development trend of text detection can be observedthrough the detection flexibility. From focused horizontalscene text detection represented by horizontal rectangu-lar detection bounding boxes, to multi-oriented scene textdetection represented by rotated rectangular or quadrilat-eral bounding boxes, and to arbitrarily shaped scene textdetection represented by instance segmentation masks orpolygons.

The early horizontal rectangular based methods can bedated back to Lucas et al. [17], in which the pioneering hor-izontal ICDAR’03 benchmark is constructed. ICDAR’03 andits successive datasets (ICDAR’11 [18] and ICDAR’13 [19])

3

have attracted considerable research efforts [20], [21], [22],[23], [24], [25] in studies on horizontal scene text detection.

Before 2010, most of the methods merely focus on reg-ular horizontal scene text, which is limited to generalizeto the real applications, where multi-oriented scene textis ubiquitous. To this end, Yao et al. [26] put forward apractical detection system as well as a multi-oriented bench-mark (MSRA-TD500) for multi-oriented scene text detection.Both the method and the dataset use rotated rectangularbounding boxes to detect and annotate the multi-orientedtext instances. Besides MSRA-TD500, the emergence of othermulti-oriented datasets including NEOCR [27], and USTB-SV1K [28] further facilitate numerous rotated rectanglebased methods [3], [26], [28], [29], [30], [31]. Since 2015,ICDAR’15 [32] starts to use four points based quadrilateralannotations for each text instance, which facilitates numer-ous methods that successfully demonstrate the superior-ity of the tighter and more flexible quadrilateral detectionmethods. The SegLink method [33] predicts text regions byoriented segments and learns connecting links to recombinethe results. DMPNet [34] observes that rotated rectanglesmay still contain unnecessary background noises, imperfectmatching, or unnecessary overlap, and thus it proposes theuse of quadrilateral bounding boxes to detect text with aux-iliary, predefined quadrilateral sliding windows. EAST [35]employs a dense prediction structure for directly predictingquadrilateral bounding boxes. WordSup [36] proposes aniterative strategy to generate characters region automati-cally, which shows robustness on complicated scenes. Thesuccessful attempt of the ICDAR 2015 motivates numerousquadrilateral based datasets, such as RCTW’17 [37], MLT[38], and ReCTS [39].

Recently, the research focus has shifted from multi-oriented scene text detection to arbitrarily shaped text de-tection. The arbitrary shape is mainly presented by curvedtext in the wild, which can also be very common in our realworld, e.g., columnar objects (bottles and stone piles), spher-ical objects, plicate planes (clothes, streamer, and receipts),coins, logos, seal, signboard and so on. The first curvedtext dataset CUTE80 [40] is constructed in 2014. But thisdataset is mainly used for scene text recognition, as it con-tains only 80 clean images with relative few text instances.For detection on arbitrarily shaped scene text, two recentbenchmarks—Total-Text [41] and SCUT-CTW1500 [42]—have been proposed to advance many influential works[43], [44], [45], [46], [47], [48], [49], [50], [51]. TextSnake[47] designs an FCN to predict the geometry attributes oftext instances and then groups them into the final output.CRAFT [44] predicts the character regions of the text andthe affinity between the adjacent ones. SegLink++ [48] prof-fers an instance-aware component grouping framework fordense and arbitrary shaped text detection. PSENet [46] pro-poses to learn the text kernel, then expands them to coverthe whole text instances. PAN [45] based on PSENet [46],adopts a learnable post-processing method by predictingsimilarity vectors of pixels. Wang et al. [52] propose to learnthe adaptive text region representation for the detection oftext in arbitrary shape. DRRN [53] proposes to first detecttext components, and then groups them together through agraph network. ContourNet [50] adopts an adaptive-RPNand an extra contour prediction branch to improve the

precision.

2.1.2 Scene Text Recognition

Scene text recognition aims at recognizing text througha cropped text image. Many previous methods [54], [55],following the bottom-up approach, first segment characterregions through sliding windows and classify each charac-ter, then group them into a word for taking the dependencewith its neighbors into consideration. They achieve goodperformance in scene text recognition, but are limited tocostly character-level annotations for character detection.Without large training datasets, the models’ in this categorytypically cannot generalize well. The works proposed by Suand Lu [56], [57] present a scene text recognition system byusing HOG feature and Recurrent Neural Network (RNN),which are one of pioneer works that successfully introduceRNN for scene text recognition. Later, CNN-based methodswith recurrent neural network are proposed to perform ina top-down manner, which can end-to-end predict a textsequence without any character detection. Shi et al. [58] ap-ply Connectionist Temporal Classification (CTC) [59] upona network integrated CNNs with RNNs, termed CRNN.Guided by the CTC loss, CRNN-based model can effectivelytranscribe the image content. Besides CTC, the attentionmechanism [60] is also employed for text recognition.

The above methods are mainly applied to regular textrecognition and not sufficiently robust for irregular one.In recent years, approaches for arbitrary-shape text recog-nition become dominant, which can be categorized intorectification-based methods and rectification-free methods.For the former, STN [2] and Thin-Plate-Spline (TPS) [61]are two widely used methods for text rectification. Shi et al.[62] are the first to introduce STN and the attention-baseddecoder for predicting the text sequence. The work in [63]achieves better performance using iterative text rectification.Besides, Luo et al. [64] propose MORAN, which rectifies thetext through regressing the offsets for location shift. Liu etal. [65] propose a Character-Aware Neural Network (Char-Net), which detects characters first and then separatelytransforms them into horizontal one. ESIR [66] presents aniterative rectification pipeline that can turn the position oftext from perspective distortion to regular format, and thusan effective end-to-end scene text recognition system can bebuilt. Litman et al. [67] first apply TPS on input images andthen stack several selective attention decoder for both visualand contextual features.

In the category of rectification-free methods, Cheng etal. [68] propose an arbitrary orientation network (AON) toextract features in four directions and the character posi-tion clues. Li et al. [69] apply the 2D-attention mechanismto capture irregular text features and achieved impressiveresults. To tackle the attention drift issue, Yue et al. [70]design a novel position enhancement branch in the recogni-tion model. In addition, some rectification-free methods arebased on semantic segmentation. Liao et al. [71] and Wan etal. [72] both propose to segment characters and classify theircategories through visual features.

4

2.2 End-to-End Scene Text Spotting

2.2.1 Regular End-to-end Scene Text SpottingLi et al. [3] may be the first to propose an end-to-end train-able scene text spotting method. The method successfullyuses an RoI Pooling [73] to join detection and recognitionfeatures via a two-stage framework. It is designed to pro-cess horizontal and focused text. Its improved version [11]significantly improves the performance. Busta et al. [74] alsopropose an end-to-end deep text spotter. He et al. [4] andLiu et al. [5] adopt an anchor-free mechanism to improveboth the training and inference speed. They use a similarsampling strategy, i.e., Text-Align-Sampling and RoI-Rotate,respectively, to enable extracting features from quadrilateraldetection results. Note that both of these two methods arenot capable of spotting arbitrarily-shaped scene text.

2.2.2 Arbitrarily-shaped End-to-end Scene Text SpottingTo detect arbitrarily-shaped scene text, Liao et al. [6] pro-pose a Mask TextSpotter which subtly refines Mask R-CNN and uses character-level supervision to simultane-ously detect and recognize characters and instance masks.The method significantly improves the performance of spot-ting arbitrarily-shaped scene text. Its improved version [10]significantly alleviates the reliance on the character-levelannotations. Sun et al. [75] propose the TextNet which pro-duces quadrilateral detection bounding boxes in advance,and then use a region proposal network to feed the detectionfeatures for recognition.

Recently, Qin et al. [7] propose to use a RoI Masking tofocus on the arbitrarily-shaped text region. Note that extracomputation is needed to fit polygons. The work in [8]propose an arbitrarily-shaped scene text spotting method,termed CharNet, requiring character-level training data andTextField [76] to group recognition results. Authors of [9]propose a novel sampling method, RoISlide, which usesfused features from the predicting segments of the textinstances, and thus it is robust to long arbitrarily-shapedtext.

Wang et al. [12] first detect the boundary points of textin arbitrary shapes, through TPS to rectify the detectedtext and then fed it into the recognition branch. Liao etal. [77] propose a Segmentation Proposal Network (SPN) toaccurate extract the text regions, and it follows [10] to attainthe final results.

3 OUR METHOD

An intuitive pipeline of our method is shown in Figure 2.Inspired by [78], [79], [80], we adopt a single-shot, anchor-free convolutional neural network as the detection frame-work. Removal of anchor boxes significantly simplifies thedetection for our task. Here the detection is densely pre-dicted on the output feature maps of the detection head,which is constructed by 4 stacked convolution layers withstride of 1, padding of 1, and 3×3 kernels. Next, we presentthe key components of the proposed ABCNet v2 in sixcomponents: 1) Bezier curve detection; 2) the coordinateconvolution module; 3) BezierAlign; 4) the light-weightattention recognition module; 5) the adaptive end-to-endtraining strategy; and 6) text spotting quantization.

3.1 Bezier Curve Detection

Compared to segmentation-based methods [44], [46], [47],[49], [76], [81], regression-based methods are more suitablefor arbitrarily-shaped text detection, as demonstrated in[42], [52]. A drawback of these methods is the complicatedpipeline, and they often require complex post-processingsteps to obtain the final results.

To simplify the detection for arbitrarily-shaped scenetext instances, we propose to fit a Bezier curve by regressingseveral key points. The Bezier curve represents a parametriccurve c(t) that uses the Bernstein Polynomials as its basis.The definition is shown in Equation (1):

c(t) =n∑

i=0

biBi,n(t), 0 ≤ t ≤ 1, (1)

where, n represents the degree, bi represents the i-th controlpoints, and Bi,n(t) represents the Bernstein basis polynomi-als, as shown in Equation (2):

Bi,n(t) =

(n

i

)ti(1− t)n−i, i = 0, ..., n, (2)

where(ni

)is the binomial coefficient. To fit arbitrary shapes

of the text with Bezier curves, we examine the arbitrarily-shaped scene text from the existing datasets and we em-pirically show that a cubic Bezier curve (i.e., n = 3) issufficient to fit different formats of curved scene text, espe-cially on dataset with word-level annotation. A higher-ordermay work better on text-line level datasets, where multiplewaves may be presented in one instance. We provide com-parisons in terms of the order of the Bezier curves in theexperiment section. An illustration of cubic Bezier curves isshown in Figure 3.

Based on the cubic Bezier curve, we can formulate thearbitrarily-shaped scene text detection into a regressionproblem similar to bounding box regression, but with eightcontrol points in total. Note that a straight text that has fourcontrol points (four vertexes) is a typical case of arbitrarily-shaped scene text. For consistency, we interpolate additionaltwo control points in the tripartite points of each long side.

To learn the coordinates of the control points, we firstgenerate the Bezier curve annotations described in §3.1.1and follow a similar regression method as in [34] to regressthe targets. For each text instance, we use

∆x = bix − xmin, ∆y = biy − ymin, (3)

where xmin and ymin represent the minimum x and y valuesof the 4 vertexes, respectively. The advantage of predictingthe relative distance is that it is irrelevant to whether theBezier curve control points are beyond the image boundary.Inside the detection head, we only use one convolution layerwith 4(n+1) (n is the number of Bezier Curve order) outputchannels to learn the ∆x and ∆y , which is nearly cost-freewhile the results can still be accurate. We discuss the detailsin §4.

3.1.1 Bezier Ground-truth Generation

In this section, we briefly introduce how to generate Beziercurve ground-truth based on the original annotations. The

5

Backbone Bezier Curve Detection

Input Image

BezierAlign Light-weight

Recognition Head

BiFPN

BezierAlign

Coord Conv

Fig. 2 – The framework of the proposed ABCNet v2. We use cubic Bezier curves and BezierAlign to extract multi-scale curvedsequence features using the Bezier curve detection results. We concatenate coordinate channels to encode the position coordinatesin FPN output features before sending to BezierAlign. The overall framework is end-to-end trainable with high efficiency. Herepurple dots represent the control points of the cubic Bezier curve.

arbitrarily-shaped datasets, e.g., Total-text [41] and SCUT-CTW1500 [42], use polygonal annotations for the text re-gions. Given the annotated points {pi}ni=1 from the curvedboundary, where pi represents the i-th annotating point,the main goal is to obtain the optimal parameters for cubicBezier curves c(t) in Equation (1). To achieve this, we cansimply apply the standard least-squares fitting as follows:B0,n(t0) · · · Bn,n(t0)B0,n(t1) · · · Bn,n(t1)

.... . .

...B0,n(tm) · · · Bn,n(tm)

bx0 by0

bx1 by1

......

bxnbyn

=

px0 py0

px1 py1

......

pxmpym

.(4)

Here m represents the number of the annotated points for acurved boundary. For Total-Text and SCUT-CTW1500, m is5 and 7, respectively. t is calculated by using the ratio of thecumulative length to the perimeter of the poly-line. Accord-ing to Equation (1) and Equation (4), we convert the originalpoly-line annotation to a parameterized Bezier curve. Notethat we directly use the first and the last annotating points asthe first (b0) and the last (bn) control points, respectively. Anvisualization comparison is shown in Figure 1, which showsthat the generated results can be even visually better thanthe original annotation. In addition, thanks to the structuredoutput, the task of text recognition can be easily formulatedby applying our proposed BezierAlign (see §3.3), whichwarps the curved text into horizontal representation. Moreresults of the Bezier curve generation are shown in Figure 4.The simplicity of our method allows it to deal with variousshapes in a unified representation format.

3.2 CoordConv

As pointed out in [14], conventional convolutions showlimitation when learning a mapping between coordinates

Fig. 3 – Cubic Bezier curves. bi represents the control points.The green lines form a control polygon, and the black curve isthe cubic Bezier curve. Note that with only two end-points b1and b4, the Bezier curve degenerates to a straight line.

in (x, y) Cartesian space and coordinates in one-hot pixelspace. The problem can be effectively solved by concatenat-ing the coordinates to the feature maps. The recent practiceof encoding relative coordinates [15] also show that the rel-ative coordinates can provide informative cues for instancesegmentation.

Let fouts denotes the features of different scales of FPN,and Oi,x and Oi,y represent the absolute x and y coordi-nates, respectively, from all the locations (i.e., the locationwhere the filters are generated) for the ith level of FPN.All Oi,x and Oi,y consist of two feature maps fox and foy .We simply concatenate fox and foy to the last channel offouts along the channel dimension. Therefore, new featuresfcoord with additional two channels are formed, whichare subsequently input to three convolutional layers withkernel size, stride, and padding size setting to 3, 1, and1, respectively. We find that using such simple coordinateconvolutions can considerably improve the performance ofscene text spotting.

3.3 BezierAlign

To enable end-to-end training, most of the previous meth-ods adopt various sampling (feature alignment) methodsto connect the recognition branch. Typically, a samplingmethod represents an in-network region cropping proce-dure. In other words, given a feature map and Region-of-Interest (RoI), using the sampling method to extract thefeatures of RoI and efficiently output a feature map of afixed size. However, sampling methods of previous non-segmentation based methods, e.g., RoI Pooling [3], RoI-Rotate [5], Text-Align-Sampling [4], or RoI Transform [75]cannot properly align features of arbitrarily-shaped text. Byexploiting the parameterization nature of a structured Beziercurve bounding box, we propose BezierAlign for featuresampling/alignment, which may be viewed as a flexibleextension of RoIAlign [82]. Unlike RoIAlign, the shape ofthe sampling grid of BezierAlign is not rectangular. Instead,each column of the arbitrarily-shaped grid is orthogonal tothe Bezier curve boundary of the text. The sampling pointshave equidistant interval in width and height, respectively,which are bilinear interpolated with respect to the coordi-nates.

Formally, given an input feature map and Bezier curvecontrol points, we process all the output pixels of the rect-angular output feature map with size hout × wout. Takingpixel gi from output feature map with position (giw, gih) as

6

Fig. 4 – Example results of Bezier curve generation. Green lines are the final Bezier curve results. Red dash lines represent thecontrol polygon, and the 4 red end points represent the control points. Zoom in for better visualization.

(a) Horizontal sampling. (b) Quadrilateral sampling. (c) BezierAlign.

Fig. 5 – Comparison between previous sampling methods and BezierAlign. The proposed BezierAlign can more accuratelysample features of the text region, which is essential for achieving good recognition accuracy. Note that the alignment procedureis applied to intermediate convolution features.

an example, we calculate t as follows:

t =giwwout

. (5)

Then the points of upper Bezier curve boundary tp andlower Bezier curve boundary bp are calculated accordingto Equation (1). Using tp and bp, we can linearly index thesampling point op by Equation (6):

op = bp · gihhout

+ tp · (1− gihhout

). (6)

With the position of op, we can easily apply bilinear interpo-lation to calculate the result. Due to the accurate samplingof features, the performance of text recognition is improvedsubstantially. We compare BezierAlign with other samplingstrategies, as presented in Figure 5.

3.4 Attention-based Recognition BranchBenefiting from the shared backbone features and Bezier-Align, we design a light-weight recognition branch asshown in Table 1, for faster execution. It consists of 6 convo-lutional layers, 1 bidirectional LSTM layer, and an attention-based recognition module. In the conference version [16],we have applied the CTC loss [59] for text string alignmentbetween predictions and ground-truth, but we find that theattention-based recognition module [10], [60], [83], [84] ismore powerful and can lead to better results. In the inferencephase, the RoI region is replaced by the detected Beziercurve as in §3.1. Note that in [16], we only use the generatedBezier curves to extract the RoI features during training. Inthis paper, we also take advantages of the detection results(see §3.5).

The attention mechanism takes zero RNN initial statesand the embedding features of an initial symbol for thesequential prediction. During each step, the c-category soft-max prediction (representing the predicted character), pre-vious hidden state, and a weighted sum of the croppedBezier curved features are recursively used to compute theresults. The prediction continues until an End-of-Sequence(EOS) symbol is predicted. The number of the class is setto 96 (excluding the EOS symbol) for English, while for thebilingual task including Chinese and English, the number ofthe class is set to 5462. Formally, at time step t, the attentionweights are calculated by:

et,s = KKKT tanh(WWWht−1 +UUUhs + bbb), (7)

where, ht−1 is the last hidden state, KKK , WWW , UUU , and bbb arethe learnable weight matrices and parameters. The weightedsum of the sequential feature vectors is formulated as:

ct =n∑

s=1

at,shs, (8)

where at,s is defined as:

at,s =exp(et,s)∑ns=1 exp(et,s)

. (9)

Then, the hidden state can be updated, as follows:

ht = GRU((embedembedembedt−1, ct), ht−1). (10)

Here embedembedembedt−1 is an embedding vector of the previousdecoding result yt, which is generated by the classifier:

yt = wwwht + bbb. (11)

7

TABLE 1 – Structure of the recognition branch, which is asimplified version of CRNN [58]. For all convolutional layers,the padding size is restricted to 1. n represents batch size. crepresents the channel size. h and w represent the height andwidth of the outputted feature map, and nclass represents thenumber of the predicted class.

Layers(CNN - RNN)

Parameters(kernel size, stride)

Output Size(n, c, h, w)

conv. layers ×4 (3, 1) (n, 256, h, w)conv. layers ×2 (3, (2,1)) (n, 256, h, w)

average pool for h - (n, 256, 1, w)Channels-Permute - (w, n, 256)

BLSTM - (w, n, 512)Attention-based decoder - (w, n, nclass)

Therefore, we use the softmax function to estimate theprobability distribution p(ut).

ut = softmax(VVV Tht), (12)

where VVV represents the parameters to be learned. To sta-bilize training, we also use a teacher enforcing strategy[85], which delivery a ground-truth character instead ofthe prediction of the GRU for the next prediction under apredefined probability setting to 0.5 in our implementation.

3.5 Adaptive End-to-End TrainingIn our published conference version [16], we only use theground truth to BezierAlign for the text recognition branchduring the training phase. While in the testing phase, weuse the detection results for feature cropping. Based on theobservations, some errors may occur when the detectionresults are not as accurate as the ground-truth Bezier curvedbounding box. To alleviate such issues, we propose a simpleyet effective strategy, termed Adaptive End-to-End Training(AET).

Formally, the detection results are first suppressed byconfidence thresholds, and then using NMS to eliminate theredundant detection results. The corresponding recognitionground truth is then assigned to each detection result basedon the minimum summation of the distances of the coordi-nates of control points:

rec = arg minrec∗∈cp∗

n∑i=1

|cp∗xi,yi− cpxi,yi

|, (13)

where cp∗ is the ground truth of the control points. n is thenumber of the control points. After assigning the recognitionannotation to the detection results, we simply concatenatethe new target to the original ground truth set for furtherrecognition training.

3.6 Text Spotting QuantizationScene text reading applications usually require a real-timeperformance; however, there are few works attempting touse quantization for scene text spotting task. Model quan-tization aims at discretizing the full precision tensors intolow-bit tensors without much degradation of network per-formance. Limited number of representation levels (quanti-zation levels) are available. Given quantization bit width tobe b-bit, the number of quantization levels are 2b. It is easyto see that deep learning models might suffer significant

performance drop with the quantization bit width becom-ing low. To maintain the accuracy, the discretization errorshould be minimized:

Q∗(x) = arg minQ

∑(Q(x)− x)2, (14)

where Q(x) is the quantization function. Motivated by theLSQ [86], we employ the following equations as the activa-tion quantifiers in this paper. Specifically, for any data xa

from the activation tensor Xa, its quantization value Q(xa)is computed by a serial of transformations.

Firstly, as indicated in the work PACT [87], not all the fullprecision data should be linearly mapped to the quantizedvalue. It is common to find some abnormal large values,which rarely occur in the full precision tensor. We furthervisualize data distribution of some layers from the ABCNetv2 in Figure 6 and observe a similar phenomenon. Therefore,a learnable parameter αa is assigned to dynamically depictthe discretization range, with beyond data clipped to theboundary:

ya = min(max(xa, 0), αa), (15)

Secondly, data in the clipped range (so-called quantizationinterval) is linearly mapped to nearby integer, as shown inEquation (16):

za = b ya

αa· (l − 1)e, (16)

where l = 2b is the number of quantization levels mentionedabove and b·e is the nearest-rounding function. Thirdly, tokeep the data magnitude similar before and after quantiza-tion, we apply corresponding scale factor on za to obtainQ(xa) by:

Q(xa) = za · αa

l − 1, (17)

In summary, the quantization of activations can be writtenas:

Q(xa) = bmin(max(xa, 0), αa) · l − 1

αae · αa

l − 1

= bmin(max(xa

αa, 0), 1) · (l − 1)e · αa

l − 1,

(18)

Different from activations, weight parameters generally con-tain both positive and negative values, thus extra lineartransforms are introduced before discretization as follows:

zw = b(min(max(xw

αw,−1), 1) + 1)/2 · (l − 1)e,

Q(xw) = (zw

l − 1· 2− 1) · αw = (2 · zw − (l − 1)) · α

w

l − 1(19)

One issue for model quantization is the gradient vanishingcaused by the round function (b·e). This is because the roundfunction has an almost-everywhere zero gradient. Straightthrough estimator (STE) is employed to tackle the prob-lem. Specifically, we override the derivative of the roundfunction constantly to be 1 (∂b·e = 1). We employ mini-batch gradient descent optimizer to train the quantizationrelated parameters αa and αw in each layer together withthe original parameters from the network.

It should be noted that for each convolutional layer, itsquantization introduced parameter αa and αw are sharedfor all elements in the input activation tensorXa and weight

8

0 2 4 6 8 10 12 140

1

2

3

4

5

6

71e6

max

(a) 1st layer activation

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.00.00

0.25

0.50

0.75

1.00

1.25

1.50

1.751e6

max

(b) 21st layer activation

0 2 4 6 80

1

2

3

4

1e6max

(c) 31st layer activation

Fig. 6 – Data distribution in the ABCNet v2 with the max value marked in the figure. We can see that the abnormal large valueshave a low occurrence frequency.

tensor Xw, respectively. Thus, it is possible to exchange thecomputational order during network forward-propagation,as depicted in Equation 20, for better efficiency. With theexchange, the time-consuming convolutional computationis operated in integer format only (all elements za ∈ Za andzw ∈ Zw are b-bit integers). Therefore, benefits in aspectsof latency, memory footprint and energy consumption canbe achieved compared with corresponding floating-pointcounterpart.

Q(Xa) ·Q(Xw) = (Za · (2 ·Zw− (l−1))) · (αa · αw

(l − 1)2), (20)

In theory, for b-bit quantization network, the input acti-vation and weight have a 32

b × memory saving. For energyconsumption, we list the estimated energy cost for per op-eration of different types on chip in Table 2. As we can see,the energy cost of floating-point ADD and MULT is muchlarger than that of the fixed-point operation. Moreover, theDRAM access costs magnitude-order higher energy com-pared to the ALU operations. Therefore, it is clear that thequantized model is potential to save considerable energycompared to the full precision counterpart.

In terms of inference latency, the actual speedup ofquantized models against full precision counterparts is de-cided by the computational ability of fixed-point arithmeticversus floating-point arithmetic on the platform. Table 3shows the operations per cycle per SM on the NvidiaTuring architecture. We can learn from Table 3 that an 8-bitnetwork is potential to achieve 2× speedups versus the fullprecision counterpart on the platform. More impressively,4-bit network and binary neural network (1-bit) are able torun faster than the full precision model by 4× and 16×,respectively.

4 EXPERIMENTS

To evaluate the effectiveness of ABCNet v2, we con-duct experiments on various scene text benchmarks, in-

TABLE 2 – Energy consumption of different operations in 45nmCMOS process.

Operation Energy (pJ)32-bit Fixed-point ADD 0.132-bit floating-point ADD 0.932-bit Fixed-point MULT 3.132-bit floating-point MULT 3.732-bit 32KB SRAM 532-bit DRAM 640

TABLE 3 – Computational ability (Ops per cycle per SM)comparison on Nvidia Turing Architecture.

input precision Output Ops/Cycle/SMFP16 FP16 or FP32 1024INT8 INT32 2048INT4 INT32 4096INT1 INT32 16384

cluding multi-oriented scene text benchmarks ICDAR’15[32], MSRA-TD500 [26], ReCTS [39], and two arbitrarilyshaped benchmarks Total-Text [41] and SCUT-CTW1500[42]. The ablation studies are conducted on Total-Text andSCUT-CTW1500 to verify each component of our proposedmethod.

4.1 Implementation Details

The backbone of the work here follows a common settingas most previous work, i.e., ResNet-50 [88] together witha Feature Pyramid Network (FPN) [89] unless specifiedotherwise. For the detection branch, we apply RoIAlignon 5 feature maps with 1/8, 1/16, 1/32, 1/64, and 1/128resolution of the input image while for recognition branch,BezierAlign is conducted on three feature maps with 1/4,1/8, and 1/16 sizes, and the width and height of thesampling grid are set to 8 and 32, respectively. For Englishonly dataset, the pretrained data is collected from publiclyavailable English word-level-based datasets, including 150Ksynthesized data described in the next section, 7K ICDAR-MLT data [38], and the corresponding training data of eachdataset. The pretrained model is then fine-tuned on thetraining set of the target datasets. Note that 15k COCO-Text[90] images in our previous manuscript [16] are not used inthis improved version. For ReCTS dataset, we adopt LSVT[91], ArT [92], ReCTS [39], and the synthetic pretrained datato train the model.

In addition, we also adopt data augmentation strategies,e.g., random scale training, with the short size uniquelychosen from 640 to 896 (interval of 32) and the long sizebeing less than 1600; and random crop, which we makesure that the crop image do not cut the text instance (forsome special cases that hard to meet the condition, we donot apply random crop).

We train our model using 4 Tesla V100 GPUs with theimage batch size of 8. The maximum iteration is 260K; andthe initialized learning rate is 0.01, which reduces to 0.001 atthe 160Kth iteration and 0.0001 at 220Kth iteration.

9

Fig. 7 – Qualitative results of ABCNet v2 on various datasets. The detection results are shown with blue bounding boxes.Prediction confidence scores are also shown. Best viewed on screen.

4.2 Benchmarks

Bezier Curve Synthetic Dataset 150k. For the end-to-endscene text spotting methods, a massive amount of free syn-thesized data are always necessary. However, the existing800k SynText dataset [93] only provides a quadrilateralbounding box for a majority of straight text. To diversifyand enrich the arbitrarily shaped scene text, we make someeffort to synthesize a dataset of 150K images (94,723 imagescontain a majority of straight text, and 54,327 images containmostly curved text) with the VGG synthetic method [93].

Specially, we filter out 40K text-free background imagesfrom COCO-Text [90] and then prepare the segmentationmask and scene depth of each background image for thefollowing text rendering. To enlarge the shape diversityof synthetic texts, we modify the VGG synthetic methodby synthesizing scene text with various art fonts and cor-pus and generate the polygonal annotation for all the textinstances. The annotations are then used for producingthe Bezier curve ground truth by the generating methoddescribed in §3.1.1. Examples of our synthesized data areshown in Figure 8. For Chinese pretraining, we synthesized100K images following the same method as above, withsome examples shown in Figure 9.

Total-Text dataset [41] is one of the most importantarbitrarily shaped scene text benchmark proposed in 2017,which was collected from various scenes, including text-likescene complexity and low-contrast background. It contains1,555 images, with 1,255 for training and 300 for testing.To resemble the real-world scenarios, most of the images ofthis dataset contain a large amount of regular text, whileguarantee that each image has at least one curved text. Thetext instance is annotated with polygon based on word-

level. Its extended version [41] improves its annotation ofthe training set by annotating each text instance with afixed ten points following text recognition sequence. Thedataset contains English text only. To evaluate the end-to-end results, we follow the same metric as previous methods,which use F-measure to measure the word-accuracy.

SCUT-CTW1500 dataset [42] is another important arbi-trarily shaped scene text benchmark proposed in 2017. Com-pared to Total-Text, this dataset contains both English andChinese text. In addition, the annotation is based on text-line level, and it also includes some document-like text, i.e.,numerous small text may stack together. SCUT-CTW1500contains 1k training images, and 500 testing images.

ICDAR 2015 dataset [32] provides images which are inci-dentally captured in the real world. Unlike previous ICDARdatasets, in which the text are clean, well-captured, andhorizontally centered in the images. The dataset includes1000 training images and 500 testing images with complicatebackgrounds. Some text may also appear in any orientationand any location, with small size or low resolution. The

Fig. 8 – Examples of English Bezier curve synthesized data.

10

Fig. 9 – Examples of Chinese Bezier curve synthesized data.

annotation is based on word-level, and it only includesEnglish samples.

MSRA-TD500 dataset [26] contains 500 multi-orientedChinese and English images, with 300 images for trainingand 200 images for testing. Most of the images are capturedindoor. To overcome the insufficiency of the training data,we use the synthetic Chinese data mentioned above formodel pretraining.

ICDAR’19-ReCTs dataset [39] contains 25k annotatedsignboard images, in which 20k images are for training set,and the rest are for testing set. Compared to English text,Chinese text normally has a significantly large number ofthe classes, with more than 6k commonly used characterswith complicated layouts and various fonts. This datasetmainly contains text of the shop signs, and it also providesannotations for each character.

ICDAR’19-ArT dataset [92] is currently the largestdataset for arbitrarily shaped scene text. It is the combi-nation and extension of the Total-text and SCUT-CTW1500.The new images also contains at least one arbitrarily-shapedtext per image. There is a high diversity in terms of textorientations. The ArT dataset is split to a training set with5,603 images and 4,563 for testing set. All the English andChinese text instances are annotated with tight polygons.

ICDAR’19-LSVT dataset [91] provides an unprecedent-edly large number of text from street view. It provides total450k images with rich information of the real scene, amongwhich 50k are annotated in full annotations (30k for trainingand the rest 20k for testing). Similar to ArT [92], this datasetalso contains some curved text, which are annotated withpolygon.

4.3 Ablation Study

To evaluate the effectiveness of the proposed components,we conduct ablation studies on two datasets, Total-Textand SCUT-CTW1500. We find that there are some trainingerrors because of the different initialization. In addition, thetext spotting task requires that all the characters should becorrectly recognized. To avoid such issues, we train eachmethod for three times and report the average results. Theresults are shown in Table 5, which demonstrate that allmodules can lead to an obvious improvement against thebaseline model on both datasets.

We can see that using the attention-based recognitionmodule, the results can be improved by 2.7% on Total-Textand 7.9% on SCUT-CTW1500, respectively.

We then evaluate all other modules using the attention-based recognition branch. Some conclusions are as follows:

TABLE 4 – Ablation study for BezierAlign. Horizontal sam-pling follows [3], and quadrilateral sampling follows [4].

Methods Sampling method F-measure (%)baseline + Horizontal Sampling 38.4baseline + Quadrilateral Sampling 44.7baseline + BezierAlign 61.9

• Using a biFPN architecture, the results can be im-proved by an additional 1.9% and 1.6%, while theinference speed is only reduced by 1 FPS. Thus,we achieve a better trade-off between speed andaccuracy.

• Using coordinate convolution mentioned in §3.2, theresults can be significantly improved by 2.8% and2.9% on two datasets, respectively. Note that suchimprovement does not introduce noticeable compu-tation overhead.

• We also test the AET strategy mentioned in §3.5,which results in 1.2% and 1.7% improvement.

• Lastly, we conduct experiments to show how the set-ting of Bezier curve order affects the results. Specif-ically, we regenerate all the ground truths for thesame synthetic and the real images using 4th-orderBezier curves. We then train the ABCNet v2 by re-gressing the control points and use 4th-order Bezier-Align. Other parts remain the same as the 3rd-ordersetting. The results shown in Table 5 demonstratethat increasing the order can be conducive to the textspotting results, especially on SCUT-CTW1500 whichadopts text-line annotation. We further conduct ex-periments by using 5th-order Bezier curves on Total-text dataset following the same experimental setting;however, compared with baseline, we find that theperformance drops from 66.2% to 65.5% in termsof the E2E Hmean. Based on the observation, weassume that the decline might be because an ex-tremely higher order may result in drastic variationof the controlled points, which could exacerbate thedifficulties of the regression. Some results using 4th-order Bezier curve are shown in Figure 10. We cansee that the detection bounding box can be morecompact, and thus the textual feature can be moreaccurately cropped for subsequent recognition.

We further evaluate BezierAlign by comparing it withprevious sampling methods, shown in Figure 5. For fair andfast comparison, we use a small training and testing scale.The results shown in Table 4 demonstrate that the Bezier-Align can dramatically improve the end-to-end results.Qualitative examples are shown in Figure 11. Another abla-tion study is conducted to evaluate the time consumption ofBezier curve detection, and we observe that the Bezier curvedetection only introduces negligible computation overheadcompared with standard bounding box detection.

4.4 Comparison with State-of-the-art

We compare our method to previous methods on bothdetection and end-to-end text spotting tasks. An optimalsetting including inference thresholds and testing scale is

11

TABLE 5 – Ablation study on both Total-Text and SCUT-CTW1500 datasets. Attn: attention recognition model. 4O: using 4th orderBezier Curve.

Components Total-Text SCUT-CTW1500Attn biFPN CoordConv AET 4O E2E Hmean Impr. FPS E2E Hmean Impr.

baseline (ctc) [16] 63.5 - 13 45.0 -baseline+ X 66.2 ↑ 2.7% 11 52.9 ↑ 7.9%baseline+ X X 68.1 ↑ 4.6% 10 54.5 ↑ 9.5%baseline+ X X 69.0 ↑ 5.5% 11 55.8 ↑ 10.8%baseline+ X X 67.4 ↑ 3.9% 11 54.6 ↑ 9.6%baseline+ X X 67.0 ↑ 3.5% 11 54.4 ↑ 9.4%

ABCNet v2 X X X X 70.4 ↑ 6.9% 10 57.5 ↑ 12.5%

TABLE 6 – Detection results on Total-text, SCUT-CTW1500, MSRA-TD500, and ICDAR 2015 datasets.

Methods Total-Text SCUT-CTW1500 MSRA-TD500 ICDAR 2015 ReCTSR P H R P H R P H R P H R P H

Seglink [33] 30.3 23.8 26.7 48.4 38.3 42.8 70.0 86.0 77.0 76.8 73.1 75.0 - - -DMPNet [34] - - - 61.7 63.9 62.7 - - - 68.2 73.2 70.6 - - -

CTD+TLOC [42] 74.5 82.7 78.4 85.3 67.9 75.6 73.9 83.2 78.3 77.1 84.5 80.6 - - -TextSnake [47] 74.5 8.7 78.4 77.8 82.7 80.1 81.7 84.2 82.9 84.9 80.4 82.6 - - -

EAST [35] 50.0 36.2 42.0 49.7 78.7 60.4 67.4 87.3 76.1 78.3 83.3 80.7 73.7 74.3 74.0He et al. [4] - - - - - - 61.0 71.0 69.0 - - - - - -

DeepReg [80] - - - - - - 70.0 77.0 74.0 - - - - - -Textboxes++ [94] - - - - - - - - - 78.5 87.8 82.9 - - -

LSE [81] - - - 77.8 82.7 80.1 81.7 84.2 82.9 - - - - - -ATTR [52] 76.2 80.9 78.5 80.2 80.1 80.1 82.1 85.2 83.6 83.3 90.4 86.8 - - -MSR [51] 73.0 85.2 78.6 79.0 84.1 81.5 76.7 87.4 81.7 - - - - - -

TextDragon [9] 75.7 85.6 80.3 82.8 84.5 83.6 - - - - - - - - -TextField [76] 79.9 81.2 80.6 79.8 83.0 81.4 75.9 87.4 81.3 80.1 84.3 82.4 - - -

PSENet-1s [46] 78.0 84.0 80.9 79.7 84.8 82.2 - - - 85.5 88.7 87.1 83.9 87.3 85.6Seglink++ [48] 80.9 82.1 81.5 79.8 82.8 81.3 - - - 80.3 83.7 82.0 - - -

LOMO [49] 79.3 87.6 83.3 76.5 85.7 80.8 - - - 83.5 91.3 87.2 - - -CRAFT [44] 79.9 87.6 83.6 81.1 86.0 83.5 78.2 88.2 82.9 84.3 89.8 86.9 - - -

PAN [45] 81.0 89.3 85.0 81.2 86.4 83.7 83.8 84.4 84.1 81.9 84.0 82.9 - - -Mask TTD [95] 74.5 79.1 76.7 79.0 79.7 79.4 81.1 85.7 83.3 87.6 86.6 87.1 - - -

CounterNet [50] 83.9 86.9 85.4 84.1 83.7 83.9 - - - 86.1 87.6 86.9 - - -DB [43] 82.5 87.1 84.7 80.2 86.9 83.4 77.7 76.6 81.9 82.7 88.2 85.4 - - -

Mask TextSpotter [10] 82.4 88.3 85.2 - - - - - - 87.3 86.6 87.0 88.8 89.3 89.0DRRN [53] 84.9 86.5 85.7 83.0 85.9 84.5 82.3 88.1 85.1 84.7 88.5 86.6 - - -ABCNet v2 84.1 90.2 87.0 83.8 85.6 84.7 81.3 89.4 85.2 86.0 90.4 88.1 87.5 93.6 90.4

TABLE 7 – End-to-End text spotting results on Total-Text, and SCUT-CTW1500. * represents the results are from [9]. “None”represents lexicon-free. “Strong Full” represents that we use all the words appeared in the test set. “S”, “W”, and “G” representrecognition with “Strong”, “Weak”, and “Generic” lexicon, respectively. FPS is for reference only, as it can be varied from differentsettings and devices. Here * represents multi-scale results.

Methods Total-Text SCUT-CTW1500 ICDAR 2015 End-to-End ReCTS FPSNone Full None Full S W G 1-NED

TextBoxes++ [94] 36.3 48.9 - - 73.3 65.9 51.9 - 1.4Mask TextSpotter’18 [6] 52.9 71.8 - - 79.3 73.0 62.4 - 2.6

TextNet [75] 54.0 - - - - - - - 2.7Li et al. [11] 57.8 - - - - - - - 1.4

Deep Text Spotter [74] - - - - 54.0 51.0 47.0 - -Mask TextSpotter’19 [10] 65.3 77.4 - - 83.0 77.7 73.5 67.8 2.0

Qin et al. [7] 67.8 - - - - - - - 4.8CharNet [8] - - - - 80.1 74.5 62.2 - 1.2

FOTS [5] - - 21.1 39.7 83.6 79.1 65.3 50.8 -RoIRotate* [9] - - 38.6 70.9 82.5 79.2 65.4 - -

TextDragon [9] 48.8 74.8 39.7 72.4 82.5 78.3 65.2 - 2.6ABCNet [16] 64.2 75.7 45.2 74.1 - - - - 17.9

Boundary TextSpotter [12] - - - - 79.7 75.2 64.1 - -Craft [96] 78.7 - - - 83.1 82.1 74.9 - 5.4

Mask TextSpotter v3 [77] 71.2 78.4 - - 83.3 78.1 74.2 - 2.5Feng et al. [97] 55.8 79.2 42.2 74.9 87.3 83.1 69.5 - 7.2

ABCNet v2 70.4 78.1 57.5 77.2 82.7 78.5 73.0 62.7 10ABCNet v2* 73.5 80.7 58.4 79.0 83.0 80.7 75.0 65.4 -

decided using grid search. For the detection task, we con-duct experiments on four datasets including two arbitrarily-shaped datasets (Total-Text and SCUT-CTW1500), two

multi-oriented datasets (MSRA-TD500, and ICDAR 2015)and one bilingual dataset ReCTS. The results in Table 6demonstrate that our method can achieve state-of-the-art

12

performance on all four datasets, outperforming previousstate-of-the-art methods.

For end-to-end scene text spotting tasks, ABCNet v2achieves the best performance on SCUT-CTW1500 and ICDAR2015 datasets, significantly outperforming previous methods. Theresults are shown in Table 7. Although our method is worsethan Mask TextSpotter [10] in terms of 1-NED on the ReCTSdataset, we argue that we do not use the provided character-level bounding box and ours shows clear advantages interms of the inference speed. On the other hand, ABCNetv2 can still achieve better detection performance comparedto Mask TextSpotter [10] according to Table 6.

Qualitative results of the test sets are shown in Figure7. From the figure, we can see that ABCNet v2 achieves apowerful recall ability for various text including horizontal,multi-oriented, and curved text, or long and dense textpresentation styles.

4.5 Comprehensive Comparison with Mask TextSpot-ter v3

Comparison using few data. We find out the proposedmethod can achieve a promising spotting result using only asmall amount of training data. To validate the effectiveness,we use the official code of the Mask TextSpotter v3 [77], andconduct experiments following the same setting by trainingthe model with only the official training data of TotalText.Specifically, the optimizer and the learning rate (0.002) ofour method are set to the same as that of Mask TextSpotterv3. The batch sizes are set to 4, and both methods are trainedwith 230K iterations. To ensure the best setting of the bothmethods, Mask TextSpotter v3 is trained by minimum sizesof 800, 1000, 1200, and 1400, and maximum size of 2333.Testing is conducted with a minimum size of 1000 and maxi-mum size of 4000. Our method is trained by a minimum sizefrom 640 to 896 with the interval of 32, and the maximumsize is set to 1600. To stabilize the training, AET strategy isnot used for the few-shot training. Testing is conducted witha minimum size of 1000, and maximum size of 1824. We alsoconduct grid search to find the best threshold for the MaskTextSpotter v3. The results of different iterations are shownin Figure 12. We can see that although Mask TextSpotterv3 converges faster at the beginning, the final result of ourmethod is better (56.41% vs. 53.82%).

Fig. 10 – Comparison between cubic Bezier curve (left) and4th-order Bezier curve (right). There are some slight differ-ences.

Fig. 11 – Qualitative recognition results of the quadrilateralsampling method and BezierAlign. Left: original image. Topright: results by using quadrilateral sampling. Bottom right:results by using BezierAlign.

TABLE 8 – Comparison with Mask TextSpotter v3 using large-scale training set. MTSv3: Mask TextSpotter v3 [77].

MTSv3 ABCNet v1 ABCNet v2F-measure (%) 65.1 64.2 70.4

FPS 2.5 14.3 8.7

Comparison using large-scale data. We also use suffi-cient training data for a more thorough comparison withMask TextSpotter v3. Formally, we carefully train MaskTextSpotter v3 using the Bezier Curve Synthetic Dataset(150k), MLT (7k), and TotalText, which are exact the sameas our method. The training scales, batch sizes, the numberof iterations, and others are all set to the same as mentionedin the Section 4.1. Grid search is also used to find the bestthreshold for the Mask TextSpotter v3. The results are shownin Table 8, from which we can see that Mask TextSpotterv3 outperforms ABCNet v1 by 0.9 in terms of F-measure,and ABCNet v2 can outperform Mask TextSpotter v3 (65.1%vs. 70.4%). The inference time is measured under the sametesting scale (Maximum size is set to 1824) and device (RTXA40), which further demonstrates the effectiveness of ourmethod.

4.6 LimitationsWe further conduct error analysis on the incorrectly pre-dicted samples. We observe two types of common errorsthat may limit the scene text spotting performance of ABC-Net v2.

13

TABLE 9 – Quantization results of ABCNet v2 (ResNet-18 as the backbone). “A/W” indicates the bit width configuration of theactivation and weight, respectively. † implies the case is trained with progressive training strategy. FPS is based on profiling dataon single Nvidia 2080TI GPU.

Data set A/W End-to-end Results Detection only Results FPSprecision (%) recall (%) hmean (%) precision (%) recall (%) hmean(%)Pretrain for Total-Text 32/32 66.1 51.0 57.6 83.3 76.7 79.9 16.4Finetune on Total-Text 32/32 70.3 64.4 67.2 86.2 82.9 84.5 16.4Finetune on CTW1500 32/32 58.8 48.3 53.0 88.2 81.4 84.7 37.9Pretrain for Total-Text 4/4 67.2 52.4 58.9 83.4 77.7 80.5 25.9Finetune on Total-Text 4/4 70.7 62.8 66.5 86.0 82.3 84.1 25.9Finetune on CTW1500 4/4 58.0 47.3 52.2 87.2 81.0 84.0 47.8Pretrain for Total-Text 1/1 62.0 34.4 44.3 82.0 63.6 71.6 29.5Pretrain for Total-Text 1/1† 65.6 48.5 55.8 81.8 74.9 78.2 29.5Finetune on Total-Text 1/1† 71.2 60.8 65.5 88.0 82.2 85.0 29.5Finetune on CTW1500 1/1† 58.2 46.5 51.7 85.1 80.4 82.7 51.2

35.83%45.45%

51.91%

51.94% 53.82% 52.74%

9.30%

46.89%50.51%

56.41% 54.87% 53.46%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

40k 80k 120k 160k 200k 230k

End-

to-E

nd F

-mea

sure

(Non

e)

Iterations

MTSv3 Ours

Fig. 12 – Comparison with Mask TextSpotter v3 using onlythe training set of TotalText. MTSv3: Mask TextSpotter v3[77].

The first one is shown in the example of Figure 13. Thetext instance contains two characters. For each character,the reading order is from left to right. But for the wholeinstance, the reading order is from up to bottom. As theBezier curve is interpolated in the longer side of the textinstance, the BezierAlign feature would be a rotated featurecompared to its original feature, which can result in acompletely different meaning. On the other hand, such casesonly consist a minority of the whole training set, which isprone to be mistakenly recognized or predicted as an unseencategory, as represented by “�” for the second character.

The second errors happen in different fonts, as shown inthe middle of Figure 13. The first two characters are writtenin unusual calligraphy fonts, making it difficult to recognize.In general, this challenge can only be alleviated with moretraining images.

We also find out that there is an extremely curved casein the test set of the CTW1500, where more than three crestsexist in the same text instance, as shown in the third rowof the Figure 13. In such a case, the lower order such ascubic Bezier curve may be limited, as the character “i” isincorrectly recognized as the uppercase “I” because of theinaccurate shape representation. However, such cases arerarely seen, especially for those datasets using word-levelbounding box.

Fig. 13 – Error analysis of ABCNet v2.

4.7 Inference Speed

To further test the potential real-time performance of theproposed method, we exploit quantization techniques tosignificantly improve the inference speed. The baselinemodel adopts the same setting as the baseline with anattention based recognition branch shown in Table 5. Thebackbone is replaced by ResNet-18, which significantly im-proves the speed with only marginal accuracy reduction.

The performance of the quantized network with variousquantization bit configurations (4/1-bit) are reported inTable 9. Full precision performance is also listed for compar-ison. To train the quantized network, we first pretrained thelow-bit model on the synthetic dataset. Then, we fine-tunethe network on dedicated dataset TotalText and CTW1500for better performance.

Results of accuracy for both the pretrained model andfine-tuned model are reported. During the pre-training,

14

260K iterations are trained with batch size to be 8. Initiallearning rate is set to be 0.01 and divided by 10 at 160K and220K iteration. For the fine-tuning on TotalText dataset,batch size remains 8 and the initial learning rate is set tobe 0.001. Only 5K iterations are fine-tuned. The batch sizeand initial learning rate are the same when fine-tuning onCTW1500 dataset. However, the number of total iterationsis set to be 120k and the learning rate is divided by 10 atiteration 80k. Similar with previous quantization work, wequantize all the convolutional layer in the network, exceptinput and output layers. If not specifically stated, networksare initialized with the full precision counterpart.

From Table 9, we can learn that the 4-bit models by ourquantization method are able to obtain comparable performancewith the full precision counterparts. For example, end-to-endhmean of 4-bit model pretrained on synthetic dataset iseven better than that of the full precision model (57.6%vs. 58.9%). After fine-tuning, end-to-end hmean of 4-bitmodel on TotalText and CTW1500 is only 0.7% (67.2% vs.66.5%) and 0.8% (53.0% vs. 52.2%) lower than that of the fullprecision model, respectively.

The fact that almost no performance drops for 4-bitmodels (the same observation is also made on image clas-sification and objection detection tasks [98]) indicates theconsiderable redundancy in the full precision scene textspotting model.

However, the performance has a sizable drop for thebinary network, with the end-to-end hmean to be only44.3%. To compensate, we propose to train the BNN (binaryneural network) model with progressive training, in whichthe quantization bit width is progressively decreased (e.g., 4-bit→ 2bit→ 1-bit). With the new training strategy (the oneswith † in the table), the performance of the binary network issignificantly improved. For example, the end-to-end hmeantrained on synthetic dataset by BNN model is only 1.8%(57.6% vs. 55.8%) lower than the full precision counterpart.

Apart from the performance evaluation, we also com-pare the overall speed of the quantized model against full-precision models. In practice, only the quantized convolu-tion layers are accelerated with other layers, such as LSTMkeeping in full precision. It can be learned from Table 9that, with limited performance drop, the binary network ofABCNet v2 is able to run in real-time for both TotalTextand CTW1500 datasets.

5 CONCLUSION

We have proposed ABCNet v2—a real-time end-to-endmethod that uses Bezier curves for arbitrarily-shaped scenetext spotting. By reformulating arbitrarily-shaped scene textusing parameterized Bezier curves, ABCNet v2 can de-tect arbitrarily-shaped scene text with Bezier curves. Ourmethod introduces negligible computation cost comparedwith standard bounding box detection. With such Beziercurve bounding boxes, we can naturally connect a light-weight recognition branch via a new BezierAlign layer,which is critical for accurate feature extraction, especiallyfor curved text instances.

Comprehensive experiments on various datasets demon-strate the effectiveness of the proposed components, includ-ing using the attention recognition module, biFPN structure,

coordinate convolution, and a new adaptive end-to-endtraining strategy. Finally, we propose to apply quantizationtechniques for deploying our model for real-time tasks,showing the great potential for a wide range of applications.

REFERENCES

[1] F. L. Bookstein, “Principal warps: Thin-plate splines and thedecomposition of deformations,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 11, no. 6, pp. 567–585, 1989.

[2] M. Jaderberg, K. Simonyan, A. Zisserman, et al., “Spatial trans-former networks,” in Proc. Advances in Neural Inf. Process. Syst.,pp. 2017–2025, 2015.

[3] H. Li, P. Wang, and C. Shen, “Towards end-to-end text spottingwith convolutional recurrent neural networks,” in Proc. IEEE Int.Conf. Comp. Vis., pp. 5238–5246, 2017.

[4] T. He, Z. Tian, W. Huang, C. Shen, Y. Qiao, and C. Sun, “An end-to-end textspotter with explicit alignment and attention,” in Proc.IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5020–5029, 2018.

[5] X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan, “Fots: Fastoriented text spotting with a unified network,” in Proc. IEEE Conf.Comp. Vis. Patt. Recogn., pp. 5676–5685, 2018.

[6] P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, “Mask textspotter:An end-to-end trainable neural network for spotting text witharbitrary shapes,” in Proc. Eur. Conf. Comp. Vis., pp. 67–83, 2018.

[7] S. Qin, A. Bissacco, M. Raptis, Y. Fujii, and Y. Xiao, “Towardsunconstrained end-to-end text spotting,” Proc. IEEE Int. Conf.Comp. Vis., 2019.

[8] X. Linjie, T. Zhi, H. Weilin, and R. S. Matthew, “ConvolutionalCharacter Networks,” in Proc. IEEE Int. Conf. Comp. Vis., 2019.

[9] F. Wei, H. Wenhao, Y. Fei, Z. Xu-Yao, and C.-L. Liu, “TextDragon:An end-to-end framework for arbitrary shaped text spotting,” inProc. IEEE Int. Conf. Comp. Vis., 2019.

[10] M. Liao, P. Lyu, M. He, C. Yao, W. Wu, and X. Bai, “Masktextspotter: An end-to-end trainable neural network for spottingtext with arbitrary shapes,” IEEE Trans. Pattern Anal. Mach. Intell.,2019.

[11] H. Li, P. Wang, and C. Shen, “Towards end-to-end text spotting innatural scenes,” arXiv: Comp. Res. Repository, 2019.

[12] H. Wang, P. Lu, H. Zhang, M. Yang, X. Bai, Y. Xu, M. He, Y. Wang,and W. Liu, “All you need is boundary: Toward arbitrary-shapedtext spotting,” in Proc. AAAI Conf. Artificial Intell., pp. 12160–12167,2020.

[13] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficientobject detection,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,pp. 10781–10790, 2020.

[14] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev,and J. Yosinski, “An intriguing failing of convolutional neuralnetworks and the coordconv solution,” in Proc. Advances in NeuralInf. Process. Syst., pp. 9605–9616, 2018.

[15] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “Solov2: Dynamicand fast instance segmentation,” in Proc. Advances in Neural Inf.Process. Syst., vol. 33, 2020.

[16] Y. Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang, “Abcnet: Real-time scene text spotting with adaptive bezier-curve network,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 9809–9818, 2020.

[17] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young,“ICDAR 2003 robust reading competitions,” in Proc. Int. Conf.Document Analysis and Recogn., pp. 682–687, Citeseer, 2003.

[18] A. Shahab, F. Shafait, and A. Dengel, “Icdar 2011 robust readingcompetition challenge 2: Reading text in scene images,” in 2011 in-ternational conference on document analysis and recognition, pp. 1491–1496, IEEE, 2011.

[19] D. Karatzas, F. Shafait, S. Uchida, et al., “ICDAR 2013 RobustReading Competition,” in Proc. IAPR Int. Conf. Document AnalysisRecog., pp. 1484–1493, 2013.

[20] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in naturalscenes with stroke width transform,” in Proc. IEEE Conf. Comp.Vis. Patt. Recogn., pp. 2963–2970, IEEE, 2010.

[21] W. Huang, Z. Lin, J. Yang, and J. Wang, “Text localization innatural images using stroke feature transform and text covariancedescriptors,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 1241–1248,2013.

[22] W. Huang, Y. Qiao, and X. Tang, “Robust scene text detectionwith convolution neural network induced mser trees,” in Proc.Eur. Conf. Comp. Vis., pp. 497–511, Springer, 2014.

15

[23] G. Liang, P. Shivakumara, T. Lu, and C. L. Tan, “Multi-spectralfusion based approach for arbitrarily oriented scene text detectionin video images,” IEEE Trans. Image Process., vol. 24, no. 11,pp. 4488–4501, 2015.

[24] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “Textboxes: A fast textdetector with a single deep neural network,” in Proc. AAAI Conf.Artificial Intell., 2017.

[25] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, “Detecting text innatural image with connectionist text proposal network,” in Proc.Eur. Conf. Comp. Vis., pp. 56–72, Springer, 2016.

[26] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts ofarbitrary orientations in natural images,” in Proc. IEEE Conf. Comp.Vis. Patt. Recogn., pp. 1083–1090, 2012.

[27] R. Nagy, A. Dicker, and K. Meyer-Wegener, “NEOCR: A config-urable dataset for natural image text recognition,” in Proc. Int.Workshop Camera-Based Document Analysis and Recognition, pp. 150–163, Springer, 2011.

[28] X.-C. Yin, W.-Y. Pei, J. Zhang, and H.-W. Hao, “Multi-orientationscene text detection with adaptive clustering,” IEEE Trans. PatternAnal. Mach. Intell., vol. 37, no. 9, pp. 1930–1937, 2015.

[29] P. Shivakumara, T. Q. Phan, and C. L. Tan, “A laplacian approachto multi-oriented text detection in video,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 33, no. 2, pp. 412–419, 2010.

[30] X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao, “Robust text detectionin natural scene images,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 36, no. 5, pp. 970–983, 2013.

[31] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai, “Multi-oriented text detection with fully convolutional networks,” in Proc.IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4159–4167, 2016.

[32] D. Karatzas, L. Gomez-Bigorda, et al., “ICDAR 2015 competitionon robust reading,” in Proc. IAPR Int. Conf. Document AnalysisRecog., pp. 1156–1160, 2015.

[33] B. Shi, X. Bai, and S. Belongie, “Detecting oriented text in naturalimages by linking segments,” in Proc. IEEE Conf. Comp. Vis. Patt.Recogn., 2017.

[34] Y. Liu and L. Jin, “Deep matching prior network: Toward tightermulti-oriented text detection,” in Proc. IEEE Conf. Comp. Vis. Patt.Recogn., 2017.

[35] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang,“EAST: An efficient and accurate scene text detector,” in Proc. IEEEConf. Comp. Vis. Patt. Recogn., 2017.

[36] H. Hu, C. Zhang, Y. Luo, Y. Wang, J. Han, and E. Ding, “Wordsup:Exploiting word annotations for character based text detection,”in Proc. IEEE Int. Conf. Comp. Vis., pp. 4940–4949, 2017.

[37] B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, S. Lu,and X. Bai, “Icdar2017 competition on reading chinese text in thewild (rctw-17),” in Proc. IAPR Int. Conf. Document Analysis andRecognition, vol. 1, pp. 1429–1434, 2017.

[38] N. Nayef, Y. Patel, M. Busta, P. N. Chowdhury, D. Karatzas,W. Khlif, J. Matas, U. Pal, J.-C. Burie, C.-l. Liu, et al., “ICDAR2019robust reading challenge on multi-lingual scene text detection andrecognition–rrc-mlt-2019,” Proc. IAPR Int. Conf. Document AnalysisRecog., 2019.

[39] R. Zhang, Y. Zhou, Q. Jiang, Q. Song, N. Li, K. Zhou, L. Wang,D. Wang, M. Liao, M. Yang, et al., “ICDAR 2019 robust readingchallenge on reading chinese text on signboard,” in Proc. IAPR Int.Conf. Document Analysis Recog., pp. 1577–1581, 2019.

[40] A. Risnumawan, P. Shivakumara, C.-S. Chan, and C. Tan, “Arobust arbitrary text detection system for natural scene images,”in Expert Systems with Applications, vol. 41, pp. 8027–8048, 2014.

[41] C.-K. Ch’ng, C. S. Chan, and C.-L. Liu, “Total-text: toward orien-tation robustness in scene text detection,” Int. J. Document AnalysisRecogn., pp. 1–22, 2019.

[42] Y. Liu, L. Jin, S. Zhang, C. Luo, and S. Zhang, “Curved scene textdetection via transverse and longitudinal sequence connection,”Pattern Recognition, vol. 90, pp. 337–345, 2019.

[43] M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, “Real-time scene textdetection with differentiable binarization.,” in Proc. AAAI Conf.Artificial Intell., pp. 11474–11481, 2020.

[44] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character regionawareness for text detection,” in Proc. IEEE Conf. Comp. Vis. Patt.Recogn., pp. 9365–9374, 2019.

[45] W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, andC. Shen, “Efficient and Accurate Arbitrary-Shaped Text Detectionwith Pixel Aggregation Network,” Proc. IEEE Int. Conf. Comp. Vis.,2019.

[46] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao,“Shape Robust Text Detection with Progressive Scale ExpansionNetwork,” Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019.

[47] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “Textsnake:A flexible representation for detecting text of arbitrary shapes,” inProc. Eur. Conf. Comp. Vis., pp. 20–36, 2018.

[48] J. Tang, Z. Yang, Y. Wang, Q. Zheng, Y. Xu, and X. Bai,“Seglink++: Detecting dense and arbitrary-shaped scene text byinstance-aware component grouping,” Pattern Recognition, vol. 96,p. 106954, 2019.

[49] C. Zhang, B. Liang, Z. Huang, M. En, J. Han, E. Ding, andX. Ding, “Look More Than Once: An Accurate Detector for Text ofArbitrary Shapes,” Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019.

[50] Y. Wang, H. Xie, Z.-J. Zha, M. Xing, Z. Fu, and Y. Zhang, “Con-tournet: Taking a further step toward accurate arbitrary-shapedscene text detection,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,pp. 11753–11762, 2020.

[51] C. Xue, S. Lu, and W. Zhang, “MSR: Multi-Scale Shape Regressionfor Scene Text Detection,” Proc. Int. Joint Conf. Artificial Intell., 2019.

[52] X. Wang, Y. Jiang, Z. Luo, C.-L. Liu, H. Choi, and S. Kim, “Arbi-trary Shape Scene Text Detection with Adaptive Text Region Rep-resentation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 6449–6458, 2019.

[53] S.-X. Zhang, X. Zhu, J.-B. Hou, C. Liu, C. Yang, H. Wang, andX.-C. Yin, “Deep relational reasoning graph network for arbitraryshape text detection,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,pp. 9699–9708, 2020.

[54] L. Neumann and J. Matas, “Real-time scene text localization andrecognition,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3538–3545, IEEE, 2012.

[55] C. Yao, X. Bai, B. Shi, and W. Liu, “Strokelets: A learned multi-scale representation for scene text recognition,” in Proc. IEEE Conf.Comp. Vis. Patt. Recogn., pp. 4042–4049, 2014.

[56] B. Su and S. Lu, “Accurate scene text recognition based on re-current neural network,” in Asian Conference on Computer Vision,pp. 35–48, Springer, 2014.

[57] B. Su and S. Lu, “Accurate recognition of words in scenes withoutcharacter segmentation using recurrent neural network,” PatternRecognition, vol. 63, pp. 397–405, 2017.

[58] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural networkfor image-based sequence recognition and its application to scenetext recognition,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 39,no. 11, pp. 2298–2304, 2017.

[59] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Con-nectionist temporal classification: labelling unsegmented sequencedata with recurrent neural networks,” in Proc. Int. Conf. Mach.Learn., pp. 369–376, ACM, 2006.

[60] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translationby jointly learning to align and translate,” in Proc. Int. Conf. Learn.Representations, 2015.

[61] F. L. B. P. Warps, “Thin-plate splines and the decompositions ofdeformations,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 11, no. 6,1989.

[62] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust scene textrecognition with automatic rectification,” in Proc. IEEE Conf. Comp.Vis. Patt. Recogn., pp. 4168–4176, 2016.

[63] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: Anattentional scene text recognizer with flexible rectification,” IEEETrans. Pattern Anal. Mach. Intell., 2018.

[64] C. Luo, L. Jin, and Z. Sun, “Moran: A multi-object rectifiedattention network for scene text recognition,” Pattern Recognition,vol. 90, pp. 109–118, 2019.

[65] W. Liu, C. Chen, and K.-Y. K. Wong, “Char-Net: A character-aware neural network for distorted scene text recognition.,” inProc. AAAI Conf. Artificial Intell., pp. 7154–7161, 2018.

[66] F. Zhan and S. Lu, “Esir: End-to-end scene text recognition via iter-ative image rectification,” in Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pp. 2059–2068, 2019.

[67] R. Litman, O. Anschel, S. Tsiper, R. Litman, S. Mazor, and R. Man-matha, “SCATTER: selective context attentional scene text recog-nizer,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2020.

[68] Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou, “AON: Towardsarbitrarily-oriented text recognition,” in Proc. IEEE Conf. Comp. Vis.Patt. Recogn., pp. 5571–5579, 2018.

[69] H. Li, P. Wang, C. Shen, and G. Zhang, “Show, attend and read: Asimple and strong baseline for irregular text recognition,” in Proc.AAAI Conf. Artificial Intell., pp. 8610–8617, 2019.

16

[70] X. Yue, Z. Kuang, C. Lin, H. Sun, and W. Zhang, “RobustScanner:Dynamically enhancing positional clues for robust text recogni-tion,” in Proc. Eur. Conf. Comp. Vis., 2020.

[71] M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, andX. Bai, “Scene text recognition from two-dimensional perspective,”in Proc. AAAI Conf. Artificial Intell., pp. 8714–8721, 2019.

[72] Z. Wan, M. He, H. Chen, X. Bai, and C. Yao, “Textscanner: Readingcharacters in order for robust scene text recognition,” in Proc.AAAI Conf. Artificial Intell., 2020.

[73] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towardsreal-time object detection with region proposal networks,” in Proc.Advances in Neural Inf. Process. Syst., pp. 91–99, 2015.

[74] M. Busta, L. Neumann, and J. Matas, “Deep textspotter: An end-to-end trainable scene text localization and recognition framework,”in Proc. IEEE Int. Conf. Comp. Vis., pp. 2204–2212, 2017.

[75] Y. Sun, C. Zhang, Z. Huang, J. Liu, J. Han, and E. Ding, “TextNet:Irregular Text Reading from Images with an End-to-End TrainableNetwork,” in Proc. Asian Conf. Comp. Vis., pp. 83–99, Springer,2018.

[76] Y. Xu, Y. Wang, W. Zhou, Y. Wang, Z. Yang, and X. Bai, “Textfield:Learning a deep direction field for irregular scene text detection,”IEEE Trans. Image Process., 2019.

[77] M. Liao, G. Pang, J. Huang, T. Hassner, and X. Bai, “Masktextspotter v3: Segmentation proposal network for robust scenetext spotting,” in Proc. Eur. Conf. Comp. Vis., 2020.

[78] Z. Zhong, L. Sun, and Q. Huo, “An anchor-free region proposalnetwork for faster r-cnn-based text detection approaches,” Int. J.Document Analysis Recogn., vol. 22, no. 3, pp. 315–327, 2019.

[79] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutionalone-stage object detection,” in Proc. IEEE Int. Conf. Comp. Vis., 2019.

[80] W. He, X.-Y. Zhang, F. Yin, and C.-L. Liu, “Deep direct regressionfor multi-oriented scene text detection,” in Proc. IEEE Conf. Comp.Vis. Patt. Recogn., 2017.

[81] Z. Tian, M. Shu, P. Lyu, R. Li, C. Zhou, X. Shen, and J. Jia,“Learning Shape-Aware Embedding for Scene Text Detection,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4234–4243, 2019.

[82] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” inProc. IEEE Int. Conf. Comp. Vis., 2017.

[83] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach,S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrentconvolutional networks for visual recognition and description,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2625–2634, 2015.

[84] T. Wang, Y. Zhu, L. Jin, C. Luo, X. Chen, Y. Wu, Q. Wang, andM. Cai, “Decoupled attention network for text recognition.,” inProc. AAAI Conf. Artificial Intell., pp. 12216–12224, 2020.

[85] R. J. Williams and D. Zipser, “A learning algorithm for continu-ally running fully recurrent neural networks,” Neural computation,vol. 1, no. 2, pp. 270–280, 1989.

[86] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S.Modha, “Learned step size quantization,” in Proc. Int. Conf. Learn.Representations, 2020.

[87] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan,and K. Gopalakrishnan, “PACT: Parameterized clipping activationfor quantized neural networks,” arXiv: Comp. Res. Repository, 2018.

[88] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deepresidual networks,” in Proc. Eur. Conf. Comp. Vis., pp. 630–645,2016.

[89] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Be-longie, “Feature pyramid networks for object detection,” in Proc.IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2117–2125, 2017.

[90] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “Coco-text: Dataset and benchmark for text detection and recognition innatural images,” arXiv: Comp. Res. Repository, 2016.

[91] Y. Sun, Z. Ni, C.-K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding,J. Liu, D. Karatzas, et al., “ICDAR 2019 Competition on Large-scaleStreet View Text with Partial Labeling–RRC-LSVT,” Proc. IAPR Int.Conf. Document Analysis Recog., 2019.

[92] C.-K. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang,S. Zhang, J. Han, E. Ding, et al., “ICDAR2019 Robust ReadingChallenge on Arbitrary-Shaped Text (RRC-ArT),” Proc. IAPR Int.Conf. Document Analysis Recog., 2019.

[93] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for textlocalisation in natural images,” in Proc. IEEE Conf. Comp. Vis. Patt.Recogn., pp. 2315–2324, 2016.

[94] M. Liao, B. Shi, and X. Bai, “Textboxes++: A single-shot orientedscene text detector,” IEEE Trans. Image Process., vol. 27, no. 8,pp. 3676–3690, 2018.

[95] Y. Liu, L. Jin, and C. Fang, “Arbitrarily shaped scene text detectionwith a mask tightness text detector,” IEEE Trans. Image Process.,vol. 29, pp. 2918–2930, 2019.

[96] Y. Baek, S. Shin, J. Baek, S. Park, J. Lee, D. Nam, and H. Lee,“Character region attention for text spotting,” in Proc. Eur. Conf.Comp. Vis., pp. 504–521, 2020.

[97] W. Feng, F. Yin, X.-Y. Zhang, W. He, and C.-L. Liu, “Residualdual scale scene text spotting by fusing bottom-up and top-downprocessing,” Int. J. Comput. Vision, pp. 1–19, 2020.

[98] R. Li, Y. Wang, F. Liang, H. Qin, J. Yan, and R. Fan, “Fullyquantized network for object detection,” in Proc. IEEE Conf. Comp.Vis. Patt. Recogn., June 2019.

Documents

1 ABCNet v2: Adaptive Bezier-Curve Network for Real-time