17
UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World Shangbang Long Carnegie Mellon University [email protected] Cong Yao Megvii (Face++) Technology Inc. [email protected] Figure 1: Demonstration of the proposed UnrealText synthesis engine, which achieves photo-realistic lighting conditions, finds suitable text regions, and realizes natural occlusion (from left to right, zoomed-in views marked with green squares). Abstract Synthetic data has been a critical tool for training scene text detection and recognition models. On the one hand, synthetic word images have proven to be a successful sub- stitute for real images in training scene text recognizers. On the other hand, however, scene text detectors still heavily rely on a large amount of manually annotated real-world images, which are expensive. In this paper, we introduce UnrealText, an efficient image synthesis method that ren- ders realistic images via a 3D graphics engine. 3D syn- thetic engine provides realistic appearance by rendering scene and text as a whole, and allows for better text re- gion proposals with access to precise scene information, e.g. normal and even object meshes. The comprehensive experiments verify its effectiveness on both scene text de- tection and recognition. We also generate a multilingual version for future research into multilingual scene text de- tection and recognition. Additionally, we re-annotate scene text recognition datasets in a case-sensitive way and in- clude punctuation marks for more comprehensive evalua- tions. The code and the generated datasets are released at: https://jyouhou.github.io/UnrealText/. 1. Introduction With the resurgence of neural networks, the past few years have witnessed significant progress in the field of scene text detection and recognition. However, these mod- els are data-thirsty, and it is expensive and sometimes dif- ficult, if not impossible, to collect enough data. More- over, the various applications, from traffic sign reading in autonomous vehicles to instant translation, require a large amount of data specifically for each domain, further es- calating this issue. Therefore, synthetic data and synthe- sis algorithms are important for scene text tasks. Further- more, synthetic data can provide detailed annotations, such as character-level or even pixel-level ground truths that are rare for real images due to high cost. Currently, there exist several synthesis algorithms [46, 10, 6, 50] that have proven beneficial. Especially, in scene text recognition, training on synthetic data [10, 6] alone has become a widely accepted standard practice. Some re- searchers that attempt training on both synthetic and real data only report marginal improvements [15, 20] on most datasets. Mixing synthetic and real data is only improving performance on a few difficult cases that are not yet well covered by existing synthetic datasets, such as seriously blurred or curved text. This is reasonable, since cropped text images have much simpler background, and synthetic data enjoys advantages in larger vocabulary size and diver- sity of backgrounds, fonts, and lighting conditions, as well as thousands of times more data samples. On the contrary, however, scene text detection is still heavily dependent on real-world data. Synthetic data [6, 50] plays a less significant role, and only brings marginal im- provements. Existing synthesizers for scene text detec- 1 arXiv:2003.10608v4 [cs.CV] 1 Jul 2020

arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World

Shangbang LongCarnegie Mellon University

[email protected]

Cong YaoMegvii (Face++) Technology Inc.

[email protected]

Figure 1: Demonstration of the proposed UnrealText synthesis engine, which achieves photo-realistic lighting conditions,finds suitable text regions, and realizes natural occlusion (from left to right, zoomed-in views marked with green squares).

Abstract

Synthetic data has been a critical tool for training scenetext detection and recognition models. On the one hand,synthetic word images have proven to be a successful sub-stitute for real images in training scene text recognizers. Onthe other hand, however, scene text detectors still heavilyrely on a large amount of manually annotated real-worldimages, which are expensive. In this paper, we introduceUnrealText, an efficient image synthesis method that ren-ders realistic images via a 3D graphics engine. 3D syn-thetic engine provides realistic appearance by renderingscene and text as a whole, and allows for better text re-gion proposals with access to precise scene information,e.g. normal and even object meshes. The comprehensiveexperiments verify its effectiveness on both scene text de-tection and recognition. We also generate a multilingualversion for future research into multilingual scene text de-tection and recognition. Additionally, we re-annotate scenetext recognition datasets in a case-sensitive way and in-clude punctuation marks for more comprehensive evalua-tions. The code and the generated datasets are released at:https://jyouhou.github.io/UnrealText/.

1. IntroductionWith the resurgence of neural networks, the past few

years have witnessed significant progress in the field ofscene text detection and recognition. However, these mod-

els are data-thirsty, and it is expensive and sometimes dif-ficult, if not impossible, to collect enough data. More-over, the various applications, from traffic sign reading inautonomous vehicles to instant translation, require a largeamount of data specifically for each domain, further es-calating this issue. Therefore, synthetic data and synthe-sis algorithms are important for scene text tasks. Further-more, synthetic data can provide detailed annotations, suchas character-level or even pixel-level ground truths that arerare for real images due to high cost.

Currently, there exist several synthesis algorithms [46,10, 6, 50] that have proven beneficial. Especially, in scenetext recognition, training on synthetic data [10, 6] alonehas become a widely accepted standard practice. Some re-searchers that attempt training on both synthetic and realdata only report marginal improvements [15, 20] on mostdatasets. Mixing synthetic and real data is only improvingperformance on a few difficult cases that are not yet wellcovered by existing synthetic datasets, such as seriouslyblurred or curved text. This is reasonable, since croppedtext images have much simpler background, and syntheticdata enjoys advantages in larger vocabulary size and diver-sity of backgrounds, fonts, and lighting conditions, as wellas thousands of times more data samples.

On the contrary, however, scene text detection is stillheavily dependent on real-world data. Synthetic data [6, 50]plays a less significant role, and only brings marginal im-provements. Existing synthesizers for scene text detec-

1

arX

iv:2

003.

1060

8v4

[cs

.CV

] 1

Jul

202

0

Page 2: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

tion follow the same paradigm. First, they analyze back-ground images, e.g. by performing semantic segmentationand depth estimation using off-the-shelf models. Then, po-tential locations for text embedding are extracted from thesegmented regions. Finally, text images (foregrounds) areblended into the background images, with perceptive trans-formation inferred from estimated depth. However, theanalysis of background images with off-the-shelf modelsmay be rough and imprecise. The errors further propagateto text proposal modules and result in text being embeddedonto unsuitable locations. Moreover, the text embeddingprocess is ignorant of the overall image conditions such asillumination and occlusions of the scene. These two factorsmake text instances outstanding from backgrounds, leadingto a gap between synthetic and real images.

In this paper, we propose a synthetic engine that syn-thesizes scene text images from 3D virtual world. Theproposed engine is based on the famous Unreal Engine 4(UE4), and is therefore named as UnrealText. Specifically,text instances are regarded as planar polygon meshes withtext foregrounds loaded as texture. These meshes are placedin suitable positions in 3D world, and rendered togetherwith the scene as a whole.

As shown in Fig. 1, the proposed synthesis engine, byits very nature, enjoys the following advantages over pre-vious methods: (1) Text and scenes are rendered together,achieving realistic visual effects, e.g. illumination, occlu-sion, and perspective transformation. (2) The method hasaccess to precise scene information, e.g. normal, depth, andobject meshes, and therefore can generate better text regionproposals. These aspects are crucial in training detectors.

To further exploit the potential of UnrealText, we designthree key components: (1) A view finding algorithm thatexplores the virtual scenes and generates camera viewpointsto obtain more diverse and natural backgrounds. (2) An en-vironment randomization module that changes the lightingconditions regularly, to simulate real-world variations. (3)A mesh-based text region generation method that finds suit-able positions for text by probing the 3D meshes.

The contributions of this paper are summarized as fol-lows: (1) We propose a brand-new scene text image syn-thesis engine that renders images from 3D world, which isentirely different from previous approaches that embed texton 2D background images, termed as UnrealText. The pro-posed engine achieves realistic rendering effects and highscalability. (2) With the proposed techniques, the synthe-sis engine improves the performance of detectors and rec-ognizers significantly. (3) We also generate a large scalemultilingual scene text dataset that will aid further research.(4) Additionally, we notice that many of the popular scenetext recognition datasets are only annotated in an incom-plete way, providing only case-insensitive word annota-tions. With such limited annotations, researchers are unable

to carry out comprehensive evaluations, and tend to over-estimate the progress of scene text recognition algorithms.To address this issue, we re-annotate these datasets to in-clude both upper-case and lower-case characters, digits,punctuation marks, and spaces if there are any. We urgeresearchers to use the new annotations and evaluate in sucha full-symbol mode for better understanding of the advan-tages and disadvantages of different algorithms.

2. Related Work

2.1. Synthetic Images

The synthesis of photo-realistic datasets has been a pop-ular topic, since they provide detailed ground-truth annota-tions at multiple granularity, and cost less than manual an-notations. In scene text detection and recognition, the use ofsynthetic datasets has become a standard practice. For scenetext recognition, where images contain only one word, syn-thetic images are rendered through several steps [46, 10],including font rendering, coloring, homography transfor-mation, and background blending. Later, GANs [5] areincorporated to maintain style consistency for implantedtext [51], but it is only for single-word images. As a re-sult of these progresses, synthetic data alone are enough totrain state-of-the-art recognizers.

To train scene text detectors, SynthText [6] proposes togenerate synthetic data by printing text on background im-ages. It first analyzes images with off-the-shelf models, andsearch suitable text regions on semantically consistent re-gions. Text are implanted with perspective transformationbased on estimated depth. To maintain semantic coherency,VISD [50] proposes to use semantic segmentation to filterout unreasonable surfaces such as human faces. They alsoadopt an adaptive coloring scheme to fit the text into theartistic style of backgrounds. However, without consider-ing the scene as a whole, these methods fail to render textinstances in a photo-realistic way, and text instances are toooutstanding from backgrounds. So far, the training of de-tectors still relies heavily on real images.

Although GANs and other learning-based methods havealso shown great potential in generating realistic im-ages [48, 17, 12], the generation of scene text images stillrequire a large amount of manually labeled data [51]. Fur-thermore, such data are sometimes not easy to collect, es-pecially for cases such as low resource languages.

More recently, synthesizing images with 3D graph-ics engine has become popular in several fields, in-cluding human pose estimation [43], scene understand-ing/segmentation [28, 24, 33, 35, 37], and object detec-tion [29, 42, 8]. However, these methods either considersimplistic cases, e.g. rendering 3D objects on top of staticbackground images [29, 43] and randomly arranging scenesfilled with objects [28, 24, 35, 8], or passively use off-the-

Page 3: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

shelf 3D scenes without further changing it [33]. In con-trast to these researches, our proposed synthesis engine im-plements active and regular interaction with 3D scenes, togenerate realistic and diverse scene text images.

This paper is also a sequel to our previous attempt, theSynthText3D[16]. SynthText3D closely follows the designsof the SynthText method. While SynthText uses off-the-shelf computer vision models to estimate segmentation anddepth maps for background images, SynthText3D uses theground-truth segmentation and depth maps provided by the3D engines. The rendering process of SynthText3D doesnot involve interactions with the 3D worlds, such as theobject meshes. Therefore it is faced with various limita-tions. We present a complete technical comparison betweenSynthText3D and UnrealText in the Appendix. We encour-age readers to read the SynthText3D paper and the technicalcomparison.

2.2. Scene Text Detection and Recognition

Scene text detection and recognition, possibly as themost human-centric computer vision task, has been a pop-ular research topic for many years [49, 21]. In scene textdetection, there are mainly two branches of methodolo-gies: Top-down methods that inherit the idea of region pro-posal networks from general object detectors that detect textinstances as rotated rectangles and polygons [19, 53, 11,52, 47]; Bottom-up approaches that predict local segmentsand local geometric attributes, and compose them into in-dividual text instances [38, 22, 2, 40]. Despite significantimprovements on individual datasets, those most widelyused benchmark datasets are usually very small, with onlyaround 500 to 1000 images in test sets, and are thereforeprone to over-fitting. The generalization ability across dif-ferent domains remains an open question, and is not studiedyet. The reason lies in the very limited real data and thatsynthetic data are not effective enough. Therefore, one im-portant motivation of our synthesis engine is to serve as astepping stone towards general scene text detection.

Most scene text recognition models consist of CNN-based image feature extractors and attentional LSTM [9] ortransformer [44]-based encoder-decoder to predict the tex-tual content [3, 39, 15, 23]. Since the encoder-decoder mod-ule is a language model in essence, scene text recognizershave a high demand for training data with a large vocabu-lary, which is extremely difficult for real-world data. Be-sides, scene text recognizers work on image crops that havesimple backgrounds, which are easy to synthesize. There-fore, synthetic data are necessary for scene text recogniz-ers, and synthetic data alone are usually enough to achievestate-of-the-art performance. Moreover, since the recogni-tion modules require a large amount of data, synthetic dataare also necessary in training end-to-end text spotting sys-tems [18, 7, 30].

3. Scene Text in 3D Virtual World3.1. Overview

In this section, we give a detailed introduction to ourscene text image synthesis engine, UnrealText, which is de-veloped upon UE4 and the UnrealCV plugin [31]. The syn-thesis engine: (1) produces photo-realistic images, (2) isefficient, taking about only 1-1.5 second to render and gen-erate a new scene text image and, (3) is general and com-patible to off-the-shelf 3D scene models. As shown in Fig.2, the pipeline mainly consists of a Viewfinder module (sec-tion 3.2), an Environment Randomization module (section3.3), a Text Region Generation module (section 3.4), and aText Rendering module (section 3.5).

Firstly, the viewfinder module explores around the 3Dscene with the camera, generating camera viewpoints.Then, the environment lighting is randomly adjusted. Next,the text regions are proposed based on 2D scene informa-tion and refined with 3D mesh information in the graph-ics engine. After that, text foregrounds are generated withrandomly sampled fonts, colors, and text content, and areloaded as planar meshes. Finally, we retrieve the RGB im-age and corresponding text locations as well as text contentto make the synthetic dataset.

3.2. Viewfinder

The aim of the viewfinder module is to automatically de-termine a set of camera locations and rotations from thewhole space of 3D scenes that are reasonable and non-trivial, getting rid of unsuitable viewpoints such as frominside object meshes (e.g. Fig. 3 bottom right).

Learning-based methods such as navigation and explo-ration algorithms may require extra training data and arenot guaranteed to generalize to different 3D scenes. There-fore, we turn to rule-based methods and design a physically-constrained 3D random walk (Fig. 3 first row) equippedwith auxiliary camera anchors.

3.2.1 Physically-Constrained 3D Random Walk

Starting from a valid location, the physically-constrained3D random walk aims to find the next valid and non-triviallocation. In contrast to being valid, locations are invalid ifthey are inside object meshes or far away from the sceneboundary, for example. A non-trivial location should be nottoo close to the current location. Otherwise, the new view-point will be similar to the current one. The proposed 3Drandom walk uses ray-casting [36], which is constrained byphysically, to inspect the physical environment to determinevalid and non-trivial locations.

In each step, we first randomly change the pitch and yawvalues of the camera rotation, making the camera pointingto a new direction. Then, we cast a ray from the camera lo-

Page 4: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

Figure 2: The pipeline of the proposed synthesis method. The arrows indicate the order. For simplicity, we only show one textregion. From left to right: scene overview, diverse viewpoints, various lighting conditions (light color, intensity, shadows,etc.), text region generation and text rendering.

cation towards the direction of the viewpoint. The ray stopswhen it hits any object meshes or reaches a fixed maximumlength. By design, the path from the current location to thestopping position is free of any barrier, i.e. not inside ofany object meshes. Therefore, points along this ray path areall valid. Finally, we randomly sample one point betweenthe 1

3 -th and 23 -th of this path, and set it as the new location

of the camera, which is non-trivial. The proposed randomwalk algorithm can generate diverse camera viewpoints.

3.2.2 Auxiliary Camera Anchors

The proposed random walk algorithm, however, is ineffi-cient in terms of exploration. Therefore, we manually selecta set of N camera anchors across the 3D scenes as start-ing points. After every T steps, we reset the location ofthe camera to a randomly sampled camera anchor. We setN = 150-200 and T = 100. Note that the selection of cam-era anchors requires only little carefulness. We only need toensure coverage over the space. It takes around 20 to 30 sec-onds for each scene, which is trivial and not a bottleneck ofscalability. The manual but efficient selection of camera iscompatible with the proposed random walk algorithm thatgenerates diverse viewpoints.

3.3. Environment Randomization

To produce real-world variations such as lighting condi-tions, we randomly change the intensity, color, and directionof all light sources in the scene. In addition to illuminations,we also add fog conditions and randomly adjust its inten-sity. The environment randomization proves to increase thediversity of the generated images and results in stronger de-tector performance. The proposed randomization can alsobenefit sim-to-real domain adaptation [41].

Figure 3: In the first row (1)-(4), we illustrate thephysically-constrained 3D random walk. For better visu-alization, we use a camera object to represent the viewpoint(marked with green boxes and arrows). In the second row,we compare viewpoints from the proposed method with ran-domly sampled viewpoints.

3.4. Text Region Generation

In real-world, text instances are usually embedded onwell-defined surfaces, e.g. traffic signs, to maintain goodlegibility. Previous works find suitable regions by using es-timated scene information, such as gPb-UCM [1] in Syn-thText [6] or saliency map in VISD [50] for approxima-tion. However, these methods are imprecise and often failto find appropriate regions. Therefore, we propose to findtext regions by probing around object meshes in 3D world.Since inspecting all object meshes is time-consuming, wepropose a 2-staged pipeline: (1) We retrieve ground truthsurface normal map to generate initial text region propos-als; (2) Initial proposals are then projected to and refinedin the 3D world using object meshes. Finally, we samplea subset from the refined proposals to render. To avoid oc-

Page 5: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

clusion among proposals, we project them back to screenspace, and discard regions that overlap with each other oneby one in a shuffled order until occlusion is eliminated.

3.4.1 Initial Proposals from Normal Maps

In computer graphics, normal values are unit vectors thatare perpendicular to a surface. Therefore, when projectedto 2D screen space, a region with similar normal valuestends to be a well-defined region to embed text on. We findvalid image regions by applying sliding windows of 64×64pixels across the surface normal map, and retrieve thosewith smooth surface normal: the minimum cosine similar-ity value between any two pixels is larger than a thresholdt. We set t to 0.95, which proves to produce reasonableresults. We randomly sample at most 10 non-overlappingvalid image regions to make the initial proposals. Makingproposals from normal maps is an efficient way to find po-tential and visible regions.

3.4.2 Refining Proposals in 3D Worlds

As shown in Fig. 4, rectangular initial proposals in 2Dscreen space will be distorted when projected into 3Dworld. Thus, we need to first rectify the proposals in 3Dworld. We project the center point of the initial proposalsinto 3D space, and re-initialize orthogonal squares on thecorresponding mesh surfaces around the center points: thehorizontal sides are orthogonal to the gravity direction. Theside lengths are set to the shortest sides of the quadrilateralscreated by projecting the four corners of initial proposalsinto the 3D space. Then we enlarge the widths and heightsalong the horizontal and vertical sides alternatively. Theexpansion of one direction stops when the sides of that di-rection get off the surface1, hit other meshes, or reach thepreset maximum expansion ratio. The proposed refining al-gorithm works in 3D world space, and is able to producenatural homography transformation in 2D screen space.

3.5. Text Rendering

Generating Text Images: Given text regions as proposedand refined in section 3.4, the text generation module sam-ples text content and renders text images with certain fontsand text colors. The numbers of lines and characters perline are determined by the font size and the size of refinedproposals in 2D space to make sure the characters are nottoo small and ensure legibility. For a fairer comparison, wealso use the same font set from Google Fonts 2 as SynthTextdoes. We also use the same text corpus, Newsgroup20. Thegenerated text images have zero alpha values on non-strokepixels, and non zero for others.

1when the distances from the rectangular proposals’ corners to the near-est point on the underlying surface mesh exceed certain threshold

2https://fonts.google.com/

Figure 4: Illustration of the refinement of initial proposals.We draw green bounding boxes to represent proposals in 2Dscreen space, and use planar meshes to represent proposalsin 3D space. (1) Initial proposals are made in 2D space.(2) When we project them into 3D world and inspect themfrom the front view, they are in distorted forms. (3) Basedon the sizes of the distorted proposals and the positions ofthe center points, we re-initialize orthogonal squares on thesame surfaces with horizontal sides orthogonal to the grav-ity direction. (5) Then we expand the squares. (6) Finally,we obtain text regions in 2D screen space with natural per-spective distortion.

Rendering Text in 3D World: We first perform triangula-tion for the refined proposals to generate planar triangularmeshes that are closely attached to the underlying surface.Then we load the text images as texture onto the generatedmeshes. We also randomly sample the texture attributes,such as the ratio of diffuse and specular reflection.

3.6. Implementation Details

The proposed synthesis engine is implemented based onUE4.22 and the UnrealCV plugin. On an ubuntu worksta-tion with an 8-core Intel CPU, an NVIDIA GeForce RTX2070 GPU, and 16G RAM, the synthesis speed is 0.7-1.5seconds per image with a resolution of 1080×720, depend-ing on the complexity of the scene model.

We collect 30 scene models from the official UE4 mar-ketplace. The engine is used to generate 600K scene textimages with English words. With the same configura-tion, we also generate a multilingual version, making it thelargest multilingual scene text dataset.

4. Experiments on Scene Text Detection

4.1. Settings

We first verify the effectiveness of the proposed engineby training detectors on the synthesized images and evaluat-ing them on real image datasets. We use a previous yet time-tested state-of-the-art model, EAST [53], which is fast andaccurate. EAST also forms the basis of several widely rec-ognized end-to-end text spotting models [18, 7]. We adopt

Page 6: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

an opensource implementation3. In all experiments, modelsare trained on 4 GPU with a batch size of 56. During theevaluation, the test images are resized to match a short sidelength of 800 pixels.Benchmark Datasets We use the following scene textdetection datasets for evaluation: (1) ICDAR 2013 Fo-cused Scene Text (IC13) [14] containing horizontal text withzoomed-in views. (2) ICDAR 2015 Incidental Scene Text(IC15) [13] consisting of images taken without carefulnesswith Google Glass. Images are blurred and text are small.(3) MLT 2017 [27] for multilingual scene text detection,which is composed of scene text images of 9 languages.

4.2. Experiments Results

Pure Synthetic Data We first train the EAST models ondifferent synthetic datasets alone, to compare our methodwith previous ones in a direct and quantitative way. Notethat ours, SynthText, and VISD have different numbers ofimages, so we also need to control the number of imagesused in experiments. Results are summarized in Tab. 1.

Firstly, we control the total number of images to 10K,which is also the full size of the smallest synthetic dataset,VISD. We observe a considerable improvement on IC15over previous state-of-the-art by +0.9% in F1-score, andsignificant improvements on IC13 (+3.5%) and MLT 2017(+2.8%). Secondly, we also train models on the full setof SynthText and ours, since scalability is also an impor-tant factor for synthetic data, especially when consideringthe demand to train recognizers. Extra training images fur-ther improve F1 scores on IC15, IC13, and MLT by +2.6%,+2.3%, and +2.1%. Models trained with our UnrealTextdata outperform all other synthetic datasets. Besides, thesubset of 10K images with our method even surpasses800K SynthText images significantly on all datasets. Theexperiment results demonstrate the effectiveness of our pro-posed synthetic engine and datasets.

Training Data IC15 IC13 MLT 2017SynthText 10K 46.3 60.8 38.9

VISD 10K (full) 64.3 74.8 51.4Ours 10K 65.2 78.3 54.2

SynthText 800K (full) 58.0 67.7 44.8Ours 600K (full) 67.8 80.6 56.3

Ours 5K + VISD 5K 66.9 80.4 55.7

Table 1: Detection results (F1-scores) of EAST modelstrained on different synthetic data.

Complementary Synthetic Data One unique characteris-tic of the proposed UnrealText is that, the images are gen-erated from 3D scene models, instead of real backgroundimages, resulting in potential domain gap due to different

3https://github.com/argman/EAST

artistic styles. We conduct experiments by training on bothUnrealText data (5K) and VISD (5K), as also shown inTab. 1 (last row, marked with italics), which achieves bet-ter performance than other 10K synthetic datasets. Thisresult demonstrates that, our UnrealText is complementaryto existing synthetic datasets that use real images as back-grounds. While UnrealText simulates photo-realistic ef-fects, synthetic data with real background images can helpadapt to real-world datasets.Combining Synthetic and Real Data One important roleof synthetic data is to serve as data for pretraining, and tofurther improve the performance on domain specific realdatasets. We first pretrain the EAST models with differ-ent synthetic data, and then use domain data to finetune themodels. The results are summarized in Tab. 2. On alldomain-specific datasets, models pretrained with our syn-thetic dataset surpasses others by considerable margins, ver-ifying the effectiveness of our synthesis method in the con-text of boosting performance on domain specific datasets.

Evaluation on ICDAR 2015Training Data P R F1

IC15 84.6 78.5 81.4IC15 + SynthText 10K 85.6 79.5 82.4

IC15 + VISD 10K 86.3 80.0 83.1IC15 + Ours 10K 86.9 81.0 83.8

IC15 + Ours 600K (full) 88.5 80.8 84.5Evaluation on ICDAR 2013

Training Data P R F1IC13 82.6 70.0 75.8

IC13 + SynthText 10K 85.3 72.4 78.3IC13 + VISD 10K 85.9 73.1 79.0IC13 + Ours 10K 88.5 74.7 81.0

IC13 + Ours 600K (full) 92.3 73.4 81.8Evaluation on MLT 2017

Training Data P R F1MLT 2017 72.9 67.4 70.1

MLT 2017 + SynthText 10K 73.1 67.7 70.3MLT 2017 + VISD 10K 73.3 67.9 70.5MLT 2017 + Ours 10K 74.6 68.7 71.6

MLT 2017 + Ours 600K (full) 82.2 67.4 74.1

Table 2: Detection performances of EAST models pre-trained on synthetic and then finetuned on real datasets.

Pretraining on Full Dataset As shown in the last rowsof Tab. 2, when we pretrain the detector models with ourfull dataset, the performances are improved significantly,demonstrating the advantage of the scalability of our en-gine. Especially, The EAST model achieves an F1 scoreof 74.1 on MLT17, which is even better than recent state-of-the-art results, including 73.9 by CRAFT[2] and 73.1 byLOMO [52]. Although the margin is not great, it sufficesto claim that the EAST model revives and reclaims state-of-

Page 7: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

the-art performance with the help of our synthetic dataset.

4.3. Module Level Ablation Analysis

One reasonable concern about synthesizing from 3D vir-tual scenes lies in the scene diversity. In this section, weaddress the importance of the proposed view finding mod-ule and the environment randomization module in increas-ing the diversity of synthetic images.Ablating Viewfinder Module We derive two baselinesfrom the proposed viewfinder module: (1) Random View-point + Manual Anchor that randomly samples camera lo-cations and rotations from the norm-ball spaces centeredaround auxiliary camera anchors. (2) Random ViewpointOnly that randomly samples camera locations and rotationsfrom the whole scene space, without checking their qual-ity. For experiments, we fix the number of scenes to 10 tocontrol scene diversity and generate different numbers ofimages, and compare their performance curve. By fixingthe number of scenes, we compare how well different viewfinding methods can exploit the scenes.Ablating Environment Randomization We remove theenvironment randomization module, and keep the scenemodels unchanged during synthesis. For experiments, wefix the total number of images to 10K and use differentnumber of scenes. In this way, we can compare the diversityof images generated with different methods.

We train the EAST models with different numbers of im-ages or scenes, evaluate them on the 3 real datasets, andcompute the arithmetic mean of the F1-scores. As shownin Fig. 5 (a), we observe that the proposed combination,i.e. Random Walk + Manual Anchor, achieves significantlyhigher F1-scores consistently for different numbers of im-ages. Especially, larger sizes of training sets result in greaterperformance gaps. We also inspect the images generatedwith these methods respectively. When starting from thesame anchor point, the proposed random walk can gener-ate more diverse viewpoints and can traverse much largerarea. In contrast, the Random Viewpoint + Manual An-chor method degenerates either into random rotation onlywhen we set a small norm ball size for random location, orinto Random Viewpoint Only when we set a large norm ballsize. As a result, the Random Viewpoint + Manual Anchormethod requires careful manual selection of anchors, andwe also need to manually tune the norm ball sizes for dif-ferent scenes, which restricts the scalability of the synthe-sis engine. Meanwhile, our proposed random walk basedmethod is more flexible and robust to the selection of man-ual anchors. As for the Random Viewpoint Only method,a large proportion of generated viewpoints are invalid, e.g.inside other object meshes, which is out-of-distribution forreal images. This explains why it results in the worst per-formances.

From Fig. 5 (b), the major observation is that environ-

Figure 5: Results of ablation tests: (a) ablating viewfindermodule; (b) ablating environment randomization module.

ment randomization module improves performances overdifferent scene numbers consistently. Besides, the improve-ment is more significant as we use fewer scenes. Therefore,we can draw a conclusion that, the environment random-ization helps increase image diversity and at the same time,can reduce the number of scenes needed. Furthermore, therandom lighting conditions realize different real-world vari-ations, which we also attribute as a key factor.

5. Experiments on Scene Text RecognitionIn addition to the superior performances in training scene

text detection models, we also verify its effectiveness in thetask of scene text recognition.

5.1. Recognizing Latin Scene Text

5.1.1 Settings

Model We select a widely accepted baseline method,ASTER [39], and adopt the implementation4 that ranks top-1 on the ICDAR 2019 ArT competition on curved scene textrecognition (Latin) by [20]. The models are trained with abatch size of 512. A total of 95 symbols are recognized,including an End-of-Sentence mark, 52 case sensitive al-phabets, 10 digits, and 32 printable punctuation symbols.Training Datasets From the 600K English synthetic im-ages, we obtain a total number of 12M word-level imageregions to make our training dataset. Also note that, oursynthetic dataset provide character level annotations, whichwill be useful in some recognition algorithms.Evaluation Datasets We evaluate models trained on dif-ferent synthetic datasets on several widely used real imagedatasets: IIIT [25], SVT [45], ICDAR 2015 (IC15) [13],SVTP [32], CUTE [34], and Total-Text[4].

Some of these datasets, however, have incomplete an-notations, including IIIT, SVT, SVTP, CUTE. While theword images in these datasets contain punctuation symbols,digits, upper-case and lower-case characters, the aforemen-tioned datasets, in their current forms, only provide case-insensitive annotations and ignore all punctuation symbols.In order for more comprehensive evaluation of scene text

4https://github.com/Jyouhou/ICDAR2019-ArT-Recognition-Alchemy

Page 8: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

recognition, we re-annotate these 4 datasets in a case-sensitive way and also include punctuation symbols. Wealso release the new annotations and we believe that theywill become better benchmarks for scene text recognitionin the future.

5.1.2 Experiment Results

Experiment results are summarized in Tab. 6. First, wecompare our method with previous synthetic datasets. Wehave to limit the size of training datasets to 1M sinceVISD only publishes 1M word images. Our synthetic dataachieves consistent improvements on all datasets. Espe-cially, it surpasses other synthetic datasets by a consider-able margin on datasets with diverse text styles and complexbackgrounds such as SVTP (+2.4%). The experiments ver-ify the effectiveness of our synthesis method in scene textrecognition especially in the complex cases.

Since small scale experiments are not very helpful inhow researchers should utilize these datasets, we furthertrain models on combinations of Synth90K, SynthText, andours. We first limit the total number of training images to9M . When we train on a combination of all 3 syntheticdatasets, with 3M each, the model performs better than themodel trained on 4.5M × 2 datasets only. We further ob-serve that training on 3M × 3 synthetic datasets is com-parable to training on the whole Synth90K and SynthText,while using much fewer training data. This result suggeststhat the best practice is to combine the proposed syntheticdataset with previous ones.5.2. Recognizing Multilingual Scene Text

5.2.1 Settings

Although MLT 2017 has been widely used as a benchmarkfor detection, the task of recognizing multilingual scene textstill remains largely untouched, mainly due to lack of aproper training dataset. To pave the way for future research,we also generate a multilingual version with 600K imagescontaining 10 languages as included in MLT 2019 [26]:Arabic, Bangla, Chinese, English, French, German, Hindi,Italian, Japanese, and Korean. Text contents are sampledfrom corpus extracted from the Wikimedia dump5.Model We use the same model and implementation as Sec-tion 5.1, except that the symbols to recognize are expandedto all characters that appear in the generated dataset.Training and Evaluation Data We crop from the proposedmultilingual dataset. We discard images with widths shorterthan 32 pixels as they are too blurry, and obtain 4.1M wordimages in total. We compare with the multilingual versionof SynthText provided by MLT 2019 competition that con-tains a total number 1.2M images. For evaluation, we ran-domly split 1500 images for each language (including sym-

5https://dumps.wikimedia.org

bols and mixed) from the training set of MLT 2019. Therest of the training set is used for training.

5.2.2 Experiment Results

Experiment results are shown in Tab. 3. When we only usesynthetic data and control the number of images to 1.2M ,ours result in a considerable improvement of 1.6% in over-all accuracy, and significant improvements on some scripts,e.g. Latin (+7.6%) and Mixed (+21.6%). Using the wholetraining set of 4.1M images further improves overall accu-racy to 39.5%. When we train models on combinations ofsynthetic data and our training split of MLT19, as shownin the bottom of Tab. 3, we can still observe a considerablemargin of our method over SynthText by 3.2% in overall ac-curacy. The experiment results demonstrate that our methodis also superior in multilingual scene text recognition, andwe believe this result will become a stepping stone to fur-ther research.

6. Limitation and Future Work

There are several aspects that are worth diving deeperinto: (1) Overall, the engine is based on rules and human-selected parameters. The automation of the selection andsearch for these parameters can save human efforts and helpadapt to different scenarios. (2) While rendering small textcan help training detectors, the low image quality of thesmall text makes recognizers harder to train and harms theperformance. Designing a method to mark the illegible onesas difficult and excluding them from loss calculation mayhelp mitigate this problem. (3) For multilingual scene text,scripts except Latin have much fewer available fonts that wehave easy access to. To improve performance on more lan-guages, researchers may consider learning-based methodsto transfer Latin fonts to other scripts.

7. Conclusion

In this paper, we introduce a scene text image synthesisengine that renders images with 3D graphics engines, wheretext instances and scenes are rendered as a whole. In exper-iments, we verify the effectiveness of the proposed enginein both scene text detection and recognition models. Wealso study key components of the proposed engine. We be-lieve our work will be a solid stepping stone towards bettersynthesis algorithms.

AcknowledgementThis research was supported by National Key R&D Pro-

gram of China (No. 2017YFA0700800).

Page 9: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

Training Data Latin Arabic Bangla Chinese Hindi Japanese Korean Symbols Mixed OverallST (1.2M) 34.6 50.5 17.7 43.9 15.7 21.2 55.7 44.7 9.8 34.9

Ours (1.2M) 42.2 50.3 16.5 44.8 30.3 21.7 54.6 16.7 25.0 36.5Ours (full, 4.1M) 44.3 51.1 19.7 47.9 33.1 24.2 57.3 25.6 31.4 39.5

MLT19-train (90K) 64.3 47.2 46.9 11.9 46.9 23.3 39.1 35.9 3.6 45.7MLT19-train (90K) + ST (1.2M) 63.8 62.0 48.9 50.7 47.7 33.9 64.5 45.5 10.3 54.7

MLT19-train (90K) + Ours (1.2M) 67.8 63.0 53.7 47.7 64.0 35.7 62.9 44.3 26.3 57.9

Table 3: Multilingual scene text recognition results (word level accuracy). Latin aggregates English, French, German, andItalian, as they are all marked as Latin in the MLT dataset.

Training Data IIIT SVT IC15 SVTP CUTE Total90K [10] (1M) 51.6 39.2 35.7 37.2 30.9 30.5

ST [6] (1M) 53.5 30.3 38.4 29.5 31.2 31.1VISD [50] (1M) 53.9 37.1 37.1 36.3 30.5 30.9

Ours (1M) 54.8 40.3 39.1 39.6 31.6 32.1ST+90K(4.5M × 2) 80.5 70.1 58.4 60.0 63.9 43.2

ST+90K+ours(3M × 3) 81.6 71.9 61.8 61.7 67.7 45.7ST+90K(16M ) 81.2 71.2 62.0 62.3 65.1 44.7

Table 4: Results on English datasets (word level accuracy).

References[1] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Ji-

tendra Malik. Contour detection and hierarchical image seg-mentation. IEEE transactions on pattern analysis and ma-chine intelligence, 33(5):898–916, 2011.

[2] Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun,and Hwalsuk Lee. Character region awareness for text detec-tion. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 9365–9374,2019.

[3] Zhanzhan Cheng, Xuyang Liu, Fan Bai, Yi Niu, Shiliang Pu,and Shuigeng Zhou. Arbitrarily-oriented text recognition.CVPR2018, 2017.

[4] Chee Kheng Ch’ng and Chee Seng Chan. Total-text: A com-prehensive dataset for scene text detection and recognition.In Proc. ICDAR, volume 1, pages 935–942, 2017.

[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In Proc. NIPS,pages 2672–2680, 2014.

[6] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman.Synthetic data for text localisation in natural images. In Proc.CVPR, pages 2315–2324, 2016.

[7] Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao,and Changming Sun. An end-to-end textspotter with explicitalignment and attention. In Proc. CVPR, pages 5020–5029,2018.

[8] Stefan Hinterstoisser, Olivier Pauly, Hauke Heibel, MartinaMarek, and Martin Bokeloh. An annotation saved is an an-notation earned: Using fully synthetic training for object in-stance detection. CoRR, abs/1902.09967, 2019.

[9] Sepp Hochreiter and Jurgen Schmidhuber. Long short-termmemory. Neural computation, 9(8):1735–1780, 1997.

[10] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and An-drew Zisserman. Synthetic data and artificial neural net-

works for natural scene text recognition. arXiv preprintarXiv:1406.2227, 2014.

[11] Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang,Wei Li, Hua Wang, Pei Fu, and Zhenbo Luo. R2cnn: rota-tional region cnn for orientation robust scene text detection.arXiv preprint arXiv:1706.09579, 2017.

[12] Amlan Kar, Aayush Prakash, Ming-Yu Liu, Eric Cameracci,Justin Yuan, Matt Rusiniak, David Acuna, Antonio Torralba,and Sanja Fidler. Meta-sim: Learning to generate syntheticdatasets. arXiv preprint arXiv:1904.11621, 2019.

[13] Dimosthenis Karatzas, Lluis Gomez-Bigorda, AnguelosNicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwa-mura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chan-drasekhar, Shijian Lu, et al. Icdar 2015 competition on robustreading. In 2015 13th International Conference on Docu-ment Analysis and Recognition (ICDAR), pages 1156–1160.IEEE, 2015.

[14] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida,Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi RoblesMestre, Joan Mas, David Fernandez Mota, Jon Almazan Al-mazan, and Lluis Pere de las Heras. Icdar 2013 robust read-ing competition. In 2013 12th International Conference onDocument Analysis and Recognition (ICDAR), pages 1484–1493. IEEE, 2013.

[15] Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. Show,attend and read: A simple and strong baseline for irregulartext recognition. AAAI, 2019.

[16] Minghui Liao, Boyu Song, Shangbang Long, Minghang He,Cong Yao, and Xiang Bai. Synthtext3d: synthesizing scenetext images from 3d virtual worlds. Science China Informa-tion Sciences, 63(2):120105, 2020.

[17] Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman,and Simon Lucey. St-gan: Spatial transformer generativeadversarial networks for image compositing. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 9455–9464, 2018.

[18] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, andJunjie Yan. Fots: Fast oriented text spotting with a unifiednetwork. Proc. CVPR, 2018.

[19] Yuliang Liu and Lianwen Jin. Deep matching prior network:Toward tighter multi-oriented text detection. In Proc. CVPR,2017.

[20] Shangbang Long, Yushuo Guan, Bingxuan Wang, KaiguiBian, and Cong Yao. Alchemy: Techniques for rectifica-tion based irregular scene text recognition. arXiv preprintarXiv:1908.11834, 2019.

Page 10: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

[21] Shangbang Long, Xin He, and Cong Yao. Scene text detec-tion and recognition: The deep learning era. arXiv preprintarXiv:1811.04256, 2018.

[22] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He,Wenhao Wu, and Cong Yao. Textsnake: A flexible represen-tation for detecting text of arbitrary shapes. In Proc. ECCV,2018.

[23] Pengyuan Lyu, Zhicheng Yang, Xinhang Leng, Xiaojun Wu,Ruiyu Li, and Xiaoyong Shen. 2d attentional irregular scenetext recognizer. arXiv preprint arXiv:1906.05708, 2019.

[24] John McCormac, Ankur Handa, Stefan Leutenegger, andAndrew J. Davison. Scenenet RGB-D: 5m photorealisticimages of synthetic indoor trajectories with ground truth.CoRR, abs/1612.05079, 2016.

[25] Anand Mishra, Karteek Alahari, and CV Jawahar. Scene textrecognition using higher order language priors. In BMVC-British Machine Vision Conference. BMVA, 2012.

[26] Nibal Nayef, Yash Patel, Michal Busta, Pinaki Nath Chowd-hury, Dimosthenis Karatzas, Wafa Khlif, Jiri Matas, Uma-pada Pal, Jean-Christophe Burie, Cheng-lin Liu, et al. Ic-dar2019 robust reading challenge on multi-lingual scenetext detection and recognition–rrc-mlt-2019. arXiv preprintarXiv:1907.00945, 2019.

[27] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, YuanFeng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal,Christophe Rigaud, Joseph Chazalon, et al. Icdar2017 ro-bust reading challenge on multi-lingual scene text detectionand script identification-rrc-mlt. In Proc. ICDAR, volume 1,pages 1454–1459. IEEE, 2017.

[28] Jeremie Papon and Markus Schoeler. Semantic pose usingdeep networks trained on synthetic rgb-d. In Proc. ICCV,pages 774–782, 2015.

[29] Xingchao Peng, Baochen Sun, Karim Ali, and Kate Saenko.Learning deep object detectors from 3d models. In Proc.ICCV, pages 1278–1286, 2015.

[30] Siyang Qin, Alessandro Bissacco, Michalis Raptis, YasuhisaFujii, and Ying Xiao. Towards unconstrained end-to-end textspotting. In Proceedings of the IEEE International Confer-ence on Computer Vision, pages 4704–4714, 2019.

[31] Weichao Qiu and Alan Yuille. Unrealcv: Connecting com-puter vision to unreal engine. In Proc. ECCV, pages 909–916, 2016.

[32] Trung Quy Phan, Palaiahnakote Shivakumara, ShangxuanTian, and Chew Lim Tan. Recognizing text with perspectivedistortion in natural scenes. In Proc. ICCV, pages 569–576,2013.

[33] Stephan R Richter, Vibhav Vineet, Stefan Roth, and VladlenKoltun. Playing for data: Ground truth from computergames. In European conference on computer vision, pages102–118. Springer, 2016.

[34] Anhar Risnumawan, Palaiahankote Shivakumara, Chee SengChan, and Chew Lim Tan. A robust arbitrary text detectionsystem for natural scene images. Expert Systems with Appli-cations, 41(18):8027–8048, 2014.

[35] German Ros, Laura Sellart, Joanna Materzynska, DavidVazquez, and Antonio M Lopez. The synthia dataset: A largecollection of synthetic images for semantic segmentation ofurban scenes. In Proc. CVPR, pages 3234–3243, 2016.

[36] Scott D Roth. Ray casting for modeling solids. ComputerGraphics & Image Processing, 18(2):109–144, 1982.

[37] Fatemeh Sadat Saleh, Mohammad Sadegh Aliakbarian,Mathieu Salzmann, Lars Petersson, and Jose M Alvarez. Ef-fective use of synthetic data for urban scene semantic seg-mentation. In Proc. ECCV, pages 86–103, 2018.

[38] Baoguang Shi, Xiang Bai, and Serge Belongie. Detectingoriented text in natural images by linking segments. In TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), 2017.

[39] Baoguang Shi, Mingkun Yang, XingGang Wang, PengyuanLyu, Xiang Bai, and Cong Yao. Aster: An attentional scenetext recognizer with flexible rectification. IEEE transactionson pattern analysis and machine intelligence, 31(11):855–868, 2018.

[40] Zhuotao Tian, Michelle Shu, Pengyuan Lyu, Ruiyu Li, ChaoZhou, Xiaoyong Shen, and Jiaya Jia. Learning shape-awareembedding for scene text detection. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 4234–4243, 2019.

[41] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Woj-ciech Zaremba, and Pieter Abbeel. Domain randomizationfor transferring deep neural networks from simulation to thereal world. In 2017 IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), pages 23–30. IEEE,2017.

[42] Jonathan Tremblay, Thang To, and Stan Birchfield. Fallingthings: A synthetic dataset for 3d object detection and poseestimation. In Proc. CVPR Workshops, pages 2038–2041,2018.

[43] Gul Varol, Javier Romero, Xavier Martin, Naureen Mah-mood, Michael J Black, Ivan Laptev, and Cordelia Schmid.Learning from synthetic humans. In Proc. CVPR, pages 109–117, 2017.

[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin. Attention is all you need. In Proc. NIPS, pages5998–6008, 2017.

[45] Kai Wang, Boris Babenko, and Serge Belongie. End-to-endscene text recognition. In 2011 IEEE International Confer-ence on Computer Vision (ICCV),, pages 1457–1464. IEEE,2011.

[46] Tao Wang, David J Wu, Adam Coates, and Andrew Y Ng.End-to-end text recognition with convolutional neural net-works. In 2012 21st International Conference on PatternRecognition (ICPR), pages 3304–3308. IEEE, 2012.

[47] Xiaobing Wang, Yingying Jiang, Zhenbo Luo, Cheng-LinLiu, Hyunsoo Choi, and Sungjin Kim. Arbitrary shape scenetext detection with adaptive text region representation. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 6449–6458, 2019.

[48] Xinlong Wang, Zhipeng Man, Mingyu You, and ChunhuaShen. Adversarial generation of training examples: Appli-cations to moving vehicle license plate recognition. arXivpreprint arXiv:1707.03124, 2017.

[49] Qixiang Ye and David Doermann. Text detection and recog-nition in imagery: A survey. IEEE transactions on patternanalysis and machine intelligence, 37(7):1480–1500, 2015.

Page 11: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

[50] Fangneng Zhan, Shijian Lu, and Chuhui Xue. Verisimilarimage synthesis for accurate detection and recognition oftexts in scenes. In Proc. ECCV, 2018.

[51] Fangneng Zhan, Hongyuan Zhu, and Shijian Lu. Spatial fu-sion gan for image synthesis. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 3653–3662, 2019.

[52] Chengquan Zhang, Borong Liang, Zuming Huang, MengyiEn, Junyu Han, Errui Ding, and Xinghao Ding. Look morethan once: An accurate detector for text of arbitrary shapes.Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2019.

[53] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, ShuchangZhou, Weiran He, and Jiajun Liang. EAST: An efficient andaccurate scene text detector. In Proc. CVPR, 2017.

A. Scene ModelsIn this work, we use a total number of 30 scene models

which are all obtained from the Internet. However, most ofthese models are not free. Therefore, we are not allowed toshare the models themselves. Instead, we list the models weuse and their links in Tab. 5.

B. New Annotations for Scene Text Recogni-tion Datasets

During the experiments of scene text recognition for En-glish scripts, we notice that among the most widely usedbenchmark datasets, several have incomplete annotations.They are IIIT5K, SVT, SVTP, and CUTE-80. The annota-tions of these datasets are case-insensitive, and ignore punc-tuation marks.

The common practice for recent scene text recognitionresearch is to convert both prediction and ground-truth textstrings to lower-case and then compare them. This meansthat the current evaluation is flawed. It ignores letter caseand punctuation marks which are crucial to the understand-ing of the text contents. Besides, evaluating on a muchsmaller vocabulary set results in over-optimism of the per-formance of the recognition models.

To aid further research, we use the Amazon mechan-ical Turk (AMT) to re-annotate the aforementioned 4datasets, which amount to 6837 word images in total.Each word image is annotated by 3 workers, and wemanually check and correct images where the 3 an-notations differ. The annotated datasets are releasedvia GitHub at https://github.com/Jyouhou/Case-Sensitive-Scene-Text-Recognition-Datasets.

B.1 Samples

We select some samples from the 4 datasets to demon-strate the new annotations in Fig. 6.

B.2 Benchmark Performances

As we are encouraging case-sensitive (also with punctua-tion marks) evaluation for scene text recognition, we wouldlike to provide benchmark performances on those widelyused datasets. We evaluate two implementations of theASTER models, by Long et al.6 and Baek et al7 respec-tively. Results are summarized in Tab. 6.

The two benchmark implementations perform compara-bly, with Baek’s better on straight text and Long’s better atcurved text. Compared with evaluation with lower case +digits, the performance drops considerably for both modelswhen we evaluate with all symbols. These results indicate

6https://github.com/Jyouhou/ICDAR2019-ArT-Recognition-Alchemy

7https://github.com/clovaai/deep-text-recognition-benchmark

Page 12: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

Scene Name LinkUrban City https://www.unrealengine.com/marketplace/en-US/product/urban-city

Medieval Village https://www.unrealengine.com/marketplace/en-US/product/medieval-villageLoft https://ue4arch.com/shop/complete-projects/archviz/loft/

Desert Town https://www.unrealengine.com/marketplace/en-US/product/desert-townArchinterior 1 https://www.unrealengine.com/marketplace/en-US/product/archinteriors-vol-2-scene-01

Desert Gas Station https://www.unrealengine.com/marketplace/en-US/product/desert-gas-stationModular School https://www.unrealengine.com/marketplace/en-US/product/modular-school-packFactory District https://www.unrealengine.com/marketplace/en-US/product/factory-district

Abandoned Factory https://www.unrealengine.com/marketplace/en-US/product/modular-abandoned-factoryBuddhist https://www.unrealengine.com/marketplace/en-US/product/buddhist-monastery-environment

Castle Fortress https://www.unrealengine.com/marketplace/en-US/product/castle-fortressDesert Ruin https://www.unrealengine.com/marketplace/en-US/product/modular-desert-ruinsHALArchviz https://www.unrealengine.com/marketplace/en-US/product/hal-archviz-toolkit-v1

Hospital https://www.unrealengine.com/marketplace/en-US/product/modular-sci-fi-hospitalHQ House https://www.unrealengine.com/marketplace/en-US/product/hq-residential-house

Industrial City https://www.unrealengine.com/marketplace/en-US/product/industrial-cityArchinterior 2 https://www.unrealengine.com/marketplace/en-US/product/archinteriors-vol-4-scene-02

Office https://www.unrealengine.com/marketplace/en-US/product/retro-office-environmentMeeting Room https://drive.google.com/file/d/0B_mjKk7NOcnEUWZuRDVFQ09STE0/view

Old Village https://www.unrealengine.com/marketplace/en-US/product/old-villageModular Building https://www.unrealengine.com/marketplace/en-US/product/modular-building-set

Modular Home https://www.unrealengine.com/marketplace/en-US/product/supergenius-modular-homeDungeon https://www.unrealengine.com/marketplace/en-US/product/top-down-multistory-dungeonsOld Town https://www.unrealengine.com/marketplace/en-US/product/old-town

Root Cellar https://www.unrealengine.com/marketplace/en-US/product/root-cellarVictorian https://www.unrealengine.com/marketplace/en-US/product/victorian-streetSpaceship https://www.unrealengine.com/marketplace/en-US/product/spaceship-interior-environment-set

Top-Down City https://www.unrealengine.com/marketplace/en-US/product/top-down-cityScene Name https://www.unrealengine.com/marketplace/en-US/product/urban-cityUtopian City https://www.unrealengine.com/marketplace/en-US/product/utopian-city

Table 5: The list of 3D scene models used in this work.

Dataset Sample Image Original Annotation New Annotation

CUTE80 TEAM Team

IIIT5K 15 15%.

SVT DONALD Donald’

SVTP MARLBORO Marlboro

Figure 6: Examples of the new annotations.

that it may still be a challenge to recognize a larger vocabu-lary, and is worth further research.

C. Technical Differences between SynthText3Dand UnrealTextC.1 Introduction

In this appendix, we summarize the key differences be-tween the methods proposed in SynthText3D and Unreal-Text, and emphasize the novel contributions of UnrealText.

In the first part, we present a high-level comparison be-tween the ideologies behind these two methods. We includea brief statement of other technical novelties as well. Fi-nally, we point out the significant contributions achieved byUnrealText.

In the second part, we present a detailed comparison be-tween SynthText3D and UnrealText. This detailed com-parison will cover each step in the rendering pipeline, anddemonstrate the significant differences between them, aswell as how UnrealText is superior to SynthText3D. We alsoattach links to relevant code snippets for readers verifica-tion.

C.2 High-level comparison

C.2.1 Interactions with 3D worlds

The main ideology behind SynthText3D is to utilize theground-truth normal segmentation and depth map provided

Page 13: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

Implementation Case IIIT SVT IC13 IC15 SVTP CUTE80 TotalLong et al. All 81.2 71.2 86.9 62.0 62.3 65.1 44.7Baek et al All 81.5 71.7 88.9 62.1 62.6 64.9 41.5Long et al. lower case + digits 89.5 84.1 89.9 68.8 73.5 76.3 58.2Baek et al lower case + digits 86.5 83.5 93.0 70.3 75.1 68.4 46.0

Table 6: Results on English datasets (word level accuracy). All indicates that the evaluation considers lower case characters,upper case characters, numerical digits, and punctuation marks.

by 3D engines, with which we can find better text regions,and render them with correct perspective distortion. Thedesign of SynthText3D closely followed SynthText whichuses gPb-UCM segmentation and depth map estimated withoff-the-shelf computer vision models. The paper didnt gofurther than utilizing the ground-truth normal segmentationand depth values. The paper didnt consider using the moreinformative object meshes to guide the region finding.

The main ideology of UnrealText is that the algorithmsshould deeply interact with the 3D engine, and should uti-lize further information, such as the object meshes, sinceobject meshes provide complete information of the scene,while segmentation and depth maps are incomplete infor-mation. The main steps in the pipeline are based on di-rect interactions with object meshes by using ray tracingand collision detection.

In addition, we would like to point out that while thereare many other papers using 3D engines to synthesize im-ages for computer vision tasks (see the related work sectionof the UnrealText paper), UnrealText is the first to considersuch interactions with 3D worlds.

In this sense, the two works have been designed withcompletely different ideologies. SynthText3D is SynthTextwith 3D engines, while UnrealText is more integrated with3D engines. UnrealText is significantly superior to Synth-Text3D. For more details, please refer to Sec. C.3.1, C.3.2,and C.3.4 in this document.

C.2.2 Technical novelties

In addition, there are also several non-trivial improvementscompared to SynthText3D:

(1) We re-designed the text foreground generation mod-ule to provide text foreground rendering at a 100 timesfaster speed compared to SynthText3D, and make it easilycompatible with multi-lingual scripts.

(2) We embedded an online environment randomizationmodule that can change the lighting conditions for eachshot, while SynthText3D used the same set of 4 fixed condi-tions (day, night, dawn, fog) for each scene model. The newsystem provides much more dynamic changes in lightingconditions. We designed this step to fight against the do-main adaptation problem by using domain randomization.(see Sec 3.5 in the document)

(3) In UnrealText, we integrated the control logics and amajor part of the rendering pipeline into the game programs,and thus successfully reduced the communication load be-tween the game process and the controller process, which isa major bottleneck in SynthText3D. Besides, different textinstances can now be rendered in parallel. This gives sig-nificant speed improvement.

C.2.3 Efficiency and scalability

When designing UnrealText, we took into consideration theefficiency and scalability. In UnrealText, we automated thepipeline, especially the viewfinding module. Also note that,the deep interactions with 3D worlds in UnrealText makethe text region proposal and refinement steps robust to cam-era views. This is the reason why we no longer need care-ful human annotations of views in UnrealText, and thus itallows for automated viewfinding which gives much morediverse views.

C.2.4 Conclusion and contributions of UnrealText

In conclusion, UnrealText has been designed to make fulladvantage of the 3D engines. The algorithms in UnrealTexthave also been designed to complement each other. Contrar-ily, as an initial attempt, SynthText3D inherits heavily fromSynthText, and only utilizes the ground-truth segmentationand depth maps to replace those estimated by the computervision models used in SynthText. Overall, the SynthText3Dmethod resembles SynthText more, while UnrealText is atruly 3D synthesis engine.

With the newly designed methods, UnrealText is able toachieve the following while SynthText3D can not:

(1) Richer Diversity: The images synthesized by Unre-alText have much larger diversity in camera views, lightingconditions, locations of text, and etc. that represent real-world data distribution.

(2) Speed: The rendering speed is faster by an order ofmagnitude: when rendering on the same machine as speci-fied in the UnrealText paper, it takes 30-40 seconds for Syn-thText3D to render a view with 5 text regions and only 1seconds for UnrealText to render a view with 15 text re-gions. (see Sec 3.6 in the document)

(3) Scalability – with improved speed and automation

Page 14: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

of the pipeline, we are able to generate a large scale datasetwhich is important in training scene text recognizers andend-to-end models (we generated 1.3M images easily; Syn-thText3D only generated 10K).

(4) Multilingual Text: With the new text foregroundgeneration module that is more compatible with differentlanguage scripts, we are able to generate a large scale mul-tilingual dataset that will aid the research in this importantyet largely untouched topic.

C.3 Technical details

First, we would like to list the code for the two papershere for reference:

• UnrealText: https://github.com/Jyouhou/UnrealText

• SynthText3D: https://github.com/MhLiao/SynthText3D

We have released the code right after acceptance and ev-eryone is free to check the code. We will address the differ-ences in each part of the pipeline:

C.3.1 Viewfinding

In this step, we need to select a camera rotation and location.SynthText3D is totally based on manual annotation.

Each annotation (both rotation and location) has to be se-lected carefully. As discussed below in part (3.2), the qual-ity of this annotation is very important, and therefore itrequires significant human effort. The viewfinding mod-ule is then randomly selecting views from these annotatedanchors. We additionally randomize the location and ro-tations by adding a white noise to the selected rotationsand locations. In other words, the viewfinding in Synth-Text3D is manual selection + norm-ball noise (i.e. the ran-dom viewpoint augmentation). The diversity is, in fact, verylimited. Note that the white noise is not accumulated. If thewhite noise is accumulated, that would be similar to the 3Drandom walk(though not considering physics constraints).Besides, the noise here in SynthText3D does not considerthe physics constraints —- they can produce inside-objectviews, which is highly unpreferable. The implementationfor ST3D’s view finding is here: (1) sampling manually se-lected camera rotations and locations8 and (2) random view-point augmentation910.

UnrealText is a semi-automatic method that perceivesthe surroundings actively with ray tracing. We first annotate

8https://github.com/MhLiao/SynthText3D/blob/144d9a0696495f8aa88786882600ade4b6f5d415/Code/GenerateData.py#L202

9https://github.com/MhLiao/SynthText3D/blob/144d9a0696495f8aa88786882600ade4b6f5d415/Code/GenerateData.py#L483

10https://github.com/MhLiao/SynthText3D/blob/144d9a0696495f8aa88786882600ade4b6f5d415/Code/GenerateData.py#L541

some camera locations (no need for rotation). One merithere is that: the selection of these locations does not requireany carefulness, as long as the selected locations can covermost areas of the scene model. Practically, we just wan-dered through the scene once and randomly recorded thelocations, which is enough for our method. The anchors areonly used to ensure the coverage. Then, a newly designedalgorithm uses light tracing to explore and navigate in thescene model under the constraints of physics (not collidingnor getting inside object meshes).

The manual selection of anchors and the random walk al-gorithms are combined together to achieve the goals of thismodule: automatic, fast, explorative and diverse. Unreal-Texts method has much higher coverage and diversity whilerequiring little human effort.

Also recall that, random walk in 3D space is not recur-rent, Therefore, pure 3D random walk is inefficient in termsof exploration. The combination of low-cost auxiliary cam-era anchor selection and diverse 3D random walk is a keyto the efficiency and scalability of the method.

The implementation of UnrealText’s view finding ishere11.

Based on the above analysis, SynthText3Ds algorithm isinferior to UnrealTexts algorithm.

C.3.2 Text region proposals and refinements

We first describe the step-by-step algorithms:SynthText3D: Large bounding boxes are mined from the

normal segmentation map, and projected into 3D coordi-nates using the depth map. Then the four corners are itera-tively clipped/shrunk until its shape looks like a rectangle.This clipping step only computes the angles of the four cor-ners of the projected quadrilaterals using their coordinates.There are no actual interactions with their surrounding ob-jects. In this sense, SynthText3D is much similar to Synth-Text or VISD in ideology where the methods are only builtupon scene semantics (segmentation and depth), while seg-mentation and depth are simply some proxy values for thewhole scene.

They are implemented in the methods CreateTrian-gleCameraLight for preStereo and CreateTriangleCamera-Light212.

UnrealText: Small prototype squares are mined fromthe normal segmentation map, and projected into the 3Dworld with ray tracing. Then, the program calculates the

11https://github.com/Jyouhou/UnrealText/blob/949e3196278e8d33916aab11b454b6d776f477cf/code/UnrealText/Source/UnrealCV/Private/UnrealText/CameraWanderActor.cpp#L63

12https://github.com/MhLiao/SynthText3D/blob/master/Code/Unrealtext-Source/UnrealCV/Private/PugTextPawn.cpp

Page 15: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

vectors representing the horizontal and gravitational direc-tions, and instantiates a small square object whose upperedge is orthogonal to the gravity and which are parallel tothe underlying surface. Then, the square is iteratively ex-panded until it hits other mesh boundaries, etc.. During eachexpansion, ray tracing is performed to determine whetherthe current text region has collided with other meshes, orgone off or beneath the underlying surface. This step takesfull advantage of interactions with the 3D world (meshes,etc.). The code is here13.

SynthText3D’s method has the following limitation:Proposing from 2D screen space using normal map can notexhaustively find the whole suitable surface and tends tofocus on the middle of available surfaces. Consider this ex-ample: the camera is facing a wall diagonally. The nor-mal surface of this wall in the segmentation map is thentrapezium-shaped. SynthText3D’s method can only mine amedium-sized rectangular proposal boxes, which are thenclipped even smaller. Its impossible to render text on loca-tions such as those near the borders on the wall. This, as aresult, requires significant care and human effort when la-beling camera rotations and locations. Otherwise, the qual-ity of the proposed and refined text regions will drop signif-icantly. Contrarily, UnrealText’s method can actually spanacross the whole wall.

We refer readers to the Fig.4 at page 5 of the Unreal-Text paper. UnrealText can easily fit a text instance ontothe stone eaves. SynthText3D is unable to do that unlesswe annotate a camera right in front of the eaves and facingdirectly to it. Also note that, for complex scenes, ill-posedsurfaces are ubiquitous in nearly all camera locations androtations.

In conclusion, this limitation results in two problemsfor SynthText3D’s implementation: (1) the camera anchorneeds to be selected very carefully; otherwise, in the refine-ment step, the boxes will all degenerate into single points,and rendering will fail. (2) it fails to cover surfaces withlarge aspect ratios especially when the camera is not facingthem right front.

There is another limitation for relying on the normal seg-mentation map solely. For example, a square pillar is lo-cated in front of a wall. The frontal side of the pillar isparallel to the wall, and therefore in the normal map, theboundaries are indistinguishable between the pillar and thewall. We implemented a mechanism in UnrealText thatmakes sure, the proposed prototypes are indeed located onsurfaces, instead of spanning across different parallel sur-faces14. Basically, it tries to determine, when projected into

13https://github.com/Jyouhou/UnrealText/blob/949e3196278e8d33916aab11b454b6d776f477cf/code/UnrealText/Source/UnrealCV/Private/UnrealText/StickerTextActor.cpp#L554

14https://github.com/Jyouhou/UnrealText/blob/949e3196278e8d33916aab11b454b6d776f477cf/code/

3D world, whether all pixels in the same proposed proto-type are located on the same planar surface in 3D world.This is a newly designed mechanism in UnrealText.

In terms of ideology, SynthText3D is SynthText with 3Dengine. The overall design still follows the idea of Syn-thText. The de facto implementation is only utilizing theadvantage that 3D engines have precise segmentation anddepth information. While contrarily, UnrealText was de-signed with a totally different ideology: UnrealText is builtupon interactions with the 3D world (the meshes and ob-jects). Therefore, UnrealText exhibits many advantagesover SynthText3D.

C.3.2.1 Does UnrealText need the normal segmentationmap? While both SynthText3D and UnrealText use thenormal segmentation map (though in different ways), Un-realText does not really need it in principle. UnrealText isequipped with a powerful mechanism to filter out boxes thatfail to fall on well-defined regions (the BoxSanityCheck15

method mentioned right above). Besides, the refinementmodule, as described above, can automatically fill any sur-faces and search for well-posed regions by interacting with3D worlds. Practically, in UnrealText the small prototypeboxes can be generated randomly, and then the BoxSani-tyCheck function can filter out unsuitable locations and af-terwards the refinement module can fill the suitable surfaceand find good text regions correctly.

As a matter of fact, when we developed the refinementmodule in UnrealText, we used the same set of box propos-als in screen spaces to tune the refinement module in differ-ent views. It turned out that the refinement module couldstill generate well-positioned text regions with these presetbox proposals regardless of the views and normal segmen-tation map.

The reason why we ended up using the normal segmen-tation map is that, we can use this normal map to makesure that the proposed boxes are sampled evenly onto dif-ferent object surfaces, instead of on a single large surface.We achieve this by making a copy of the normal map anditeratively mask out normal segmentation regions that arealready occupied by the already proposed boxes16. Never-theless, it represents a different usage from SynthText3D.

UnrealText/Source/UnrealCV/Private/UnrealText/StickerTextActor.cpp#L473

15https://github.com/Jyouhou/UnrealText/blob/949e3196278e8d33916aab11b454b6d776f477cf/code/UnrealText/Source/UnrealCV/Private/UnrealText/StickerTextActor.cpp#L473

16https://github.com/Jyouhou/UnrealText/blob/949e3196278e8d33916aab11b454b6d776f477cf/code/DataGenerator/BoxProposing.py#L118

Page 16: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

C.3.3 Text foreground generation

SynthText3D directly uses SynthTexts original code17 togenerate text foregrounds, which is based on the PyGamelibrary. In the experiments, this module has several short-comings: (1) slow; (2) can not generate images with desig-nated heights and widths; it frequently gives oversized im-ages with totally different aspect ratios; since SynthText3Ddoes not have any remedy against this problem, a propor-tion of the SynthText3D data have incorrect aspect ratiosand are distorted seriously. (3) its based on single-characterrendering, which does not support languages such as Ara-bic where characters are not separated; its not friendly tomultilingual data generation.

In UnrealText, the module is implemented18 with thePIL library. The merits are as follows: (1) its about 100times faster; (2) the pipeline is carefully designed such thatthe text foregrounds generated have correct aspect ratios.(3) diverse text layouts are incorporated; (4) the newly im-plemented module adapts to multilingual text easily (there-fore we are able to make a large-scale multilingual dataset).

C.3.4 Text mesh generation

This step instantiates a mesh object in the 3D worlds to loadthe text foregrounds.

SynthText3D first uses the depth map to compute an ap-proximate 3D location for each pixel (since the depth map isstored with limited precision, the depth values are approx-imate); then it casts a ray from the camera towards the ap-proximate location until it hits some mesh surface. It doesnot allow occlusion, since each pixel has to be reachablefrom the camera. Also, if a pixel is invisible, the depth valuealso makes no sense.

For UnrealText, recall that step (C.3.2) finds a text re-gion that is closely aligned with some mesh surface. Thenray tracing starting from proximal points (slightly above thissurface in the normal direction) hits this underlying surface.In this way, text foregrounds are printed onto this surface.The whole generation process is independent of the cameraviews. In this way, UnrealText can render text even if theyare not fully visible (i.e. occluded, e.g. the right image ofFig.1 of the UnrealText paper), while SynthText3D requiresthe whole text region to be visible to the camera.

In conclusion, UnrealText can directly realize the occlu-sion effect, while in contrast, SynthText3D can only real-ize such effect by changing the location and rotation afterthe rendering. For more details, see the AStickerTextAc-tor::CreateRectTriangle function in StickerTextActor.cpp of

17https://github.com/MhLiao/SynthText3D/blob/master/Code/text_utils.py

18https://github.com/Jyouhou/UnrealText/blob/949e3196278e8d33916aab11b454b6d776f477cf/code/DataGenerator/WordImageGenerationModule.py

the UnrealText repo.

C.3.5 Environment randomization

As mentioned in the SynthText3D paper, there are 4 pre-set lighting conditions that are baked before generating thedata. Then, the data generation pipeline is applied to the4 different game executables. In other words, the environ-ment conditions are fixed and pre-computed offline. Fur-ther, exactly the same 4 types of lighting conditions are used(day, night, fog, dawn) for all the scene models. It leads to alack of diversity. Besides, such an approach is not scalable.

In UnrealText, the environments are randomized foreach image. There are also much richer randomization tech-niques, such as the rotation of light directions. The lightingintensity, color, direction, fog intensity e.g. are all highlyrandomized. In this way, UnrealText can achieve better do-main adaptation via such domain randomization techniques.See these codes1920.

C.3.6 Data generation (taking shots)

For each camera location, SynthText3D renders one timeand then changes the camera locations and rotations (in anorm-ball proximity) to take 20 shots. Actually, the ren-dering step is much slower – it can take 30 40 seconds torender a view with 5 text regions, while the photo-takingis effortless. The overall timing is averaged and thereforeseems not slow. Another potential issue is that, the differentshots taken for each view may look similar and the overalldiversity of the generated dataset may be limited.

UnrealText takes only one shot for each rendering andeach rendering will generate 15 text regions. It has muchmore diversity since the rendered views are used only once.This relies on the efficiency of UnrealText: it takes only 1seconds to render a view with 15 text regions. The renderingspeed is faster than SynthText3D by an order of magnitude.

(Note that the timing presented here is measured usingthe same machine)

C.3.7 Clarification W.R.T. some potential similarities

C.3.7.1 Pipeline For the overall rendering pipelines, theymight look similar, because we are following an intuitiveand straightforward pipeline that has been used widelyin previous works[1-3]: (1) viewfinding (or finding back-grounds without text), (2) scene analysis (segmentation,depth estimation, saliency estimation, etc.), (3) proposing

19https://github.com/Jyouhou/UnrealText/blob/949e3196278e8d33916aab11b454b6d776f477cf/code/UnrealText/Source/UnrealCV/Private/UnrealText/EnvJitterActor.cpp#L39

20https://github.com/Jyouhou/UnrealText/blob/949e3196278e8d33916aab11b454b6d776f477cf/code/DataGenerator/DataGeneratorModule.py#L147

Page 17: arXiv:2003.10608v4 [cs.CV] 1 Jul 2020 · shelf computer vision models to estimate segmentation and depth maps for background images, SynthText3D uses the ground-truth segmentation

text regions, (4) refining text regions (perspective distortion,etc.), (5) generating text foregrounds, (6) rendering text intobackgrounds.

This pipeline, which consists of several steps, is ordi-nary, and what makes these papers and ours different is thealgorithms for each step. This is similar to an NLP pipelinewhere we will first process the stop words, perform stem-ming/lemmatization and then feed into the model.

[1] Gupta et al. Synthetic data for text localisation innatural images (CVPR2016)

[2] Zhan et al. Verisimilar image synthesis for accuratedetection and recognition of texts in scenes (ECCV2018)

[3] Zhan et al. Scene Text Synthesis for Efficient andEffective Deep Network Training (TPAMI2019)

C.3.7.2 Experiment settings For experiments, trainingthe EAST detection models and ASTER recognition mod-els on different datasets is yet another common practice inscene text detection and recognition research [2-7]. We arejust following this protocol. This is similar to the ImageNetexperiments in neural architecture search papers, and is justsome widely accepted testbed.

[4] Zhan et al. GA-DAN: Geometry-Aware DomainAdaptation Network for Scene Text Detection and Recog-nition (ICCV2019)

[5] Yang et al. SwapText: Image Based Texts Transferin Scenes (CVPR2020)

[6] Baek et al. What Is Wrong With Scene Text Recog-nition Model Comparisons? Dataset and Model Analysis(ICCV2019)

[7] Xu et al. Geometry Normalization Networks for Ac-curate Scene Text Detection (ICCV2019)

C.4 Conclusion

As summarized above, the two papers have significanttechnical differences. In all the steps in the renderingpipeline, there is little similarity. They are designed withtotally different ideologies. While SynthText3D can beviewed as SynthText with 3D engine, UnrealText is a truly3D engine where the rendering process interacts deeplywith the 3D world.

Further, there are many technical novelties and improve-ments in the design of UnrealText. Together, they makean efficient and effective scene text image synthesizer withmuch higher speed, better scalability, richer diversity, andsupport for multilingual data generation.