Tens of Thousands of Nom Character Recognition by Deep ...web.tuat.ac.jp/~nakagawa/pub/2017/pdf/HIP2017_Tens... · recognize Nom characters on the documents scanned by the National

Tens of Thousands of Nom Character Recognition by Deep Convolution Neural Networks

Cong Kha Nguyen, Cuong Tuan Nguyen, Nakagawa Masaki Tokyo University of Agriculture and Technology

2-24-16 Naka-chou, Koganei-shi, Tokyo, Japan, [email protected], [email protected], [email protected]

ABSTRACT This paper proposes a method to recognize a large set of 32,695 Nom characters which had been used in Vietnam from the tenth century to the twentieth century before the Latin-based Vietnamese alphabet became common. So far, the largest sets to which character recognition methods have been studied, including the latest deep Neural Networks, are about 10,000 for the current set of Chinese, Japanese and Korean, but ancient languages of Chinese origin require much larger sizes of categories. Moreover, lack of training patterns makes the development of Optical Character Recognition (OCR) for Nom be a big challenge. On the other hand, the demand to archive Nom historical documents is very high since a large amount of documents are left uninterpreted and scholars who can comprehend Nom are decreasing. Therefore, we propose a method to recognize a very large set of Nom categories by Deep Convolution Neural Networks (CNN). The proposed method introduces coarse categories which are prepared by K-means beforehand. We construct deep CNNs composed by a coarse category feature extractor, a coarse category classifier and a fine category classifier including some inception modules. We pre-train the coarse category feature extractor and the coarse category classifier with the coarse categories, freeze them and then perform fine tuning to recognize characters in the whole categories of Nom. Unlike conventional cascade of coarse classification and fine classification, the coarse and fine category classifiers are executed in parallel to feature maps generated by the feature extractor and their likelihoods are multiplied. The experiment shows that this architecture provides the better recognition rate than the former cascade of GLVQ and MQDF.

CCS CONCEPTS • Computing methodologies → Neural networks; Image processing

KEYWORDS Nom script, VGG net, inception, clustering, OCR

ACM Reference format: Kha Cong Nguyen, Cuong Tuan Nguyen, and Nakagawa Masaki. 2017. Tens of Thousands of Nom Character Recognition by Deep Convolution Neural Networks. In Proceedings of The 4th International Workshop on Historical Document Imaging and Processing, Kyoto, Japan, November 10-11, 2017 (HIP 2017), 5 pages.

DOI: 10.1145/3151509.3151517

1 INTRODUCTION Nom is an ancient script used in Vietnam until the current Latin-based Vietnamese alphabet became common. From the tenth century to the twentieth century, all of the documents in Vietnam were recorded by Nom script, so the value of the script is invaluable. Nowadays, tens of thousands of Nom documents are stored in families, pagodas, churches, and libraries. Most of them have not been yet digitalized and have a high risk of complete damage in the near future. As the result, a part of Vietnamese history will be buried forever and cannot be accessed by the next generation. Recently, there are many projects to digitalize Nom documents for reserving this heritage such as the project by the National Library of Vietnam [1], by the General Library of Thua Thien Hue and the Temple University Library [2], or by the Tue Quang wisdom light foundation [3] and so on. The common drawbacks of the projects are that after scanning Nom documents they do not have a highly accurate OCR for recognizing Nom characters so that the digitalizing process mostly depends on the interpretation of Nom experts. However, now all over the world, the number of people who can comprehend Nom script is less than 100 and most of them are aged. For ancient document digitalization, there have been several projects reported. Kim et al developed a system for digitalizing more than ten million handwritten Hanja documents. Hanja was a popular script in Korean until the late ninth century [4]. The system applies the Mahalanobis distance for OCR. In China, Digital Heritage Publishing Ltd. digitized more than 36,000 volumes (4.7 million pages) of Siku Quanshu (四庫全書) which is the largest collection of books compiled by 361 scholars during the Qianlong period [5]. Both of systems use OCR to recognize segmented characters and manually checked recognition result. For Nom document digitalization, we developed a system to recognize Nom characters on the documents scanned by the National Library of Vietnam [6]. In the system, Nom documents are firstly preprocessed and binarized, then segmented by the X-Y cut method and the Voronoi diagram method. The segmented patterns are recognized by OCR using generalized learning vector quantization (GLVQ) and modified quadratic discriminant function (MQDF). Finally,

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

HIP2017, November 10–11, 2017, Kyoto, Japan © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-5390-8/17/11…$15.00 https://doi.org/10.1145/3151509.3151517

37

HIP 2017, November 10-11, 2017, Kyoto, Japan Cong Kha Nguyen, Cuong Tuan Nguyen, Nakagawa Masaki

the system provides a GUI for users to revise the recognition result and save the result to text files. Because of the lack of training patterns and a large number of Nom categories, the recognition rate is still low. Although the system allows users to automatically convert scanned Nom pages into text files, they have to check recognized results again. This makes the digitalization process still consuming the huge human resource. Here, we propose a combined architecture of deep CNNs to create a more effective OCR for Nom. From the origin and formation of Nom script, we first cluster Nom categories into small groups called coarse categories by the K-mean algorithm using directional features. The proposed deep CNN architecture is made by three components: a coarse category feature extractor, a coarse category classifier, and a fine category classifier. We utilize VGG 11 networks [7] for the coarse category feature extractor. We pre-train the coarse category classifier and the coarse category feature extractor. We freeze them and use the coarse category feature extractor to extract the coarse features. Then, we apply the two classifiers for the extracted features. The output probability of the fine category classifier is multiplied with that of the coarse category classifier when we train the whole combined network for fine turning. We do not extend the coarse classifier with more convolution layer as the VGG 16 and VGG 19 networks [7] to recognize all fine Nom categories because they consume a very large computation resource and memory. Recently, inception layers are proposed which can extract very deep features but consume less computational resource and memory [8]. Therefore, we utilize this type of layers for the fine category classifier. In the inception layers, we also apply batch normalization to accelerate the training process and improve learning accuracy. The coarse category classifier allows the fine category classifier to converge quickly and improve the training accuracy when making fine tuning. The remaining of the paper is organized as follows. Section 2 presents the origin and formation of Nom script. Section 3 shows the related work on coarse and fine classifications by CNNs. Section 4 describes the proposed network model. Section 5 presents the experiment result and Section 6 draws the conclusion.

2 ORIGIN AND FORMATION OF NOM SCRIPT In the previous work [6], we presented that Nom script includes at least 32,695 characters based on studies on Nom fonts and publications from the Vietnamese Nom Preservation Foundation. Nom script, originated from Chinese, is generally formed by the following methods [9]:

Borrowing Chinese characters: A large proportion of Nom characters were borrowed from Chinese in the Tang period. There are three ways to borrow Chinese: using similar meanings and sounds with original characters, borrowing sounds but not meanings, and using meaning but not sounds.

Locally invented by Vietnamese: The most common kind of invented Nom characters is the phono-semantic compound by combining two Chinese characters or components with one presenting the word’s meaning and the other approximating the word’s sound such as 爸 (father) made by 父(presenting its meaning as father) and巴 (indicating its sound). A smaller group is made by composing the semantic of characters such as 𡗶 (heaven) composed of 天 (sky) and 上 (upper). An even smaller group is modified from the original Chinese characters

such as 𧘇 (Vietnamese sound: ý) from original Chinese character 衣(Vietnamese sound: y).

About 60% of Nom characters were invented by Vietnamese people [9]. The later documents were made, the higher this proportion increased while Chinese-origin characters decreased.

3 RELATED WORK Many researches group objects into coarse categories before classifying them into fine categories. Cevahir et al. combine deep belief nets and deep auto-encoder neural network models to classify 28,338 fine categories of a large scale e-commerce data belonged to five super classes [10]. They utilize all textual contents such as titles and descriptions but ignore image contents of products. Jie et al. consider image classification under the weakly supervised learning [11]. Training data is separated into coarse labels which include several fine labels. They investigate how the coarsely labeled data can help to improve the fine label classification by a CNN model. The model first generates a feature map for each fine category. Average pooling is applied to the fine feature maps to obtain the classification score and min-pooling is used to obtain the coarse class feature maps. The classification scores for fine classes and coarse classes are finally combined. Yan et al. apply a hierarchical deep convolution neural network (HD-CNN) for large scale visual recognition [12]. They first pre-train a CNN model for 20 coarse categories. Then, they pre-train a model for each coarse category. After both the coarse category network and the fine category networks for 20 coarse categories are properly per-trained, they fine-tune the complete HD-CNN. Because the number of parameters in the fine category networks grows linearly with the number of coarse categories, they compress the parameters of the fine category networks by K-mean clustering.

4 STRUCTURE OF NETWORK We first prepare coarse categories for Nom. We tentatively decided the number of coarse categories as 304, mainly due to the fact that Nom script includes 304 radicals and partially that we have been employing the number of several hundreds for coarse classification of Kanji characters. However, if we groups Nom categories by radicals, some characters in a group have very complicated structures while others have simple structures, with the result that coarse category classification is hard to converge. Therefore, we use just this number and we do not employ radical based classification.

Fig. 1. Coarse category creation

38

Tens of Thousands of Nom Character Recognition by Deep Convolution Neural Networks HIP 2017, November 10-11, 2017, Kyoto, Japan

Instead, we use K-means to cluster 32,695 Nom categories into 304 coarse categories. The clustering process to create coarse categories is shown in Fig. 1. We normalize Nom character patterns by non-linear line density projection interpolation (LDPI) which is recognized as the best normalization method [13]. Then, we extract directional features of 512 dimensions by the normalization-cooperated gradient feature (NCGF) method. To make features be more discriminative so that the clustering process by K-means is more effective, we reduce the dimension of original features to 160 by the Fisher Linear Discriminant Analysis (FLDA). We compose a deep CNN of a coarse category feature extractor (A), a coarse category classifier (B) and a fine category classifier as shown in Fig. 2. We utilize six convolution layers as shown in the blue text box in Fig. 3 to construct the A block, based on the VGG 11

network structure. We adjust the sizes of padding, stride, and pool to make them extract shallower features compared with the originally proposed VGG 11 net. We combine two full connected layers and a softmax layer for the B block as shown in the red text box in Fig. 3. We use drop out layers for both of the blocks to reduce the overfitting problem and accelerate the learning process. Then, we pre-train them and freeze their weights. The probability of an input pattern (x) assigned to a coarse class 𝐶𝑖, 𝑖 ∈ {1, … 𝑀 = 304} by the coarse category classifier after the softmax layer is 𝑃(𝐶𝑖 |𝑥). The probability of a fine category 𝐹𝑗, 𝑗 ∈ {1, … 𝑁 = 32695} corresponding to the coarse category 𝐶𝑖 is 𝑃(𝐹𝑗|𝑥):

𝑃(𝐹𝑗|𝐶𝑖, 𝑥) = 𝑃(𝐶𝑖|𝑥) 𝑖𝑓 𝐹𝑗 ∈ 𝐶𝑖 (1)

The coarse category feature extractor with frozen weights is used to extract features for the fine category classifier as shown in the right side of Fig. 3. The adaptation layer is used to fit the output of the coarse category feature extractor to the fine category classifier. We employ several inception blocks as proposed for the inception V4 network for the fine category classifier [8]. The reason we use inception modules instead of extending layers from the VGG 11 network to the VGG 16 or VGG 19 networks in [7] is that inception modules allow us to create a very deep network without increasing the computational complexity or the number of operations in comparison with the VGG16, and VGG19 networks. Inception blocks, including some branches of different size filters, allow us to extract features from input feature maps in parallel and concatenate feature maps of all branches. Fig. 4 shows the inception-A block in Fig. 3, inducing four parallel branches of filters. Reduction blocks are the same as inception blocks, but their pooling layer is to reduce the size of features.

We use the new generations of inception layers proposed for the inception V4 network, which save more the computational resource and memory than those of the former version of the inception network. The inception layers utilize 1×1 convolutional layers to reduce the number of features before the features reach to the expensive parallel blocks. Each convolution layer in the inception modules is applied to the batch normalization which computes the mean and the standard-deviation of all the output feature maps of the layer and then normalizes them. That makes all feature maps have the same range and zero mean so that helps the training process of the next layer does not have to learn offsets of data known as the covariate shift problem

Fig. 2. Proposed CNN structure

Fig. 3. Coarse category feature extractor, classifier and fine category classifier

Fig. 4. Inception-A block

39

HIP 2017, November 10-11, 2017, Kyoto, Japan Cong Kha Nguyen, Cuong Tuan Nguyen, Nakagawa Masaki

[14]. It accelerates the training speed and improves the training accuracy.

The output after the inception blocks is flatten by a full-connected layer and a softmax layer. The output probability of the fine category classifier 𝑃(𝐹𝑗|𝑥) is multiplied with the output from the coarse classifier to create the final probability 𝑅(𝐹𝑗|𝑥):

𝑅(𝐹𝑗|𝑥) = 𝑃(𝐹𝑗|𝑥)∗ 𝑃(𝐹𝑗|𝐶𝑖,𝑥)

∑ 𝑃(𝐹𝑘|𝑥)∗ 𝑃(𝐹𝑘|𝐶𝑚,𝑥)𝑁𝑘=1,𝐹𝑘∈𝐶𝑚 (2)

5 EXPERIMENT In this section, we describe an experiment for Nom classification with 32,965 fine categories. To make the experiment, we prepared artificial Nom character patterns since we do not have real patterns sufficiently. We made 500 artificial patterns as shown in Fig. 5 (a) for each fine category using image deformation methods of affine, rotate, shear, shrink and prospective methods from 27 Nom fonts such as Nom Na Tong, Nom Khai, Nom Minh and so on. Then, we applied the K-mean clustering to make 304 coarse categories. Fig. 6 shows two examples of coarse categories clustered by the K-mean algorithm. Obviously, characters in the same coarse category have the relatively similar structure. The testing dataset was collected from 47 pages of character images from real Nom documents which include 11,669 patterns within 2,539 categories as shown in Fig. 5 (b). The methods to collect these real patterns are described in [6]. The testing dataset was also used for testing OCR by GLVQ and MQDF. We trained the coarse category classifier and made fine-tuning of the whole network on Intel Xeon CPU E5-2630 v2 2.6GHz, two Tesla K20c GPUs, and 32 GB RAM. Fig. 7 shows the training loss and accuracy of the coarse category classifier every five epochs. The network converged within 30 epochs with the accuracy rate of 95.88%.

The number of epochs for the fine-tuning is 22 and the batch size is 250. We used the Adadelta algorithm for optimizing the training process recognized as the best method for optimizing CNN [15]. The training time for each epoch is about 17 hours on the above-mentioned system. Fig. 8 shows the testing loss and accuracy of fine-tuning the

Fig. 7. Training loss and accuracy of the coarse category

classifier

(a) Coarse category 1

(b) Coarse category 277

Fig. 6. Fine category characters in coarse categories

(a) Artificial Nom training patterns

(b) Real Nom testing patterns

Fig. 5. Artificial training and real Nom testing patterns

Fig. 8. Testing accuracy and loss after fine-tuning the whole of

the network

40

Tens of Thousands of Nom Character Recognition by Deep Convolution Neural Networks HIP 2017, November 10-11, 2017, Kyoto, Japan

network on the real Nom test set in 22 epochs. Table I compares the recognition rates between the proposed CNN model and our previous OCR composed of GLVQ and MQDF. The recognition rate of the proposed CNN model is 12 points higher than the previous method. Although the difference in top 10 rate is not large, that in top 1 rate is significant. This seems due to the fact that the proposed CNN can extract very deep features while the previous method uses only gradient features. We also trained CNN models based on VGG 19, ResNet 152 layers, Inception-V3 networks by ten epochs with the same training dataset and the same optimization method, but the networks did not converge within 10 epochs. Cost of each epoch is very high since we have 500 × 32,695 training patterns. Fig. 9 shows the training loss of the above three deep CNNs. The utilization of the coarse category classifier makes the network converge quickly and improves the accuracy.

Fig. 10 shows some examples which are successfully recognized by the proposed CNN, but failed by the previous method. It is clear that the MQDF based method often misrecognize characters when they are noised, degraded or their shapes are very similar to others.

6 CONCLUSION In the paper, we proposed a deep CNN model combining the feature extractor, the coarse category classifier based on the VGG network structure and the fine category classifier including the latest inception layers which save computational resource and memory. The experiment result shows the effect of the coarse category classifier for the whole network and the higher recognition rate than the previous method using GLVQ and MQDF. From the result, we can create a more effective OCR for Nom script. There remain several works to do. We must investigate the number of coarse categories. The number of 304 of coarse categories was decided from the number of radicals, but the proposed method is not structural so that this number should be decided through experiments. We also need to prepare a large dataset of real Nom

character patterns for testing. It is also interesting to apply the method for other languages such as ancient Japanese and ancient Chinese.

ACKNOWLEDGMENT We thank the National Library of Vietnam and the Vietnamese Nom Preservation Foundation for providing the Nom historical document pages. This work is being partially supported by the Grant-in-Aid for Scientific Research (S)-25220401.

REFERENCES [1] Shih, V.J. and Chu, T.L, “The Han Nom Digital Library.” In The International Nom

Conference, The National Library of Vietnam, Hanoi, Nov, 2004. [2] Nhan, N. T., and Mai, C., “A Mini-Conference in Nôm Studies.” Hanoi, 2014.[3] Khanh, T. T., Huyen, T. T., “A Program to Translate the Chinese Taisho Tripitaka

into English and Other Western Languages.” presentation at United Nations VesakDay, Hanoi, Vietnam, May 13-16, 2008.

[4] Kim, M. S., Jang, M.D., Choi, H.I., Rhee, T.H., Kim, J.H., and Kwag, H.K., “Digitalizing scheme of handwritten Hanja historical documents.” In Proc. of the 1st International Workshop on Document Image Analysis for Libraries, USA, Jan. 2004, 321-327.

[5] Liu, C. L., and Lu, Y., “Advances in Chinese Document and Text Processing.” WorldScientific (Vol. 2), 2017.

[6] Van Phan, T., Nguyen, K.C. and Nakagawa, M., “A Nom historical documentrecognition system for digital archiving.” International Journal on DocumentAnalysis and Recognition (IJDAR), 19(1), pp.49-64, 2016.

[7] Simonyan, K., and Andrew, Z., “Very deep convolutional networks for large-scaleimage recognition.” arXiv preprint arXiv:1409.1556, 2014.

[8] Szegedy, C., et al. “Inception-v4, Inception-ResNet and the Impact of ResidualConnections on Learning.” AAAI, 2017.

[9] Dao, D. A., “Chữ Nôm: nguồn gốc, cấu tạo, diễn biến (Chu Nom: origins, formation,and transformations).” Hà Nội: Nhà Xuất Bản Khoa Học Xã Hội, 1979.

[10] Cevahir, A. and Murakami, K., “Large-scale Multi-class and Hierarchical ProductCategorization for an E-commerce Giant.” In Proc. COLING, pp. 525-535, 2016.

[11] Jie, L. Zhenyu, G. and Yang, W., “Weakly Supervised Image Classification withCoarse and Fine Labels.” The 14th Conference on Computer and Robot Vision (CRV),2017.

[12] Yan, Z., Zhang, H., Piramuthu, R., Jagadeesh, V., DeCoste, D., Di, W. and Yu, Y., “HD-CNN: hierarchical deep convolutional neural networks for large scale visual recognition.” In Proceedings of the IEEE International Conference on Computer Vision,pp. 2740-2748, 2015.

[13] Van Phan, T., Gao, J., Zhu, B., and Nakagawa, M., “Effects of line densities onnonlinear normalization for online handwritten Japanese character recognition.” InDocument Analysis and Recognition (ICDAR), 2011 International Conference on, pp. 834-838. IEEE, 2011.

[14] Ioffe, S., and Christian, S., “Batch normalization: Accelerating deep network trainingby reducing internal covariate shift.” In International Conference on MachineLearning, pp. 448-456, 2015.

[15] Zeiler, M. D., “ADADELTA: an adaptive learning rate method.” arXiv preprintarXiv:1212.5701, 2012.

Table I. Recognition rate after 22 epochs Recog. Rate top

1 (%) Recog. Rate top

10 (%) No.

Misrecognized GLVQ+MQDF 69.08 86.03 3,608 Proposed CNN 81.73 86.34 2,132

Fig. 9. Training loss for different very deep CNNs

(未→木) (宅→田) (喔→陸) (福→福) (頭→現)

(王→上) (之→辷) (冲→沖) ( 檊 →榊) (成→戒)

Fig. 10. Correct recognition examples by the proposed methods but failed by the former method

41

Documents

Tens of Thousands of Nom Character Recognition by Deep ...web.tuat.ac.jp/~nakagawa/pub/2017/pdf/HIP2017_Tens... · recognize Nom characters on the documents scanned by the National