5
Abstract Optical Characters Recognition (OCR) has been an active subject of research since the early days of computers. Despite the age of the subject, it remains one of the most challenging and exciting areas of research in computer science. In recent years it has grown into a mature discipline, producing a huge body of work. In this paper, we present an Arabic Optical multifont Character Recognition approach based on both Hough transform and wavelets transform for features selection and Hidden Markov Models for classification In the next sections, the whole OCR system will be presented. The different tests carried out on a set of about 170.000 samples of multifont Arabic characters and the obtained results so far will be developed. 1. Introduction Arabic belongs to the group of Semitic alphabetical scripts in which mainly the consonants are represented in writing, while the marking of vowels (using diacritics) is optional. This language is spoken by almost 250 million people and is the official language of 19 countries. There are two main types of written Arabic: classical Arabic, the language of the Quran and classical literature and modern standard Arabic, the universal language of the Arabic speaking world which is understood by all Arabic speakers. Each Arabic speaking country or region also has its own variety of colloquial spoken Arabic. Due to the cursive nature of the script, there are several characteristics that make recognition of Arabic distinct from the recognition of Latin or Chinese scripts. The work we present in this paper belongs to the general field of Arabic documents recognition exploring the use of multiple sources of information. In fact, several experimentations we have carried out [1, 2, 3, 4] had proved the importance of the cooperation of different types of information at different levels (features extraction, classification…) in order to overcome the variability of Arabic and especially multifont characters. The following figure shows some of Arabic characters in the five considered fonts we have worked on so far. Combining a hybrid Approach for Features Selection and Hidden Markov Models in Multifont Arabic Characters Recognition 1 Nadia Ben Amor and 2 Najoua Essoukri Ben Amara 1 National Engineering School of Tunis (ENIT) Tunisia 2 National Engineering School of Sousse (ENISo) Tunisia Laboratory of Systems and Signal Processing (LSTS) 1 [email protected] , [email protected] , 2 [email protected] Arabic Transparent Badr Diwani Alhada Kufi ‘Mim’ ‘Noun’ ‘Kef’ ‘Dhel’ ‘Té’ Arabic Transparent Badr Diwani Alhada Kufi Arabic Transparent Badr Diwani Alhada Kufi ‘Mim’ ‘Mim’ ‘Noun’ ‘Noun’ ‘Kef’ ‘Kef’ ‘Dhel’ ‘Dhel’ ‘Té’ ‘Té’ Fig. 1. Illustration of the five considered fonts Proceedings of the Second International Conference on Document Image Analysis for Libraries (DIAL’06) 0-7695-2531-8/06 $20.00 © 2006 IEEE

[IEEE Second International Conference on Document Image Analysis for Libraries (DIAL'06) - Lyon, France (27-28 April 2006)] Second International Conference on Document Image Analysis

  • Upload
    ne

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE Second International Conference on Document Image Analysis for Libraries (DIAL'06) - Lyon, France (27-28 April 2006)] Second International Conference on Document Image Analysis

Abstract

Optical Characters Recognition (OCR) has been an active subject of research since the early days of computers. Despite the age of the subject, it remains one of the most challenging and exciting areas of research in computer science. In recent years it has grown into a mature discipline, producing a huge body of work. In this paper, we present an Arabic Optical multifont Character Recognition approach based on both Hough transform and wavelets transform for features selection and Hidden Markov Models for classification In the next sections, the whole OCR system will be presented. The different tests carried out on a set of about 170.000 samples of multifont Arabic characters and the obtained results so far will be developed.

1. Introduction

Arabic belongs to the group of Semitic alphabetical scripts in which mainly the consonants are represented in writing, while the marking of vowels (using diacritics) is optional.

This language is spoken by almost 250 million people and is the official language of 19 countries. There are two main types of written Arabic: classical Arabic, the language of the Quran and classical literature and modern standard Arabic, the universal language of the Arabic speaking world which is understood by all Arabic speakers. Each Arabic speaking country or region also has its own variety of colloquial spoken Arabic.

Due to the cursive nature of the script, there are several characteristics that make recognition of Arabic distinct from the recognition of Latin or Chinese scripts.

The work we present in this paper belongs to the general field of Arabic documents recognition exploring the use of multiple sources of information. In fact, several experimentations we have carried out [1, 2, 3, 4] had proved the importance of the cooperation of different types of information at different levels (features extraction, classification…) in order to overcome the variability of Arabic and especially multifont characters.

The following figure shows some of Arabic characters in the five considered fonts we have worked on so far.

Combining a hybrid Approach for Features Selection and Hidden Markov Models in Multifont Arabic Characters Recognition

1Nadia Ben Amor and 2Najoua Essoukri Ben Amara 1National Engineering School of Tunis (ENIT) Tunisia

2National Engineering School of Sousse (ENISo) Tunisia Laboratory of Systems and Signal Processing (LSTS)

[email protected] , [email protected],[email protected]

ArabicTransparent

Badr

Diwani

Alhada

Kufi

‘Mim’‘Noun’‘Kef’‘Dhel’‘Té’

ArabicTransparent

Badr

Diwani

Alhada

Kufi

ArabicTransparent

Badr

Diwani

Alhada

Kufi

‘Mim’‘Mim’‘Noun’‘Noun’‘Kef’‘Kef’‘Dhel’‘Dhel’‘Té’‘Té’

Fig. 1. Illustration of the five considered fonts

Proceedings of the Second International Conference on Document Image Analysis for Libraries (DIAL’06) 0-7695-2531-8/06 $20.00 © 2006 IEEE

Page 2: [IEEE Second International Conference on Document Image Analysis for Libraries (DIAL'06) - Lyon, France (27-28 April 2006)] Second International Conference on Document Image Analysis

In spite of the different researches realised in the field of Arabic OCR (AOCR), we are not yet able to evaluate objectively the reached performances since the tests had not been carried out on the same data base. Thus, the idea is to develop several single and hybrid approaches and to make tests on the same data base of multifont Arabic characters so that we can deduce the most suitable combination or method for Arabic Character Recognition.

In this paper, we present an Arabic Optical multifont Character Recognition approach based on both Hough transform and wavelets transform for features selection and Hidden Markov Models for classification

2. Characters Recognition System

The main process of the AOCR system we developed can be presented by the following figure:

2.1. Pre-Processing

Pre-processing covers all those functions carried out prior to features extraction to produce a cleaned up version of the original image so that it can be used directly and efficiently by the feature extraction components of the OCR.

In our case, this step is limited to noise reduction, while using wavelet transform features extraction, which is a random error in pixel value, usually introduced as a result of reproduction and digitalization of the original image.

For Hough Transform features extraction, the goal of image pre-processing is to generate simple line-drawing image such as the one in Figure 3 which presents the edges detection of the character

‘noun’. Our implementation uses the Canny edge detector [6] for this extraction. While the extracted edges are generally good, they include many short, incorrect, (noise) edges as well as the correct boundaries.

Noise edges are removed through a two-step process: first, connected components are extracted from the thresholded edge image, and then the smallest components, those with the fewest edge pixels, were eliminated. After noise removal, the resulting edges are quite clean.

Fig. 3. Edges extraction using canny edge detector

2.2. Features Extraction

We will emphasise on this important step in an OCR, since it represents the main point of our work presented in this paper.

In fact, one of the two basic steps of pattern recognition is features selection. We quote from Lippman [10]: “Features should contain information required to distinguish between classes, be insensitive to irrelevant variability in the input, and also be limited in number to permit efficient computation of discriminant functions and to limit the amount of training data required.”

Features extraction involves measuring those features of the input character that are relevant to classification. After features extraction, the character is represented by the set of extracted features. There are an infinite number of potential features that one can extract from a finite 2D pattern. However, only those features that are of possible relevance to classification need to be considered. This entails that during the design stage, the expert is focused on those features, which, given a certain classification technique, will produce the most and efficient classification results.

Obviously, the extraction of suitable features helps the system reach the best recognition rate. In a previous work, we have used wavelet transform in order to extract features and we have obtained very encouraging results. We have also developed a Hough Transform based method for features extraction.

Acquisition and preprocessing

Features Extraction

Character learning

Models

Character recognition

Recognized characters

Acquisition and preprocessing

Features Extraction

Character learning

Models

Character recognition

Recognized characters

Fig. 2. Block diagram of the OCR

Proceedings of the Second International Conference on Document Image Analysis for Libraries (DIAL’06) 0-7695-2531-8/06 $20.00 © 2006 IEEE

Page 3: [IEEE Second International Conference on Document Image Analysis for Libraries (DIAL'06) - Lyon, France (27-28 April 2006)] Second International Conference on Document Image Analysis

The main idea behind this work is to take advantages of the two methods and carry out a combination of their respective results by developing an Arabic Optical multifont Character Recognition approach based on both Hough transform and wavelets transform for features selection and Hidden Markov Models for classification

Wavelets transform

Thanks to its efficiency, wavelet transform is more and more implemented in writing recognition systems and signature verification [8].

In fact, wavelets transform which are widely used in image and signal compression [7] seems to be very interesting as far as features extractions are concerned.

We tested several kinds of wavelets such as Haar, Symmlet and Daubechies but we retained the Deaubechies 3 wavelets since we reached better results than when using the other ones.

In addition to features obtained from wavelet transform, we kept the black pixels density of the image as an interesting criterion especially in the case of a multifont context.

Thus, the carried out parameters are as follow:

• the black pixels density • the mean absolute deviation and the

standard deviation of the matrix corresponding to the approximate image

• the mean absolute deviation and the standard deviation of the matrix associated to horizontal details

• the mean absolute deviation and the standard deviation of the matrix associated to vertical details

• the mean absolute deviation and the standard deviation of the matrix associated to diagonal details

Hough transform

In addition to the wavelet transform method for features extraction, we used the Hough transform [9] previously used as the only method for features selection in another work [2].

The Hough Transform (HT) gathers evidence for the parameters of the equation that defines a shape, by mapping image points into the space defined by the parameters of the curve. After gathering evidence, shapes are extracted by finding local

maxima in the parameter space (i.e., local peaks). The HT is a robust technique capable of handling significant levels of noise and occlusion.

The Hough technique is particularly useful for computing a global description of a feature (where the number of solution classes need not be known apriori), given local measurements. The motivating idea behind the Hough technique for line detection is that each input measurement (e.g. coordinate point) indicates its contribution to a globally consistent solution.

Hough transform is used to identify features of a particular shape within a character image such as straight lines, curves and circles. When using the (HT) to detect straight lines, we rely on the fact that a line can be expressed in parametric format by the formula: r =xcos +ysin , where r is the length of a normal from the origin to the line and is the orientation of r with respect to the x-axis.

To find all the lines within the character image we need to build up the Hough parameter space H.This is a two dimensional array that contains accumulator cells. These cells should be initialised with zero values and will be filled with line lengths for a particular and r. For our study the range of is usually from 0° to 180° although often we only need to consider a subset of these angles as we are usually only interested in lines that lie in particular direction.

In our case, we have used only nine effective orientations for each character, while in a previous work fifty orientations were considered. For each black pixel p(x,y) in the image, we take each angle along which we wish to find lines, calculate the value r as defined above and increment the value held in accumulator cell H(r, ) by 1. The values in the resultant matrix will hold values that indicate the number of pixels that lie on a particular line r=xcos +ysin .

Lines passing through more pixels will have higher values than those lines passing through fewer pixels. The line can be plotted by substituting values for either x and y or r and and calculating the corresponding co-ordinates.

The advantage of the Hough transform is the fact that it operates globally on the image rather than locally. The Hough transform works by allowing each edge point in the image to vote for all lines that pass through the point, and then selecting the lines with the most votes. After all edge points are considered, the peaks in the parameter space indicate which lines are supported by the most points from the image.

Proceedings of the Second International Conference on Document Image Analysis for Libraries (DIAL’06) 0-7695-2531-8/06 $20.00 © 2006 IEEE

Page 4: [IEEE Second International Conference on Document Image Analysis for Libraries (DIAL'06) - Lyon, France (27-28 April 2006)] Second International Conference on Document Image Analysis

The transformation between feature space and parameter space is the following:

Project a line through each edge pixel at every possible angle (you can also increment the angles at steps). For each line, calculate the minimum distance between the line and the origin. Increment the appropriate parameter space accumulator by one.

Through these two methods, the main extracted parameters are of a number on eighteen. The classification was done by the Hidden Markov Models Classifier that we present in the next section.

2.3. Hidden Markov Models Classifier

Hidden Markov Models or HMMs are widely used in many fields where temporal (or spatial) dependencies are present in the data [5]. During the last decade, HMMs, which can be thought of as a generalization of dynamic programming techniques, have become a very interesting approach in character recognition. The power of the HMMs lies in the fact that the parameters that are used to model the signal can be well optimized, and this, results in lower computational complexity in the decoding procedure as well as improved recognition accuracy. Furthermore, other knowledge sources can also be represented with the same structure, which is one of the important advantages of the Hidden Markov modelling [11].

HMM’s Topology

The retained HMMs use a left-to-right topology, in which each state has a transition to itself and the next state. HMM for each character have 4 to 7 states, but we have noticed that 5 is approximately the optimal number of states .

Fig 4. HMM left to right topology

The standard approach is to assume a simple probabilistic model of characters production whereby a specified character C produces an

observation sequence O with probability P(C;O). The goal is then to decode the character, based on the observation sequence, so that the decoded character has the maximum a posterioriprobability.

Recognition : Since the models are labeled by the identity of the corresponding characters, the task of recognition is to identify, among a set of L models λk, k=1…L the one (the character) which gives the best interpretation of the observation sequence to be decoded i.e:

Car= arg max [P (O⏐λ)] 1<=car<=L

Proceedings of the Second International Conference on Document Image Analysis for Libraries (DIAL’06) 0-7695-2531-8/06 $20.00 © 2006 IEEE

Page 5: [IEEE Second International Conference on Document Image Analysis for Libraries (DIAL'06) - Lyon, France (27-28 April 2006)] Second International Conference on Document Image Analysis

3. Experimental Results

The achieved results per character are shown in the following table.

4. Conclusion A wide variety of techniques are used to perform Arabic character recognition. In this paper we presented a hybrid technique based on both wavelet decomposition and Hough transform for features extraction and hidden Markov models for classification. As results show, designing an appropriate set of features for the classifier is a vital part of the system and the achieved recognition rate is indebted to the selection of features especially when we deal with multifont characters. We are intending to carry out other hybrid approaches on the level of features extraction as well as the level of classification in order to take advantages of the characteristics of the very efficient methods tested in a single way.

5. References

[1] N. Ben Amor, N. Essoukri Ben Amara : “A hybrid approach for multifont Arabic characters recognition”. International Journals of the 5th WSEAS on Artificial Intelligence, Knowledge Engineering and Data Bases (AIKED'06) Madrid, Spain, February 15-17, 2006

[2] N. Ben Amor, N.Essoukri Ben Amara “Multifont Arabic Character Recognition Using Hough Transform and Hidden Markov Models” ISPA2005 IEEE 4th International Symposium on Image and Signal Processing and AnalysisSeptember 15-17, 2005, Zagreb, Croatia

[3] N. Ben Amor, N. Essoukri Ben Amara : “Hidden Markov Models and Wavelet Transform in Multifont Arabic Characters Recognition”, International Conference on Computing, Communications and Control Technologies(CCCT 2005), July 24-27, in Austin, Texas, USA (Silicon Hills) 2005.

[4] N. Ben Amor, N. Essoukri Ben Amara : “Two approaches for Multifont Arabic Characters Recognition”, The second International Conference on Machine Intelligence (ACIDCA-ICMI'2005) Tozeur-Tunisia , November 5-7, 2005.

[5] R.-D. Bippus and M. Lehning. “Cursive script recognition using Semi Continuous Hidden Markov Models in combination with simple features”. In European workshop on handwriting analysis and recognition,Brussels, July 1994.

[6]J.F Canny. “A Computational Approach to Edge Detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-6, pp. 679-698, 1986.

[7]N. Ben Amor, N. Essoukri Ben Amara : DICOM Image Compression By Wavelet Transform . Proc. IEEE International Conference on Systems, Man and Cybernetics, Vol. 2, 6-9 October 2002.

[8]A Fadhel et P. Bhattacharyya, “Application of a Steerable Wavelet Transform using Neural Network for Signature Verification’’, Pattern Analysis & Application, Springer-Verlag, London, pp. 184-195,1999.

[9]J. Illingworth and J. Kittler, “A Survey of the Hough Transform” Computer Vision, Graphics and Image Processing, vol. 44, pp. 87-116, 1988.

[10] R. Lippmann, “Pattern Classification using Neural Networks”, IEEE Communications Magazine, p. 48, November 1989.

[11] R.-D. Bippus and V. Maergner. Script recognition using inhomogeneous P2DHMM and hierarchical search space reduction. In Proc. ICDAR’99, pages 773–776, Bangalore, India, September 1999

95.75 95.60 97.32 98.35 94.42 97.01 97.01 95.83 98.27 95.60 98.03 98.97 96.30 98.49

98.08 97.78 98.64 95.63 95.00 95.41 97.67 97.15 96.68 94.58 94.98 95.39 94.91 94.73

Table1: Recognition rate per character

Proceedings of the Second International Conference on Document Image Analysis for Libraries (DIAL’06) 0-7695-2531-8/06 $20.00 © 2006 IEEE