Writer Identification Using Texture Analysis

Preview:

Citation preview

Writer Identification Using Texture Analysis

Manish Manoria1 Amit Sinhal2 Manish Maheshwari3

1 Department of Computer Science and Engineering, Samrat Ashok Technological Institute, Vidisha (M.P.), India

manishmanoria@rediffmail.com 2 Department of Computer Science and Engineering, University Institute of Technology,

Barkatullah University, Bhopal (M.P.), India amit_sinhal@rediffmail.com

3 Department of Computer Science, Makhanlal Chaturvedi Rashtriya Patrakarita Vishwa Vidyalaya (MCRPVV), Bhopal (M.P.), India

manishbhom@yahoo.com Abstract. In this paper, we describe a new method to identify the writer of handwritten documents. There are many methods for signature verification or writer identification, but most of them require segmentation or connected component analysis. They are the kinds of content dependent identification methods as signature verification requires the writer to write the same text (e.g. his name). In our new method, we take the handwriting as an image containing some special texture, and writer identification is regarded as texture identification. This is a content independent for a particular script. We apply the well-established 2-D Gabor filtering technique and grey scale co-occurrence matrices to extract features of such textures and a weighted Euclidean distance classifier and K-nearest neighbor classifier to fulfill the identification task. Experiments are made using handwritings from different people in different languages and very promising results were achieved.

1 Introduction Biometric personal identification is an important research area aiming at automatic identity recognition and is receiving growing interest from both academia and industry. There are two types of biometric features: physiological (e.g. face, iris pattern and fingerprint) and behavioral (e.g. voice and handwriting). Personal identification based on handwriting is a behavioral biometric identification approach. Handwriting is easy to obtain and different people have different handwritings. Handwriting (HW) based personal identification has a wide variety of potential applications, from security, forensics, financial activities to archeology (e.g. to identify ancient document writers).

Writer Identification based on HW is an active research topic in the computer vision and pattern recognition filed [1]. However, most existing methods assume that the text is fixed (e.g. in signature verification). These methods extract such features as writing speed, direction, duration, height, width, and slant angle, black pixels etc. They rely on the interactive localization and segmentation of the relevant text information which remains to be difficult and far from being solved.

2nd Indian International Conference on Artificial Intelligence (IICAI-05)

Bhanu Prasad (Editor): IICAI-05, pp. 527-536, 2005.Copyright © IICAI 2005

Since the purpose of Writer Identification based on HW is to identify the writer of a specific handwriting, one does not need to know what the written text is. In this paper, we describe a new algorithm. The key point is using texture analysis to extract features. Each person’s handwriting is seen as having a specific texture. The spatial frequency and orientation contents represent the features of each texture. It is these texture features that we use to identify writers of handwritings. It is the content for a particular script and requires no segmentation or connected component analysis. We use the well-established multi-channel Gabor filters to extract these features. It has demonstrated good performance in texture discrimination and segmentation [2-4].

Handwriting is one of the easiest and natural ways of communication between humans and computers. Signature verification has been an active research topic for several decades in the image processing and signature verification remains a challenging issue. It provides a way of identifying the writer of a piece of handwriting in order to verify the claimed identity in security and related applications. It requires the writer to write the same fixed text. In the sense, signature verification may also be called text dependent writer verification. In practice, the requirement and the use of fixed text makes writer verification it relatively less useful as it's not useful in many application. For example the identification of the writer of archived handwritten documents, crime suspect, identification in forensic science, etc. In these applications, the writer of a piece of handwriting is often identified by professional handwriting examiners (graphologist). Although human intervention in the text independent writer identification has been effective, it is costly and prone to fatigue. This task is much more difficult and useful than signature verification. 2 Previous Approaches There have been a few attempts in writer identification in the past few years [5]. Shrihari and Cha [6] extracted twelve shape features from the handwriting text lines to represent personal handwriting style. The features mainly contained visible characteristics of the handwriting, such as width, slant and height of the main writing zones. Some other papers also adopted multiple features integration to writer identification [7]. In reference [8] and [9], two writer verification systems were proposed, judging an individual identity based on a short sentence and each word in the sentence is used to tackle the individual verification problem. So the proposed method makes it more difficult for the forgery to simulate the authentic writer than the previous methods using only signature.

We had studied several papers on Handwriting Analysis and Document Analysis and try to use a new approach for each phase. 3 Proposed Approach Research into writer identification has been focused on two streams, off-line and on-line writer identification [10,11]. Offline is further sub divided into two parts Text Independent and Text Dependent; here we are working on the former one as that is the more general case. This approach is called "Texture Analysis Technique". This uses a

2nd Indian International Conference on Artificial Intelligence (IICAI-05)

528

feature set whose component describes global statistical feature extracted from the entire image of a text.

Handwriting Analysis

On Line Offline

Text Independent Text Dependent

Texture Analysis Technique The approach is based on the "Computer Image Processing and Pattern Recognition" techniques to solve the different phases of the problem- pre processing, feature extraction and selection of the correct writer. The algorithm is based on texture analysis and illustrated in figure. The three main stages are described below:

Preprocessing Feature Extraction Identification

As we shown above basically we can contain the whole work in three steps. We had described all the three phases as below. 3.1 Preprocessing of handwriting images Texture Analysis cannot be applied directly to input handwriting images as texture is affected by noise, different word spacing, varying line spacing etc. The influence of such factors is minimized by preprocessing [12-16]. This is subdivided into following steps. 3.1.1 Thresholding and noise removal The input to a thresholding operation is typically a grayscale or color image. In the simplest implementation, the output is a binary image representing the segmentation. Black pixels correspond to background and white pixels correspond to foreground (or vice versa). In simple implementations, the segmentation is determined by a single parameter known as the intensity threshold.

• Each pixel in the image is compared with this threshold. If the pixel's intensity is higher than the threshold, the pixel is set to, say, white in the output. If it is less than the threshold, it is set to black.

2nd Indian International Conference on Artificial Intelligence (IICAI-05)

529

Noise removal is important part of preprocessing which can affect the feature set. Noise can occur during scanning the Handwriting

• Apply the Salt and Pepper filter to remove the noise. 3.1.2 Skew correction Locating Text Lines Using Horizontal Projection Profile (HPP) • Construct Histogram of black pixels against rows • Find peaks in the histogram. This corresponds to approximately the center of a

line of text. • Find the points with maximum radius of curvature on both sides of the peaks.

This corresponds to the top and bottom of a line of text • Repeat the process for various angles. Say from -15o to +15o • For each of the angles find height from points of maximum curvature to peak • The angle having the minimum of these heights corresponds to the skew angle. • Rotate the image by the angle calculated above. Here we are using the simple

technique of transformation. Now the whole document will become horizontal.

3.1.3 Padding Padding is the process of adding the text in the incomplete lines of the handwritten image. It is also necessary as this will also affect the feature set • Incomplete lines are padded with text taken from the start of the line 3.1.4 Spacing between words/lines The spacing between lines and words could be varying that will change the feature so we had predefined these spacing so that the feature won't get changed. • Define margins, spacing between words and lines. • Construct a new image with the above parameters.

Now we will choose the random non overlapping blocks from this pre-processed page from which the features will be extracted. We are choosing the randomized page as to improve the performance. After this step we will get an image with the specified spacing between lines and words and with the appropriate padding. 3.2 Feature Extraction In principle, any texture analysis technique can be applied to extract features from each uniform block of handwriting.

By using the multi-channel Gabor filtering technique and Grey Scale Co-occurrence Matrices technique, we can fully analyze handwriting texture in different scales. Both the inter-class variability and intra-class similarity are included in these texture features.

2nd Indian International Conference on Artificial Intelligence (IICAI-05)

530

3.2.1 Gabor filter The multi channel Gabor filtering technique is inspired by the psycho physical findings that the processing of pictorial information in the human visual cortex involves a set of parallel and quasi independent mechanisms or cortical channels which can be modeled by band pass filters. Mathematically the 2D Gabor filter ca be expressed as follows:

(1) h(x,y) = g(x,y)e-2∏j(uox+voy) Where g(x, y) is a 2-D Gaussian function of the form

(2) g(x,y) = e1/2[(x2+y2)/σ2] Each cortical channel is modeled by a pair of Gabor filters

He(x,y,f:θ) = g(x,y)Cos(2Πf(xCos θ +ySin θ) (3) Ho(x,y,f:θ) = g(x,y)Sin(2Πf(xCos θ +ySin θ)

We will convolve the input image with these two filters and then we will got two

images say qe (x, y) and qo (x, y) then by applying the following formula we will got the final image as a gabor output for a particular value of f and theta.

q(x,y) = √[qe(x,y)2+qo(x,y)2] (4)

The two important parameters of Gabor filter are f (the radial frequency) and

theta (orientation), which define the location of the channel in the frequency plane. Commonly used frequencies are power of 2. It is known that for any image of size N x N , the important frequency components are within f<=N/4 cycles/degree.

In this we will use frequencies of 4, 8, 16, 32 cycles/degree. For each central frequency f, filtering is performed at an angle of 00, 450, 900, and 1350. This gives a total of 16 output images (four for each frequency) from which the writer's features are extracted. 3.2.2 Grey Scale co-occurrence matrices (GSCM) For an image represented using N x N grey levels, each GSCM is of size N x N. Generally speaking, GSCMs are very expensive to compute. In the case of binary handwriting images, we have only two grey levels. Therefore, in this case, it is reasonable to use the GSCM technique.

In this paper, GSCMs were constructed for five distances (d=1,2,3,4,5) and four directions 0, 45, 90 and 135. This gives each input handwriting image 20 matrices of dimension 2x2. When the size of the GSCM is too large to allow the direct use of matrix elements, measurements such as energy, entropy, contrast and correlation are computed from the matrix and used as features. For each 2x2 GSCM derived from a binary handwriting image however, there are only 3 independent

2nd Indian International Conference on Artificial Intelligence (IICAI-05)

531

values due to the diagonal symmetry. There these 3 values are used directly as features. We then have 60 (20x3) GSCM features per handwriting image. 3.3 Writer identification We will require a database to start with. So for this we will first define our domain of writers then from each of these writers we will take 10 documents and then we will maintain a database of features.

There are essentially two classifiers, namely the Weighted Euclidean Distance (WED) classifier and the K-Nearest neighbor classifier (K-NN). 3.3.1 The weighted Euclidean distance classifier Representative features for each writer are determined from the features extracted from training handwriting texts of the writer. Similar feature extraction operations are carried out, for an input handwritten text block by an unknown writer. The extracted features are then compared with the representative features of a set of known writers. The writer of the handwriting is identified as writer K by the WED classifier if the following distance function is a minimum at K: N

d(k) = ∑ [(fn-nfk)2/(nvk)2] (5) n=1

Where fn is the nth feature of the input document, and nf(k) and nv(k) are the sample mean and the sample standard deviation of the nth feature of writer k respectively.

3.3.2 The nearest neighbor classifier (K-NN) When using the nearest neighbor classifier (K-NN), for each class K in the training set the ideal feature vectors is given as fk. Then we detect and measure the features of the unknown writer represented as U. To determine the class K of the writer we measure the similarity with each class by computing the distance between the feature vectors fk and U. The distance measure used here is the Euclidean distance. Then the distance computed dk of the unknown writer from class is given by N

dk= [ ∑ (Uj - fkj)2 ]1/2 (6) j=1

Where j=1, 2, …, N (N is the number of the features considered). The writer is then assigned to the class R such that:

dr = min(dk) (7)

2nd Indian International Conference on Artificial Intelligence (IICAI-05)

532

where k is the number of classes. Other more sophisticated measures and classifiers such as neural network classifiers could have been used. The emphasis in this paper, however, is computational simplicity. 4 Experimental Results We trained our software with images of handwritten documents in English, French and Hindi languages for 10 writers with 10 pages per writer and maintained a database. With this system we now tried various test documents (these documents were not used during training).

Fig. 1 -Original Image Fig. 2 –After Thresholding

2nd Indian International Conference on Artificial Intelligence (IICAI-05)

533

Fig. 3–After Noise Removal Fig. 4 –After Skew Correction

Table 1. Gabor filtering technique under WED and K-NN

Features Accuracy WED K-NN All 94% 78% SD 90% 82% Mean 86% 76% All at f=32 82% 68% All at f=16 67% 65% All at f=8 60% 55%

Table 2. GSCM technique under WED and K-NN

Features Accuracy WED K-NN d=1,2,3,4,5 74% 68% d=2,3,4,5 70% 72% d=3,4,5 66% 61% d=4,5 62% 58% d=1,2, 90% 88% d=1,2,3 60% 75%

Fig. 5 After Padding & Space Normalization

2nd Indian International Conference on Artificial Intelligence (IICAI-05)

534

5 Conclusions The experiments use 10 different writer classes. Features were extracted from handwriting images using the multi-channel Gabor filtering and the grey scale co-occurrence matrix (GSCM) technique. Identification was performed using two different classifiers (the weighted Euclidean distance (WED) and the K-nearest neighbor (K-NN) classifiers). The results achieved were very promising and identification accuracy as high as 96.0% was obtained. The two classifiers have shown good performance, but at some stages the KNN classifier’s performance is relatively poor when compared to the WED classifier with Gabor filtering technique.

We have presented a new algorithm for Writer Identification. Different to most existing methods, our algorithm is text independent. Compared with other techniques, the method proposed in this paper has the following important advantages: • Unlike signature verification, this method is content independent for a

particular Script. The content of training handwriting samples is not required to be the same as testing samples.

• This method is Script independent. We have applied for English, French and Hindi language.

• The method needs neither segmentation nor connected component analysis. • In theory, any texture classification and analysis technique (such as Gabor

filtering, GSCM) can be applied in our method. • Because of preprocessing, the new method can function well even if the

input image contains a small amount of text. • The method needs no complex computing. It may easily be applied in

practical applications.

All of these demonstrate that the new method is able to handle writer identification tasks efficiently. It is a promising technique for Writer identification. References

1. Plamond and G. Lorette: Automatic Signature Verification and Writer Identification-The State of Art, Pattern Recognition, Vol.22, No.2. (1989)107-131.

2. K. Jain, and S. Bhattacharjee: Text Segmentation Using Gabor Filters for Automatic Document Processing, Machine and Vision Applications. (1992)169-184.

3. T. R. Reed and J. M. Hans Du Buf: A Review of Recent Texture Segmentation and Feature Extraction Techniques, CVGIP: Image Understanding, Vol.57. (1993) pp.359-372.

4. R. Jain, R. Kasturi, B. G. Schunck: Machine Vision, McGraw-Hill, Inc. (1995) 5. W. Kuckuck: Writer Recognition by Spectra Analysis, Proc. Int. Conf. In Security

through Science Engineering. (1980) 1-3.

2nd Indian International Conference on Artificial Intelligence (IICAI-05)

535

6. S. Cha and S. N. Srihari: Multiple Feature Integration for Writer Verification, the Proceedings of 7 th IWFHR2000, Amsterdam, Netherlands. ( 2000) 333-342

7. E.N. Zois, and V. Anastassopouls: Fusion of Correlated Decisions for Writer Verification, Pattern Recognition, vol. 32, NO. 10. (1999) 1821-1823

8. H. Bunke, Dr. U.-V. Marti, R. Messerli: Writer Identification Using Text Line Based Features, Proceedings of 6th ICDAR '01,

9. Yasushi Yamazaki and Naohisa Komatsu: A proposal for Text-Indicated Writer Verification Method. Proceedings of ICDAR’97

10. Rejean Plamondon and Sargur N. Srihari: On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey, IEEE Trans. Pattern Analysis and Machine Intelligence, vol.22, no.1. ( 2000.)

11. Mami KIKUCHI and Norio AKAMATSU: Development of Speedy and High Sensitive Pen System for Writing Pressure and Writer Identification. Proceedings of 6th ICDAR '01.

12. H.S Baird: The Skew Angle of Printed Documents, Proceedings of Society of Photographic Scientific Engineering, Vol. 40. (1987) 21-24.

13. T. Akiyama and N. Hagita: Automatic Entry for Printed Documents, Pattern Recognition, Vol.23. (1990) 141-1,154.

14. T. Pavlidis and J. Zhou: Page Segmentation and classification, Computer Vision Graphics and Image Processing, Vol.26. (1995) 2,107-2,123.

15. W. Postal: Detection of Linear Oblique Structures and Skew in Digitized Documents, Proceeding of Eighth International Conference on Pattern Recognition (1986) 464-468.

16. A. L. Spitz: Text Characterization by Connected Component Transformations, Proc. Of SPIE, Document Recognition (1994) 97-105

2nd Indian International Conference on Artificial Intelligence (IICAI-05)

536