Abdelwaheb BELAID

1

Natural Language Processing: Arabic Cursive Handwriting

Recognition

A. Belaïd

Natural Language Processing: Natural Language Processing: Arabic Cursive Handwriting Arabic Cursive Handwriting

RecognitionRecognition

A. A. BelaïdBelaïd

2

PreambleArabic – Part of the Semitic language family

One of the most spoken language in the world– Nearly 250 million people speak Arabic

Spoken outside Arabic countries– Over 600 000 people In the United States speak

ArabicGive rise to other alphabets– Farsi, Urdu…

• spoken by millions of people from Iran, Pakistan, India…

3

PreambleThe work presented here is that of numerous researchers working with me…– Najoua Ben Amara– Samia Maddouri– Afef Kacem– Imen Ben Cheikh– Hiba Khelil– Mohamed Yazid Boudaren– Nazih Ouwayed– Christophe Choisy– Umapada Pal

4

Presentation outlineIntroduction– Brief historic, specific applications

Writing characteristics– Those shared with Latin– Proper to Arabic

Issues of cursive word recognition– Reading models– Global word-based (holistic), Local letter-based

(analytical) approach, Hybrid approachLanguage Processing– Some basic solutions for handwriting recognition

5

IntroductionThe field arose before the apparition of

computers

1900 1980 20081916 1950 1965

Maturity

Price

Patents on OCR: blindtelegraph

Working Models

OCR inindustry

1st Postal address reader

Forms

Handwritten forms

Small devicesIntelligent pen

6

Today for Latin: An industry and a real market

Material : – Scanners adapted to documents

We know– scan documents in a huge quantity,

preserving the image quality– compress them, publish them on the net– recognize by OCR and identify some

structure elements But … – on very good quality documents– rather recent, poor structure, printed, – Handwriting:

• Just few work available for specific applications: small vocabulary

7

IntroductionArabic

Started at 1980 – increasing demand for information indexing and retrieval

Pioneers– A. Amin (Loria), – M. Cheriet (ETS, Montreal), – N. Ben Amara (ENIT)…

Today– Many Labs: REGIM (Sfax), LRI (Annaba), READ (Loria),

ETS (Montreal), CEDAR (USA), ISI (India)Many dedicated sessions and workshops– IWFHR, ICFHR, CIFED, ICDAR, SACH’06

Several public datasets– IFN/ENIT, DARPA/SAIC, CENPARMI, Farsi-City…

• See Volker Märgner list

8

IntroductionArabic

Commercial Arabic OCR– Number of commercial Arabic OCR engines

• Sakhr's Automatic Reader: ~1500$• Readiris from IRIS : ~500$• Verus from NovoDynamics : ~1300$• Omnipage

– Evaluation by UNLV• Sakhr (90,33%), OmniPage (86.89%)

Open Source Arabic OCR projects– The Siragi project (started in 2005)

• Part of the Arabic Unix open source project

9

IntroductionSome specific applications

Bank check recognition of courtesy amounts

10

IntroductionSignature recognition,

verification, forgery detection…

V. K. Madasu et al.Pattern Recognition, 38 (2005)

Normalized vector angle (α) in boxes

M.A. Ismail et al.Pattern Recognition, 33 (2000)

Global and local features in boxes

Algorithms based on fuzzy concepts

11


Writer identification

Comparison of Gabor-Based Features for Writer Identification of Farsi/Arabic Handwriting, IWFHR,’06, F. Shahabi et al.

Automatic Writer Identification Using Connected-Component Contours and Edge-Based Features of Uppercase Western Script, L. Schomaker et al, PAMI, V. 26, N. 6, 2004

12

IntroductionWord Spotting by request

Spotting Words in Handwritten Arabic Documents, S. Srihari et al. , SACH 2006

Ch. Choisy, A. Belaid, Cross-learning in AnalyticWord Recognition Without Segmentation. IJDAR’02.

Template matching: ALMALIKStochastic model: MADAME

PrototypesCandidates

13

IntroductionNewspapers segmentation

Connected Pattern Segmentation and Title Grouping in Newspaper Images, P. E Mitchella et al, ICPR’04

Arabic Page Segmentation, Planet, K. Hadjar et al. ICDAR03

14


Paleographic inspection

INSA Lyon: Auto-similarité de formes pour la discrimination des styles d’écriture des manuscrits médiévaux, I. Moalla, F. LeBourgois, H. Emptoz, A. M. Alimi

Progressive evolution between VI and XVIc

University of Pisa: to classify and identify medieval scripts

University of Annaba:

15

IntroductionThe issue of recognition

Handwritten Latin recognition showed first the way– In terms of modalities

• On-line vs Off-line– In terms of scripts

• Printed vs Handwritten– In terms of pre-processing

• Shape normalization• Feature extraction: indices or graphemes

– In terms of methodologies, classified regarding:• use or not lexicon • nature of primitives / model: structural, statistic,

stochastic• vision level: local or global

16

IntroductionInitiated by Speech recognition

Today: a well established PR System

Preprocessing

Recognition

Feature extraction&

Vector quantization

Sequence

Tree structured lexicon

HMM Models database

« Bonsoir, à demainpour une nouvelleédition du journal »

بقة

نقش

تفت

بقة لثة

startstandstore Enrollment

17

IntroductionWhen LP contributes?

Preprocessing + Feature extraction

Phoneme / Character /Modeling

Character Models

Training data

Language Modeling

Text

Lexicon +Grammar

Recognition searchPreprocessing + Feature extraction

Image input

EnrollmentEnrollment

RecognitionRecognition

WordSequence

18

IntroductionThe process bases are well

established

Analytic

Global

Pre-segmentation Internal

Segm

en ta ti on

RecognitionLearning

Sliding Window

a b z...

Discriminate Path

Discriminate Model

19

IntroductionPerformances: criteria influencing

the quality

Writer nb

omni

multi

reducedmono

large

Lexiconsize

Writing

non constrained

guided

20

IntroductionThe performances are satisfactory

for Arabic

Script Process Model

21

Outline

1. Introduction2. Writing characteristics3. Issues of Segmentation4. Natural Language Processing

22

Writing characteristicsSome of them are similar to Latin

Writing lines

AlignementUpper-lineMean-lineBase-lineLower-line

Baseline

X-height

23


Perceptive invariants: J. C. Simon called: regularities and singularities

Letter support

Letter peculiarities

24


Perceptive invariants / regularities and singularities

25



26



27

Latin vs Arabic What changes?

Essentially the script, always complex

Arabic SystemsThe difficulty is permanent:

Cursiveness, ligature, tashkeel

Latin SystemsThe difficulty is gradual

28


The gaps are significant for Latin, not always for Arabic

LatinBetween words

ArabicEverywhere

29


The ligatures are permanent: horizontal and vertical

30

Latin vs Arabic Arabic has some peculiarities

1. Helpful: accents and diacritical dots contribute to the recognition

31


2. Helpful: the letter elongation contributes to the segmentation

32


3. Helpful: the position of the hamza (16) and the descenders

33

Latin vs Arabic Arabic Pecularities?

5. Helpful: PAWs offer a pause in the writing, a decomposition of the writing • Simplify the script apprehension, make easier the

linear recognition

PAW

[Al-Badr and R. M. Haralick 1998]

34

Arabic ScriptIn conclusion

• Arabic: more global than syllabic• PAWs : facilitate the recognition• PAW level ~ letter level in Latin• For recognition

– In Arabic: to reach PAW level: characteristic information

– In Latin: to reach letter level• The PAW level is the stable level

makes it semi-global

35

Outline

1. Introduction2. Writing characteristics3. Issues of Recognition4. Natural Language Processing

36

Issue of segmentationDue to the local variability– It is widely accepted

• Arabic word segmentation in letters is very delicate and not always ensured

• Usually in most attempts, Arabic word is segmented into graphemes (copied on Latin)– This is an error!

37

Issues of recognitionConsidering Arabic peculiarities

Reading models

– The recognition of a word • implies the processing of visual data and its

interpretation at the linguistic level– Psychologists call "mental lexicon access"

• the process by which the human associates the image of a word to its significance

Several models emerges

38

Interactive Activation ModelMc Clelland and Rumelhart 1981

Important assumptions– Perception takes place in a

multilevel processing:• Feature, letter, word

A consequence:– more abstract levels of

representation are only accessed via intermediate level

A third assumption– Processing combines both

bottom-up and top-down information refers• readers can use their

(top-down) knowledge of words to help identify letter sequences from (bottom-up) visual input

-

- ME O

\ /

MATE MOVE

-

+-++- +

--+-+ -

Words

Letters

Features

Neurons have excitatory and inhibitory connections

39

IA & Arabic RecognitionArabic writing – fits very well the reading principle of IA

• Clearly privileges the superiority of the whole• Local perceptual information is just used to help

word understanding

But the corresponding model – should be adapted to consider the PAW level and letter

distortions: • PAWs introduce an intermediate global level

Hence, perfect similarity if adapted

40

Perceptro [Côté, Cheriet 98]

Limited number of features– Ascender, descender, loop

word not having these features cannot be initialized

No trainingno inhibition rapid saturation

Recognition– Perceptive cycles– Top-down & bottom up

41

Transparent Neural Network

[Maddouri, Belaïd, Ellouze, 03]– Input correction by FD

[Ben Cheikh, Kacem, 07]– Slight extended vocabulary

(Tunisian city names)– Training possibility

[Ben Cheikh, Belaïd, Kacem, 08]– Wide vocabulary

42

Arabic RecognitionCorrection process

لص لحملسر

Propagation

Back-Propagation ?

Original image

Reconstructedimage (harmonics)

Real image

43

Arabic RecognitionExperiments– 2100 images, 70 words, 63 PAWs– Without Perceptive cycle

• PAW RR: 68.42% • Word RR: 90%

– With perceptive cycles• PAW RR: 95% • Word RR: 97%

44

Arabic Recognition Methodologies

Considering human perception of Arabic writing with the particularity of PAWs

revised literature approaches: vision degree

• Global-based vision classifiers• Semi-global-based vision classifiers• Local-based vision classifiers• Hybrid-level classifiers

and examined their proximity with IA

45

MethodologiesI Global-based Vision Classifier

The word – regarded as a whole

The features– doesn’t need to be precise:

• presence and somerelationships

The approach– assimilated to segmentation free

even if a segmentation is used, no localinterpretation is madeinformation is gathered at the word level

Its use is limited to small vocabularies

46

GBVCExamples

Srihari and al [2005]:– Several preprocessing steps– Feature extraction for PAWs

and Words: – aspects measurements

– Word resemblance by NN• 10 writers writing 10

documents each : word extraction is ~ 60%, rr=70%

Noise suppression and binarization

Suppression of internal contours

Fusion of minor components

47

GBVCExamples

Al Badr et al [1998] – Free segmentation method :

• detects a set of shape primitives on the word

• matches the regions of the word with a set of symbol models

• maximizes the a posteriori probability of the arrangement of symbol models

– Word recognition scores : clean (99.39%), degraded (95.60%) or scanned (73.13%)

Matching with symbol model

Correspondence regionsof the model (in shades of gray)

48

Local-Based-Vision ClassifierExample

Shirin Saleem et al. [2008]: – BBN Technologies, Cambridge, MA: BBN Byblos OCR

System (DARPA data set): – Locate line tops and bottoms– Extract narrow overlapping vertical slices of the image

• measure features on each slice• reduce the size of feature using Linear Discriminant

Analysis (typically 15 features)

49

Simple Frame-based Features

Examples of features:– Intensity as a function of vertical position

– Vertical derivative of intensity

– Horizontal derivative of Intensity

– Local angle within a small window

– Difference of angle

50

Character Hidden Markov Model

51


R. Al Hadj et al [2006]– HMM for letters and words

with sliding windows– Windows correspond to 3

different orientations: density description

– A second system integrates all the orientations in each position

85.02% (Top1) 91.29% (Top2) 93.14% (Top3)

52

Global-Based Vision ClassifierExamples

Khorsheed et al [2000] – Polar transformation coupled

with a Fourier transform– Each word: template with Fourier

coefficients– Recognition

• normalized ED from templates• In a multi-font approach:

– 95.4% of good word classification on 1700 samples of different size, angle and translation

Original images

Normalized images by polar transform

53

GBVCSynthesis

The works related – accredit the word superiority– Many feature combinations

and models perform well

The proximity with IA? – can operate – but limited to 2 levels

needs more precisionin feature extraction

Adaptation of GC to Arabic– Possible if high level features

usedInput

Feature

Word

54

MethodologiesII Semi Global Vision Classifier

The word – natural concatenation of independent PAWs which

provides a natural segmentation

The features– are numerous and different

require normalization of image before extractionThe approach

leads to reduce the vocabulary as only the PAWs are considered

Important to find features

55

Semi-Global-Based VCExamples

Planar HMM: Ben Amara, Belaïd, Ellouze [1996]For the main: band width:

• observation P of the S HMM• a specific function (normal

density) of the duration

For the secondary: band description: List of B&W segments in each line of the band– Morphology of each PAW

– 99.84% for 33168 samples, 100 PAWs

56

Semi-Global-Based VCExamples: town name recognition

Burrow [2000]– Method

• to trace lines making up the town name, and to use these as a representation

– Features• Vector angles + average

length +… – Results

• ED (converted into pseudo probabilities) between the test feature vectors and all those in the training set

• Recognition rate 74%

57

SGBVCSynthesis

The works related– similar to those for GBVC– some are reported on PAWs

The proximity with IA is limited– only features and PAW levels considered

The adaptation of Semi Global BVC to Arabicfits well but limited to PAWs

fits better if a gathering procedure of PAWs is possible

58

IA architecture for Semi-GlobalVC

…

Input

Letter

PAW

…

Feature

59

MethodologiesIII Local-Based Vision Classifier

The word – regarded as a list of letters or

smaller entities

The features – should be located precisely,

inversely to the other approacheswhere flexibility is tolerated

The approach– should gather, confront these

entities to identify the word

The interest– can cope with large vocabulary

60

LBVCExample

Multi-level handwritten word recognition for tunisian city names Miled [1997]

61

LBVCThe strategy: 2 perceptive levels

First perceptive level:– practices the global view by extracting visual indices: by

tracing and grapheme extraction– This global Information is extracted in the main zones :

(b) diacritics; (c) baseline and middle zone characters

62

LBVCExample

Then, visual indices are extracted by tracing

63

LBVCExample

Finally: a Markovian modeling is operated on the list of visual indices

Recognition: 58,9% (top1) to 86,8% (top10)

64

LBVCExample

The 2nd perceptive level practices an analytical approach by extracting finest features: graphemes

18 classes• 1. A: alef, 2. B - D: graphemes with ascenders• 3. E – H : graphemes with both ascenders and descenders• 4. I – M : graphemes with descenders• 5. N – R : graphemes within the middle zone

Recognition: 69.68% (top1) to 91.66% (top10)

65


Finally, the 3rd level practices a pseudo-analytical modeling and recognition of PAWs and words

37 words: 80.11% (Top1), 90.79% (Top5)

66

LBVCSynthesis

The works related– give good result showing that the analytic approach can

perform well– point out drawbacks of over and under-segmentation

As letters or segments are recognized independentlyany error can perturb the whole recognition process

The proximity with IA is far- WSE is not taken into account because

- no global vision of the word, but as a sum of small parts

67

MethodologiesIV Hybrid Level Classifier

The word – regarded as a whole as well globally as in details

The features – Correspond to precise location reinforced according to the level

of detail needed

The approach– combines different strategies: to approach more human

reading: • the analysis must be global for a good synthesis of

the information • while being based on local information suitable to

make emerge this information

68

Hybrid-Level ClassifierExamples

NSHP-HMM [Choisy & Belaïd 02] :– a random field drawing its observation directly in the

image – a HMM taking into account the column observations

ijX

ijXθ

69


Analytical aspect : Local-Global aspect :

70


NSHP-HMM [Vajda & Belaïd 06] :– Combination of structural and pixel information

71

HM &Synthesis

The works related – seem efficient– IA seen as meta-model reassembling models working at

different visual levels: global, local, semi-global The proximity with IA is close – If we add the PAW level- It combines different levels as proposed in IA

The interest- to do the maximum without segmentation- if needed, we can operate a segmentation which will be

guided by the context

72

Outline

1. Introduction2. Writing characteristics3. Issues of Recognition4. Language Processing

73

NLPNumber of effective Arabic words go past 60 billions! – due to its morphological complexity [K. Darwish 02]

makes their automatic processing unrealistic– handicaps: dictionary building, IR, automatic spelling…

Simplification of their pattern becomes mandatoryfor their processing

One solution seems to turn towardsmorphological analysis and word stemming

74

NLPMany studies – highlight the richness and the stability of Arabic in

terms of morpho-phonologic peculiar to this language [A. Ben Hamadou 93], [S. Kanoun 02], [W. Kammoun04], [M. Cheriet 06]

Questions– Importance of the kind of linguistic knowledge– more appropriate location for its incorporation

75

NLPMost of them confirm– The morphological structure of Arabic

• can be analyzed in terms of consonantal roots, considered as independent morphological unit

Tri-literal roots, the most common of them– [Watson 06]

• give rise up to 15 verbal forms or stems, one basic and the rest derived

– [Ben Hamadou 93]• an average of 80 currently used words derive from a

given root – [Kanoun 02]

• 808 healthy tri-consonant a lexicon of 98 413 words

76

Radical

Word decompositionAn Arabic word is

decomposable (e.g. derivates from a root) (school :مدرسة ) or not (doctor :دڪتور )

A decomposable word is composed of morphemes: prefix, radical and suffix

The radical (or the verbal core) is– the derivation of a root according

to a given scheme by introducing“access” letters: ,ا م

A root is either– tri-consonant (three letters): تبآ– quadri-consonant (four letters): دحرج– Healthy ( جرح) or non-healthy ( قال contains a vowel at least)

77

Schemes can go up to 70– علاتف ,فعل مفعال , , علامف , لاس تفعا , مفعول, لافتع ا , منفعل

Schemes classes are: – Verb “ تبآ ”رحل “ / ”– Agent noun “ تب آا ”راحل “ / ”– Accentuated agent noun ”رّحال “– Patient noun “ توبمك ”– Machine noun “ تاب آ ”

Arabic Morpho-phonologicalConcepts

78

The approach: Transparent Neural Network

Easy to train:– Decomposable on 3

mono-layers– Training is rapid

But not allows too many outputs

79

First craftiness– To consider the word as the conjugation of a root

according to a given scheme– To separate the outputs in roots and schemes

• For 8000 words that rise from 100 roots the maximum of schemes is 1400

8000 size problem 1500 size problem (still high !)

RNTword100 roots

1400 conjugated schemesRNTRNTTNN

To process a wide vocabulary: several improvements

80

To consider the scheme as a brief scheme ( علو ناتفي is defined by a brief scheme: a non-conjugated one, عل اتف ) and a set of conjugation elements

– Brief schemes number is around 75– Conjugation elements number is 12 (tense, gender, person, definition)

87 neurons represent 1400 conjugated schemes

8000 1500 187 (100 roots + 75 schemes + 12 conjugation elts.)

Second craftiness

81

Roots and schemes trainings are independent– These two trainings do not require the same information

• The Information about word PAWs are:– useless for the training of its root

فح آا حآفا

– useful for the training of its scheme

فح آا scheme: عل فا حآفا scheme: ل فعا

Third craftiness

root: آفح

82

Third craftinessTo separate them and so lighten them by splitting the TNN into two models:

TNN_R

TNN_S

100 outputs

87 outputs

Word

These sizes are now practicable

83

TNN_R: three-layer network– Learns how to focus on root letters and ignores access ones

reserved for schemes, trains roots from structural primitives

Neuro-Linguistic approach

-0. 465. 9

15. 3

2. 8

0.74

-1

8. 94

Primitives (70)

2. 64

2. 63

4. 94

2. 64

4. 93

Letters(117)

PD

QM

ا آ

اـب

RD

ا ت

د

ا ع

HF د عب

RF

Roots(100)

اس

root: بعد

PDQMHFRD RF

84

TNN_S: 4 layers– learns schemes from structural primitives, how to ignore

root letters, focuses on access letters: prefix, suffix …– PAWs of Arabic schemes more reduction


Conjugation elements (10)

Primitives

PAWS of SchemesSchemesLetters

PD

HM

ا آ

ا آ

JF

ات

ر

اث

HF

ات*ات

**

تفاعل

متفاعل

Singular

Accomplished

ات*امت

masculine

feminine

word: تڪاثر

+ PD+HM+HF+JF+ Access letters: ت,ا+ Singul + accompl.+ masc.

85

TNN_R training: words containing the root– As the information trained corresponds to two independent cases:

• letter constitution / root formation: letter location and sisters • and there is no local error to treat

we separate it into two MonoLP: 1 & 2


أتصرف &

Letter position & sister letters

انصرفتصرف

يتصرفStructural features

86

TNN_S training: same corpus as for TNN_RThe same way: 3 sub-networks:


& أتصرف&

انصرفتصرف

يتصرف

Letter position & sisters

Fixed manually to indicate those should

be activated

Mono-LP1 Mono-LP3 Mono-LP4

صرفسرق

حرفصرخ

عل تاف

TNN_R

TNN_S

X

X

Xانصرفانحرف

: HI PD BM JrF BPI

: HI PD RM JrF BPI

HI PD BM JrF BPI

X

فعل نا

انصرف

Perceptive cycles

Perceptive cycles

Linguistic

restrictionHI PD BM JrF BPI

!!

!!

!!

!!

Recognition: perceptive cycles + linguistic restrictif

88

Vocabulary size– 1531 words– 51 roots– 25 brief schemes

Training base– The same training corpus for TNN_R and TNN_S

• TNN_R corpus size : 1531 (words) to train 50 roots• TNN_S corpus size : 1531 (words) to train 25

schemesTest base– size: 765= 255 (words) * 3 (samples)

Experiments

92%

99.7%

97 %

Top4

TNN_R91.9%89.4%80,7%1531

Pseudo-global

Typesetted/ Handwritten

[Ben Cheikhand al 08]

88,7%25AnalyticHandwritten[Touj & Ben Amara 07]

Typesetted

Typesetted

Writing

Analytic

Analytic

Approach

1423

545

Vocab. size

96.4%95.7%81.3%[Kammoun06]

TNN_S93%

95 %84 %69 %[Kanoun 02]

Top3Top2Top1

Word baseComparison with approaches

dedicated to wide vocabularies

90

Conclusion (1)

Neural model + linguistic knowledge – Arabic writing recognition with wide vocabulary: – Knowledge: Arabic morphology analysis

Favors the recognition of words which have never been learned– It is just needed that its root and its scheme have been

already learned via other words

Conclusion

91

«آثر» the root«تفاعل» && the scheme

The words « آثرأ », « ريآث », « ةآثر » and « تآثر » participate to the training of the root «آثر» (in addition to their schemes)

The words « نقا عت », « رباق ت », « خل ادت » and « سك ام ت » participate to the training of the scheme « تفاعل» (in addition to their roots)

.

Hence, when recognizing the word « ثر ا ك ت », our model should be able to recognize :

Example

92

Conclusion (1)

The improvements will continue:– Knowledge

• Considering other aspects of the Arabic morphology: other kinds of roots, derivations…

– Recognition stage• More linguistic restriction in the perceptive cycles

– Data Base• To work on more realistic vocabulary by enlarging

more the size

Perspectives

Conclusion (1)

Thank you