Upload
dothuan
View
220
Download
0
Embed Size (px)
Citation preview
1
Natural Language Processing: Arabic Cursive Handwriting
Recognition
A. Belaïd
Natural Language Processing: Natural Language Processing: Arabic Cursive Handwriting Arabic Cursive Handwriting
RecognitionRecognition
A. A. BelaïdBelaïd
2
PreambleArabic – Part of the Semitic language family
One of the most spoken language in the world– Nearly 250 million people speak Arabic
Spoken outside Arabic countries– Over 600 000 people In the United States speak
ArabicGive rise to other alphabets– Farsi, Urdu…
• spoken by millions of people from Iran, Pakistan, India…
3
PreambleThe work presented here is that of numerous researchers working with me…– Najoua Ben Amara– Samia Maddouri– Afef Kacem– Imen Ben Cheikh– Hiba Khelil– Mohamed Yazid Boudaren– Nazih Ouwayed– Christophe Choisy– Umapada Pal
4
Presentation outlineIntroduction– Brief historic, specific applications
Writing characteristics– Those shared with Latin– Proper to Arabic
Issues of cursive word recognition– Reading models– Global word-based (holistic), Local letter-based
(analytical) approach, Hybrid approachLanguage Processing– Some basic solutions for handwriting recognition
5
IntroductionThe field arose before the apparition of
computers
1900 1980 20081916 1950 1965
Maturity
Price
Patents on OCR: blindtelegraph
Working Models
OCR inindustry
1st Postal address reader
Forms
Handwritten forms
Small devicesIntelligent pen
6
Today for Latin: An industry and a real market
Material : – Scanners adapted to documents
We know– scan documents in a huge quantity,
preserving the image quality– compress them, publish them on the net– recognize by OCR and identify some
structure elements But … – on very good quality documents– rather recent, poor structure, printed, – Handwriting:
• Just few work available for specific applications: small vocabulary
7
IntroductionArabic
Started at 1980 – increasing demand for information indexing and retrieval
Pioneers– A. Amin (Loria), – M. Cheriet (ETS, Montreal), – N. Ben Amara (ENIT)…
Today– Many Labs: REGIM (Sfax), LRI (Annaba), READ (Loria),
ETS (Montreal), CEDAR (USA), ISI (India)Many dedicated sessions and workshops– IWFHR, ICFHR, CIFED, ICDAR, SACH’06
Several public datasets– IFN/ENIT, DARPA/SAIC, CENPARMI, Farsi-City…
• See Volker Märgner list
8
IntroductionArabic
Commercial Arabic OCR– Number of commercial Arabic OCR engines
• Sakhr's Automatic Reader: ~1500$• Readiris from IRIS : ~500$• Verus from NovoDynamics : ~1300$• Omnipage
– Evaluation by UNLV• Sakhr (90,33%), OmniPage (86.89%)
Open Source Arabic OCR projects– The Siragi project (started in 2005)
• Part of the Arabic Unix open source project
9
IntroductionSome specific applications
Bank check recognition of courtesy amounts
10
IntroductionSignature recognition,
verification, forgery detection…
V. K. Madasu et al.Pattern Recognition, 38 (2005)
Normalized vector angle (α) in boxes
M.A. Ismail et al.Pattern Recognition, 33 (2000)
Global and local features in boxes
Algorithms based on fuzzy concepts
11
IntroductionSome specific applications
Writer identification
Comparison of Gabor-Based Features for Writer Identification of Farsi/Arabic Handwriting, IWFHR,’06, F. Shahabi et al.
Automatic Writer Identification Using Connected-Component Contours and Edge-Based Features of Uppercase Western Script, L. Schomaker et al, PAMI, V. 26, N. 6, 2004
12
IntroductionWord Spotting by request
Spotting Words in Handwritten Arabic Documents, S. Srihari et al. , SACH 2006
Ch. Choisy, A. Belaid, Cross-learning in AnalyticWord Recognition Without Segmentation. IJDAR’02.
Template matching: ALMALIKStochastic model: MADAME
PrototypesCandidates
13
IntroductionNewspapers segmentation
Connected Pattern Segmentation and Title Grouping in Newspaper Images, P. E Mitchella et al, ICPR’04
Arabic Page Segmentation, Planet, K. Hadjar et al. ICDAR03
14
IntroductionSome specific applications
Paleographic inspection
INSA Lyon: Auto-similarité de formes pour la discrimination des styles d’écriture des manuscrits médiévaux, I. Moalla, F. LeBourgois, H. Emptoz, A. M. Alimi
Progressive evolution between VI and XVIc
University of Pisa: to classify and identify medieval scripts
University of Annaba:
15
IntroductionThe issue of recognition
Handwritten Latin recognition showed first the way– In terms of modalities
• On-line vs Off-line– In terms of scripts
• Printed vs Handwritten– In terms of pre-processing
• Shape normalization• Feature extraction: indices or graphemes
– In terms of methodologies, classified regarding:• use or not lexicon • nature of primitives / model: structural, statistic,
stochastic• vision level: local or global
16
IntroductionInitiated by Speech recognition
Today: a well established PR System
Preprocessing
Recognition
Feature extraction&
Vector quantization
Sequence
Tree structured lexicon
HMM Models database
« Bonsoir, à demainpour une nouvelleédition du journal »
بقة
نقش
تفت
بقة لثة
startstandstore Enrollment
17
IntroductionWhen LP contributes?
Preprocessing + Feature extraction
Phoneme / Character /Modeling
Character Models
Training data
Language Modeling
Text
Lexicon +Grammar
Recognition searchPreprocessing + Feature extraction
Image input
EnrollmentEnrollment
RecognitionRecognition
WordSequence
18
IntroductionThe process bases are well
established
Analytic
Global
Pre-segmentation Internal
Segm
en ta ti on
RecognitionLearning
Sliding Window
a b z...
Discriminate Path
Discriminate Model
19
IntroductionPerformances: criteria influencing
the quality
Writer nb
omni
multi
reducedmono
large
Lexiconsize
Writing
non constrained
guided
20
IntroductionThe performances are satisfactory
for Arabic
Script Process Model
21
Outline
1. Introduction2. Writing characteristics3. Issues of Segmentation4. Natural Language Processing
22
Writing characteristicsSome of them are similar to Latin
Writing lines
AlignementUpper-lineMean-lineBase-lineLower-line
Baseline
X-height
23
Writing characteristicsSome of them are similar to Latin
Perceptive invariants: J. C. Simon called: regularities and singularities
Letter support
Letter peculiarities
24
Writing characteristicsSome of them are similar to Latin
Perceptive invariants / regularities and singularities
25
Writing characteristicsSome of them are similar to Latin
Perceptive invariants / regularities and singularities
26
Writing characteristicsSome of them are similar to Latin
Perceptive invariants / regularities and singularities
27
Latin vs Arabic What changes?
Essentially the script, always complex
Arabic SystemsThe difficulty is permanent:
Cursiveness, ligature, tashkeel
Latin SystemsThe difficulty is gradual
28
Latin vs Arabic What changes?
The gaps are significant for Latin, not always for Arabic
LatinBetween words
ArabicEverywhere
29
Latin vs Arabic What changes?
The ligatures are permanent: horizontal and vertical
30
Latin vs Arabic Arabic has some peculiarities
1. Helpful: accents and diacritical dots contribute to the recognition
31
Latin vs Arabic Arabic has some peculiarities
2. Helpful: the letter elongation contributes to the segmentation
32
Latin vs Arabic Arabic has some peculiarities
3. Helpful: the position of the hamza (16) and the descenders
33
Latin vs Arabic Arabic Pecularities?
5. Helpful: PAWs offer a pause in the writing, a decomposition of the writing • Simplify the script apprehension, make easier the
linear recognition
PAW
[Al-Badr and R. M. Haralick 1998]
34
Arabic ScriptIn conclusion
• Arabic: more global than syllabic• PAWs : facilitate the recognition• PAW level ~ letter level in Latin• For recognition
– In Arabic: to reach PAW level: characteristic information
– In Latin: to reach letter level• The PAW level is the stable level
makes it semi-global
35
Outline
1. Introduction2. Writing characteristics3. Issues of Recognition4. Natural Language Processing
36
Issue of segmentationDue to the local variability– It is widely accepted
• Arabic word segmentation in letters is very delicate and not always ensured
• Usually in most attempts, Arabic word is segmented into graphemes (copied on Latin)– This is an error!
37
Issues of recognitionConsidering Arabic peculiarities
Reading models
– The recognition of a word • implies the processing of visual data and its
interpretation at the linguistic level– Psychologists call "mental lexicon access"
• the process by which the human associates the image of a word to its significance
Several models emerges
38
Interactive Activation ModelMc Clelland and Rumelhart 1981
Important assumptions– Perception takes place in a
multilevel processing:• Feature, letter, word
A consequence:– more abstract levels of
representation are only accessed via intermediate level
A third assumption– Processing combines both
bottom-up and top-down information refers• readers can use their
(top-down) knowledge of words to help identify letter sequences from (bottom-up) visual input
-
- ME O
\ /
MATE MOVE
-
+-++- +
--+-+ -
Words
Letters
Features
Neurons have excitatory and inhibitory connections
39
IA & Arabic RecognitionArabic writing – fits very well the reading principle of IA
• Clearly privileges the superiority of the whole• Local perceptual information is just used to help
word understanding
But the corresponding model – should be adapted to consider the PAW level and letter
distortions: • PAWs introduce an intermediate global level
Hence, perfect similarity if adapted
40
Perceptro [Côté, Cheriet 98]
Limited number of features– Ascender, descender, loop
word not having these features cannot be initialized
No trainingno inhibition rapid saturation
Recognition– Perceptive cycles– Top-down & bottom up
41
Transparent Neural Network
[Maddouri, Belaïd, Ellouze, 03]– Input correction by FD
[Ben Cheikh, Kacem, 07]– Slight extended vocabulary
(Tunisian city names)– Training possibility
[Ben Cheikh, Belaïd, Kacem, 08]– Wide vocabulary
42
Arabic RecognitionCorrection process
لص لحملسر
Propagation
Back-Propagation ?
Original image
Reconstructedimage (harmonics)
Real image
43
Arabic RecognitionExperiments– 2100 images, 70 words, 63 PAWs– Without Perceptive cycle
• PAW RR: 68.42% • Word RR: 90%
– With perceptive cycles• PAW RR: 95% • Word RR: 97%
44
Arabic Recognition Methodologies
Considering human perception of Arabic writing with the particularity of PAWs
revised literature approaches: vision degree
• Global-based vision classifiers• Semi-global-based vision classifiers• Local-based vision classifiers• Hybrid-level classifiers
and examined their proximity with IA
45
MethodologiesI Global-based Vision Classifier
The word – regarded as a whole
The features– doesn’t need to be precise:
• presence and somerelationships
The approach– assimilated to segmentation free
even if a segmentation is used, no localinterpretation is madeinformation is gathered at the word level
Its use is limited to small vocabularies
46
GBVCExamples
Srihari and al [2005]:– Several preprocessing steps– Feature extraction for PAWs
and Words: – aspects measurements
– Word resemblance by NN• 10 writers writing 10
documents each : word extraction is ~ 60%, rr=70%
Noise suppression and binarization
Suppression of internal contours
Fusion of minor components
47
GBVCExamples
Al Badr et al [1998] – Free segmentation method :
• detects a set of shape primitives on the word
• matches the regions of the word with a set of symbol models
• maximizes the a posteriori probability of the arrangement of symbol models
– Word recognition scores : clean (99.39%), degraded (95.60%) or scanned (73.13%)
Matching with symbol model
Correspondence regionsof the model (in shades of gray)
48
Local-Based-Vision ClassifierExample
Shirin Saleem et al. [2008]: – BBN Technologies, Cambridge, MA: BBN Byblos OCR
System (DARPA data set): – Locate line tops and bottoms– Extract narrow overlapping vertical slices of the image
• measure features on each slice• reduce the size of feature using Linear Discriminant
Analysis (typically 15 features)
49
Simple Frame-based Features
Examples of features:– Intensity as a function of vertical position
– Vertical derivative of intensity
– Horizontal derivative of Intensity
– Local angle within a small window
– Difference of angle
50
Character Hidden Markov Model
51
Local-Based-Vision ClassifierExample
R. Al Hadj et al [2006]– HMM for letters and words
with sliding windows– Windows correspond to 3
different orientations: density description
– A second system integrates all the orientations in each position
85.02% (Top1) 91.29% (Top2) 93.14% (Top3)
52
Global-Based Vision ClassifierExamples
Khorsheed et al [2000] – Polar transformation coupled
with a Fourier transform– Each word: template with Fourier
coefficients– Recognition
• normalized ED from templates• In a multi-font approach:
– 95.4% of good word classification on 1700 samples of different size, angle and translation
Original images
Normalized images by polar transform
53
GBVCSynthesis
The works related – accredit the word superiority– Many feature combinations
and models perform well
The proximity with IA? – can operate – but limited to 2 levels
needs more precisionin feature extraction
Adaptation of GC to Arabic– Possible if high level features
usedInput
Feature
Word
54
MethodologiesII Semi Global Vision Classifier
The word – natural concatenation of independent PAWs which
provides a natural segmentation
The features– are numerous and different
require normalization of image before extractionThe approach
leads to reduce the vocabulary as only the PAWs are considered
Important to find features
55
Semi-Global-Based VCExamples
Planar HMM: Ben Amara, Belaïd, Ellouze [1996]For the main: band width:
• observation P of the S HMM• a specific function (normal
density) of the duration
For the secondary: band description: List of B&W segments in each line of the band– Morphology of each PAW
– 99.84% for 33168 samples, 100 PAWs
56
Semi-Global-Based VCExamples: town name recognition
Burrow [2000]– Method
• to trace lines making up the town name, and to use these as a representation
– Features• Vector angles + average
length +… – Results
• ED (converted into pseudo probabilities) between the test feature vectors and all those in the training set
• Recognition rate 74%
57
SGBVCSynthesis
The works related– similar to those for GBVC– some are reported on PAWs
The proximity with IA is limited– only features and PAW levels considered
The adaptation of Semi Global BVC to Arabicfits well but limited to PAWs
fits better if a gathering procedure of PAWs is possible
58
IA architecture for Semi-GlobalVC
…
Input
Letter
PAW
…
Feature
59
MethodologiesIII Local-Based Vision Classifier
The word – regarded as a list of letters or
smaller entities
The features – should be located precisely,
inversely to the other approacheswhere flexibility is tolerated
The approach– should gather, confront these
entities to identify the word
The interest– can cope with large vocabulary
60
LBVCExample
Multi-level handwritten word recognition for tunisian city names Miled [1997]
61
LBVCThe strategy: 2 perceptive levels
First perceptive level:– practices the global view by extracting visual indices: by
tracing and grapheme extraction– This global Information is extracted in the main zones :
(b) diacritics; (c) baseline and middle zone characters
62
LBVCExample
Then, visual indices are extracted by tracing
63
LBVCExample
Finally: a Markovian modeling is operated on the list of visual indices
Recognition: 58,9% (top1) to 86,8% (top10)
64
LBVCExample
The 2nd perceptive level practices an analytical approach by extracting finest features: graphemes
18 classes• 1. A: alef, 2. B - D: graphemes with ascenders• 3. E – H : graphemes with both ascenders and descenders• 4. I – M : graphemes with descenders• 5. N – R : graphemes within the middle zone
Recognition: 69.68% (top1) to 91.66% (top10)
65
Local-Based-Vision ClassifierExample
Finally, the 3rd level practices a pseudo-analytical modeling and recognition of PAWs and words
37 words: 80.11% (Top1), 90.79% (Top5)
66
LBVCSynthesis
The works related– give good result showing that the analytic approach can
perform well– point out drawbacks of over and under-segmentation
As letters or segments are recognized independentlyany error can perturb the whole recognition process
The proximity with IA is far- WSE is not taken into account because
- no global vision of the word, but as a sum of small parts
67
MethodologiesIV Hybrid Level Classifier
The word – regarded as a whole as well globally as in details
The features – Correspond to precise location reinforced according to the level
of detail needed
The approach– combines different strategies: to approach more human
reading: • the analysis must be global for a good synthesis of
the information • while being based on local information suitable to
make emerge this information
68
Hybrid-Level ClassifierExamples
NSHP-HMM [Choisy & Belaïd 02] :– a random field drawing its observation directly in the
image – a HMM taking into account the column observations
ijX
ijXθ
69
Hybrid-Level ClassifierExamples
Analytical aspect : Local-Global aspect :
70
Hybrid-Level ClassifierExamples
NSHP-HMM [Vajda & Belaïd 06] :– Combination of structural and pixel information
71
HM &Synthesis
The works related – seem efficient– IA seen as meta-model reassembling models working at
different visual levels: global, local, semi-global The proximity with IA is close – If we add the PAW level- It combines different levels as proposed in IA
The interest- to do the maximum without segmentation- if needed, we can operate a segmentation which will be
guided by the context
72
Outline
1. Introduction2. Writing characteristics3. Issues of Recognition4. Language Processing
73
NLPNumber of effective Arabic words go past 60 billions! – due to its morphological complexity [K. Darwish 02]
makes their automatic processing unrealistic– handicaps: dictionary building, IR, automatic spelling…
Simplification of their pattern becomes mandatoryfor their processing
One solution seems to turn towardsmorphological analysis and word stemming
74
NLPMany studies – highlight the richness and the stability of Arabic in
terms of morpho-phonologic peculiar to this language [A. Ben Hamadou 93], [S. Kanoun 02], [W. Kammoun04], [M. Cheriet 06]
Questions– Importance of the kind of linguistic knowledge– more appropriate location for its incorporation
75
NLPMost of them confirm– The morphological structure of Arabic
• can be analyzed in terms of consonantal roots, considered as independent morphological unit
Tri-literal roots, the most common of them– [Watson 06]
• give rise up to 15 verbal forms or stems, one basic and the rest derived
– [Ben Hamadou 93]• an average of 80 currently used words derive from a
given root – [Kanoun 02]
• 808 healthy tri-consonant a lexicon of 98 413 words
76
Radical
Word decompositionAn Arabic word is
decomposable (e.g. derivates from a root) (school :مدرسة ) or not (doctor :دڪتور )
A decomposable word is composed of morphemes: prefix, radical and suffix
The radical (or the verbal core) is– the derivation of a root according
to a given scheme by introducing“access” letters: ,ا م
A root is either– tri-consonant (three letters): تبآ– quadri-consonant (four letters): دحرج– Healthy ( جرح) or non-healthy ( قال contains a vowel at least)
77
Schemes can go up to 70– علاتف ,فعل مفعال , , علامف , لاس تفعا , مفعول, لافتع ا , منفعل
Schemes classes are: – Verb “ تبآ ”رحل “ / ”– Agent noun “ تب آا ”راحل “ / ”– Accentuated agent noun ”رّحال “– Patient noun “ توبمك ”– Machine noun “ تاب آ ”
Arabic Morpho-phonologicalConcepts
78
The approach: Transparent Neural Network
Easy to train:– Decomposable on 3
mono-layers– Training is rapid
But not allows too many outputs
79
First craftiness– To consider the word as the conjugation of a root
according to a given scheme– To separate the outputs in roots and schemes
• For 8000 words that rise from 100 roots the maximum of schemes is 1400
8000 size problem 1500 size problem (still high !)
RNTword100 roots
1400 conjugated schemesRNTRNTTNN
To process a wide vocabulary: several improvements
80
To consider the scheme as a brief scheme ( علو ناتفي is defined by a brief scheme: a non-conjugated one, عل اتف ) and a set of conjugation elements
– Brief schemes number is around 75– Conjugation elements number is 12 (tense, gender, person, definition)
87 neurons represent 1400 conjugated schemes
8000 1500 187 (100 roots + 75 schemes + 12 conjugation elts.)
Second craftiness
81
Roots and schemes trainings are independent– These two trainings do not require the same information
• The Information about word PAWs are:– useless for the training of its root
فح آا حآفا
– useful for the training of its scheme
فح آا scheme: عل فا حآفا scheme: ل فعا
Third craftiness
root: آفح
82
Third craftinessTo separate them and so lighten them by splitting the TNN into two models:
TNN_R
TNN_S
100 outputs
87 outputs
Word
These sizes are now practicable
83
TNN_R: three-layer network– Learns how to focus on root letters and ignores access ones
reserved for schemes, trains roots from structural primitives
Neuro-Linguistic approach
-0. 465. 9
15. 3
2. 8
0.74
-1
8. 94
Primitives (70)
2. 64
2. 63
4. 94
2. 64
4. 93
Letters(117)
PD
QM
ا آ
اـب
RD
ا ت
د
ا ع
HF د عب
RF
Roots(100)
اس
root: بعد
PDQMHFRD RF
84
TNN_S: 4 layers– learns schemes from structural primitives, how to ignore
root letters, focuses on access letters: prefix, suffix …– PAWs of Arabic schemes more reduction
Neuro-Linguistic approach
Conjugation elements (10)
Primitives
PAWS of SchemesSchemesLetters
PD
HM
ا آ
ا آ
JF
ات
ر
اث
HF
ات*ات
**
تفاعل
متفاعل
Singular
Accomplished
ات*امت
masculine
feminine
word: تڪاثر
+ PD+HM+HF+JF+ Access letters: ت,ا+ Singul + accompl.+ masc.
85
TNN_R training: words containing the root– As the information trained corresponds to two independent cases:
• letter constitution / root formation: letter location and sisters • and there is no local error to treat
we separate it into two MonoLP: 1 & 2
Neuro-Linguistic approach
أتصرف &
Letter position & sister letters
انصرفتصرف
يتصرفStructural features
86
TNN_S training: same corpus as for TNN_RThe same way: 3 sub-networks:
Neuro-Linguistic approach
& أتصرف&
انصرفتصرف
يتصرف
Letter position & sisters
Fixed manually to indicate those should
be activated
Mono-LP1 Mono-LP3 Mono-LP4
صرفسرق
حرفصرخ
عل تاف
TNN_R
TNN_S
X
X
Xانصرفانحرف
: HI PD BM JrF BPI
: HI PD RM JrF BPI
HI PD BM JrF BPI
X
فعل نا
انصرف
Perceptive cycles
Perceptive cycles
Linguistic
restrictionHI PD BM JrF BPI
!!
!!
!!
!!
Recognition: perceptive cycles + linguistic restrictif
88
Vocabulary size– 1531 words– 51 roots– 25 brief schemes
Training base– The same training corpus for TNN_R and TNN_S
• TNN_R corpus size : 1531 (words) to train 50 roots• TNN_S corpus size : 1531 (words) to train 25
schemesTest base– size: 765= 255 (words) * 3 (samples)
Experiments
92%
99.7%
97 %
Top4
TNN_R91.9%89.4%80,7%1531
Pseudo-global
Typesetted/ Handwritten
[Ben Cheikhand al 08]
88,7%25AnalyticHandwritten[Touj & Ben Amara 07]
Typesetted
Typesetted
Writing
Analytic
Analytic
Approach
1423
545
Vocab. size
96.4%95.7%81.3%[Kammoun06]
TNN_S93%
95 %84 %69 %[Kanoun 02]
Top3Top2Top1
Word baseComparison with approaches
dedicated to wide vocabularies
90
Conclusion (1)
Neural model + linguistic knowledge – Arabic writing recognition with wide vocabulary: – Knowledge: Arabic morphology analysis
Favors the recognition of words which have never been learned– It is just needed that its root and its scheme have been
already learned via other words
Conclusion
91
«آثر» the root«تفاعل» && the scheme
The words « آثرأ », « ريآث », « ةآثر » and « تآثر » participate to the training of the root «آثر» (in addition to their schemes)
The words « نقا عت », « رباق ت », « خل ادت » and « سك ام ت » participate to the training of the scheme « تفاعل» (in addition to their roots)
.
Hence, when recognizing the word « ثر ا ك ت », our model should be able to recognize :
Example
92
Conclusion (1)
The improvements will continue:– Knowledge
• Considering other aspects of the Arabic morphology: other kinds of roots, derivations…
– Recognition stage• More linguistic restriction in the perceptive cycles
– Data Base• To work on more realistic vocabulary by enlarging
more the size
Perspectives
Conclusion (1)
Thank you