Arabic Spell Checking
Approaches
By: Banan AlHadlaq, Dalal AlZeer , Monirah AlOrf
Supervised by: Dr. Amal Al-Saif
Natural language processing - CS465
1
Introduction
Common Arabic Spell Error
Towards Automatic Spell Checking for Arabic
Towards Arabic Spell-Checker Based on N-Grams Scores
Automatic Stochastic Arabic Spelling Correction With
Emphasis on Space Insertions and Deletions
Arabic Word Generation and Modeling for Spell Checking
Improved Spelling Error Detection and Correction for Arabic
Conclusion
Outline
2
Arabic language NLP applications Approaches for solving the Arabic spell
checking problem
Introduction
3
4
Common Arabic Spell Errors
• Reading Errors
{ , , آ, إ أ }{ا , , , يـ, نـ ث ت }{ب , خ, ح }{ج ذ, }{د ز, }{ر ش, }{س ض, }{ص ظ, }{ط غ, {ع{ ق, }{ف ى, }{ي ة, } {ه ؤ, {و
• Hearing Errors
{ ا, }{ى أ, }{ق , , ط, ت د }{ض , ث, س }{ش ج, }{ق , ذ, ظ }{ز , ر, ل }{ي ك, }{ق ه, {ة
• Touch-Typing Errors
• Morphological Errors
• Editing Errors
5
1.Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions
• Stochastic-based approach for misspelling correction of Arabic text.
• A context based on two-layer that is automatically correct misspelled words in large datasets.
6
Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions (Cont.)
Candidates’ generation component
Error detection
Best candidate selection component
Single-Word Errors
Space Deletion Errors
Space Insertion
Errors
7
Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions (Cont.)
Candidates’ generation component: Space Deletion Errors
8
Result
• A standard Arabic text corpus (TRN_DB_I)
• An extra standard Arabic text corpus(TRN_DB_II)
• The test data (TST_DB)
• The testing results show that as we increase the size of the training set, the performance improves reaching 97.9% of F1 score for detection and 92.3% of F1 score for correction.
9
2.Towards Automatic Spell Checking for Arabic
• Developing an Arabic spelling checker program.
• Using SICStus Prolog language.
• Recognizes common Arabic spelling errors and offers suggestions for error correction.
• Be able to recognize common spelling errors for standard Arabic and Egyptian dialects.
• Can be integrated with other text processing software, such as word processors.
10
• Analysis of the common spelling errors that are used for detecting the misspelled Arabic word.
• Limited the detection of spelling errors to isolated words (non–word). e.g. ‘محد’ for ‘محمد’.
• Perform a series of heuristic steps to find a replacement candidate:
Add missing character
Replace incorrect character
Remove excessive character
Add a space to split words
Towards Automatic Spell Checking for Arabic(Cont.)
11
• e.g. the candidates of the misspelled word ”معض“are : ” دمعض“, ”ضرمع “
Add missing
character
• e.g. the candidates of the misspelled word ”معض“are : ” “ , عضن“, ” دمع
عضك” “
Replace incorrect character
• e.g. the candidates of the misspelled word ”معض“are : ” عض“, ”مع “
Remove excessive character
• e.g. the candidates of the misspelled word ”معض“are : ” عض“, ”مع “
Add a space to split words
Towards Automatic Spell Checking for Arabic(Cont.)
12
Neighbors table
Towards Automatic Spell Checking for Arabic(Cont.)
13
• Developing a simple and flexible spell-checker for Arabic language (detect errors).
• Based on N-Grams scores.
• Using matrix approach.
• The corpus which is used is adapted from Muaidi PHD thesis .
• It is consists of 101,987 word types.
3.Towards Arabic Spell-Checker Based on N-Grams Scores
14
Entered the tested text
Tokenizing process
Cleaning process
Matrix method deals
with each word
Towards Arabic Spell-Checker Based on N-Grams Scores(Cont.)
15
• Building the matrices
Number of matrices = longest word in corpus – 1.
Dimension of each matrix is 28×28( for Arabic letter).
(M1) for the combination of the first and the second letters in a word. (M2) for the combination of the second and the third letters in a word and so on.
All the matrices are initialized by zeros.
Matrix Method Deals With Each Word
16
• 2-Gram set (S)
Each item in (S) consists of two letters.
The item will assign the value 1 or 2
Assigned 2 in the corresponding matrix; if the word is ended by these two letters.
Assigned 1 if there is a connection and the word is not over yet.
e.g. for the word:[ناجح]
the 2-Gram set is S = { جح ، اج ، { نا
M1[ا] [ ن] = 2 ,1[ M M ]3 ,1 = [ا] [ ج .2=[ج ][ ح
Matrix Method Deals With Each Word
17
Entered the tested text
Tokenizing process
Cleaning process
Matrix method deals
with each word
Matrix Method Deals With Each Word (Cont.)
18
Result
• The training dataset consists of 71,390 Arabic words (70%) and While the testing dataset consists of 30,597 Arabic words (30%).
The Overall Evaluation of the Results
• Increasing the size of the data set leads increment the accuracy.
19
Bridge the critical gap of available open-source spell checking resources for Arabic.
Create open-source and large-coverage word list for Arabic (9,000,000 words).
Error Detection: Direct method: match words in an open text
against a list of correct words. Language modeling method: build a character-
based tri-gram language modal using SRILM in order to classify generated words as valid and invalid.
4. Arabic Word Generation and Modeling for Spell Checking
20
Input
Finite-State Transducer
Error?
Suggestion list
Candidates list score
Candidates ranker augmented edit distance
and language specific rules
Post-processing
Display suggestions
Arabic word list
Noisy channel model
Gigaword corpus
Yes
No
Flow chart of spelling error correction.
21
Best accuracy score = 75% Evaluation on:
Microsoft Word 2010 = 80.54% Hunspell using Ayaspell = 45.64%
Result
22
Language model
Spelling error detection and correction components
Dictionary (or reference word
list)
Error model
5. Improved Spelling Error Detection and Correction for Arabic
23
AraComLex Extended word list
•Matching its word list against
Gigaword corpus
•Double-checked by
Buckwalter Arabic
Morphological Analyzer
•Creating a dictionary of 9.3 million Arabic words
Improving the Dictionary
24
Finite-state transducer to
propose candidate corrections
Discard candidates that are not found in the
word list
Rank the remaining candidates
Spelling Correction: N-gram language models. The candidate with the least
perplexity score is selected to be the gold correction.
Improving the Error Model: Candidate Generation
25
Analyze the level of noise in different sources of data.
Agence France-Presse (AFP) is the noisiest while Al- Jazeera data is the cleanest.
Select the optimal subset to train the system on.
Improving the language model: Analyzing the Training Data
26
AFP = 73.93 % Al-Jazeera data = 80.97 % Gigaword corpus = 82.86 % Clean data is better than noisy data when
they are comparable in size, however more data is better than clean data.
Evaluation on: Google Docs = 9.32 % Ayaspell for OpenOffice = 41.86 % Microsoft Word 2010 = 57.15 %
Result
27
After displaying these approaches we see that the results are promising, and represent a good starting point for future researches to enhance the Arabic spell checker.
Conclusion
28
THANKS