Arabic spell checkers

Arabic Spell Checkers

Natural Language Processing - CS465

Supervised by:

Dr. Amal Al-Saif

Done by:

Hanan Al-Mohammadi

Mona Al-Mutairi

Imam Muhammad ibn Saud University, Department of Computer Science and Information System

1

2

Outline- Introduction- Arabic Spell Checker Techniques

3


4


5

First Paper

“An Approach for Analyzing and CorrectingSpelling Errors for Non-native Arabic learners”

o Based on a questioning environment.

6

First Paper

• Error DetectionTwo types of errors:

1. Ill-formed word errors.o Buckwalter’s Arabic Morphological analyzer .

Ex. ‘حائيط ’ is ill-formed of word ‘حائط’

2. Semantically incorrect errors.Ex. If a spelling question displays a happy face to a learner and asks him to write a word which describes this picture and he enter ’ساعد’/helped instead of ’سعيد’/happy

7

First Paper

• Error CorrectionEdit distance technique.

• Filtering1. Morphological Analyzer Filter.

Ex. After applying Correction techniques on word ‘ حئط‘, حائيط ’ ’ appears as correction. So, Morphological filter will exclude it.

2. Gloss Filter.Ex. If user misspelled word ’سعيد’/happy with ’ساعيد’ (the second letter ’ا’ is incorrectly replaced by the short vowel Fatha). applying Correction techniques will result two possible word corrections: helped, Both are valid Arabic words. Apply/’ساعد’ happy and/’سعيد’gloss filter will exclude word ’ساعد’/helped.

8

First Paper

• Evaluation:Done using real test data composed of 190 misspelled words and include both single and multi-error misspellings composed of up to three errors per word. Average word length is 5 letters per word.

• Result80+% recall and 90+% precision were achieved for each type of spelling error.

9

Second Paper

“Towards Automatic Spell Checking for Arabic”

• Composed of Arabic morphological analyzer, lexicon, spelling detector, and spelling corrector.

• Spelling detection• Two possibilities :

1. The misspelled word is an invalid word, Ex. ‘محد’ for ’محمد‘2. The misspelled word is a valid word , Ex. ‘مال’ in place of ‘نال’

10

Second Paper

• Spelling correction:• Add missing character: the candidates of the misspelled ‘معض’ are

‘ معوض’, ‘ ’معضد‘ and ’معرض

• Replace incorrect character: the candidates of the misspelled "معض" are " ,"نعض ."معد" and "كعض

• Remove excessive character: the candidates of the misspelled word " are "معض" مع", " ."عض

• Add a space to split words: the candidates of the misspelled word " are "معض" مع", " ."عض

• Arabic morphological analyzer• Broke down the inflected word ‘المسئولين’ into the prefix

Then check the .’مسئول‘ and the stem ,'ين‘ the suffix ,'ال‘stem lexicon, if has entry in the lexicon stem is correct.

11

Second Paper

• Evaluation:This approach theoretical, No experimental results were report.

12

Third Paper

- Algorithm defined by B. Haddad and M. Yassen- Error patterns

Simple Errors :

Editing Errors and Boundary Problems

Cognitive and Phonetic Mistakes

Syntax Errors

Semantic Errors

Substitution: (/ قال → .(q,/ق/) mistakenly substituted by (f,/ف/) fāl→qāl, he said), the letter ,/فال

Deletion: (/ استخدم→ .is missing (t,/ت /) sḫdama→ ’staḫdama, he or it-used), the letter’ ,/اسخدم

Insertion: (/ مكتوب→ .is additionally inserted (t,/ت /) .(makttūb → maktūb, a letter in the sense of a message ,/مكتتوب

Transposition: (/ اجتماع→ .is swapped (t ,/ت /) ğmitā‘ → ’ğtimā‘, meeting). The letter’ ,/اجمتاع

(/ الجامعة (ra’īs’alğami‘h→ ra’īs ’alğami‘h ,/رئيسالجامعة→رئيس

(/ قال→فقال (fa qāl → faqāl, and then he said ,/ف

→هازا or هاذا /) (hādā or hāzā → hadā, the particle that ,/ هذا

(/ المدرسة إلى البنت المدرسة→ذهبت إلى البنت ,(the girl went to [the]- school ,/ذهب.(dahabat ,/ ذهبت /) instead of (dahaba,/ ذهب/)

الحمراء /) الذم كريات → الحمراء الدم ldam, the’ ,/الذم/) .(red rebuking cells → red blood cells ,/ كرياتrebuking) instead of (/ الدم /, ’ldam, the-blood).

13

Third Paper

- Knowledge base : D&C = ( DAWKB , NDAKB , CORSTR) - Derivative Arabic Word Knowledge Base DAWKB

- For each valid Arabic root there is a certain number of consistent patterns. - Root-pattern relationship means, a word, which has at least one lexical occurrence in the Arabic vocabulary. - dwj = ( Prefji + PtjΘsubMGRi + Suffji ) MSR PNGRi

- Database for NDW & AW Considered as stems or lexemes collected in the knowledge base.

- Non-Word Recognition and Error Correction Strategy

14

Fourth Paper

- Paper proposed by A. Hattab and A. Hussein.

- The proposed system consists of three models.

- The detection and correction model, classify words into a non-words or a misspelling.

15

Fourth PaperEvaluation :-There are two run applied for the proposed system, first run without the detection and correction method and the second is with detection and correction method.

-The same data will be used in both experiments. The results of these experiments are shown in Tables:

-The detection and correction algorithm outperformed the Bayes algorithm by about 10%, without checking misspelling errors accuracy is 68.85%, while the average accuracy for the classification system with misspellings detection and correction is 71.77%.

Thank You For Your Attention

Education

Arabic spell checkers