Arabic spell checking approaches

Arabic Spell Checking

Approaches

By: Banan AlHadlaq, Dalal AlZeer , Monirah AlOrf

Supervised by: Dr. Amal Al-Saif

Natural language processing - CS465

1

Introduction

Common Arabic Spell Error

Towards Automatic Spell Checking for Arabic

Towards Arabic Spell-Checker Based on N-Grams Scores

Automatic Stochastic Arabic Spelling Correction With

Emphasis on Space Insertions and Deletions

Arabic Word Generation and Modeling for Spell Checking

Improved Spelling Error Detection and Correction for Arabic

Conclusion

Outline

2

Arabic language NLP applications Approaches for solving the Arabic spell

checking problem

Introduction

3

4

Common Arabic Spell Errors

• Reading Errors

{ , , آ, إ أ }{ا , , , يـ, نـ ث ت }{ب , خ, ح }{ج ذ, }{د ز, }{ر ش, }{س ض, }{ص ظ, }{ط غ, {ع{ ق, }{ف ى, }{ي ة, } {ه ؤ, {و

• Hearing Errors

{ ا, }{ى أ, }{ق , , ط, ت د }{ض , ث, س }{ش ج, }{ق , ذ, ظ }{ز , ر, ل }{ي ك, }{ق ه, {ة

• Touch-Typing Errors

• Morphological Errors

• Editing Errors

5

1.Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions

• Stochastic-based approach for misspelling correction of Arabic text.

• A context based on two-layer that is automatically correct misspelled words in large datasets.

6

Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions (Cont.)

Candidates’ generation component

Error detection

Best candidate selection component

Single-Word Errors

Space Deletion Errors

Space Insertion

Errors

7

Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions (Cont.)

Candidates’ generation component: Space Deletion Errors

8

Result

• A standard Arabic text corpus (TRN_DB_I)

• An extra standard Arabic text corpus(TRN_DB_II)

• The test data (TST_DB)

• The testing results show that as we increase the size of the training set, the performance improves reaching 97.9% of F1 score for detection and 92.3% of F1 score for correction.

9

2.Towards Automatic Spell Checking for Arabic

• Developing an Arabic spelling checker program.

• Using SICStus Prolog language.

• Recognizes common Arabic spelling errors and offers suggestions for error correction.

• Be able to recognize common spelling errors for standard Arabic and Egyptian dialects.

• Can be integrated with other text processing software, such as word processors.

10

• Analysis of the common spelling errors that are used for detecting the misspelled Arabic word.

• Limited the detection of spelling errors to isolated words (non–word). e.g. ‘محد’ for ‘محمد’.

• Perform a series of heuristic steps to find a replacement candidate:

Add missing character

Replace incorrect character

Remove excessive character

Add a space to split words

Towards Automatic Spell Checking for Arabic(Cont.)

11

• e.g. the candidates of the misspelled word ”معض“are : ” دمعض“, ”ضرمع “

Add missing

character

• e.g. the candidates of the misspelled word ”معض“are : ” “ , عضن“, ” دمع

عضك” “

Replace incorrect character

• e.g. the candidates of the misspelled word ”معض“are : ” عض“, ”مع “

Remove excessive character

• e.g. the candidates of the misspelled word ”معض“are : ” عض“, ”مع “

Add a space to split words


12

Neighbors table


13

• Developing a simple and flexible spell-checker for Arabic language (detect errors).

• Based on N-Grams scores.

• Using matrix approach.

• The corpus which is used is adapted from Muaidi PHD thesis .

• It is consists of 101,987 word types.

3.Towards Arabic Spell-Checker Based on N-Grams Scores

14

Entered the tested text

Tokenizing process

Cleaning process

Matrix method deals

with each word

Towards Arabic Spell-Checker Based on N-Grams Scores(Cont.)

15

• Building the matrices

Number of matrices = longest word in corpus – 1.

Dimension of each matrix is 28×28( for Arabic letter).

(M1) for the combination of the first and the second letters in a word. (M2) for the combination of the second and the third letters in a word and so on.

All the matrices are initialized by zeros.

Matrix Method Deals With Each Word

16

• 2-Gram set (S)

Each item in (S) consists of two letters.

The item will assign the value 1 or 2

Assigned 2 in the corresponding matrix; if the word is ended by these two letters.

Assigned 1 if there is a connection and the word is not over yet.

e.g. for the word:[ناجح]

the 2-Gram set is S = { جح ، اج ، { نا

M1[ا] [ ن] = 2 ,1[ M M ]3 ,1 = [ا] [ ج .2=[ج ][ ح

Matrix Method Deals With Each Word

17

Entered the tested text

Tokenizing process

Cleaning process

Matrix method deals

with each word

Matrix Method Deals With Each Word (Cont.)

18

Result

• The training dataset consists of 71,390 Arabic words (70%) and While the testing dataset consists of 30,597 Arabic words (30%).

The Overall Evaluation of the Results

• Increasing the size of the data set leads increment the accuracy.

19

Bridge the critical gap of available open-source spell checking resources for Arabic.

Create open-source and large-coverage word list for Arabic (9,000,000 words).

Error Detection: Direct method: match words in an open text

against a list of correct words. Language modeling method: build a character-

based tri-gram language modal using SRILM in order to classify generated words as valid and invalid.

4. Arabic Word Generation and Modeling for Spell Checking

20

Input

Finite-State Transducer

Error?

Suggestion list

Candidates list score

Candidates ranker augmented edit distance

and language specific rules

Post-processing

Display suggestions

Arabic word list

Noisy channel model

Gigaword corpus

Yes

No

Flow chart of spelling error correction.

21

Best accuracy score = 75% Evaluation on:

Microsoft Word 2010 = 80.54% Hunspell using Ayaspell = 45.64%

Result

22

Language model

Spelling error detection and correction components

Dictionary (or reference word

list)

Error model

5. Improved Spelling Error Detection and Correction for Arabic

23

AraComLex Extended word list

•Matching its word list against

Gigaword corpus

•Double-checked by

Buckwalter Arabic

Morphological Analyzer

•Creating a dictionary of 9.3 million Arabic words

Improving the Dictionary

24

Finite-state transducer to

propose candidate corrections

Discard candidates that are not found in the

word list

Rank the remaining candidates

Spelling Correction: N-gram language models. The candidate with the least

perplexity score is selected to be the gold correction.

Improving the Error Model: Candidate Generation

25

Analyze the level of noise in different sources of data.

Agence France-Presse (AFP) is the noisiest while Al- Jazeera data is the cleanest.

Select the optimal subset to train the system on.

Improving the language model: Analyzing the Training Data

26

AFP = 73.93 % Al-Jazeera data = 80.97 % Gigaword corpus = 82.86 % Clean data is better than noisy data when

they are comparable in size, however more data is better than clean data.

Evaluation on: Google Docs = 9.32 % Ayaspell for OpenOffice = 41.86 % Microsoft Word 2010 = 57.15 %

Result

27

After displaying these approaches we see that the results are promising, and represent a good starting point for future researches to enhance the Arabic spell checker.

Conclusion

28

THANKS

Technology

Arabic spell checking approaches