28
Arabic Spell Checking Approaches By: Banan AlHadlaq, Dalal AlZeer , Monirah AlOrf Supervised by: Dr. Amal Al-Saif Natural language processing - CS465 1

Arabic spell checking approaches

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Arabic spell checking approaches

Arabic Spell Checking

Approaches

By: Banan AlHadlaq, Dalal AlZeer , Monirah AlOrf

Supervised by: Dr. Amal Al-Saif

Natural language processing - CS465

1

Page 2: Arabic spell checking approaches

Introduction

Common Arabic Spell Error

Towards Automatic Spell Checking for Arabic

Towards Arabic Spell-Checker Based on N-Grams Scores

Automatic Stochastic Arabic Spelling Correction With

Emphasis on Space Insertions and Deletions

Arabic Word Generation and Modeling for Spell Checking

Improved Spelling Error Detection and Correction for Arabic

Conclusion

Outline

2

Page 3: Arabic spell checking approaches

Arabic language NLP applications Approaches for solving the Arabic spell

checking problem

Introduction

3

Page 4: Arabic spell checking approaches

4

Common Arabic Spell Errors

• Reading Errors

{ , , آ, إ أ }{ا , , , يـ, نـ ث ت }{ب , خ, ح }{ج ذ, }{د ز, }{ر ش, }{س ض, }{ص ظ, }{ط غ, {ع{ ق, }{ف ى, }{ي ة, } {ه ؤ, {و

• Hearing Errors

{ ا, }{ى أ, }{ق , , ط, ت د }{ض , ث, س }{ش ج, }{ق , ذ, ظ }{ز , ر, ل }{ي ك, }{ق ه, {ة

• Touch-Typing Errors

• Morphological Errors

• Editing Errors

Page 5: Arabic spell checking approaches

5

1.Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions

• Stochastic-based approach for misspelling correction of Arabic text.

• A context based on two-layer that is automatically correct misspelled words in large datasets.

Page 6: Arabic spell checking approaches

6

Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions (Cont.)

Candidates’ generation component

Error detection

Best candidate selection component

Single-Word Errors

Space Deletion Errors

Space Insertion

Errors

Page 7: Arabic spell checking approaches

7

Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions (Cont.)

Candidates’ generation component: Space Deletion Errors

Page 8: Arabic spell checking approaches

8

Result

• A standard Arabic text corpus (TRN_DB_I)

• An extra standard Arabic text corpus(TRN_DB_II)

• The test data (TST_DB)

• The testing results show that as we increase the size of the training set, the performance improves reaching 97.9% of F1 score for detection and 92.3% of F1 score for correction.

Page 9: Arabic spell checking approaches

9

2.Towards Automatic Spell Checking for Arabic

• Developing an Arabic spelling checker program.

• Using SICStus Prolog language.

• Recognizes common Arabic spelling errors and offers suggestions for error correction.

• Be able to recognize common spelling errors for standard Arabic and Egyptian dialects.

• Can be integrated with other text processing software, such as word processors.

Page 10: Arabic spell checking approaches

10

• Analysis of the common spelling errors that are used for detecting the misspelled Arabic word.

• Limited the detection of spelling errors to isolated words (non–word). e.g. ‘محد’ for ‘محمد’.

• Perform a series of heuristic steps to find a replacement candidate:

Add missing character

Replace incorrect character

Remove excessive character

Add a space to split words

Towards Automatic Spell Checking for Arabic(Cont.)

Page 11: Arabic spell checking approaches

11

• e.g. the candidates of the misspelled word ”معض“are : ” دمعض“, ”ضرمع “

Add missing

character

• e.g. the candidates of the misspelled word ”معض“are : ” “ , عضن“, ” دمع

عضك” “

Replace incorrect character

• e.g. the candidates of the misspelled word ”معض“are : ” عض“, ”مع “

Remove excessive character

• e.g. the candidates of the misspelled word ”معض“are : ” عض“, ”مع “

Add a space to split words

Towards Automatic Spell Checking for Arabic(Cont.)

Page 12: Arabic spell checking approaches

12

Neighbors table

Towards Automatic Spell Checking for Arabic(Cont.)

Page 13: Arabic spell checking approaches

13

• Developing a simple and flexible spell-checker for Arabic language (detect errors).

• Based on N-Grams scores.

• Using matrix approach.

• The corpus which is used is adapted from Muaidi PHD thesis .

• It is consists of 101,987 word types.

3.Towards Arabic Spell-Checker Based on N-Grams Scores

Page 14: Arabic spell checking approaches

14

Entered the tested text

Tokenizing process

Cleaning process

Matrix method deals

with each word

Towards Arabic Spell-Checker Based on N-Grams Scores(Cont.)

Page 15: Arabic spell checking approaches

15

• Building the matrices

Number of matrices = longest word in corpus – 1.

Dimension of each matrix is 28×28( for Arabic letter).

(M1) for the combination of the first and the second letters in a word. (M2) for the combination of the second and the third letters in a word and so on.

All the matrices are initialized by zeros.

Matrix Method Deals With Each Word

Page 16: Arabic spell checking approaches

16

• 2-Gram set (S)

Each item in (S) consists of two letters.

The item will assign the value 1 or 2

Assigned 2 in the corresponding matrix; if the word is ended by these two letters.

Assigned 1 if there is a connection and the word is not over yet.

e.g. for the word:[ناجح]

the 2-Gram set is S = { جح ، اج ، { نا

M1[ا] [ ن] = 2 ,1[ M M ]3 ,1 = [ا] [ ج .2=[ج ][ ح

Matrix Method Deals With Each Word

Page 17: Arabic spell checking approaches

17

Entered the tested text

Tokenizing process

Cleaning process

Matrix method deals

with each word

Matrix Method Deals With Each Word (Cont.)

Page 18: Arabic spell checking approaches

18

Result

• The training dataset consists of 71,390 Arabic words (70%) and While the testing dataset consists of 30,597 Arabic words (30%).

The Overall Evaluation of the Results

• Increasing the size of the data set leads increment the accuracy.

Page 19: Arabic spell checking approaches

19

Bridge the critical gap of available open-source spell checking resources for Arabic.

Create open-source and large-coverage word list for Arabic (9,000,000 words).

Error Detection: Direct method: match words in an open text

against a list of correct words. Language modeling method: build a character-

based tri-gram language modal using SRILM in order to classify generated words as valid and invalid.

4. Arabic Word Generation and Modeling for Spell Checking

Page 20: Arabic spell checking approaches

20

Input

Finite-State Transducer

Error?

Suggestion list

Candidates list score

Candidates ranker augmented edit distance

and language specific rules

Post-processing

Display suggestions

Arabic word list

Noisy channel model

Gigaword corpus

Yes

No

Flow chart of spelling error correction.

Page 21: Arabic spell checking approaches

21

Best accuracy score = 75% Evaluation on:

Microsoft Word 2010 = 80.54% Hunspell using Ayaspell = 45.64%

Result

Page 22: Arabic spell checking approaches

22

Language model

Spelling error detection and correction components

Dictionary (or reference word

list)

Error model

5. Improved Spelling Error Detection and Correction for Arabic

Page 23: Arabic spell checking approaches

23

AraComLex Extended word list

•Matching its word list against

Gigaword corpus

•Double-checked by

Buckwalter Arabic

Morphological Analyzer

•Creating a dictionary of 9.3 million Arabic words

Improving the Dictionary

Page 24: Arabic spell checking approaches

24

Finite-state transducer to

propose candidate corrections

Discard candidates that are not found in the

word list

Rank the remaining candidates

Spelling Correction: N-gram language models. The candidate with the least

perplexity score is selected to be the gold correction.

Improving the Error Model: Candidate Generation

Page 25: Arabic spell checking approaches

25

Analyze the level of noise in different sources of data.

Agence France-Presse (AFP) is the noisiest while Al- Jazeera data is the cleanest.

Select the optimal subset to train the system on.

Improving the language model: Analyzing the Training Data

Page 26: Arabic spell checking approaches

26

AFP = 73.93 % Al-Jazeera data = 80.97 % Gigaword corpus = 82.86 % Clean data is better than noisy data when

they are comparable in size, however more data is better than clean data.

Evaluation on: Google Docs = 9.32 % Ayaspell for OpenOffice = 41.86 % Microsoft Word 2010 = 57.15 %

Result

Page 27: Arabic spell checking approaches

27

After displaying these approaches we see that the results are promising, and represent a good starting point for future researches to enhance the Arabic spell checker.

Conclusion

Page 28: Arabic spell checking approaches

28

THANKS