Click here to load reader

Arabic spell checking approaches

  • View
    600

  • Download
    2

Embed Size (px)

DESCRIPTION

 

Text of Arabic spell checking approaches

  • 1. Arabic SpellCheckingApproachesBy: Banan AlHadlaq, Dalal AlZeer , Monirah AlOrfSupervised by: Dr. Amal Al-SaifNatural language processing - CS465

2. Introduction Common Arabic Spell Error Towards Automatic Spell Checking for Arabic Towards Arabic Spell-Checker Based on N-Grams Scores Automatic Stochastic Arabic Spelling Correction With Emphasison Space Insertions and Deletions Arabic Word Generation and Modeling for Spell Checking Improved Spelling Error Detection and Correction for Arabic ConclusionOutline 3. Arabic language NLP applications Approaches for solving the Arabic spellchecking problemIntroduction 4. Common Arabic Spell Errors Reading Errors{ }{ }{ }{ }{ }{ }{ }{ }{ }{ }{ }{ } { } Hearing Errors{ }{ }{ }{ }{ }{ }{ }{ }{ } Touch-Typing Errors Morphological Errors Editing Errors 5. 1.Automatic Stochastic Arabic Spelling Correction WithEmphasis on Space Insertions and Deletions Stochastic-based approach for misspelling correction ofArabic text. A context based on two-layer that is automaticallycorrect misspelled words in large datasets. 6. Automatic Stochastic Arabic Spelling Correction WithEmphasis on Space Insertions and Deletions (Cont.)Candidates generationcomponentError detectionBest candidate selectioncomponentSingle-WordErrorsSpace DeletionErrorsSpace InsertionErrors 7. Automatic Stochastic Arabic Spelling Correction WithEmphasis on Space Insertions and Deletions (Cont.)Candidates generation component: Space Deletion Errors 8. Result A standard Arabic text corpus (TRN_DB_I) An extra standard Arabic text corpus(TRN_DB_II) The test data (TST_DB) The testing results show that as we increase the sizeof the training set, the performance improves reaching97.9% of F1 score for detection and 92.3% of F1score for correction. 9. 2.Towards Automatic Spell Checking for Arabic Developing an Arabic spelling checker program. Using SICStus Prolog language. Recognizes common Arabic spelling errors and offerssuggestions for error correction. Be able to recognize common spelling errors for standardArabic and Egyptian dialects. Can be integrated with other text processing software, suchas word processors. 10. Analysis of the common spelling errors that are used fordetecting the misspelled Arabic word. Limited the detection of spelling errors to isolated words (nonword). e.g. for . Perform a series of heuristicsteps to find a replacementcandidate:Add missingcharacterReplace incorrectcharacterRemove excessivecharacterAdd a space tosplit wordsTowards Automatic Spell Checking for Arabic(Cont.) 11. e.g. the candidates of the misspelled wordare : ,Add missingcharacter e.g. the candidates of the misspelled wordare : , ,Replaceincorrectcharacter e.g. the candidates of the misspelled wordare : ,Removeexcessivecharacter e.g. the candidates of the misspelled wordare : ,Add a spaceto split wordsTowards Automatic Spell Checking forArabic(Cont.) 12. Neighbors tableTowards Automatic Spell Checking for Arabic(Cont.) 13. Developing a simple and flexible spell-checker for Arabiclanguage (detect errors). Based on N-Grams scores. Using matrix approach. The corpus which is used is adapted from Muaidi PHDthesis . It is consists of 101,987 word types.3.Towards Arabic Spell-Checker Based onN-Grams Scores 14. Entered thetested textTokenizingprocessCleaningprocessMatrix methoddeals witheach wordTowards Arabic Spell-Checker Based onN-Grams Scores(Cont.) 15. Building the matrices Number of matrices = longest word in corpus 1. Dimension of each matrix is 28 28( for Arabic letter). (M1) for the combination of the first and the second lettersin a word. (M2) for the combination of the second and thethird letters in a word and so on. All the matrices are initialized by zeros.Matrix Method Deals With Each Word 16. 2-Gram set (S) Each item in (S) consists of two letters. The item will assign the value 1 or 2 Assigned 2 in the corresponding matrix; if the word isended by these two letters. Assigned 1 if there is a connection and the word is notover yet. e.g. for the word:[ ]the 2-Gram set is S = { }M1[ ] [ ] = 1, M [ ] [ ] = 1, M3[ ][ ]=2.Matrix Method Deals With Each Word 17. Entered thetested textTokenizingprocessCleaningprocessMatrix methoddeals witheach wordMatrix Method Deals With Each Word(Cont.) 18. Result The training dataset consists of 71,390 Arabic words (70%)and While the testing dataset consists of 30,597 Arabic words(30%).The Overall Evaluation of the Results Increasing the size of the data set leads increment theaccuracy. 19. Bridge the critical gap of available open-sourcespell checking resources for Arabic. Create open-source and large-coverage word listfor Arabic (9,000,000 words). Error Detection: Direct method: match words in an open textagainst a list of correct words. Language modeling method: build a character-basedtri-gram language modal using SRILM in order toclassify generated words as valid and invalid.4. Arabic Word Generation and Modeling for SpellChecking 20. InputFinite-StateTransducerError ?Suggestion listCandidates listscoreCandidates ranker augmentededit distance and languagespecific rulesPost-processingDisplaysuggestionsArabicword listNoisy channelmodelGigawordcorpusYesNoFlow chart of spelling errorcorrection. 21. Best accuracy score = 75% Evaluation on: Microsoft Word 2010 = 80.54% Hunspell using Ayaspell = 45.64%Result 22. Language modelSpelling error detection andcorrection componentsDictionary (orreference wordlist)Error model5. Improved Spelling Error Detection and Correctionfor Arabic 23. AraComLexExtendedword list Matching itsword listagainstGigawordcorpus Double-checked byBuckwalterArabicMorphologicalAnalyzer Creating adictionary of9.3 millionArabic wordsImproving the Dictionary 24. Finite-statetransducer toproposecandidatecorrectionsDiscardcandidates thatare not found inthe word listRank theremainingcandidates Spelling Correction: N-gram language models. The candidate with the least perplexityscore is selected to be the gold correction.Improving the Error Model: Candidate Generation 25. Analyze the level of noise in different sources ofdata. Agence France-Presse (AFP) is the noisiest whileAl- Jazeera data is the cleanest. Select the optimal subset to train the system on.Improving the language model: Analyzing the TrainingData 26. AFP = 73.93 % Al-Jazeera data = 80.97 % Gigaword corpus = 82.86 % Clean data is better than noisy data when they arecomparable in size, however more data is betterthan clean data. Evaluation on: Google Docs = 9.32 % Ayaspell for OpenOffice = 41.86 % Microsoft Word 2010 = 57.15 %Result 27. After displaying these approaches we see that theresults are promising, and represent a good startingpoint for future researches to enhance the Arabicspell checker.Conclusion 28. THANKS

Search related