Language Processing Applications: spelling checkers
Nepali spell checker
Bal Krishna Bal
Project Manager
PAN Localization Project
Madan Puraskar Pustakalaya, Nepal
URL : www.madanpuraskar.org
Email: [email protected]
1Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Contents• Background
• Hunspell and the OpenOffice.org suite
• Main features of Hunspell
• Hunspell file prerequisites• Hunspell file prerequisites
• Fitting the Nepali inflections and derivations into theHunspell format
• Nepali Spell checker coverage and robustness
• Conclusion
2Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Background• Nepali spell checkers do not have a long history.
• The first spell checkers both for MS Office package and OpenOffice.org suite got released for public usage in the year 2005.
• Spell checking facility for Nepali – taken up with great interest.
• Major beneficiaries – publication houses, writers, journalists etc.• Major beneficiaries – publication houses, writers, journalists etc.
3Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Hunspell and OpenOffice.org suite• Hunspell is the default spell checker from OpenOffice.org 2.0 onwards.
• Hunspell is a spell checker and morphological analyzer library, initiallydeveloped for the Hungarian language.
• Hunspell can be extended to other languages having Unicode support.
• Hence, the Nepali spell checker is a customized version of HunspellHence, the Nepali spell checker is a customized version of Hunspell.
4Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Main features of Hunspell
• Extended support for language peculiarities; Unicode character encoding, compounding and complex morphology.
• Improved suggestion using n-gram similarity, rule and dictionary based pronunciation data.
• Morphological analysis, stemming and generation.
S S• Hunspell is based on MySpell and works also with MySpell dictionaries.
• C++ library under GPL/LGPL/MPL tri-license.
• Interfaces and ports: Enchant (Generic spelling library from the Abiword project), OpenXSpell (Mac OS X Enchant port), Delphi, Java (JNA, JNI), Perl, Python,Ruby , UNO.
Source: http://hunspell.sourceforge.net/
5Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
Hunspell file prerequisites
• Hunspell consists of language files for different language specific territory.
• It requires two files in order to define the language that it is spell checking.
• The first file is a dictionary containing the words for
6Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
y g the words for the language and the second is an “affix” file.
• The “affix” file defines the meaning of special flags in the dictionary.
• These files are located together in one folder ~openofficefolder/share/dic/ooo/
• The spell checking is done using the .aff file for the language together with the .dic file.
Fitting the Nepali inflections and derivations into the Hunspell format
A sample entry of the dict file:
�ã�/r1Rule r1 contains 4 sub-rules that generates
four inflected forms of �ã� which are
7Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
A sample of the affix file:four inflected forms of �ã�, which are
respectively, �ã�,� ã�,�ã� ` and �ã� ^In the second level, rule r2 is applied to each
of the inflected forms. The rules r1 and r2
yield as many as 320 inflections from the
single verb like “�ã�” of the dictionary file.
Nepali spellchecker, coverage and robustness
• A total of 37,000 head words in the dictionary file.
• 1,800 affix rules in the affix file.
• Word coverage of around 6.2 million N li wordsNepali words
• Random tests of the spell checker exhibit:– 90% accuracy (43 words unhandled out of 450
words)˜
– 94% accuracy (25 words unhandled out of 400 words)˜
– 89% accuracy (100 words unhandled out of 923 words)˜
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
8
Nepali spell checking in OpenOffice.org Writer
Conclusion• The current version of the Nepali spell checker with
substantial enhancements made is believed to have attained the industrial strength or robustness required for the target audience, i.e., publication houses, writers and the target audience, i.e., publication houses, writers and journalists.
• Further testing and additional enhancements would be made to the spell checker in the days to come.
9Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, Hotel, Vientiane, Laos
AcknowledgmentThis work was carried out with the aid of a grantfrom the Language Resource Association (GSK)of Japan and International Developmentp pResearch Centre (IDRC), Ottawa, Canada,administered through the Centre for Research inUrdu Language Processing (CRULP), NationalUniversity of Computer and Emerging Sciences(NUCES), Pakistan.
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
10
Thank You!!
Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos
11