Transcript
Page 1: Language Processing Applications: spelling checkers Nepali ... · Applications: spelling checkers Nepali spell checker ... rule and dictionary based ... dictionary file. • 1,800

Language Processing Applications: spelling checkers

Nepali spell checker

Bal Krishna Bal

Project Manager

PAN Localization Project

Madan Puraskar Pustakalaya, Nepal

URL : www.madanpuraskar.org

Email: [email protected]

1Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

Page 2: Language Processing Applications: spelling checkers Nepali ... · Applications: spelling checkers Nepali spell checker ... rule and dictionary based ... dictionary file. • 1,800

Contents• Background

• Hunspell and the OpenOffice.org suite

• Main features of Hunspell

• Hunspell file prerequisites• Hunspell file prerequisites

• Fitting the Nepali inflections and derivations into theHunspell format

• Nepali Spell checker coverage and robustness

• Conclusion

2Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

Page 3: Language Processing Applications: spelling checkers Nepali ... · Applications: spelling checkers Nepali spell checker ... rule and dictionary based ... dictionary file. • 1,800

Background• Nepali spell checkers do not have a long history.

• The first spell checkers both for MS Office package and OpenOffice.org suite got released for public usage in the year 2005.

• Spell checking facility for Nepali – taken up with great interest.

• Major beneficiaries – publication houses, writers, journalists etc.• Major beneficiaries – publication houses, writers, journalists etc.

3Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

Page 4: Language Processing Applications: spelling checkers Nepali ... · Applications: spelling checkers Nepali spell checker ... rule and dictionary based ... dictionary file. • 1,800

Hunspell and OpenOffice.org suite• Hunspell is the default spell checker from OpenOffice.org 2.0 onwards.

• Hunspell is a spell checker and morphological analyzer library, initiallydeveloped for the Hungarian language.

• Hunspell can be extended to other languages having Unicode support.

• Hence, the Nepali spell checker is a customized version of HunspellHence, the Nepali spell checker is a customized version of Hunspell.

4Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

Page 5: Language Processing Applications: spelling checkers Nepali ... · Applications: spelling checkers Nepali spell checker ... rule and dictionary based ... dictionary file. • 1,800

Main features of Hunspell

• Extended support for language peculiarities; Unicode character encoding, compounding and complex morphology.

• Improved suggestion using n-gram similarity, rule and dictionary based pronunciation data.

• Morphological analysis, stemming and generation.

S S• Hunspell is based on MySpell and works also with MySpell dictionaries.

• C++ library under GPL/LGPL/MPL tri-license.

• Interfaces and ports: Enchant (Generic spelling library from the Abiword project), OpenXSpell (Mac OS X Enchant port), Delphi, Java (JNA, JNI), Perl, Python,Ruby , UNO.

Source: http://hunspell.sourceforge.net/

5Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

Page 6: Language Processing Applications: spelling checkers Nepali ... · Applications: spelling checkers Nepali spell checker ... rule and dictionary based ... dictionary file. • 1,800

Hunspell file prerequisites

• Hunspell consists of language files for different language specific territory.

• It requires two files in order to define the language that it is spell checking.

• The first file is a dictionary containing the words for

6Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

y g the words for the language and the second is an “affix” file.

• The “affix” file defines the meaning of special flags in the dictionary.

• These files are located together in one folder ~openofficefolder/share/dic/ooo/

• The spell checking is done using the .aff file for the language together with the .dic file.

Page 7: Language Processing Applications: spelling checkers Nepali ... · Applications: spelling checkers Nepali spell checker ... rule and dictionary based ... dictionary file. • 1,800

Fitting the Nepali inflections and derivations into the Hunspell format

A sample entry of the dict file:

�ã�/r1Rule r1 contains 4 sub-rules that generates

four inflected forms of �ã� which are

7Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

A sample of the affix file:four inflected forms of �ã�, which are

respectively, �ã�,� ã�,�ã� ` and �ã� ^In the second level, rule r2 is applied to each

of the inflected forms. The rules r1 and r2

yield as many as 320 inflections from the

single verb like “�ã�” of the dictionary file.

Page 8: Language Processing Applications: spelling checkers Nepali ... · Applications: spelling checkers Nepali spell checker ... rule and dictionary based ... dictionary file. • 1,800

Nepali spellchecker, coverage and robustness

• A total of 37,000 head words in the dictionary file.

• 1,800 affix rules in the affix file.

• Word coverage of around 6.2 million N li wordsNepali words

• Random tests of the spell checker exhibit:– 90% accuracy (43 words unhandled out of 450

words)˜

– 94% accuracy (25 words unhandled out of 400 words)˜

– 89% accuracy (100 words unhandled out of 923 words)˜

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

8

Nepali spell checking in OpenOffice.org Writer

Page 9: Language Processing Applications: spelling checkers Nepali ... · Applications: spelling checkers Nepali spell checker ... rule and dictionary based ... dictionary file. • 1,800

Conclusion• The current version of the Nepali spell checker with

substantial enhancements made is believed to have attained the industrial strength or robustness required for the target audience, i.e., publication houses, writers and the target audience, i.e., publication houses, writers and journalists.

• Further testing and additional enhancements would be made to the spell checker in the days to come.

9Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, Hotel, Vientiane, Laos

Page 10: Language Processing Applications: spelling checkers Nepali ... · Applications: spelling checkers Nepali spell checker ... rule and dictionary based ... dictionary file. • 1,800

AcknowledgmentThis work was carried out with the aid of a grantfrom the Language Resource Association (GSK)of Japan and International Developmentp pResearch Centre (IDRC), Ottawa, Canada,administered through the Centre for Research inUrdu Language Processing (CRULP), NationalUniversity of Computer and Emerging Sciences(NUCES), Pakistan.

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

10

Page 11: Language Processing Applications: spelling checkers Nepali ... · Applications: spelling checkers Nepali spell checker ... rule and dictionary based ... dictionary file. • 1,800

Thank You!!

Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12’th-16’th January, 2009, Novotel Hotel, Vientiane, Laos

11