Click here to load reader

Language Processing Applications: spelling checkers Nepali ... · PDF fileApplications: spelling checkers Nepali spell checker ... rule and dictionary based ... dictionary file. •

  • View
    239

  • Download
    2

Embed Size (px)

Text of Language Processing Applications: spelling checkers Nepali ... · PDF fileApplications:...

  • Language Processing Applications: spelling checkers

    Nepali spell checker

    Bal Krishna Bal

    Project Manager

    PAN Localization Project

    Madan Puraskar Pustakalaya, Nepal

    URL : www.madanpuraskar.org

    Email: [email protected]

    1Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12th-16th January, 2009, Novotel Hotel, Vientiane, Laos

  • Contents Background

    Hunspell and the OpenOffice.org suite

    Main features of Hunspell

    Hunspell file prerequisites Hunspell file prerequisites

    Fitting the Nepali inflections and derivations into theHunspell format

    Nepali Spell checker coverage and robustness

    Conclusion

    2Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12th-16th January, 2009, Novotel Hotel, Vientiane, Laos

  • Background Nepali spell checkers do not have a long history.

    The first spell checkers both for MS Office package and OpenOffice.org suite got released for public usage in the year 2005.

    Spell checking facility for Nepali taken up with great interest.

    Major beneficiaries publication houses, writers, journalists etc. Major beneficiaries publication houses, writers, journalists etc.

    3Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12th-16th January, 2009, Novotel Hotel, Vientiane, Laos

  • Hunspell and OpenOffice.org suite Hunspell is the default spell checker from OpenOffice.org 2.0 onwards.

    Hunspell is a spell checker and morphological analyzer library, initiallydeveloped for the Hungarian language.

    Hunspell can be extended to other languages having Unicode support.

    Hence, the Nepali spell checker is a customized version of HunspellHence, the Nepali spell checker is a customized version of Hunspell.

    4Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12th-16th January, 2009, Novotel Hotel, Vientiane, Laos

  • Main features of Hunspell

    Extended support for language peculiarities; Unicode character encoding, compounding and complex morphology.

    Improved suggestion using n-gram similarity, rule and dictionary based pronunciation data.

    Morphological analysis, stemming and generation.

    S S Hunspell is based on MySpell and works also with MySpell dictionaries.

    C++ library under GPL/LGPL/MPL tri-license.

    Interfaces and ports: Enchant (Generic spelling library from the Abiword project), OpenXSpell (Mac OS X Enchant port), Delphi, Java (JNA, JNI), Perl, Python,Ruby , UNO.

    Source: http://hunspell.sourceforge.net/

    5Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12th-16th January, 2009, Novotel Hotel, Vientiane, Laos

  • Hunspell file prerequisites

    Hunspell consists of language files for different language specific territory.

    It requires two files in order to define the language that it is spell checking.

    The first file is a dictionary containing the words for

    6Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12th-16th January, 2009, Novotel Hotel, Vientiane, Laos

    y g the words for the language and the second is an affix file.

    The affix file defines the meaning of special flags in the dictionary.

    These files are located together in one folder ~openofficefolder/share/dic/ooo/

    The spell checking is done using the .aff file for the language together with the .dic file.

  • Fitting the Nepali inflections and derivations into the Hunspell format

    A sample entry of the dict file:

    /r1 Rule r1 contains 4 sub-rules that generates four inflected forms of which are

    7Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12th-16th January, 2009, Novotel Hotel, Vientiane, Laos

    A sample of the affix file:four inflected forms of , which are respectively, , , ` and ^In the second level, rule r2 is applied to each

    of the inflected forms. The rules r1 and r2

    yield as many as 320 inflections from the

    single verb like of the dictionary file.

  • Nepali spellchecker, coverage and robustness

    A total of 37,000 head words in the dictionary file.

    1,800 affix rules in the affix file.

    Word coverage of around 6.2 million N li wordsNepali words

    Random tests of the spell checker exhibit: 90% accuracy (43 words unhandled out of 450

    words)

    94% accuracy (25 words unhandled out of 400 words)

    89% accuracy (100 words unhandled out of 923 words)

    Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12th-16th January, 2009, Novotel Hotel, Vientiane, Laos

    8

    Nepali spell checking in OpenOffice.org Writer

  • Conclusion The current version of the Nepali spell checker with

    substantial enhancements made is believed to have attained the industrial strength or robustness required for the target audience, i.e., publication houses, writers and the target audience, i.e., publication houses, writers and journalists.

    Further testing and additional enhancements would be made to the spell checker in the days to come.

    9Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12th-16th January, Hotel, Vientiane, Laos

  • AcknowledgmentThis work was carried out with the aid of a grantfrom the Language Resource Association (GSK)of Japan and International Developmentp pResearch Centre (IDRC), Ottawa, Canada,administered through the Centre for Research inUrdu Language Processing (CRULP), NationalUniversity of Computer and Emerging Sciences(NUCES), Pakistan.

    Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12th-16th January, 2009, Novotel Hotel, Vientiane, Laos

    10

  • Thank You!!

    Regional Conference on Localized ICT Development and Dissemination across Asia. PAN Localization Project. 12th-16th January, 2009, Novotel Hotel, Vientiane, Laos

    11

Search related