Upload
charleen-dickerson
View
213
Download
1
Embed Size (px)
Citation preview
Where do we stand?
Harold SomersCentre for Computational Linguistics,
UMIST, Manchester, England
Panel session, MT Summit VIII, September 2001
Part I: the other 6,000+ languages
• LE R&D has focussed on a dozen or so languages of major commercial interest
• Many other languages equally “deserving”• Not just MT but large range of LE resources needed
Which languages?• “Minority” languages• NIMLs (non-indigenous minority languages)
– Immigrants– Refugees– Asylum seekers
• E.g. Languages of Indian subcontinent, and Africa• Hardly “minority” languages
Example of Hindi
• 180 million speakers in India• Spoken as first language in Northern States• 400-700 million speakers worldwide• 450,000 speakers in Britain• Hindi-Urdu - if taken together (!) #2 in world, ahead of
English
Translation software - What would you expect?• Word processing• Fonts• Hyphenation• Spell checker• Style checker• Mono-/bilingual on-line dictionary• Multi-lingual on-line dictionary• Thesaurus (i.e. synonym dictionary)• Terminology• Translation memory• Computer-aided translation• MT
Translation software - what is available?
• Word processing• Fonts• • • • • • • But not much else
What can we do about it?
• Long term: computational linguistics research on a wider variety of languages
• Short term: make use of existing resources (corpora, MRDs, web pages) and extract linguistic data from them
Part II: India - the forgotten jewel
• Three visits to India earlier this year– MT workshop in Kanpur– NLP workshop in Kolkata– Anglo-Indian summit in Mumbai
• Several major groups working on NLP, including MT• Government initiatives
India’s problem
• 13 official languages• Using 6 different writing systems• Special status of English• Widespread low levels of literacy• Introspective focus vis a vis interlingual communication
Problems being addressed
• Agreed exchange formats for writing systems• OCR for writing systems• Speech recognition (including Indian English)• Word processing and related packages (dictionaries,
spell checkers)
Contd.
• Terminology• Corpus collection• MT and CAT tools
– English <> Hindi– Between Indian languages