40
Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Embed Size (px)

Citation preview

Page 1: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Localization and Language Technology Standards

Kavi Narayana MurthyUniversity of Hyderabad

ELITEX - 2007New Delhi, 10-11 January 2007

Page 2: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

2

Outline Character Encoding Standards Fonts, Glyphs, Mapping Standards OS/Browser Support, Drivers Transliteration, Romanization Translation, Linguistic Resources Speech and OCR Technologies Enforcement

Page 3: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

3

Goals Functionality

Whatever we can do with English, we must be able to do with our own languages and scripts with equal ease

Inter-operability, Platform Independence All Applications must work seemlessly on all

hardware and software platforms Language and Script Independence

Multi-lingual, Multi-Script Support

Page 4: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

4

Standards Even a poor standard is better than no

standard Standards save us a lot in the long run Commercial forces promoting non-

standard, proprietary, secret systems must not be allowed to succeed Let us not say “Let the Market Decide”!!!

Page 5: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

5

Character Encoding Standards ISCII and Unicode ISCII is a BIS Standard, Unicode is

not Unicode is based on ISCII In some sense, Unicode is a step in

the backward direction Let us understand ISCII first

Page 6: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

6

Language and Script Do not confuse one for the other Many-to-Many Script is neither language nor font Script and SuperScript Phonetic Basis

Common SuperScript for all ILs Script Grammar

Page 7: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

7

Language and Script Sanskrit is written in Devanagari,

Telugu, Kannada, Bangla etc. scripts

Devanagari is used for writing Sanskrit, Hindi, Marathi, etc.

English words are often written (transliterated) in local language scripts

Page 8: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

8

Phonetic Basis Words: Meanings, Sounds, Written

Symbols Meanings are supreme but difficult

to quantify and encode Sounds are the next best

A ‘ka’ sound is a ‘ka’ sound, whatever be the language – Hence ‘Universal’

No need for ‘Spellings’ What is write is what we speak - directly

Page 9: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

9

Orthography Written symbols correspond with

phonemes – basic sound units Minor variations in sounds

(allophones, co-articulation effects etc.) are not depicted in orthography t: Mountain, tea, truck, spilt, little

Special Symbols not to confused with basic Characters

Page 10: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

10

What is a Character? Indian Languages:

No ‘alphabet’, not letters, no spellings Phoneme-based Units are syllable-like: called ‘akshara’-

s akshara-s very large in number

Corpus studies not sufficient Made up of vowels, consonants etc. Not all sequences valid

Page 11: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

11

Script Grammar A Grammar for Scripts Allows all valid sequences, only valid

sequences No need to code all possible akshara-s Script grammar must be part of

standards: ISCII includes. UNICODE? Script Grammar to be enforced by s/w

Page 12: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

12

SuperScript ILs: 10 Scripts with a nearly common

sound system – all derived from the ancient ‘braahmi’ script

=> SuperScript Super Set of all Phonemes

Common encoding: ISCII Extendable to all languages of the

world

Page 13: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

13

ISCII: (BIS – 1991: IS 13194) 128 codes more than sufficient Uses second half of ASCII, first half

untouched – allows mixing with English

SuperScript: Transliteration built-in Long Standing: ISCII 1988, 1991 Well thought and well designed

Page 14: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

14

Why did ISCII fail to catch on? Silent on Character-to-Font mapping

A complex many-to-many mapping Fonts not standardized, fonts not available

Not registered, no OS/Browser Support (BIS – 1991: IS 13194) Rationale not explained Not publicized, not enforced

Page 15: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

15

History Proprietary, non-standard, secret

font based encoding schemes Promoted by commercial companies Near Zero Inter-operability Ad-hoc ISCII-to-font mapping schemes Mapping schemes not made public To be made Illegal and Punishable

Put India back by at least a decade!

Page 16: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

16

Improving ISCII Register - To get OS/Browser Support

Remove encoding of allophones, allographs Script Grammar: FSM enough, CFG - not needed

Include Rationale, explanatory notes Remove Attribute/Extension codes Standardize ISCII-to-Font Mapping Scheme Promote, Enforce

Page 17: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

17

Character-to-Font Mapping Complex scripts – not linear Glyphs: shape units convenient for

rendering Poor correspondence with sound

units Many-to-Many mappings

Glyph selection, scaling, positioning No Glyph Encoding Standard

Page 18: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

18

From Character to Font Must be provably complete and

100% consistent Current systems are all ad-hoc –

neither complete nor consistent Finite State Transducers:

Necessary and Sufficient Without restricting Creativity and

Flexibility Simple, Efficient, Re-Usable

Page 19: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

19

Encoding Standards: Unicode For Language/Script/SuperScript?

CJK. Why not for ILs? Script Grammar? Character-to-Font:

relegated to font level font effects

ISCII-88 Based, Has Errors Once added, cannot be deleted!

Page 20: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

20

ISCII or Unicode? Unicode:

To be with the World, to know and be known ‘Correcting’ Mistakes, Improving Standards Support (OS, Fonts, etc.), Education, Training Converting Legacy Data – A Huge Task

ISCII-to-Unicode is not trivial Ignore BIS Standard and embrace what is not

yet ‘standardized’? Why not co-exist? – Internal and External

Views

Page 21: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

21

Keyboard Layouts, Drivers Several de-facto standards and

many variations in use To select a few and standardize

So called Roman Phonetic Typing ILs through English! OK for oldies, not for future!

INSCRIPT: ISCII Standard, Good for new comers

To strictly enforce Script Grammar

Page 22: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

22

Document Encoding Standards Plain Text: pure ISCII/UNICODE

Mono-lingual Plain Text? Annotated Text (Ex. Word

Processors) XML Style, Open, Readable formats to

be encouraged Proprietary, secret, non-standard

encodings must be discouraged

Page 23: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

23

Transliteration Widely used, part of our Tradition

Sanskrit texts in local scripts English, Hindi, Urdu words in local

scripts Music Compositions

Automatic in ISCII. Unicode? Quality of transliteration

To and From English?

Page 24: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

24

Romanization Need:

Where there is no support for local languages English dailies, posters, advertisements etc. Lack of support: OS/Browser/Fonts etc.

Where users prefer Roman A variety of ad-hoc schemes in use

iTRANS, RTS, W-X, etc. Standards badly wanted

Page 25: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

25

Romanization Multi-dimensional optimization problem

Case Mix-up 26 Letters not sufficient 52 nearly sufficient Not always supported

Storage space, Ease of Typing, Aesthetics Scientific/Logical Design/Naturalness

English-like – for the oldies: a, ee, oo, a, oa ??? Futuristic: aa/ii/uu/ee/oo

Page 26: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

26

Romanization Clashes: a+u/au, k+h/kh, s’

Two way conversion, cyclic check Ex. Long Vowels:

a: -clashes with colon diacritic –not supported ipa –not understood –not supported A +single char. +saves space –ugly –

difficult to type –case-mix-up aa +logical (like ee) +easy to type

Page 27: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

27

Romanization: An Example a aa i ii u uu R RR e ee ai o oo au M H k kh g gh n~ c ch j jh n` T TH D DH N t th d dh n p ph b bh m y r l v s’ S s h L

Page 28: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

28

Translation Create Material Afresh Translate by Hand Automatic/Machine Translation Machine Aided Translation English – Local Language

Translation Local – Local Language Translation

Page 29: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

29

Translation Resource Intensive

Manpower, Time, Cost Quality/Uniformity

Standards, Bench-Mark Data, Testing and Evaluation Procedures

Dictionaries, Terminology Databases Pan-Indian Terms/Sanskritize/Localize

Page 30: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

30

Linguistic Resources Dictionaries – General, Domain Specific Terminological Databases Thesauri, WordNets, Ontologies Morphological Analyzers, Generators Spell/Grammar/Style Checkers Annotated Text and Speech Corpora

Page 31: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

31

India: Future is in Speech One Billion People, A Sixth of the World More than 150 Languages, 22 Recognized 95 % not comfortable with English Computers, Current, Connectivity Info Revolution benefits: Majority

Deprived 10 M Computers, 100 M Phones Future is in Speech

Page 32: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

32

Speech Natural Easy, Fast Hands-Free No need to Learn

Technology Language

Available to all

Page 33: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

33

Text and Speech Speech is Natural Reading/Writing is learnt, Artificial Some never learn – Illiterates Oral Tradition Speech is more permanent than Text! “I did not steal that ring of gold” Trust Yourself!

Page 34: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

34

Speech Technologies Speech Recognition: Speech to Text Speech Synthesis: Text to Speech Speaker Recognition,Verification,ID Speech Coding/Decoding,

Compression Slow down, Speed up Speech as Evidence

Page 35: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

35

Applications Telephone Dialing Form Filling Dictation Machine Command and Control Voice enabled Web OCR+WP+TTS MT: Cross-Lingual IR, S2S

Page 36: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

36

OCR OCR in Local Scripts Needed

To digitize and save legacy data To compile/process/edit/refine data

For Printed Texts/Manuscripts Old Data

deterioration of paper old type fonts, problems of type-

setting

Page 37: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

37

Multi-Modal Interfaces

To Reach out to 1 Billion People, we must get the best of many worlds: Speech Recognition and Synthesis Graphics and iconic Interfaces OCR Technologies Translation, CLIR Camera, Gestures, Touch Screen

Page 38: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

38

Balance Between Backward Compatibility

and Future-Proof Designs Quick Fix Solutions and Long Haul One Standard or Several? Economics and Business Sense

versus Social Responsibilities Acceptance versus Enforcement

Page 39: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

39

The 3 Most Important Things1. Develop/Refine/Update Standards

Detailed Documentation Including rationale, issues, evaluation,

etc.

2. Education and Training3. Enforcement

Make use of non-standard methods illegal and punishable under law

Technical Workshops for detailing

Page 40: Localization and Language Technology Standards Kavi Narayana Murthy University of Hyderabad ELITEX - 2007 New Delhi, 10-11 January 2007

Thank You!

Visitwww.LanguageTechnologies.a

c.in