33
Language Identification and Language Identification and IT IT Peter Constable and Gary Simons Peter Constable and Gary Simons SIL International SIL International [email protected] [email protected] www.sil.org

Language Identification and IT Peter Constable and Gary Simons SIL International [email protected] [email protected]

Embed Size (px)

Citation preview

Language Identification and ITLanguage Identification and IT

Peter Constable and Gary SimonsPeter Constable and Gary Simons

SIL InternationalSIL International

[email protected]

[email protected]

www.sil.org

17th International Unicode Conference San Jose, CA September 2000

Language identificationLanguage identification

The use of identificational codes for tagging The use of identificational codes for tagging information objects to indicate the language in information objects to indicate the language in which the information is expressedwhich the information is expressed

<body xml:lang=“en”>

17th International Unicode Conference San Jose, CA September 2000

Language identificationLanguage identification

Not considering automated language detectionNot considering automated language detection

Considering only Considering only languagelanguage identifiers, not identifiers, not identifiers for paralinguistic notions, such as identifiers for paralinguistic notions, such as writing system or localewriting system or locale

17th International Unicode Conference San Jose, CA September 2000

About the About the EthnologueEthnologue

SIL EthnologueSIL Ethnologue• catalogue of all modern languages in the worldcatalogue of all modern languages in the world• lists over 6,800 living languageslists over 6,800 living languages• result of decades of researchresult of decades of research• system of three-letter codessystem of three-letter codes• http://www.sil.org/ethnologuehttp://www.sil.org/ethnologue

17th International Unicode Conference San Jose, CA September 2000

About the About the EthnologueEthnologue

17th International Unicode Conference San Jose, CA September 2000

About the About the EthnologueEthnologue

17th International Unicode Conference San Jose, CA September 2000

About the About the EthnologueEthnologue

Existing user base for Ethnologue codes:Existing user base for Ethnologue codes:• SILSIL• UNESCOUNESCO• Linguistic Data Consortium (850+ agencies) Linguistic Data Consortium (850+ agencies) • The Linguist List (12,500 individual linguists)The Linguist List (12,500 individual linguists)• The Endangered Language FundThe Endangered Language Fund• othersothers

17th International Unicode Conference San Jose, CA September 2000

Linguistic diversityLinguistic diversity

# of languages:# of languages:

Africa: 2062Africa: 2062

Americas: 1020Americas: 1020

Europe: 237Europe: 237Asia: 2202Asia: 2202

Pacific: 1312Pacific: 1312

17th International Unicode Conference San Jose, CA September 2000

Motivation for this paperMotivation for this paper

Languages covered by standardsLanguages covered by standards• ISO 639-x covers approx. 400ISO 639-x covers approx. 400 languages;languages;• existing needs to go much further—over 6,800 existing needs to go much further—over 6,800

languageslanguages• immediate need among linguists and other immediate need among linguists and other

researchers for use in XMLresearchers for use in XML

17th International Unicode Conference San Jose, CA September 2000

Five issuesFive issues

ChangeChange

CategorizationCategorization

Inadequate definitionInadequate definition

ScaleScale

DocumentationDocumentation

17th International Unicode Conference San Jose, CA September 2000

The need for language identifiersThe need for language identifiers

Language-specific processingLanguage-specific processing• spell-checkingspell-checking• sortingsorting• morphological parsingmorphological parsing• speech recognition/synthesisspeech recognition/synthesis• language-specific typographic behaviourlanguage-specific typographic behaviour• etc.etc.

17th International Unicode Conference San Jose, CA September 2000

The need for language identifiersThe need for language identifiers

Language-specific processingLanguage-specific processing• choosing appropriate resourceschoosing appropriate resources

Los eventos deportivos pra la juventudLos eventos deportivos pra la juventud

ህጩቦአፈ ዸድ ማዽ ጸመቂትወይቴ።

Los eventos deportivos Los eventos deportivos prapra la juventud la juventud

ህጩቦአፈ ዸድ ማዽ ጸመቂትወይቴ።

17th International Unicode Conference San Jose, CA September 2000

The need for language identifiersThe need for language identifiers

Two distinct issues:Two distinct issues:• identify the languageidentify the language• apply the specific processing for that languageapply the specific processing for that language

17th International Unicode Conference San Jose, CA September 2000

The need for language identifiersThe need for language identifiers

Language detectionLanguage detection• identify language by inspection of data itselfidentify language by inspection of data itself• available only for a few languagesavailable only for a few languages• not practical for searching large corpora (e.g. the not practical for searching large corpora (e.g. the

Internet)Internet)• doesn’t work on short text segmentsdoesn’t work on short text segments

She said, “chat”.

17th International Unicode Conference San Jose, CA September 2000

The need for language identifiersThe need for language identifiers

Language-specific processingLanguage-specific processing• in general, must tag information objects to indicate in general, must tag information objects to indicate

languagelanguage• identifiers are needed to distinguish every identifiers are needed to distinguish every

languagelanguage

17th International Unicode Conference San Jose, CA September 2000

Issue #1: changeIssue #1: change

Languages are constantly changingLanguages are constantly changing

Implications:Implications:• systems of language tags cannot be staticsystems of language tags cannot be static• the speech variety (varieties) denoted by a tag is the speech variety (varieties) denoted by a tag is

time-boundtime-bound

“English” c. 1700 A.D. ≠ “English” c. 2000 A.D.

17th International Unicode Conference San Jose, CA September 2000

Issue #2: categorizationIssue #2: categorization

Typical question: Typical question: Are Serbian and Croatian the Are Serbian and Croatian the same language, or different languages?same language, or different languages?

Operational definitions of languageOperational definitions of language• many different ways to formulate a definitionmany different ways to formulate a definition• different definitions create different categorizationsdifferent definitions create different categorizations• different categorizations serve different purposesdifferent categorizations serve different purposes

17th International Unicode Conference San Jose, CA September 2000

Issue #3: inadequate definitionIssue #3: inadequate definition

Existing systems do not consistently employ a Existing systems do not consistently employ a single operational definitionsingle operational definition

• ISO 639-2: codes for “languages” and for groups ISO 639-2: codes for “languages” and for groups of languagesof languagesnav = Navajoath = Athapascan languages

• ISO 639-2: some “languages” are groups of ISO 639-2: some “languages” are groups of languageslanguagesque = “Quechua” (47 distinct languages)

17th International Unicode Conference San Jose, CA September 2000

Issue #3: inadequate definitionIssue #3: inadequate definition

Consistent use of a single definition in a given Consistent use of a single definition in a given namespace is beneficialnamespace is beneficial

““Requiring a single definition imposes too Requiring a single definition imposes too much constraint on users”much constraint on users”

• users may legitimately have different requirementsusers may legitimately have different requirements• butbut no control results in confusion, especially no control results in confusion, especially

when thousands of identifiers are addedwhen thousands of identifiers are added

17th International Unicode Conference San Jose, CA September 2000

Issue #4: ScaleIssue #4: Scale

Number of languages exceed existing systems Number of languages exceed existing systems by an order of magnitude (400 vs. 6,800)by an order of magnitude (400 vs. 6,800)

Existing systems do not scale wellExisting systems do not scale well

17th International Unicode Conference San Jose, CA September 2000

Issue #4: ScaleIssue #4: Scale

ISO 639-xISO 639-x• slow process unable to cope with large volume of slow process unable to cope with large volume of

requestsrequests• minimal attestation (50 documents) not minimal attestation (50 documents) not

appropriate for lesser-known languagesappropriate for lesser-known languages• mnemonic codes (impossible for thousands of mnemonic codes (impossible for thousands of

languages)languages)• confusion due to inconsistent definitionconfusion due to inconsistent definition

17th International Unicode Conference San Jose, CA September 2000

Issue #4: ScaleIssue #4: Scale

RFC 1766RFC 1766• process unable to cope with large volume of process unable to cope with large volume of

requestsrequests• confusion due to inconsistent definitionconfusion due to inconsistent definition• unclear how to create tagsunclear how to create tags

17th International Unicode Conference San Jose, CA September 2000

Issue #5: documentationIssue #5: documentation

Existing systems: can’t tell what codes denoteExisting systems: can’t tell what codes denote• ISO 639-x: language, or group of languages?ISO 639-x: language, or group of languages?

ara, “Arabic”: Standard only? all variants?

bin, “Bini”= dial. of Yoruba (Nigeria; 20,000,000)= dial. of Anyin (Côte d'Ivoire; 810,000)= alt. name for Edo (Nigeria; 1,000,000)= alt. name for Pini (Australia; dying)

• ISO 639-x: which of several alternate possibilities?ISO 639-x: which of several alternate possibilities?

17th International Unicode Conference San Jose, CA September 2000

Issue #5: documentationIssue #5: documentation

• ISO 639-x: 2- vs. 3-letter codesISO 639-x: 2- vs. 3-letter codes

st, “Sesotho”= nso, “Sotho, Northern”? = sot, “Sotho, Southern”?= both?

to, “Tonga”= tog, “Tonga (Nyasa)”? = ton, “Tonga (Tonga Islands)”?

17th International Unicode Conference San Jose, CA September 2000

Solving these problemsSolving these problems

Requirements of an adequate system:Requirements of an adequate system:• able to scaleable to scale• able to deal with change, track history of changeable to deal with change, track history of change• use a single operational definition for a given use a single operational definition for a given

namespacenamespace• apply definition consistently within a namespaceapply definition consistently within a namespace• complete, maintained, online documentationcomplete, maintained, online documentation

17th International Unicode Conference San Jose, CA September 2000

What the What the EthnologueEthnologue offers offers

Scale: already thereScale: already there• enumeration of languagesenumeration of languages• set of three-letter codesset of three-letter codes

Change: careful managementChange: careful management• no re-use of codesno re-use of codes• have begun recording revision historyhave begun recording revision history

17th International Unicode Conference San Jose, CA September 2000

What the What the EthnologueEthnologue offers offers

Definition: single definition, applied quite Definition: single definition, applied quite consistentlyconsistently

• definition: primary criterion of mutual non-definition: primary criterion of mutual non-intelligibility as a basis for identifying candidates intelligibility as a basis for identifying candidates for separate literacy, literaturefor separate literacy, literature

• all categories are of the same type; no language all categories are of the same type; no language families, groups, writing systemsfamilies, groups, writing systems

17th International Unicode Conference San Jose, CA September 2000

What the What the EthnologueEthnologue offers offers

DocumentationDocumentation• extensive information maintained for every extensive information maintained for every

languagelanguage• new site will provide various reportsnew site will provide various reports

• alternate names, location, population, etc.alternate names, location, population, etc.• related ISO codes, relationshiprelated ISO codes, relationship• return return Ethnologue Ethnologue data given an ISO codedata given an ISO code

• evaluating possibilities for returning results as evaluating possibilities for returning results as XMLXML

17th International Unicode Conference San Jose, CA September 2000

Integration with RFC 1766, XMLIntegration with RFC 1766, XML

EthnologueEthnologue codes immediately available using codes immediately available using “x-”“x-”

• private-use tags not ultimately satisfactoryprivate-use tags not ultimately satisfactory

“Hopi”:<body xml:lang=“x-hop”>

<body xml:lang=“x-sil-hop”>

17th International Unicode Conference San Jose, CA September 2000

Integration with RFC 1766, XMLIntegration with RFC 1766, XML

Register thousands of new tags with IANARegister thousands of new tags with IANA• process would not be able to copeprocess would not be able to cope• problems devising that many tagsproblems devising that many tags• create considerable confusion in the single create considerable confusion in the single

namespacenamespace

17th International Unicode Conference San Jose, CA September 2000

Integration with RFC 1766, XMLIntegration with RFC 1766, XML

Register “i-sil-Register “i-sil-” ” to specify a namespace to specify a namespace maintained by a particular agencymaintained by a particular agency

<body <body xml:lang=“i-sil-hop”>

• deals with scaledeals with scale• creates a namespace with a particular definition creates a namespace with a particular definition

that is consistently appliedthat is consistently applied• avoids confusion of having a single namespace for avoids confusion of having a single namespace for

all needsall needs• allow alternate namespacesallow alternate namespaces

17th International Unicode Conference San Jose, CA September 2000

Integration with RFC 1766, XMLIntegration with RFC 1766, XML

Possible refinement: define primary tag “n-”Possible refinement: define primary tag “n-”

• first sub-tag identifies a registered namespace of first sub-tag identifies a registered namespace of identifiersidentifiers

• each namespace provides its own operational each namespace provides its own operational definition(s)definition(s)

• ““i-”i-” usage more consistent (languages only) usage more consistent (languages only) • ““i-”i-” specifies a privileged namespace (doesn’t specifies a privileged namespace (doesn’t

require require “n-”“n-”)

<body <body xml:lang=“n-sil-hop”>

17th International Unicode Conference San Jose, CA September 2000

ConclusionsConclusions

Language identifiers required for language-specific Language identifiers required for language-specific processingprocessing

Immediate need for thousands of new language identifiers; Immediate need for thousands of new language identifiers; in particular, for use in XMLin particular, for use in XML

Five problem areas—need to be considered in any systemFive problem areas—need to be considered in any system

SIL SIL EthnologueEthnologue codes address all five problems codes address all five problems

Revising RFC 1766 to add a namespace mechanism can Revising RFC 1766 to add a namespace mechanism can support this and would offer many benefitssupport this and would offer many benefits