Upload
jonathan-garza
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Language Identification and ITLanguage Identification and IT
Peter Constable and Gary SimonsPeter Constable and Gary Simons
SIL InternationalSIL International
www.sil.org
17th International Unicode Conference San Jose, CA September 2000
Language identificationLanguage identification
The use of identificational codes for tagging The use of identificational codes for tagging information objects to indicate the language in information objects to indicate the language in which the information is expressedwhich the information is expressed
<body xml:lang=“en”>
17th International Unicode Conference San Jose, CA September 2000
Language identificationLanguage identification
Not considering automated language detectionNot considering automated language detection
Considering only Considering only languagelanguage identifiers, not identifiers, not identifiers for paralinguistic notions, such as identifiers for paralinguistic notions, such as writing system or localewriting system or locale
17th International Unicode Conference San Jose, CA September 2000
About the About the EthnologueEthnologue
SIL EthnologueSIL Ethnologue• catalogue of all modern languages in the worldcatalogue of all modern languages in the world• lists over 6,800 living languageslists over 6,800 living languages• result of decades of researchresult of decades of research• system of three-letter codessystem of three-letter codes• http://www.sil.org/ethnologuehttp://www.sil.org/ethnologue
17th International Unicode Conference San Jose, CA September 2000
About the About the EthnologueEthnologue
17th International Unicode Conference San Jose, CA September 2000
About the About the EthnologueEthnologue
17th International Unicode Conference San Jose, CA September 2000
About the About the EthnologueEthnologue
Existing user base for Ethnologue codes:Existing user base for Ethnologue codes:• SILSIL• UNESCOUNESCO• Linguistic Data Consortium (850+ agencies) Linguistic Data Consortium (850+ agencies) • The Linguist List (12,500 individual linguists)The Linguist List (12,500 individual linguists)• The Endangered Language FundThe Endangered Language Fund• othersothers
17th International Unicode Conference San Jose, CA September 2000
Linguistic diversityLinguistic diversity
# of languages:# of languages:
Africa: 2062Africa: 2062
Americas: 1020Americas: 1020
Europe: 237Europe: 237Asia: 2202Asia: 2202
Pacific: 1312Pacific: 1312
17th International Unicode Conference San Jose, CA September 2000
Motivation for this paperMotivation for this paper
Languages covered by standardsLanguages covered by standards• ISO 639-x covers approx. 400ISO 639-x covers approx. 400 languages;languages;• existing needs to go much further—over 6,800 existing needs to go much further—over 6,800
languageslanguages• immediate need among linguists and other immediate need among linguists and other
researchers for use in XMLresearchers for use in XML
17th International Unicode Conference San Jose, CA September 2000
Five issuesFive issues
ChangeChange
CategorizationCategorization
Inadequate definitionInadequate definition
ScaleScale
DocumentationDocumentation
17th International Unicode Conference San Jose, CA September 2000
The need for language identifiersThe need for language identifiers
Language-specific processingLanguage-specific processing• spell-checkingspell-checking• sortingsorting• morphological parsingmorphological parsing• speech recognition/synthesisspeech recognition/synthesis• language-specific typographic behaviourlanguage-specific typographic behaviour• etc.etc.
17th International Unicode Conference San Jose, CA September 2000
The need for language identifiersThe need for language identifiers
Language-specific processingLanguage-specific processing• choosing appropriate resourceschoosing appropriate resources
Los eventos deportivos pra la juventudLos eventos deportivos pra la juventud
ህጩቦአፈ ዸድ ማዽ ጸመቂትወይቴ።
Los eventos deportivos Los eventos deportivos prapra la juventud la juventud
ህጩቦአፈ ዸድ ማዽ ጸመቂትወይቴ።
17th International Unicode Conference San Jose, CA September 2000
The need for language identifiersThe need for language identifiers
Two distinct issues:Two distinct issues:• identify the languageidentify the language• apply the specific processing for that languageapply the specific processing for that language
17th International Unicode Conference San Jose, CA September 2000
The need for language identifiersThe need for language identifiers
Language detectionLanguage detection• identify language by inspection of data itselfidentify language by inspection of data itself• available only for a few languagesavailable only for a few languages• not practical for searching large corpora (e.g. the not practical for searching large corpora (e.g. the
Internet)Internet)• doesn’t work on short text segmentsdoesn’t work on short text segments
She said, “chat”.
17th International Unicode Conference San Jose, CA September 2000
The need for language identifiersThe need for language identifiers
Language-specific processingLanguage-specific processing• in general, must tag information objects to indicate in general, must tag information objects to indicate
languagelanguage• identifiers are needed to distinguish every identifiers are needed to distinguish every
languagelanguage
17th International Unicode Conference San Jose, CA September 2000
Issue #1: changeIssue #1: change
Languages are constantly changingLanguages are constantly changing
Implications:Implications:• systems of language tags cannot be staticsystems of language tags cannot be static• the speech variety (varieties) denoted by a tag is the speech variety (varieties) denoted by a tag is
time-boundtime-bound
“English” c. 1700 A.D. ≠ “English” c. 2000 A.D.
17th International Unicode Conference San Jose, CA September 2000
Issue #2: categorizationIssue #2: categorization
Typical question: Typical question: Are Serbian and Croatian the Are Serbian and Croatian the same language, or different languages?same language, or different languages?
Operational definitions of languageOperational definitions of language• many different ways to formulate a definitionmany different ways to formulate a definition• different definitions create different categorizationsdifferent definitions create different categorizations• different categorizations serve different purposesdifferent categorizations serve different purposes
17th International Unicode Conference San Jose, CA September 2000
Issue #3: inadequate definitionIssue #3: inadequate definition
Existing systems do not consistently employ a Existing systems do not consistently employ a single operational definitionsingle operational definition
• ISO 639-2: codes for “languages” and for groups ISO 639-2: codes for “languages” and for groups of languagesof languagesnav = Navajoath = Athapascan languages
• ISO 639-2: some “languages” are groups of ISO 639-2: some “languages” are groups of languageslanguagesque = “Quechua” (47 distinct languages)
17th International Unicode Conference San Jose, CA September 2000
Issue #3: inadequate definitionIssue #3: inadequate definition
Consistent use of a single definition in a given Consistent use of a single definition in a given namespace is beneficialnamespace is beneficial
““Requiring a single definition imposes too Requiring a single definition imposes too much constraint on users”much constraint on users”
• users may legitimately have different requirementsusers may legitimately have different requirements• butbut no control results in confusion, especially no control results in confusion, especially
when thousands of identifiers are addedwhen thousands of identifiers are added
17th International Unicode Conference San Jose, CA September 2000
Issue #4: ScaleIssue #4: Scale
Number of languages exceed existing systems Number of languages exceed existing systems by an order of magnitude (400 vs. 6,800)by an order of magnitude (400 vs. 6,800)
Existing systems do not scale wellExisting systems do not scale well
17th International Unicode Conference San Jose, CA September 2000
Issue #4: ScaleIssue #4: Scale
ISO 639-xISO 639-x• slow process unable to cope with large volume of slow process unable to cope with large volume of
requestsrequests• minimal attestation (50 documents) not minimal attestation (50 documents) not
appropriate for lesser-known languagesappropriate for lesser-known languages• mnemonic codes (impossible for thousands of mnemonic codes (impossible for thousands of
languages)languages)• confusion due to inconsistent definitionconfusion due to inconsistent definition
17th International Unicode Conference San Jose, CA September 2000
Issue #4: ScaleIssue #4: Scale
RFC 1766RFC 1766• process unable to cope with large volume of process unable to cope with large volume of
requestsrequests• confusion due to inconsistent definitionconfusion due to inconsistent definition• unclear how to create tagsunclear how to create tags
17th International Unicode Conference San Jose, CA September 2000
Issue #5: documentationIssue #5: documentation
Existing systems: can’t tell what codes denoteExisting systems: can’t tell what codes denote• ISO 639-x: language, or group of languages?ISO 639-x: language, or group of languages?
ara, “Arabic”: Standard only? all variants?
bin, “Bini”= dial. of Yoruba (Nigeria; 20,000,000)= dial. of Anyin (Côte d'Ivoire; 810,000)= alt. name for Edo (Nigeria; 1,000,000)= alt. name for Pini (Australia; dying)
• ISO 639-x: which of several alternate possibilities?ISO 639-x: which of several alternate possibilities?
17th International Unicode Conference San Jose, CA September 2000
Issue #5: documentationIssue #5: documentation
• ISO 639-x: 2- vs. 3-letter codesISO 639-x: 2- vs. 3-letter codes
st, “Sesotho”= nso, “Sotho, Northern”? = sot, “Sotho, Southern”?= both?
to, “Tonga”= tog, “Tonga (Nyasa)”? = ton, “Tonga (Tonga Islands)”?
17th International Unicode Conference San Jose, CA September 2000
Solving these problemsSolving these problems
Requirements of an adequate system:Requirements of an adequate system:• able to scaleable to scale• able to deal with change, track history of changeable to deal with change, track history of change• use a single operational definition for a given use a single operational definition for a given
namespacenamespace• apply definition consistently within a namespaceapply definition consistently within a namespace• complete, maintained, online documentationcomplete, maintained, online documentation
17th International Unicode Conference San Jose, CA September 2000
What the What the EthnologueEthnologue offers offers
Scale: already thereScale: already there• enumeration of languagesenumeration of languages• set of three-letter codesset of three-letter codes
Change: careful managementChange: careful management• no re-use of codesno re-use of codes• have begun recording revision historyhave begun recording revision history
17th International Unicode Conference San Jose, CA September 2000
What the What the EthnologueEthnologue offers offers
Definition: single definition, applied quite Definition: single definition, applied quite consistentlyconsistently
• definition: primary criterion of mutual non-definition: primary criterion of mutual non-intelligibility as a basis for identifying candidates intelligibility as a basis for identifying candidates for separate literacy, literaturefor separate literacy, literature
• all categories are of the same type; no language all categories are of the same type; no language families, groups, writing systemsfamilies, groups, writing systems
17th International Unicode Conference San Jose, CA September 2000
What the What the EthnologueEthnologue offers offers
DocumentationDocumentation• extensive information maintained for every extensive information maintained for every
languagelanguage• new site will provide various reportsnew site will provide various reports
• alternate names, location, population, etc.alternate names, location, population, etc.• related ISO codes, relationshiprelated ISO codes, relationship• return return Ethnologue Ethnologue data given an ISO codedata given an ISO code
• evaluating possibilities for returning results as evaluating possibilities for returning results as XMLXML
17th International Unicode Conference San Jose, CA September 2000
Integration with RFC 1766, XMLIntegration with RFC 1766, XML
EthnologueEthnologue codes immediately available using codes immediately available using “x-”“x-”
• private-use tags not ultimately satisfactoryprivate-use tags not ultimately satisfactory
“Hopi”:<body xml:lang=“x-hop”>
<body xml:lang=“x-sil-hop”>
17th International Unicode Conference San Jose, CA September 2000
Integration with RFC 1766, XMLIntegration with RFC 1766, XML
Register thousands of new tags with IANARegister thousands of new tags with IANA• process would not be able to copeprocess would not be able to cope• problems devising that many tagsproblems devising that many tags• create considerable confusion in the single create considerable confusion in the single
namespacenamespace
17th International Unicode Conference San Jose, CA September 2000
Integration with RFC 1766, XMLIntegration with RFC 1766, XML
Register “i-sil-Register “i-sil-” ” to specify a namespace to specify a namespace maintained by a particular agencymaintained by a particular agency
<body <body xml:lang=“i-sil-hop”>
• deals with scaledeals with scale• creates a namespace with a particular definition creates a namespace with a particular definition
that is consistently appliedthat is consistently applied• avoids confusion of having a single namespace for avoids confusion of having a single namespace for
all needsall needs• allow alternate namespacesallow alternate namespaces
17th International Unicode Conference San Jose, CA September 2000
Integration with RFC 1766, XMLIntegration with RFC 1766, XML
Possible refinement: define primary tag “n-”Possible refinement: define primary tag “n-”
• first sub-tag identifies a registered namespace of first sub-tag identifies a registered namespace of identifiersidentifiers
• each namespace provides its own operational each namespace provides its own operational definition(s)definition(s)
• ““i-”i-” usage more consistent (languages only) usage more consistent (languages only) • ““i-”i-” specifies a privileged namespace (doesn’t specifies a privileged namespace (doesn’t
require require “n-”“n-”)
<body <body xml:lang=“n-sil-hop”>
17th International Unicode Conference San Jose, CA September 2000
ConclusionsConclusions
Language identifiers required for language-specific Language identifiers required for language-specific processingprocessing
Immediate need for thousands of new language identifiers; Immediate need for thousands of new language identifiers; in particular, for use in XMLin particular, for use in XML
Five problem areas—need to be considered in any systemFive problem areas—need to be considered in any system
SIL SIL EthnologueEthnologue codes address all five problems codes address all five problems
Revising RFC 1766 to add a namespace mechanism can Revising RFC 1766 to add a namespace mechanism can support this and would offer many benefitssupport this and would offer many benefits