Automatic Identification of Close Languages - Case study: Malay and Indonesian

126 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.2, NO.2 NOVEMBER 2006

Automatic Identification of Close Languages -Case study: Malay and Indonesian

Bali Ranaivo-Malancon, Non-member

ABSTRACT

Identifying the language of an unknown text isnot a new problem but what is new is the task ofidentifying close languages. Malay and Indonesianas many other languages are very similar, and there-fore it is a real difficulty to search, retrieve, classify,and above all translate texts written in one of thetwo languages. We have built a language identifier todetermine whether the text is written in Malay or In-donesian which could be used in any similar situation.It uses the frequency and rank of trigrams of charac-ters, the lists of exclusive words, and the format ofnumbers. The trigrams are derived from the mostfrequent words in each language. The current pro-gram contains as language models: Malay/Indonesian(661 trigrams), Dutch (826 trigrams), English (652trigrams), French (579 trigrams), and German (482trigrams). The trigrams of an unknown text aresearched in each language model. The language ofthe input text is the language having the highest ra-tio in “number of shared trigrams / total number oftrigrams” and “number of winner trigrams / numberof shared trigrams”. If the language found at trigramsearch level is ’Malay or Indonesian’, the text is thenscanned by searching the format of numbers and ofsome exclusive words.

Keywords: Language identifier, Trigram, Close lan-guages, Malay, Indonesian, Exclusive words

1. INTRODUCTION

As long as human communicate by language, inwritten or spoken form, it is natural to identify thelanguage used. The constant increase of the numberof electronic documents combined with the perpetuallack of satisfaction of users - they request accurateand understandable documents - lead to the elabora-tion of a variety of tools making possible the auto-matic classification of these documents by language.

Language identification is not a new problem andmany methods have been tried to resolve it. A quickglance to the Web offers a glimpse of the great numberof published papers related to the topic and also thenumber of free or commercial tools that can perform

Manuscript received on June 12, 2006.The author is with Computer Aided Translation Unit

(UTMK) School of Computer Sciences Universiti SainsMalaysia 11800 Minden, Pulau Pinang, Malaysia; E-mail:[email protected]

the task of language identification. Identifying themain language of a document is very important formany applications in addition to the fact that notevery one can speak and understand more than onelanguage.

A language identifier finds its application in anytask involving multilingual electronic documents.Most of the current Internet search engines allow theusers to restrict the search to documents written in aspecified language. A language identifier is not onlyused for searching specific documents in the Web. Itallows also the evaluation or establishment of the lan-guages used in the Web, “in order to determine theneeds for language specific processing it is helpful toknow the distribution and relative importance of lan-guages on the web” [1]. In 1997, the Babel team [2]used SILC language identifier [3] in order to estab-lish Web languages hit parade. Another applicationof language identifier is found in Microsoft Word thatintegrates a tool to set up the language of the currentdocument and if the text within this document is notthe one that has been defined, the text is identified asincorrect. Other applications are e-mail managementtools, information retrieval, and speech synthesis [4].For Natural Language Processing applications, a lan-guage identifier can be used as a filter. For example,before submitting a document to a bidirectional ma-chine translation system, the document is processedby the language identifier. Once the language is iden-tified, the translation can start automatically withouthuman interference.

Different types of identification can be obtained bya language identifier. One common result is the iden-tification of the main language of a document. In thiscase, the output of the language identifier is the nameor family language name of this identified language.As we know, many documents contain more than onelanguage, one main language and one or more thanone language generally used to cite original texts intheir own language. In this situation, the languageidentifier should identify the main language of thedocument and the languages of each foreign citation.

In this paper, we present a language identifier thataims to discriminate very close languages like Malay(henceforth BM) and Indonesian (henceforth BI). Be-sides the two tasks that are identifying the main lan-guage and the languages of foreign citations (isolatedby quotation marks), the language identifier is able todistinguish close languages and very short texts liketitles or snippets.

Automatic Identification of Close Languages - Case study: Malay and Indonesian 127

The identification of the language is done in twosteps. During the first step, close languages are dis-carded from other languages present in the languagemodel. At this level, the language identifier uses thefrequency and rank of trigrams of characters. Dur-ing the second step, close languages are discarded be-tween them. For BM and BI, the language identifieruses the list of exclusive words and the format of num-bers.

This paper is organised as follows. Section 2 pro-vides an overview of similarities and differences thatmay exist between two close languages like BM andBI. Section 3 presents some techniques that have beenused to identify languages. Section 4 explains themethod that we have used in building our languageidentifier. Section 5 concerns the evaluation of the ac-curacy of our language identifier by comparing manu-ally its results with the results of two other languageidentifiers.

2. CLOSE LANGUAGES: SIMILARITIESAND DIFFERENCES

Most of existing language identifiers has been builtto recognise the language of any document withouttaking into account the problem of close languages.Our first motivation was to find an accurate, simpleand reusable method that can identify without ambi-guity close languages.

Two languages (or a group of languages) are said tobe close if they share relatively important similaritiesat word level. Close languages share many lexical andmorphological units with the same spelling, pronun-ciation and meaning. To be close imply differencessince they are not identical. These differences can beused to discriminate close languages.

Two examples of close languages are BM (officiallanguage of Malaysia) and BI (official language of In-donesia). The two languages belong to the family ofAustronesian languages and both derive from BahasaMelayu, which was the lingua franca of SoutheastAsia since at least the 7th century until 13th century[5-6]. We use BM and BI to illustrate the similari-ties and differences that may exist between close lan-guages, and also because our language identifier hasbeen built in order to differentiate the two languages.

2.1 Lexical similarities

Two languages are said close or similar if they havean important volume of shared vocabulary. The fol-lowing table (Table 1), built from the informationgiven by Ethnologue (a catalogue of known living lan-guages, www.ethnologue.com), shows the percentageof vocabulary shared by a pair of close languages.

There is no real and statistical evaluation of thesimilarity or difference between BM and BI. As-mah [5] did some small tests with her students in1998. The result showed that 30% of the two Indone-sian texts, submitted to 81 Malaysian students, were

Table 1: priority constant of design patterns com-ponents

“odd, unintelligible, and unusual”. Asmah dividedthese unintelligibilities into five categories, and thecategory “Presence of words and phrases totally un-familiar to Malaysians” is less than 10%. From thissmall but very informative test, we may start to statethat BM and BI have more 90% lexical similarity.

Table 2: Examples of borrowings in BM and BI

2.2 Different spellings

The Malay used in Malaysia is written whetherwith a variety of Arabic script (tulisan Jawi) or Ro-man script (tulisan Rumi). A unified spelling ofnative and borrowed words in BM and BI was in-troduced in 1972, known in Indonesia as “the Per-fect Spelling” (Ejaan Yang Disempurnakan), and inMalaysia as “the New Spelling of Malaysian Lan-guage” (Ejaan Baharu Bahasa Malaysia). Table 3shows some examples of spelling changes adopted byboth countries, Malaysia and Indonesia, in 1972. Be-sides these changes, the apostrophe for glottal stop(hamzah) and the inverted comma were omitted, and


were replaced by the letter ‘k’. New letters were in-cluded to write borrowed terms: ‘o’, ‘v’, and ‘x’.

Table 3: Examples of spelling agreement (1972)

In 1975, the Language Council of Indonesia andMalaysia (MBIM) proposed a set of rules for the coin-ing of technical terms. The introduction of BruneiDarussalam in the Council has changed the name ofthe Council which became the Language Council ofBrunei Darussalam, Indonesia, and Malaysia (MAB-BIM). Whatever the efforts done by the Council, wemust say that all these reforms did not achieve theiraims. We still have different spellings for some wordsin both BI and BM as it is shown in Table 4. Thesymbol ’#ı́ndicates word boundary.

Table 4: Spelling differences between BM and BI

Another difference between BM and BI can be no-ticed in writing numbers (Table 5). The difference isrelated to the use of full stop and comma.

In BI, the spelling of monetary value in trillion isnot standardised. Sometimes the three digits are sep-arated with a comma (e.g. Rp 6,565 triliun, Rp1,536triliun), sometimes with a full stop (e.g. Rp 1.000triliun, Rp 24.000 triliun).

2.3 Different vocabularies

To express the same concept or meaning, BM andBI may use two different words (Table 6). The twowords are totally synonymous.

Table 5: Format of numbers in BM and BI

Table 6: Different words for the same meaning inBM and BI


3. LANGUAGE IDENTIFICATION TECH-NIQUES

Many parameters can be used to identify the lan-guage - or at least to guess the family language - ofa document. The identification can be performed atdifferent levels of the document with for each level itsgood or bad reliability.

3.1 Language encoding information

Markup metalanguages, like SGML, HTML, XML,etc., include attributes that allow the users to spec-ify both language and script used in a document. InLatex, the author can indicate with a specific tag,the language used for any portion of the text whichis not the same as the main language. Unfortu-nately, the use of these attributes is simply recom-mended. Most of the time, the author of a doc-ument omits to indicate the language of the docu-ment. However, if this language information is men-tioned, there is no standardised format of the wayhow to represent the name of the language accord-ing to “Language Identifiers in the Markup Context”(http://xml.coverpages.org/languageIdentifiers.html #oss). For these reasons, a language identifier cannotrely fully on this unpredictable information. There-fore most language identifiers handle only ASCIItexts and do not rely fully on the language and char-acter encoding information.

3.2 Word or sub-word identification

The easiest and simplest solution in the identifica-tion of the language of a given text is to perform lexi-con look-up for each language available in the library.This kind of approach will create a language identi-fier that is undoubtedly one-hundred percent accu-rate. Adopting this method means that all or almostall existing and recognised languages have their elec-tronic dictionaries ready in the language identifier.Unfortunately not all living languages have a dictio-nary and even less an electronic dictionary. It meansalso that a good and fast search string must be ap-plied allowing the comparison of a given documentwith thousand dictionaries.

Giguet [8] declared that “it was possible to cat-egorize long sentences and texts using only linguis-tic knowledge”. Following this idea, the size of thelexicon can be reduced by looking for specific subsetof the lexicon: common words for [9][10], grammat-ical words to discriminate the language of sentencesfor [11], the most frequent words for [12], and shortstrings of characters which are unique to each lan-guage for [13][9]. The common limitation of this kindof approach is the size of the unknown text. A shortinput text may not contain those particular words.

Other language identifiers use linguistic segments.Giguet [11] realised that accented letters can be onlyused to discard languages but not to identify the right

language of unknown texts. The reason is that in ashort text, accented letters are not so frequent. Toimprove his approach, he explored other natural lan-guage properties: “using knowledge upon word mor-phology via syllabation: the idea is to check the goodsyllabation of words in a language, distinguishing thefirst, middles, and last syllables; in the same way, us-ing word endings and using sequences of vowels orconsonants” [11].

3.3 N-gram identification

The most popular statistical language models forlanguage identification are n-gram characters. Theidea is to accumulate the frequency or probability ofeach sequence of characters and detect the sequencesthat are specific and recurrent to each language. Thevalue of n varies from 1 to 5: 2 and 3 for [12], 1to 5 for [16] but usually n equals to 3 is often used[9]-[14][15]-[16]-[12][17].

Trigrams work well with short texts, 40-80 char-acters thought Liberman in his message in LinguistList 2.530 answering Kulikowski on how to identifythe language of a text line. Liberman’s thought isconfirmed by Poutsma’s tests [10]. Poutsma realisedthat “the n-gram method performed best with smallamounts of input (i.e. less than 100 characters in-put)”. Prager [18] found during his tests that com-bining words with unrestricted length and 4-gramsgave the best performance.

In 1994, Cavnar and Trenkle [16] have developeda language identifier very similar to our program in asense that it uses n-grams and computes the similar-ity between an unknown text and any of the languagemodels by calculating “out-of-place” measure. Thismeasure corresponds to the distance that exists be-tween the rank of an n-gram in the unknown text andin a language model. The language of the unknowntext is the language of the language model that has aminimum distance. The performance of Textcat andour language identifier is shown in section 5.

4. THE PRESENT WORK

Our motivation in building a new language identi-fier is due to the fact that the performance of availablelanguage identifiers is not wholly satisfactory whenguessing close languages like BM and BI. We startedour research by looking for a method that could sat-isfy the criteria for the best automatic language iden-tification. The tool must require a very small size aslanguage models. It has to be fast since it will beused as a pre-processing tool for other applications.It must be able to handle multilingual documents andshort texts. And more importantly, it has to be ac-curate.

We have built a language identifier for closelanguages written with Roman alphabet by usingtrigrams and some linguistic information like the


spelling of numbers and exclusive words. Close lan-guages mean that some words are used only in onelanguage and never in the other. We call ‘exclusivewords’ those words that appear only in one language.

Currently five languages are made available in ourlanguage identifier that is BI, BM, English, French,German, and Dutch (Table 7). The language modelfor each language is very small: between 231 (for Ger-man) and 270 words (for English). Those words areconsidered as being the most frequent words in eachrespective language. The most frequent words in BIand BM have been mixed as they show two similarlists. For each language, all words are transformedinto lower case and the trigrams are extracted, sorted,and reduced to one occurrence with its frequency.The list of trigrams for each language is saved in asimple text file with one line per each trigram and itsnumber of occurrences in the language.

Table 7: Language models

The language of an unknown text is given afterprocessing the text through four main modules as itis shown in figure 1.

In the pre-processing module, the unknown textis divided by sentences and all letters are changedinto lower case. All sequences of white spaces are re-duced to one white space which in turn is changedinto one underscore. After the pre-processing mod-ule, the unknown text is sent to the ‘Trigram seg-menter’. This module transforms a stream of char-acters (corresponding to a sentence or a quoted text)into a list of trigrams. Once a list of trigrams is ob-tained, it is sent to the module called ‘Language Iden-tifier’. At this level, each trigram in the unknown textis searched in each list of trigrams representing eachavailable language. A binary search is an appropri-ate algorithm for being a fast search in a sorted list.The Language Identifier module calculates the lan-guage of the unknown text based on two values: (1)the ratio of the total number of ‘shared trigrams’ di-vided by the total number of trigrams in the unknowntext, and (2) the ratio of the total number of ‘win-ner trigrams’ divided by the total number of sharedtrigrams. We called ‘shared trigrams’ trigrams that

appear in the unknown text and the current languagemodel. ‘Winner trigrams’ are trigrams that appear inthe unknown text, in at least one language model, andhave the highest rank among all language models.

Table 8 provides an illustration of shared and win-ner trigrams. The BM input sentence ‘Saya sukamakan nasi goreng.’ means ‘I like eating fried rice.’It generates 27 trigrams. The five trigrams ‘a m’,‘a s’, ‘gor’, ‘i g’, and ‘n n’ do not appear inany of the language models. The winner trigrams arehighlighted in bold.

Table 8: Language models

The language of the unknown text is the one thathas its ratio-1 and 2 equal or bigger to 0.45. Thisvalue has been defined after running many tests onthe language identifier. In our example, the text iseither BM and BI. To specify to right language of theunknown text, the original text is sent to the modulecalled ‘Malay-Indonesian Language Identifier’. Thismodule performs two actions to discard BM and BItexts. First, it checks the format of each number (ifthere is any) in the unknown text and start countingthe differences based on Table 5. Then it looks up inthe exclusive lists of words for BM and BI sian. Thecorrect language of the unknown text is the one withthe highest number of markers: format of numbers


Fig.1: Architecture of the Malay-Indonesian language identifier.

and exclusive words.

5. EVALUATION AND RESULTS

A method for evaluating the accuracy of languageidentifiers is not straightforward as the objective, thediscriminators, and the number of languages to beidentified is often different.

To evaluate the accuracy of our language identi-fier, we need to compare its results with some existingtools that contain BM and BI. We have found onlyfour language identifiers that include BM and BI lan-guages: Xerox Language Guesser, Textcat, RosetteLanguage Identifier, and Lextek Language IdentifierSDK. The other available language identifiers con-sider BI and BM as the same language. We haveconducted two tests and evaluate manually the re-sults.

The first test has been done based on the num-ber of words and characters in five BM texts and fiveBI texts. The ten texts have been chosen at ran-dom from our BM and BI corpora. We comparedthe results given by Textcat [19], Lextek [20], andour program. Textcat is a free online language iden-tifier having more than 75 languages (including BM

and BI). Textcat provides most of the time more thanone language leaving the user to guess the correct one.Lextek is also a free language identification programthat offers over 260 language (including BM and BI)and encoding modules. Lextek is more precise by pro-viding only one possible language. Table 9 (LI standsfor language identifier) gives the results of this firsttest in which our LI performs very well - zero error- compared with Textcat and Lextek. As one of ourreviewer highlighted, the good performance of our LIis not a real surprise since it uses lists of exclusivewords and rules for writing numbers, two data thatTextcat and Lextek do not have. The remark couldbe true if Lextek does not try to identify clearly BItexts from BM texts.

The second test was conducted by comparingTextcat and our language identifier by looking at theaccuracy of each tool at sentence length. The lengthof a sentence corresponds to the number of words thatit contains. We compared 180 sentences (90 BM sen-tences and 90 BI sentences) chosen at random basedon their length: only sentences with less or equal to10 words. Each sentence length has a set of 10 sen-tences. For example, there are 10 sentences of lengthtwo, 10 sentences of length three, and so on. Table


Table 9: Identification errors by number of wordsand characters.

12 and Table 11 show the results of the comparisonwhen we consider as an error any output with morethan one language.

Table 10: Identification errors by BM sentencelength (more than one language in the output).

The two language identifiers, Textcat and our LI,show very bad results with short BM texts. The re-sults are slightly better with short BI texts. Tworeasons may explain these results. Firstly, BI shorttexts contain more specific words (21 sentences over90 sentences) than short BM texts (12 sentences over90 sentences). Secondly, many short texts cannot beclearly identified as BI or BM as all words that formthe texts belong to both languages. 147 sentencesover the 180 sentences tested are in this situation. Ifwe take in account this second reason, the results ofthe two language identifiers are reviewed in Table 12.

It appears that Textcat performs slightly betterthan our language identifier. But since the expectedtask of a language identifier is to provide “the lan-guage” and not “the possible languages” of a givendocument, we are still comforting in our approach.Table 10 and Table 11 show that our LI is more pre-

Table 11: Identification errors by BI sentencelength (more than one language in the output).

Table 12: Identification errors by sentence length(21 BI sentences, 12 BM sentences, 147 BI/BM sen-tences).

cise than Textcat.

6. CONCLUSION

We have described in this paper the first languageidentifier that aims to guess close languages writtenin Roman alphabet. The method is simple and fast.The language identifier has been built to distinguishBM and BI texts but it can be applied for other closelanguages like American English and British English.The identification task is done in two steps. First, theBM or BI text is identified among other texts writ-ten in other languages. The use of trigrams providesacceptable results as it is shown in our results. Then,in the second step, the language of the unknown in-put text is identified clearly by applying two crite-ria: the presence of exclusive words and the format


of numbers. For the moment the two exclusive listsare not sufficient to cover all BM and BI texts. Wehave built manually the lists of exclusive words. Weare currently looking for a method that can extractautomatically these exclusive words from two alignedtexts of close languages. There is no doubt that get-ting a perfect language identifier for close languagesmust contain two successive filters: a simple statisti-cal method that is n-gram of characters, and a lookupto an exclusive list of words combined with some spe-cific spelling rules (e.g. format of numbers, presenceof diacritic characters, high frequency of some mor-phological endings, etc.).

ACKNOWLEDGMENTS

The author wishes to thank the anonymous review-ers for their valuable remarks and suggestions. Thelists of sentences and outputs provided by the testedlanguage identifiers presented in this paper are avail-able on request. The author gratefully acknowledgesUniversiti Sains Malaysia through Research Creativ-ity and Management Office for their financial support(USM Short Grant, 304/PKOMP/634121). The au-thor thanks Ng Pek Kuan for implementing the lan-guage identifier in Java and making it available online(http://utmk.cs.usm.my:8080/identifier/Identifier.j-sp).

References

[1] S. Langer. Natural languages and the World WideWeb, Bulag, Revue annuelle, Presses Universi-taires Franc-Comtoises, S. pp. 89-100, 2001.

[2] Web Languages Hit Parade,http://alis.isoc.org/palmares.en.html.(07/11/2006)

[3] SILC (Systeme dIdentification de la Langueet du Codage), http://rali.iro.umontreal.ca/.(07/11/2006)

[4] S. Lewis, K. McGrath, J. Reuppel, LanguageIdentification and Language Specific Letter-to-Sound Rules, Colorado Research in Linguistics,Vol. 17, Issue 1, 2004.

[5] O. Asmah, The Malay Language In Malaysia AndIndonesia: From Lingua Franca To National Lan-guage,. The Asianists ASIA, vol. II, 2001.

[6] D. A. Aziz, The Development of the MalayLanguage: Contemporary Challenges, 2003,http://english.gdufs.edu.cn/ArticleShow.asp?Ar-ticleID = 182.(07/11/2006)

[7] R. G. Jr. Gordon (ed.), Ethnologue: Lan-guages of the World, Fifteenth edition, Dallas,Texas: SIL International, 2005, Online version:http://www.ethnologue.com/. (07/11/2006)

[8] E. Giguet, Categorization according to Language:A step toward combining Linguistic Knowledgeand Statistic Learning, Proceedings of the Inter-national Workshop of Parsing Technologies, Sept.1995, Prague - Karlovy Vary, Czech Republic.

[9] G. Grefenstette, Comparing two language identi-fication schemes, JADT 1995, 3rd InternationalConference on Statistical Analysis of TextualData, Dec. 1995, Rome, pp. 11-13.

[10] A. Poutsma, Applying Monte Carlo Techniquesto Language Identification, Language and Com-puter, Computational Linguistics in the Nether-lands 2001, Selected Papers from the TwelfthCLIN Meeting, Edited by M. Theune, A. Nijholtand H. Hondord, pp. 179-189, 2001.

[11] E. Giguet, Multilingual Sentence Categorizationaccording to Language, Proceedings of the Eu-ropean Chapter of the Association for Computa-tional Linguistics SIGDAT Workshop From textto tags: Issues in multilingual Language Analysis,March 1995, Dublin, Ireland, pp. 73-76.

[12] C. Souter, G. Churcher, J. Hayes, J. Hughes S.Johnson, Natural Language Identification usingCorpus-Based Models, Hermes Journal of Lin-guistics, no. 13, pp. 183-203, 1994.

[13] S. Kulikowski, Stan, Using short words: alanguage identification algorithm, UnpublishedTechnical Report, 1991.

[14] M. Damashek, Gauging Similarity via N-Grams:Language-Independent Sorting, Categorization,and Retrieval of Text, Science, Vol. 267, pp. 843-848,1995.

[15] K. R. Beesley, Language Identifier: A computerprogram for automatic natural-language identi-fication of on-line text, Language at Crossroads:Proceedings of the 29th Annual Conference of theAmerican Translators Association, Oct. 1988, pp.47-54.

[16] W. B. Cavnar, J. M. Trenkle, N-Gram-BasedText Categorization, Proceedings of SDAIR-94,3rd Annual Symposium on Document Analysisand Information Retrieval, April 1994, Las Vegas,pp. 161-175.

[17] M. Padr?, L. Padro, Comparing methods forlanguage identification, Proceedings of the XXCongress de la Sociedad Espanola para el Proce-samiento del Lenguaje Natural (SEPLN), July2004, Barcelona, Spain, pp. 155-161.

[18] J. M. Prager, Linguini: Language Identificationfor Multilingual Documents, Proceedings of the32nd Hawaii International Conference on SystemSciences, 1999, Hawaii, pp. 1-11.

[19] Textcat Language Guesser,http://odur.let.rug.nl/ vanno-ord/TextCat/Demo/. (07/11/2006)

[20] Lextek Language Identifier SDK,http://www.languageidentifier.com/.(07/11/2006)


Bali Ranaivo-Malancon was bornin Madagascar. She received a Ph.D.in Natural Language Processing (NLP)from the National Institute for Ori-ental Languages and Civilisations (IN-ALCO), Paris, France, in 2001. Since2002, she has been with the Schoolof Computer Sciences, Universiti SainsMalaysia, Penang, Malaysia, where sheis currently a Lecturer. Her researchinterests include the development of a

generic language identifier for close languages and the creationof NLP tools for the analysis and annotation, of Malay andIndonesian texts.