The CLUES database: automated search for linguistic cognates

  • Published on

  • View

  • Download


Overview of the design of the CLUES database, developed as an aid to the comparative method in historical linguistics. Includes information on the design of the database and the strategies used to detect correlate forms (potential cognates), including metrics used to rate similarity of form and meaning.


  • 1. The CLUES database: automatedsearch for cognate forms Australian Linguistics Society Conference, Canberra 4 December 2011Mark Planigale (Mark Planigale Research & Consultancy)Tonya Stebbins (RCLT, La Trobe University)

2. Introduction Overview of the design of the CLUES database -being developed as a tool to aid the search forcorrelates across multiple datasets Linguistic model underlying the database Explore key issues in developing the methodology Show examples of output from the database Because the design of CLUES is relatively generic, itis potentially applicable to a wide range oflanguages, and to tasks other than correlatedetection. 3. Context 4. What is CLUES? Correlate Linking and User-defined Evaluation System. Database designed to simultaneously handle lexical datafrom multiple languages. It uses add-on modules forcomparative functions. Primary purpose: identify correlates across two or morelanguages. Correlate: pair of lexemes which are similar in phonetic formand/or meaning The linguist assesses which of the identified correlates arecognates, and which are similar due to some other reason(borrowing, universal tendencies, accidental similarity) Allows the user to adjust the criteria used to evaluatedegree of correlation between lexemes. It can store, filter and organise results of comparisons. 5. Computational methods in historicallinguistics Lexicostatistics Typological comparison Phylogenetics Phoneme inventory comparison Modelling effects of sound change rules Correlate search > CLUES 6. A few examples Lowe & Mazaudon 1994 Reconstruction Engine (modelsoperation of proposed sound change rules as a means of checkinghypotheses) Nakhleh et al 2005 Indo-European, phylogenetic Holman et al. 2008 Automated Similarity Judgment Program 4350 languages; 40 lexical items (edit distance); 85 most stablegrammatical (typological) features from WALS database. Austronesian Basic Vocabulary Database: 874 mostly Austronesianlanguages, each language represented by around 210 words. project hadphylogenetic focus did some manual comparative work inpreparing the data) Greenhill & Gray 2009 Austronesian, phylogenetic Dunn, Burenhult et al. 2011 Aslian Proto-TaioMatic (merges, searches, and extends several wordlistsand proposed reconstructions of proto-Tai and Southwestern Tai 7. Broad vs. deep approaches to automatedlexical comparisonParameter Broad and shallow Narrow and deepLanguage sample Relatively largeRelatively smallVocabularyConstrained, based on All available lexical data forsamplestandardised wordlist (e.g. selected languagesSwadesh 200, 100 or 40)Purpose Establish (hypothesised)Linguistic and/or culturalgenetic relationships reconstruction; model languagecontact and semantic shiftMethodLexicostatisticsComparative method with fuzzyPhylogenetics matchingTypical metrics Phonetic (e.g. edit distance) Phonetic (e.g. edit distance)Typological (shared Semanticgrammatical features) GrammaticalMaximum likelihoodCLUES comparisons can be constrained to core vocab (using wordlist feature)however it is intended to be used within a narrow and deep approach. 8. Design of CLUES 9. CLUES: Desiderata Results agree with human expert judgment Accuracy Minimisation of false positives and negatives Computed similarity level does measure degree ofValidity correlation Computed similarity level varies directly with cognacy Like results for like comparison pairsReliability Like results for a single comparison pair on repetition System performs accurately on new (unseen) data as wellGeneralisability as the data that the similarity metrics were trained on Efficiency Comparisons are performed fast enough to be useful 10. Lexical model (partial) Language 1LexemeOrthography 1 part of speech 1 Sourcetemporal information11 Sense Wordlist item Written form1 1... Gloss SemanticPhone domain 11. Three dimensions of lexical similarityDimension of comparison Data fields currentlyavailablePhonetic / phonological Written form (mapped to(phonetic form of lexeme) phonetic content)SemanticSemantic domain(meaning of lexeme) GlossGrammatical Word class(grammatical features oflexeme) In the context of correlate detection, grammatical features maybe of interest as a dis-similarising feature for lexemes that arehighly correlated on form and meaning. 12. What affects the results?Selection and evaluation of metrics Choice of appropriate formal (quantifiable) criteria for similarity Impact: Validity of results; generalisability of systemInconsistent representations Systematic differences in the representations used for differentdata sets within the corpus Impact: Validity of resultsNoise Random fluctuations within the data that obscure the true valueof individual data items, but do not change the underlying natureof the distribution Impact: Reliability of data, reliability of results lesscontrollabl e 13. CLUES: Managing representational issues automated generation of phonetic form(s) fromwritten form(s) where required, manual standardization to commonlexicographic conventions manual assignment to common ontology semantic domain set automated mapping onto a shared common set ofgrammatical features, values and terms 14. Calculating similarity 15. Similarity scores Overall Totalscore w5 w6 w7 MeaningGrammar Subtotals Form subtotal subtotal subtotal w1 w2 w3 w4Semantic Written form Gloss Wordclass Basedomainsimilaritysimilaritysimilaritysimilarity 16. Ura unga vs. Mali kunngga sunBase4a.Lexeme 1 Lexeme 2similarity Weight SubtotalWeight OverallscoreungakunnggaWritten form(s) 0.896 1.0 0.8960.45 [unga] [unga]Gloss(es)Semantic domain(s)Wordclass sun A3Nsun A3 N 1.0 1.0}1.0 1.0 0.45 0.1 } 0.953 17. Sulka kolkha sun vs. Mali dulka stoneBase 4b.Lexeme 1 Lexeme 2 similarity Weight SubtotalWeight Overallscore kolkha dulka Written form(s)0.828 1.0 0.8280.45[kolkha] [dulka] Gloss(es) Semantic domain(s)sunA3stone A5 0.00.3330.50.5 } 0.1670.45 } 0.548 Wordclass NN1.01.01.0 0.1Base 4c.Lexeme 1 Lexeme 2 similarity Weight SubtotalWeight Overallscore kolkha dulka Written form(s)0.828 1.0 0.8280.7[kolkha] [dulka] Gloss(es) Semantic domain(s)sunA3stone A5 0.00.3330.50.0 }0.0 0.2 } 0.68 Wordclass NN1.01.01.0 0.1 18. Sample results: across domains Small set of lexical data from 7 languages; symmetrical; overallscores (tau) N J1 (ura) N A3 (sul) N A5 (sul) N J1 (mal) N A3 (ura) N J1 (qaq) N T1 (mal) V T1 (mal) N J15a. kabarakungakre ka ptaik kunngga slpltigi lt slpki blood sun stoneskin sunbonefirelight a fire bone (tau) N J11 0.3090.29050.6570.2995 0.54350.278 0.24350.541kabarak blood(ura) N A30.309 1 0.347250.2665 0.948 0.312 0.3515 0.24450.325 unga sun (sul) N A50.2905 0.34725 1 0.26150.33875 0.3395 0.28250.2940.2745kre stone(sul) N J10.6570.26650.261510.2895 0.5275 0.28350.2260.587 ka ptaik skin(mal) N A30.29950.948 0.338750.2895 1 0.289 0.3025 0.220.3495kunngga sun (ura) N J10.54350.3120.33950.5275 0.289 10.326 0.38150.8905slp bone(qaq) N T10.2780.35150.28250.2835 0.30250.3261 0.69450.371ltigi fire(mal) V T10.2435 0.24450.294 0.226 0.220.3815 0.694510.307lt light a fire (mal) N J10.541 0.3250.27450.5870.3495 0.89050.3710.3071 slpki bone 19. Sample results: within a domain (kua) N M1(qaq) N A5 (ura) N A5 (mal) N A5 (tau) N A5 (sul) N A5 (kua) N A5 (sia) N A5 5b. Overall dududulduldul dulkaaaletpala krevat fatsimilarity score fightingstonestonestonestonestonestonestone stone (qaq) N A5110.8750.6945 0.7355 0.759 0.739 0.425 dul stone (ura) N A5110.8750.6945 0.7355 0.759 0.739 0.425 dul stone (mal) N A50.8750.87510.776 0.790.7205 0.7355 0.426dulka stone (tau) N A5 0.6945 0.69450.77610.7375 0.727 0.730.3815aaletpala stone (sul) N A5 0.7355 0.7355 0.790.7375 10.77850.798 0.3075 kre stone (kua) N A50.7590.759 0.72050.7270.7785 10.9805 0.3095 vat stone (sia) N A50.7390.739 0.7355 0.730.7980.98051 0.298 fat stone(kua) N M10.4250.4250.4260.3815 0.3075 0.30950.298 1 dududul fighting stone 20. Sample results: within a domain (kua) N M1(qaq) N A5 (ura) N A5 (mal) N A5 (tau) N A5 (sul) N A5 (kua) N A5 (sia) N A5 5c. Form similarity dududulduldul dulkaaaletpala krevat fatonly fightingstonestonestonestonestonestonestone stone (qaq) N A511 0.750.3890.4710.518 0.4780.6 dul stone (ura) N A511 0.750.3890.4710.518 0.4780.6 dul stone (mal) N A50.75 0.75 10.552 0.580.441 0.471 0.602dulka stone (tau) N A50.3890.3890.55210.4750.454 0.460.513aaletpala stone (sul) N A50.4710.471 0.580.47510.557 0.596 0.365 kre stone (kua) N A50.5180.5180.4410.4540.5571 0.961 0.369 vat stone (sia) N A50.4780.4780.471 0.460.5960.961 1 0.346 fat stone(kua) N M1 0.60.6 0.6020.5130.3650.369 0.346 1 dududul fighting stone 21. Metrics A wide variety of metrics can be implemented andplugged into the comparison strategy Metrics return a real value in range [0.0, 1.0]representing the level of similarity of the items beingcompared User can control which set of metrics is used Can use multiple comparison strategies on the samedata set and store and compare results Metrics discussed here are those used to producethe sample results General principle: best match prefer falsepositives to false negatives 22. Phonetic form similarity metric edit distance with phone substitution probability matrix f1, f2 := phonetic forms being compared (lists of phones generatedautomatically from written forms, or transcribed manually) Apply edit distance algorithm to f1 and f2 with following costs: Deletion cost = 1.0 (constant) Insertion cost = 1.0 (constant) Substitution cost = 2 x (1 - sp), where sp is phone similarity. Substitution cost falls inrange [0.0, 2.0] dmin := minimum edit distance for f1 and f2 dmax := maximum possible edit distance for f1 and f2 (sum of lengths of f1 andf2 ) Similarity = 1 (dmin / dmax) Finds maximal unbounded alignment of two forms. Can also be understoodas detecting contribution of each form to a putative combined formExamples:mbias vs. biaska dmin= 3 dmax= 11Similarity = 1-(3/11) = 0.727mbiaskavat vs. fatdmin= 0.236 dmax= 6 Similarity = 1-(0.236/6) = 0.96 {v,f}at 23. Phone similarity metric Phone similarity sp for a pair of phones is a real numberin range [0, 1] drawn from a phone similarity matrix Matrix calculated automatically on the basis of weightedsum of similarities between phonetic features of the twophones Examples of phonetic features include nasality (universal),frontness (vowels), place of articulation (consonants) Each phonetic feature has a set of possible values and asimilarity matrix for these values. Similarity matrix is user-editable Feature similarity matrix should reflect probability of variouspaths of diachronic change Possible to under-specify feature values for phones Similarity of a phone with itself will always be 1.0 Default similarities can be overridden for particularphones (universal) and/or phonemes (language pair- 24. Semantic domain similarity metric depth of deepest subsumer asproportion of maximum local depth Aof semantic domain tree n1, n2 := the semantic domainsB C ...being compared (nodes in semanticdomain tree) S := subsumer: deepest node inD Esemantic domain tree thatsubsumes both n1 and n2 F ds := depth of S in tree (path lengthfrom root node to S) dm := maximum local depth of tree Examples:(length of longest path from rootnode to an ancestor of n1 or n2)F vs. F = 1.0 Similarity = ds / dmD vs. E = 0.333B vs. C = 0.0 See also Li et al. (2003) 25. Gloss similarity metric Crude sentence comparison metric:Examples:proportion of tokens in common g1, g2 := the glosses beinghouse vs. house = 1.0compared house vs. a house = 1.0 r1, r2 := reduced glosses (afterremoval of stop words, e.g. a, the,house vs. raised sleepingof) house = 0.333 len1, len2 := length of r1, r2 (numberof tokens) house vs. hut = 0.0 L := max (len1, len2) If L = 0, Similarity = 1.0, else: C := count of common tokens(tokens that appear in both r1 and r2) Similarity = C / L This metric needs refinement 26. Conclusion 27. Possible extensions; unresolved questions Extensions: find borrowings; detect duplicate lexicographic entries;orthographic conversion; ... Analytical questions: How to represent tone and incorporate withinphonetic comparison? Phonetic feature system multi valued orbinary? Segmentation (comparison at phone, phone sequence orphoneme level)? The Edit distance metric may be improved byprivileging uninterrupted identical sequences. Elaborate semantic matching: more sophisticated approaches using:taxonomies e.g. WordNet, with some way to map lexemes ontoconcepts; compositional semantics primitives. Performance: Since comparison is parameterised, it may be possibleto use genetic algorithms to optimise performance. Need aquantitative way to evaluate performance of system. Relation to theory: How much theory is embedded in the instrument?What effect does this have on results? Inter-operability between databases is a key issue in the ultimateusability of the tool. 28. Acknowledgements Thanks to Christina Eira, Claire Bowern, Beth Evans,Sander Adelaar, Friedel Frowein and Sheena VanDer Mark, and Nicolas Tournadre for their commentsand suggestions on this project. 29. References Bakker, Dik, Andr Mller, Viveka Velupillai, Sren Wichmann, CecilH. Brown, Pamela Brown, Dmitry Egorov, Robert Mailhammer,Anthony Grant, and Eric W. Holman. 2009. Adding typology tolexicostatistics: a combined approach to language classification.Linguistic Typology 13: 167-179. Holman, Eric W., Sren Wichmann, Cecil H. Brown, VivekaVelupillai, Andr Mller, and Dik Bakker. 2008. Explorations inautomated language classification. Folia Linguistica 42.2: 331-354.Atkinson et al 2005. Li, Yuhua, Bandar Z, McLean D (2003)An approach for measuring semantic similarity using multipleinformation sources, IEEE Transactions on Knowledge and DataEngineering, vol. 15, no.4, pp. 871-882. Nakhleh, Luay, Don Ringe, and Tandy Warnow (2005). Perfectphylogenetic networks: A new methodology for reconstructing theevolutionary history of natural languages. Language 81: 382-420.(from Bakker et al. 2009) Lowe, John Brandon and Martine Mazaudon. 1994. Thereconstruction engine: a computer implementation of thecomparative method," Association for Computational LinguisticsSpecial Issue in Computational Phonology 20.3:381-417.


View more >