Automatic Identification of Cognates, False Friends, and Partial Cognates

Embed Size (px)

DESCRIPTION

Automatic Identification of Cognates, False Friends, and Partial Cognates. University of Ottawa, Canada. Outline. Overview of the Thesis Research Contribution Cognate and False Friend Identification Partial Cognate Disambiguation CLPA- Cognate and False Friend Annotator - PowerPoint PPT Presentation

Text of Automatic Identification of Cognates, False Friends, and Partial Cognates

  • Automatic Identification of Cognates, False Friends, and Partial Cognates

    University of Ottawa, Canada

  • OutlineOverview of the ThesisResearch ContributionCognate and False Friend IdentificationPartial Cognate DisambiguationCLPA- Cognate and False Friend AnnotatorConclusions and Future Work

  • Overview of the ThesisTasksAutomatic Identification of Cognates and False FriendsAutomatic Disambiguation of Partial Cognates

    Areas of ApplicationsCALL, MT, Word Alignment, Cross-Language Information Retrieval

    CALL Tool - CLPA

  • DefinitionsCognates or True Friends (Vrais Amis), are pairs of words that are perceived as similar and are mutual translations. nature - nature, reconnaissance - recognition

    False Friends (Faux Amis) are pairs of words in two languages that are perceived as similar but have different meanings. main (=hand) - main (principal, essential), blesser (=to injure) - bless (bnir in French)

    Partial Cognates words that share the same meaning in two languages in some but not all contexts note note, facteur - factor or mailman, maker

  • Research ContributionNovel method based on ML algorithms to identify Cognates and False Friends

    A method to create complete lists of Cognates and False Friends

    Define a novel task: Partial Cognate Disambiguation, and solve it using a supervised and a semi-supervised methodCombine and use corpora from different domains

    Implement a CALL Tool CLPA to annotate Cognates and False Friends

  • Cognates and False Friends IdentificationOur methodMachine Learning techniques with different algorithmsInstances: French-English pairs of words Feature Space: 13 orthographic similarity measuresClasses: Cog_FF and Unrelated

    Experiments done for:Each measure separately Average of all measuresAll 13 measures

  • Cognates and False Friends IdentificationData

  • Results for classification (COG_FF/UNREL)

  • Results for classification (COG_FF/UNREL)

  • Complete Lists of Cognates and False FriendsMethodUse the XXDICE orthographic similarity measureUse list of pairs of words in two languages (the words that are translation of each other, or not, or monolingual lists of words)Use a bilingual dictionary to determine if the words contained in a pair are translation of each other

  • Complete Lists of Cognates and False FriendsEvaluationOn the entry list of a French-English bilingual dictionary55% - Cognates2% - False Friends (5,619,270 pairs)

    We created pair of words from two large monolingual list of words in French and English 11,469,662 Orthographical Similar (0.8%)3,496 Cognates (0.03%)3,767,435 False Friends (32%)

  • Cognates and False Friends IdentificationConclusion

    We tested a number of orthographic similarity measures individually, and also combined using different Machine Learning algorithms

    We evaluated the methods on a training set using 10-fold cross validation, on a test set

    We proposed an extension of the method to create complete lists of Cognates and False Friends

    The results show that, for French and English, it is possible to achieve very good accuracy based on the orthographic measures of word similarity

  • Partial Cognate DisambiguationTask To determine the sense/meaning (Cognate or False Friend with the equivalent English word) of an Partial Cognate in a French contextNote Cog Le comit prend note de cette information. The Committee takes note of this reply. FF Mais qui a d payer la note? So who got left holding the bill?

  • Data Use a set of 10 Partial CognatesParallel sentences that have on the French side the French Partial Cognate and on the English side the English Cognate (English False Friend) - labeled as COG (FF)Collected from EuroPar, Hansard~ 115 sentences each class for Training~ 60 sentences each class for Testing

  • Supervised MethodTraditional ML algorithms

    Features

    - used the bag-of-words (BOW) approach of modeling context, with the binary feature values

    - context words from the training corpus that appeared at least 3 times in the training sentences Classes COG and FF

  • Monolingual BootstrappingFor each pair of partial cognates (PC) 1. Train a classifier on the training seeds using the BOW approach and a NB-K classifier with attribute selection on the features

    2. Apply the classifier on unlabeled data sentences that contain the PC word, extracted from LeMonde (MB-F) or from BNC (MB-E)

    3. Take the first k newly classified sentences, both from the COG and FF class and add them to the training seeds (the most confident ones the prediction accuracy greater or equal than a threshold =0.85)

    4. Rerun the experiments training on the new training set

    5. Repeat steps 2 and 3 for t times endFor

  • Bilingual Bootstrapping

    1. Translate the English sentences that were collected in the MB-E step into French using an online MT tool and add them to the French seed training data.

    2. Repeat the MB-F and MB-E steps for T times.

  • Additional DataLeMondeAn average of 250 sentences for each class

    BNCAn average of 200 sentences for each class

    Multi-Domain corpusAn average of 80 sentences for each class

  • Results

  • Partial Cognate DisambiguationConclusions

    Simple methods and available tools are used with success for a task hard to solve even for humans

    Additional use of unlabeled data improves the learning process for the Partial Cognates Disambiguation task

    Semi-Supervised Learning proves to be as good as Supervised Learning

  • CLPA-Cross Language Pair Annotator

  • Future WorkApply the Cognate and False Friend Identification method, and create complete list for other pair of languages

    Increase the accuracy results for the Partial Cognate Disambiguation task

    Use lemmatization for French texts and human evaluation for CLPA

  • Thank you!