Upload
vothu
View
227
Download
2
Embed Size (px)
Citation preview
Bilingual terminology extraction
Vít Baisa
6th Sketch Engine WorkshopHerstmonceux, August 10, 2015
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 1 / 12
Terminology extraction: recap
combination of rules & statisticslanguages: Czech, Dutch, English, French, German, ChineseSimplified, Chinese Traditional, Italian, Japanese, Korean, Polish,Portuguese, Russian, Spanishyou can help us to add your languagecurrently in progress: Turkish, Hungarian
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 2 / 12
Terminology extraction: what is a term?
unithoodhow it is grammatically defined? (e.g. noun phrases)
termhooddoes it belong to a domain?
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 3 / 12
Terminology extraction: what is a term?
unithoodhow it is grammatically defined? (e.g. noun phrases)
termhooddoes it belong to a domain?
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 3 / 12
Terminology extraction: what is a term?
unithoodhow it is grammatically defined? (e.g. noun phrases)
termhooddoes it belong to a domain?
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 3 / 12
Unithood
Sketch Grammar formalismCQL (corpus query language) rules
# English, computer mouse*COLLOC "%(2.lc)_%(1.lc)"
2:[tag=="NN" | tag=="JJ" | tag=="VVG"] 1:[tag=="NN"]
# German, kleines Haus*COLLOC "%(2.adj_stem)%(1.gender_ending)_%(1.lemma_cap)"
2:[kind="ADJA"] 1:[kind="N"] & 1.case = 2.case
# Czech, Ústav národního zdraví*COLLOC "%(1.gender_lemma)_%(2.lc)_%(3.lc)"
1:noun 2:adj_genitive 3:noun_genitive & agree(2,3)
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 4 / 12
Termhood
simple math parameter N
ffocus + Nfref + N
f is relative (per million) frequency of a termthe formula is used also for keyword extractionN influences whether rare or frequent words are preferreda reference corpus in the same language is needed
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 5 / 12
Fine-tuning: options
stoplists (blacklists)simple math parameterminimum frequencyminimum scoreminimum character lengthonly alphanumerical strings. . .
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 7 / 12
Bilingual (multilingual) terminology extraction
recent developmentparallel corpora needed
Two-step procedure1 extraction of terms in source and target languages2 counting co-occurrences of the terms
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 8 / 12
Bilingual (multilingual) terminology extraction
recent developmentparallel corpora needed
Two-step procedure1 extraction of terms in source and target languages2 counting co-occurrences of the terms
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 8 / 12
Bilingual terminology extraction
we need to evaluate the extraction properlydata can be saved as TBXgranularity affects quality
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 10 / 12
The (not so distant) future for BTE
parallel vs. comparable corporadefinition findingterm hyper-, hyponyms findingterm thesaurusthe ultimate goal: one-click terminology :)terminology consistency checkingmulti- instead of bilingual extraction
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 11 / 12
The last slide
API availableIntelliWebSearch configurationsplugins for SDL, Kilgray products plannedone-off terminology extractionspromissing results so far
Vít Baisa (Lexical Computing Ltd.) August 10, 2015 12 / 12