View
10
Download
0
Category
Preview:
Citation preview
KoreanKorean--Spanish CLIR System DevelopmentSpanish CLIR System Development::Translation of Unknown WordsTranslation of Unknown Words
August 23, 2005
Qing Li
Information Retrieval and Natural Language Processing Lab.Information and Communications University (ICU)
22//3030Information and Communications UniversityInformation and Communications University
OutlineOutline
• Introduction
• Unknown Word Translation Method
• Evaluation
• Unknown word module Demo
33//3030Information and Communications UniversityInformation and Communications University
How to retrieve documents ?How to retrieve documents ?
Crawler
Search Engine
Query
44//3030Information and Communications UniversityInformation and Communications University
CLIRCLIR
Crawling Spanish doc.
Search Engine indexed Spanish doc.
Spanish Query
Korean Person
55//3030Information and Communications UniversityInformation and Communications University
Translation ModuleTranslation Module
Korean Query
Spanish Query
66//3030Information and Communications UniversityInformation and Communications University
Unknown words translationUnknown words translation
Unknown words:• They are currently not available
Named-entities: (person, organization and location)Book/movie titlesTerminology (Medical, Sci&Tech, Military, …)
• Most of them are compound nounsThe meaning can not be directly derived from its componentsRequires more world knowledge to translate
• Important for NLP applications:Machine Translation (MT)Cross-lingual Information Retrieval (CLIR)Question-Answering (QA)
77//3030Information and Communications UniversityInformation and Communications University
Our strategyOur strategy
Manually construct an unknown words list• Accurate but almost impossible !?
Good strategy is to automatically construct that list
• Mining the web to get the corresponding translation of unknown words.
88//3030Information and Communications UniversityInformation and Communications University
Searching the web for the translation?Searching the web for the translation?
Searching the parallel data on the web (e.g. STRAND: Resnik 2003)
99//3030Information and Communications UniversityInformation and Communications University
Searching the web for the translation?Searching the web for the translation?
Searching the comparable corpus on the web (Fung 1998)
1010//3030Information and Communications UniversityInformation and Communications University
Searching the web for the translation?Searching the web for the translation?
Anchor texts pointing to the same page (Lu 2004)
1111//3030Information and Communications UniversityInformation and Communications University
Searching the web for the translation?Searching the web for the translation?
Mining the mixed language pages.
1212//3030Information and Communications UniversityInformation and Communications University
Mining translations from mixedMining translations from mixed--langlang. pages. pages
Crawling the Chinese web pages that contain English text. (Zhang and Vines, SIGIR 2004)– Use Google to locate the webpages containing the Chinese terms– English expressions occur next to the Chinese terms are considered as
their translations– Crawled 2GB web data, 1,168 distinct English terms found, 61% are
correct translations
Searching the Chinese terms among the English pages. (Cheng et al. SIGIR 2004)– Use Google to retrieve “English” pages containing the Chinese terms– Extract translations from the snippets– LiveTrans system
1313//3030Information and Communications UniversityInformation and Communications University
Our method Our method
1414//3030Information and Communications UniversityInformation and Communications University
Our methodOur method
– Preprocessing– Multiple features
• Phonetic • Physical structure
– parenthesis • Frequency-length
– Feature fusion
1515//3030Information and Communications UniversityInformation and Communications University
Preprocessing Preprocessing –– case 1case 1
1616//3030Information and Communications UniversityInformation and Communications University
Preprocessing Preprocessing ––case 2case 2
1717//3030Information and Communications UniversityInformation and Communications University
FrequencyFrequency--length modellength model
( ) ( )( ) (1 )max max
i iFL i
len c Freq cw clen Freq
α α= × + − ×
….
1.6724Roh Moo
1.3314Moo
234Roh Moo hyun
1.6724Moo hyun
Weight_FLLen Freq Candidate
1818//3030Information and Communications UniversityInformation and Communications University
Physical structure ModelPhysical structure Model
Candidates retrieved from parenthesis
1919//3030Information and Communications UniversityInformation and Communications University
Phonetic model Phonetic model
Phonetic model• Capture phonetic similarity
• Person, location and brand names
• Probabilistic surface string alignment
• Romanized source phrases vs. target phrase
• Letters are aligned according to their pronunciation similarity (not orthogonal forms)
Huang, Vogel and Waibel, Automatic Extraction of Named Entity Translingual Equivalence Based on Multi-feature Cost Minimization, ACL 03 Multilingual NE Recognition Workshop
2020//3030Information and Communications UniversityInformation and Communications University
Feature combination Feature combination
Stage 1 :• Combining the result from physical structure model and frequency-length
model.
• Rank the candidates based on the weight
Stage 2 :• If the weights of some candidates got from phonetic model lager than a certain
threshold, move them to the top and rerank them based on the phonetic model weight.
2121//3030Information and Communications UniversityInformation and Communications University
Sample Sample
English unknown word :kofi annan (UN President)Result from stage 1
Result from stage 2
2222//3030Information and Communications UniversityInformation and Communications University
One problem One problem
• The WebPages have both Korean and Spanish words is limited.
• Solution : mining the webpages contain both Korean and English words in order to get the corresponding English translation for the unknown words.
• Is this reasonable?• Same spelling for unknown words• Chilean webpages contains many webpages written in English• Even in one line some words are written in English in Chilean webpage.
2323//3030Information and Communications UniversityInformation and Communications University
ReasonReason
2424//3030Information and Communications UniversityInformation and Communications University
Evaluation Evaluation
Test set• 300 key phrases manually selected • Manual translation as reference• One phrase may have several correct translations
2525//3030Information and Communications UniversityInformation and Communications University
Overall Translation Quality Overall Translation Quality
2626//3030Information and Communications UniversityInformation and Communications University
Snapshot of translation resultSnapshot of translation result
2727//3030Information and Communications UniversityInformation and Communications University
OnOn--going workgoing work
We have collected 40,000 Korean unknown words from newspaper collection.
We will translate those words into Spanish / English to extend the current bilingual language list.
Continue to refine the method by applying the techniques in MT area
2828//3030Information and Communications UniversityInformation and Communications University
DemoDemo
http://220.69.185.118:8080/tran/
2929//3030Information and Communications UniversityInformation and Communications University
ReferencesReferences
Fung, P and Yee, L.Y. An IR Approach for Translating New Words from Nonparallel, Comparable Texts. In Proc. Of COLING-ACL, pp. 414-420, 1998.F. Huang, S. Vogel and A. Waibel. Automatic extraction of named entity translingual equivalence based on multi-feature cost minimization. In Proceeding of the 41st ACL, Workshop on Multilingual and Mixed-Language Named Entity Recognition, Sapporo, Japan, July 2003.Lu, W.-H., Chien, L.-F., and Lee, H.-J. Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach. ACM Transactions on Information Systems 22(2), pp. 242-269, 2004.P. Resnik and N. A. Smith. The web as a parallel corpus. Comput. Linguist., 29(3):349--380, 2003.Y. Zhang and P. Vines. Detection and translation of oov terms prior to query time. In SIGIR '04, pages 524--525. ACM Press, 2004.Y. Zhang, F. Huang and S. Vogel. Mining Translations of OOV Terms from the Web through Cross-lingual Query Expansion. In SIGIR ’05.
3030//3030Information and Communications UniversityInformation and Communications University
Thank you.
Any questions will be welcome!
Recommended