Upload
sauda
View
46
Download
0
Embed Size (px)
DESCRIPTION
Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval. Advisor : Dr. Hsu Presenter : Zih-Hui Lin Author :Ying Zhang and Phil Vines. Outline. Motivation Objective Previous work Methodology Experiments and results Conclusions. Motivation. - PowerPoint PPT Presentation
Citation preview
1Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval
Advisor : Dr. Hsu
Presenter : Zih-Hui Lin
Author :Ying Zhang and Phil Vines
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation Objective Previous work Methodology Experiments and results Conclusions
Outline
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation One of the major remaining reasons that CLIR
does not perform as well as monolingual retrieval is the presence of out of vocabulary (OOV) terms.─ it will not be recognized, and segmented into either sm
aller sequences of characters or individual characters ─ 北野武→ (north limit military)
Previous work has either relied on manual intervention or has only been partially successful in solving this problem.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
We propose a segmentation free method which can be applied to both Chinese-English and English-Chinese CLIR, correctly extracting translations of OOV terms from the Web automatically, and thus is a significant improvement on earlier work
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.English translation extraction in Chinese-English CLIR
Chinese OOV term detection─ 北野武 (north limit military) → Pvalue given by the HMM will be very low if Pval
ue < Pmin → contains OOV terms web text extraction
─ we extract strings that contain the Chinese query terms and some English text from the Web.
collection of co-occurrence statistics,
translation selection.search for longest Chinese substring Ct:
search for the English term etwith the highest frequency:
1. |Ctargets| = max(|Cij|).
2. f(et, Ct) = max(f(ei,Ctargets)).
3. Add (Ct, et) into the translation dictionary.
1.f(etargets) = max(f(ei)).
2.f(et’ ,Ct’) = max(f(etargets,Cij )).
3. if Ct’ ≠ Ct and et’ ≠ et , add (Ct’ , et’ ) into the translation dictionary.
北野武( Kitano Takeshi )c4 c5 c6 e 1
導演北野武 ( Kitano Takeshi )c2 c3 c4 c5 c6 e1
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Chinese translation extraction in English-Chinese CLIR
Extraction of web text─ use Google to fetch the top100 Chinese documents with the English OOV term eoov as t
he query. Collection of co-occurrence statistics
─
─ accumulate the frequency foov.─ considering all substrings in Sleft and Sright, and collecting
the frequency fn and the length |sn| of each Chinese substring.─
Translation selection─ exclude any substring that
already in the translation dictionary doesn’t occur in the document collection
與區域貿易………兩岸經貿關係( Canada and Cross Straits Economic Relations )三組發表九………英茂、加
Sleft ( 包含 20 個字 ) Sright ( 包含 20 個字 )eoov
Longest length
Highest frequency
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments and results
Chinese-English CLIR English-Chinese CLIR
30
10 10
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction When translating from Chinese to English, a standard
first step is to segment the text into words based on an existing segmentation dictionary.
However where an OOV term occurs, it will not be recognized, and segmented into either smaller sequences of characters or individual characters.
We propose a segmentation free method based on frequency and length analysis and corpus-based disambiguation
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Previous work
Dictionary-based translation schemes need to address three major issues─ phrase identification and translation
ex. non proliferation treaty and cross straits.
─ translation ambiguity using techniques such as term co-occurrence , mutual
information or language modeling.
─ out of vocabulary (OOV) terms. ex. Dioxin
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Previous work-Existing approaches to the OOV problem Depending on the language, it may be possible to ded
uce appropriate transliterated translations automatically.─ that they successfully applied in English-Arabic CLIR.
However the issue is more difficult in Chinese as many characters have the same sound, and many English syllables do not have equivalent sounds in Chinese, meaning that selecting the correct characters to represent a transliterated word can be problematic.─ cross straits( 兩岸 ) 、北野武 (north limit military)
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Previous work-Segmentation free translation extraction It is common to find a small amount of English text
in Chinese web documents, but extremely rare to find Chinese text in English web documents.
We therefore rely on Chinese web documents to extract translations in both directions.
The problem is that the Chinese OOV term we are looking for is currently unknown, and thus we have no information about how it should be segmented.─ In previous work, this problem was overcome by manual
intervention to provide appropriate segmentation.
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments and results Chinese-English CLIR
─ retrieving English documents using Chinese queries.30
10
10
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments and results (cont.) English-Chinese CLIR
─ retrieving Chinese documents using English queries. The aim of our work is to find appropriate Chinese translations of English OOV terms
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusions
We have also described improved ways to extract the translation of OOV terms from the Web in a way that does not rely on prior segmentation.
Although the Web is constantly changing, we were able to find most OOV terms, many of which related to news events up to 10 years ago.
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.My opinion
Advantage: Segmentation free translation extraction
Disadvantage: Apply: 線上翻譯 ……… ..