Upload
yoh-okuno
View
4.073
Download
6
Tags:
Embed Size (px)
DESCRIPTION
TokyoNLP is a meetup about natural language processing at Tokyo. This slide is presented at the 5th presentation of the 8th event.
Citation preview
A Report of IJCNLP 2011 @nokuno
#tokyonlp
About the presenter
• Name: Yoh Okuno
• Software Engineer at Yahoo! Japan
• Interest: NLP, Machine Learning, Data Mining
• Skill: C/C++, Python, Hadoop, etc.
• Website: http://www.yoh.okuno.name/
Recent nokuno (1)
Recent nokuno (2)
Recent nokuno (3)
#emnlpreading 2011. 12. 23.
at Cybozu Labs
Today’s Topic • Japanese Pronunciation Prediction as Statistical
Machine Translation
• Integrating Models Derived from non-‐Parametric
Bayesian Co-‐segmentation into a Statistical Machine
Transliteration System
• Discriminative Phrase-‐based Lexicalized Reordering
Models using Weighted Reordering Graphs
Japanese Pronunciation Prediction as Statistical Machine Translation
Jun Hatori and Hisami Suzuki
University of Tokyo, Microsoft Research
IJCNLP 2011
Motivation • Japanese words and sentences have multiple
pronunciations
• Proposed method predicts pronunciations of
out-‐of-‐vocabulary (OOV) words [Hatori+ 11] and
known words sentence simultaneously
• Used statistical machine translation (SMT)
framework at word and character level
An Example
• Input: 東京都美術館の狩野探幽展に行った
• Output: とうきょうとびじゅつかんのかのうたんゆうてんにいった
• Training corpus: Japanese dictionary and corpus with pronunciation
Discriminative model • Similar to phrase-‐based SMT with monotone
alignment, no insertion and no deletion
• Used averaged perceptron training
λ:parameters , f: features
Features
• Bidirectional translation probability
• Target character n-‐gram model
• Target character length
• Joint n-‐gram model
– Pairs of (source, target) probability
Translation Process
Training • Produce translation table and language model
Experimental Result
• Dictionary-‐based approach outperformed
substring-‐based approach [Hatori+ 11]
References • HS11: [Hatori+ 11] Predicting Word Pronunciation in
Japanese
• Mecab: [Kudo+ 04] Applying conditional random fields to
Japanese morphological analysis
• KyTea: [Neubig+ 10] Word-‐based partial annotation for
efficient corpus construction
• [Suzuki+ 05] Microsoft Research IME Corpus
• [Maekawa+ 08] Compilation of the KOTONOHA-‐BCCWJ
corpus (in Japanese)
Integrating Models Derived from non-‐Parametric
Bayesian Co-‐segmentation into a Statistical Machine Transliteration System
Andrew Finch and Eiichiro Sumita (NICT)
NEWS 2011
Transliteration Task • Transliteration is defined as phonetic translation of names across languages
[Zhang+ 11]
Nonparametric co-‐segmentation
• Extended monolingual word segmentation
[Mochihashi+ 09] [Goldwater+ 06]
• Used Unigram Dirichlet Process Model as
language model, and Poisson distribution as base
measure (no character-‐level LM)
• Simple Gibbs Sampling with Forward-‐Backward
[Finch+ 10]
Joint Source-‐Channel Model • Model parallel corpus as bilingual sequence-‐pairs
• Bilingual sequence-‐pairs don’t cross word boundary
(1)
[Finch+ 10]
s: sources, t: targets, w: words, γ: bilingual segmentation
Unigram Dirichlet Process Model • Bilingual sequence-‐pairs are generated from Unigram
Dirichlet Process
• Used Chinese Restaurant Process representation • Bilingual sequence-‐pairs are generated as:
1. One of the existing type according to their count
2. New type according to a constant (α=0.3 in this case)
[Finch+ 10]
The Base Measure • Double Poisson distribution for bilingual sequence-‐pairs
• Characters are uniformly generated
[Finch+ 10]
v: vocabulary size, λ: parameter (=2 in this case)
The Generative Model • Generation from the history of bilingual
sequence-‐pairs
-‐k: “up to but not including k” α = 0.3: New bilingual sequence-‐pair
[Finch+ 10]
The Generative Process [Finch+ 10]
Sample from multinomial distribution
Generate new pair?
Sample each characters uniformly
Sample each lengths of the pair
No
Yes
Gibbs Sampling • Used the Blocked version of Forward-‐Filtering-‐
Backward-‐Sampling (FFBS) [Mochihashi+ 09]
[Finch+ 10]
Graph for all co-‐segmentation of (abba, アッバ)
Experimental Result • Outperform m2m baseline with all language pairs
Translation table example
References • [Zhang+ 11] Whitepaper of NEWS 2011 Shared Task on
Machine Transliteration
• [Finch+ 10] A Bayesian Model of Bilingual
Segmentation for Transliteration
• [Mochihashi+ 09] Bayesian unsupervised word
segmentation with nested pitman-‐yor language
modeling
• [Goldwater+ 06] Contextual dependencies in
unsupervised word segmentation
Discriminative Phrase-‐based Lexicalized Reordering Models using Weighted
Reordering Graphs
Wang Ling, Joao Grac¸a, David Martins de Matos,
Isabel Trancoso and Alan Blac
Carnegie Mellon University
IJCNLP 2011
Reordering in Phrase-‐based SMT
• Reordering model plays important role in
language pairs like Japanese-‐English
LM Reordering Translation
P (e|f) = P (e)I�
i=1
P (fi|ei)P (pi, oi)
[Koehn+ 03]
History of reordering model • Distance-‐based reordering model [Kohen+ 03]
• Word-‐based lexicalized reordering [Kohen+ 05]
• Phrase-‐based lexicalized reordering [Tillmann+ 04]
• Weighted word-‐based lexicalized reordering [Ling+ 11]
– Weighted alignment matrices [Liu+ 09]
– Reordering graph representation [Su+ 10]
• Propose weighted phrase-‐based lexicalized reordering
Three types of “orientation”
• Categorize 3 types
– monotone (m)
– swap (s)
– discontinuous (d)
[Koehn+ 05]
Word-‐based reordering • Most popular reordering model currently
• Extend count to weighted sum of probability
[Koehn+ 05]
Weighted alignment matrices [Liu+ 09]
Weighted Reordering Graph [Su+ 10]
Forward-‐Backward Algorithm
• To calculate reordering probability P(p,o)
Choosing Weight Matrix • Weighted Alignment Matrix
• Distance-‐based edge weight
come from [Liu+ 09]
Experimental Result
References • [Koehn+ 03] Statistical phrase-‐based translation • [Koehn+ 05] Edinburgh System Description for the
2005 IWSLT Speech Translation Evaluation
• [Liu+ 09] Weighted Alignment Matrices for Statistical
Machine Translation
• [Su+ 10] Learning lexicalized reordering models from
reordering graphs
• [Ling+ 11] Reordering modeling using weighted
alignment matrices
Phrase Extraction for Japanese Predictive Input Method as Post-‐Processing
Yoh Okuno
Yahoo Japan Corporation
IJCNLP 2011
Call For Paper TokyoNLP #9
EMNLP 2011 Reading
Any Question?