A Report of IJCNLP 2011 #TokyoNLP

A Report of IJCNLP 2011 @nokuno

#tokyonlp

About the presenter

•  Name: Yoh Okuno

•  Software Engineer at Yahoo! Japan

•  Interest: NLP, Machine Learning, Data Mining

•  Skill: C/C++, Python, Hadoop, etc.

•  Website: http://www.yoh.okuno.name/

Recent nokuno (1)

Recent nokuno (2)

Recent nokuno (3)

#emnlpreading 2011. 12. 23.

at Cybozu Labs

Today’s Topic •  Japanese Pronunciation Prediction as Statistical

Machine Translation

•  Integrating Models Derived from non-‐Parametric

Bayesian Co-‐segmentation into a Statistical Machine

Transliteration System

•  Discriminative Phrase-‐based Lexicalized Reordering

Models using Weighted Reordering Graphs

Japanese Pronunciation Prediction as Statistical Machine Translation

Jun Hatori and Hisami Suzuki

University of Tokyo, Microsoft Research

IJCNLP 2011

Motivation •  Japanese words and sentences have multiple

pronunciations

•  Proposed method predicts pronunciations of

out-‐of-‐vocabulary (OOV) words [Hatori+ 11] and

known words sentence simultaneously

•  Used statistical machine translation (SMT)

framework at word and character level

An Example

•  Input: 東京都美術館の狩野探幽展に行った

•  Output: とうきょうとびじゅつかんのかのうたんゆうてんにいった

•  Training corpus: Japanese dictionary and corpus with pronunciation

Discriminative model •  Similar to phrase-‐based SMT with monotone

alignment, no insertion and no deletion

•  Used averaged perceptron training

λ：parameters , f: features

Features

•  Bidirectional translation probability

•  Target character n-‐gram model

•  Target character length

•  Joint n-‐gram model

– Pairs of (source, target) probability

Translation Process

Training •  Produce translation table and language model

Experimental Result

•  Dictionary-‐based approach outperformed

substring-‐based approach [Hatori+ 11]

References •  HS11: [Hatori+ 11] Predicting Word Pronunciation in

Japanese

•  Mecab: [Kudo+ 04] Applying conditional random fields to

Japanese morphological analysis

•  KyTea: [Neubig+ 10] Word-‐based partial annotation for

efficient corpus construction

•  [Suzuki+ 05] Microsoft Research IME Corpus

•  [Maekawa+ 08] Compilation of the KOTONOHA-‐BCCWJ

corpus (in Japanese)

Integrating Models Derived from non-‐Parametric

Bayesian Co-‐segmentation into a Statistical Machine Transliteration System

Andrew Finch and Eiichiro Sumita (NICT)

NEWS 2011

Transliteration Task •  Transliteration is defined as phonetic translation of names across languages

[Zhang+ 11]

Nonparametric co-‐segmentation

•  Extended monolingual word segmentation

[Mochihashi+ 09] [Goldwater+ 06]

•  Used Unigram Dirichlet Process Model as

language model, and Poisson distribution as base

measure (no character-‐level LM)

•  Simple Gibbs Sampling with Forward-‐Backward

[Finch+ 10]

Joint Source-‐Channel Model •  Model parallel corpus as bilingual sequence-‐pairs

•  Bilingual sequence-‐pairs don’t cross word boundary

(1)

[Finch+ 10]

s: sources, t: targets, w: words, γ: bilingual segmentation

Unigram Dirichlet Process Model •  Bilingual sequence-‐pairs are generated from Unigram

Dirichlet Process

•  Used Chinese Restaurant Process representation •  Bilingual sequence-‐pairs are generated as:

1.  One of the existing type according to their count

2.  New type according to a constant (α=0.3 in this case)

[Finch+ 10]

The Base Measure •  Double Poisson distribution for bilingual sequence-‐pairs

•  Characters are uniformly generated

[Finch+ 10]

v: vocabulary size, λ: parameter (=2 in this case)

The Generative Model •  Generation from the history of bilingual

sequence-‐pairs

-‐k: “up to but not including k” α = 0.3: New bilingual sequence-‐pair

[Finch+ 10]

The Generative Process [Finch+ 10]

Sample from multinomial distribution

Generate new pair?

Sample each characters uniformly

Sample each lengths of the pair

No

Yes

Gibbs Sampling •  Used the Blocked version of Forward-‐Filtering-‐

Backward-‐Sampling (FFBS) [Mochihashi+ 09]

[Finch+ 10]

Graph for all co-‐segmentation of (abba, アッバ)

Experimental Result •  Outperform m2m baseline with all language pairs

Translation table example

References •  [Zhang+ 11] Whitepaper of NEWS 2011 Shared Task on

Machine Transliteration

•  [Finch+ 10] A Bayesian Model of Bilingual

Segmentation for Transliteration

•  [Mochihashi+ 09] Bayesian unsupervised word

segmentation with nested pitman-‐yor language

modeling

•  [Goldwater+ 06] Contextual dependencies in

unsupervised word segmentation

Discriminative Phrase-‐based Lexicalized Reordering Models using Weighted

Reordering Graphs

Wang Ling, Joao Grac¸a, David Martins de Matos,

Isabel Trancoso and Alan Blac

Carnegie Mellon University

IJCNLP 2011

Reordering in Phrase-‐based SMT

•  Reordering model plays important role in

language pairs like Japanese-‐English

LM Reordering Translation

P (e|f) = P (e)I�

i=1

P (fi|ei)P (pi, oi)

[Koehn+ 03]

History of reordering model •  Distance-‐based reordering model [Kohen+ 03]

•  Word-‐based lexicalized reordering [Kohen+ 05]

•  Phrase-‐based lexicalized reordering [Tillmann+ 04]

•  Weighted word-‐based lexicalized reordering [Ling+ 11]

– Weighted alignment matrices [Liu+ 09]

–  Reordering graph representation [Su+ 10]

•  Propose weighted phrase-‐based lexicalized reordering

Three types of “orientation”

•  Categorize 3 types

– monotone (m)

– swap (s)

– discontinuous (d)

[Koehn+ 05]

Word-‐based reordering •  Most popular reordering model currently

•  Extend count to weighted sum of probability

[Koehn+ 05]

Weighted alignment matrices [Liu+ 09]

Weighted Reordering Graph [Su+ 10]

Forward-‐Backward Algorithm

•  To calculate reordering probability P(p,o)

Choosing Weight Matrix •  Weighted Alignment Matrix

•  Distance-‐based edge weight

come from [Liu+ 09]

Experimental Result

References •  [Koehn+ 03] Statistical phrase-‐based translation •  [Koehn+ 05] Edinburgh System Description for the

2005 IWSLT Speech Translation Evaluation

•  [Liu+ 09] Weighted Alignment Matrices for Statistical

Machine Translation

•  [Su+ 10] Learning lexicalized reordering models from

reordering graphs

•  [Ling+ 11] Reordering modeling using weighted

alignment matrices

Phrase Extraction for Japanese Predictive Input Method as Post-‐Processing

Yoh Okuno

Yahoo Japan Corporation

IJCNLP 2011

Call For Paper TokyoNLP #9

EMNLP 2011 Reading

Any Question?

Technology

A Report of IJCNLP 2011 #TokyoNLP