Upload
mike-tian-jian-jiang
View
181
Download
2
Embed Size (px)
Citation preview
– Ed Hovy
“A plague of statistics has descended on our houses.”
5
e.g. 11,001 New Features for Statistical Machine Translation……
First Brick in the Wall
• Via Negativa
• False positive/negative
• Error propagation
• Unknown unknown
9
Funny Autocomplete“autocomplete is not a function” is current top-1 Google
autocomplete of “autocomplete is”.10
Autocomplete is NOT a function
• Neither is auto-suggestion
• They are many-to-many relations with scores.
• Recognize this?
11
Many-to-many Scoring• Map by prefix, rank by popularity
• Google search box autocomplete
• Map by occurrence, rank by similarity
• Search (information retrieval)
• Map by information, rank by knowledge
• Translation12
Information?• Surface patterns and……
• Imaginations
• Quantum information theory
• Tensor (Network Algorithm)
• Quantum Physics and Linguistics
• Frobenius (diagrammatic) algebras (for semantics)
13
Popularity & Similarity
• Popularity: famous or infamous?
• Consensus: social choice?
• Similarity
• Distance: rational choice?
15
Prefix, Occurrence• Surface pattern
• Regular
16
• Context-free
• Context-sensitive
• Recursively blahblah……
Regular expression• [a-z]+
• Colours of cats and dogs.
• [^o]{2}
• Colours of cats and dogs.
• cat|dog
• Colours of cats and dogs.
• Colou?rs?
• Colours of cats and dogs.
• Colors of cats and dogs.
• Color of a cat.
• <[A-Za-z][A-Za-z]*>
• <html>Colours of cats and dogs.</html>
18
Edit Distance• Colors
• Delete s
• Color
• Insert u
• Colour
• Replace C with c
• colour
• Distance from Colors to colour: 3 (or 4 if the cost of replacing is 2)
19
Normalization• time flies like an arrow. fruit flies like bananas.
• Case restoration
• Time flies like an arrow. Fruit flies like bananas.
• Sentence segmentation
• time flies like an arrow.
• fruit flies like bananas.
• Word normalization: stemming or lemmatization?
21
Stemming
• Porter Stemmer (mainly suffix stripping)
• flies → fli
• bananas → banana
• How about “flies → fly”?
• Lemmatization
22
Confidence Score• Confidence interval? Confidence level?
• Not really
• But it can be
• Just a buzz word from speech recognition
• Shannon’s game
• Hidden-Markov models
• Generative
• The Italian who went to Malta
• Can be any reasonable score
• Mostly probability
26
Calculate Sentence Similarity
Confident
Trusted
Doubted
[partial match]
[exact match]
[no match]
a / b < threshold, since b is higherwhen
a = prob. of (#2(w1 w2 w3 w4) #1(w1 w2 w3) #1(w2 w3 w4)#1(w1 w2) #1(w2 w3) #1(w3 w4) #2(w1 w3) #2(w2 w4) #3(w1 w4));
b = avg. prob. of all known exact matches;where #n: any other (n - 1) words in-between.
Sentence: “w1 w2 w3 w4.”
27
Evaluate Pair: {Source, Target} Confidence
Confident
Trusted
Doubted
[Trusted Source]
[Confident Source]
[Doubted Source]
Triple: {Source, Target, Back}
Source Target
[Trusted Target]
[Not Doubted Target]
Evaluate Back Confidence
[Doubted Back]
28
Summarization• Extraction
• Classification • Discriminative
• Abstraction • Aggregation
• Generative30
Even Intractable • Minimum Feedback Arc Set
• NP-complete, APX-hard
• Bipartite Tournament
• Hypergraph Grammar
• Synchronous Grammar
• Arrow’s Impossibility Theorem
• Social Choice
• Voting System
35
There are two kinds of…
PAIN. The sort of pain that makes you strong, or useless pain. The sort of pain that's only suffering. I have no patience for useless things.
37
What might make me stronger……
(See also http://www.no-free-lunch.org)
38
HTML Side-effect<span class=“notranslate”>Hello, WorldJumper!</span>
<!-- Are you talking to me? -->
41
More Anomalies • 【⽶米】
• 飛来物
• 菜の花
• 桃⽩白⽩白
• ⽩白⽴立斌
• Oh, I also want [[[this part to be a partially matched TM]]] pre-edited for MT, please?
44
Transliteration
• Alignment
• Alignment
• Alignment
• (And better be more than bilingual)
47
system using M2M-aligner, CRF models, and AV features in this work is explained in Section III. Section IV describes experiment results, and discussion is provided in Section V. Finally, Section VI draws a conclusion.
II. RELATED WORKS
A. CRF-based Transliteration A phrase-based transliteration system was presented that
groups characters into substrings to map to target names [16], to demonstrate how substring representation can be incorporated into a CRF model with local context and phonemic information. Shishtla et al. [17] adopted a statistical transliteration technique that consists of alignment model of GIZA++ [19] and CRF models. Instead of GIZA++, M2M-aligner is used and applied source grapheme AV for CRF-based transliteration [13].
A two-stage CRF method for transliteration was first designed to pipeline two independent processes [7][10]. The first stage predicts syllable boundaries of source names, and the second stage uses those boundaries to get corresponding characters of target names. The advantage of the two-stage CRF method is considerably decreasing the training cost with complex features than one-stage method in character-based labeling method. The downside, comparing with the one-stage method, is features of target language are not directly applied in the first stage. To recover from error propagations of the pipeline, a joint optimization of two-stage CRF method is then proposed to utilize n-best candidates of source name segmentations [11]. Another approach to reduce the local errors in boundary segmentation is the pools of CRF models for second stage model training [8][14].
B. Accessor Variety Accessor variety (AV) is a criterion to decide meaningful
Chinese words from consecutive Chinese characters in a sentence [12]. Another similar criterion for measurement of English and Chinese words called boundary entropy or branching entropy (BE) was used in several works. The basic idea behind these measurements is closely related to one particular perspective of n-gram and information theory of cross entropy or perplexity. Zhao and Kit [20] induced that AV and BE both assume that the border of a potential word is located where the uncertainty of successive characters increases, where AV and BE are regarded as the discrete and continuous versions, respectively, of the fundamental work of Harris [18], and then chose to adopt AV as the additional feature of CRF-based Chinese Word Segmentation (CWS). The AV of a string s is defined as:
. (1)
In Eq. (1), Lav(s) and Rav(s) are defined as the number of distinct preceding and succeeding characters, except when the adjacent character is absent due to a sentence boundary, and then the pseudo-character of the beginning or end of a sentence is accumulated indistinctly. More heuristic rules were also developed to remove strings that contain known words or adhesive characters [12]. For the strict meaning of
unsupervised features and for simplicity, this work does not include those additional rules.
The necessity of AV is primarily on the demand for semi-supervised learning. Since AV can be extracted from large corpora without any manual segmentation or annotation, hidden variables underlying frequent surface patterns of languages may be captured via an inexpensive and unsupervised method such as suffix array. Unsupervised feature selection of AV or similar features has generally improved effectiveness of supervised CWS on cross-domain and unlabeled data, and this work consequently considers that AV of un-segmented English names from training, development, and test data might help enhancing E2C transliteration.
III. METHODOLOGY
A. Basic CRF Theorem Conditional random fields (CRF) are undirected graphical
models trained to maximize a conditional probability and the concept is well established for sequence labeling problem [9]. Given an input sequence
Txx…1X = and label sequence
TyyY …1= , a conditional probability of linear-chain CRF with parameters },...,{ 1 nλλ=Λ can be defined as:
!"
#$%
&= ∑∑
=−
T
t kttkk tXyyf
11
X
),,,(expZ1X)|(YP λλ
(2)
where XZ is the normalization constant that makes probability
of all label sequences sum to one, ),,,( 1 tXyyf ttk − is a feature
function which is often binary valued, but can be real valued, and kλ is a learned weight associated with feature
kf .
Given such a model as defined in Eq. 2, the most probable labeling sequence for an input X is as follow.
)|(maxarg* XYPyY
Λ=
(3)
Eq. 3 can be efficiently calculated by dynamic programming using Viterbi algorithm.
B. EM for Initial Alignments In [15], the authors argued that previous work has generally
assumed one-to-one alignment for simplicity, but letter strings and phoneme strings are not typically in the same length, so null phonemes or null letters must be introduced to make one-to-one-alignments possible. Furthermore, two letters frequently combine to produce a single phoneme (double letters), and a single letter can sometimes produce two phonemes (double phonemes). For example, the English word “ABERT” with its Chinese transliteration “���”, which can be referred as “phonemes”, is aligned as [15]:
)}(),(min{)( sRsLsAV avav=
A BE RT | | | � � �
Reinforcement• Explore vs. Exploit
• Interactive
• Online
• Free Lunches
• Second moments and higher of algorithms' generalisation error
• Coevolution
• Confidence intervals can give a priori distinctions between algorithms
• People respond to incentives
53
Translate X for Y• {restaurant AD, coupon}
• {game, credit}
• {subtitle, DRM-free video}
• {Heart Sūtra, inner peace}
• {inside news, outside support}
• Taiwanese protesters
• {anything, incentives}
• See also: Unbabel, Duolingo
54
New Types of Assistance for Translators
by Philipp Koehn (http://www.mastar.jp/wfdtr/shiryou2013/Philipp%20Koehn.pdf
via http://www.mastar.jp/wfdtr/index-e.html)
55
Wrap up• Where’s my pony semantics?
• Adaptation
• Chinese restaurant process
• Indian buffet process
• 信 (adequate)、達 (fluent)
• 雅 (elegant)?貼 (pertinent)?
• Bilingual might be insufficient: 全⽇日空 → ANA
• Pony: you can’t always get what you want
• Extrinsic evaluation
• Embrace and enjoy changes
57