View
110
Download
3
Tags:
Embed Size (px)
DESCRIPTION
This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme. For the latest updates go to http://www.statmt.org/mosescore/ or follow us on Twitter - #MosesCore
Citation preview
sAndrejs Vasiļjevs
chairman of the [email protected]
Localization World, Santa ClaraOctober 9, 2013
MT & Terminology:
better together
• Language technology developer
• Localization service provider
• Leadership in smaller languages
• Offices in Riga (Latvia), Tallinn
(Estonia) and Vilnius (Lithuania)
• 130 employees
• Strong R&D team
• 5 PhDs, 80+ research papers
• Trusted partner of the EU for
significant R&D projects
data
challenge
platform
challenge
[ttable-file]
0 0 5 /.../unfactored/model/phrase-table.0-0.gz
% ls steps/1/LM_toy_tokenize.1* | cat
steps/1/LM_toy_tokenize.1
steps/1/LM_toy_tokenize.1.DONE
steps/1/LM_toy_tokenize.1.INFO
steps/1/LM_toy_tokenize.1.STDERR
steps/1/LM_toy_tokenize.1.STDERR.digest
steps/1/LM_toy_tokenize.1.STDOUT
% train-model.perl \
--corpus factored-corpus/proj-syndicate \
--root-dir unfactored \
--f de --e en \
--lm 0:3:factored-corpus/surface.lm:0
% moses -f moses.ini -lmodel-file "0 0 3
../lm/europarl.srilm.gz“
use-berkeley = true
alignment-symmetrization-method = berkeley
berkeley-train = $moses-script-
dir/ems/support/berkeley-train.sh
berkeley-process = $moses-script-
dir/ems/support/berkeley-process.sh
berkeley-jar = /your/path/to/berkeleyaligner-
2.1/berkeleyaligner.jar
berkeley-java-options = "-server -mx30000m -ea"
berkeley-training-options = "-Main.iters 5 5 -
EMWordAligner.numThreads 8"
berkeley-process-options = "-
EMWordAligner.numThreads 8"
berkeley-posterior = 0.5
tokenize
in: raw-stem
out: tokenized-stem
default-name: corpus/tok
pass-unless: input-tokenizer output-tokenizer
template-if: input-tokenizer IN.$input-
extension OUT.$input-extension
template-if: output-tokenizer IN.$output-
extension OUT.$output-extension
parallelizable: yes
working-dir = /home/pkoehn/experiment
wmt10-data = $working-dir/data
customization
challenge
s
do-it-yourself
MT factory
on the cloud
• Automated training of SMT
systems from specified
collections of data
• Repository of parallel and
monolingual corpora
• based on open-source MT
tools GIZA and Moses
• Services for data collection,
MT generation,
customization and running of
variety of user-tailored MT
systems
Training Data Provided
Platform Architecture
Training UsingSharing of training data
Giza++
Moses SMT toolkit
SMT Resource Repository
SMT Multi-Model Repository
(trained SMT models)
Pro
cesi
ng,
Eva
luat
ion
...
Up
load
An
on
ymo
us
acce
ssA
uth
enti
cate
dac
cess
System management, user authentication, access rights control ...
Web page
Web service
Web pagetranslation widget
CAT tools
Web browserPlug-ins
SMT Resource Directory
SMT System Directory
Moses decoder
• Integration with CAT tools
• Integration in web pages
• Integration in web
browsers
• API-level integration
integration
• Training data on the LetsMT!
platform
• 119 languages
• 2,1 B parallel units in total
• 253 language pairs
• 860 corpora
• 249 production MT systems
currently on
the platform
General Domain MT
English – Lithuanian
DATA
5.3 M parallel sentences
81 M monolingual sentences
QUALITY
LetsMT – 26.65 BLEU
Google – 25.85 BLEU
Beating
Google Translate
• MT service for
e-Government
• Mobile Translation
app
• Desktop Translation
Tool
%Productivity
►Average translation productivity:
►Baseline with TM only: 550 w/h
►With TM and MT: 731 w/h
32.9% productivity increase
►High variability in individual performance
►Increase of error score from 20.2 to 28.6 points but still at the level “GOOD” (<30 points)
25.1%
28.5%
Czech Polish
32.9%
Latvian
How to instruct
SMT to use the
right terms?
ko
ks
tim
ber
%
terminology
as a
service
%
cloud-based
platform for
acquiring, cleaning,
sharing, and reusing
multilingual
terminological data
TaaS Services
Term identification and annotation
Identifying and marking terms
Machine users
TaaS Terminology Services
ITS 2.0 enriched content
ITS2.0term-annotated content
export / visualisation
Showcase Web Page
Terminology Annotation
Web Service API
Plaintext
Term-annotated content
ITS 2.0 enriched content
ITS2.0term-annotated
content
CAT Tools MT Systems
ITS 2.0 enriched content
ITS2.0term-annotated
content
Human users(e.g., translators,
terminologists)
• New W3C standard for
Internationalization Tag Set ITS 2.0
HTML Term AnnotationTerm entries for terms identified in EuroTermBank are stored in TBX format in a <script> element that is placed in the HTML5 document.
XLIFF Term Annotation
Narrow Domain Automotive MT
English – Latvian
DATA
2 M unique parallel sentences
1.9 M monolingual sentences
0.2 M in-domain monolingual
QUALITY
16% improvement from
terminology integration
Beating
Google Translate
synergy of machine translation and terminology services on the cloud
tilde.com
The research within the projects LetsMT! and TaaS has received funding from the European Commission ICT Policy Support
Programme (ICT PSP) and FP7 Programme
thank you