TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

sAndrejs Vasiļjevs

chairman of the [email protected]

Localization World, Santa ClaraOctober 9, 2013

MT & Terminology:

better together

• Language technology developer

• Localization service provider

• Leadership in smaller languages

• Offices in Riga (Latvia), Tallinn

(Estonia) and Vilnius (Lithuania)

• 130 employees

• Strong R&D team

• 5 PhDs, 80+ research papers

• Trusted partner of the EU for

significant R&D projects

data

challenge

platform

challenge

[ttable-file]

0 0 5 /.../unfactored/model/phrase-table.0-0.gz

% ls steps/1/LM_toy_tokenize.1* | cat

steps/1/LM_toy_tokenize.1

steps/1/LM_toy_tokenize.1.DONE

steps/1/LM_toy_tokenize.1.INFO

steps/1/LM_toy_tokenize.1.STDERR

steps/1/LM_toy_tokenize.1.STDERR.digest

steps/1/LM_toy_tokenize.1.STDOUT

% train-model.perl \

--corpus factored-corpus/proj-syndicate \

--root-dir unfactored \

--f de --e en \

--lm 0:3:factored-corpus/surface.lm:0

% moses -f moses.ini -lmodel-file "0 0 3

../lm/europarl.srilm.gz“

use-berkeley = true

alignment-symmetrization-method = berkeley

berkeley-train = $moses-script-

dir/ems/support/berkeley-train.sh

berkeley-process = $moses-script-

dir/ems/support/berkeley-process.sh

berkeley-jar = /your/path/to/berkeleyaligner-

2.1/berkeleyaligner.jar

berkeley-java-options = "-server -mx30000m -ea"

berkeley-training-options = "-Main.iters 5 5 -

EMWordAligner.numThreads 8"

berkeley-process-options = "-

EMWordAligner.numThreads 8"

berkeley-posterior = 0.5

tokenize

in: raw-stem

out: tokenized-stem

default-name: corpus/tok

pass-unless: input-tokenizer output-tokenizer

template-if: input-tokenizer IN.$input-

extension OUT.$input-extension

template-if: output-tokenizer IN.$output-

extension OUT.$output-extension

parallelizable: yes

working-dir = /home/pkoehn/experiment

wmt10-data = $working-dir/data

customization

challenge

s

do-it-yourself

MT factory

on the cloud

• Automated training of SMT

systems from specified

collections of data

• Repository of parallel and

monolingual corpora

• based on open-source MT

tools GIZA and Moses

• Services for data collection,

MT generation,

customization and running of

variety of user-tailored MT

systems

Training Data Provided

Platform Architecture

Training UsingSharing of training data

Giza++

Moses SMT toolkit

SMT Resource Repository

SMT Multi-Model Repository

(trained SMT models)

Pro

cesi

ng,

Eva

luat

ion

...

Up

load

An

on

ymo

us

acce

ssA

uth

enti

cate

dac

cess

System management, user authentication, access rights control ...

Web page

Web service

Web pagetranslation widget

CAT tools

Web browserPlug-ins

SMT Resource Directory

SMT System Directory

Moses decoder

• Integration with CAT tools

• Integration in web pages

• Integration in web

browsers

• API-level integration

integration

• Training data on the LetsMT!

platform

• 119 languages

• 2,1 B parallel units in total

• 253 language pairs

• 860 corpora

• 249 production MT systems

currently on

the platform

General Domain MT

English – Lithuanian

DATA

5.3 M parallel sentences

81 M monolingual sentences

QUALITY

LetsMT – 26.65 BLEU

Google – 25.85 BLEU

Beating

Google Translate

• MT service for

e-Government

• Mobile Translation

app

• Desktop Translation

Tool

%Productivity

►Average translation productivity:

►Baseline with TM only: 550 w/h

►With TM and MT: 731 w/h

32.9% productivity increase

►High variability in individual performance

►Increase of error score from 20.2 to 28.6 points but still at the level “GOOD” (<30 points)

25.1%

28.5%

Czech Polish

32.9%

Latvian

How to instruct

SMT to use the

right terms?

ko

ks

tim

ber

%

terminology

as a

service

%

cloud-based

platform for

acquiring, cleaning,

sharing, and reusing

multilingual

terminological data

TaaS Services

Term identification and annotation

Identifying and marking terms

Machine users

TaaS Terminology Services

ITS 2.0 enriched content

ITS2.0term-annotated content

export / visualisation

Showcase Web Page

Terminology Annotation

Web Service API

Plaintext

Term-annotated content


ITS2.0term-annotated

content

CAT Tools MT Systems


ITS2.0term-annotated

content

Human users(e.g., translators,

terminologists)

• New W3C standard for

Internationalization Tag Set ITS 2.0

HTML Term AnnotationTerm entries for terms identified in EuroTermBank are stored in TBX format in a <script> element that is placed in the HTML5 document.

XLIFF Term Annotation

Narrow Domain Automotive MT

English – Latvian

DATA

2 M unique parallel sentences

1.9 M monolingual sentences

0.2 M in-domain monolingual

QUALITY

16% improvement from

terminology integration

Beating

Google Translate

synergy of machine translation and terminology services on the cloud

tilde.com

The research within the projects LetsMT! and TaaS has received funding from the European Commission ICT Policy Support

Programme (ICT PSP) and FP7 Programme

thank you

Technology

TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013