MMT – Modern, Next Generation Machine Translation, Achim Ruopp (TAUS)

MMT – Modern, Next Generation Machine Translation

Achim Ruopp, Directory of R&[email protected]

MMT Project

Horizon 2020 Innovation Action

3M € funding

3 years: 2015-2017

Goal:

deliver a large-scale commercial online machine

translation service based on a new open-source distributed

architecture.

This project has received funding from the European Union's Horizon 2020

research and innovation programme under grant agreement No 645487.

MMT Team

Business Research

Special thanks to Marcello Frederico (FBK) and Ulrich Germann

(University of Edinburgh) for many of the slides!

Setting up MT for CAT today

1. Select TMs

2. Collect extra data

3. Train and evaluate engine

4. Doesn’t work? back to 2.

5. Analyse/process input documents

6. Apply MT on fake TM

7. Import TMs in CAT tool

8. Start translating

9. Adapt engine to new data - go back to 3.

10. New project? back to 1.

The MMT way

1. Drag & drop your private TMs

2. connect your CAT with a key

3. Start translating!

Modern MT in a nutshell

Zero training time

Manages context

Learns from users

Scales with data and users

Prototype (April 2016) - Fast training

Context aware translation

party

CONTEXT

We are going out.

TRANSLATION

fête

SENTENCE

CONTEXT

We approved the law

TRANSLATION

parti

Prototype (March 2016)

MS Translator Hub vs Modern MT

MMT vs. Moses core language processing

● More supported languages

● Faster processing

● Simpler to use

● Tags and XML management

● Localisation of expressions

REST API

GET /translate?q=party&context=We+approved+the+law

"translation": "parti",

"context": [

{ "id": "europarl",

"score": 0.10343984

}, …

]

MMT Architecture

MMT Data Pooling

Partner’s repositories: MyMemory (Translated)

Data Cloud (TAUS)

Volume pooled for the English-Italian prototypes

ca 785M words & 423M segments in total

MMT Data Collection from CommonCrawl

commoncrawl.org – US-based non-profit“CommonCrawl is a 501(c)(3) non-profit organization

dedicated to providing a copy of the internet to internet

researchers, companies and individuals at no cost for the

purpose of research and analysis.”

On average 1.5 billion unique URLs per crawl Vs. an estimated 50 billion pages in Google index and 20

billion pages in Microsoft Bing index

What can be considered the “surface web” vs. the “deep

web”?

Two questions1. What language are these pages in?

2. Which pages are translations of each other?

Monolingual Data Including English

Monolingual Data Excluding English

Parallel Data Projections from en→it

MMT is Open Source

LGPL/Apache licences

new core technology

github.com/ModernMT/MMT

soon: github.com/ModernMT/DataCollectionemail me if you are interested

Roadmap

2015 Q1 2016 Q2 2016 Q4 2017 Q4

development

started

first alpha

release.

10 langs,

fast training,

context aware,

distributed

first beta

release

45 langs,

Incremental

learning

final release

enterprise

ready

This slide may not be used or copied without permission from TAUS

THANK YOU!

Presentations & Public Speaking

MMT – Modern, Next Generation Machine Translation, Achim Ruopp (TAUS)