View
229
Download
1
Embed Size (px)
Citation preview
MMT – Modern, Next Generation Machine Translation
Achim Ruopp, Directory of R&[email protected]
MMT Project
Horizon 2020 Innovation Action
3M € funding
3 years: 2015-2017
Goal:
deliver a large-scale commercial online machine
translation service based on a new open-source distributed
architecture.
This project has received funding from the European Union's Horizon 2020
research and innovation programme under grant agreement No 645487.
MMT Team
Business Research
Special thanks to Marcello Frederico (FBK) and Ulrich Germann
(University of Edinburgh) for many of the slides!
Setting up MT for CAT today
1. Select TMs
2. Collect extra data
3. Train and evaluate engine
4. Doesn’t work? back to 2.
5. Analyse/process input documents
6. Apply MT on fake TM
7. Import TMs in CAT tool
8. Start translating
9. Adapt engine to new data - go back to 3.
10. New project? back to 1.
The MMT way
1. Drag & drop your private TMs
2. connect your CAT with a key
3. Start translating!
Modern MT in a nutshell
Zero training time
Manages context
Learns from users
Scales with data and users
Prototype (April 2016) - Fast training
Context aware translation
party
CONTEXT
We are going out.
TRANSLATION
fête
SENTENCE
CONTEXT
We approved the law
TRANSLATION
parti
Prototype (March 2016)
MS Translator Hub vs Modern MT
MMT vs. Moses core language processing
● More supported languages
● Faster processing
● Simpler to use
● Tags and XML management
● Localisation of expressions
REST API
GET /translate?q=party&context=We+approved+the+law
"translation": "parti",
"context": [
{ "id": "europarl",
"score": 0.10343984
}, …
]
MMT Architecture
MMT Data Pooling
Partner’s repositories: MyMemory (Translated)
Data Cloud (TAUS)
Volume pooled for the English-Italian prototypes
ca 785M words & 423M segments in total
MMT Data Collection from CommonCrawl
commoncrawl.org – US-based non-profit“CommonCrawl is a 501(c)(3) non-profit organization
dedicated to providing a copy of the internet to internet
researchers, companies and individuals at no cost for the
purpose of research and analysis.”
On average 1.5 billion unique URLs per crawl Vs. an estimated 50 billion pages in Google index and 20
billion pages in Microsoft Bing index
What can be considered the “surface web” vs. the “deep
web”?
Two questions1. What language are these pages in?
2. Which pages are translations of each other?
Monolingual Data Including English
Monolingual Data Excluding English
Parallel Data Projections from en→it
MMT is Open Source
LGPL/Apache licences
new core technology
github.com/ModernMT/MMT
soon: github.com/ModernMT/DataCollectionemail me if you are interested
Roadmap
2015 Q1 2016 Q2 2016 Q4 2017 Q4
development
started
first alpha
release.
10 langs,
fast training,
context aware,
distributed
first beta
release
45 langs,
Incremental
learning
final release
enterprise
ready
This slide may not be used or copied without permission from TAUS
THANK YOU!