Large Scale Machine Translation Architectures Qin Gao

Large Scale Machine Large Scale Machine Translation ArchitecturesTranslation ArchitecturesQin Gao

OutlineOutlineTypical Problems in Machine TranslationProgram Model for Machine Translation

MapReduceRequired System Component

Supporting softwareDistributed streaming data storage systemDistributed structured data storage system

Integrating – How to make a full-distributed system

23/4/20 2Qin Gao, LTI, CMU

Why large scale MTWhy large scale MT

We need more data..

But…


Some representative MT Some representative MT problemsproblemsCounting events in corpora

◦ Ngram countSorting

◦ Phrase table extractionPreprocessing Data

◦Parsing, tokenizing, etcIterative optimization

◦ GIZA++ (All EM algorithms)


Characteristics of different Characteristics of different taskstasksCounting events in corpora

◦Extract knowledge from dataSorting

◦Process data, knowledge is inside dataPreprocessing Data

◦Process data, require external knowledgeIterative optimization

◦For each iteration, process data using existing knowledge and update knowledge


Components required for Components required for large scale MTlarge scale MT





Stream Data

Structured Knowledg

e

Processor


Problem for each Problem for each componentcomponentStream data:

◦As the amount of data grows, even a complete navigation is impossible.

Processor:◦Single processor’s computation power

is not enoughKnowledge:

◦The size of the table is too large to fit into memory

◦Cache-based/distributed knowledge base suffers from low speed


Make it simple: What is the Make it simple: What is the underlying problem?underlying problem?We have a huge cake and we want

to cut them into pieces and eat.

Different cases:◦We just need to eat

the cake.◦We also want to

count how many peanuts inside the cake

◦(Sometimes)We have only one folk!


ParallelizationParallelization


SolutionsSolutionsLarge-scale distributed processing

◦ MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean, Sanjay Ghemawat, Communications of the ACM, vol. 51, no. 1 (2008), pp. 107-113.

Handling huge streaming data◦ The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-

Tak Leung, Proceedings of the 19th ACM Symposium on Operating Systems Principles, 2003, pp. 20-43.

Handling structured data◦ Large Language Models in Machine Translation, Thorsten Brants,

Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858-867.

◦ Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006, pp. 205-218.


MapReduceMapReduceMapReduce can refer to

◦A programming model that deal with massive, unordered, streaming data processing tasks(MUD)

◦A set of supporting software environment implemented by Google Inc

Alternative implementation:◦Hadoop by Apache fundation


MapReduce programming MapReduce programming modelmodelAbstracts the computation into

two functions:◦MAP◦Reduce

User is responsible for the implementation of the Map and Reduce functions, and supporting software take care of executing them


Representation of dataRepresentation of dataThe streaming data is abstracted

as a sequence of key/value pairsExample:

◦(sentence_id : sentence_content)


MapMap function functionThe Map function takes an input

key/value pair, and output a set of intermediate key/value pairs

Key1 : Value1

Key2 : Value2

Map()

Key1 : Value1

Key2 : Value2

Key3 : Value3

……..

Map()

Key1 : Value2

Key2 : Value1

Key3 : Value3

……..


ReduceReduce function functionReduce function accepts one

intermediate key and a set of intermediate values, and produce the result

Key1 : Value1

Key1 : Value2

Key1 : Value3

……..Key2 : Value1

Key2 : Value2

Key2 : Value3

……..

Reduce()

Reduce()

Result

Result


The architecture of The architecture of MapReduceMapReduceMap

function

Reduce Function

Distributed Sort23/4/20 18Qin Gao, LTI, CMU

Benefit of MapReduceBenefit of MapReduceAutomatic splitting dataFault toleranceHigh-throughput computing, uses

the nodes efficientlyMost important: Simplicity, just

need to convert your algorithm to the MapReduce model.


Requirement for expressing Requirement for expressing algorithm in MapReducealgorithm in MapReduceProcess Unordered data

◦The data must be unordered, which means no matter in what order the data is processed, the result should be the same

Produce Independent intermediate key◦Reduce function can not see the

value of other keys


ExampleExampleDistributed Word Count (1)

◦ Input key : word◦ Input value : 1◦ Intermediate key : constant◦ Intermediate value: 1◦ Reduce() : Count all intermediate values

Distributed Word Count (2)◦ Input key : Document/Sentence ID◦ Input value : Document/Sentence content◦ Intermediate key : constant◦ Intermediate value: number of words in the

document/sentence◦ Reduce() : Count all intermediate values


Example 2Example 2Distributed unigram count

◦ Input key : Document/Sentence ID◦ Input value : Document/Sentence content◦ Intermediate key : Word◦ Intermediate value: Number of the word

in the document/sentence◦ Reduce() : Count all intermediate values


Example 3Example 3Distributed Sort

◦Input key : Entry key◦Input value : Entry content◦Intermediate key : Entry key

(modification may be needed for ascend/descend order)

◦Intermediate value: Entry content◦Reduce() : All the entry content

Making use of built-in sorting functionality


Supporting MapReduce: Supporting MapReduce: Distributed StorageDistributed StorageReminder what we are dealing with in

MapReduce:◦ Massive, unordered, streaming data

Motivation: ◦ We need to store large amount of data◦ Make use of storage in all the nodes◦ Automatic replication

Fault tolerant Avoid hot spots client can read from many

servers

Google FS and Hadoop FS (HDFS)


Design principle of Google FSDesign principle of Google FS

Optimizing for special workload:◦Large streaming reads, small random

reads◦Large streaming writes, rare modification

Support concurrent appending ◦It actually assumes data are unordered

High sustained bandwidth is more important than low latency, fast response time is not important

Fault tolerant


Google FS Architecture Google FS Architecture Optimize for large streaming

reading and large, concurrent writing

Small random reading/writing is also supported, but not optimized

Allow appending to existing filesFile are spitted into chunks and

stored in several chunk serversA master is responsible for storage

and query of chunk information23/4/20 26Qin Gao, LTI, CMU

Google FS architectureGoogle FS architecture


ReplicationReplicationWhen a chunk is frequently or

“simultaneously” read from a client, the client may fail

A fault in one client may cause the file not usable

Solution: store the chunks in multiple machines.

The number of replica of each chunk : replication factor


HDFSHDFSHDFS shares similar design

principle of Google FSWrite-once-read-many : Can only

write file once, even appending is now allowed

“Moving computation is cheaper than moving data”


Are we done?Are we done?

NO… Problems about the existing architecture


We are good at dealing with We are good at dealing with data data What about knowledge? I.E.

structured data?What if the size of the knowledge

is HUGE?


A good example: GIZAA good example: GIZAA typical EM algorithm

WorldAlignment

Collect Counts

Has More

Sentences?

Y

NormalizeCounts

NHas More

Iterations?

Y

N


When parallelized: seems to When parallelized: seems to be a perfect MapReduce be a perfect MapReduce applicationapplication

Word Alignment

Collect Counts

Has More

Sentences?

Y

NormalizeCounts

NHas More

Iterations?

Y

N

WordAlignment

Collect Counts

Has More

Sentences?

Y

Word Alignment

Collect Counts

Has More

Sentences?

Y

N N

Run on cluster


However:However:

…

. . . .. . . . . .. . . .

. . . .. . . . . .. . . .

. . . .. . . . . .. . . .

. . . .. . . . . .. . . .

. . . . .. . . . . . . .. . . . . . . . . . .

. . . . .. . . . . . . .. . . . . . . . . . .

Large parallel corpus

Corpuschunks

Counttables

Combinedcount table

Statisticallexicon

RenormalizationRedistribute fornext iteration

MemoryData I/O

Map

Reduce

Memory


Huge tablesHuge tablesLexicon probability table: T-TableUp to 3G in early stagesAs the number of workers

increases, they all need to load this 3G file!

And all the nodes need to have 3G+ memory – we need a cluster of super computers?


Another example, Another example, decodingdecodingConsider language models, what

can we do if the language model grows to several TBs

We need storage/query mechanism for large, structured data

Consideration:◦Distributed storage◦Fast access: network has high latency


Google Language ModelGoogle Language ModelStorage:

◦Central storage or distributed storageHow to deal with latency?

◦Modify the decoder, collect a number of queries and send them in one time.

It is a specific application, we still need something more general.


Again, made in Google:Again, made in Google:BigtableBigtableIt is the specially optimized for

structured dataServing many applications nowIt is not a complete databaseDefinition:

◦A Bigtable is a sparse, distributed, persistent, multi-dimensional, sorted map


Data model in BigtableData model in BigtableFour dimension table:

◦Row◦Column family◦Column◦Timestamp

Row

Column family Column

Timestamp 23/4/20 39Qin Gao, LTI, CMU

Distributed storage unit : Distributed storage unit : TabletTabletA tablet consists a range of rowsTablets can be stored in different

nodes, and served by different servers

Concurrent reading multiple rows can be fast


Random access unit : Random access unit : Column familyColumn familyEach tablet is a string-to-string

map(Though not mentioned, the API

shows that: ) In the level of column family, the index is loaded into memory so fast random access is possible

Column family should be fixed


Tables inside table: Column Tables inside table: Column and Timestampand TimestampColumn can be any arbitrary

string valueTimestamp is an integerValue is byte arrayActually it is a table of tables


PerformancePerformanceNumber of 1000-byte values read/write per

second.

What is shocking: ◦ Effective IO for random read (from GFS) is

more than 100 MB/second◦ Effective IO for random read from memory is

more than 3 GB/second


An example : Phrase TableAn example : Phrase TableRow: First bigram/trigram of the

source phraseColumn Family: Length of source

phrase or some hashed number of remaining part of source phrase

Column: Remaining part of the source phrase

Value: All the phrase pairs of the source phrase


BenefitBenefitDifferent source phrase comes

from different serversThe load is balanced and the

reading can be concurrent and much faster.

Filtering the phrase table before decoding becomes much more efficient.


Another Example: GIZA++Another Example: GIZA++Lexicon table:

◦Row: Source word id◦Column Family: nothing◦Column: Target word id◦Value: The probability value

With a simple local cache, the table loading can be extremely efficient comparing to current implemenetation


ConclusionConclusionStrangely, the talk is all about

how Google does itA useful framework for

distributed MT systems require three components:◦MapReduce software◦Distributed streaming data storage

system◦Distributed structured data storage

system23/4/20 47Qin Gao, LTI, CMU

Open Source AlternativesOpen Source AlternativesMapReduce Library HadoopGoogleFS Hadoop FS (HDFS)BigTable HyperTable


THANK YOU!THANK YOU!


Documents

Large Scale Machine Translation Architectures Qin Gao