Hands On: Multimedia Methods for Large Scale Video Analysis...

Preview:

Citation preview

Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture)

Dr. Gerald Friedland, fractor@icsi.berkeley.edu

1

Thursday, November 15, 12

Today

2

•Comments on the Mid-Term•MapReduce

•Intro•Implementation•Algorithm Design

Slides excerpts by - Malcolm Slanley, Microsoft Research

- Jimmy Lin, Chris Dyer, UMDThursday, November 15, 12

Midterm Results

3

•Max Score: 40 points = 100%•Mean Result: 33.2 points = 83%•Variance: 18.4 points, StdDev: 4.28

Thursday, November 15, 12

Comments on the Mid-Term

4

Memory thrashing: If a process does not have enough pages, thrashing is a high paging activity, and the page-fault rate is high.

=> low CPU utilization, high I/O.

Thursday, November 15, 12

Unix Command: top

5

Thursday, November 15, 12

MapReduce

6

•Idea originated in functional programming

•First large use for parallelization at Google for accessing BigTable.

•Killer app: Text indexing!Thursday, November 15, 12

7

Basic Parallelization Pattern

Thursday, November 15, 12

“Work”

7

Basic Parallelization Pattern

Thursday, November 15, 12

“Work”

w1 w2 w3

7

Basic Parallelization Pattern

Thursday, November 15, 12

“Work”

w1 w2 w3

“worker” “worker” “worker”

7

Basic Parallelization Pattern

Thursday, November 15, 12

“Work”

w1 w2 w3

r1 r2 r3

“worker” “worker” “worker”

7

Basic Parallelization Pattern

Thursday, November 15, 12

“Work”

w1 w2 w3

r1 r2 r3

“Result”

“worker” “worker” “worker”

7

Basic Parallelization Pattern

Thursday, November 15, 12

“Work”

w1 w2 w3

r1 r2 r3

“Result”

“worker” “worker” “worker”

Partition

7

Basic Parallelization Pattern

Thursday, November 15, 12

“Work”

w1 w2 w3

r1 r2 r3

“Result”

“worker” “worker” “worker”

Partition

Combine

7

Basic Parallelization Pattern

Thursday, November 15, 12

Reality

Thursday, November 15, 12

Reality

Fundamental issuesscheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, …

Thursday, November 15, 12

Reality

Fundamental issuesscheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, …

Architectural issuesFlynn’s taxonomy (SIMD, MIMD, etc.),network typology, bisection bandwidthUMA vs. NUMA, cache coherence

Thursday, November 15, 12

Reality

Message Passing

P1 P2 P3 P4 P5

Shared Memory

P1 P2 P3 P4 P5

Mem

ory

Different programming modelsFundamental issuesscheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, …

Architectural issuesFlynn’s taxonomy (SIMD, MIMD, etc.),network typology, bisection bandwidthUMA vs. NUMA, cache coherence

Thursday, November 15, 12

Reality

Message Passing

P1 P2 P3 P4 P5

Shared Memory

P1 P2 P3 P4 P5

Mem

ory

Different programming modelsFundamental issuesscheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, …

Common problemslivelock, deadlock, data starvation, priority inversion…dining philosophers, sleeping barbers, cigarette smokers, …

Architectural issuesFlynn’s taxonomy (SIMD, MIMD, etc.),network typology, bisection bandwidthUMA vs. NUMA, cache coherence

Thursday, November 15, 12

Reality

Message Passing

P1 P2 P3 P4 P5

Shared Memory

P1 P2 P3 P4 P5

Mem

ory

Different programming models

Different programming constructsmutexes, conditional variables, barriers, …masters/slaves, producers/consumers, work queues, …

Fundamental issuesscheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, …

Common problemslivelock, deadlock, data starvation, priority inversion…dining philosophers, sleeping barbers, cigarette smokers, …

Architectural issuesFlynn’s taxonomy (SIMD, MIMD, etc.),network typology, bisection bandwidthUMA vs. NUMA, cache coherence

Thursday, November 15, 12

Reality

Message Passing

P1 P2 P3 P4 P5

Shared Memory

P1 P2 P3 P4 P5

Mem

ory

Different programming models

Different programming constructsmutexes, conditional variables, barriers, …masters/slaves, producers/consumers, work queues, …

Fundamental issuesscheduling, data distribution, synchronization, inter-process communication, robustness, fault tolerance, …

Common problemslivelock, deadlock, data starvation, priority inversion…dining philosophers, sleeping barbers, cigarette smokers, …

Architectural issuesFlynn’s taxonomy (SIMD, MIMD, etc.),network typology, bisection bandwidthUMA vs. NUMA, cache coherence

The reality: programmer shoulders the burden of managing concurrency… Solutions: See PARLAB

Thursday, November 15, 12

Common Workflow

• Iterate over a large number of records

• Extract something of interest from each

• Shuffle and sort intermediate results

• Aggregate intermediate results• Generate final output

(Dean and Ghemawat, OSDI 2004)9

Thursday, November 15, 12

Common Workflow

• Iterate over a large number of records

• Extract something of interest from each

• Shuffle and sort intermediate results

• Aggregate intermediate results• Generate final output

Map

(Dean and Ghemawat, OSDI 2004)9

Thursday, November 15, 12

Common Workflow

• Iterate over a large number of records

• Extract something of interest from each

• Shuffle and sort intermediate results

• Aggregate intermediate results• Generate final output

Map

Reduce

(Dean and Ghemawat, OSDI 2004)9

Thursday, November 15, 12

Common Workflow

• Iterate over a large number of records

• Extract something of interest from each

• Shuffle and sort intermediate results

• Aggregate intermediate results• Generate final output

Key idea: provide a functional abstraction for these two operations

Map

Reduce

(Dean and Ghemawat, OSDI 2004)9

Thursday, November 15, 12

MapReduce as a Diagram

10

Thursday, November 15, 12

MapReduce as a Diagram

10

Thursday, November 15, 12

f f f f f

MapReduce as a Diagram

10

Thursday, November 15, 12

f f f f fMap

MapReduce as a Diagram

10

Thursday, November 15, 12

g

f f f f fMap

MapReduce as a Diagram

10

Thursday, November 15, 12

g g

f f f f fMap

MapReduce as a Diagram

10

Thursday, November 15, 12

g g g

f f f f fMap

MapReduce as a Diagram

10

Thursday, November 15, 12

g g g g

f f f f fMap

MapReduce as a Diagram

10

Thursday, November 15, 12

g g g g g

f f f f fMap

MapReduce as a Diagram

10

Thursday, November 15, 12

g g g g g

f f f f fMap

Fold

MapReduce as a Diagram

10

Thursday, November 15, 12

g g g g g

f f f f fMap

Fold

Map

Reduce

MapReduce as a Diagram

10

Thursday, November 15, 12

MapReduce in Haskell

11

Very parallelizable!

Google: Part 2

� We already know that Google have a massive amount ofcomputational power:

� This is largely based on a cluster of workstation type computers.� The conglomerate result is a MIMD type parallel computer.

� We also know that to use this, we need some efficient parallelalgorithms:

� The PageRank web-page ranking system is one example of this.� To efficiently execute PageRank we need good graph algorithms, good

data structures and so on ...

� The question is, how do we bridge the gap between the two and

implement software on the Google cluster ?

� We look at MapReduce, a system that performs this task in an

interesting way.

Dan Page

COMS21102 : Software Engineering Slide 1

Introduction (1)

� One option might be to use a system like Condor:� Condor is basically a system for harnessing idle processing time.� Jobs are submitted to a queue and distributed to nodes in cluster in a

managed way.� When node has no user load, job is run but stopped again when user

load returns.

� Negatives:� Doesn’t really support parallelism, only batch processing.� No support for inter-process communication.

� Positives:� Due to lack of parallelism, the programming model is easy to understand.� Can automatically check-point processes and recover from faults.� Fairly extensible in that new processors can be easily added.

Dan Page

COMS21102 : Software Engineering Slide 2

Introduction (2)

� One option might be to use a system like MPI:� MPI is basically a system for writing message-passing programs.� Jobs are forcibly executed on nodes in cluster.

� Negatives:� Need to explicitly define parallelism in programs, programming model is

only quite easy.� Doesn’t automatically deal with issues of fault tolerance.� There is no real management or load-balancing.

� Positives:� Can be used as an API to C, a well known and portable language.� Some issues of communication are automatically optimised for platform.

Dan Page

COMS21102 : Software Engineering Slide 3

Introduction (3)

� Neither of these options is ideal (obviously there are some others):� We might have to manually deal with parallelism.� We might have to manually deal with scalability.� We might have to manually deal with fault tolerance.

� Google decided that in order to get the best out of their platform, they

would change the way people wrote parallel programs ...

Dan Page

COMS21102 : Software Engineering Slide 4

A Recap on Haskell (1)

� The concept of higher-order functions is central to programming inHaskell:

� Functions can take other functions as arguments.� Functions can return other functions as results.

� Very roughly, this concept is like using function pointers in C orinterfaces in Java:

� In Haskell we can create functions on-the-fly, in C and Java they are

static; we have to define them in the source code.� In Haskell the concept is natural, in C and Java it (arguably) isn’t.

Dan Page

COMS21102 : Software Engineering Slide 5

A Recap on Haskell (2) – map

� As an example, consider the map function:

map :: (A->B) -> [A] -> [B]

map f [] = []

map f (x:xs) = f x : map f xs

� The inputs to map are:� A unary function f : A⇥ B.� A list ⌅v0, v1, . . . vn�1⇧ with vi ⇤ A.

� The output of map is the list ⇥v ⇥0, v ⇥1, . . . v ⇥n�1⇤ with v ⇥i � B such thatv ⇥i = f (vi).

� The key point is the fact we can pass a function into map for it to use.

� So, for example, we can execute map (+1) [1,2,3] and get the

result [2,3,4].

Dan Page

COMS21102 : Software Engineering Slide 6

A Recap on Haskell (3) – reduce

� Another example is the reduce function:

reduce :: (A->B->B) -> B -> [A] -> B

reduce f y [] = y

reduce f y (x:xs) = f x (reduce f y xs)

� The inputs to reduce are:� A binary function f : A� B ⇥ B.� A value y ⇤ B.� A list ⌅v0, v1, . . . vn�1⇧ with vi ⇤ A.

� The output of reduce is the value f (v0, f (v1, . . . f (vn�1, y))) � B.

� Again, the key point is the fact we can pass a function into reduce for

it to use.

� So, for example, we can execute reduce (+) 0 [1,2,3] and get

the result 6.

Dan Page

COMS21102 : Software Engineering Slide 7

A Recap on Haskell (4) – reduce

� In Haskell programming reduce is often called foldr or fold-right.

� We can also define foldl, or fold-left which applies the function f inthe opposite order:

foldl :: (A->B->A) -> A -> [B] -> A

foldl f y [] = y

foldl f y (x:xs) = f (foldl f y xs) x

� So, for example, we can still write foldr (+) 0 [1,2,3] and get

the result 6 since addition is associative.

� But division is not associative so:� Executing foldr (/) 64 [4,2,4] gives 0.125.� Executing foldl (/) 64 [4,2,4] gives 2.000.

Dan Page

COMS21102 : Software Engineering Slide 8

A Recap on Haskell (5) – reduce

� It turns out that many useful and quite powerful functions can beexpressed using just reduce.

� All these example use the concept of creating a function on-the-fly.� You can even define map using reduce if you want !

length :: [A] -> Int

length xs = reduce (\y n -> 1 + n) 0 xs

reverse :: [A] -> [A]

reverse xs = reduce (\y ys -> ys ++ [y]) [] xs

map :: (A->B) -> [A] -> [B]

map f xs = reduce (\y ys -> f y : ys) [] xs

filter :: (A->Bool) [A] -> [A]

filter p xs = reduce (\y ys -> if p y then y:ys else ys) [] xs

� It might be a good idea to write Haskell library code like this ...� Imagine if you are writing a Haskell compiler such as GHC.� To ensure your generated programs are efficient, you just need to

optimise the reduce implementation in the library.

Dan Page

COMS21102 : Software Engineering Slide 9

Google: Part 2

� We already know that Google have a massive amount ofcomputational power:

� This is largely based on a cluster of workstation type computers.� The conglomerate result is a MIMD type parallel computer.

� We also know that to use this, we need some efficient parallelalgorithms:

� The PageRank web-page ranking system is one example of this.� To efficiently execute PageRank we need good graph algorithms, good

data structures and so on ...

� The question is, how do we bridge the gap between the two and

implement software on the Google cluster ?

� We look at MapReduce, a system that performs this task in an

interesting way.

Dan Page

COMS21102 : Software Engineering Slide 1

Introduction (1)

� One option might be to use a system like Condor:� Condor is basically a system for harnessing idle processing time.� Jobs are submitted to a queue and distributed to nodes in cluster in a

managed way.� When node has no user load, job is run but stopped again when user

load returns.

� Negatives:� Doesn’t really support parallelism, only batch processing.� No support for inter-process communication.

� Positives:� Due to lack of parallelism, the programming model is easy to understand.� Can automatically check-point processes and recover from faults.� Fairly extensible in that new processors can be easily added.

Dan Page

COMS21102 : Software Engineering Slide 2

Introduction (2)

� One option might be to use a system like MPI:� MPI is basically a system for writing message-passing programs.� Jobs are forcibly executed on nodes in cluster.

� Negatives:� Need to explicitly define parallelism in programs, programming model is

only quite easy.� Doesn’t automatically deal with issues of fault tolerance.� There is no real management or load-balancing.

� Positives:� Can be used as an API to C, a well known and portable language.� Some issues of communication are automatically optimised for platform.

Dan Page

COMS21102 : Software Engineering Slide 3

Introduction (3)

� Neither of these options is ideal (obviously there are some others):� We might have to manually deal with parallelism.� We might have to manually deal with scalability.� We might have to manually deal with fault tolerance.

� Google decided that in order to get the best out of their platform, they

would change the way people wrote parallel programs ...

Dan Page

COMS21102 : Software Engineering Slide 4

A Recap on Haskell (1)

� The concept of higher-order functions is central to programming inHaskell:

� Functions can take other functions as arguments.� Functions can return other functions as results.

� Very roughly, this concept is like using function pointers in C orinterfaces in Java:

� In Haskell we can create functions on-the-fly, in C and Java they are

static; we have to define them in the source code.� In Haskell the concept is natural, in C and Java it (arguably) isn’t.

Dan Page

COMS21102 : Software Engineering Slide 5

A Recap on Haskell (2) – map

� As an example, consider the map function:

map :: (A->B) -> [A] -> [B]

map f [] = []

map f (x:xs) = f x : map f xs

� The inputs to map are:� A unary function f : A⇥ B.� A list ⌅v0, v1, . . . vn�1⇧ with vi ⇤ A.

� The output of map is the list ⇥v ⇥0, v ⇥1, . . . v ⇥n�1⇤ with v ⇥i � B such thatv ⇥i = f (vi).

� The key point is the fact we can pass a function into map for it to use.

� So, for example, we can execute map (+1) [1,2,3] and get the

result [2,3,4].

Dan Page

COMS21102 : Software Engineering Slide 6

A Recap on Haskell (3) – reduce

� Another example is the reduce function:

reduce :: (A->B->B) -> B -> [A] -> B

reduce f y [] = y

reduce f y (x:xs) = f x (reduce f y xs)

� The inputs to reduce are:� A binary function f : A� B ⇥ B.� A value y ⇤ B.� A list ⌅v0, v1, . . . vn�1⇧ with vi ⇤ A.

� The output of reduce is the value f (v0, f (v1, . . . f (vn�1, y))) � B.

� Again, the key point is the fact we can pass a function into reduce for

it to use.

� So, for example, we can execute reduce (+) 0 [1,2,3] and get

the result 6.

Dan Page

COMS21102 : Software Engineering Slide 7

A Recap on Haskell (4) – reduce

� In Haskell programming reduce is often called foldr or fold-right.

� We can also define foldl, or fold-left which applies the function f inthe opposite order:

foldl :: (A->B->A) -> A -> [B] -> A

foldl f y [] = y

foldl f y (x:xs) = f (foldl f y xs) x

� So, for example, we can still write foldr (+) 0 [1,2,3] and get

the result 6 since addition is associative.

� But division is not associative so:� Executing foldr (/) 64 [4,2,4] gives 0.125.� Executing foldl (/) 64 [4,2,4] gives 2.000.

Dan Page

COMS21102 : Software Engineering Slide 8

A Recap on Haskell (5) – reduce

� It turns out that many useful and quite powerful functions can beexpressed using just reduce.

� All these example use the concept of creating a function on-the-fly.� You can even define map using reduce if you want !

length :: [A] -> Int

length xs = reduce (\y n -> 1 + n) 0 xs

reverse :: [A] -> [A]

reverse xs = reduce (\y ys -> ys ++ [y]) [] xs

map :: (A->B) -> [A] -> [B]

map f xs = reduce (\y ys -> f y : ys) [] xs

filter :: (A->Bool) [A] -> [A]

filter p xs = reduce (\y ys -> if p y then y:ys else ys) [] xs

� It might be a good idea to write Haskell library code like this ...� Imagine if you are writing a Haskell compiler such as GHC.� To ensure your generated programs are efficient, you just need to

optimise the reduce implementation in the library.

Dan Page

COMS21102 : Software Engineering Slide 9

Thursday, November 15, 12

MapReduce in Practice• Programmers specify two

functions:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>*– All values with the same key are

reduced together• Usually, programmers also

specify:partition (k’, number of partitions) → partition for k’

– Often a simple hash of the key, e.g. hash(k’) mod n

12

Thursday, November 15, 12

MapReduce

http://people.apache.org/~rdonkin/hadoop-talk/diagrams/map-reduce.pngThursday, November 15, 12

A MapReduce Engine Typically• Handles scheduling

– Assigns workers to map and reduce tasks

• Handles “data distribution”– Moves the process to the data

• Handles synchronization– Gathers, sorts, and shuffles

intermediate data• Handles faults

– Detects worker failures and restarts 14

Thursday, November 15, 12

What to do with I/O?• Don’t move data to workers…

Move workers to the data!– Store data on the local disks for

nodes in the cluster– Start up the workers on the node

that has the data local• Why?

– Not enough RAM to hold all the data in memory

– Disk access is slow, need to be serialized!

– Disk becomes bottleneck! 15

Thursday, November 15, 12

Distributed File System

• A distributed file system is the answer– GFS (Google File System)– HDFS for Hadoop (= GFS clone)– Amazon S3

16

Thursday, November 15, 12

GFS: Design Decisions• Files stored as chunks

– Fixed size (64MB)• Reliability through replication

– Each chunk replicated across 3+ chunkservers

• Single master to coordinate access, keep metadata– Simple centralized management

• No data caching– Little benefit due to large data sets,

streaming reads• Simplify the API

– Push some of the issues onto the client 17

Thursday, November 15, 12

MapReduce Implementation

18

Thursday, November 15, 12

Hadoop

Thursday, November 15, 12

Hadoop Word Count

• #!/usr/bin/env python   • import sys   • # input comes from STDIN (standard input) • for line in sys.stdin:

# remove leading and trailing whitespace line = line.strip() # split the line into words words = line.split() # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print '%s\t%s' % (word, 1)

http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python

Thursday, November 15, 12

Hadoop Word Count • #!/usr/bin/env python   • import sys   • word2count = {}   • # input comes from STDIN • for line in sys.stdin:

# remove leading and trailing whitespace line = line.strip()  # parse the input we got from mapper.pyword, count = line.split('\t', 1) # convert count (currently a string) to inttry: count = int(count) word2count[word] = word2count.get(word, 0) + count except ValueError: # count was not a number, so silently # ignore/discard this line pass

Thursday, November 15, 12

Word Count Command

• bin/hadoopjar contrib/streaming/

hadoop-0.19.1streaming.jar-mapper /home/hadoop/

mapper.py-reducer /home/hadoop/

reducer.py-input gutenberg/* -output gutenberg-output

Thursday, November 15, 12

HadoopMapper• def ReadAndDispatch():• while True:• theLine = sys.stdin.readline()• if theLine == '' or theLine == None:• break• if theLine[0] == '#':• continue• sys.stderr.write("Working on " + theLine + "\n")• args = theLine.split()• if len(args) < 1:• continue• if args[0] == 'expected':• GetOneResult(int(args[1]), int(args[2]), \• int(args[3]), [int(args[4])], \• [float(args[5])])• elif args[0] == 'hyperexpected':• GetOneResult(int(args[1]), int(args[2]), \• int(args[3]), [int(args[4])], \• [float(args[5])], hyper=float(args[6]))• else:• print "Unknown command:", args[0]

Thursday, November 15, 12

Run Hadoop Example• $HADOOP_HOME/bin/hadoopfs -rmrmyOutputDir• EXP=run.tst8

• $HADOOP_HOME/bin/hadoopfs -rm $EXP • $HADOOP_HOME/bin/hadoopfs -put $EXP .

• $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \• -Dmapred.job.queue.name=keystone \• -Dmapred.map.tasks=50 \• -Dmapred.task.timeout=9999000 \• -Dmapred.reduce.tasks=0 \• -archives hdfs://axoniteblue-nn1.blue.ygrid.yahoo.com:8020/user/malcolm/

numpy4python2.5.tgz \• -cmdenv PYTHONPATH=./numpy4python2.5.tgz/ \• -cmdenv LD_LIBRARY_PATH=./numpy4python2.5.tgz/numpy \• -input $EXP \• -mapper "${HOD_PYTHON_HOME} TestRecall.py -hadoop" \• -output myOutputDir \• -reducer /bin/cat \• -file lsh.py \• -file TestRecall.py

• $HADOOP_HOME/bin/hadoopfs -cat myOutputDir/part* > $EXP.out

Thursday, November 15, 12

Hadoop Status

Thursday, November 15, 12

Pricing (Amazon)

Thursday, November 15, 12

MapReduce Algorithm Design

Adapted from work reported in (Lin, EMNLP 2008)27

Thursday, November 15, 12

Managing Dependencies

28

Thursday, November 15, 12

Managing Dependencies

• Remember: Mappers run in isolation– You have no idea in what order the mappers run– You have no idea on what node the mappers run– You have no idea when each mapper finishes

28

Thursday, November 15, 12

Managing Dependencies

• Remember: Mappers run in isolation– You have no idea in what order the mappers run– You have no idea on what node the mappers run– You have no idea when each mapper finishes

• Tools for synchronization:– Ability to hold state in reducer across multiple

key-value pairs– Sorting function for keys– Partitioner– Cleverly-constructed data structures

28

Thursday, November 15, 12

Motivating Example

29

Thursday, November 15, 12

Motivating Example

• Term co-occurrence matrix for a text collection– M = N x N matrix (N = vocabulary size)– Mij: number of times i and j co-occur in some

context (for concreteness, let’s say context = sentence)

29

Thursday, November 15, 12

Motivating Example

• Term co-occurrence matrix for a text collection– M = N x N matrix (N = vocabulary size)– Mij: number of times i and j co-occur in some

context (for concreteness, let’s say context = sentence)

• Why?– Distributional profiles as a way of measuring

semantic distance– Semantic distance useful for many language

processing tasks29

Thursday, November 15, 12

MapReduce: Large

• Term co-occurrence matrix for a text collection= specific instance of a large counting problem– A large event space (number of terms)– A large number of observations (the collection itself)– Goal: keep track of interesting statistics about the

events• Basic approach

– Mappers generate partial counts– Reducers aggregate partial counts

30

Thursday, November 15, 12

MapReduce: Large

• Term co-occurrence matrix for a text collection= specific instance of a large counting problem– A large event space (number of terms)– A large number of observations (the collection itself)– Goal: keep track of interesting statistics about the

events• Basic approach

– Mappers generate partial counts– Reducers aggregate partial counts

How do we aggregate partial counts efficiently?30

Thursday, November 15, 12

First Try: “Pairs”

• Each mapper takes a sentence:– Generate all co-occurring term

pairs– For all pairs, emit (a, b) → count

• Reducers sums up counts associated with these pairs

• Use combiners!

31

Thursday, November 15, 12

“Pairs” Analysis

• Advantages– Easy to implement, easy to understand

• Disadvantages– Lots of pairs to sort and shuffle around

(upper bound?)

32

Thursday, November 15, 12

Another Try: “Stripes”

• Idea: group together pairs into an associative array

• Each mapper takes a sentence:– Generate all co-occurring term pairs– For each term, emit a → { b: countb, c: countc, d: countd … }

• Reducers perform element-wise sum of associative arrays

(a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2

a → { b: 1, c: 2, d: 5, e: 3, f: 2 }

a → { b: 1, d: 5, e: 3 }a → { b: 1, c: 2, d: 2, f: 2 }a → { b: 2, c: 2, d: 7, e: 3, f: 2 }

+33

Thursday, November 15, 12

“Stripes” Analysis

• Advantages– Far less sorting and shuffling of key-

value pairs– Can make better use of combiners

• Disadvantages– More difficult to implement– Underlying object is more heavyweight– Fundamental limitation in terms of size

of event space34

Thursday, November 15, 12

Cluster size: 38 coresData Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed) 35

Thursday, November 15, 12

Conditional ProbabilitiesApproach• How do we estimate

conditional probabilities from counts?

• Why do we want to do this?• How do we do this with

MapReduce?36

Thursday, November 15, 12

P(B|A): “Stripes”

• Easy!– One pass to compute (a, *)– Another pass to directly compute

P(B|A)

a → {b1:3, b2 :12, b3 :7, b4 :1, … }

37

Thursday, November 15, 12

P(B|A): “Pairs”

• For this to work:– Must emit extra (a, *) for every bn in mapper– Must make sure all a’s get sent to same reducer (use

partitioner)– Must make sure (a, *) comes first (define sort order)– Must hold state in reducer across different key-value pairs

(a, b1) → 3 (a, b2) → 12 (a, b3) → 7(a, b4) → 1 …

(a, *) → 32 (a, b1) → 3 / 32 (a, b2) → 12 / 32(a, b3) → 7 / 32(a, b4) → 1 / 32…

Reducer holds this value in memory

38

Thursday, November 15, 12

Synchronization in Hadoop

• Approach 1: turn synchronization into an ordering problem– Sort keys into correct order of computation– Partition key space so that each reducer

gets the appropriate set of partial results– Hold state in reducer across multiple key-

value pairs to perform computation– Illustrated by the “pairs” approach

39

Thursday, November 15, 12

Synchronization in Hadoop

• Approach 2: construct data structures that “bring the pieces together”– Each reducer receives all the data it needs

to complete the computation– Illustrated by the “stripes” approach

40

Thursday, November 15, 12

Issues and Tradeoffs• Number of key-value pairs

– Object creation overhead– Time for sorting and shuffling pairs

across the network• Size of each key-value pair

– De/serialization overhead• Combiners make a big difference!

– RAM vs. disk and network– Arrange data to maximize

opportunities to aggregate partial results 41

Thursday, November 15, 12

In general: Issues

42

Thursday, November 15, 12

In general: Issues

• The optimally-parallelized version doesn’t exist!

42

Thursday, November 15, 12

In general: Issues

• The optimally-parallelized version doesn’t exist!

• It’s all about the right level of abstraction

42

Thursday, November 15, 12

In general: Issues

• The optimally-parallelized version doesn’t exist!

• It’s all about the right level of abstraction

42

Thursday, November 15, 12

In general: Issues

• The optimally-parallelized version doesn’t exist!

• It’s all about the right level of abstraction

• Hadoop has Overhead!

42

Thursday, November 15, 12

Next Week (Project Meeting)

•Stephanie Pancoast (Stanford)

Thursday, November 15, 12

Next Week (Lecture)

44

Mehmet Emre Sargin (Google Research/YouTube)

Thursday, November 15, 12

Recommended