Upload
zilong-tan
View
81
Download
0
Tags:
Embed Size (px)
Citation preview
The Data Model● A (logical) file is a string
a1a2...an, where aj is a substring.
Eg: “Hello\nworld!” ⇒ a1= “Hello”, a2= “world!”“\n” is a separator.
The Data Model● A (logical) file is a string
a1a2...an, where aj is a substring.
Eg: “Hello\nworld!” ⇒ a1= “Hello”, a2= “world!”“\n” is a separator.
● Q1: How to equally split a file?○ Eg: a1a2...a2n ⇒ a1a2...an and an+1an+2...a2n.
The Data Model● A (logical) file is a string
a1a2...an, where aj is a substring.
Eg: “Hello\nworld!” ⇒ a1= “Hello”, a2= “world!”“\n” is a separator.
● Q1: How to equally split a file?○ Eg: a1a2...a2n ⇒ a1a2...an and an+1an+2...a2n.
● Q2: What about splitting the file into more segments?
The Map(aj) Function● Map: aj → {(key(aj), val(aj))}● key(aj) and val(aj) are strings.
Eg: Map(“Hello”) = (“Hello”, “1”), Map(“Hello world”) = {(“Hello”,“1”), (“world”,“1”)}, Map(“Hello world”) = (“world”, “Hello”).
Contd.
● The input file a1a2...am is organized as
Value 1 Value 2 Value 3 ...
key(a1) val(a1) val(a7) val(a2) #
key(a5) val(an) val(a5) #
key(a3) val(am) val(a2) val(a3) ...
...
Contd.
● The input file a1a2...am is organized as
Value 1 Value 2 Value 3 ...
key(a1) val(a1) val(a7) val(a2) #
key(a5) val(an) val(a5) #
key(a3) val(am) val(a2) val(a3) ...
...
Each row shares the same key.
Contd.
● The input file a1a2...am is organized as
Value 1 Value 2 Value 3 ...
key(a1) val(a1) val(a7) val(a2) #
key(a5) val(an) val(a5) #
key(a3) val(am) val(a2) val(a3) ...
...
Mistake! a2 cannot appear in two rows.
The Reduce(k,v1,v2,...,vd) Function● Reduce: (k,v1,v2,...,vd) → v.
Eg: Reduce(“Hello”,“2”,“1”,“5”) = “Hello 8”. (WordCount)
key(s),val(s1),val(s2),...,val(sd)(a row)
Parallel Computation● The table we have seen is global.● A Map node is assigned a file segment sjsj+1...sj+k, and
executes Map() on each s.● A Reduce node is associated with one or more rows of
the table, and executes Reduce() on each associated row.
● Map() and Reduce() execute concurrently on multiple machines.
WordCount Example
● Input: w = w1w2,...wk.● Map(w) = {(wj,“1”)}, j = 1,2,...,k.● Reduce(w,v1,v2,...,vd) = (w, jvj).
Contd.● w = “cat … dog … bird …”.● Map(w) = {(wj,“1”)}, j = 1,2,...,k.● Reduce(w,v1,v2,...,vd) = (w, jvj).
Value 1 Value 2 Value 3 ...
“cat” “1” “1” “1” ...
“dog” “1” “1” “1” ...
“bird” “1” “1” “1” ...
Contd.● w = “cat … dog … bird …”.● Map(w) = {(wj,“1”)}, j = 1,2,...,k.● Reduce(w,v1,v2,...,vd) = (w, jvj).
Value 1 Value 2 Value 3 ...
“bird” “1” “1” “1” ...
“cat” “1” “1” “1” ...
“dog” “1” “1” “1” ...
Contd.● w = “cat … dog … bird …”.● Map(w) = {(wj,“1”)}, j = 1,2,...,k.● Reduce(w,v1,v2,...,vd) = (w, jvj).
Value 1
“bird” “39”
“cat” “20”
“dog” “11”
The Bursting I/O Problem
● Let N be the file size.● What would be the table size?● At least Ω(N).
○ Each word in the input file corresponds to a value in the table.
● Too much I/O traffic!
The Combinek(v1,v2,...,vd) Function● Goal: to reduce the table size.● Assumptions:
Combinek(v) = v,Combinek(v1,...,vd) = Combinek(Combinek(v1,...,vd-1),vd),Reduce(k,v1,v2,...,vd) = Reduce(k,Combinek(v1,...,vd)).
The Combinek(v1,v2,...,vd) Function● Goal: to reduce the table size.● Assumptions:
Combinek(v) = v,Combinek(v1,...,vd) = Combinek(Combinek(v1,...,vd-1),vd),Reduce(k,v1,v2,...,vd) = Reduce(k,Combinek(v1,...,vd)).
● Table size reduction (m Map nodes):Reduce(k,v1,v2,...,vd) =
Reduce(k,Combinek(v1,...,vd/m),Combinek(vd/m+1,...,v2d/m),...).
Contd.● Assume m map nodes:
○ Best case: each map node has a combiner.○ Minimum possible space: ϴ(m).
Value 1 Value 2 Value 3 ...
“bird” “300” “351” “310” ...
“cat” “109” “1112” “207” ...
“dog” “4” “2” “3” ...
The Partition(k,M) Function
● How to assign rows to reduce nodes?● Partition: key → node.● Typically
Partition(k,M) = HashFunction(k) mod M.Eg.:
Partition(“cat”, 5) = 1 % 5 = 1.