Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
11/14/16
1
Web Search Engines Lecture 8 2016-11-14
Cong Yu, Google Research NYC
Big Data
NYU-CS Web Search Engines Fall 2016
Midterm
2
P1: Crawling • (10 pt) circular for consistent hashing + replica for load balancing
• Hashing on hostname, reassign existing URLs upon machine removal, rate limiting per machine
• “.edu”!
P2: Segmentation
• (10 pt) applying statistical methods
• Pseudo-code, discussing batch and live evaluation
P3: Index Compression è byte-aligned is decodable.
Grades are released: mean is 73, median is 75.
NYU-CS Web Search Engines Fall 2016
Outline
Big Data
MapReduce Techniques
Case Studies
• Document Inverted Indexing
• Query Log Cube Analysis
Fallacies of Big Data
3
11/14/16
2
NYU-CS Web Search Engines Fall 2016
Handling Large Scale Data is Necessary
Google: 20PB per day processed by MapReduce
Facebook: 2.5 PB user data, growing 15TB per day in 2009
Petabytes data are the norm for Internet companies.
Scientific studies generate similar amount of data • Sloan Digital Sky Survey: 200GB per night
• Human Genome Project: 3GB of raw data per individual
4
NYU-CS Web Search Engines Fall 2016
Leveraging Large Scale Data is Powerful
Translation: analyzing parallel texts (sentences written in two languages in parallel) at a large scale drastically improves the accuracy of automatic translation.
Google Trends: analyzing search queries can predict flu spread more accurately and faster than CDC in 2009.
5
NYU-CS Web Search Engines Fall 2016
Big Data vs Cloud Computing Big Data / Data Science: techniques for storing, analyzing, mining large
scale, fast moving, heterogeneous data • Volume, Velocity, Variability, Veracity
Cloud Computing: hosting various services for remote customers • Infrastructure, Platform, Software, Data
IaaS: Virtual Machines • Amazon EC2
PaaS: IaaS+OS+Servers+Runtime è Developer facing • Cloud Foundry, Google App Engine, Microsoft Azure
SaaS and DaaS: PaaS+Applications+Data è User facing • Google Apps
Big data techniques are usually integrated at the level of PaaS.
6
11/14/16
3
NYU-CS Web Search Engines Fall 2016
The MapReduce Paper
MapReduce: Simplified Data Processing on Large Clusters, by Jeff Dean and Sanjay Ghemawat, OSDI 2004.
Built upon decades of research in parallel and distributed computing
Demonstrated, to the large audience, the power of shared-nothing parallel processing architecture
Extremely valuable to data processing on the Web, essentially launched the Big Data era.
MapReduce now refers to both the Google implementation and the programming model with open and proprietary implementations
• Hadoop, Dryad • Latest advancements: Spark (iterative), Kafka (streaming), TensorFlow
7
NYU-CS Web Search Engines Fall 2016
Intuition and Infrastructure Divide and Conquer • Divide the large amount of input data into smaller chunks
• Process each chunk separately with shared-nothing environment, and produce intermediate outputs from each chunk
• Aggregate those intermediate outputs into the final output
A MapReduce stack provides the infrastructure for: • Retrieving input from and writing output to the distributed storage system
• Data processing: • Running user provided code for processing data chunks • Shuffling intermediate outputs to their destinations • Running user provided code for aggregating intermediate outputs to produce
the final output
• A master for monitoring failures and restarting failed tasks
• A scheduling system for coordination among tasks and jobs
8
NYU-CS Web Search Engines Fall 2016
Key Ideas
Offline batch processing
Lots of cheap machines is better than a few high end machines
Failures are inevitable, as a consequence of the above
Moving data is more expensive than moving computation
As simple as possible for developers
But not suitable for all data processing tasks • If possible, adjust the tough tasks to suit the MapReduce model
9
11/14/16
4
NYU-CS Web Search Engines Fall 2016
Basic Concepts
Mapper • Workers that execute map tasks, each of which processes one
individual input chunk
• The Map function is provided by the user • input record è {(Key1, Value1), (Key2, Value2), …}
Reducer • Workers that execute reduce tasks, each of which aggregates
intermediate output records of the same Key
• The Reduce function is provided by the user • (Key, {Value1, Value2, …}) è (Key, Result)
10
NYU-CS Web Search Engines Fall 2016
Document Word Count: Simple Flow
11
a 1 b 1 c 1 c 1 a 1 c 1 b 1 c 1
a 1 1 b 1 1 c 1 1 1 1
a 2 b 2 c 4
NYU-CS Web Search Engines Fall 2016
Document Word Count
12
11/14/16
5
NYU-CS Web Search Engines Fall 2016
Execution Framework
13
NYU-CS Web Search Engines Fall 2016
Execution Framework The infrastructure provides the critical layer of shuffle/sort • Performance differentiator between implementations
Shuffle (i.e. Partition) • Partitions the intermediate output based on Key • Assigns each partition to specific reducers
Sort, for each partition • Sorts the records based on Key • Groups values with the same Key together • Each reducer can expect the records come in sorted by the Key • There is no guarantee of order across the different reducers
User can provide customized partition and sort functions to control the shuffle/sort behavior
14
NYU-CS Web Search Engines Fall 2016
In-Class Exercise
Write a MapReduce program to compute the average per document term frequency over the documents they appear in for all terms
15
11/14/16
6
NYU-CS Web Search Engines Fall 2016
Answer
16
NYU-CS Web Search Engines Fall 2016
Notable Issues
Reduce workers can not start until all map workers are done • Speculative execution can speed things up in anticipation of mapper
failures: i.e., run duplicate copies of same map task if it is running slowly
Straggling reducers • When data distribution is skewed, a single Key may have lots of Values
• The reducer being assigned that task becomes the job bottleneck
Failure handling • Master polls each worker periodically and considers one has failed if no
response within certain amount of time
• All tasks (even those completed b/c master doesn’t know) executed by that worker will be reset and executed by a new worker.
• All reduce workers will be notified to read from the new worker
17
NYU-CS Web Search Engines Fall 2016
Outline
Big Data
MapReduce Techniques
Case Studies
• Document Inverted Indexing
• Query Log Cube Analysis
Fallacies of Big Data
18
11/14/16
7
NYU-CS Web Search Engines Fall 2016
Local Aggregation
In the most basic MapReduce model, aggregations only happen on the reducers
• All the intermediate data must be shipped (often over the wire) from the mapper to the remote reducer
Sometimes, it is much more efficient to execute the reduce tasks partially on the mapper to avoid the extra communication cost
• Helps alleviate the straggling reducer problem, although does not eliminate it
19
NYU-CS Web Search Engines Fall 2016
Local Aggregation within Mapper
Mapper can easily run out of memory, important to flush the data in blocks
20
process multiple docs
NYU-CS Web Search Engines Fall 2016
Important notes on combiners • Not guaranteed to be executed and may get executed multiple times
• Must process the intermediate output in the same way as reducers
• Must produce the intermediate output that can be fed into the reducers
c 1 1 1 1
Local Aggregation through Combiner
21
a 1 b 1 c 1 c 1 a 1 c 1 b 1 c 1
a 1 1 b 1 1 c 2 1 1
c 2a 1 b 1 a 1 c 1 b 1 c 1
a 1 1 b 1 1
11/14/16
8
NYU-CS Web Search Engines Fall 2016
Algebraic Measure and Combiner
Let A and B be two different sets of numbers
AVERAGE(A U B) != AVERAGE(AVERAGE(A), AVERAGE(B))
• AVERAGE({1, 2, 3, 4, 5}) == 3
• AVERAGE(AVERAGE({1, 2}), AVERAGE({3, 4, 5})) = 2.75
SUM(SUM(A), SUM(B)) / SUM((COUNT(A) + COUNT(B))
• SUM(SUM({1, 2}), SUM({3, 4, 5})) == 15
• SUM(COUNT({1, 2} + COUNT({3, 4, 5})) == 5
Measures like sum, count, average are algebraic measures that lend themselves to optimization through combiner
22
NYU-CS Web Search Engines Fall 2016
Using a Combiner
23
NYU-CS Web Search Engines Fall 2016
Using a Combiner, Correctly
24
11/14/16
9
NYU-CS Web Search Engines Fall 2016
Thought Exercise
How to write a MapReduce program that computes the number of times each pair of words co-occurs in the same sentence?
• I.e., “new york university”, “new york city”
• {“new”, “york”} = 2, {“new”, “university”} = 1, etc.
25
NYU-CS Web Search Engines Fall 2016
Pairs and Stripes
There is a fundamental tension in most MapReduce algorithms on two aspects
• Number of intermediate output records
• Size of intermediate output records
Pair-based approaches aim to reduce the latter
Stripe-based approaches aim to reduce the former
26
NYU-CS Web Search Engines Fall 2016
The Pair Approach
27
11/14/16
10
NYU-CS Web Search Engines Fall 2016
The Stripe Approach
28
NYU-CS Web Search Engines Fall 2016
Pair versus Stripe There is no right way, it all depends on the problem
The Pair approach • Each intermediate output record is small
• There are a lot of them è all possible word pair combinations
The Stripe approach • The number of intermediate output records is limited: # of distinct words
• Each record can be very large è an array of many words and their counts • Can cause straggling reducer problem
If applicable, the Stripe approach is slightly preferred • Less burden on shuffle and sort due to the lesser number of records
• Also moves slightly smaller amount of data due to savings on the Keys
Neither solves the straggling reducer problem
29
NYU-CS Web Search Engines Fall 2016
Thought Exercise
Write a MapReduce program that computes Pr(wj|wi) • How likely wj will appear if wi appears (say in a sentence):
Straight-forward using the Stripe approach
• In the Reduce function, the total count is readily available
How to do that using the Pair approach?
30
11/14/16
11
NYU-CS Web Search Engines Fall 2016
Customized Partition and Sort
A Customized Partition function on the key can be applied to guide how the shuffle phase distributes intermediate records
• All records with the same partition ID go to the same reducer
A Customized Sort function on the Key can be applied to determine the order of intermediate records being processed by the reducer
31
NYU-CS Web Search Engines Fall 2016
Computing Pr(w1|w2) using the Pair Approach
In addition to word pairs, for each word in a sentence, output
(w, *) è # of all words
Customized Partition function to hash on only the first word
Customized Sort function to sort “*” before any real word • Using stateful Reduce function
32
NYU-CS Web Search Engines Fall 2016
Design Principles for MapReduce Algorithms
Carefully design the Key and Value data structures to work with Partition and Sort functions
Apply Combiner whenever possible
Be aware of potential data skews and the straggling reducer problem
Chain multiple MapReduce jobs together to accomplish complex tasks
• Check out Flume, Pig, Dryad
33
11/14/16
12
NYU-CS Web Search Engines Fall 2016
Outline
Big Data
MapReduce Techniques
Case Studies
• Document Inverted Indexing (very simple)
• Query Log Cube Analysis (advanced)
Fallacies of Big Data
34
NYU-CS Web Search Engines Fall 2016
MapReduce and Web Search Engines
MapReduce was designed to handle two important tasks that are at the heart of Web Search Engines
• Document processing and indexing at Web Scale
• Mining search query logs
Big data techniques became necessary because of the enormous amount of data
• It’s not a coincident that MapReduce was developed at Google
35
NYU-CS Web Search Engines Fall 2016
Inverted Indexing via MapReduce: Basic
36
11/14/16
13
NYU-CS Web Search Engines Fall 2016
Inverted Indexing using MapReduce: Basic
37
NYU-CS Web Search Engines Fall 2016
Any Issue?
The in-memory sort in the Reduce function can be costly
Apply the techniques we learned earlier
• Push docid to be part of the Key, i.e., <term, docid>
• Customize the Partition function to partition on term only
• Customize the Sort function to sort term first and docid second
Then the MapReduce execution framework will automatically perform the much more scalable distributed sort
This value-to-key strategy is often used to leverage the inherent sorting facility within MapReduce
38
NYU-CS Web Search Engines Fall 2016
Inverted Indexing via MapReduce: No User Sort
39
11/14/16
14
NYU-CS Web Search Engines Fall 2016
Take-Home Exercise
Compute PageRank via MapReduce
40
NYU-CS Web Search Engines Fall 2014
Outline
Big Data
MapReduce Techniques
Case Studies
• Document Inverted Indexing (very simple)
• Query Log Cube Analysis (advanced)
Fallacies of Big Data
41
NYU-CS Web Search Engines Fall 2016
Cube Analysis of Query Logs
42
Search engines accumulate huge query logs • Each query event has a lot of information
• Location (inferred from IP)
• Demographics (inferred from past behavior) • The query itself (mapped to semantic categories)
Used to identify interesting facts across all combinations of dimensions
• E.g. “Women in 40s from Michigan search a lot about knitting.”
Over an extended period, there can be a lot of query events
11/14/16
15
NYU-CS Web Search Engines Fall 2016
Distributed Cube Materialization
43
Distributed Cube Materialization on Holistic Measures, by Arnab Nandi, Cong Yu, Philip Bohannon, Raghu Ramakrishnan. ICDE, 2011.
A brief look at an advanced usage of MapReduce
Techniques proposed in the paper are incorporated into the official Hadoop implementation.
NYU-CS Web Search Engines Fall 2016
Holistic Measures
44
Algebraic measures • Value for parent can be computed easily from values of children.
• COUNT(A U B) = SUM(COUNT(A), COUNT(B))
Holistic Measures • Value for parent can not be computed from children.
• COUNT(DISTINCT(A U B)) • distinct(A={a, b, b, c}) = 3; distinct(B={a, c}) = 2;
• distinct(A U B) = 3 can not be computed any other way
NYU-CS Web Search Engines Fall 2016
Real Example Task
45
Given the location and topic hierarchies, compute volume and reach (number of distinct users) of all cube groups whose reach is at least 5.
User Query (Event) Time Topic … Location … u1 Nikon d40 20090601 SLR Ann Arboru2 Canon sd100 20090601 Point & Shoot Detroitu1 Apple iPhone 20090602 Smartphone Ann Arboru3 Nikon d40 20090604 SLR New Yorku4 Nokia 1100 20090605 Basic Phone Sunnyvale
3-levels each
Topic è Cat è Subcat
Country è State è City
11/14/16
16
NYU-CS Web Search Engines Fall 2016
All Combinations, i.e., the Cube
46
NYU-CS Web Search Engines Fall 2016
Naïve Algorithm
47
of the typical tasks that analysts demand and will be usedthroughout this paper. They both involve holistic measures:reach for the former and top-k for the latter. The schema,data and lattice for both tasks is the same as our runningexample in Sec. II.
Example 1 (Coverage Analysis): Given the location andtopic hierarchies, compute volume and reach of all cubegroups whose reach is at least 5. Highlight those cube groupswhose reach is unusually high compared with their volume.Here, measure volume is defined as the number of searchtuples in the group, while reach is defined as the number ofunique users issuing those searches.2
This analysis is inspired by the need to identify querytopics that are issued relatively infrequently by the users,but cover a large number of users: these queries are oftenmissed by traditional frequency based analyses because oftheir relative low volume, even though those query topics canin fact have an impact on a very large user population. A SQL-style specification corresponding to the cubing task for thisanalysis is shown in Fig. 1, while the dimension hierarchiesand hierarchical cube lattice are shown in Fig. 5 and Fig. 6respectively.
Example 2 (Top-k Query Analysis): Given the locationand topic hierarchies, compute top-5 frequent queries forall cube groups. Highlight those cube groups that share lessthan 3 top queries with their parent groups.2
This analysis aims to discover location bias for variousquery topics. For example, the top political queries in Austincan be very different from those in Texas in general, whichin turn can be different from those in the entire USA.
We note again that cube computation is only part of bothanalysis tasks, albeit an essential part. Highlighting the desiredcube groups is on top of the cube computation layer andbeyond the scope of this work. We provide some discussionon how to efficiently identify interesting groups in Sec. VIII.
IV. CHALLENGES
Typically, the CUBE operation can be expressed in high-level MapReduce languages (e.g., Pig [19]) as a disjunctionof groupby queries. A query optimizer would then (in theideal case) combine all the queries into a single MapReducejob. Algo. 1 represents this combined cube computation.Intuitively, this naive algorithm divides the full cubing taskinto a collection of aggregation tasks, one for each cube group,and distributes them for computation using the MapReduceframework. In particular, the function R(e) extracts, from thetuple e, the values of the groupby attributes specified by thecube region R.
For example, given the cube lattice in Fig. 6, the searchtuple e1 = ⇤091203, u1, 64.97.55.3, iPhone⌅ is mapped to16 groups (one per region in the cube), including the smallestgroup ⇤Ann Arbor, Smart⌅ and the broadest group ⇤⇥, ⇥⌅.Each reducer then computes the measures for its assignedgroups by applying the measure function. Measure-specificoptimization can be incorporated into the naive algorithm to
reduce amount of intermediate data. As an example, for cov-erage analysis (Example 1), to compute the measure reach,we only need the attribute uid. Therefore the mapper can emitjust e.uid instead of the full tuple.
Algorithm 1 Naive AlgorithmMAP(e)1 # e is a tuple in the data2 let C be the Cube Lattice3 for each Region R in C4 do k = R(e);5 EMIT k � e
REDUCE/COMBINE(k, {e1, e2, ...})1 let M be the measure function2 EMIT k � M({e1, e2, ...})
It should be noted that despite its simplicity, the Naivealgorithm fares quite well for small datasets. As we will see inSec. VI, it outperforms its competitors for such datasets dueto the extremely low overhead costs. However, as the scaleof data increases, we encounter two key challenges that causethis algorithm to perform poorly and eventually fail: size ofintermediate data and size of large groups. We briefly describethese challenges next.
A. Size of Intermediate DataThe first challenge arises from the large size of intermediate
data being generated from the map phase, which measures at|C| � |D|, where |C| is the number of regions in the cubelattice and |D| is the size of the input data. Since |C| increasesexponentially with both the number and depth of dimensionsto be explored, the naive approach can quickly lead to thesystem running out of disk space during the map phase orstruggling through the shuffle (i.e., sort) phase.
B. Size of Large GroupsThe second challenge arises from cube groups belonging
to cube regions at the bottom part of the cube lattice, suchas ⇤USA, Shopping⌅ or even ⇤⇥, ⇥⌅ (i.e., the cube groupcontaining all tuples). The reducer that is assigned the lattergroup essentially has to compute the measure for the entiredataset, which is usually large enough to cause the reducer totake significantly longer time to finish than others or even fail.As the size of the data increases, the number of such groupsalso increases. We call such groups reducer-unfriendly. Acube region with a significant percentage of reducer-unfriendlygroups is called reducer-unfriendly region.
For algebraic measures, this challenge can addressed bynot processing those groups directly: we can first computemeasures only for those smaller, reducer-friendly, groups, thencombine those measures to produce the measure for the larger,reducer-unfriendly, groups. Such measures are also amenableto mapper-side aggregation which further decreases the load onthe shuffle and reduce phases. For holistic measures, however,measures for larger groups can not be assembled from theirsmaller child groups, and mapper-side aggregation is also notpossible. Hence, we need a different approach.
NYU-CS Web Search Engines Fall 2016
Naïve Algorithm
48
<Ann Arbor, SLR> è u1 …
<Michigan, Camera> è u1 …
<*, *> è u1
Map phase (16 records!)
User Query (Event) Time Topic … Location … u1 Nikon d40 20090601 SLR Ann Arboru2 Canon sd100 20090601 Point & Shoot Detroitu1 Apple iPhone 20090602 Smartphone Ann Arboru3 Nikon d40 20090604 SLR New Yorku4 Nokia 1100 20090605 Basic Phone Sunnyvale
<Ann Arbor, SLR> è {u1} …
<Michigan, Camera> è {u1, u2} …
<*, *> è {u1, u2, u3, u4, u5}
Reduce phase
11/14/16
17
NYU-CS Web Search Engines Fall 2016
Challenges
49
Map phase: many intermediate keys • Number of Keys = |C| x |D|, where |C| is the number of regions in
the lattice, and |D| is the input size.
• As the number of dimension increases, this can cause problems for the map/shuffle phase.
Reduce phase: extremely large reduce groups
• Some reduce shards can be very large and cause the reducer dealing with those groups to run out of memory.
• In fact, one group is guaranteed to have this problem: <*, *>
NYU-CS Web Search Engines Fall 2016
Handle Straggling Reducer
50
Inspired by a common method to compute the unique number of users in a query log
• The usual practice involves two steps: 1) MapReduce to get the set of unique user IDs; 2) calculate the size of result to produce the final answer
Holistic measure computation can not be distributed, but • What if the input is split into disjoint sets with regard to the
attribute of the measure (in the above example, user ID)?
NYU-CS Web Search Engines Fall 2016
Example
51
User Query (Event) Time Topic … Location … u1 Nikon d40 20090601 SLR Ann Arboru2 Canon sd100 20090601 Point & Shoot Detroitu1 Apple iPhone 20090602 Smartphone Ann Arboru3 Nikon d40 20090604 SLR New Yorku4 Nokia 1100 20090605 Basic Phone Sunnyvale
<*, *, uid> è {u1} <*, *, uid> è {u2}
… <*, *, uid> è {u5} Reduce phase
no more
large groups!
11/14/16
18
NYU-CS Web Search Engines Fall 2016
Detecting Large Cube Groups
52
Run a small MapReduce job over a sample of the data using the naïve algorithm
• Goal is to detect any cube group that might be reducer-unfriendly
With a sample size of 2M tuples, we can obtain accurate enough estimation for 20B tuples
• Proof based on Chernoff Bounds
NYU-CS Web Search Engines Fall 2016
Slightly Less Naïve Approach
53
reducer- friendly
reducer-unfriendly <*,*,uid>
<*,topic,uid>
<*,cat,uid>
<country,*,uid>
<country,topic> <state,*>
<country,cat> <state,topic>
<state,cat>
<city,subcat>
<city,*>
<city,topic>
<city,cat><state,subcat>
<country,subcat>
<*,subcat>producing even
more intermediate
records!
NYU-CS Web Search Engines Fall 2016
Incorporating Partition Factor
54
11/14/16
19
NYU-CS Web Search Engines Fall 2016
MR-Cube Algorithm
55
!"#$!!%&&!!%'()'$!!*+,)&-!!".$!!&/0$!!*+,)&-!!!!!!!!!".$!!1-2')*2$!!&*3)&!!145! !6
%+!
7,"8
-!
9-1"
0-!
!"#"$%&'()&*+(+,)-(./"&'("0(-1+/,/)-2(3(*,4)-(,/+,5(
6,7&+%',/88"05()"$*10+9(10('"54%'/")+5510.(
!:;*0,*<%&$=>?!!"#$!!%&&!!%'()'$!!+,)&-!!:=$=>$!#!!!!!!!!!!?!!"#$!!"@%!
!:&-A!/)'3$=>?!!".$!!&/0$!!+,)&-!!:=$=>$!.!!!!!!!!!!?!!".$!!"@%!
!:;*0,*<%&$=>?!!".$!!1-2')*2$!!0%;-'%!!:=$=>$!.!!!!!!!!!!?!!".$!!"@%!
!:;*0,*<%&$=>?!!"#$!!%&&!!%'()'$!!+,)&-!
!:&-A!/)'3$=>?!!".$!!&/0$!!+,)&-!
!:=$=>$!.!!!!!!!!!!?!!".$!!"@%!
!:;*0,*<%&$=>?!!".$!!1-2')*2$!!0%;-'%!
!:=$=>$!.!!!!!!!!!!?!!".$!!"@%!
!:=$=>$!#!!!!!!!!!!?!!"#$!!"@%!
!:=$=>?!#!
!:=$=>?!#!
!:&-A!/)'3$!=>?!#!
!:&/0$!=>!!!!!!!!!!!!?!#!
!:&-A!/)'3$!0%;-'%>!?!#!
!:&/0$!0%;-'%>!!!!!!!!!!!!!?!#!
!:;*0,*<%&$!+,)&->!!!?!#!
!:;*0,*<%&$!0%;-'%>!?!#!
!:%&&!%'()'$!+,)&->!!?!#!
!:1-2')*2$!0%;-'%>!!!!!?!#!
!:;*0,*<%&$!=>!?!.!
!:%&&!%'()'$!=>?!#!
!:1-2')*2$!=>!!!!!?!#!
Fig. 11. Walkthrough for MR-Cube with reach as measure, as described in Sec. V-D. Both the Map and Reduce stages take the AnnotatedCube Lattice(Fig. 10) as a parameter. Walkthrough is shown only for batch areas b1 (denoted by red ⇥ on tuples) & b5 (denoted by blue �).
D. Cube Materialization & WalkthroughAs shown in Algo. 2, an annotated lattice is generated and
then used to perform the main MR-CUBE-MAP-REDUCE.
Algorithm 2 MR-Cube Algorithm
ESTIMATE-MAP(e)1 # e is a tuple in the data2 let C be the Cube Lattice;3 for each ci in C4 do EMIT (ci, ci(e) ⇤ 1) # the group is the secondary key
ESTIMATE-REDUCE/COMBINE(⌥r,g�, {e1, e2, ...})1 # ⌥r, g� are the primary/secondary keys2 MaxSize S � {}3 for each r, g4 do S[r] � MAX(S[r], |g|)5 # |g| is the number of tuples {ei, ..., ej} ⇧ g6 return S
MR-CUBE-MAP(e)1 # e is a tuple in the data2 let Ca be the Annotated Cube Lattice3 for each bi in Ca.batch areas4 do s � bi[0].partition factor5 EMIT (e.SLICE(bi[0]) + e.id%s ⇤ e)6 # value partitioning: ‘e.id % s’ is appended to primary key
MR-CUBE-REDUCE(k, V)1 let Ca be the Annotated Cube Lattice2 let M be the measure function3 cube � BUC(DIMENSIONS(Ca, k), V, M)4 EMIT-ALL (k, cube)5 if (MEMORY-EXCEPTION)6 then D � � D � ⌃ (k, V)
MR-CUBE(Cube Lattice C, Dataset D, Measure M)1 Dsample = SAMPLE(D)2 RegionSizes R = ESTIMATE-MAPREDUCE(Dsample, C)3 Ca = ANNOTATE(R,C) # value part. & batching4 while (D)5 do R � R ⌃ MR-CUBE-MAPREDUCE(Ca,M, D)6 D � D � # retry failed groups D’ from MR-Cube-Reduce7 Ca � INCREASE-PARTITIONING(Ca)8 Result � MERGE(R) # post-aggregate value partitions9 return Result
In Fig. 11 we present a walkthrough of MR-Cube overour running example. Based on the sampling results, cuberegions ⌥⇥, ⇥� and ⌥country, ⇥� have been deemed reducer-unfriendly and require partitioning into 10 parts. We depictmaterialization for 2 of the 5 batch areas, b1 and b5. Foreach tuple in the dataset, the MR-CUBE-MAP emits key:valuepairs for each batch area, denoted by red ⌅ (b1) and blue⇤ (b5). In required, keys are appended with a hash basedon value partitioning, e.g. the 2 in ⌥⇥, ⇥�, 2 : u2, usa. Theshuffle phase then sorts them according to key, yielding 4reducer tasks. Algorithm BUC is run on each reducer, and thecube aggregates are generated. The value partitioned groupsrepresenting ⌥⇥, ⇥, 1� are merged during post-processing toproduce the final result for that group, ⌥⇥, ⇥, 2�.
Note that if a reducer fails due to wrong estimation ofgroup size, all the data for that group is written back to thedisk and follow-up MapReduce jobs are then run with moreaggressive value partitioning, until the cube is completed. Itshould be noted that in practice, a follow-up job is rarely, ifat all, needed.
VI. EXPERIMENTAL EVALUATION
We perform the experimental evaluation on a productionYahoo! Hadoop 0.20 cluster as described in [24]. Each nodehas 2�Quad Core 2.5GHz Intel Xeon CPU, 8GB RAM,4�1TB SATA HDD, and runs RHEL/AS 4 Linux. The heapsize for each mapper or reducer task is set to 4GB. All tasks areimplemented in Python and executed via Hadoop Streaming.Similar to [1], we collect results only from successful runs,where all nodes are available, operating correctly, and thereis no delay in allocating tasks. We report the average numberfrom three successful runs for each task.
A. Experimental Setup
1) Datasets: We adopt two datasets. The Real datasetcontains real life click streams obtained from query logs ofYahoo! Search. We examined clicks to Amazon, Wikipediaand IMDB on the search results. For each click tuple, weretain the following information: uid, ip, query, url. Weestablish three dimensions containing a total of six levels.The location dimension contains three levels (from lowest tohighest, with cardinalities): city(74214) ⇥ state(2669) ⇥country(235) and is derived from ip5. The time dimension
5Mapping is done using the free MaxMind GeoLite City database:http://www.maxmind.com/app/geolitecity.
NYU-CS Web Search Engines Fall 2016
Experimental Setup
56
Yahoo! Hadoop Cluster • Hadoop 0.20, 2x4core 2.5GHz Xeon/8GB RAM/4x1TB
• 4GB reserved for the workers
• Python + Hadoop Streaming
Datasets:
• A sample of real click log: 516M tuples, 3 dims, 6 levels
Baselines:
• Naïve
• MR-BPP and MR-PT, adapted from earlier works by Ng et. al., You et. al., Sergey et. al.
NYU-CS Web Search Engines Fall 2016
Scalability over Real Query Logs
57
!"#
$%"#
%&"#
&'"#
(!"#
$# %# &# %"# $""# )$!#
!"#$%&'(%
)*+*%,"-$%&./#0$1%23%45$6+'7%"6%8"99"26'(%
Dataset: Real Measure: Reach
*+,-..#
*+,./#
*+,0123#
45673#
11/14/16
20
NYU-CS Web Search Engines Fall 2016
Outline
Big Data
MapReduce Techniques
Case Studies
• Document Inverted Indexing (very simple)
• Query Log Cube Analysis (advanced)
Fallacies of Big Data
58
NYU-CS Web Search Engines Fall 2016
Why Big Data Analysis Can be Wrong Errors in traditional statistical analysis • Sampling error
• Sampling bias
Sampling error: a randomly chosen sample may not reflect the total population
• Big Data addresses this issue à larger sample, less error
Sampling bias: the samples are not randomly chosen • Depending on the analysis, Big Data may make this worse!
• Only solution is to gather all data points, which may be impossible
E.g., facebook/twitter population is very different from real life, see US Election 2016!
59
NYU-CS Web Search Engines Fall 2016
The Best Practice
Be well versed in statistical techniques
Be aware of sampling bias and avoid drawing conclusions purely based on ‘found’ data
Be more hypothesis driven and test ideas with experiments • With the large volume of data, exploratory analysis inevitably
lead to some interesting patterns that are simply there by chance • Cross-validation is critical
60
11/14/16
21
NYU-CS Web Search Engines Fall 2016
Review of Today’s Material
Introduction to Big Data, Cloud Computing
Various MapReduce techniques
Two case studies relevant to core tasks of search engines
What to pay attention to when doing big data analysis
61
NYU-CS Web Search Engines Fall 2016