lecture08 - cs.nyu.eduThe MapReduce Paper MapReduce: Simplified Data Processing on Large Clusters, by Jeff Dean and Sanjay Ghemawat, OSDI 2004. Built upon decades of research in parallel

11/14/16

1

Web Search Engines Lecture 8 2016-11-14

Cong Yu, Google Research NYC

Big Data

NYU-CS Web Search Engines Fall 2016

Midterm

2

P1: Crawling •  (10 pt) circular for consistent hashing + replica for load balancing

•  Hashing on hostname, reassign existing URLs upon machine removal, rate limiting per machine

•  “.edu”!

P2: Segmentation

•  (10 pt) applying statistical methods

•  Pseudo-code, discussing batch and live evaluation

P3: Index Compression è byte-aligned is decodable.

Grades are released: mean is 73, median is 75.


Outline

Big Data

MapReduce Techniques

Case Studies

•  Document Inverted Indexing

•  Query Log Cube Analysis

Fallacies of Big Data

3

11/14/16

2


Handling Large Scale Data is Necessary

Google: 20PB per day processed by MapReduce

Facebook: 2.5 PB user data, growing 15TB per day in 2009

Petabytes data are the norm for Internet companies.

Scientific studies generate similar amount of data •  Sloan Digital Sky Survey: 200GB per night

•  Human Genome Project: 3GB of raw data per individual

4


Leveraging Large Scale Data is Powerful

Translation: analyzing parallel texts (sentences written in two languages in parallel) at a large scale drastically improves the accuracy of automatic translation.

Google Trends: analyzing search queries can predict flu spread more accurately and faster than CDC in 2009.

5


Big Data vs Cloud Computing Big Data / Data Science: techniques for storing, analyzing, mining large

scale, fast moving, heterogeneous data •  Volume, Velocity, Variability, Veracity

Cloud Computing: hosting various services for remote customers •  Infrastructure, Platform, Software, Data

IaaS: Virtual Machines •  Amazon EC2

PaaS: IaaS+OS+Servers+Runtime è Developer facing •  Cloud Foundry, Google App Engine, Microsoft Azure

SaaS and DaaS: PaaS+Applications+Data è User facing •  Google Apps

Big data techniques are usually integrated at the level of PaaS.

6

11/14/16

3


The MapReduce Paper

MapReduce: Simplified Data Processing on Large Clusters, by Jeff Dean and Sanjay Ghemawat, OSDI 2004.

Built upon decades of research in parallel and distributed computing

Demonstrated, to the large audience, the power of shared-nothing parallel processing architecture

Extremely valuable to data processing on the Web, essentially launched the Big Data era.

MapReduce now refers to both the Google implementation and the programming model with open and proprietary implementations

•  Hadoop, Dryad •  Latest advancements: Spark (iterative), Kafka (streaming), TensorFlow

7


Intuition and Infrastructure Divide and Conquer •  Divide the large amount of input data into smaller chunks

•  Process each chunk separately with shared-nothing environment, and produce intermediate outputs from each chunk

•  Aggregate those intermediate outputs into the final output

A MapReduce stack provides the infrastructure for: •  Retrieving input from and writing output to the distributed storage system

•  Data processing: •  Running user provided code for processing data chunks •  Shuffling intermediate outputs to their destinations •  Running user provided code for aggregating intermediate outputs to produce

the final output

•  A master for monitoring failures and restarting failed tasks

•  A scheduling system for coordination among tasks and jobs

8


Key Ideas

Offline batch processing

Lots of cheap machines is better than a few high end machines

Failures are inevitable, as a consequence of the above

Moving data is more expensive than moving computation

As simple as possible for developers

But not suitable for all data processing tasks •  If possible, adjust the tough tasks to suit the MapReduce model

9

11/14/16

4


Basic Concepts

Mapper •  Workers that execute map tasks, each of which processes one

individual input chunk

•  The Map function is provided by the user •  input record è {(Key1, Value1), (Key2, Value2), …}

Reducer •  Workers that execute reduce tasks, each of which aggregates

intermediate output records of the same Key

•  The Reduce function is provided by the user •  (Key, {Value1, Value2, …}) è (Key, Result)

10


Document Word Count: Simple Flow

11

a 1 b 1 c 1 c 1 a 1 c 1 b 1 c 1

a 1 1 b 1 1 c 1 1 1 1

a 2 b 2 c 4


Document Word Count

12

11/14/16

5


Execution Framework

13


Execution Framework The infrastructure provides the critical layer of shuffle/sort •  Performance differentiator between implementations

Shuffle (i.e. Partition) •  Partitions the intermediate output based on Key •  Assigns each partition to specific reducers

Sort, for each partition •  Sorts the records based on Key •  Groups values with the same Key together •  Each reducer can expect the records come in sorted by the Key •  There is no guarantee of order across the different reducers

User can provide customized partition and sort functions to control the shuffle/sort behavior

14


In-Class Exercise

Write a MapReduce program to compute the average per document term frequency over the documents they appear in for all terms

15

11/14/16

6


Answer

16


Notable Issues

Reduce workers can not start until all map workers are done •  Speculative execution can speed things up in anticipation of mapper

failures: i.e., run duplicate copies of same map task if it is running slowly

Straggling reducers •  When data distribution is skewed, a single Key may have lots of Values

•  The reducer being assigned that task becomes the job bottleneck

Failure handling •  Master polls each worker periodically and considers one has failed if no

response within certain amount of time

•  All tasks (even those completed b/c master doesn’t know) executed by that worker will be reset and executed by a new worker.

•  All reduce workers will be notified to read from the new worker

17


Outline

Big Data


Case Studies

•  Document Inverted Indexing

•  Query Log Cube Analysis


18

11/14/16

7


Local Aggregation

In the most basic MapReduce model, aggregations only happen on the reducers

•  All the intermediate data must be shipped (often over the wire) from the mapper to the remote reducer

Sometimes, it is much more efficient to execute the reduce tasks partially on the mapper to avoid the extra communication cost

•  Helps alleviate the straggling reducer problem, although does not eliminate it

19


Local Aggregation within Mapper

Mapper can easily run out of memory, important to flush the data in blocks

20

process multiple docs


Important notes on combiners •  Not guaranteed to be executed and may get executed multiple times

•  Must process the intermediate output in the same way as reducers

•  Must produce the intermediate output that can be fed into the reducers

c 1 1 1 1

Local Aggregation through Combiner

21

a 1 b 1 c 1 c 1 a 1 c 1 b 1 c 1

a 1 1 b 1 1 c 2 1 1

c 2a 1 b 1 a 1 c 1 b 1 c 1

a 1 1 b 1 1

11/14/16

8


Algebraic Measure and Combiner

Let A and B be two different sets of numbers

AVERAGE(A U B) != AVERAGE(AVERAGE(A), AVERAGE(B))

•  AVERAGE({1, 2, 3, 4, 5}) == 3

•  AVERAGE(AVERAGE({1, 2}), AVERAGE({3, 4, 5})) = 2.75

SUM(SUM(A), SUM(B)) / SUM((COUNT(A) + COUNT(B))

•  SUM(SUM({1, 2}), SUM({3, 4, 5})) == 15

•  SUM(COUNT({1, 2} + COUNT({3, 4, 5})) == 5

Measures like sum, count, average are algebraic measures that lend themselves to optimization through combiner

22


Using a Combiner

23


Using a Combiner, Correctly

24

11/14/16

9


Thought Exercise

How to write a MapReduce program that computes the number of times each pair of words co-occurs in the same sentence?

•  I.e., “new york university”, “new york city”

•  {“new”, “york”} = 2, {“new”, “university”} = 1, etc.

25


Pairs and Stripes

There is a fundamental tension in most MapReduce algorithms on two aspects

•  Number of intermediate output records

•  Size of intermediate output records

Pair-based approaches aim to reduce the latter

Stripe-based approaches aim to reduce the former

26


The Pair Approach

27

11/14/16

10


The Stripe Approach

28


Pair versus Stripe There is no right way, it all depends on the problem

The Pair approach •  Each intermediate output record is small

•  There are a lot of them è all possible word pair combinations

The Stripe approach •  The number of intermediate output records is limited: # of distinct words

•  Each record can be very large è an array of many words and their counts •  Can cause straggling reducer problem

If applicable, the Stripe approach is slightly preferred •  Less burden on shuffle and sort due to the lesser number of records

•  Also moves slightly smaller amount of data due to savings on the Keys

Neither solves the straggling reducer problem

29


Thought Exercise

Write a MapReduce program that computes Pr(wj|wi) •  How likely wj will appear if wi appears (say in a sentence):

Straight-forward using the Stripe approach

•  In the Reduce function, the total count is readily available

How to do that using the Pair approach?

30

11/14/16

11


Customized Partition and Sort

A Customized Partition function on the key can be applied to guide how the shuffle phase distributes intermediate records

•  All records with the same partition ID go to the same reducer

A Customized Sort function on the Key can be applied to determine the order of intermediate records being processed by the reducer

31


Computing Pr(w1|w2) using the Pair Approach

In addition to word pairs, for each word in a sentence, output

(w, *) è # of all words

Customized Partition function to hash on only the first word

Customized Sort function to sort “*” before any real word •  Using stateful Reduce function

32


Design Principles for MapReduce Algorithms

Carefully design the Key and Value data structures to work with Partition and Sort functions

Apply Combiner whenever possible

Be aware of potential data skews and the straggling reducer problem

Chain multiple MapReduce jobs together to accomplish complex tasks

•  Check out Flume, Pig, Dryad

33

11/14/16

12


Outline

Big Data


Case Studies

•  Document Inverted Indexing (very simple)

•  Query Log Cube Analysis (advanced)


34


MapReduce and Web Search Engines

MapReduce was designed to handle two important tasks that are at the heart of Web Search Engines

•  Document processing and indexing at Web Scale

•  Mining search query logs

Big data techniques became necessary because of the enormous amount of data

•  It’s not a coincident that MapReduce was developed at Google

35


Inverted Indexing via MapReduce: Basic

36

11/14/16

13


Inverted Indexing using MapReduce: Basic

37


Any Issue?

The in-memory sort in the Reduce function can be costly

Apply the techniques we learned earlier

•  Push docid to be part of the Key, i.e., <term, docid>

•  Customize the Partition function to partition on term only

•  Customize the Sort function to sort term first and docid second

Then the MapReduce execution framework will automatically perform the much more scalable distributed sort

This value-to-key strategy is often used to leverage the inherent sorting facility within MapReduce

38


Inverted Indexing via MapReduce: No User Sort

39

11/14/16

14


Take-Home Exercise

Compute PageRank via MapReduce

40


Outline

Big Data


Case Studies




41


Cube Analysis of Query Logs

42

Search engines accumulate huge query logs •  Each query event has a lot of information

•  Location (inferred from IP)

•  Demographics (inferred from past behavior) •  The query itself (mapped to semantic categories)

Used to identify interesting facts across all combinations of dimensions

•  E.g. “Women in 40s from Michigan search a lot about knitting.”

Over an extended period, there can be a lot of query events

11/14/16

15


Distributed Cube Materialization

43

Distributed Cube Materialization on Holistic Measures, by Arnab Nandi, Cong Yu, Philip Bohannon, Raghu Ramakrishnan. ICDE, 2011.

A brief look at an advanced usage of MapReduce

Techniques proposed in the paper are incorporated into the official Hadoop implementation.


Holistic Measures

44

Algebraic measures •  Value for parent can be computed easily from values of children.

•  COUNT(A U B) = SUM(COUNT(A), COUNT(B))

Holistic Measures •  Value for parent can not be computed from children.

•  COUNT(DISTINCT(A U B)) •  distinct(A={a, b, b, c}) = 3; distinct(B={a, c}) = 2;

•  distinct(A U B) = 3 can not be computed any other way


Real Example Task

45

Given the location and topic hierarchies, compute volume and reach (number of distinct users) of all cube groups whose reach is at least 5.

User Query (Event) Time Topic … Location … u1 Nikon d40 20090601 SLR Ann Arboru2 Canon sd100 20090601 Point & Shoot Detroitu1 Apple iPhone 20090602 Smartphone Ann Arboru3 Nikon d40 20090604 SLR New Yorku4 Nokia 1100 20090605 Basic Phone Sunnyvale

3-levels each

Topic è Cat è Subcat

Country è State è City

11/14/16

16


All Combinations, i.e., the Cube

46


Naïve Algorithm

47

of the typical tasks that analysts demand and will be usedthroughout this paper. They both involve holistic measures:reach for the former and top-k for the latter. The schema,data and lattice for both tasks is the same as our runningexample in Sec. II.

Example 1 (Coverage Analysis): Given the location andtopic hierarchies, compute volume and reach of all cubegroups whose reach is at least 5. Highlight those cube groupswhose reach is unusually high compared with their volume.Here, measure volume is defined as the number of searchtuples in the group, while reach is defined as the number ofunique users issuing those searches.2

This analysis is inspired by the need to identify querytopics that are issued relatively infrequently by the users,but cover a large number of users: these queries are oftenmissed by traditional frequency based analyses because oftheir relative low volume, even though those query topics canin fact have an impact on a very large user population. A SQL-style specification corresponding to the cubing task for thisanalysis is shown in Fig. 1, while the dimension hierarchiesand hierarchical cube lattice are shown in Fig. 5 and Fig. 6respectively.

Example 2 (Top-k Query Analysis): Given the locationand topic hierarchies, compute top-5 frequent queries forall cube groups. Highlight those cube groups that share lessthan 3 top queries with their parent groups.2

This analysis aims to discover location bias for variousquery topics. For example, the top political queries in Austincan be very different from those in Texas in general, whichin turn can be different from those in the entire USA.

We note again that cube computation is only part of bothanalysis tasks, albeit an essential part. Highlighting the desiredcube groups is on top of the cube computation layer andbeyond the scope of this work. We provide some discussionon how to efficiently identify interesting groups in Sec. VIII.

IV. CHALLENGES

Typically, the CUBE operation can be expressed in high-level MapReduce languages (e.g., Pig [19]) as a disjunctionof groupby queries. A query optimizer would then (in theideal case) combine all the queries into a single MapReducejob. Algo. 1 represents this combined cube computation.Intuitively, this naive algorithm divides the full cubing taskinto a collection of aggregation tasks, one for each cube group,and distributes them for computation using the MapReduceframework. In particular, the function R(e) extracts, from thetuple e, the values of the groupby attributes specified by thecube region R.

For example, given the cube lattice in Fig. 6, the searchtuple e1 = ⇤091203, u1, 64.97.55.3, iPhone⌅ is mapped to16 groups (one per region in the cube), including the smallestgroup ⇤Ann Arbor, Smart⌅ and the broadest group ⇤⇥, ⇥⌅.Each reducer then computes the measures for its assignedgroups by applying the measure function. Measure-specificoptimization can be incorporated into the naive algorithm to

reduce amount of intermediate data. As an example, for cov-erage analysis (Example 1), to compute the measure reach,we only need the attribute uid. Therefore the mapper can emitjust e.uid instead of the full tuple.

Algorithm 1 Naive AlgorithmMAP(e)1 # e is a tuple in the data2 let C be the Cube Lattice3 for each Region R in C4 do k = R(e);5 EMIT k � e

REDUCE/COMBINE(k, {e1, e2, ...})1 let M be the measure function2 EMIT k � M({e1, e2, ...})

It should be noted that despite its simplicity, the Naivealgorithm fares quite well for small datasets. As we will see inSec. VI, it outperforms its competitors for such datasets dueto the extremely low overhead costs. However, as the scaleof data increases, we encounter two key challenges that causethis algorithm to perform poorly and eventually fail: size ofintermediate data and size of large groups. We briefly describethese challenges next.

A. Size of Intermediate DataThe first challenge arises from the large size of intermediate

data being generated from the map phase, which measures at|C| � |D|, where |C| is the number of regions in the cubelattice and |D| is the size of the input data. Since |C| increasesexponentially with both the number and depth of dimensionsto be explored, the naive approach can quickly lead to thesystem running out of disk space during the map phase orstruggling through the shuffle (i.e., sort) phase.

B. Size of Large GroupsThe second challenge arises from cube groups belonging

to cube regions at the bottom part of the cube lattice, suchas ⇤USA, Shopping⌅ or even ⇤⇥, ⇥⌅ (i.e., the cube groupcontaining all tuples). The reducer that is assigned the lattergroup essentially has to compute the measure for the entiredataset, which is usually large enough to cause the reducer totake significantly longer time to finish than others or even fail.As the size of the data increases, the number of such groupsalso increases. We call such groups reducer-unfriendly. Acube region with a significant percentage of reducer-unfriendlygroups is called reducer-unfriendly region.

For algebraic measures, this challenge can addressed bynot processing those groups directly: we can first computemeasures only for those smaller, reducer-friendly, groups, thencombine those measures to produce the measure for the larger,reducer-unfriendly, groups. Such measures are also amenableto mapper-side aggregation which further decreases the load onthe shuffle and reduce phases. For holistic measures, however,measures for larger groups can not be assembled from theirsmaller child groups, and mapper-side aggregation is also notpossible. Hence, we need a different approach.


Naïve Algorithm

48

<Ann Arbor, SLR> è u1 …

<Michigan, Camera> è u1 …

<*, *> è u1

Map phase (16 records!)


<Ann Arbor, SLR> è {u1} …

<Michigan, Camera> è {u1, u2} …

<*, *> è {u1, u2, u3, u4, u5}

Reduce phase

11/14/16

17


Challenges

49

Map phase: many intermediate keys •  Number of Keys = |C| x |D|, where |C| is the number of regions in

the lattice, and |D| is the input size.

•  As the number of dimension increases, this can cause problems for the map/shuffle phase.

Reduce phase: extremely large reduce groups

•  Some reduce shards can be very large and cause the reducer dealing with those groups to run out of memory.

•  In fact, one group is guaranteed to have this problem: <*, *>


Handle Straggling Reducer

50

Inspired by a common method to compute the unique number of users in a query log

•  The usual practice involves two steps: 1) MapReduce to get the set of unique user IDs; 2) calculate the size of result to produce the final answer

Holistic measure computation can not be distributed, but •  What if the input is split into disjoint sets with regard to the

attribute of the measure (in the above example, user ID)?


Example

51


<*, *, uid> è {u1} <*, *, uid> è {u2}

… <*, *, uid> è {u5} Reduce phase

no more

large groups!

11/14/16

18


Detecting Large Cube Groups

52

Run a small MapReduce job over a sample of the data using the naïve algorithm

•  Goal is to detect any cube group that might be reducer-unfriendly

With a sample size of 2M tuples, we can obtain accurate enough estimation for 20B tuples

•  Proof based on Chernoff Bounds


Slightly Less Naïve Approach

53

reducer- friendly

reducer-unfriendly <*,*,uid>

<*,topic,uid>

<*,cat,uid>

<country,*,uid>

<country,topic> <state,*>

<country,cat> <state,topic>

<state,cat>

<city,subcat>

<city,*>

<city,topic>

<city,cat><state,subcat>

<country,subcat>

<*,subcat>producing even

more intermediate

records!


Incorporating Partition Factor

54

11/14/16

19


MR-Cube Algorithm

55

!"#$!!%&&!!%'()'$!!*+,)&-!!".$!!&/0$!!*+,)&-!!!!!!!!!".$!!1-2')*2$!!&*3)&!!145! !6

%+!

7,"8

-!

9-1"

0-!

!"#"$%&'()&*+(+,)-(./"&'("0(-1+/,/)-2(3(*,4)-(,/+,5(

6,7&+%',/88"05()"$*10+9(10('"54%'/")+5510.(

!:;*0,*<%&$=>?!!"#$!!%&&!!%'()'$!!+,)&-!!:=$=>$!#!!!!!!!!!!?!!"#$!!"@%!

!:&-A!/)'3$=>?!!".$!!&/0$!!+,)&-!!:=$=>$!.!!!!!!!!!!?!!".$!!"@%!

!:;*0,*<%&$=>?!!".$!!1-2')*2$!!0%;-'%!!:=$=>$!.!!!!!!!!!!?!!".$!!"@%!

!:;*0,*<%&$=>?!!"#$!!%&&!!%'()'$!!+,)&-!

!:&-A!/)'3$=>?!!".$!!&/0$!!+,)&-!

!:=$=>$!.!!!!!!!!!!?!!".$!!"@%!

!:;*0,*<%&$=>?!!".$!!1-2')*2$!!0%;-'%!

!:=$=>$!.!!!!!!!!!!?!!".$!!"@%!

!:=$=>$!#!!!!!!!!!!?!!"#$!!"@%!

!:=$=>?!#!

!:=$=>?!#!

!:&-A!/)'3$!=>?!#!

!:&/0$!=>!!!!!!!!!!!!?!#!

!:&-A!/)'3$!0%;-'%>!?!#!

!:&/0$!0%;-'%>!!!!!!!!!!!!!?!#!

!:;*0,*<%&$!+,)&->!!!?!#!

!:;*0,*<%&$!0%;-'%>!?!#!

!:%&&!%'()'$!+,)&->!!?!#!

!:1-2')*2$!0%;-'%>!!!!!?!#!

!:;*0,*<%&$!=>!?!.!

!:%&&!%'()'$!=>?!#!

!:1-2')*2$!=>!!!!!?!#!

Fig. 11. Walkthrough for MR-Cube with reach as measure, as described in Sec. V-D. Both the Map and Reduce stages take the AnnotatedCube Lattice(Fig. 10) as a parameter. Walkthrough is shown only for batch areas b1 (denoted by red ⇥ on tuples) & b5 (denoted by blue �).

D. Cube Materialization & WalkthroughAs shown in Algo. 2, an annotated lattice is generated and

then used to perform the main MR-CUBE-MAP-REDUCE.

Algorithm 2 MR-Cube Algorithm

ESTIMATE-MAP(e)1 # e is a tuple in the data2 let C be the Cube Lattice;3 for each ci in C4 do EMIT (ci, ci(e) ⇤ 1) # the group is the secondary key

ESTIMATE-REDUCE/COMBINE(⌥r,g�, {e1, e2, ...})1 # ⌥r, g� are the primary/secondary keys2 MaxSize S � {}3 for each r, g4 do S[r] � MAX(S[r], |g|)5 # |g| is the number of tuples {ei, ..., ej} ⇧ g6 return S

MR-CUBE-MAP(e)1 # e is a tuple in the data2 let Ca be the Annotated Cube Lattice3 for each bi in Ca.batch areas4 do s � bi[0].partition factor5 EMIT (e.SLICE(bi[0]) + e.id%s ⇤ e)6 # value partitioning: ‘e.id % s’ is appended to primary key

MR-CUBE-REDUCE(k, V)1 let Ca be the Annotated Cube Lattice2 let M be the measure function3 cube � BUC(DIMENSIONS(Ca, k), V, M)4 EMIT-ALL (k, cube)5 if (MEMORY-EXCEPTION)6 then D � � D � ⌃ (k, V)

MR-CUBE(Cube Lattice C, Dataset D, Measure M)1 Dsample = SAMPLE(D)2 RegionSizes R = ESTIMATE-MAPREDUCE(Dsample, C)3 Ca = ANNOTATE(R,C) # value part. & batching4 while (D)5 do R � R ⌃ MR-CUBE-MAPREDUCE(Ca,M, D)6 D � D � # retry failed groups D’ from MR-Cube-Reduce7 Ca � INCREASE-PARTITIONING(Ca)8 Result � MERGE(R) # post-aggregate value partitions9 return Result

In Fig. 11 we present a walkthrough of MR-Cube overour running example. Based on the sampling results, cuberegions ⌥⇥, ⇥� and ⌥country, ⇥� have been deemed reducer-unfriendly and require partitioning into 10 parts. We depictmaterialization for 2 of the 5 batch areas, b1 and b5. Foreach tuple in the dataset, the MR-CUBE-MAP emits key:valuepairs for each batch area, denoted by red ⌅ (b1) and blue⇤ (b5). In required, keys are appended with a hash basedon value partitioning, e.g. the 2 in ⌥⇥, ⇥�, 2 : u2, usa. Theshuffle phase then sorts them according to key, yielding 4reducer tasks. Algorithm BUC is run on each reducer, and thecube aggregates are generated. The value partitioned groupsrepresenting ⌥⇥, ⇥, 1� are merged during post-processing toproduce the final result for that group, ⌥⇥, ⇥, 2�.

Note that if a reducer fails due to wrong estimation ofgroup size, all the data for that group is written back to thedisk and follow-up MapReduce jobs are then run with moreaggressive value partitioning, until the cube is completed. Itshould be noted that in practice, a follow-up job is rarely, ifat all, needed.

VI. EXPERIMENTAL EVALUATION

We perform the experimental evaluation on a productionYahoo! Hadoop 0.20 cluster as described in [24]. Each nodehas 2�Quad Core 2.5GHz Intel Xeon CPU, 8GB RAM,4�1TB SATA HDD, and runs RHEL/AS 4 Linux. The heapsize for each mapper or reducer task is set to 4GB. All tasks areimplemented in Python and executed via Hadoop Streaming.Similar to [1], we collect results only from successful runs,where all nodes are available, operating correctly, and thereis no delay in allocating tasks. We report the average numberfrom three successful runs for each task.

A. Experimental Setup

1) Datasets: We adopt two datasets. The Real datasetcontains real life click streams obtained from query logs ofYahoo! Search. We examined clicks to Amazon, Wikipediaand IMDB on the search results. For each click tuple, weretain the following information: uid, ip, query, url. Weestablish three dimensions containing a total of six levels.The location dimension contains three levels (from lowest tohighest, with cardinalities): city(74214) ⇥ state(2669) ⇥country(235) and is derived from ip5. The time dimension

5Mapping is done using the free MaxMind GeoLite City database:http://www.maxmind.com/app/geolitecity.


Experimental Setup

56

Yahoo! Hadoop Cluster •  Hadoop 0.20, 2x4core 2.5GHz Xeon/8GB RAM/4x1TB

•  4GB reserved for the workers

•  Python + Hadoop Streaming

Datasets:

•  A sample of real click log: 516M tuples, 3 dims, 6 levels

Baselines:

•  Naïve

•  MR-BPP and MR-PT, adapted from earlier works by Ng et. al., You et. al., Sergey et. al.


Scalability over Real Query Logs

57

!"#

$%"#

%&"#

&'"#

(!"#

$# %# &# %"# $""# )$!#

!"#$%&'(%

)*+*%,"-$%&./#0$1%23%45$6+'7%"6%8"99"26'(%

Dataset: Real Measure: Reach

*+,-..#

*+,./#

*+,0123#

45673#

11/14/16

20


Outline

Big Data


Case Studies




58


Why Big Data Analysis Can be Wrong Errors in traditional statistical analysis •  Sampling error

•  Sampling bias

Sampling error: a randomly chosen sample may not reflect the total population

•  Big Data addresses this issue à larger sample, less error

Sampling bias: the samples are not randomly chosen •  Depending on the analysis, Big Data may make this worse!

•  Only solution is to gather all data points, which may be impossible

E.g., facebook/twitter population is very different from real life, see US Election 2016!

59


The Best Practice

Be well versed in statistical techniques

Be aware of sampling bias and avoid drawing conclusions purely based on ‘found’ data

Be more hypothesis driven and test ideas with experiments •  With the large volume of data, exploratory analysis inevitably

lead to some interesting patterns that are simply there by chance •  Cross-validation is critical

60

11/14/16

21


Review of Today’s Material

Introduction to Big Data, Cloud Computing

Various MapReduce techniques

Two case studies relevant to core tasks of search engines

What to pay attention to when doing big data analysis

61


Documents

lecture08 - cs.nyu.eduThe MapReduce Paper MapReduce: Simplified Data Processing on Large Clusters, by Jeff Dean and Sanjay Ghemawat, OSDI 2004. Built upon decades of research in parallel