MRShare: Sharing Across Multiple Queries in MapReduce

By Tomasz Nykiel (University of Toronto)Michalis Potamias (Boston University)Chaitanya Mishra (University of Toronto, currently Facebook)George Kollios (Boston University)Nick Koudas (University of Toronto)

Presented by Xiaolan Wang and Pengfei Tang

Motivation

• Reducing the execution time• Reducing energy consumption• Monetary savings

*http://aws.amazon.com/ec2/#pricing

MRShare – a sharing framework for Map Reduce

• MRShare framework:– Inspired by sharing primitives from relational domain– Introduces a cost model for Map Reduce jobs– Searches for the optimal sharing strategies– Does not change the Map Reduce computational model

hsdhquweiquwijksajdajsdjhwhjadjhashdj

Outline

• Introduction• Map Reduce recap.• MRShare – Sharing opportunities in Map-Reduce• Cost model for MapReduce• MRShare – Grouping algorithms• MRShare Implementation and Evaluation• Summary

Outline

• Introduction• Map Reduce recap.• MRShare – Sharing opportunities in Map-Reduce• Cost model for MapReduce• MRShare – Grouping algorithms• MRShare Implementation and Evaluation• Summary

network

Map Reduce recap.

Map Reduce

Output

HDFSHDFS

Outline

• Introduction• Map Reduce recap.• MRShare - Sharing opportunities in Map-Reduce– Sharing scans– Sharing intermediate data

• Cost model for MapReduce• MRShare – Grouping algorithms• MRShare Implementation and Evaluation• Summary

Sharing opportunities– sharing scans

• SELECT COUNT(*) FROM user GROUP BY hometown

• SELECT AVG(age) FROM user GROUP BY hometown

id1 studentToronto

Toronto 1 Map

id1 studentToronto

Toronto 17

Reduce

Toronto 1Toronto 1Toronto 1Ottawa 1Ottawa 1

Toronto 3

Ottawa 2

Reduce

Toronto 17Toronto 19

Montreal 20Ottawa 23Ottawa 25

Toronto 18Montreal 20Ottawa 24

User_id Hometown Occupation Age

Meta-map

MRShare – sharing scans (map).Input

Map 1 Map 2 Map 3 Map 4

Map output

Meta-reduce

MRShare – sharing scans (reduce)

J1 J2 J3 J4 key value

Toronto 1

Toronto 17

Toronto 19

Toronto 2

Toronto 5

Reduce 1

Reduce 2

Reduce 3

Reduce 4

Sharing Map OutputSELECT T.a, sum(T.b) SELECT T.a, avg(T.b)FROM T FROM TWHERE T.a>10 AND T.a<20 WHERE T.b>10 AND T.c<100GROUP BY T.a GROUP BY T.a

Sharing MapSELECT T.c, sum(T.b) SELECT T.a, avg(T.b)FROM T FROM TWHERE T.c > 10 WHERE T.c > 10GROUP BY T.c GROUP BY T.a

Same reducing.

Sharing Parts of MapSELECT T.a, sum(T.b) SELECT T.a, avg(T.b)FROM T FROM TWHERE T.c>10 AND T.a<20 WHERE T.c>10 AND T.c<100GROUP BY T.a GROUP BY T.a

Outline• Introduction• Map Reduce recap.• MRShare – Sharing opportunities in Map-Reduce• Cost model for MapReduce• MRShare – Grouping algorithm• MRShare Implementation and Evaluation• Summary

Cost model for Map Reduce (single job)

• Reading – f(input size)• Sorting – f(intermediate data size)• Transferring– f(intermediate data size)• Writing – f(output size)

Reading input Sorting int. data Transferring Writing output

T(J) = Tread(J) + Tsort(J) + Ttr(J)

Cost of executing a group of jobsRead Sort Transfer Write

Read Sort Transfer Write

Potential costs

SavingsPotential savings

J1+J2+J3

Cost without grouping

n – n jobs;m – m maps;r – r reduces; |Mi| - the average output size of a map task;|Ri| - the average input size of a reduce task;|Di| - the size of the intermediate data of job Ji.

|Di| = |Mi| · m = |Ri| · r

n MapReduce jobs, J = {J1, . . . , Jn}, read from the same input file F.

Cost with grouping

m – m maps;r – r reduces; |Xm| - the average size of the combined output of map tasks;|Xr| - the average size of the combined input of reduce tasks; |XG| - the size of the intermediate data.

| XG | = | Xm | · m = | Xr | · r

Single group G contains all n jobs and execute it as a single job JG.

Beneficial conditions

n <= B

Finding the optimal sharing strategy

• An optimization problem

“NoShare”

“GreedyShare”

Sharing scans - cost based optimization

• Savings come from reduced number of scans• The sorting cost might change• The costs of copying and writing the output do not change

Read Sort

Potential costsSavings

J1+J2+J3

Outline• Introduction• Map Reduce recap.• MRShare – Sharing opportunities in Map-Reduce• Cost model for MapReduce• MRShare – Grouping algorithms– SplitJobs – cost based algorithm for sharing scans– MultiSplitJobs – an improvement of SplitJobs

• MRShare Evaluation• Summary

SplitJobs – a DP solution for sharing scans.

• We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting.

• Using our cost model and the approximation, we employ a DP algorithm to find the optimal split points.

J1 J2 J3 J4 J5 J6

G1 G2 G3

SplitJobs

SplitJobs (cont.)

GS(i, l) = GAIN(i, l) − f

c(l) is the savings of the optimal grouping of jobs J1,…Jl.

MultiSplitJobs – an improvement of SplitJobs

J1 J2 J7 J8

J6J3 J4 J5

SplitJobs

G4SplitJobs

MultiSplitJobs

MultiSplitJobs (cont.)

Outline

• Introduction• Map Reduce recap.• MRShare – Sharing primitives in Map-Reduce• MRShare – Cost based approach to sharing • MRShare Implementation and Evaluation• Summary

Implementing MRShare• MRShare implement on Hadoop• First, acquire a batch of jobs from queries in a short time T• Second, MultiSplit Jobs is called to compute the optimal

grouping of the jobs• Third, the groups are rewritten, using a meta-map and a

meta-reduce function. These are MRShare specific container and their functionality relies on tagging.

• Finally, new jobs are submitted for execution

Tagging for Sharing Only Scans

Tagging for Sharing Map Output

Evaluation setup

• 40 EC2 small instance virtual machines• Modified Hadoop engine• 30 GB text dataset consisting of blogs• Multiple grep-wordcount queries– Counts words matching a regular expression– Allows for variable intermediate data sizes– Generic aggregation Map Reduce job

Validation of the Cost Model

Evaluation goals

• Sharing is not always beneficial.– ‘GreedyShare’ policy

• How much can we save on sharing scans?– MRShare - MultiSplitJobs evaluation

• How much can we save on sharing intermediate data? – MRShare - γ-MultiSplitJobs evaluation

Is sharing always beneficial?- ‘GreedyShare’ policy

Group of jobs

Group size

d=|intermediate data| / |input data|

H1 16 0.3 < d <0.7H2 16 0.7 < dH3 16 0.9 < d

How much we save on sharing scans – MRShare MultiSplitJobs

Group of jobs

Group size

G1 16 0.7 < d

G2 16 0.2 < d < 0.7

G3 16 0.0 < d < 0.2

G4 16 0.0 < d < max

G5 64 0.0 < d < max

How much we save on sharing Map-output – MRShare MultiSplitJobs

How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs

Group of jobs

Group size

G1 16 0.7 < d

G2 16 0.2 < d < 0.7

G3 16 0.0 < d < 0.2

Summary

• Introduction on MRShare – a framework for automatic work sharing in Map Reduce.

• We identified sharing primitives and demonstrated the implementation thereof in a Map-Reduce engine.

• We established a cost model and solved several work sharing optimization problems.

• We demonstrated vast savings when using MRShare.

Thank you!!!

Questions?

MRShare: Sharing Across Multiple Queries in MapReduce

Documents

CS 294: Uniﬁcation vs. Specializationistoica/classes/... · Spark: better than MapReduce » Not best for interactive queries, streaming iOS: best phone OS » Mediocre game platform

MapReduce. MapReduce Outline MapReduce Architecture MapReduce Internals MapReduce Examples JobTracker Interface

1 QSX: Querying Social Graphs Graph algorithms in MapReduce MapReduce: an introduction BFS for distance queries PageRank Keyword search Subgraph isomorphism

Sharing Aggregate Computation for Distributed …db.cs.berkeley.edu/papers/sigmod07-aggshare.pdfSharing Aggregate Computation for Distributed Queries Ryan Huebsch, Minos Garofalakis,

Enhancing SpatialHadoop with Closest Pair Queries · Keywords: Closest Pair Queries, Spatial Data Processing, SpatialHadoop, MapReduce. ... databases [4], etc. Since both the spatial

Big Data Platforms for Artifical Intelligence · MapReduce and Apache Pig (HDD),Apache Spark(RAM) I Distributed SQL database queries for analytics: Apache Hive, Spark SQL, Cloudera

Shared Arrangements: practical inter-query sharing for ...Shared arrangements allow queries to share indexed state. Inter-query state sharing can be framed in terms of (i) what can

Hadoop/MapReduce - 123seminarsonly.comHadoop MapReduce • MapReduce is a programming model and software framework first developed by Google (Google’s MapReduce paper submitted in

SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

MapReduce and Hadoop File Systemnsrit.edu.in/admin/img/cms/10096mapreduce.pdf · The Outline Introduction to MapReduce From CS Foundation to MapReduce MapReduce programming model

IG DATA SYSTEMSpages.cs.wisc.edu/~paris/cs564-f16/lectures/lecture-18.pdf– OLAP (decision support queries) • MapReduce – first developed by Google, published in 2004 – only

WikiQuery.org -- An interactive collaboration interface for creating, storing and sharing effective CNF queries Le Zhao, Xiaozhong Liu #, Jamie Callan

MapReduce basics

RESEARCH Open Access Federated ontology-based queries · PDF fileRESEARCH Open Access Federated ontology-based queries over cancer ... Existing software infrastructures for data-sharing

Columnar Access with HBase - VI4IO · Real-time queries Compression, in-memory execution Bloom ﬁlters and block cache to speed up queries Use HDFS and supports MapReduce Uses ZooKeeper

MapReduce with Scalding · Java MapReduce Word count example . MapReduce Techs Scalding.io Java MapReduce Pig Hive Hadoop Cascading Others tion . The promise of Cascading Scalding.io

MapReduce & Hadoop IIcslui/CMSC5702/mapreduce_hadoop2.pdf · MapReduce & Hadoop II ... MapReduce & Hadoop MapReduce Recap ... example, the combiners aggregate term counts across the

1. Introduction to MapReduce - UPMlsd.ls.fi.upm.es/.../IntroToMapReduce.pdf · Processing of massive data: MapReduce – 1. Introduction to MapReduce MapReduce has a 'low semantic

Efcient Processing of Hamming-Distance-Based Similarity ... · Efcient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce Mingjie Tang y, Yongyang Yu y,

Data Management in Large-Scale Distributed Systems - MapReduce … · Introduction to MapReduce The Hadoop Eco-System HDFS Hadoop MapReduce 4. MapReduce at Google Publication The