Upload
datasalt
View
2.576
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Tuple MapReduce, a new foundational model extending MapReduce with the notion of tuples. Tuple MapReduce allows to bridge the gap between the low-level constructs provided by MapReduce and higher-level needs required by programmers, such as compound records, sorting or joins. This paper presents as well Pangool, an open- source framework implementing Tuple MapReduce. Pangool eases the design and implementation of applications based on MapReduce and increases their flexibility, still maintaining Hadoop’s performance.
Citation preview
Tuple MapReduce: Beyond classic MapReduce
Pedro Ferrera, Ivan de Prado, Eric PalaciosDataSalt
Barcelona, SPAINpere,ivan,[email protected]
Jose Luis FernandezMarquezGiovanna Di Marzo Serugendo
University of Geneva, CUIGeneva, SWITZERLAND
2 / 18
Outline
● Introduction● Related Work● Classic MapReduce
– The problems of MapReduce
● Tuple MapReduce– The basic Tuple MapReduce
– Joins
– Generalization of MapReduce
● Pangool● Conclusions and Future work
3 / 18
Introduction
● A huge amount of information → needs for new processing technologies.
● MapReduce → major contribution ...– … but involves a sharp learning curve.
● Most of design patterns found in real world problems are not well covered.
● We propose Tuple MapReduce as a better foundation model.● TupleMapReduce on Hadoop → Pangool
– No key architectural changes needed.
4 / 18
Related work
● MapReduce: Google paper on 2004● Hadoop● Higher level tools
– Sawzall, FlumeJava, Pig, Hive, Jaql, Cascading
● Higher level abstractions very popular– Supports the idea of MapReduce as a too low-level paradigm
● Merge MapReduce– Targets the problem of relational operations (joins)
– Implies changes in the architecture and a new step merge
5 / 18
Classic MapReduce
● Jobs– input file, ouput file
– Developer provides two functions: map and reduce
● Distributed execution of work– Firstly the map function in the mapper phase
– Then the reduce function in the reducing phase
6 / 18
The problems of MapReduce
● Compound records– Real world problems include multi-field records. They don’t fit well on
the key/value schema
● Sorting– No inherent sorting within the reduce records.
– “secondary sorting trick” on implementations (Hadoop)
● Join– A quite common operation
– Not directly possible in MapReduce without using “tricks”:
● secondary sorting● compound records
7 / 18
Tuple MapReduce
● Idea: replace key/value by tuples● group-by and sort-by clauses
8 / 18
Tuple MapReduce (II)
● group-by and sort-by constraint– group-by as a prefix of sort-by
– Needed if you want to be able to implement Tuple MapReduce over a MapReduce architecture
● Contrary to MapReduce, Tuple MapReduce:– provides compound records → tuple
– provides intra-reduce sorting
9 / 18
Example: cumulative visits
● Cumulative # of visits up to each single date
Input → URL, date, visits
Expected output → URL, date, cumulative visits
<<<
10 / 18
Join-Tuple MapReduce
● Joins among heterogeneous datasets– Tuples associated with a source-id.
● Tuples reach the reducer sorted by source-id
– enabling memoryless reduce joins– and grouped by some common fields
11 / 18
Example: join between clients and payments
clients
paymentsInner join
client_idname payment_id amount
12 / 18
Generalization of MapReduce
● MapReduce is a TupleMapReduce with...– tuples of two values and
– group-by and sort-by set to first value
● The opposite is also possible → implementing Tuple MapReduce into existing MapReduce implementations. – Architectural changes are not needed.
– Pangool is a proof of that.
13 / 18
Pangool
● Tuple MapReduce implementation on top of Hadoop. – On top of existing MapReduce implementation.
● It is just a library. No architecture change was needed.
● Used on real world applications– Banking
– Searching
– Social networks
pangool.net
14 / 18
Pangool benchmark – secondary sort
15 / 18
Pangool benchmark – join
16 / 18
Pangool performance
● Just between 5% and 8% worst than Hadoop– Pretty good considering that Pangool is built on top of Hadoop API
● The difference would probably disappear with a native implementation
● Much better than higher level API's– Probably because Pangool is a low level API
17 / 18
Conclusions and Future work
● MapReduce key/value has been shown too strict. ● Tuple MapReduce keep MapReduce features
– Enhancing it with
● compound records, ● joins and ● intra-reduce sorting.
● Pangool is a proof of its viability, – including in existing implementations like Hadoop without changing the
architecture
● Future work would involve abstractions for flow creations– Simplifying job chaining and data flow.
18 / 18
Thanks!
● Any questions, or doubts?
– @ivanprado
Pedro Ferrera, Ivan de Prado, Eric PalaciosDataSalt
Barcelona, SPAINpere,ivan,[email protected]
Jose Luis FernandezMarquezGiovanna Di Marzo Serugendo
University of Geneva, CUIGeneva, SWITZERLAND