Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
MapReduce Extension Topics in Data Science
Cheng Ren, Lixing Lian
• Scien7fic Workloads -‐ Hadoop’s Adolescence: An Analysis of Hadoop Usage in Scien7fic Workloads
• Itera7ve extension
-‐ Haloop: efficient itera7ve data processing on large clusters • Adap7ve Indexes
-‐ Only Aggressive Elephants are Fast Elephants(HAIL)
Outline
Hadoop's adolescence An analysis of Hadoop usage in scien7fic
workload
HaLoop: Efficient Itera7ve Data Processing On Large Clusters
Yingyi Bu, Bill Howe, Magda Balazinska, Michael D. Ernst
• Mo7va7on
• Examples that cannot be executed perfectly
• Architecture
• Caching ideas
Outline
• MapReduce can’t express recursion/itera7on • Lots of interes7ng programs need loops -‐ graph algorithms -‐ clustering -‐ machine learning -‐ recursive queries (CTEs, datalog, WITH clause) • Dominant solu7on: Use a driver program outside of
MapReduce • Hypothesis: making MapReduce loop-‐aware affords
op7miza7on
Mo7va7on
-‐ lays a founda7on for scalable implementa7ons of recursive languages
Example 1: PageRank
PageRank Implementa7on on MapReduce
What’s the problem?
L and Count are loop invariants, but 1. They are loaded on each itera7on 2. They are shuffled on each itera7on 3. Also, fixpoint evaluated as a separate MapReduce job per itera7on
Example 2: Transi7ve Closure
Transi7ve Closure on MapReduce
What’s the problem?
Friend is loop invariant, but 1. Friend is loaded on each itera7on 2. Friend is shuffled on each itera7on
• Architecture
• Cache loop-‐invariant data • Programming Model
Push loops into MapReduce!
HaLoop Architecture
Inter-‐itera7on caching
RI: Reducer Input Cache • Provides: -‐ Access to loop invariant data without map/shuffle • Data: -‐ Reducer func7on • Assumes: 1. Sta7c par77oning (implies: no new nodes) 2. Determinis7c mapper implementa7on • PageRank -‐ Avoid loading and shuffling the web graph at every itera7on • Transi7ve Closure -‐ Avoid loading and shuffling the friends graph at every itera7on
RO: Reducer Output Cache • Provides: -‐ Distributed access to output of previous itera7ons • Used by: -‐ Fixpoint evalua7on • Assumes: 1. Par77oning constant across itera7ons 2. Reducer output key func7onally determines Reducer input key • PageRank -‐ Allows distributed fixpoint evalua7on -‐ Obviates extra MapReduce job • Transi7ve Closure -‐ No help
MI: Mapper Input Cache • Provides: -‐ Access to non-‐local mapper input on later itera7ons • Data for:
-‐ Map func7on • Assumes: Mapper input does not change -‐ Avoids non-‐local data reads on itera7ons > 0
• Mapper/reducer stay the same! • Touch points – Input/Output: for each <itera7on, step> – Cache filter: which tuple to cache? – Distance func7on: op7onal • Nested job containing child jobs as loop body • Minimize extra programming efforts
Programming Model
• Rela7vely simple changes to MapReduce/Hadoop can -‐ support itera7ve/recursive programs -‐ TaskTracker (Cache management) -‐ Scheduler (Cache awareness) -‐ Programming model (mul7-‐step loop bodies, cache control)
• Op7miza7ons
-‐ Caching reducer input realizes the largest gain -‐ Good to eliminate extra MapReduce step for termina7on checks -‐ Mapper input cache benefit inconclusive; need a busier cluster
Conclusions
Only Aggressive Elephants are fast Elephants
Jens Diirich, Jorge-‐Arnulfo Quiané-‐Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jörg Schad
• Mo7va7on
• Comparison between Hadoop and HAIL • Upload pipeline
• Query pipeline
Outline
• Analyze a large web log by filtering condi7ons. (source IP, web address)
• He uses a sequence of different filter condi7ons, each one triggering a new MapReduce job.
• He is not exactly sure what he is looking for. • “Let’s see what I am going to encounter on the way.”
Bob
• This kind of use-‐case illustrates an exploratory usage of Hadoop MapReduce.
• It is a major use-‐case of Hadoop MapReduce. -‐ One major problem: slow query run7mes. -‐ Time dominated by the I/O for reading all input data.
Bob
Hadoop Aggressive Indexing Library
VS.
HAIL + MapReduce
HDFS + MapReduce
HDFS + MapReduce
HDFS
horizontal par77ons
HDFS blocks 64MB (default)
Datanodes
HDFS
HDFS
HDFS
HDFS
HDFS
Allows two Failovers
MapReduce
map(row) -‐> set of (ikey, value)
MapReduce
map(row) -‐> set of (ikey, value)
MapReduce
map(row) -‐> set of (ikey, value)
MapReduce
map(docID, document) -‐> set of (term, docID)
HAIL + MapReduce
HAIL
horizontal par77ons
HDFS blocks 64MB (default)
HAIL
HAIL
HAIL
HAIL
HAIL 1. Convert the input file into binary PAX 2. Create a series of different sort orders 3. Create mul7ple clustered indexes.
-‐ If indexes cannot help, fall back to standard Hadoop scanning.
HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica.
HAIL Upload Pipeline
Why Clustered Indexes? -‐ Unclustered indexes are only compe77ve for very selec7ve
queries as they may trigger considerable random I/O for non-‐selec7ve index traversals.
-‐ Clustered index do not have that problem. Whatever the
selec7vity, we will read the clustered index and scan the qualifying blocks.
HAIL Upload Pipeline
Seman7cs of an ACK for a packet of a block are changed From “packet received, validated, and flushed”
To “packet received and validated”.
HAIL Upload Pipeline
In parallel to forwarding and reassembling packets, each datanode sorts the data, creates indexes,and forms a HAIL block.
HAIL Upload Pipeline Enrich the HDFS namenode to schedule map tasks close to replicas having a suitable index • Dir_rep mapping (blockID, datanode) à HAILBlockReplicaInfo • HAILBlockReplicaInfo contains detailed informa7on about the
types of available indexes
HAIL Query Pipeline
Upload Time
Query Times Individual Jobs: Weblog
Fast Indexing and Fast Querying
Ques7ons? Ask Bob.