MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7ﬁc)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7ﬁc)Workloads)

MapReduce Extension Topics in Data Science

Cheng Ren, Lixing Lian

•  Scien7fic Workloads -‐ Hadoop’s Adolescence: An Analysis of Hadoop Usage in Scien7fic Workloads

•  Itera7ve extension

-‐ Haloop: efficient itera7ve data processing on large clusters •  Adap7ve Indexes

-‐ Only Aggressive Elephants are Fast Elephants(HAIL)

Outline

Hadoop's adolescence An analysis of Hadoop usage in scien7fic

workload

HaLoop: Efficient Itera7ve Data Processing On Large Clusters

Yingyi Bu, Bill Howe, Magda Balazinska, Michael D. Ernst

•  Mo7va7on

•  Examples that cannot be executed perfectly

•  Architecture

•  Caching ideas

Outline

•  MapReduce can’t express recursion/itera7on •  Lots of interes7ng programs need loops -‐ graph algorithms -‐ clustering -‐ machine learning -‐ recursive queries (CTEs, datalog, WITH clause) •  Dominant solu7on: Use a driver program outside of

MapReduce •  Hypothesis: making MapReduce loop-‐aware affords

op7miza7on

Mo7va7on

-‐ lays a founda7on for scalable implementa7ons of recursive languages

Example 1: PageRank

PageRank Implementa7on on MapReduce

What’s the problem?

L and Count are loop invariants, but 1. They are loaded on each itera7on 2. They are shuffled on each itera7on 3. Also, fixpoint evaluated as a separate MapReduce job per itera7on

Example 2: Transi7ve Closure

Transi7ve Closure on MapReduce

What’s the problem?

Friend is loop invariant, but 1.  Friend is loaded on each itera7on 2.  Friend is shuffled on each itera7on

•  Architecture

•  Cache loop-‐invariant data •  Programming Model

Push loops into MapReduce!

HaLoop Architecture

Inter-‐itera7on caching

RI: Reducer Input Cache •  Provides: -‐ Access to loop invariant data without map/shuffle •  Data: -‐ Reducer func7on •  Assumes: 1. Sta7c par77oning (implies: no new nodes) 2. Determinis7c mapper implementa7on •  PageRank -‐ Avoid loading and shuffling the web graph at every itera7on •  Transi7ve Closure -‐ Avoid loading and shuffling the friends graph at every itera7on

RO: Reducer Output Cache •  Provides: -‐ Distributed access to output of previous itera7ons •  Used by: -‐ Fixpoint evalua7on •  Assumes: 1. Par77oning constant across itera7ons 2. Reducer output key func7onally determines Reducer input key •  PageRank -‐ Allows distributed fixpoint evalua7on -‐ Obviates extra MapReduce job •  Transi7ve Closure -‐ No help

MI: Mapper Input Cache •  Provides: -‐ Access to non-‐local mapper input on later itera7ons •  Data for:

-‐ Map func7on •  Assumes: Mapper input does not change -‐ Avoids non-‐local data reads on itera7ons > 0

•  Mapper/reducer stay the same! •  Touch points – Input/Output: for each <itera7on, step> – Cache filter: which tuple to cache? – Distance func7on: op7onal •  Nested job containing child jobs as loop body •  Minimize extra programming efforts

Programming Model

•  Rela7vely simple changes to MapReduce/Hadoop can -‐ support itera7ve/recursive programs -‐ TaskTracker (Cache management) -‐ Scheduler (Cache awareness) -‐ Programming model (mul7-‐step loop bodies, cache control)

•  Op7miza7ons

-‐ Caching reducer input realizes the largest gain -‐ Good to eliminate extra MapReduce step for termina7on checks -‐ Mapper input cache benefit inconclusive; need a busier cluster

Conclusions

Only Aggressive Elephants are fast Elephants

Jens Diirich, Jorge-‐Arnulfo Quiané-‐Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jörg Schad

•  Mo7va7on

•  Comparison between Hadoop and HAIL •  Upload pipeline

•  Query pipeline

Outline

•  Analyze a large web log by filtering condi7ons. (source IP, web address)

•  He uses a sequence of different filter condi7ons, each one triggering a new MapReduce job.

•  He is not exactly sure what he is looking for. •  “Let’s see what I am going to encounter on the way.”

Bob

•  This kind of use-‐case illustrates an exploratory usage of Hadoop MapReduce.

•  It is a major use-‐case of Hadoop MapReduce. -‐ One major problem: slow query run7mes. -‐ Time dominated by the I/O for reading all input data.

Bob

Hadoop Aggressive Indexing Library

VS.

HAIL + MapReduce

HDFS + MapReduce

HDFS + MapReduce

HDFS

horizontal par77ons

HDFS blocks 64MB (default)

Datanodes

HDFS

HDFS

HDFS

HDFS

HDFS

Allows two Failovers

MapReduce

map(row) -‐> set of (ikey, value)

MapReduce


MapReduce


MapReduce

map(docID, document) -‐> set of (term, docID)

HAIL + MapReduce

HAIL

horizontal par77ons

HDFS blocks 64MB (default)

HAIL

HAIL

HAIL

HAIL

HAIL 1.  Convert the input file into binary PAX 2.  Create a series of different sort orders 3.  Create mul7ple clustered indexes.

-‐ If indexes cannot help, fall back to standard Hadoop scanning.

HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica.

HAIL Upload Pipeline

Why Clustered Indexes? -‐  Unclustered indexes are only compe77ve for very selec7ve

queries as they may trigger considerable random I/O for non-‐selec7ve index traversals.

-‐  Clustered index do not have that problem. Whatever the

selec7vity, we will read the clustered index and scan the qualifying blocks.


Seman7cs of an ACK for a packet of a block are changed From “packet received, validated, and flushed”

To “packet received and validated”.


In parallel to forwarding and reassembling packets, each datanode sorts the data, creates indexes,and forms a HAIL block.

HAIL Upload Pipeline Enrich the HDFS namenode to schedule map tasks close to replicas having a suitable index •  Dir_rep mapping (blockID, datanode) à HAILBlockReplicaInfo •  HAILBlockReplicaInfo contains detailed informa7on about the

types of available indexes

HAIL Query Pipeline

Upload Time

Query Times Individual Jobs: Weblog

Fast Indexing and Fast Querying

Ques7ons? Ask Bob.

Documents

MapReduce)Extension) - Brown Universitycs.brown.edu/courses/cs195w/slides/mrextensions.pdf · • Scien7ﬁc)Workloads))=)Hadoop’s)Adolescence:)An)Analysis)of)Hadoop)Usage)in)Scien7ﬁc)Workloads)